Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home

Content and metadata analysis & management

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here
People working in an office
 

Research opportunities

Most of us are drowning in data -- in both our business and personal lives -- generating enormous numbers of e-mails, documents, digital photos, presentations, etc. Almost every business process -- e.g., making purchases or scheduling meetings -- creates more data.

Yet even though we're preserving more files, e-mails and database records, we're less able to find and use them efficiently. We make multiple copies, save multiple versions, store information in many places, and often even lose things.

From a corporate standpoint, the problem is even more serious. Even though disks are cheap, enterprise storage-management costs are soaring. Companies need better tools for storing only what they want to store, for knowing what they are storing and why, and for making stored information available.

Existing tools for managing these huge collections of loosely related information are inadequate. How do you store a trillion items so people can find the right information at the right time? How do you index it intelligently to account for each item’s current and future purpose and status? For instance:

  • If it’s urgent, it needs to be immediately and constantly available.
  • If it’s archival, it needs to be stored safely.
  • If it’s shared, everyone involved must have the correct version and receive the right updates.
  • If it’s confidential, only certain users should be able to access it.
  • If it’s expired, it needs to be disposed of properly.

Research focus

HP Labs researchers are developing tools and techniques to improve personal and professional information management. By finding and harnessing patterns in large data sets and streams, we are helping users do the right thing with each piece of information and get the most out of the information they have -- all while keeping data growth under control.

Current work

In cooperation with some of HP’s leading-edge customers, as well as with several universities, our researchers are addressing:

  • Content categorization and analysis -- developing tools that help users organize hundreds of thousands of documents, making sure that each is in the right place with minimal oversight and intervention
  • Document and image-similarity analysis -- a new class of clustering algorithms designed to identify and weed out duplicate and outdated documents
  • Efficient data scanning -- making storage systems information-aware to avoid wasting valuable bandwidth and storage resources
  • Policy and privacy management -- Evaluating the content of documents and files, along with associated metadata, to help secure information in keeping with corporate policies -- and do so in a way that’s minimally invasive to users’ privacy

Technical contributions

Several products from HP StorageWorks -- including the Reference Information Storage System, an information archive and retrieval system -- are incorporating our content-based chunking algorithms and content-addressable storage research to improve storage efficiency.

We’re also working with HP’s Enterprise Storage and Server division on new backup and archiving solutions. The same technologies are also used to synchronize data between customers and remote data centers to help improve efficiency, robustness and resilience to data communication errors.

Our technologies are used by HP Services to better manage millions of support documents. We have developed novel data mining methods and algorithms, including new methods for feature selection, hierarchical categorization and clustering.

While working with HP’s customer support business, we developed breakthrough methods for quantification -- an area previously little known in the data mining research community.

This involves taking large numbers of hard-to-classify documents and providing accurate estimates as to how many belong to various categories. This work is being used to help measure and manage various support problems -- based on free text in support call records -- so they can be addressed as efficiently and effectively as possible.

Information management

       
» Business intelligence & advanced databases
  » Information lifecycle management  
  » Content & metadata analysis and management  
  » Digital asset preservation  
  » Semantic Web  
       
 
 

Related research

»  Content analysis & unstructured information management
»  Data mining & machine learning
»  Storage compaction & data synchronization
 

Learn more

»  HP StorageWorks
»  Reference Information Storage System
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.