Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home


Technical Reports



» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

 
Click here for full text: PDF

Finding Similar Files in Large Document Repositories

Forman, George; Eshghi, Kave; Chiocchetti, Stephane

HPL-2005-42R1

Keyword(s): content management; document management; near duplicate detection; similarity; scalability

Abstract: Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file- chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability. Notes: Copyright ACM. To be published in and presented at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), 21-25 August 2005, Chicago, IL, USA

7 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.