Technical Reports

HP Labs


»	Research


»	News and events
»	Technical reports


»	About HP Labs
»	Careers @ HP Labs
»	People
»	Worldwide sites


»	Downloads

Click here for full text:

Finding Similar Files in Large Document Repositories

Forman, George; Eshghi, Kave; Chiocchetti, Stephane

HPL-2005-42R1

Keyword(s): content management; document management; near duplicate detection; similarity; scalability

Abstract: Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file- chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability. Notes: Copyright ACM. To be published in and presented at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), 21-25 August 2005, Chicago, IL, USA

7 Pages

Back to Index


»Technical Reports
	»	2009
	»	2008
	»	2007
	»	2006
	»	2005
	»	2004
	»	2003
	»	2002
	»	2001
	»	2000
	»	1990 - 1999



Heritage Technical Reports
	»	Compaq & DEC Technical Reports
	»	Tandem Technical Reports

Printable version


Privacy statement	Using this site means you accept its terms	Feedback to HP Labs

© 2009 Hewlett-Packard Development Company, L.P.

Technical Reports

HP Labs

»Technical Reports

Heritage Technical Reports