|
Click here for full text:
Finding Similar Files in Large Document Repositories
Forman, George; Eshghi, Kave; Chiocchetti, Stephane
HPL-2005-42R1
Keyword(s): content management; document management; near duplicate detection; similarity; scalability
Abstract: Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction. The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing. We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file- chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability. Notes: Copyright ACM. To be published in and presented at the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'05), 21-25 August 2005, Chicago, IL, USA
7 Pages
Back to Index
|