Technical Reports

HPL-2008-38R2

Click here for full text: PDF

Data Weaving: Scaling Up the State-Of-The-Art in Data Clustering

Bekkerman, Ron; Scholz, Martin
HP Laboratories

HPL-2008-38R2

Keyword(s): information-theoretic clustering, multi-modal clustering, parallel and distributed data mining

Abstract: The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information- theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi- dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often out-performed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving -- a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal -- it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

10 Pages

Additional Publication Information: Published in ACM CIKM, Conference on Information & Knowledge Management, Napa, CA Oct 27, 2008

External Posting Date: April 6, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: August 21, 2008 [Fulltext]

Back to Index