Technical Reports
HPL-2008-38R2
Data Weaving: Scaling Up the State-Of-The-Art in Data Clustering
Bekkerman, Ron; Scholz, Martin
HP Laboratories
HPL-2008-38R2
Keyword(s): information-theoretic clustering, multi-modal clustering, parallel and distributed data mining
Abstract: The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information- theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi- dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often out-performed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving -- a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal -- it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.
10 Pages
Additional Publication Information: Published in ACM CIKM, Conference on Information & Knowledge Management, Napa, CA Oct 27, 2008
External Posting Date: April 6, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: August 21, 2008 [Fulltext]