Tech Report: HPL-2000-6:Scale Up Center-Based Data

HP Labs Technical Reports

Click here for full text:

Scale Up Center-Based Data Clustering Algorithms by Parallelism

Zhang, Bin; Hsu, Meichun

HPL-2000-6

Keyword(s): parallel algorithms; data mining; data clustering; K-Means; K-Harmonic Means; Expectation Maximization

Abstract:As data collection increases at an accelerating rate with the advances of computers and networking technology, analyzing the data (data mining) becomes very important. Data clustering is one of the basic tools widely used as a component in many data mining solutions. Even though many data clustering algorithms have been developed in the last few decades, they face new challenges in front of hugh data sets. Algorithms with quadratic (or higher order) computational complexity, like agglomerative algorithms, drop out very quickly. More efficient algorithms like K-Means and EM, which have linear cost per iteration, also need scale-up before they can be applied to very large data sets. This paper shows that many parameter estimation algorithms, including the clustering algorithms like K-Means, K-Harmonic Means and EM, have intrinsic parallel structure in them. Many workstations over a LAN or a multiple-processor computer can be efficiently used to run this class of algorithms in parallel. With 60 workstations running in parallel (on a fast LAN), clustering 28.8 GBytes of 40 dimensional data into 100 clusters, the utilization of the computing units is above 80%.

23 Pages

Back to Index