TY - GEN

T1 - K-means clustering versus validation measures

T2 - KDD 2006: 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

AU - Xiong, Hui

AU - Wu, Junjie

AU - Chen, Jian

PY - 2006

Y1 - 2006

N2 - K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by Kmeans? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation (CV), is in a specific range, approximately from 0.3 to 1.0.

AB - K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by Kmeans? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation (CV), is in a specific range, approximately from 0.3 to 1.0.

KW - Coefficient of Variation (CV)

KW - Entropy

KW - K-means Clustering

UR - http://www.scopus.com/inward/record.url?scp=33749563831&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33749563831&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33749563831

SN - 1595933395

SN - 9781595933393

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 779

EP - 784

BT - KDD 2006

Y2 - 20 August 2006 through 23 August 2006

ER -