TY - GEN
T1 - A generalization of proximity functions for K-means
AU - Wu, Junjie
AU - Xiong, Hui
AU - Chen, Jian
AU - Zhou, Wenjun
PY - 2007
Y1 - 2007
N2 - K-means is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for K-means. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions that fit K-means clustering can be generalized as K-means distance, which can be derived by a differen-tiable convex function. A general proof of sufficient and necessary conditions for K-means distance functions is also provided. In addition, we reveal that K-means has a general uniformization effect; that is, K-means tends to produce clusters with relatively balanced cluster sizes. This uniformization effect of K-means exists regardless of proximity functions. Finally, we have conducted extensive experiments on various real-world data sets, and the results show the evidence of the uniformization effect. Also, we observed that external clustering validation measures, such as Entropy and Variance of Information (VI), have difficulty in measuring clustering quality if data have skewed distributions on class sizes.
AB - K-means is a widely used partitional clustering method. A large amount of effort has been made on finding better proximity (distance) functions for K-means. However, the common characteristics of proximity functions remain unknown. To this end, in this paper, we show that all proximity functions that fit K-means clustering can be generalized as K-means distance, which can be derived by a differen-tiable convex function. A general proof of sufficient and necessary conditions for K-means distance functions is also provided. In addition, we reveal that K-means has a general uniformization effect; that is, K-means tends to produce clusters with relatively balanced cluster sizes. This uniformization effect of K-means exists regardless of proximity functions. Finally, we have conducted extensive experiments on various real-world data sets, and the results show the evidence of the uniformization effect. Also, we observed that external clustering validation measures, such as Entropy and Variance of Information (VI), have difficulty in measuring clustering quality if data have skewed distributions on class sizes.
UR - http://www.scopus.com/inward/record.url?scp=49749114842&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=49749114842&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2007.59
DO - 10.1109/ICDM.2007.59
M3 - Conference contribution
AN - SCOPUS:49749114842
SN - 0769530184
SN - 9780769530185
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 361
EP - 370
BT - Proceedings of the 7th IEEE International Conference on Data Mining, ICDM 2007
T2 - 7th IEEE International Conference on Data Mining, ICDM 2007
Y2 - 28 October 2007 through 31 October 2007
ER -