External validation measures for K-means clustering: A data distribution perspective

Junjie Wu, Jian Chen, Hui Xiong, Ming Xie

Research output: Contribution to journalArticlepeer-review

41 Scopus citations

Abstract

Cluster validation is an important part of any cluster analysis. External measures such as entropy, purity and mutual information are often used to evaluate K-means clustering. However, whether these measures are indeed suitable for K-means clustering remains unknown. Along this line, in this paper, we show that a data distribution view is of great use to selecting the right measures for K-means clustering. Specifically, we first introduce the data distribution view of K-means, and the resultant uniform effect on highly imbalanced data sets. Eight external measures widely used in recent data mining tasks are also collected as candidates for K-means evaluation. Then, we demonstrate that only three measures, namely the variation of information (VI), the van Dongen criterion (VD) and the Mirkin metric (M), can detect the negative uniform effect of K-means in the clustering results. We also provide new normalization schemes for these three measures, i.e., VInorm , VDnorm and Mnorm , which enables the cross-data comparisons of clustering qualities. Finally, we explore some properties such as the consistency and sensitivity of the three measures, and give some advice on how to use them in K-means practice.

Original languageEnglish (US)
Pages (from-to)6050-6061
Number of pages12
JournalExpert Systems With Applications
Volume36
Issue number3 PART 2
DOIs
StatePublished - Apr 2009

All Science Journal Classification (ASJC) codes

  • Engineering(all)
  • Computer Science Applications
  • Artificial Intelligence

Keywords

  • Cluster validation
  • External criteria
  • K-means
  • Normalization

Fingerprint

Dive into the research topics of 'External validation measures for K-means clustering: A data distribution perspective'. Together they form a unique fingerprint.

Cite this