TY - JOUR
T1 - COG
T2 - Local decomposition for rare class analysis
AU - Wu, Junjie
AU - Xiong, Hui
AU - Chen, Jian
N1 - Funding Information:
Acknowledgments This research was partially supported by the National Natural Science Foundation of China (NSFC) (No. 70901002, 70621061, 70890082), National Science Foundation (NSF) via grant number CNS 0831186, and the Rutgers Seed Funding for Collaborative Computing Research. This work was also supported in part by the Doctoral Fund of Ministry of Education of China (No. 360285), and the Lan Tian Xin Xiu 2008 Seed Funding of Beihang University (No. 221531).
PY - 2010/3
Y1 - 2010/3
N2 - Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.
AB - Given its importance, the problem of predicting rare classes in large-scale multi-labeled data sets has attracted great attention in the literature. However, rare class analysis remains a critical challenge, because there is no natural way developed for handling imbalanced class distributions. This paper thus fills this crucial void by developing a method for classification using local clustering (COG). Specifically, for a data set with an imbalanced class distribution, we perform clustering within each large class and produce sub-classes with relatively balanced sizes. Then, we apply traditional supervised learning algorithms, such as support vector machines (SVMs), for classification. Along this line, we explore key properties of local clustering for a better understanding of the effect of COG on rare class analysis. Also, we provide a systematic analysis of time and space complexity of the COG method. Indeed, the experimental results on various real-world data sets show that COG produces significantly higher prediction accuracies on rare classes than state-of-the-art methods and the COG scheme can greatly improve the computational performance of SVMs. Furthermore, we show that COG can also improve the performances of traditional supervised learning algorithms on data sets with balanced class distributions. Finally, as two case studies, we have applied COG for two real-world applications: credit card fraud detection and network intrusion detection.
KW - K-means clustering
KW - Local clustering
KW - Rare class analysis
KW - Support vector machines (SVMs)
UR - http://www.scopus.com/inward/record.url?scp=77649273505&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77649273505&partnerID=8YFLogxK
U2 - 10.1007/s10618-009-0146-1
DO - 10.1007/s10618-009-0146-1
M3 - Article
AN - SCOPUS:77649273505
VL - 20
SP - 191
EP - 220
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
SN - 1384-5810
IS - 2
ER -