The minimum consistent subset cover problem and its applications in data mining

Byron J. Gao, Martin Ester, Oliver Schulte, Jin Yi Cai, Hui Xiong

Research output: Chapter in Book/Report/Conference proceedingConference contribution

15 Scopus citations

Abstract

In this paper, we introduce and study the Minimum Consistent Subset Cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has Minimum Clique Partition, a dual problem of graph coloring, as an instance. Many practical data mining problems in the areas of rule learning, clustering, and frequent pattern mining can be formulated as MCSC instances. In particular, we discuss the Minimum Rule Set problem that minimizes model complexity of decision rules as well as some converse k-clustering problems that minimize the number of clusters satisfying certain distance constraints. We also show how the MCSC problem can find applications in frequent pattern summarization. For any of these MCSC formulations, our proposed novel graph-based generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set. Our experiments on benchmark datasets show that CAG achieves good results compared to existing popular heuristics.

Original languageEnglish (US)
Title of host publicationKDD-2007
Subtitle of host publicationProceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages310-319
Number of pages10
DOIs
StatePublished - 2007
Externally publishedYes
EventKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - San Jose, CA, United States
Duration: Aug 12 2007Aug 15 2007

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

OtherKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CitySan Jose, CA
Period8/12/078/15/07

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Keywords

  • Converse k-clustering
  • Minimum consistent subset cover
  • Minimum rule set
  • Pattern summarization

Fingerprint

Dive into the research topics of 'The minimum consistent subset cover problem and its applications in data mining'. Together they form a unique fingerprint.

Cite this