TY - GEN

T1 - The minimum consistent subset cover problem and its applications in data mining

AU - Gao, Byron J.

AU - Ester, Martin

AU - Schulte, Oliver

AU - Cai, Jin Yi

AU - Xiong, Hui

PY - 2007

Y1 - 2007

N2 - In this paper, we introduce and study the Minimum Consistent Subset Cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has Minimum Clique Partition, a dual problem of graph coloring, as an instance. Many practical data mining problems in the areas of rule learning, clustering, and frequent pattern mining can be formulated as MCSC instances. In particular, we discuss the Minimum Rule Set problem that minimizes model complexity of decision rules as well as some converse k-clustering problems that minimize the number of clusters satisfying certain distance constraints. We also show how the MCSC problem can find applications in frequent pattern summarization. For any of these MCSC formulations, our proposed novel graph-based generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set. Our experiments on benchmark datasets show that CAG achieves good results compared to existing popular heuristics.

AB - In this paper, we introduce and study the Minimum Consistent Subset Cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has Minimum Clique Partition, a dual problem of graph coloring, as an instance. Many practical data mining problems in the areas of rule learning, clustering, and frequent pattern mining can be formulated as MCSC instances. In particular, we discuss the Minimum Rule Set problem that minimizes model complexity of decision rules as well as some converse k-clustering problems that minimize the number of clusters satisfying certain distance constraints. We also show how the MCSC problem can find applications in frequent pattern summarization. For any of these MCSC formulations, our proposed novel graph-based generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set. Our experiments on benchmark datasets show that CAG achieves good results compared to existing popular heuristics.

KW - Converse k-clustering

KW - Minimum consistent subset cover

KW - Minimum rule set

KW - Pattern summarization

UR - http://www.scopus.com/inward/record.url?scp=36849021586&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36849021586&partnerID=8YFLogxK

U2 - 10.1145/1281192.1281228

DO - 10.1145/1281192.1281228

M3 - Conference contribution

AN - SCOPUS:36849021586

SN - 1595936092

SN - 9781595936097

T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

SP - 310

EP - 319

BT - KDD-2007

T2 - KDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Y2 - 12 August 2007 through 15 August 2007

ER -