TY - JOUR
T1 - TAPER
T2 - A two-step approach for all-strong-pairs correlation query in large databases
AU - Xiong, Hui
AU - Shekhar, Shashi
AU - Tan, Pang Ning
AU - Kumar, Vipin
N1 - Funding Information:
This work was partially supported by US National Science Foundation (NSF) grant IIS-0308264, NSF grant ITR ACI-0325949, and by the US Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAD19-01-2-0014. The content of this work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHPCRC and the Minnesota Supercomputing Institute. In addition, the authors thank Professor Joydeep Ghosh (University of Texas, Austin) and Professor Ke Wang (Simon Fraser University, Canada) for valuable comments. Finally, they are grateful to Kim Koffolt for her timely and detailed feedback to help improve the readability of this paper.
PY - 2006/4
Y1 - 2006/4
N2 - Given a user-specified minimum correlation threshold θ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.
AB - Given a user-specified minimum correlation threshold θ and a market-basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. The goal of this paper is to provide computationally efficient algorithms to answer the all-strong-pairs correlation query. Indeed, we identify an upper bound of Pearson's correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearson's correlation coefficient, but also exhibits special monotone properties which allow pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corElation queRy (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent of or improves when the number of items is increased in data sets with Zipf-like or linear rank-support distributions. Experimental results from synthetic and real-world data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives. Finally, we demonstrate that the algorithmic ideas developed in the TAPER algorithm can be extended to efficiently compute negative correlation and uncentered Pearson's correlation coefficient.
KW - Association analysis
KW - Data mining
KW - Pearson's correlation coefficient
KW - Statistical computing
UR - http://www.scopus.com/inward/record.url?scp=33644654860&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33644654860&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2006.1599388
DO - 10.1109/TKDE.2006.1599388
M3 - Article
AN - SCOPUS:33644654860
SN - 1041-4347
VL - 18
SP - 493
EP - 508
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 4
ER -