TY - JOUR
T1 - Finding frequent items in data streams
AU - Charikar, Moses
AU - Chen, Kevin
AU - Farach-Colton, Martin
N1 - Funding Information:
∗Corresponding author. E-mail addresses: moses@cs.princeton.edu, (M. Charikar), kevinc@cs.berkeley.edu (K. Chen), farach@cs.rutgers.edu (M. Farach-Colton). 1Most of this work was done while the authors were at Google Inc. Moses Charikar gratefully acknowledges support from NSF ITR Grant CCR-0205594 and DOE Early Career Principal Investigator award DE-FG02-02ER25540.
PY - 2004/1/26
Y1 - 2004/1/26
N2 - We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.
AB - We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature.
KW - Approximation
KW - Frequent items
KW - Streaming algorithm
UR - http://www.scopus.com/inward/record.url?scp=0348230637&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0348230637&partnerID=8YFLogxK
U2 - 10.1016/S0304-3975(03)00400-6
DO - 10.1016/S0304-3975(03)00400-6
M3 - Conference article
AN - SCOPUS:0348230637
SN - 0304-3975
VL - 312
SP - 3
EP - 15
JO - Theoretical Computer Science
JF - Theoretical Computer Science
IS - 1
T2 - Automata, Languages and Programming
Y2 - 8 July 2002 through 13 July 2002
ER -