TY - GEN

T1 - The DLT priority sampling is essentially optimal

AU - Szegedy, Mario

PY - 2006

Y1 - 2006

N2 - The priority sampling procedure of N. Duffield, C. Lund and M. Thorup is not only an exciting new approach to sampling weighted data streams, but it has also proven to be highly successful in a variety of practical applications. We resolve the two major issues related to its performance. First we solve the main conjecture of N. Alon, N. Duffield, C. Lund and M. Thorup in [1], which states that the standard deviation for the subset sum estimator obtained from k priority samples is upper bounded by W/√k - 1, where W denotes the actual subset sum that the estimator estimates. Although Alon et al. give an O(W/√k - 1) upper bound on the standard deviation of the estimator, their formula cannot be used as a performance guarantee in an applied setting, because the constants coming up in their proof are very large. Our constant cannot be improved. We also resolve the conjecture of Duffield, C. Lund and M. Thorup which states that the variance of the priority sampling procedure is not larger than the variance of the threshold sampling procedure with sample size only one smaller. This is the main conjecture in [7]. The conjecture's significance is that the latter procedure is provably optimal within a very general class of sampling algorithms, but unlike priority sampling, it is not practical. Our result therefore certifies that priority sampling offers the unlikely feat of uniting mathematical elegance, (essential) optimality and applicability. Our proof is based on a new integral formula and on very finely tuned telescopic sums.

AB - The priority sampling procedure of N. Duffield, C. Lund and M. Thorup is not only an exciting new approach to sampling weighted data streams, but it has also proven to be highly successful in a variety of practical applications. We resolve the two major issues related to its performance. First we solve the main conjecture of N. Alon, N. Duffield, C. Lund and M. Thorup in [1], which states that the standard deviation for the subset sum estimator obtained from k priority samples is upper bounded by W/√k - 1, where W denotes the actual subset sum that the estimator estimates. Although Alon et al. give an O(W/√k - 1) upper bound on the standard deviation of the estimator, their formula cannot be used as a performance guarantee in an applied setting, because the constants coming up in their proof are very large. Our constant cannot be improved. We also resolve the conjecture of Duffield, C. Lund and M. Thorup which states that the variance of the priority sampling procedure is not larger than the variance of the threshold sampling procedure with sample size only one smaller. This is the main conjecture in [7]. The conjecture's significance is that the latter procedure is provably optimal within a very general class of sampling algorithms, but unlike priority sampling, it is not practical. Our result therefore certifies that priority sampling offers the unlikely feat of uniting mathematical elegance, (essential) optimality and applicability. Our proof is based on a new integral formula and on very finely tuned telescopic sums.

KW - Integral formula

KW - Internet traffic

KW - Network measurement

KW - Priority sampling

KW - Subset sum estimate

KW - Telescopic sums

UR - http://www.scopus.com/inward/record.url?scp=33748098958&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33748098958&partnerID=8YFLogxK

U2 - 10.1145/1132516.1132539

DO - 10.1145/1132516.1132539

M3 - Conference contribution

AN - SCOPUS:33748098958

SN - 1595931341

SN - 9781595931344

T3 - Proceedings of the Annual ACM Symposium on Theory of Computing

SP - 150

EP - 158

BT - STOC'06

PB - Association for Computing Machinery (ACM)

T2 - 38th Annual ACM Symposium on Theory of Computing, STOC'06

Y2 - 21 May 2006 through 23 May 2006

ER -