TY - GEN
T1 - Consistent sampling through extremal process
AU - Li, Ping
AU - Li, Xiaoyun
AU - Samorodnitsky, Gennady
AU - Zhao, Weijie
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/4/19
Y1 - 2021/4/19
N2 - The1 Jaccard similarity has been widely used in search and machine learning, especially in industrial practice. For binary (0/1) data, the Jaccard similarity is often called the "resemblance"and the method of minwise hashing has been the standard tool for computing resemblances in massive data. For general weighted data, the commonly used sampling algorithm for computing the (weighted) Jaccard similarity is the Consistent Weighted Sampling (CWS). A convenient (and perhaps also mysterious) implementation of CWS is the so-called "0-bit CWS"published in KDD 2015 [31], which, in this paper, we refer to as the "relaxed CWS"and was purely an empirical observation without theoretical justification. The difficulty in the analysis of the "relaxed CWS"is due to the complicated probability problem, which we could not resolve at this point. In this paper, we propose using extremal processes to generate samples for estimating the Jaccard similarity. Surprisingly, the proposed "extremal sampling"(ES) scheme makes it possible to analyze the "relaxed ES"variant. Through some novel probability endeavours, we are able to rigorously compute the bias of the "relaxed ES"which, to a good extent, explains why the "relaxed ES"works so well and when it does not in extreme corner cases. Interestingly, compared with CWS, the resultant algorithm only involves counting and does not need sophisticated mathematical operations (as required by CWS). It is therefore not surprising that the proposed ES scheme is actually noticeably faster than CWS. Although ES is different from CWS (and other algorithms in the literature for estimating the Jaccard similarity), in retrospect ES is indeed closely related to CWS. This paper provides the much needed insight which connects CWS with extremal processes. This insight may help understand CWS (and variants), and might help develop new algorithms for similarity estimation, in future research.
AB - The1 Jaccard similarity has been widely used in search and machine learning, especially in industrial practice. For binary (0/1) data, the Jaccard similarity is often called the "resemblance"and the method of minwise hashing has been the standard tool for computing resemblances in massive data. For general weighted data, the commonly used sampling algorithm for computing the (weighted) Jaccard similarity is the Consistent Weighted Sampling (CWS). A convenient (and perhaps also mysterious) implementation of CWS is the so-called "0-bit CWS"published in KDD 2015 [31], which, in this paper, we refer to as the "relaxed CWS"and was purely an empirical observation without theoretical justification. The difficulty in the analysis of the "relaxed CWS"is due to the complicated probability problem, which we could not resolve at this point. In this paper, we propose using extremal processes to generate samples for estimating the Jaccard similarity. Surprisingly, the proposed "extremal sampling"(ES) scheme makes it possible to analyze the "relaxed ES"variant. Through some novel probability endeavours, we are able to rigorously compute the bias of the "relaxed ES"which, to a good extent, explains why the "relaxed ES"works so well and when it does not in extreme corner cases. Interestingly, compared with CWS, the resultant algorithm only involves counting and does not need sophisticated mathematical operations (as required by CWS). It is therefore not surprising that the proposed ES scheme is actually noticeably faster than CWS. Although ES is different from CWS (and other algorithms in the literature for estimating the Jaccard similarity), in retrospect ES is indeed closely related to CWS. This paper provides the much needed insight which connects CWS with extremal processes. This insight may help understand CWS (and variants), and might help develop new algorithms for similarity estimation, in future research.
KW - Extremal Process
KW - Jaccard Similarity
KW - Massive Data
UR - http://www.scopus.com/inward/record.url?scp=85107995746&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107995746&partnerID=8YFLogxK
U2 - 10.1145/3442381.3449955
DO - 10.1145/3442381.3449955
M3 - Conference contribution
AN - SCOPUS:85107995746
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 1317
EP - 1327
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery, Inc
T2 - 2021 World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -