Consistent sampling through extremal process

Ping Li, Xiaoyun Li, Gennady Samorodnitsky, Weijie Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

The1 Jaccard similarity has been widely used in search and machine learning, especially in industrial practice. For binary (0/1) data, the Jaccard similarity is often called the "resemblance"and the method of minwise hashing has been the standard tool for computing resemblances in massive data. For general weighted data, the commonly used sampling algorithm for computing the (weighted) Jaccard similarity is the Consistent Weighted Sampling (CWS). A convenient (and perhaps also mysterious) implementation of CWS is the so-called "0-bit CWS"published in KDD 2015 [31], which, in this paper, we refer to as the "relaxed CWS"and was purely an empirical observation without theoretical justification. The difficulty in the analysis of the "relaxed CWS"is due to the complicated probability problem, which we could not resolve at this point. In this paper, we propose using extremal processes to generate samples for estimating the Jaccard similarity. Surprisingly, the proposed "extremal sampling"(ES) scheme makes it possible to analyze the "relaxed ES"variant. Through some novel probability endeavours, we are able to rigorously compute the bias of the "relaxed ES"which, to a good extent, explains why the "relaxed ES"works so well and when it does not in extreme corner cases. Interestingly, compared with CWS, the resultant algorithm only involves counting and does not need sophisticated mathematical operations (as required by CWS). It is therefore not surprising that the proposed ES scheme is actually noticeably faster than CWS. Although ES is different from CWS (and other algorithms in the literature for estimating the Jaccard similarity), in retrospect ES is indeed closely related to CWS. This paper provides the much needed insight which connects CWS with extremal processes. This insight may help understand CWS (and variants), and might help develop new algorithms for similarity estimation, in future research.

Original languageEnglish (US)
Title of host publicationThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PublisherAssociation for Computing Machinery, Inc
Pages1317-1327
Number of pages11
ISBN (Electronic)9781450383127
DOIs
StatePublished - Apr 19 2021
Externally publishedYes
Event2021 World Wide Web Conference, WWW 2021 - Ljubljana, Slovenia
Duration: Apr 19 2021Apr 23 2021

Publication series

NameThe Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021

Conference

Conference2021 World Wide Web Conference, WWW 2021
Country/TerritorySlovenia
CityLjubljana
Period4/19/214/23/21

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Keywords

  • Extremal Process
  • Jaccard Similarity
  • Massive Data

Fingerprint

Dive into the research topics of 'Consistent sampling through extremal process'. Together they form a unique fingerprint.

Cite this