Asymmetric minwise hashing for indexing binary inner products and set containment

Anshumali Shrivastava, Ping Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

60 Scopus citations

Abstract

Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing (MH-ALSH), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditionalminhash towards smaller sets, making the final "collision probability" monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available highdimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.

Original languageEnglish (US)
Title of host publicationWWW 2015 - Proceedings of the 24th International Conference on World Wide Web
PublisherAssociation for Computing Machinery, Inc
Pages981-991
Number of pages11
ISBN (Electronic)9781450334693
DOIs
StatePublished - May 18 2015
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: May 18 2015May 22 2015

Publication series

NameWWW 2015 - Proceedings of the 24th International Conference on World Wide Web

Other

Other24th International Conference on World Wide Web, WWW 2015
Country/TerritoryItaly
CityFlorence
Period5/18/155/22/15

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Software

Fingerprint

Dive into the research topics of 'Asymmetric minwise hashing for indexing binary inner products and set containment'. Together they form a unique fingerprint.

Cite this