TY - GEN
T1 - Asymmetric minwise hashing for indexing binary inner products and set containment
AU - Shrivastava, Anshumali
AU - Li, Ping
PY - 2015/5/18
Y1 - 2015/5/18
N2 - Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing (MH-ALSH), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditionalminhash towards smaller sets, making the final "collision probability" monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available highdimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.
AB - Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing (MH-ALSH), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditionalminhash towards smaller sets, making the final "collision probability" monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available highdimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.
UR - http://www.scopus.com/inward/record.url?scp=84968845887&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968845887&partnerID=8YFLogxK
U2 - 10.1145/2736277.2741285
DO - 10.1145/2736277.2741285
M3 - Conference contribution
AN - SCOPUS:84968845887
T3 - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
SP - 981
EP - 991
BT - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
PB - Association for Computing Machinery, Inc
T2 - 24th International Conference on World Wide Web, WWW 2015
Y2 - 18 May 2015 through 22 May 2015
ER -