An efficient partition based method for exact set similarity joins

Dong Deng, Guoliang Li, He Wen, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingChapter

66 Scopus citations

Abstract

We study the exact set similarity join problem, which, given two collections of sets, finds out all the similar set pairs from the collections. Existing methods generally utilize the prefix filter based framework. They generate a prefix for each set and prune all the pairs whose prefixes are disjoint. However the pruning power is limited, because if two dissimilar sets share a common element in their prefixes, they cannot be pruned. To address this problem, we propose a partition- based framework. We design a partition scheme to partition the sets into several subsets and guarantee that two sets are similar only if they share a common subset. To improve the pruning power, we propose a mixture of the subsets and their 1-deletion neighborhoods (the subset of a set by eliminating one element). As there are multiple allocation strategies to generate the mixture, we evaluate difierent allocations and design a dynamic-programming algorithm to select the optimal one. However the time complexity of generating the optimal one is O(s3) for a set with size s. To speed up the allocation selection, we develop a greedy algorithm with an approximation ratio of 2. To further reduce the complexity, we design an adaptive grouping mechanism, and the two techniques can reduce the complexity to O(s log s). Experimental results on three real-world datasets show our method achieves high performance and outperforms state- of-the-art methods by 2-5 times.

Original languageEnglish (US)
Title of host publicationProceedings of the VLDB Endowment
PublisherAssociation for Computing Machinery
Pages360-371
Number of pages12
Edition4
StatePublished - 2016
Externally publishedYes
Event42nd International Conference on Very Large Data Bases, VLDB 2016 - Delhi, India
Duration: Sep 5 2016Sep 9 2016

Publication series

NameProceedings of the VLDB Endowment
Number4
Volume9
ISSN (Electronic)2150-8097

Conference

Conference42nd International Conference on Very Large Data Bases, VLDB 2016
Country/TerritoryIndia
CityDelhi
Period9/5/169/9/16

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'An efficient partition based method for exact set similarity joins'. Together they form a unique fingerprint.

Cite this