Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search

Jin Wang, Guoliang Li, Dong Deng, Yong Zhang, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

31 Scopus citations

Abstract

String similarity search is a fundamental operation in data cleaning and integration. It has two variants, threshold-based string similarity search and top-k string similarity search. Existing algorithms are efficient either for the former or the latter; most of them can't support both two variants. To address this limitation, we propose a unified framework. We first recursively partition strings into disjoint segments and build a hierarchical segment tree index (HS-Tree) on top of the segments. Then we utilize the HS-Tree to support similarity search. For threshold-based search, we identify appropriate tree nodes based on the threshold to answer the query and devise an efficient algorithm (HS-Search). For top-k search, we identify promising strings with large possibility to be similar to the query, utilize these strings to estimate an upper bound which is used to prune dissimilar strings, and propose an algorithm (HS-Topk). We also develop effective pruning techniques to further improve the performance. Experimental results on real-world datasets show our method achieves high performance on the two problems and significantly outperforms state-of-the-art algorithms.

Original languageEnglish (US)
Title of host publication2015 IEEE 31st International Conference on Data Engineering, ICDE 2015
PublisherIEEE Computer Society
Pages519-530
Number of pages12
ISBN (Electronic)9781479979639
DOIs
StatePublished - May 26 2015
Externally publishedYes
Event2015 31st IEEE International Conference on Data Engineering, ICDE 2015 - Seoul, Korea, Republic of
Duration: Apr 13 2015Apr 17 2015

Publication series

NameProceedings - International Conference on Data Engineering
Volume2015-May
ISSN (Print)1084-4627

Conference

Conference2015 31st IEEE International Conference on Data Engineering, ICDE 2015
CountryKorea, Republic of
CitySeoul
Period4/13/154/17/15

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Fingerprint Dive into the research topics of 'Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search'. Together they form a unique fingerprint.

Cite this