MassJoin: A mapreduce-based method for scalable string similarity joins

Dong Deng, Guoliang Li, Shuang Hao, Jiannan Wang, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

83 Scopus citations

Abstract

String similarity join is an essential operation in data integration. The era of big data calls for scalable algorithms to support large-scale string similarity joins. In this paper, we study scalable string similarity joins using MapReduce. We propose a MapReduce-based framework, called MASSJOIN, which supports both set-based similarity functions and character-based similarity functions. We extend the existing partition-based signature scheme to support set-based similarity functions. We utilize the signatures to generate key-value pairs. To reduce the transmission cost, we merge key-value pairs to significantly reduce the number of key-value pairs, from cubic to linear complexity, while not sacrificing the pruning power. To improve the performance, we incorporate 'light-weight' filter units into the key-value pairs which can be utilized to prune large number of dissimilar pairs without significantly increasing the transmission cost. Experimental results on real-world datasets show that our method significantly outperformed state-of-the-art approaches.

Original languageEnglish (US)
Title of host publication2014 IEEE 30th International Conference on Data Engineering, ICDE 2014
PublisherIEEE Computer Society
Pages340-351
Number of pages12
ISBN (Print)9781479925544
DOIs
StatePublished - 2014
Externally publishedYes
Event30th IEEE International Conference on Data Engineering, ICDE 2014 - Chicago, IL, United States
Duration: Mar 31 2014Apr 4 2014

Publication series

NameProceedings - International Conference on Data Engineering
ISSN (Print)1084-4627

Conference

Conference30th IEEE International Conference on Data Engineering, ICDE 2014
Country/TerritoryUnited States
CityChicago, IL
Period3/31/144/4/14

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'MassJoin: A mapreduce-based method for scalable string similarity joins'. Together they form a unique fingerprint.

Cite this