Josie: Overlap set similarity search for finding joinable tables in data lakes

Erkang Zhu, Fatemeh Nargesian, Dong Deng, Renée J. Miller

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Scopus citations

Abstract

We present a new solution for finding joinable tables in massive data lakes: given a table and one join column, find tables that can be joined with the given table on the largest number of distinct values. The problem can be formulated as an overlap set similarity search problem by considering columns as sets and matching values as intersection between sets. Although set similarity search is well-studied in the field of approximate string search (e.g., fuzzy keyword search), the solutions are designed for and evaluated over sets of relatively small size (average set size rarely much over 100 and maximum set size in the low thousands) with modest dictionary sizes (the total number of distinct values in all sets is only a few million). We observe that modern data lakes typically have massive set sizes (with maximum set sizes that may be tens of millions) and dictionaries that include hundreds of millions of distinct values. Our new algorithm, JOSIE (JOining Search using Intersection Estimation) minimizes the cost of set reads and inverted index probes used in finding the top-k sets. We show that JOSIE completely out performs the state-of-the-art overlap set similarity search techniques on data lakes. More surprising, we also consider state-of-the-art approximate algorithm and show that our new exact search algorithm performs almost as well, and even in some cases better, on real data lakes.

Original languageEnglish (US)
Title of host publicationSIGMOD 2019 - Proceedings of the 2019 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages847-864
Number of pages18
ISBN (Electronic)9781450356435
DOIs
StatePublished - Jun 25 2019
Event2019 International Conference on Management of Data, SIGMOD 2019 - Amsterdam, Netherlands
Duration: Jun 30 2019Jul 5 2019

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2019 International Conference on Management of Data, SIGMOD 2019
CountryNetherlands
CityAmsterdam
Period6/30/197/5/19

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'Josie: Overlap set similarity search for finding joinable tables in data lakes'. Together they form a unique fingerprint.

Cite this