Dima: A distributed in-memory similarity-based query processing system

Ji Sun, Zeyuan Shang, Guoliang Li, Dong Deng, Zhifeng Bao

Research output: Contribution to journalConference articlepeer-review

17 Scopus citations


Data analysts in industries spend more than 80% of time on data cleaning and integration in the whole process of data analytics due to data errors and inconsistencies. It calls for effective query processing techniques to tolerate the errors and inconsistencies. In this paper, we develop a distributed in-memory similarity-based query processing system called Dima. Dima supports two core similarity-based query operations, i.e., similarity search and similarity join. Dima extends the SQL programming interface for users to easily invoke these two operations in their data analysis jobs. To avoid expensive data transformation in a distributed environment, we design selectable signatures where two records approximately match if they share common signatures. More importantly, we can adaptively select the signatures to balance the workload. Dima builds signature-based global indexes and local indexes to support effcient similarity search and join. Since Spark is one of the widely adopted distributed inmemory computing systems, we have seamlessly integrated Dima into Spark and developed effective query optimization techniques in Spark. To the best of our knowledge, this is the first full-fledged distributed in-memory system that can support similarity-based query processing. We demonstrate our system in several scenarios, including entity matching, web table integration and query recommendation.

Original languageEnglish (US)
Pages (from-to)1925-1928
Number of pages4
JournalProceedings of the VLDB Endowment
Issue number12
StatePublished - Aug 1 2017
Externally publishedYes
Event43rd International Conference on Very Large Data Bases, VLDB 2017 - Munich, Germany
Duration: Aug 28 2017Sep 1 2017

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)


Dive into the research topics of 'Dima: A distributed in-memory similarity-based query processing system'. Together they form a unique fingerprint.

Cite this