Scalable column concept determination for web tables using large knowledge bases

Dong Deng, Yu Jiang, Guoliang Li, Jian Li, Cong Yu

Research output: Contribution to journalConference articlepeer-review

35 Scopus citations

Abstract

Tabular data on the Web has become a rich source of struc-tured data that is useful for ordinary users to explore. Due to its potential, tables on the Web have recently attracted a number of studies with the goals of understanding the se-mantics of those Web tables and providing e ective search and exploration mechanisms over them. An important part of table understanding and search is column concept deter-mination, i.e., identifying the most appropriate concepts as-sociated with the columns of the tables. The problem be-comes especially challenging with the availability of increas-ingly rich knowledge bases that contain hundreds of millions of entities. In this paper, we focus on an important instantiation of the column concept determination problem, namely, the concepts of a column are determined by fuzzy matching its cell values to the entities within a large knowledge base. We provide an efficient and scalable MapReduce-based solution that is scalable to both the number of tables and the size of the knowledge base and propose two novel techniques: knowledge concept aggregation and knowledge entity par-tition. We prove that both the problem of finding the op-timal aggregation strategy and that of finding the optimal partition strategy are NP-hard, and propose efficient heuris-tic techniques by leveraging the hierarchy of the knowledge base. Experimental results on real-world datasets show that our method achieves high annotation quality and perfor-mance, and scales well.

Original languageEnglish (US)
Pages (from-to)1606-1617
Number of pages12
JournalProceedings of the VLDB Endowment
Volume6
Issue number13
DOIs
StatePublished - Aug 2013
Externally publishedYes
Event39th International Conference on Very Large Data Bases, VLDB 2012 - Trento, Italy
Duration: Aug 26 2013Aug 30 2013

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Scalable column concept determination for web tables using large knowledge bases'. Together they form a unique fingerprint.

Cite this