Taxonomic data integration from multilingual Wikipedia editions

Gerard de Melo, Gerhard Weikum

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

Information systems are increasingly making use of taxonomic knowledge about words and entities. A taxonomic knowledge base may reveal that the Lago di Garda is a lake and that lakes as well as ponds, reservoirs, and marshes are all bodies of water. As the number of available taxonomic knowledge sources grows, there is a need for techniques to integrate such data into combined, unified taxonomies. In particular, the Wikipedia encyclopedia has been used by a number of projects, but its multilingual nature has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is one of the largest multilingual lexical knowledge bases currently available.

Original languageEnglish (US)
Pages (from-to)1-39
Number of pages39
JournalKnowledge and Information Systems
Volume39
Issue number1
DOIs
StatePublished - Apr 2014
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Hardware and Architecture
  • Artificial Intelligence

Keywords

  • Graph
  • Multilingual
  • Ranking
  • Taxonomy induction

Fingerprint

Dive into the research topics of 'Taxonomic data integration from multilingual Wikipedia editions'. Together they form a unique fingerprint.

Cite this