Faerie: Efficient filtering algorithms for approximate dictionary-based entity extraction

Guoliang Li, Dong Deng, Jianhua Feng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

32 Scopus citations

Abstract

Dictionary-based entity extraction identifies predefined entities (e.g., person names or locations) from a document. A recent trend for improving extraction recall is to support approximate entity extraction, which finds all substrings in the document that approximately match entities in a given dictionary. Existing methods to address this problem support either token-based similarity (e.g., Jaccard Similarity) or character-based dissimilarity (e.g., Edit Distance). It calls for a unified method to support various similarity/dissimilarity functions, since a unified method can reduce the programming efforts, hardware requirements, and the manpower. In addition, many substrings in the document have overlaps, and we have an opportunity to utilize the shared computation across the overlaps to avoid unnecessary redundant computation. In this paper, we propose a unified framework to support many similarity/dissimilarity functions, such as jaccard similarity, cosine similarity, dice similarity, edit similarity, and edit distance. We devise efficient filtering algorithms to utilize the shared computation and develop effective pruning techniques to improve the performance. The experimental results show that our method achieves high performance and outperforms state-of-the-art studies.

Original languageEnglish (US)
Title of host publicationProceedings of SIGMOD 2011 and PODS 2011
Pages529-540
Number of pages12
DOIs
StatePublished - 2011
Externally publishedYes
Event2011 ACM SIGMOD and 30th PODS 2011 Conference - Athens, Greece
Duration: Jun 12 2011Jun 16 2011

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2011 ACM SIGMOD and 30th PODS 2011 Conference
Country/TerritoryGreece
CityAthens
Period6/12/116/16/11

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Keywords

  • approximate entity extraction
  • filtering algorithms

Fingerprint

Dive into the research topics of 'Faerie: Efficient filtering algorithms for approximate dictionary-based entity extraction'. Together they form a unique fingerprint.

Cite this