2ED: An efficient entity extraction algorithm using two-level edit-distance

Zeyi Wen, Dong Deng, Rui Zhang, Ramamohanarao Kotagiri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Scopus citations

Abstract

Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on string matching against a dictionary of known entities. For approximate entity extraction from free text, considering solely character-based or solely token-based similarity cannot simultaneously deal with minor name variations at token-level and typos at character-level. Moreover, the tolerance of mismatch in character-level may be different from that in token-level, and the tolerance thresholds of the two levels should be able to be customised individually. In this paper, we propose an efficient character-level and token-level edit-distance based algorithm called FuzzyED. To improve the efficiency of FuzzyED, we develop various novel techniques including (i) a spanning-based candidate sub-string producing technique, (ii) a lower bound dissimilarity to determine the boundaries of candidate sub-strings, (iii) a core token based technique that makes use of the importance of tokens to reduce the number of unpromising candidate sub-strings, and (iv) a shrinking technique to reuse computation. Empirical results on real world datasets show that FuzzyED can efficiently extract entities and produce a high F1 score in the range of [0.91, 0.97].

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE 35th International Conference on Data Engineering, ICDE 2019
PublisherIEEE Computer Society
Pages998-1009
Number of pages12
ISBN (Electronic)9781538674741
DOIs
StatePublished - Apr 2019
Event35th IEEE International Conference on Data Engineering, ICDE 2019 - Macau, China
Duration: Apr 8 2019Apr 11 2019

Publication series

NameProceedings - International Conference on Data Engineering
Volume2019-April
ISSN (Print)1084-4627

Conference

Conference35th IEEE International Conference on Data Engineering, ICDE 2019
CountryChina
CityMacau
Period4/8/194/11/19

All Science Journal Classification (ASJC) codes

  • Software
  • Signal Processing
  • Information Systems

Keywords

  • Approximation
  • Edit distance
  • Entity extraction

Fingerprint Dive into the research topics of '2ED: An efficient entity extraction algorithm using two-level edit-distance'. Together they form a unique fingerprint.

Cite this