Record-linkage is the process of identifying whether two separate records refer to the same real-world entity when some elements of the records identifying information (attributes) agree and others disagree. Existing record-linkage decision methodologies use the outcomes from the comparisons of the whole set of attributes. Here, we propose an alternative scheme that assesses the attributes sequentially, allowing for a decision to made at any attributes comparison stage, and thus before exhausting all available attributes. The scheme we develop is optimum in that it minimizes a well-defined average cost criterion while the corresponding optimum solution can be easily mapped into a decision tree to facilitate the record-linkage decision process. Experimental results performed in real datasets indicate the superiority of our methodology compared to existing approaches.
All Science Journal Classification (ASJC) codes
- Information Systems
- Information Systems and Management
- Duplicate detection
- Optimal stopping