The main challenges encountered when linking two or many sources are scalability to large datasets and high result’s quality. The scalability problem is due to the ever-growing size of the datasets to be linked and the quadratic nature of the record linkage process. Hence, the naïve way of comparing each record from one source with all records from another source affect dramatically the runtime of the process and is unfeasible for very large datasets. Therefore, refined solutions like blocking/indexing and partitioning are required during the process. The result’s quality of record linkage depends on several factors like the quality of the input data the followed blocking strategy and the used classifier. The aim is to reduce both the number of false positive (number of non-match pairs which are labeled as match) and false negative (number of match pairs which are labeled as non-match). Record linkage overcomes these challenges by following several steps outlined in Fig. 2.
Pre-processing: Since the different datasets to be linked are heterogeneous and sometimes dirty, the data owners have to agree on some fields to use in the linkage process. Generally these fields, also called Quasi-Identifier QID, have a high selectivity and together allow the identification of a record. Furthermore the data owners define a set of rules to transform these fields in a standard form, e.g. the way to deal with missing values, case sensitivity and abbreviations. For example data holders in Fig. 1 agreed to use three common fields (first and last name and address) then they normalized them (lower case, ä to ae …).
Blocking: As explained above the main goal of blocking is to mitigate the quadratic nature of the linkage process. It is quite obvious that for each record from one dataset only a small set of records from the other datasets constitute candidate matches that must be checked for similarity. To save comparisons between dissimilar records one can put “probably similar records” in the same blocks, so that only records within one block will be compared with each other. A block is defined by a key, which is shared by all records belonging to it, e.g. the same zip code or some specific letters in the name. Other methods use filter techniques based on some characteristics of the records, the used similarity function and its threshold to prune dissimilar records before comparison. Finally, in the case of Big Data, the aforementioned strategies can be combined within a parallel environment to further improve the runtime.
Similarity computation: In this step records from the same block or partition are compared with each other using one or many similarity functions. The output of this computation is either a unique value, generally normalized between 0 and 1, that expresses the similarity degree between two records, or a vector of normalized similarity values. The latter case occurs when for different attributes different similarity functions are used.
Classification: the similarity values between a pairs of records are used to assign these pair to one of the classes: match, non-match or possible match. There are several classification methods depending on the input data and the use case, e.g. threshold-based, rule-based, (un)supervised learning or the probabilistic method proposed by Fellegi and Sunter.
Evaluation: the aim of this step is to evaluate the record linkage process w.r.t. its runtime, its complexity and its result’s quality. Runtime and complexity are much related and depends generally on the efficiency of the blocking method. The blocking scheme should exclude as many dissimilar pairs as possible without causing a runtime overhead. The quality is expressed by two values from the Information Retrieval, precision and recall. Beside the number of true matches precision and recall are influenced by the number false positive and false negative respectively.