Many companies and organizations collect a huge amount of data about people simple by offering their services in form of online applications. Another “more official” way to gather such data is asking the people (by the mean of printed forms) as is the case in hospitals and administration. In both cases each data owner holds information that cover only one or few aspects of each person. However, analyzing such data and mining interesting patterns or improving decision making processes generally require clean and aggregated data, which are held by several organizations. Record linkage operates as a preprocessing step for these tasks with the main goal to find records, stored in different databases, which refer to the same real world object or person. This process finds application in many areas like healthcare, national security or business. In healthcare for example, linking records from two or more hospitals allows the adaptation of disease’s treatment of patients.
The main impediment when linking person related data across many organizations is the privacy aspect. In several countries processing such data is subject to strict privacy policies, e.g. how and where to store the data and whether or not such data can be exchanged with a third party. Privacy Preserving Record Linkage (PPRL) presents techniques and methods to efficiently link similar records in different databases without compromising the privacy and confidentiality.
An example of PPRL in healthcare area is illustrated in Fig.1. Two research groups from two different hospitals are interested for possible correlation between two diseases, diabetes type 2 and Alzheimer. The database of hospital 1 contains records about patients generally suffering from diabetes and other cardiovascular diseases. The database of the second hospital stores records about patients afflicted with different forms of dementia. Since there are no unique IDs to identify persons present in both databases, the data holders first agree on some fields to be used in the linkage. These fields, known as Quasi-Identifiers (QID), have the property that together they enable the identification of a person, e.g. in the databases of Fig.1 fields like first and last name, date of birth and address are well suited to be used as QIDs. The other type of fields, holding highly sensitive data about patients, should and cannot be used with the QIDs in the linkage process because they are very private and have no added value in the linkage. For example if we consider the records 34599 and 733 from hospitals 1 and 2 respectively the only fields we need in the linkage process are first and last name, and the address. The other fields, weight and disease, are on the one side sensitive and should not be disclosed, and on the other side they are very dissimilar to be compared with each other (e.g. diabetes and Alzheimer). After the normalization/standardization of the QIDs they are encoded/encrypted in a way that allows their comparison (preserves similarity) without disclosing the identity of the persons they represent (preserves privacy). The following step is to compare the encoded records and to classify them in match and non-match. Now the data owners know the IDs of the matching records and they can exchange some sensitive fields of the respective matches to continue their research. For example the data owners know that the pairs (34599 - 733) and (34601 - 734) are matches and after exchanging some of their sensitive information (e.g. disease) they can deduce that there is possibly a correlation between diabetes type 2 and Alzheimer.
An important question after the execution of such a process is: which information is gained by each data owner about the other’s database?
- The first kind of information know by each data owner are the matches. Data holder from hospital 1 knows that patients with ids 34599 and 34601 (idem for hospital 2) both are suffering from Alzheimer, but we can argue that this was the goal of the linkage process.
- Furthermore each data holder knows that its own non-match records (patients) do not suffer from any disease stored in the other database (e.g. record 34589 in database 1 does not suffer from any kind of dementia). This kind of information gain (or leak) is unavoidable in the record linkage process.
- The last and most important information that must not be disclosed are the other party’s non-matches, i.e. that no data holder knows the non-match records of the other (data holder from hospital 1 does know nothing about the records 732 stored in database 2).