A PPRL protocol describes the steps to follow by each involved party to carry out the linkage process. The protocols mainly differ in the number of involved parties and the privacy degree.
This protocol, illustrated in Fig.4, has the minimal possible number of parties taking part in the PPRL process, two data owners. To find matching records the involved parties exchange some messages about their records without disclosing them. One know way to realize this is to use Secure Multiparty Computation (SMC) introduced by Yao. In our case the input of such a computation are two private records from two parties and a public similarity function (match or non-match). The goal is to compute the similarity between these two records by exchanging cryptographic messages between the data holders. At the end a computation is said successful if the parties calculate the function but none of them knows the data held by the other. Although this protocol is more secure compared the others (explained later), its main drawback is scalability: each record from one source must be compared with all other records from the other source (quadratic complexity). Furthermore the comparison of each pair of records needs the exchange of a relative high number of messages.
In this protocol a trusty third party, also known as linkage unit (LU), is involved in the linkage process. As shown in Fig.5 the data owners preprocess and encode their data by their own, then both parties send their encoded record to the LU which run an adapted linkage algorithm. At the end the UL returns only the IDs of the matching records to the data owners. In contrast to the two party protocol is this protocol less secure due to the extra LU, but the same LU can use many blocking and/or parallel techniques to improve the scalability and the runtime.
This kind of protocols are a generalization of the two above mentioned ones, where more than two data owners are involved in the linkage process. This can be realized with or without a LU. The main difference is the kind of matches we are interested in. The first case is pretty easy and try to find matching records present in all datasets. The more difficult case aim to find matches present in sub-sets of all the datasets.