#### CENTER FOR SCALABLE DATA ANALYTICS AND ARTIFICIAL INTELLIGENCE

• • • ## Introduction to Privacy Preserving Record Linkage - Data anonymization

### Data anonymization

To preserve privacy and confidentiality several data masking/encoding techniques for PPRL were developed. In the following we describe some of them and show for which kind of protocol they are suitable or rather in which step of the PPRL process they occur.

Commutative encryption:

Commutative encryption is a well know encryption scheme used in three-pass-protocol to exchange private messages between two parties without sharing any key. The commutative encryption scheme can be used as special case of secure multi party computation to implement some basic operations like set intersection which is fundamental operation to compute the similarity between two records represented as sets of tokens. The basic idea of commutative encryption is the existence of a function f that satisfies the commutative property:

fk1(fk2(x)) = fk2(fk1(x)), where k1 and k2 are two private keys owned by the two parties respectively and x the message to exchange. An example of such a function is: fk(x) = xk mod n.

Let’s now show roughly how commutative encryption can be used in PPRL. Two parties Alice and Bob have their own encryption and decryption keys ki and di respectively. For each pair of records (represented as sets of tokens) Alice and Bob execute the following steps:

1. Each party encrypt the elements of its tokens’ set using its own encryption key: Ei(k,t)::encrypt(token t using key k).
2. Alice sends its encrypted tokens EA(ka,ti) to Bob, and Bob sends its encrypted tokens EB(kb,ti) to Alice
3. Alice encrypts Bob’s tokens using here key, i.e. EA(ka,EB(kb,t)) and sends the pairs < EB(kb,t), EA(ka,EB(kb,t))> back to Bob.
4. Bob encrypts Alice’s tokens using his key, EB(kb,EA(ka,t)),
5. Bob decrypts its part from the pairs received from Alice, i.e. <t, EA(ka,EB(kb,t))>, because E is commutative he can check whether EA(ka,EB(kb,t)) = EB(kb,EA(ka,t))

This kind of computation, although very secure, still have scalability problems. From the above algorithm one can see the number of exchanged messages between the parties to compare just one pair of records, and these steps must be carried out for all pairs (quadratic complexity).

Hash function:

A very simple way to “guarantee” privacy is to use hash function to generate hash values of records then to compare them in a private way using two or three party protocol. Because such encryption scheme is vulnerable to dictionary attacks, one can enhance the security by using keyed hash functions like SHA-1. However, this kind of data anonymization has the main drawback that it allows only to check whether two records are equal or not (equality and not similarity) which is feasible when dealing with dirty data.

Phonetic encoding:

Phonetic encoding is widely used in record linkage and database systems to generate blocks of candidates. It can also be used in PPRL to overcome the scalability problem. The basic idea of phonetic encoding it to produce a code for each name based on it pronunciation while tolerating small typos. Soundex for example generates codes of length four (One character followed by 3 digits), e.g. Carol and Carrell share the same code C640. Note that phonetic encoding is language dependent.

Bloom filter:

Bloom filter was originally presented by Bloom to efficiently check set membership. Its adaptation for PPRL was proposed by Schnell in a three-party protocol. The workflow to encode two record from two different parties into a bloom filters is and illustrated in Fig.7. First the QIDs of each record are tokenized to n-grams. Then starting with an empty bit array of length l, the elements of each set of tokens are mapped into the same bit array using k hash functions. Fig. 7: An example of Bloom filter to encode two records and the way to compute their similarity using Jaccard function

When appropriate parameters are used, like length of the bit array and the number of hash functions, bloom filter can provide high privacy and result’s quality. Furthermore the authors proposed some methods to harden it.      