Page 32 - FIGI - Big data, machine learning, consumer protection and privacy

P. 32

Figure 3 – Pseudonymisation process. Source, KI Protect

Whereas de-identification involves removing both previously assured constituents that their health data
of these, pseudonymisation removes only directly was kept confidential.)
148
identifying data so that the personal data cannot be One study in 2013 found that 95% of mobility trac-
attributed to a specific individual without the use of es are uniquely identifiable given four random
additional information. Such additional information spatio-temporal points (data and time) and over 50%
is kept separately and protected by technical and of users are uniquely identifiable from two random-
administrative measures to prevent such attribu- ly chosen points (which will typically be home and
tion. The basic pseudonymisation process is not work). Subsequent studies have found similar
146
149
complex, simply substituting alternative attributes: results using large datasets (e.g., 1 million people
De-identification is one means by which organiza- in Latin America), and applying the methodology
tions can comply with “data minimization” require- to bank transaction data, finding that four points
ments in data protection laws, i.e., to collect, store were enough to uniquely identify 90% of credit card
and use only the personal data that is necessary and users. 150
relevant for the purpose for which it is used (see sec- Richer data makes it possible to “name” an indi-
tion 5.1). vidual by a collection of fields or attributes, for exam-
De-identification rarely eliminates the risk of ple postal code, date of birth and sex.
re-identification. Re-identification may occur if Geolocation data carries particular risks of iden-
de-identification was incorrectly implemented or tification or re-identification of individuals. It is pos-
controlled, or where it is possible to link de-identi- sible to combine user data linked to a persistent,
fied data with already known personal data or pub- non-unique identifier with other data to develop an
licly available information. Effective de-identification enhanced profile of a person. Even geolocation data
requires expert understanding of the data and the alone may be used to identify a user because the two
wider data ecosystem, including reasons and means most common user locations will typically be their
by which adverse parties might seek to re-identify home and work addresses. Sensitive data about an
individuals. individual, for example a particular medical condition,
Some experts criticise de-identification as being may be identified due to their attendance at partic-
ineffective and as promoting a false sense of secu- ular locations, such as an abortion clinic or mosque.
rity by assuming unrealistic, artificially constrained Measures may be employed to reduce such risks,
models of what an adversary might do. In a famous such as accepting only insights rather than full data-
147
example in 1997, by linking health data that had been sets, accepting only data that has already been
stripped of personal identifiers with publicly avail- aggregated or de-identified, and applying additional
able voter registration data, it was possible to identi- filters where data is drawn from devices, e.g., accept-
fy Governor William Weld of Massachusetts and thus ing only geo-fenced data, removing home, work and
link him to his medical records. (The Governor had sensitive locations or restricting the time of the data,
and “blurring” or “fuzzing” datasets.

30 Big data, machine learning, consumer protection and privacy

27 28 29 30 31 32 33 34 35 36 37