Page 32 - FIGI - Big data, machine learning, consumer protection and privacy
P. 32

Figure 3 – Pseudonymisation process. Source, KI Protect

























            Whereas de-identification  involves  removing  both   previously assured constituents that their health data
            of these, pseudonymisation removes only directly   was kept confidential.)
                                                                                   148
            identifying data so that the personal data cannot be   One study in 2013 found that 95% of mobility trac-
            attributed to a specific individual without the use of   es are uniquely identifiable given four random
            additional information. Such additional information   spatio-temporal points (data and time) and over 50%
            is kept separately and protected by technical and   of users are uniquely identifiable from two random-
            administrative measures to prevent such attribu-   ly chosen points (which will typically be home and
            tion.  The basic pseudonymisation process is not   work).  Subsequent studies have found similar
                146
                                                                    149
            complex, simply substituting alternative attributes:  results using large datasets (e.g., 1 million people
               De-identification is one means by which organiza-  in Latin America), and applying the methodology
            tions can comply with “data minimization” require-  to bank transaction data, finding that four points
            ments in data protection laws, i.e., to collect, store   were enough to uniquely identify 90% of credit card
            and use only the personal data that is necessary and   users. 150
            relevant for the purpose for which it is used (see sec-  Richer data makes it possible to “name” an indi-
            tion 5.1).                                         vidual by a collection of fields or attributes, for exam-
               De-identification rarely eliminates the risk of   ple postal code, date of birth and sex.
            re-identification. Re-identification may occur if    Geolocation data carries particular risks of iden-
            de-identification was incorrectly implemented or   tification or re-identification of individuals. It is pos-
            controlled, or where it is possible to link de-identi-  sible to combine user data linked to a persistent,
            fied data with already known personal data or pub-  non-unique identifier with other data to develop an
            licly available information. Effective de-identification   enhanced profile of a person. Even geolocation data
            requires expert understanding of the data and the   alone may be used to identify a user because the two
            wider data ecosystem, including reasons and means   most common user locations will typically be their
            by which adverse parties might seek to re-identify   home and work addresses. Sensitive data about an
            individuals.                                       individual, for example a particular medical condition,
               Some experts criticise de-identification as being   may be identified due to their attendance at partic-
            ineffective and as promoting a false sense of secu-  ular locations, such as an abortion clinic or mosque.
            rity by assuming unrealistic, artificially constrained   Measures may be employed to reduce such risks,
            models of what an adversary might do.  In a famous   such as accepting only insights rather than full data-
                                              147
            example in 1997, by linking health data that had been   sets, accepting only data that has already been
            stripped of personal identifiers with publicly avail-  aggregated or de-identified, and applying additional
            able voter registration data, it was possible to identi-  filters where data is drawn from devices, e.g., accept-
            fy Governor William Weld of Massachusetts and thus   ing only geo-fenced data, removing home, work and
            link him to his medical records. (The Governor had   sensitive locations or restricting the time of the data,
                                                               and “blurring” or “fuzzing” datasets.



           30    Big data, machine learning, consumer protection and privacy
   27   28   29   30   31   32   33   34   35   36   37