6. Technical requirements for effective de facto anonymization

by Bundesverband der Deutschen Industrie e.V.

Legality of de-identification measures

06

Technical requirements for effective de facto anonymization

The effective implementation of de facto anonymization (i.e. the fulfillment of certain formalized anonymity criteria) depends on the anonymization technique(s) used. The GDPR itself does not specify which anonymization techniques should be used. This section presents some of the common de-identification techniques and recommendations for checking their effectiveness.

6.1 Overview of de-identification techniques

Many de-identification techniques exist that can be used to de-identify personal data. These meet – depending on the methodological approach and potential “re-identification attack model” – certain formalized anonymity criteria (for example: k-anonymity, l-diversity, t-closeness, differential privacy). Which de-identification technique or combination of these can guarantee sufficient de facto anonymization must always be assessed in light of the specific individual case at hand.

6.1.1 Removal of identifiers

Personal data can consist of identifying attributes (i.e. name or identity card number), a quasi-identifying attribute (i.e. date of birth, place of residence or gender) as well as sensitive attributes (e.g. illnesses, sexual tendencies, very old age, etc.). In this context, the term “sensitive attribute” is not to be equated with special categories within the meaning of Article 9 (1) GDPR. One speaks of a sensitive attribute if the disclosure of the content and the attribution to a person justify a particular risk of potential or invasions of privacy (this also includes, for example, bank details, social security number or photographs).23 By removing the identifying and quasi-identifying attributes, data can be de-identified. In this case, individual or several identifying or quasi-identifying attributes (i.e. identifiers) are completely deleted from a set of data, so that conclusions about an individual person are no longer possible, or at least this becomes very difficult. Yet, removing these identifiers is usually only the first step towards de facto anonymization.

Example:

The name of the user, the user and vehicle number are deleted from GPS location data generated by vehicles. In this way, the GPS location data can only be traced back to a single person under difficult conditions (and possibly only with corresponding additional knowledge).

6.1.2 Randomization

Randomization/perturbation (i.e. a type of “disturbance”) refers to techniques (see a selection of individual such techniques below under 6.1.2.1 to 6.1.2.6) with which data values are replaced by artificially generated values in order to “alter” or “perturb” a data set in such a way that the direct link between certain data and the data subjects is removed. The data should only be altered to such an extent that at least statistical properties of the data set are retained for analysis purposes.

6.1.2.1 Data swapping

In swapping, certain attributes of a data subject are artificially swapped for attributes of another person. Ideally, this happens randomly or pseudo-randomly,24 where it must be ensured that no data set ultimately reproduced itself. The technique can be improved if the variables of a specific person do not exactly match the variables of the other person.

24 Pseudorandomness is a calculated randomness. This looks like a “real” randomness to the observer but can be reversed with knowledge of the key material.

Example:

In a customer list, the customer’s place of residence should be swapped. For example, if person A lives in place X and person B in place Y, then after swapping the “place” information in the database, it is mapped that person A lives in place Y and person B lives in place X. If, however, further elements were to be swapped between person A and person B, this could lead to the result that the set of data is largely self-reproduced and the purpose of the swap is therefore not achieved. Therefore, only a non-decisive part of the set of data should be swapped between two specific sets of data.

6.1.2.2 Cryptographic hash function

A cryptographic hash function maps an output value – the so-called hash value – with a fixed length for some data or input value (of any length). A cryptographic hash function is a one-way function, so that no conclusions can be drawn about the original data from the hash value alone. In addition, a cryptographic hash function is collision-resistant, so that only one input value can ever be attributed to a hash value. This process is known as hashing. The cryptographic hash function itself is standardized and to that extent (generally) known. Therefore, the use of cryptographic hash functions does not enable automatic protection of decryption. A re-identification attacker who knows the stored hash value can calculate various input values using the known hash function until he receives a match with the stored hash value. Decryption therefore depends on the extent to which a re-identification attacker knows or can limit the type of possible input values (e.g. telephone numbers).

To increase the difficulty of decryption, a random value is often added to an input value, which changes the hash value. If known, this random value is referred to as “salt.” If the random value is kept secret, it is called “pepper.” In order for the random value to offer the highest possible security against a re-identification attacker, it should be of sufficient complexity and length and kept as secret as possible.

In addition, other de-identification techniques (such as stochastic overlay; see 6.1.2.3) or specific technical and organizational measures (such as access restrictions and restrictive rights and roles) are recommended.

Example:

In practice, hashing is used, for instance, to avoid having to save user passwords from online portals in clear text, i.e. unencrypted. Only the so-called unique hash value, i.e. the result of the cryptographic hash function applied to the password, is saved. If a password is entered, a (unique) hash value is also generated from the entry and, if the two hash values match, it is mathematically ensured that the password entered matches the password stored in the database. To prevent the hash values of simpler passwords from being determined by trial and error, a random value is usually added to the password before hashing (salt).

Another widespread use of hashing is the de-identified storage of IP addresses, for which the same procedure can be used.

6.1.2.3 Stochastic overlay (“additive noise”)

In the case of stochastic overlay, a random “measurement error” is deliberately added to the data, for example by overlaying random data (which are generated, for example, by adding random values to the existing values). This method can only be used on numeric values.

Example:

In the case of numerical values, for instance, the last digit is replaced by a random number (for example with GPS coordinates).

6.1.2.4 Synthetic data generation

In this method, artificial data sets are created on the basis of a statistical model. The model is constructed with reference to statistical attributes of the original data, the synthetic data forming a subset of the original data. Samples are then taken from this in order to form a new data set.

Example:

From a dataset about burglaries in a certain region, only the statistical findings are extracted into a mathematical model, which now calculates other scenarios based on these statistical findings and possibly other added parameters.

6.1.2.5 Perturbation

With perturbation, data values are replaced by artificial values. The aim is to change the data in such a way that statistical properties of the data set are nevertheless retained for analyses. The methods offer a high level of protection against attacks, since the generated entries, which are created using random-based methods, no longer correspond to real persons. This is, however, also a disadvantage because flexibility in terms of analyses is lost.

Example:

A comprehensive data set contains, in addition to the classification “jobseeker, in training, self-employed, employed and retired,” the decade of birth (1950 to 1959, 1960 to 1969, 1970 to 1979, 1980 to 1989, 1990 to 1999, etc.) of the corresponding individuals. These values are replaced by randomly generated artificial information.

6.1.2.6 Permutation

With permutation, data is shuffled between data sets within attributes. With this method, no values of the data set are altered, but the original data set is broken down into two parts (for example two tables) and linked via a group ID. This softens the association between the values from table 1 and the values from table 2.

Example:

In a data set of 30 patients, the table with the personal data is divided into the quasi-identifying attributes (age, gender and place of residence) on the one hand and the sensitive attributes on the other (course of the disease, symptoms). The split tables are still linked to one another via a group ID. 30 different sensitive values (table 2) are now possible for one entry in table 1 (quasi-identifying attributes). It is no longer possible to determine what course of disease and what symptoms the patients on the list have.

6.1.3 Generalization/aggregation

Data can be de-identified by reducing their accuracy using various techniques (see a selection of individual such techniques below under 6.1.3.1 and 6.1.3.2) (for example, categorical values can be replaced by more general values based on a taxonomy, for instance the term “academic” replaces the designations judge, doctor or pharmacist). For numeric attributes, exact information is replaced by intervals (for example, the age 30 is replaced by the interval 30-35). This makes the data less specific and means that it can no longer easily be traced back to individual persons. However, if the number of sets of data is too small or the spread is too low, aggregation can still allow a personal reference.

6.1.3.1 Use of various generalization schemes

Depending on the generalization approach, a distinction can be made between different schemes: In a so-called “full-domain generalization scheme,” all values of an attribute are generalized to the same level. If “doctor,” “judge” and “pharmacist” are replaced by “academic” in the above example, “electricians” and “painters” would also have to be generalized to “craftsmen.”

In a so-called “subtree generalization scheme,” all so-called “child nodes”25 of a “parent node” are generalized. A so-called “sibling generalization scheme” is similar to the above, but here only specific child nodes of a parent node are generalized. For example, “doctor” can be replaced by “academic” without altering the designation “judge.” A so-called “cell generalization scheme,” on the other hand, allows the generalization of only selected individual values. For example, the value “judge” can be generalized in one entry and at the same time retained in another entry in the same

25 In graph theory, a node is an element of the node set of a graph. An edge indicates whether two nodes are related to one another or are connected to one another in the graphical representation of the node set, respectively. For a node that is different from the root, the node through which it is connected to an incoming edge is called the parent node. Conversely, all nodes that are connected from any node by an outgoing edge are called children, child nodes, or descendants (the gender-neutral names “parent” and “child” have largely displaced the older “father” and “son” terminology).

table. So-called “multidimensional generalization” considers several attributes at the same time and provides different generalization approaches for the respective attributes. For example, the group “Doctor, 32” can be replaced by “Doctor, (30-40),” whereas all entries with “Doctor, 36” are generalized to “Academic, 36.”

Example:

In a set of data about patients – after the (unambiguously) identifying attributes (name, health insurance number) have been erased – all quasi-identifying and sensitive attributes are generalized one level up (the patient’s exact home address becomes the neighborhood, the age becomes a specified age range and the broken leg becomes a fracture).

6.1.3.2 Micro-aggregation

Micro-aggregation describes a technique of de-identification in which the data are grouped according to similarity in the attribute values and the individual values are combined into a representative value for each group, such as the mean or median. While individual attribute values are altered (or generalized) with classic aggregation, with micro-aggregation the attribute values remain the same and are only summarized. Micro-aggregation therefore has the advantage over classic aggregation, among others, that it leads to less data loss and regularly maintains the granularity of the data to a higher degree.

Example:

A very simple form of aggregation is the summary of all data points to an average value. In principle, this no longer allows any conclusions to be drawn about individuals (for example the average salary of a software developer in a larger group). For example, patient data can be de-identified with the help of micro-aggregation by first dividing the patients into groups according to age and then replacing the individual age values within an age group with the age mean of this group.

6.2 Formalized anonymization criteria

A distinction is made between de-identification techniques on the one hand and formalized anonymization criteria on the other. “Formalized anonymization criteria” are not techniques as such, but a mathematical description of the specific “security level” of the intended de-identification as a result of the planned (combination of) de-identification techniques to be used. The fulfillment of a formalized degree of de-identification is not synonymous with the achievement of a de facto anonymization; rather, the other necessary criteria (see 6.2.1 to 6.2.3) must also be observed.

6.2.1 Differential privacy

Differential privacy is a mathematical definition of requirements to make the degree of de-identification measurable.26 Differential privacy aims to provide an accurate indication of the likelihood of re-identification without the need to identify individual data sets.

How high the risk of re-identification is determined by the parameter epsilon (ε)27, expressed as the probability that a query via a database that contains an additional data set will produce the same result as a query on another database that does not contain this data set. The smaller the factor ε, the higher the protection against a re-identification attack. Which value ε must assume in order to achieve the degree of de facto anonymization according to this measurement method can only be assessed for each situation individually, since the quantity of the data plays a particularly significant role here.

26 A randomized function κ provides ∈ differential privacy if, for all sets of data D1 and D2 which differ in at most one entry, and all S ⊆ Range(κ), the following applies: Pr[κ(D1) ∈ S]≤e{ϵ}\times\Pr[κ(D2)∈S]. 27 Papastefanou, “Database Reconstruction Theorem” und die Verletzung der Privatsphäre (Differential Privacy), CR 2020, 379-386 (382 et seq.).

Example:

At a conference, the number of all participants per subject area should be published. For this purpose, the data is aggregated and random noise is added to the result, which is selected according to the contribution of an individual user (for example, each participant can choose a maximum of three subject areas and is only counted once for each subject area). If the possible contribution of the users changes, the parameters of the noise must also be changed (for example, if a user is to choose only a single subject).

The catchphrase “local differential privacy” is understood to mean the addition of statistical noise, which in turn means that the drawing of conclusions about individuals become impossible, but the data de-identified in this manner still permit a statistical evaluation. “Central differential privacy,” on the other hand, means that data is first aggregated and then provided with random noise in order to disguise the existence of individual data sets of users in the collected data. In both cases, the noise comes from a recognized distribution (usually Laplace or Gauss) with predetermined parameters, which are obtained from known properties of the existing data sets (for example, how often a single user has contributed a value and how much influence his data has on the result the aggregation).

6.2.2 k-anonymity

k-anonymity is a formal data protection model that describes the statement about the probability of whether one set of data can be linked to another. This allows a statement to be made about the probability of re-identification.

For de-identification, k-anonymity requires that sets of data are altered to such an extent that no conclusion can be drawn about a single person (i.e. indistinguishable from k-1 other persons). The “k-value” expresses the parameter of how often an attribute of a set of data occurs within a data collection (so-called equivalence class).

Example:

As part of a medical study, the zip code, attending physician and illness are saved and the sensitive information about the illness is to be de-identified. If there are two identical entries in the table with the same attributes for zip code, attending physician and illness, the k-value is 2.

Yet, k-anonymity has weaknesses. Due to the homogeneity of the equivalence classes (i.e. all k sets of data of an equivalence class have identical attributes) or due to additional background knowledge (i.e. an attacker knows about the existence of a person in a database and can attribute this person to the correct equivalence class, so that they can potentially exclude certain sensitive attributes for the person due to the additional knowledge), a re-identification is possible. These weaknesses are to be remedied through further developments of k-anonymity (through l-diversity and t-closeness, see below).

6.2.3 l-diversity and t-closeness

l-diversity is an extension of k-anonymity in order to clean up weak points in the k-anonymity model that k does not make a statement about the representation of the person in the k-group. In the case of l-diversity, the association of a sensitive or otherwise easily identifiable attribute (cf. 6.1.1.) with a person is protected by hiding it in at least a set of l other sensitive attributes. An attacker therefore needs at least l-1 background knowledge in order to be able to deduce the correct attribute by sufficiently excluding incorrect sensitive attributes.

Example:

With a k-factor of 5, for example, two persons over 100 years of age are included and the age information is available in the data set. Due to the background knowledge that two persons over 100 years of age are present in a data set and since persons over 100 years of age are very rare, these two people can easily be re-identified. By further de-identifying this information, for example by attributing people over 100 years of age to a different age group, the set of data can be anonymized according to the l-diversity.

The t-closeness model, in turn, refines the approach of l-diversity by forming equivalence classes (also known as blocks) that are similar to the original distribution of the attribute values in the data. Another condition is introduced for this: Not only should at least 1 different value be represented in an equivalence class, but it is also necessary that each value is represented in a block as frequently as corresponds to the original distribution for each individual attribute.

Example:

An insurer wants to create a statistical overview of the districts in which most insurance claims are reported by customers and also break this down by age. To do this, however, it is not absolutely necessary to directly relate and process the zip code, age and the number of reported insurance claims of the customers concerned. If the t-closeness method is applied accordingly, the last digit of the zip code could be omitted and this could be divided into blocks, for example 8080*, 8033*, etc. The same can be done with the age of the customer, for example in five-year steps. The number of reported insurance claims would consequently only be displayed and processed in relation to these blocks (i.e. zip code areas and age areas).

6.3 Effectiveness of anonymization

Depending on the nature of the raw data set, a combination of the formalized anonymization criteria or de-identification techniques listed above (see 5.3) may be necessary. The methods used for de-identification must ensure that, in terms of de facto anonymization (see 3.3.1 above), it cannot be expected that someone will be able to:

.pick out a single person from the database (“singling out”) – this is the case as long as sets of data (such as a table) can be attributed to individual persons;

Example:

In an employee list, a wide variety of attributes are swapped or stochastically overlaid in the sets of data, but not the position of the employee in the company. Since there is only one head of the IT department in many companies, he can be easily identified from the database (if you know that the employee’s position has not been swapped), even if some of the attributes stored in the database are not accurate for the head of the IT department.

.establish a connection between two data sets of a database or between two independent databases in the sense of linkability – this means the possibility of attributing at least two different entries to the same person or the same group of people, regardless of whether they are in in the same database or not28;

Example:

When de-identifying search queries via publicly accessible search engines, for instance, it cannot be ruled out that the data could be linked to a specific person with the help of other information available on the Internet.

28 In individual cases, the quality of the information provided by the anonymized data (e.g. statistics) can be contingent on various sets of data being further attributable to a single person without this person having to be known. In such a case, effective anonymization would have to ensure that re-identification is not possible despite the linking of several sets of data.

.derive information from a database by means of inference – according to this, it must in all likelihood be impossible to derive the value or content of a set of data from that of other entries.

Example:

A census collects certain information about the population. The data is published but aggregated in such a way that at least five people were returned as a result for each query or otherwise no result was given. However, the selection of certain criteria for the inhabitants of a locality (gender, age group, nationality) results in exactly five people whose school education is identical. It can thus be deduced from the aggregated data that if you meet a person from this locality with the corresponding gender, age and nationality, you would implicitly also know their schooling.

If one or more of these “re-identification attacks” leads to “success,” i.e. this enables a – partial – re-identification, personal data still exists.

Generally, no method and no criterion are sufficient in itself to effectively and de facto anonymize data. Sufficient de facto anonymization therefore regularly requires a combination of different methods of randomization and generalization (see also 5.3 and 6.4.3).

Note:

The effectiveness of the de facto anonymization must be assessed and documented for the respective situation at hand and the methods used for de-identification. The following questions can serve as a guide:

.Who could have a motive for re-identification?

.What resources are available to someone for re-identification?

.What steps and what (time) effort are required in order to the de-identified data to be restored? . What data is publicly available that could be used to restore the personal reference? This could be information from public registers (e.g. commercial register, land register, register of associations, etc.), information from social media, information that can be found via search engines or databases, or information that can otherwise be accessed via the Internet or from other data sources.

.Are there third parties with access to the de-identified data who possess further information on basis of which the personal reference can be restored? (e.g. original data set, location data which can enable identification in combination with the de-identified data, etc.)

6.4 Selection of the anonymization method

To determine the selection of the anonymization method (i.e. the required combination of de-identification techniques), the following aspects particularly must be taken into account:

6.4.1 What kind of data sets is involved?

The first step is to evaluate the type and nature of the sets of data concerned:

.What “sensitivity” (sensitivity describes not only special categories of personal data within the meaning of Article 9 GDPR, but also information that is particularly relevant for the data subject, such as bank and account information, see 5.2) does the data have?

.Would re-identification entail a high data protection law risk?

.How much data/persons and which (“type” of) persons (e.g. children) are affected?

6.4.2 For which use case is the data being anonymized?

In order to the data to be anonymized to have the quality required for the respective use case, the specific purpose of the anonymized data must also be taken into account when selecting the de-identification techniques to be applied. Too high a degree of anonymization could make the data unusable, whereas too low a degree of anonymization can rule out de facto anonymization.

6.4.3 Levels of anonymization

The EU data protection authorities propose a step-bystep approach to anonymization, i.e., different techniques should be combined (see also 5.3).29

The following order can be used:

.Removal of identifiers: First, the identifiers should be removed, i.e. all directly or indirectly identifying attributes should be erased.

.Randomization: The next step is to randomize the data sets in order to remove a direct link between the data and the data subjects.

.Generalization: In the last step, generalization and aggregation are used to reduce the accuracy of the data.

Depending on the specific use case and the risk proneness of the data sets to re-identification, however, a different order of de-identification techniques may be more expedient. In any case, the use of just a single de-identification technique will suffice in the rarest of cases in order to achieve a sufficient level of de-identification.

6.4.4 Review

After each de-identification step, it must be checked whether sufficient de facto anonymization has already been achieved and whether a “re-identification attacker” would accordingly have to invest a disproportionately large amount of time, money and manpower in order to carry out a re-identification. See section 6.5 below for details.

6.5 Regular review of the anonymization method

Due to technical progress and possible changes in other relevant objective factors (see 5.2), the anonymization method used (i.e., the sum of the de-identification methods used which lead to effective de facto anonymization) must be regularly reviewed and – if necessary – updated.

Information security is not to be considered a condition, but rather a continuous improvement process (ISO/ IEC 27001) and should be regularly reviewed based on the PDCA cycle (Plan-Do-Check-Act)30 or by means of a comparable methodology. The PDCA system is also recommended for implemented or planned de facto anonymizations in order to regularly take stock of the security status. Since technical progress cannot realistically be forecast either in terms of de-identification techniques or future hardware performance (and the controller initially only checks whether re-identification is unlikely at the time the anonymization method is carried out in the context of de facto anonymization, see 5.2 and 6.2), the period between such reviews should be kept rather short in case of doubt.

30 Cf. for instance GDD-Praxishilfe DS-GVO II, Verantwortlichkeiten und

Aufgaben nach der DSGVO, page 7, available at https://www.gdd.de/ downloads/praxishilfen/GDD-Praxishilfe_DS-GVO_2.pdf. The PDCA cycle describes a four-phase process that is used to control and continuously improve processes and products. Processes are initially planned (plan), tested (do), the test phase is evaluated (check) and, on this basis, the process is improved (act).

The technical guidelines of the German Federal Office for Information Security (BSI) on cryptographic procedures can also be used as a guide. These provide information about the reliability of forecasts for the security of cryptographic procedures and are therefore an indicator of the period over which a technology used (for anonymization) can be considered effective.31

Note:

Audits by external, specialized service providers can also be used to review the effectiveness of the anonymization method used. To date, though, the known service providers have not yet offered reviews of the effectiveness of the anonymization method used.

If the review and evaluation of the anonymization method leads to the result that it is no longer sufficiently effective and re-identification is possible, remedial action must be taken, as otherwise a risk for the rights and freedoms of the data subjects arises. This can possibly mean that old methods can no longer be used, or at least no longer alone without additional accompanying de-identification techniques, and must be replaced by new measures, or even that the entire anonymization method must be redesigned. In individual cases, it may also be necessary to again de-identify already de-identified data that were de-identified with the “old” anonymization method, using the new anonymization method in order to be able to continue to use these sets of data in a de facto anonymized form.

Even if anonymized data has been disclosed to third parties or made publicly accessible and later become re-identifiable again, it must be deleted or replaced by newly anonymized data in the event re-identifiability currently occurs. This should be taken into account before disclosing de facto anonymized data. In addition, an erasure concept – even if this is not legally required for anonymized data – can create additional legal security and risk minimization when dealing with de facto anonymized data. The parameters for an exchange or a return/erasure of this data can, for example, be contractually regulated.

Next Article

Legality of de-identification measures

06

6.1 Overview of de-identification techniques

6.1.1 Removal of identifiers

Example:

6.1.2 Randomization

6.1.2.1 Data swapping

Example:

6.1.2.2 Cryptographic hash function

Example:

6.1.2.3 Stochastic overlay (“additive noise”)

Example:

6.1.2.4 Synthetic data generation

Example:

6.1.2.5 Perturbation

6.1.2.6 Permutation

Example:

6.1.3 Generalization/aggregation

6.1.3.1 Use of various generalization schemes

Example:

6.1.3.2 Micro-aggregation

Example:

6.2 Formalized anonymization criteria

6.2.1 Differential privacy

Example:

6.2.2 k-anonymity

Example:

6.2.3 l-diversity and t-closeness

Example:

Example:

6.3 Effectiveness of anonymization

Example:

Example:

Example:

Note:

6.4 Selection of the anonymization method

6.4.1 What kind of data sets is involved?

6.4.2 For which use case is the data being anonymized?

6.4.3 Levels of anonymization

6.4.4 Review

6.5 Regular review of the anonymization method

Note:

More articles from this publication:

Legality of de-identification measures

4. Legal consequences of effective anonymization

Further data protection requirements with regard to de facto anonymization

Preface

7. Organizational implementation of the de facto anonymization by third parties

This article is from:

Anonymization of personal data