24
A GUIDE TO BASIC ANONYMISATION
Applicable to:
STEP 4
COMPUTE YOUR RISK
Internal data sharing (de-identified data)
Internal data sharing (anonymised data) or External data sharing
Long-term data retention
Synthetic data
k-anonymity9 is an easy method10,11 to compute the re-identification risk level of a dataset. It basically refers to the smallest number of identical records that can be grouped together in a dataset. The smallest group is usually taken to represent the worst-case scenario in assessing the overall re-identification risk of the dataset. A k-anonymity value of 1 means that the record is unique. Generally, only indirect identifiers are considered for k-anonymity computation.12 A higher k-anonymity value means there is a lower risk of re-identification while a lower k-anonymity value implies a higher risk. Generally the industry threshold for k-anonymity value is at 3 or 5.13 Where possible, a higher k-anonymity threshold value should be set to minimise any re-identification risks. Refer to Chapter 3 (Anonymisation) of PDPC’s Advisory Guidelines on the Personal Data Protection Act for Selected Topics on the criteria for determining whether the data may be considered sufficiently anonymised.
Postal code
Age
22xxxx
21 to 25
Emily in Paris
22xxxx
21 to 25
Emily in Paris
10xxxx
41 to 45
Brooklyn Nine-Nine
10xxxx
41 to 45
Brooklyn Nine-Nine
10xxxx
41 to 45
Brooklyn Nine-Nine
10xxxx
41 to 45
Brooklyn Nine-Nine
58xxxx
56 to 60
Attenborough’s Life in Colour
58xxxx
56 to 60
Attenborough’s Life in Colour
58xxxx
56 to 60
Attenborough’s Life in Colour
Favourite show k=2
k=4
Overall k=2
k=3
The above diagram illustrates a dataset with three groups of identical records. The k value of each group ranges from 2 to 4. Overall, the dataset’s k-anonymity value is 2, reflecting the lowest value (highest risk) within the entire dataset.14