1 minute read

Synthetic data

What is synthetic data and what does it do?

Synthetic data is ‘artificial’ data generated by data synthesis algorithms, which replicate patterns and the statistical properties of real data (which may be personal data). It is generated from real data using a model trained to reproduce the characteristics and structure of that data. This means that when you analyse the synthetic data, the analysis should produce very similar results to analysis carried out on the original real data.

Advertisement

It can be a useful tool for training AI models in environments where access to large datasets is not possible.

There are two main types of synthetic data:

• “partially” synthetic data, which synthesises only some variables of the original data; and • “fully” synthetic data, which synthesises all variables.

How does synthetic data assist with data protection compliance?

Synthetic data requires real data to generate it, which may involve the processing of personal data. However, data synthesis may allow large datasets to be generated from small datasets. This can help you comply with the data minimisation principle as it reduces or eliminates the processing of personal data.

You should consider synthetic data for generating non-personal data in situations where you do not need to, or cannot, share personal data. If you are generating synthetic derived from personal data, any inherent biases in the data will be carried through. You should:

• ensure that you can detect and correct bias in the generation of synthetic data, and ensure that the synthetic data is representative; and • consider whether you are using synthetic data to make decisions that have consequences (ie legal or health consequences) for individuals.

What do we need to know about implementing synthetic data?

Generating synthetic data is an active research area and, at present, it may not be a viable solution for many data processing scenarios. Synthetic data is being considered as a type of statistical disclosure control method for open data release.

What are the risks associated with the use of synthetic data?

The degree to which synthetic data is an accurate proxy for the original data depends on the utility of the method and model. The more that the synthetic

This article is from: