PHILIPP M DIESINGER
THE DAWN OF SYNTHETIC DATA In the rapidly evolving landscape of technology and data, a groundbreaking trend is emerging the rise of synthetic data. As data becomes the lifeblood of modern businesses, researchers and developers are looking for innovative solutions to harness its power while addressing privacy concerns and data scarcity. Synthetic data, a novel concept that generates artificial datasets with properties mimicking real-world data, is gaining momentum.
S
ynthetic data is generated by algorithms and models that replicate the statistical properties, structures, and relationships found in real data. It is often used as a substitute for actual data, especially in cases where privacy, security, or a limited dataset pose challenges. Synthetic data is information that has been artificially created by computer algorithms, as opposed to traditional real data based on observations of real-world events. Synthetic data is not a new concept. Academic disciplines, including computational physics and engineering, have long employed synthetic data. These fields have successfully modelled and simulated complex systems, spanning from molecular structures and intercity traffic to optical devices and entire galaxies. These simulations are grounded in first principles, generating data that portrays the behaviour of these systems. Subsequently, this synthetic data is subjected to statistical analysis to create insights and predict system properties. Additionally, synthetic data is often generated using known statistical probability distributions of system components. This method also allows for the creation
PHILIPP IS A DATA SCIENTIST, AI ENTHUSIAST AND ESTABLISHED LEADER OF LARGE-SCALE DIGITAL TRANSFORMATIONS. HE IS PARTNER AT BCG X. PHILIPP HOLDS A PHD IN THEORETICAL PHYSICS FROM HEIDELBERG UNIVERSITY AND SPENT THREE YEARS AT MIT DEVELOPING A STRONG BACKGROUND IN AI RESEARCH AND LIFE SCIENCES.
of synthetic data, even from limited datasets, by empirically measuring distributions and then sampling them to expand and augment the dataset. Well before the advent of computational power, mathematicians employed analytical techniques. They derived probability distributions from first principles and propagated them to the system level, often utilising theories like the central limit theorem. While the notion of synthetic data is not a recent development, its relevance has witnessed a significant upswing in recent years. The number of industry applications has increased dramatically. Synthetic data finds its applications spanning a multitude of industries. Notably, the realm of autonomous vehicles, aircraft, and drones relies on the training of these technologies with hyper-realistic 3D-rendered data. Industry giants like Amazon have made a name for themselves by employing synthetic data to instruct their warehouse robots in recognising and managing packages of diverse shapes and sizes. The healthcare sector is increasingly harnessing synthetic data to train AI systems, ensuring that patient privacy remains uncompromised. The surge of relevance of synthetic data is aided by
26 | THE DATA SCIENTIST