Creating Synthetic Data for Testing Digital Platforms


A client needed to test their new digital platform and understand the variety of scenarios that the platform may encounter. However, using real data was not possible due to ethical and privacy concerns. The client required a synthetic dataset that accurately reflected real-world scenarios and was statistically correct, without accessing sensitive information. 


Creating synthetic data that is both realistic and ethical can be challenging. There is a risk that synthetic data could inadvertently identify real individuals, leading to privacy concerns. Additionally, generating realistic data requires understanding the distributions and correlations of actual data, and accurately replicating these factors in the synthetic data. 


To create synthetic data that can be used ethically to test new systems, we used the concept of digital twins or digital shadows. In this project, we developed a synthetic database of fake people, addresses and scenarios, which acted as a proxy for real-life data that could be used to test a digital platform. 

To generate the synthetic data, we used open-source data for each of the parameters needed and captured the correlations and relationships between them using simple mathematical approaches. We then added a prefix to each column of data to make it clear that the data was synthetic. For fields such as names, we used a sample of real first names and allocated them to surnames generated from the phonetic alphabet to create realistic yet fake names. 

Results and Impact

The result of this work was a statistically accurate synthetic database that could be used to test digital platforms. The data was based on evidence rather than assumptions, and its use was not limited by privacy concerns. The client was able to use this data to test their new system, both before it became fully operational and without accessing sensitive personal information. 

Ethical Impact

Our ethical approach to creating synthetic data allowed us to provide valuable insights to our clients without risking the privacy of real individuals. By using open-source datasets and collaborating with our client, we created a comprehensive and holistic picture of scenarios that aligned with the use-case of their new platform. Our approach ensures that the data can be used effectively to test new systems and improve decision-making, while protecting the privacy of real individuals. 

