Synthetic Data: generate realistic data artificially
Synthetic data is a new trend emerging in the analytics world. It is artificially created data, often with key properties accurately injected. Several companies have been founded in recent years around the subject of synthetic data and Gartner estimates that it will soon reach the tip of the hype-cycle (see figure).
In this article we will introduce synthetic data and then focus on structured synthetic data, its types and corresponding use cases. Finally, we will show you how the MakeYourData tool implements them all.
What is synthetic data and why should you use it
Synthetic data is artificially generated data engineered to have specific properties. Therefore, it is generated by an analyst or a developer rather than a real process. Synthetic data serves many purposes that can range from improving an application to enabling data sharing. We report here two notable examples.
Proven use cases for synthetic data are in the automotive industry, specifically developing self-driving cars. In this context artificial situations, which are a form of visual synthetic data, are used to teach cars to drive themselves better in situations that do not occur often in reality or cannot occur just for learning purposes. For example, you cannot ask real people to go willingly against a car to simulate the risk of running them over and therefore improve the car response.
CERN, the nuclear research center in Geneva, pioneered synthetic data techniques. The research centre operates unbelievably complex detectors and to extract results it needs to know their effects in detail. For this reason, Monte Carlo techniques, which are a form of synthetic data generation, are used to simulate the detectors and their effect on particles. The simulations produce the same sort of data the real detectors produce, with the advantage that we know the settings and conditions of the simulation and therefore we can compare input and output to measure the detectors’ effect.
In the following sections we will focus on structured synthetic data in tabular form.
Structured synthetic data types and use cases
The use cases for structured synthetic data are diverse. They can range from anonymising datasets to testing. In addition, there are several different techniques to generate data, each best suited to different use cases. This section will give you an overview of techniques and corresponding use cases.
Models to generate synthetic data can range from completely set by the user to partially or completely learned taking inspiration from existing datasets. There are (at least) 3 classes of structured synthetic data techniques, depending on the way they generate data and the level of human intervention: rule-based, statistically generated, AI generated. Let us describe these classes and some typical use cases.
Rule-based synthetic data
Rule-based synthetic data is data generated by a pre-defined a set of rules. For example, if you want to generate an HR dataset with employees and hire dates you might use these rules:
- Extract the employee’s name at random from a list of names
- Assign a hire date in a given range
- Assign an end date more recent than the hire date and according to an average tenure of 1 year
- Generate a salary distributed around the company average
This method allows you to have full control on the conditions with which data is generated. It also allows to start generating without any input dataset and for this reason it is also referred to as “pure synthetic data”.
Use cases
Prototyping: Applications, for example mobile applications or business intelligence dashboards, need data as input. In fact, in our digital modern world most applications need data as input. However, real data might not exist yet to start developing the application or it might not be usable due to confidentiality constraints. This would cause delays and hidden costs. Synthetic data can be used to fill the gap and start prototyping fast.
Demos: Any company producing applications needs to prove their value and often this is done via product demonstrations. Data is often needed to make the applications live and real data might not be usable due to confidentiality or regulatory constraints. Synthetic data again can come to the rescue and can fill the gap.
Statistically generated synthetic data
In this case synthetic data is generated starting from an existing dataset. A model is built that distils statistical properties of the dataset that are later used to generate new data.
Using the example HR example again we could start from an existing employee’s database and use this dataset to find what the average tenure and average salary is for different positions and then use this information to generate data normally distributed around these values.
Use cases
Testing: Applications also need testing to detect unwanted behaviour and catch errors early before the application goes live. Synthetic data can help by reproducing real datasets but adding artificial mistakes or inconsistencies.
Augmenting volume: One other type of testing is “load testing” namely exploring the application behaviour under stress. Often this translates to large input datasets. Synthetic data can help increasing the volume of data while keeping a realistic look and feel.
Anonymising: the line between anonymisation and synthetic data is faint. Anonymizing datasets can be seen as generating new data with similar properties to the original one but with key characteristics hidden or obfuscated. This definition is very similar to the definition we just gave of statistically generated synthetic data. In other words, if the statistical rules are sufficiently stringent then synthetic data can be considered a form of anonymisation.
AI generated synthetic data
Data generated by Artificial Intelligence and Machine Learning is the extreme version of statistically generated data. The difference is that the model is not pre-set by the user, but it is learned automatically by the algorithm.
The advantages of this kind of data generation are that the model can be more flexible and at the same time, if it is a good model, it should inherently respect all correlations between variables.
Use cases
Machine Learning: Machine learning needs large datasets to train models. In many cases this data is not available. For example in fraud detection or other forms of anomaly detection the events are (hopefully) rare enough that they are not sufficient to train a model. In these cases, synthetic data can be used to augment the volume of the training dataset. If you want to use synthetic data to train your model the generated data should be very well adapted to the original, to an extent that only AI generated data can guarantee.
Anonymisation: Again this technique can be used as a form of anonymisation. Often researchers need to access datasets that are confidential. On the other hand, to be able to draw high level conclusions, they often do not care about individuals but about the statistical properties of the dataset. Removing names and other identifiers is often not enough because a person can be identified by the exact combination of their measurements. In this case, synthetic data can help by replacing the original dataset entirely. Also in this case, as the purpose is in depth analysis, AI is the only method that gives a good enough representation of the initial data while ensuring full anonymity.
MakeYourData: synthetise your data as you need
Did you find this topic interesting? Do you think that your organisation has use cases for synthetic data? At Argusa we developed the tool for you. MakeYourData is a software that allows to implement all use cases described above without programming knowledge. If you want to know more please visit our website at www.argusa.ch/mkyd or contact us at info@argusa.ch. You can also follow us on LinkedIn and watch out for our blogs and webinars about synthetic data and MKYD.
Webinars - Discover MKYD with Argusa
If you'd like to know more about MKYD, join us in a series of webinars led by Team Argusa! The first one about using MKYD for Testing and Prototyping is taking place on Tuesday June 21st. To sign up and find out more please visit https://www.argusa.ch/post/discover-makeyourdata-with-argusa
Synthetic data is a new trend emerging in the analytics world. It is artificially created data, often with key properties accurately injected. Several companies have been founded in recent years around the subject of synthetic data and Gartner estimates that it will soon reach the tip of the hype-cycle (see figure).
In this article we will introduce synthetic data and then focus on structured synthetic data, its types and corresponding use cases. Finally, we will show you how the MakeYourData tool implements them all.
What is synthetic data and why should you use it
Synthetic data is artificially generated data engineered to have specific properties. Therefore, it is generated by an analyst or a developer rather than a real process. Synthetic data serves many purposes that can range from improving an application to enabling data sharing. We report here two notable examples.
Proven use cases for synthetic data are in the automotive industry, specifically developing self-driving cars. In this context artificial situations, which are a form of visual synthetic data, are used to teach cars to drive themselves better in situations that do not occur often in reality or cannot occur just for learning purposes. For example, you cannot ask real people to go willingly against a car to simulate the risk of running them over and therefore improve the car response.
CERN, the nuclear research center in Geneva, pioneered synthetic data techniques. The research centre operates unbelievably complex detectors and to extract results it needs to know their effects in detail. For this reason, Monte Carlo techniques, which are a form of synthetic data generation, are used to simulate the detectors and their effect on particles. The simulations produce the same sort of data the real detectors produce, with the advantage that we know the settings and conditions of the simulation and therefore we can compare input and output to measure the detectors’ effect.
In the following sections we will focus on structured synthetic data in tabular form.
Structured synthetic data types and use cases
The use cases for structured synthetic data are diverse. They can range from anonymising datasets to testing. In addition, there are several different techniques to generate data, each best suited to different use cases. This section will give you an overview of techniques and corresponding use cases.
Models to generate synthetic data can range from completely set by the user to partially or completely learned taking inspiration from existing datasets. There are (at least) 3 classes of structured synthetic data techniques, depending on the way they generate data and the level of human intervention: rule-based, statistically generated, AI generated. Let us describe these classes and some typical use cases.
Rule-based synthetic data
Rule-based synthetic data is data generated by a pre-defined a set of rules. For example, if you want to generate an HR dataset with employees and hire dates you might use these rules:
- Extract the employee’s name at random from a list of names
- Assign a hire date in a given range
- Assign an end date more recent than the hire date and according to an average tenure of 1 year
- Generate a salary distributed around the company average
This method allows you to have full control on the conditions with which data is generated. It also allows to start generating without any input dataset and for this reason it is also referred to as “pure synthetic data”.
Use cases
Prototyping: Applications, for example mobile applications or business intelligence dashboards, need data as input. In fact, in our digital modern world most applications need data as input. However, real data might not exist yet to start developing the application or it might not be usable due to confidentiality constraints. This would cause delays and hidden costs. Synthetic data can be used to fill the gap and start prototyping fast.
Demos: Any company producing applications needs to prove their value and often this is done via product demonstrations. Data is often needed to make the applications live and real data might not be usable due to confidentiality or regulatory constraints. Synthetic data again can come to the rescue and can fill the gap.
Statistically generated synthetic data
In this case synthetic data is generated starting from an existing dataset. A model is built that distils statistical properties of the dataset that are later used to generate new data.
Using the example HR example again we could start from an existing employee’s database and use this dataset to find what the average tenure and average salary is for different positions and then use this information to generate data normally distributed around these values.
Use cases
Testing: Applications also need testing to detect unwanted behaviour and catch errors early before the application goes live. Synthetic data can help by reproducing real datasets but adding artificial mistakes or inconsistencies.
Augmenting volume: One other type of testing is “load testing” namely exploring the application behaviour under stress. Often this translates to large input datasets. Synthetic data can help increasing the volume of data while keeping a realistic look and feel.
Anonymising: the line between anonymisation and synthetic data is faint. Anonymizing datasets can be seen as generating new data with similar properties to the original one but with key characteristics hidden or obfuscated. This definition is very similar to the definition we just gave of statistically generated synthetic data. In other words, if the statistical rules are sufficiently stringent then synthetic data can be considered a form of anonymisation.
AI generated synthetic data
Data generated by Artificial Intelligence and Machine Learning is the extreme version of statistically generated data. The difference is that the model is not pre-set by the user, but it is learned automatically by the algorithm.
The advantages of this kind of data generation are that the model can be more flexible and at the same time, if it is a good model, it should inherently respect all correlations between variables.
Use cases
Machine Learning: Machine learning needs large datasets to train models. In many cases this data is not available. For example in fraud detection or other forms of anomaly detection the events are (hopefully) rare enough that they are not sufficient to train a model. In these cases, synthetic data can be used to augment the volume of the training dataset. If you want to use synthetic data to train your model the generated data should be very well adapted to the original, to an extent that only AI generated data can guarantee.
Anonymisation: Again this technique can be used as a form of anonymisation. Often researchers need to access datasets that are confidential. On the other hand, to be able to draw high level conclusions, they often do not care about individuals but about the statistical properties of the dataset. Removing names and other identifiers is often not enough because a person can be identified by the exact combination of their measurements. In this case, synthetic data can help by replacing the original dataset entirely. Also in this case, as the purpose is in depth analysis, AI is the only method that gives a good enough representation of the initial data while ensuring full anonymity.
MakeYourData: synthetise your data as you need
Did you find this topic interesting? Do you think that your organisation has use cases for synthetic data? At Argusa we developed the tool for you. MakeYourData is a software that allows to implement all use cases described above without programming knowledge. If you want to know more please visit our website at www.argusa.ch/mkyd or contact us at info@argusa.ch. You can also follow us on LinkedIn and watch out for our blogs and webinars about synthetic data and MKYD.
Webinars - Discover MKYD with Argusa
If you'd like to know more about MKYD, join us in a series of webinars led by Team Argusa! The first one about using MKYD for Testing and Prototyping is taking place on Tuesday June 21st. To sign up and find out more please visit https://www.argusa.ch/post/discover-makeyourdata-with-argusa