Skip to main content

blog

Synthetic data for speed, security and scale

| Author: HSD Foundation

  • What if you could share data with partners, governments, and other organizations to boost innovation without breaking privacy laws?
  • Wouldn’t it be great if you could better use your company’s closely guarded customer data and maintain the highest standards of privacy and safety?
  • Imagine if you could create new revenue streams for your business by monetizing your data, without compromising personal/sensitive information?
  • That’s the promise of synthetic data, which is poised to revolutionize the way the world uses and benefits from its data.

In today’s world, data truly makes the world go ‘round. It’s fundamental to virtually everything we do. And data assumes even greater power and importance when it’s shared. Think about how much more quickly diseases could be cured, or how much waste could be reduced, or how much more efficiently ecosystems could run if data were able to be freely exchanged. Of course, such sharing isn’t possible today because we’re limited to using our own data that, for good reason, is highly protected.

What is synthetic data?

Synthetic data, simply put, is data artificially generated by an AI algorithm that has been trained on a real data set. The goal is to reproduce the statistical properties and patterns of the existing dataset by modelling its probability distribution and sampling it out. The algorithm essentially creates new data that has all the same characteristics of the original data – leading to the same answer – but, crucially, it’s impossible for any of the original data to ever be reconstructed from either the algorithm or the synthetic data it has created. As a result, the synthetic data set has the same predictive power as the original data, but none of the privacy concerns that restrict the use of most original data sets.

Here’s an example: Imagine as a simple exercise that you are interested in creating synthetic data around athletes, specifically height and speed. We can represent the relationship between these two variables as simple linear function…if you take this function and want to create synthetic data it’s easy enough to have a machine randomly create a set of points that conform to the equation. This is our synthetic set. Same equation but different values.

Now imagine you are interested in height, speed, blood-pressure, oxygen in blood, etc... the data is much more complicated and representing it requires more complex non-linear equations and we need the power of AI to help us determine the "pattern." Using the same thinking as with our simple example, one can now use the trained AI to create data points that approximate to this new, more complex "pattern" we have learned and thus create our synthetic data set.

Synthetic data is a boon for researchers. One example is what the National Institutes of Health (NIH) in the U.S. is doing with Syntegra, an IT services start-up. Syntegra is using its synthetic data engine to generate and validate a non-identifiable replica of the NIH’s database of COVID-19 patient records comprising more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients. The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.

While the pandemic has illustrated potential health research-oriented use cases for synthetic data, we see potential for the technology across a range of other industries. For instance, in financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers—without contravening data privacy regulations. Retailers are beginning to recognize how they could create new revenue streams by selling synthetic copies of their customers’ purchasing behavior that companies such as consumer goods manufacturers would find extremely valuable—all while keeping their customers’ personal details safely locked up.

The value for business: Security, speed, and scale

While the use of synthetic data today is still nascent, it’s poised for massive growth in the coming years because it offers companies security, speed, and scale when working with data and AI.

Synthetic data’s most obvious benefit is in eliminating the risk of exposing critical data and compromising the privacy and security of companies and customers. Techniques such as encryption, anonymization, and advanced privacy preserving (for example, homomorphic encryption or secure multiparty computation) focus on protecting the original data and the information in that data that could be traced back to an individual. So long as the original data is in play, there’s always a risk of compromising or exposing it in some way. Synthetic data doesn’t disguise or modify the original data—it replaces it.

This is one of the main points of the COVID-19 example noted earlier and, indeed, is a big selling point for the healthcare industry at large. Imagine if we had pooled all the data we collectively have about everybody who’s contracted the disease around the world since the beginning, and we were sharing it with whoever wanted to use it. We likely would have been better off but, legally, there’s no chance of that happening. The NIH’s initiative demonstrates how synthetic data can hurdle the privacy barrier.

Another big challenge companies face is getting access to their data quickly so they can start generating value from it. Synthetic data eliminates the roadblocks of privacy and security protocols that often make it difficult and time-consuming to get and use data.

Consider the experience of one financial institution. The enterprise had a cache of rich and valuable data that could help decision makers solve a variety of business problems. And yet, the data was so highly protected and controlled that getting access to it was an arduous process—even if the data would never leave the company. In one case, it took six months to get even a small amount of data, which the analysis team used very quickly. Another six months followed just to get an update. To get around this access obstacle, the company created synthetic data from its original data. Now the team can continuously update and model the data and generate ongoing powerful insights into how to improve business performance.

Furthermore, with synthetic data, a company can quickly train ML models on large datasets, which means faster speed to training, testing, and deploying an AI solution. This addresses a real challenge many companies face: a lack of enough data to train a model. Access to a large set of synthetic data gives ML engineers and data scientists more confidence in the results they’re getting at the different stages of model development—and that means getting to market more quickly with new products and services, and ultimately, more value faster.

Scale: Sharing to solve bigger problems

Scale is a by-product of security and speed. Secure and faster access to data make it possible to expand the amount of data you can analyze and, by extension, the types and numbers of problems you can solve. This is attractive to big companies, whose current modeling efforts tend to be quite narrow because they’re limited to just the data they own. Companies can, of course, purchase third-party data in its "original" form, but it’s often prohibitively expensive (and comes with the related privacy concerns). Synthetic data sets from third parties make it much easier and cheaper for companies to supplement their own data with additional data from many other sources, so they can learn more about the problem they’re trying to solve and get more accurate answers—without the worry of compromising anyone’s privacy.

Here’s an example. Every bank is obliged by itself and regulators to identify and stamp out fraud. And each bank is on its own quest, working independently of others and committing significant resources to the cause, because regulators require it and only the bank itself is allowed to comb through its data to look for suspicious activity. If banks used synthetic data, they could share information about their investigations and analyses. By pooling their synthetic data sets with peers in the industry, they could get a holistic picture of all the people interacting with banks in a particular country, not just each bank, which would help streamline and speed up the detection process and, ultimately, eliminate more fraud using fewer resources.

Why isn’t everybody using it?

The benefits of synthetic data are compelling and significant. But realizing them requires more than just plugging in an AI tool to analyze your data sets. Generating synthetic data properly requires people with truly advanced knowledge of AI and specialized skills, as well as very specific, sophisticated frameworks that enable a company to validate that it created what it set out to create.

This is a critical point. The team working on the effort must be able to demonstrate to the business (or to regulators or customers, if necessary) that the artificial data they created truly represents the original data—but can’t be related to, or expose, the original data set in any way. That’s really hard to do. If it doesn’t match, important patterns in the original would be missing. This means subsequent modeling efforts might overlook potentially big opportunities or, worse, generate inaccurate insights.

There’s also the challenge of bias, which can easily creep into AI models that have been trained on human-created datasets that contain inherent, historical biases. If a company creates a synthetic data set that simply copies the original, the new data will have all the same biases. Therefore, you have to make complex adjustments to the AI models so they can account for bias and create a fairer and more representative synthetic data set. And that’s not easy, but it’s possible.

Generating synthetic data properly requires people with truly advanced knowledge of AI and specialized skills, as well as very specific, sophisticated frameworks that enable a company to validate that it created what it set out to create.

Synthetic data can also be used to generate datasets that agree with a pre-agreed definition of fairness. Using this metric as a constraint to an optimizing model, the new dataset will not only accurately reflect the original one, it will also do so in a way that meets that specific definition of fairness. As a result, this new fair dataset can be used to train a model, without the need for bias mitigation strategies like algorithmic fairness, which can lead to accuracy trade-offs. Mostly.AI, for example, has demonstrated its effectiveness on the well-known COMPAS recidivism dataset that fueled racially discriminatory algorithmic outcomes. Mostly.AI’s approach reduced the gap between high COMPAS scores for African Americans (59%) and Caucasians (35%) to just 1%, with "minimal compromises to predictive accuracy."

Beyond ensuring the actual mechanics of creating synthetic data are sound, most companies also need to get past common cultural resistance to the concept. "It won’t work in our company." "I don’t trust it—it doesn’t sound secure." "The regulators will never go for it." We faced this at a North American financial services firm we worked with. When we initially broached the topic with some of the company’s executives, we had to do a lot of work educating them—as well as the risk and legal teams—on how synthetic data works. But now that they’ve gotten their heads around it, there’s no stopping them.

Photo: Istock.com/g-stockstudio