MLOps

Generate your own synthetic dataset… using a video game engine?

Data is the new oil, as the famous saying goes. It is no wonder then that so many organizations are manufacturing their own digital fuel in the form of synthetic data, which is both cheap to produce and very effective at training machine learning (ML) models.

Alon Lev

Co-Founder & CEO at Qwak

May 11, 2022

Contents

Generate your own synthetic dataset… using a video game engine?

What is synthetic data?

Synthetic data is artificially generated information. It is the opposite of “normal” data that has been generated by real-world occurrences.

Let’s assume that you are one of the 2.9 billion people on Earth with a Facebook profile. That’s “normal” data—it’s real and probably includes some of your personal information. Now let’s say that you decided to make a Facebook profile for your alter ego, who is an entirely different person with a made-up birthday, hometown, and backstory. That’s synthetic data.

The power of synthetic datasets

Just because data is synthetic doesn’t make it useless, though. Synthetic data can be used to theoretically generate vast amounts of training data that can be applied to deep learning models. Data can be fully or partially synthetic, and it has infinite possibilities.

For example, synthetic datasets can be used to test systems by generating a large pool of “fake” user profiles (like your alter ego’s Facebook profile) to run through a predictive solution for validation. While this data is artificial, it reflects real-world data, and research has shown that it can be just as good—and in some cases better—than real-world data.

According to a study by Gartner, by 2024, 60% of the data used in artificial intelligence (AI) and ML development will be synthetically generated. By 2030, the same Gartner report predicts that most data will be synthetic. “The fact is you won’t be able to build high-quality, high-value AI models without synthetic data,” the report says.

Why synthetic data is key

ML teams require easy access to large datasets to train their machine learning models. The more accurate and diverse training data that is available, the more reliable and accurate a model’s predictions will be.

If ML teams were to rely solely on real-world data, however, they would be physically limited by data that’s available. For simple ML models this might not be a problem, but a lack of available training data, smaller sample sizes, and limited diversity can become especially problematic for more niche and specialist ML models.

This is exactly why synthetic data is key. Since it has been generated entirely by machines, synthetic datasets can be completely customized and built to be as diverse as is needed for a given application. This leads to ML models that are far better at making their predictions than they otherwise would have been if they were trained solely on real-world data.

Synthetic data also has various benefits in the context of deep learning. When you complete the generation process, for example, it is usually fast and cheap to produce as much data as is needed for an application. That said, some datasets such as photorealistic video might need more processing power and thus be more expensive to produce, in terms of both time and money.

In addition to being able to tap into infinite amounts of training data, synthetic data that has been properly created can alleviate privacy and regulatory concerns that come with working with real-world datasets. This is because synthetic datasets don’t (or shouldn’t) be connected to a recognizable real-world individual.

Synthetic data challenges

While synthetic data can offer many advantages, there are also some challenges.

The main challenge is one of quality: The quality of synthetic data can vary greatly. It is usually created using generative algorithms that are supported by input data, meaning that the quality of the output data can depend highly on the quality of the input data. If the input data is biased, for example, the output can end up being skewed.

The second challenge concerns realism. Synthetic data needs to be sufficiently realistic so that it appears as natural, real-world content.

Let’s say that you wanted to build a model that could detect car crashes. There is only a limited amount of input data available, and it’s incredibly difficult, time-consuming, and expensive to go out into the real world and simulate car crashes. Even then, your own simulated crashes are unlikely to come close to the real thing, meaning that you can only achieve a certain level of realism.

There are some other challenges that ML teams should be aware of, too.

On top of synthetic data generation requiring (sometimes significant amounts of) time and effort, output control may be required and the data that is generated might not cover outlier situations that data generated in the real world might include. This can lead to inaccuracies in a model’s predictions.

Creating synthetic training data with video game engines

One way to alleviate these challenges is to use a video game engine such as Unity or Unreal Engine.

Unity is the most straightforward of the two to use. It is a cross-platform engine that enables users to build real-time 3D projects for various industries across games, animation, automotive, architecture, and more. Unity even has its own guide on using the platform to generate synthetic data which uses the Unity Perception package.

The Unity Perception package enables a new workflow in Unity for generating synthetic datasets and supports both Universal and High Definition Render Pipelines.

In the first release of Unity Perception, tools are included for dataset capture which consists of four primary features: object labeling, labelers, image capture, and custom metrics. It also provides a simple interface to input object-label associations, that are picked up automatically and fed to the labelers. A labeler uses this object information to generate ground truth data such as 2D bounding boxes or semantic segmentation masks.

According to Unity, recent research from Google Cloud AI used 64 grocery products regularly available in stores—cereal boxes, paper towels, etcetera—to demonstrate the efficacy of object detection models that have been trained entirely on synthetic data. To test this research, Unity chose an equal number of products that were close to the originals in size, shape, and textural diversity. In doing so, they created a library of 3D assets using digital content creation tools, scanned labels, and photogrammetry.

A picture containing severalDescription automatically generated — *Scene view in the Unity Editor showcasing the placement of various assets. Image credit: Unity.*

Ultimately, the project was a success and demonstrates the potential for using Unity to generate 3D assets for the training of machine learning models. The potential use cases here are endless, with applications in everything from fashion and design to agriculture and farming.

Unity’s 3D model and scene creation tools would be the perfect use case for our earlier example of re-creating car crash scenes. Although it may take a while for ML teams to build out these virtual models and environments and then use them together to generate realistic crashes, the ultimate reward is unlimited training data, which is something that you can’t put a price on.

Why use synthetic data?

Today, data is everything. While real-world data is constantly being generated, it simply isn’t enough to serve some of the more high-level developments and breakthroughs that are occurring with each passing day. This is where synthetic data comes in.

Synthetic data generation provides a cost-effective and efficient solution for ML teams that need to get their hands on more data but cannot collect it from the real world. Even where data can be collected from the real world, synthetic data is in many cases a much better option than collecting real-world data because it can be fine-tuned to meet the demands of a particular ML project.

MLOps

Bridging the Gap: How MLOps and DevOps Work Together for AI Adoption in 2025

Guy Eshet

December 8, 2024