How data augmentation can improve ML model accuracy
The performance of a machine learning model, particularly deep learning models, depends on both the quality and quantity of available training data.
However, insufficient amounts of available training data is one of the most common challenges faced by ML teams, particularly in the enterprise, and it can lead to all sorts of problems, such as overfitting.
Overfitting in machine learning
When machine learning models are trained on limited examples, they tend to “overfit.” Overfitting happens when a model performs accurately on its training data but fails to perform when faced with new, unseen data.
There are many ways to avoid overfitting in machine learning, such as choosing different algorithms, modifying a model’s architecture, and adjusting hyperparameters. Ultimately, however, the main way to avoid overfitting is by adding more high-quality data to the training dataset.
For example, consider the machine learning architecture in a productionized self-driving car. Without a large and diverse training dataset, the vehicle could misclassify objects in the real world. On the other hand, if a self-driving car is trained on images of objects from different angles and under different lighting conditions, it will become much better at identifying them in the real world.
However, gathering extra training data can be expensive and time-consuming. In certain situations, it can even be impossible. This challenge becomes more difficult in supervised learning, where training examples must be labeled by human experts.
To overcome this challenge, businesses can leverage data augmentation to reduce their reliance on collecting and preparing training data, enabling ML teams to build more accurate machine learning models more quickly while preventing overfitting. Teams can still productionize the models for versioning control, so changes are easily rolled back if necessary.
What is data augmentation?
Data augmentation is a method that is used to artificially increase the amount of available data by making slightly modified copies of already existing data or newly created synthetic data from existing data. This can be as simple as making small changes to datasets or by using deep learning models to generate completely artificial data points.
Datasets that are created through data augmentation are useful because they can improve the predictive accuracy and general performance of machine learning models by reducing the risk of overfitting—where models catch inaccurate values present in a dataset.
Data augmentation is more commonly used in machine learning models that involve text or image classification because it is in these areas where collecting new data can be difficult. In the case of images as input data, for example, augmentation can involve:
- Affine transformations of the image data (i.e., rotation, rescaling, translations)
- Elastic distortions
- Adding noise
- Cropping portions of the image
- Convolutional filters
- Darkening and brightening
- Color modification
- Changing contrast
Machine learning applications, particularly in the domain of deep learning, are continuing to grow and diversify. Data augmentation is important because it is useful for improving the predictive ability and outcomes of machine learning models by forming new and different examples with which to train datasets. Data augmentation techniques can also enable machine learning models to be more robust by creating variations that the model may see in the real world.
Data augmentation vs synthetic data
Although the two are very similar, there are some key differences between data augmentation and synthetic data. They are both artificial but to different extents.
- Synthetic data: This data is partly or completely artificial, usually produced by generative adversarial networks.
- Augmented data: This data is derived from real data, usually images, with some form of minor and realistic transformation introduced to increase the diversity of the training dataset.
How can data augmentation help?
The main benefit of data augmentation is reducing the potential for overfitting. For instance, a classification ML model that has been trained on three images will be limited to recognizing and classifying those exact images. Even adding a slight variation to the data will improve the generalizability. As such, there are two scenarios when data augmentation is particularly beneficial:
When you’ve got a small dataset
Most machine learning projects start with smaller datasets. Given that the accuracy of your model’s predictions lives and dies on the amount of high-quality data that it is trained with, this is less than ideal. However, data augmentation can help ML teams expand their datasets and ensure that they are providing their model with enough training data to develop a reliable AI application.
When you can’t control the input data
ML training datasets tend to be in pretty good shape. However, anyone with even the slightest bit of experience in data science will tell you that real-world data isn’t. So, what happens when you cannot control the input data that is being fed into an algorithm? Without data augmentation, your model’s predictive accuracy will suffer because it isn’t being presented with less-than-ideal real-world scenarios.
What are the benefits of data augmentation?
Some of the benefits of data augmentation in machine learning include:
- Adding more training data
- Preventing data scarcity
- Reducing overfitting
- Resolving class imbalance
- Improving the predictive accuracy of models
- Reducing the cost of collecting and processing data
- Preventing data privacy
- Enabling rare event prediction
- Training models on scenarios that are difficult to simulate
While there are plenty of benefits, it is important to be aware of the challenges of data augmentation. First, businesses that wish to go down the data augmentation route must first build evaluation systems for the quality of augmented datasets. As the use of data augmentation methods begins to grow, assessment of the quality of their output will eventually be required. Bias is another big challenge. Simply put, if a dataset contains biases, the data augmented from it will also contain biases. Putting together an optimized data augmentation strategy is therefore important to prevent this from happening.
Data augmentation limitations
Data augmentation isn’t magic; it can’t solve all your data problems. You should think of it as a performance enhancer for your models. Depending on your target application, you still need to have a relatively large training dataset with enough examples without augmentation.
In some applications, training data might be too limited for data augmentation to work. In these scenarios, businesses must simply collect more data until they reach a minimum threshold for using data augmentation. Occasionally, it might be possible to use transfer learning, where a model is trained on a general dataset (i.e., ImageNet or AugLy) and then repurposed by fine tuning it on the limited data you have available for your target application.
In addition, data augmentation does nothing to address other challenges that exist such as biases, and the data augmentation process itself must be adjusted to address other potential issues such as class imbalance.
When used wisely, though, data augmentation can be a very powerful tool.
How to choose augmentations for your task
There are three things that you need to create a robust data augmentation pipeline: domain expertise, a business need, and a good dose of common sense.
Depending on your project’s domain, some data augmentations will make sense. Meanwhile, many will not. If you are working with satellite images, for example, a good choice for augmentations would be cropping, rotations, and scaling because they do not distort objects. Meanwhile, if working with medical images, better choices would be color transformations and grid distortion. To identify the best augmentations, domain expertise can be very helpful.
By business need, we mean creating augmentations that are aligned with your ML project’s goals. Returning to the example of self-driving cars, if you were developing a computer vision system for one, would it make sense to use horizontal flips? There is no right or wrong answer here, it all depends on what your computer vision system would be expected to see.
It is possible to create too many augmentations. While more training data is better, you should stop transforming images before the image becomes unrecognizable. If you as a human cannot see or understand what is in an image, how can you expect the model to? If you’re not sure whether using a particular data augmentation is a good idea, train several models using different augmentation pipelines and compare model accuracy results.