Building a machine learning model is hard. Deploying it is even harder. Running experiments, writing code for production, integrating with infrastructure, running ongoing tests, and avoiding training-serving skew – none of this is easy.
While things are getting better as more open-source libraries and powerful tools (e.g., Vertex AI, Qwak.com, and AzureML) hit the market, post-deployment challenges are still a real pain for organizations that are looking to implement machine learning into their operations.
Let’s look at some recent statistics.
Organizations reported that they invested $28.5 billion into the development of machine learning models and applications in 2019. Despite this huge investment, only 35% of organizations say that they have analytical models deployed in production.
While this may look surprising, it begins to make sense when you consider the many challenges that organizations face when it comes to successfully deploying their models. These can include anything, from increased model complexity over time hampering maintainability to bias, schema changes, and skewed data, there are a lot of things that can go wrong.
In this article, we are going to look at instances of training-serving skew, one of the most common problems when it comes to deploying machine learning models.
First, though, it makes sense to look at and break down the typical machine learning workflow. This is because training-serving skew is caused by the ML training process.
When training a machine learning model, we tend to follow the same five steps:
Data acquisition is the collection of data from relevant sources before it is stored, cleaned, and processed for use in model training. This is the first and most crucial ML training step because the quality and quantity of data that is acquired will directly impact the model and how good it is at making predictions.
The terms “Data is the New Oil” and “Garbage in, Garbage Out” are not without substance; they emphasize that a clean data input is essential for an ML model’s successful development and ongoing performance.
Once data has been acquired, it must be cleansed and prepared. This is another critically important step that identifies and corrects errors – for example, empty columns and duplicated rows – in the dataset that may negatively impact the predictive model. These steps are so important, in fact, that data acquisition and data cleansing can account for as much as 80% of the total time taken to develop a model and get it to production.
Typical steps involved in the data cleansing process might include:
Feature generation focuses on taking the raw, unstructured data that has been collected, cleaned, and pre-processed and turning it into useful representations by defining features (for example, variables) for use in statistical analysis. The process of feature generation adds new information that the training model can access during the training process which should, in theory, lead to a more accurate predictive model.
The purpose of model training is to build the best mathematical representation for the relationship that exists between dataset features and, in the case of supervised learning, a target label, or alternatively among the features themselves in the case of unsupervised learning. Model training is a key step that, if done correctly with a high-quality dataset, leads to a model that’s ready for validation, testing, deployment, and ongoing training.
Deployment is the final and arguably most difficult step in machine learning and involves integrating the machine learning model into a production environment. A deployed model integrates production environments to actively serve predictions to end users.
As we mentioned earlier, many organizations experience significant hurdles with model deployment due to incompatible infrastructure and discrepancies between model programming language and those that the production system can understand. The organizations have to guarantee the deployed model is performing as expected while robust enough to handle any unexpected inputs in the data distribution.
It is the first two steps – data acquisition and data cleansing – that are the most crucial in building a model that can make accurate predictions and practical decisions. This requires high-quality datasets that are free from errors.
Unfortunately, we don’t live in a perfect world and real-world data can be messy. Very messy. And if left untreated, messy datasets can lead to problems in machine learning model production such as bias, degraded predictive ability, and an all-around useless machine learning tool that cannot perform as intended.
In short – poor data quality is any machine learning model’s number one enemy and developers go to great lengths to ensure data quality.
The primary way this is ensured is through data cleansing. This is a time-consuming task that combs through datasets to ensure that they have been thoroughly cleaned and processed to remove data that is incomplete, incorrect, incompatible, duplicated, irrelevant, or incorrectly formatted. This helps to remove discrepancies that can harm machine learning models in the training and production stages.
Even with the most thorough data cleansing and seemingly perfect datasets, there is still potential for a little-talked-about challenges in machine learning that can cause major discrepancies in performance between the model training and model deployment stages: training-serving skew.
“Past performance is no guarantee of future results.”
This isn’t just a small print in the Terms & Conditions of most financial products that most of us choose to ignore, an adage that is true in the world of ML model development.
ML models in production can experience reduced performance over time not only due to being fed bad data and poor programming but also due to datasets and profiles that are constantly evolving.
This is a concept often referred to as model decay or drift. It is a natural occurrence in ML models and the speed of decay can vary greatly. In some models, it can take years. In others, it can happen over the course of a few days.
One of the biggest post-production problems that can lead to an expedited rate of decay is data-serving skew, a problem that can arise quite easily and be difficult to detect.
Training-serving skew is a difference between ML model outputs during the training and during serving (deployment). It is essentially a discrepancy between an ML model’s feature engineering code during training and during deployment.
Training-serving skew can be caused by:
It is very easy for training-serving skew to crop up. Let’s consider an example.
Imagine that a machine learning pipeline trains a new ML model every day. During routine feature engineering, an engineer carries out some refactoring of the model’s serving stack and accidentally introduces a bug that pins a specific feature to -1.
This is a pretty big error caused by a relatively small mistake on the engineer’s part, but because the ML model is robust to data changes, it doesn’t output any error. The model continues to generate predictions with lower accuracy, all the while the engineer is unaware of the error. The serving data then gets re-used for training the next ML model and the problem persists in a cycle, getting progressively worse until it is finally discovered.
As this scenario shows, training-serving skew can quite easily crop up via a bug in your model’s code and cause serious repercussions further down the line, potentially stopping your model from working entirely.
Training-serving skew impacts machine learning models in many of the same ways that regular skewed or bad data would – by reducing the model’s performance over time as it gradually decays.
While training-serving skew and data drift appear to be the same because they lead to the same result – model decay and degradation – they are actually not the same things. Although the way that we analyze them is similar, the root cause of training-serving skew is not the same as data drift.
In training-serving skew, there is no “drift”. Drift assumes that there is a change during the production of the model. This isn’t the case with training-serving skew, which is more of a mismatch.
It is possible to avoid training-serving skew by following best practices.
Ideally, engineers should be re-using the same feature engineering code to ensure that any given raw data input maps to the same feature vector during training and deployment (serving). If this does not happen, then a training-serving skew exists which opens the model to potential degradation.
One of the most common reasons for this skew is a mismatch in computational resources at training and deployment time.
Let’s imagine for a moment that you are an engineer working on a new project. You decide to write your pipeline using Apache Spark. A few months pass and you finally have the first version of your ML model that you are ready to deploy for initial testing via a microservice.
It would not be very efficient for you to require your microservice to connect to Spark in order to make a new prediction. As a result, you decide to re-implement your feature engineering code using NumPy, for instance, to avoid any extra infrastructure. You now have two feature engineering codebases that you must maintain – one in Spark and one in NumPy. With any input, you now must ensure the same output to avoid training-serving skew because of this.
So, if you cannot re-use your feature engineering training code for any reason – such as in the above scenario – it is imperative that you test for training-serving skew before deploying a new model.
You can achieve this by passing your raw data through your training and deployment pipelines then comparing the output. All raw input vectors should map to the same output feature vector. If they don’t, you have training-serving skew.
Machine learning is a fundamental technology that helps organizations deploy complex solutions that work to save time, reduce cost by creating efficient workflows, and unlock previously untapped sources of revenue.
These goals are hard to accomplish if a machine learning model isn’t performing at its best. While poor performance is often attributed to low-quality and poor datasets, this isn’t always the case. Problems can still arise in production even when a model has been trained with perfect datasets due to various reasons, one of which is training-serving skew.
Training-serving skew, a problem that arises due to the most minor discrepancies in feature engineering code between training and deployment (serving), can be severely damaging to a model and very difficult to detect.
It is therefore important for engineers and others involved in the ML model pipeline to employ active best practices to reduce the potential for instances of training-serving skew to crop up and damage model performance.