Back to blog
ML models for AB testing - Advanced model deployment patterns
MLOps

ML models for AB testing - Advanced model deployment patterns

By 
Ran Romano
October 28, 2022

Decisioning based on machine learning has become the foundational aspect of running businesses in the modern era. Machine learning plays an important role in everything from gathering customer leads, grooming leads to customers, serving them, and preventing them from defecting. Modern data platforms process huge amounts of data and still achieve lightning-fast response times with great accuracy. A carefully curated sequence of steps involving continuous training, evaluation, deployment, and monitoring of models is set up behind the scenes to accomplish this. 

Deployment and maintenance of machine learning models are a lot more complex than maintaining the rest of the code base. Changes in performance brought by newer versions of models are very difficult to assert because of their black box nature. A higher version model with better average accuracy may still fail in case of critical input that the earlier model predicted successfully. Then there is the problem of data drift. The data that the model receives when it is deployed may be significantly different from what it was trained on. Or worse yet, the input data reflects training data initially, but as time passes, it diverges. Both these problems are difficult to tackle and require a considerable amount of effort. 

To combat these complexities, architects use a variety of deployment strategies to systematically test different models and arrive at the best fit one. This article talks about popular model deployment strategies and the factors that need to be considered while choosing one.

Understanding Model Implementation Sequence 

Taking a machine learning model to production is an elaborate process with many stages involved. At a high level, it includes the below stages.

  • Data Collection and Preparation
  • Model Architecture shortlisting and training
  • Testing and Evaluation
  • Deployment and Monitoring

Each of the above stages can include multiple stages within them depending upon the use case. 

Data Collection and Preparation

A model is only as good as the data it is trained on. Hence ML practitioners spend a considerable amount of time cleansing and preparing data ready for training. The source for data can be the organization’s own data repositories or publicly available datasets. For example, an eCommerce organization trying to implement a recommendation system will already have historic purchase pattern data in that domain, if the business has been running for a while. If the organization is new to the game, it has to rely on rule-based models first and accumulate data to move to better models. 

Some problems can rely on publicly available data sets. For example, the said eCommerce business wants to identify the category of a product automatically when a seller uploads an image. Images of such common products may be available in publicly usable form and they can be readily used. In the worst case, the organization can employ its own photographers to take pictures of common products. 

Once the data set source is identified, the next step is to label the data. This is a very time-consuming process. Some problems may allow using automated scripts to generate labeled data, but most of them require at least a manual review. Data cleansing is another time-consuming step. At least for the initial versions, any data element that can confuse the model algorithm is removed. The cleansing and preparation part varies a lot depending upon whether you are handling a problem with numerical input, image input, or text input. 

Model Architecture Shortlisting And Training

Finalizing model architecture requires considerable research and analysis. Deep Learning based models are the de-facto choice in problems involving complex relationships. Statistical machine learning-based models work well for problems with simpler input-output relationships and less data availability. There is no thumb rule for selecting the model architecture. In fact, data scientists arrive only at a shortlist after the initial research phase.

Finalizing architecture is not only about reading research papers and finding a suitable one. In most cases, data scientists have to tweak the loss functions and parameters like learning rate to arrive at the final model that works for the problem at hand. This happens after the shortlist is created. Multiple model architectures are tried out in parallel with different hyperparameters to arrive at the right one.

Testing, Evaluation, and Monitoring. 

ML model testing involves assessing the trained models against the benchmarks to decide the final architecture. The result of this testing is a final model architecture which can be a single one or an ensemble of multiple models. Metrics that closely reflect the business objectives are an important element of arriving at the right model. Hence, evaluation metrics are not defined at this stage, but during the initial planning stage itself. Typical metrics are different varieties of precision, recall and accuracy tweaked for the problem at hand.

At times, even after careful evaluation, it is difficult to arrive at a single architecture. In such cases, the only way out is to evaluate the models on real user data itself and decide on the best one. But this is easier said than done. Typically, organizations will not be able to afford to disturb a production system to evaluate models. Hence data architects go to great lengths to devise deployment architecture that facilitates A/B testing without affecting production systems. 

Deployment and Monitoring

A good deployment architecture is one that provides great response time in a cost-effective manner. It should also facilitate the evaluation of multiple models at the same time and generate enough information for the development team to assess the efficiency. This article will talk about such advanced deployment patterns in later sections. 

 

Maintaining the models is a never-ending process. Once the first version of the model is deployed, the team continues to work on exploring newer architectures and training the models with new data. Since there is no dearth of data by this time, data scientists can now try out even more complex architectures that they may have left out earlier. This continuous improvement process provides more model alternatives that can replace the first one.

The decision to replace a model is a complicated one to execute for many reasons. For one, the newer model may be better performing in the test data set, but there is no guarantee that it will perform the same when the real user data comes. Worse yet, it may perform better on in the case of average, but it may fail in special cases that the previous one got right. The balance between precision and recall is another aspect that makes this decision complex. A good deployment strategy can take care of these aspects to an extent by providing reliable data to rely on when choosing between models. 

Model Deployment Patterns

The choice of model deployment pattern depends on the constraints imposed by the problem at hand. Whether the model requires A/B testing to finalize an architecture is an important consideration while deciding on the deployment pattern. A/B testing is a randomized experiment that feeds random users to one of the models and others to a second model. By comparing the results. The version that improved the business objective metric high is the one finally chosen. 

The possibility of a gradual rollout is another factor that affects the choice of deployment pattern. In cases where a model is already deployed successfully, it may not be a good idea to route the full traffic to the new model one fine day. For example, let us consider the case of a recommendation engine, that provides product recommendations on the user’s home page based on his previous purchases. Replacing this model entirely with a new model suddenly is a big risk. A drastic change in recommendations can lead to a mass exodus of users. 

Some use cases require explicit routing of input to models for best performance. This could be part of a larger strategy of improving the accuracy of the whole system. For example, let's say, data scientists come up with a model that does very well on certain kinds of data. In such cases, it does not make sense to replace the model entirely for the complete input data, but use it only when specific rules are met. 

The next section will talk about some of the advanced deployment patterns based on the above factors. 

Shadow Model Deployment

In the case of shadow model deployment, a new version of the model is deployed alongside an existing model that is already processing requests. The new version is called the shadow model and the already running one is the live model. The shadow model process all the requests processed by the live model. The application will use the results from the live model for normal operation but will store the results from the shadow model for further analysis. The advantage of this approach is that data scientists can get information on how the new model will perform in production without disturbing the operation of the platform. Analysis of stored results provides a perspective on how well the new version of the model is performing.

This model testing strategy works in cases where internal teams can assert whether the model’s results were valid or not. For example, in the case of a recommendation model, this is not the case. The only person who can categorically say whether a recommendation was valid for him or not is the user himself. He can express the validity by clicking on it or purchasing it. Even though a  recommendation generated by a shadow model can be validated by business teams, it is not a categorical validation like in the case of say- an image processing model.

Consider the example of the computer vision model that an eCommerce organization developed to automatically assign the category of a product image uploaded by a seller. In this case, internal teams can assess the results of a shadow model with full confidence. After all,a shirt categorized as an article of clothing does not need much debate. The stored results of the new version of the model, in this case, are directly comparable to the previous results. One can calculate precision, recall, and accuracy on these results to test the ML models and easily decide which one is the better bet. 

The shadow model testing is used when there is no need for A/B testing. Architects use this pattern when they require validation of the model on complete user data before making a switch. The downside of this pattern is resource usage. Since all the data passes through both models, this pattern generally requires close to double the resources used by the single model strategy. One can try to optimize resource usage by finding stages that can be run commonly. For example, some parts of pre-processing can usually be done once. 

Weighted Traffic Split Model Testing

This pattern is used when the business needs to evaluate a model based on feedback from real users. If there is an option for a shadow deployment pattern, that option is first executed before getting to a weighted traffic-based deployment pattern. This is because the shadow deployment pattern will give the organization enough confidence to push the model to production. But since the real user feedback is still missing, they are not confident enough to roll out the model completely. Weighed traffic model testing is used when A/B testing is required. 

This pattern is generally used when the model can not be fully evaluated by the internal teams and require validation from real users. For example, consider the case of the recommendation engine again. A recommendation engine’s primary validation is whether the user clicks on the recommendations and if it results in additional revenue compared to the earlier algorithm or recommendation-less approach. A weighted traffic split is a great option to evaluate recommendation models. Organizations can start with bare minimum weights and monitor the results to see whether the new recommendation model brings them more revenue. If the users exposed to the new recommendation model are clicking more on the recommendations than on the previous version of the model, it is evident that the new model is doing better. 

Weighted traffic ML model testing is also beneficial when there are two new contenders. One can use a 1:1 random split and route traffic randomly to two models. The decision in this case is even simpler. Which even model that leads to higher revenue becomes the winner.

The weighted traffic strategy’s biggest advantage is the possibility of a gradual rollout. The traffic can be increased gradually when the internal teams become confident of the model's ability to help customers. This kind of rollout is called canary deployment. Compared to the shadow model pattern, this is less expensive because there is no duplicate model inference. 

There is a more complicated form of weighted traffic split that uses dynamic weights instead of static weights. This model testing strategy can be used in cases where model performance can be directly correlated to a business outcome like revenue or profit. In this case, the deployment algorithm keeps track of the success of each model ( say, whether the user clicks on a recommendation or not)  and gradually increases weights to the better-performing model. This model testing method bases its execution on the concept of reinforcement learning. 

Rule-Based Traffic Split Model Testing

This ML model testing strategy is used when the organization requires close control of the data elements that are fed to each kind of model. It is mainly used when there are multiple models which perform better on specific cases. The rule-based split improves the overall accuracy of the system by feeding each model with its preferred input type. Like in the case of weighted traffic model testing, this strategy comes after the business has already validated the efficiency of the model using shadow strategy or offline batch validation. 

Consider the example of the recommendation model. During the model testing, developers find that the current model does extremely well on users from a certain geography, lets say the US but not so well on others. So they attempt to improve the model for better overall performance. But they find out that trying to improve the overall accuracy reduces the otherwise excellent performance for US. In this case, the team can decide to field the new model for all the other cases and use the existing one only for cases where they are sure about the demographic. The result will be an overall higher accuracy of the system without compromising on earlier results.

Attributes like geography, language, gender age, etc often have a big effect on model performance. Having multiple models for specific criteria is easier than developing one generic model that works well in all cases. Hence such model arrangement is very common in eCommerce platforms. The advantage of this system is the close control it allows on model inference flow. The disadvantage is the need for a sophisticated rule engine that redirects traffic according to set rules. Overuse of this strategy can lead to an explosion of rules and a maintenance mess. At some point, developers will have to take an effort to consolidate models and reduce rules without losing accuracy. 

Conclusion

Machine learning model testing is a complicated affair that requires sophisticated deployment architectures. Replacing a model may not always be a straightforward decision because of the difficulty of assessing the efficiency of different models. It is common to see models that performed well on test data fall flat when real user data comes. Disturbing an already running production system with a new model is a risky affair and can quickly lead to customer retention nightmares. This article talked about many advanced patterns of deployment that can reduce such risks. There is no one-size answer to this problem. The choice of the deployment architecture depends on the problem at hand and the risk appetite of the organization. 

Having access to a well-designed machine learning platform can abstract away most of the work involved in solving these uncertainties. Such platforms allow developers to focus on the core problem without worrying about the complexities of deployment. 

Qwak simplifies the productionization of machine learning models at scale. Qwak’s ML Platform empowers data science and ML engineering teams to deliver ML models to production at scale. 

By abstracting the complexities of model deployment, integration and optimization, Qwak brings agility and high-velocity to all ML initiatives designed to transform business, innovate, and create competitive advantage.

You can check out Qwak here.

Related articles