How to Deploy Machine Learning Models into Production

Discover Qwak's strategies for effectively productionizing machine learning models, focusing on development, architecture, and operational efficiency.
Pavel Klushin
Pavel Klushin
Head of Solution Architecture at Qwak
March 1, 2022
Table of contents
How to Deploy Machine Learning Models into Production


Modern businesses use machine learning models to solve many kinds of business challenges that otherwise require human intervention. Developing machine learning models are inherently complex and experimental in nature. The same factors make using a machine learning model in production more complex than typical web services. In this post, you will learn about steps involved in taking a machine learning model to production, typical architectural patterns, and challenges faced while using them. 

Productionizing ML Models 

Taking a machine learning model to production generally involves the following stages.

  1. Setting up a repeatable development process
  2. Dealing with model explainability 
  3. Defining the model serving architecture
  4. Setting up model monitoring and verification
  5. Establishing a process for model updates.

Each of these steps includes many uncertainties because of the ever-evolving nature of input data as well as the models. 

Repeatable Development Process

Development process of a typical machine learning model involves a lot of experimentation. Establishing a process to minimize the chaos around the experiments is the first big step towards taking a machine learning model to production. This is typically done by establishing an accuracy metric and a test data set early in the development process. Once this is fixed, all experiments can then use it to decide whether there has been a statistically significant improvement. 

Model Explainability

Once an acceptable model is developed, the explainability of the model architecture comes into the picture. While this is not a required step in all machine learning projects, if you are dealing with a highly governed domain like finance or healthcare, this will be very much part of the story.  The developers may present their results on the established data sets as the first logical milestone, but the real challenge starts when the model is scrutinized by the business stakeholders. Questions about why and how model arrived conclusions will have to be answered and in some cases, there will be requirements to even generate an evidence tree along with the model output. 

For example, if one is developing a model to predict churn risk, there will be questions about why specific customers were classified as at-risk ones. Considering the fact that model predictions may exhibit slight variations even in the same test data across different training interactions, dealing with model explainability will be one of the most frustrating phases of taking a machine learning model to production. 

The Model Serving Architecture

Once an acceptable model is developed, a model serving architecture can be finalized by considering the below factors.

  • Frequency of model inferences.
  • Latency requirements for model inferences
  • Typical load on inference APIs

The above factors may seem very similar to the factors that are considered while deploying any web service. But the high hardware requirements and ever-changing inputs mean model servicing APIs need to have a lot more checks and balances. 

Model Monitoring And Verification

After finalizing the serving architecture, auto-deployment pipelines can be configured to consider the performance of the model in the test set and manage the deployment process after successful experiments. 

Machine learning models make predictions based on what they learned from the training data. The intuition is that the training data came from the same distribution as that of the production data. In reality, this assumption rarely holds and production input will be different from training data. The extent of difference depends on how good a job the development team did while collecting data for training. This variation will go up as the model ages or external factors change. So there must be mechanisms to detect such data drifts and variations in inputs. Hence Model monitoring is also an important aspect of taking a machine learning model to production. 

The model monitoring involves keeping a watch on important metrics of model performance. These include accuracy metrics, like precision, recall, ROC curve, etc for up-to-date ground truths generated on production input. Model monitoring should also consider metrics like response time, resource utilization, etc. 

Model Evolution

Since machine learning models need to be continuously updated to adapt to variations of input data and environment, there must be a semi-automated training process that considers production data and keeps training new models. This is generally done by using a fraction of the production data for verification and annotation. The corrected data will then be added to the original training set. 

The test also requires particular attention while the models evolve, if the test set remains the same throughout the evolution, the model results in production will start diverging from the test results. It may lead to a scenario where the development team depends upon the wrong evidence to find and solve issues in the model. Hence it is recommended to continuously introduce data from production to test set. 

Machine Learning in Production - Architecture

Let us consider an example to understand the typical architecture pattern that is followed while deploying a machine learning model. Consider the case of a telecom company that uses a customer churn prediction algorithm to classify customers to at-risk ones. The company uses a customer support dashboard where reports regarding at-risk customers are displayed. This risk report is generated on a daily basis and is available for consumption for the customer service managers when they begin their day. 

To accomplish this objective, three data flows need to be implemented.

  1. Development and training flow
  2. Model inference flow
  3. Model monitoring and verification flow

Development and training flow

The training flow starts with fetching data from the company’s data warehouse and preprocessing them. The data warehouse is periodically populated by ETL jobs using the company’s operational database. The preprocessed data is loaded to a feature store where data scientists and analysts explore the data and forms experiments. For example, in the customer churn analysis problem, customer features like last interaction, last complaint, current location, lifetime value, etc can be stored in a feature store for analysts to explore while developing the model. The features that eventually make it into the model are decided only after many experiments. The experiments finally arrive at a model architecture that can solve the problem. Typically, such problems are solved using random forests or variations of decision trees. 

A model registry is generally implemented to manage the versioning of the model and store all the production model contenders that are built as part of various experiments. The decision to push a model to the model server is taken after evaluating the models on standard test sets. 

Inference flow

The inference flow generally starts with fetching the data from the operational database and then formatting it into a form supported by the model server. The inference then runs using this formatted input. In the case of the churn prediction, this can be a batch process since the results are expected every morning on the dashboard for the customer care managers. The results that are generated after the batch job is populated to the results database and to the operational database to be consumed by the customer service managers.

That said, Inference flows need not always be batch-oriented. Real-time inference mechanisms are also used very commonly these days. An example could be a surveillance system that needs to generate immediate alerts based on what it sees from a camera. In such cases, images will be continuously steamed to a queue like RabbitMQ, Kafka or a managed one like Kinesis and then inference jobs will be queue consumers that act on the images and push the result to a socket or an operational database. Having a buffering queue and stream processing provides a clean way to horizontally scale the inference process according to real time traffic patterns. 

In some cases, inferences can even be web service based. For example, a low load or low memory footprint model can be deployed as a flask or spring boot web service. The inference job will then be just like any web service API access. 

Model monitoring and verification flow

The model results repository act as the base information source for the model monitoring system. This database not only holds the model evaluation results but even the performance details of the model server including the time taken for job completion. The model monitoring system will use the model evaluation results to compute metrics for a fraction of the results by using manual annotation or real-life evidence. In this case, it is easy to get real-life evidence from the feedback of the customer service agents. In other cases, there may be model quality assessors involved who will manually annotate a fraction of the results to arrive at model metrics. The verified data is then fed back into the training feature stores for further usage. 

Stitching all these together, this is what our functional diagram looks like.

Selecting the technical stack for various components if this architecture is a herculean task and one that deserves particular attention is the model serving component. Choosing a model-independent serving framework is very important to ensure flexibility of further model development and evolution. What started as a simple decision tree model can quickly end up being a multi-model pipeline with neural networks and GPUs involved. 

Much of the work involved in selecting architectural components and establishing processes can be abstracted away by using a good machine learning platform. Qwak is one such platform that can streamline your build, deployment, and model maintenance process through a single utility. Visit Qwak and start free.

Now that we are familiar with the architectural components required in setting up a production machine learning model pipeline, let us understand some of the typical challenges faced by the development teams while doing this.

Typical challenges in productionizing ML models

Meeting Model Performance Requirements

Meeting the accuracy metrics set by the business stakeholders is one of the biggest challenges while deploying a machine learning model in production. This is aggravated by the fact that production accuracy will always be divergent from the test set accuracy and it will keep diverging as long as the model is not updated. Chasing the accuracy metrics and reacting to external factors are an infinite cat and mouse game, machine learning developers are cursed with. 

The models will suffer from various kinds of data drifts as it continues to get used in production. There could be changes in the input data representations that may suddenly change your model predictions from acceptable to nonsense. At times external factor changes can play havoc with the models. For example, the entry of a small regional telecom player with rock bottom prices can suddenly trigger churn from a specific region, and your model would have had a clue about it. 

Meeting Serving Layer performance Requirements

The servicing layer architecture is particularly important because of its ability to play havoc with future model development. The serving layer must support a model agnostic deployment mechanism and should be capable of hosting all the popular model-building frameworks for future-proofing. 

Meeting the business requirements for the serving layer can be particularly challenging in the case of high-load applications. In some cases, model inferences have to be real-time and if the model input is a data-heavy asset like an image or audio signal, elaborate servicing architecture will have to be implemented. In most cases, this is done using a streaming queue like Kafka or RabbitMQ and implementing models as consumers or subscribers rather than APIs. 

Dealing with explainability

Even before a model is approved for deployment to production, there will be questions from multiple directions about the quality of predictions and the reason why predictions are made the way they are. These questions will continue long after deployment and there will be continuous requests to build more explainability into models. As a matter of fact, a low accuracy model with high explainability is at times preferred over a better model with low explainability. 

Explainability needs to be considered during the model development process itself. While it is straightforward to bring some explainability in a simple decision tree model, it gets exponentially tough when model complexity increases. At times, a separate model only to generate explanations for predictions have to be developed and deployed as a companion to the original model. Developing such models are easier said than done because, for most of the complex problems that can only be solved using deep learning, there is no current way to explain the outcomes. 

Model Governance Challenges

Once models get used in production, a lot of questions about model governance tend to arise. The most prominent one is the questions about the impact of the models on customers and how to gain visibility on them. In our example, how do you conclude whether a customer who was classified as at-risk indeed had intentions to switch and was saved by the timely intervention of the customer service executive? 

Being compliant with the internal and governmental regulations is another big aspect of model governance. Organization’s security policies and how it applies to the model as well the data that is being collected for training and testing is another aspect that takes up considerable time to work out. 


Deploying machine learning models is a complex process that involves multiple strategies with uncertainties. The experimental nature of the model development and ever-changing data environment transcends into the deployment process and model operation in production. 

Having access to a well-designed machine learning platform can abstract away most of the work involved in solving these uncertainties. Such platforms allow developers to focus on the core problem without worrying about the complexities of deployment. QWAK is a machine learning platform that can streamline your complete model development and deployment process. You can check out Qwak and sign up for a free trial

Also, there are many great machine learning podcasts that explore ML topics.

Chat with us to see the platform live and discover how we can help simplify your AI/ML journey.

say goodbe to complex mlops with Qwak