MLOps

The Machine Learning Lifecycle in 2022

Pavel Klushin

Head of Solution Architecture at Qwak

August 26, 2022

Contents

It seems like everybody is breaking into machine learning (ML) nowadays. The proliferation of it alongside ‘big data’ has naturally led to a scramble among organizations who want to figure out how to use all the data that they’re collecting in a way that delivers value to their bottom lines. Indeed, growth in the ML space is occurring at such a rapid speed that the machine learning market cap is expected to hit US$117 billion by 2027.

Although it’s great that the influx in popularity of ML is leading to lots of newcomers entering the ML space—after all, AI and ML solutions can only improve if there’s widespread participation—it needs to be made clear that building and incorporating an ML project in a production environment is a highly technical feat. It’s no walk in the park, and firms entering the market without sufficient ML experience could be in for a rude awakening.

Unfortunately, many firms seem to be under the impression that running an ML project is fairly straightforward as long as you’ve got the right data and computing resources for training. This couldn’t be further from the truth, though, and it’s an assumption that could cause organizations to needlessly waste money by embarking on projects that had a near-zero chance of ever making it to deployment.

In this article, we’re going to discuss what the life cycle of a machine learning project actually looks like in a bid to give organizational leaders a better understanding of what’s involved.

The machine learning lifecycle

The reality is that machine learning projects are not straightforward. Rather, they’re a cycle iterating between improving the data, the model, and evaluation. The cycle never truly finishes, and it’s crucial for developing ML models because it focuses on using model results and evaluation to refine your dataset.

This means that unfortunately, the ML lifecycle isn’t something that you can complete once and forget about. Much like any system, a deployed ML model requires ongoing monitoring, maintenance, and updates. ML models that have been deployed in production environments are going to need regular updates as you uncover biases in the model, add new sources of data, require additional functionality, and more.

With the right approach and tooling, however, managing the ML lifecycle needn’t be something to fret about. We’re now going to break the process down into its four main phases: data, model, evaluation, and production.

Phase 1: Data

The lifeblood of any ML model is the quantity and quality of data that it’s rained with. The biggest data-related tasks in the typical machine learning lifecycle are:

Data collection — You need to collect as much data as possible. The more data that a model is trained with, the better. It’s useful to keep a reserve of data that you can add to the model as and when it’s needed when performance issues arise.

Data annotation — Annotation is the important process of labeling and notarising datasets. It’s an incredibly laborious task that can take hours, days, and even weeks to complete, and automated data annotation tools have been developed as a result.

Annotation schema — This is one of the most important parts of the data phase of the lifecycle because a poorly constructed annotation schema will lead to ambiguous classes and edge cases that make model training more difficult. It’s important for teams to thoroughly define their annotation schema as a result.

When trying to improve model performance, ML teams will spend most of their time trying to perfect the data. This is because if a model is not performing well, the cause is almost always a data-related problem such as a training dataset containing too many biases. In addition, making improvements to models generally involve things like hard data mining, rebalancing, and updating annotations and schema.

Phase 2: Model

The model phase involves creating a model and training pipeline and training and tracking model versions. Despite the end result of any ML project being a model, the model phase requires the least amount of time spent on it in comparison to data, evaluation, and production.

Typical model phase tasks might include:

Exploring existing models — ML teams explore existing models in a bid to reuse available resources and get ahead of model production. During the model phase, you’ll most likely be fine-turning an existing model that was trained on a related task and uploaded to GitHub rather than creating a new one from scratch. This is normal.

Constructing training loops — Your data is highly likely to differ from what was used to train the model you’re using. For example, for a model using image datasets, considerations like object size and resolution need to be accounted for when creating the training pipeline for your own model.

Experiment tracking — In the course of building a model, there’ll be multiple iterations of the ML lifecycle. This will lead to you training lots of different models, so it’s important that you’re accurately and meticulously tracking different versions of the model and the hyperparameters and data that it was trained on. Organization is key!

Phase 3: Evaluation

Once you’ve got a model that has been trained, it’s time to see how well it performs on new data by evaluating it. Tasks include:

Visualizing model output — Once you’ve got a trained model, it needs to be run on a few samples so that the output can be evaluated. This is the best way to figure out if there are any bugs in your training pipeline. It will also show if there are any other major errors such as mislabeled classes.

Looking at failure cases — Everything that your model does is influenced by and based on the data that was used to train it. If your model is performing worse than you expected, then you need to take a look at the data. Although it’s useful to look at cases where your model is doing well, you also need to look at failure cases where your model has predicted something incorrectly.

Coming up with solutions — Looking at failures is the first step in putting together solutions for fixing model performance. Most of the time, you’ll be going back to add training data in areas where your model failed so that it can learn some more, but it might also include fixing annotations or changing pre and/or post-processing steps in your pipeline.

Phase 4: Production

You can only get to the production phase when you’ve got a model that performs well without any major errors. But this doesn’t mean that the work is over. Far from it, actually. Production is the most important and most difficult to manage phase and it involves a lot of work, including:

Model monitoring — You’ll need to test your deployed model to ensure that it is still performing as expected on test data with respect to your evaluation metrics and things like inference speed.

New data evaluation — Having a model in production means that you’ll always be passing new data through. The model will never have been tested on this data, so it’s important to perform an evaluation and look at samples to see how it performs.

Adding new functionality — Even if your model works flawlessly, there’s always going to be room for improvement. Adding new functionality is an inevitability for most models because you’ll always want to make the model more efficient, expand its capabilities, and ensure that it’s strengthening your bottom line as much as possible.

Don’t underestimate the machine learning lifecycle

Only a very small (we’re talking minuscule) number of organizations that try to incorporate machine learning actually make it to the stage where a model is deployed into production. This is because, despite what people may think, developing, deploying, and managing an ML model is an incredibly complicated and labor-intensive process.

That shouldn’t put you off, though. While it was once the case that only organizations with significant amounts of money behind them or dedicated machine learning teams (or, in most cases, both!) could be said to be in a position to deploy their own ML models, the proliferation of machine learning services and tooling has made the prospect of doing so much more accessible to smaller businesses.

That’s not to say it’s easy, though. Even with the best ML tooling in the world, building and deploying an ML model is a lot of work. Whether this is worth it depends entirely on your organization, what you’re trying to achieve, and how much potential value ML could deliver.

‍For smaller businesses that do decide to build their own models, they’re increasingly turning to platforms like Qwak to get the job done.

Qwak is the full-service machine learning platform that enables teams to take their models and transform them into well-engineered products. Our cloud-based platform removes the friction from ML development and deployment while enabling fast iterations, limitless scaling, and customizable infrastructure.

‍Want to find out more about how Qwak could help you deploy your ML models effectively? Get in touch for your free demo!

MLOps

Bridging the Gap: How MLOps and DevOps Work Together for AI Adoption in 2025

Guy Eshet

December 8, 2024