In November 2018, Ville Tuulos, a machine learning infrastructure architect, was the very first person to publicly dissect and discuss the Netflix machine learning infrastructure at the annual ‘QCon’ software development conference held in San Francisco.
Although this took place almost four years ago, Tuulos’ talk is still highly interesting and relevant and provides a deep dive into the machine learning architecture that has been developed at one of the world’s largest entertainment companies.
In this article, we are going to summarize his talk and discuss some of the concepts that he covers. Alternatively, you can watch the full 49-minute video on YouTube. All image assets used in this article have been taken directly from the video.
Ville Tuulos is an ML architect who has been developing infrastructure for machine learning for over two decades.
During the course of his career, he has worked as an ML researcher in academia, and as an ML leader at several large companies including Netflix machine learning, where he was active between August 2017 and March 2021. By the time he left Netflix, he was the company’s manager of machine learning infrastructure and led the open-source project Metaflow, a popular open-source framework for data science infrastructure.
Nowadays, Tuulos is the co-founder and CEO of Outbounds, a company that’s developing modern human-centric ML that is based on Metaflow. Tuulos is also the author of Effective Data Science Infrastructure, which discusses how to make data scientists more productive.
Tuulos begins his talk by comparing ML infrastructure to an online store, and how building one was a huge technical challenge just two decades ago. Back then, store owners were forced to build the whole system themselves, starting by setting up servers because cloud infrastructure didn’t exist.
Nowadays, however, new platforms (i.e., WordPress & WooCommerce, Shopify, Etsy) have emerged that allow pretty much anyone, even those without technical expertise, to build their own online store. As a result, the biggest challenge of setting one up is more to do with having a good product and knowing your customer than configuring and setting up infrastructure.
He goes on to add that the same thing is going to happen with ML infrastructure. While in recent years companies have had to build their own infrastructure from scratch, platforms like Qwak have solved this problem by providing advanced tooling out of the box. Today, building your own ML infrastructure is largely unnecessary; your time and resources are better spent elsewhere.
Just a few years ago, machine learning infrastructure was a major technical pain point at Netflix. Today, largely due to the work of Tuulos and his team, the company’s ML development is becoming more human-centric, and its infrastructure is guided by two key principles:
At Netflix, machine learning is being used company-wide because the company recognizes that general ML researchers and data scientists who build ML vision models using Python and TensorFlow, for example, are not the best people to build models to solve numeric problems (i.e., revenue models) using R.
Tuulos emphasizes that while it’s important to hire specialized data scientists for each problem domain, this person isn’t always the DevOps specialist who is also in charge of cloud infrastructure setup. This is where ML infrastructure comes into play.
Tuulos admits that, in the future, there will eventually be some sort of standard solution that enables data scientists to apply machine learning to very different types of problems. At Netflix, however, they wanted to be ahead of the curve and solve their customers’ problems with ML already today. Thus, they needed to build ML infrastructure to achieve this.
According to Tuulos, ML workflows can be divided into eight building blocks:
The above slide, created by Tuulos, illustrates the stages that should be considered when building ML infrastructure. The arrows indicate that, in general, the more infrastructure that’s needed to perform a certain step, the less that data scientists will care about how it’s done. In other words, ML infrastructure teams should care more about the things that data scientists don’t care as much about.
In his talk, Tuulos describes how his team started a project to analyze sentiments in Tweets written about a Netflix series. Although Netflix already had different tools that allowed them to execute each step in the model-building process, nothing was connecting them.
Although this didn’t impact the build process, it created several problems in production, and this created a lot of questions:
Tuulos says that he and his team checked the code and found that 60 percent of it was related to infrastructure and only 40 percent was related to data science. Pondering over the questions above, Tuulos and his team realized that they were missing a piece of infrastructure. Realizing the cost of this, they engineered their own solution: Metaflow, which acts as the link between all these different technologies by wrapping around them.
Prior to building Metaflow, Netflix’s average time from project idea to deployment was four months. Now, the median average is just one week, which shows that working with ML infrastructure enables ML teams to iterate much more quickly. This lends a significant advantage to fast-moving companies like Netflix.
Utilizing ML infrastructure doesn’t need to be complicated. While building your own would have been a huge challenge once upon a time, modern ML platforms make it super easy to set up, and it’s only going to become easier over time as these platforms continue to innovate their products.
If you take anything away from Tuulos’ talk, it’s that this infrastructure is becoming increasingly important for modern ML development. This is because it helps ML teams to iterate and get their models into production sooner, which is critical when models constantly need to be re-trained in response to new data.
Even if you’re already using tooling to get your models into production, remember that it’s not enough to use different tools for each different step in your workflow. As we have just explored, deploying a dedicated ML infrastructure, Netflix cut down its time from idea to deployment from four months to one week.
If you want to do the same, your best option is to use a standardized platform that supports all the tools, languages, and frameworks that you use in your workflows. That’s where Qwak comes in.
The Qwak platform is a managed platform that unifies ML engineering and data operations, providing agile infrastructure that enables businesses to continuously productionize their ML models at scale. If you’re interested in learning more, check out our platform here.