“Big data” has been around for more than a decade, and much has been done to adapt the tech stack; however, getting value from big data is complicated. To do so, you need a team of skilled data scientists to sort through it. Companies understand this, as evidenced by the 15x to 20x growth in data science jobs between 2016 and 2020.
But even with a capable team of data scientists on hand, you still need to overcome the major hurdle of putting those ideas into production. It is crucial that your engineers and data scientists work together to effect a real impact on business.
As a result, you’ll often see super talented and expansive data engineering departments who manage the organization data across multiple data stores. These include the organization data lake, data warehouse, etc. in more advanced companies where data is well arranged and has a very arranged and robust data transformation layer.
Next to the data engineering team, it’s not rare to find a data scientist department responsible for building machine learning (ML) models on top of the organization's data. These models must add a business “edge” to the product and keep the company ahead of the market. Some examples of ML models: product recommendation, customer churn prediction, fraud detection, image recognition, customer support optimization, natural language processing, and many more.
Data scientists use many different research tools to work with the data and build their model. However, the friction actually starts the moment the model is “ready.” In the next section, we’ll elaborate more on the main reasons for that, but in a nutshell, the main problem is the definition of “ready”. From data scientists’ point of view, “ready” means that the model predicts the expected business use case and is under the statistical threshold (Accuracy, F1 score, etc.). The data engineers’ definition of “ready,” on the other hand, is more related to the “real world problems” (we’ll touch on that later): they want to make sure that the ML model that was just “received” from their data scientist counterparts can hold the production pressure, in terms of scale, code quality, data ingestion, etc. Because of this difference in mindset, there is now a new profession that should help companies bridge the gap between data scientists and engineers—Machine Learning Engineers.
The ML engineer’s goal is to be a kind of mediator between these two departments: in one capacity, they understand statistics, models machinsem and data, and they also understand what it takes to run things in production, how the manage versions, support scale, and build features for real time / production inference.
At their core, data scientists are innovators who draw new insights from the data ingested by your company every day; engineers, meanwhile, build on those insights to create sustainable solutions that can “live” in the production environment.
Using a variety of tools and techniques such as data mining and statistical analysis, data scientists manipulate, interpret, and merchandize data to create business outcomes.
As a result, they perform myriad tasks ranging from data mining to statistical analysis. The process of collecting, organizing, and interpreting data aims to identify significant trends and relevant information.
Although engineers and data scientists work together, there are some distinct differences between the two roles. For example, engineers place a higher value on "production readiness" of systems.
Engineers want their systems to be fast and reliable—from the models generated by data scientists to their format and scalability.
Therefore, engineering and data science teams have different day-to-day concerns.
So how can these two roles thrive together and make sure they are creating the desired business impact? We’ll focus on some of the main areas of overlap below.
Putting a few scientists and engineers in the same room and asking them to solve the world's problems will not suffice. They must first learn each other's terminology and speak the same language.
On the data science front, it’s better if they learn and adopt engineering practices and standards. They also need to understand the meaning of having your code as part of the production environment, where scale, uptime, monitoring, tests, etc. are critical.
On the engineering side, one thing it’s really important to understand first is that data science code may look similar to “classic” software development code, but the concept is totally different; ML models change while they are in production (aka retraining) and are much more sensitive to data changes, meaning they also need to be released and monitored differently.
One of the biggest challenges is to create code standards for ML models. There are various reasons for that, from managing versions, running tests, setting code standards across the different data scientists, etc.
You can read much more about it in Yuval’s blog.
The other problem with ML code is that it's been developed in “lab” environments. For that reason, when it meets production, some challenges can appear from supporting the production scale (not just single or batch prediction commands). The things that are required in the real world are the following:
Version management – One of the most important things being ignored a lot in the translation between research to production is the version management. The reason is that in the research phase, data science can create hundreds of versions for each model, and, obviously, most of them will never be relevant. This is not the case in production; in a well defined production environment, all the versions that ran in production in the past are well managed and reproducible for investigation and fall back options.
Serving that can scale – When building models, it’s very common to “forget” to test not just the statistical test but also to check whether the model can actually support the expected production load + some buffer of course; in addition, scale should be flexible and must be set in a way that allows scaling up and down with the business.
Monitoring – Same as any other production application, ML requires monitoring and alerts so the relevant people will know if something is going in the wrong direction.
Analytics – ML is often a data product and heavily affected by data changes; for that reason, it’s not enough to “just'' monitor the application itself, but data monitoring is also crucial to make sure your ML model is healthy.
Automations – A critical part of managing ML models is creating automations to act when the model is getting degraded during time—for example, if the accuracy goes below 80, trigger retrain and upload a new version of the model.
The best way to maximize the value of clean code is to "productize" it internally, by creating an environment where engineers and data scientists can each draw on their strengths. This is the "features store," which is a centralized place where documented and curated features (independent variables) can be stored.
We use this data management layer to feed curated data into our ML algorithms. Aside from standardization and ease-of-use, our main advantage is that our feature store allows consistency between our models. It has significantly increased the reliability of our algorithms and increased our data team's efficiency. Data scientists and engineers know that when they take a feature off the shelf, it has been stress-tested to ensure it will work at scale.
In recent years, big data and ML have created both new opportunities and challenges at the organizational level. Phase one was the realization that big data in and of itself wasn't going to create efficiencies.
Phase two is about helping the data scientists who are wonderful at finding value put their ideas into practice in a way that meets the rigors of an engineering team operating at scale. Thousands of customers rely on the product.
The proliferation of big data and ML has opened up new opportunities and challenges along the way. In the first phase of big data, we have seen it takes time to realize, but there were no efficiencies to be gained. The second phase involves data scientists and data engineers extracting value from big data. Working with both of these teams should be seamless and organized. Once they are in sync, your customers will get the most value from your business.
As we wrote in the opening paragraph, one of the easiest ways to “solve” that is to hire a dedicated ML engineer who can bridge the gap between data science and data engineers. ML engineers usually take the ML code from the data scientists and add all the relevant engineering pieces required to run ML in production, from version management, serialization, serving, monitoring, analytics, to feature management and automation.