A machine learning model is only as good as the data that it is trained with. Or to be more precise, it can only be as good as the features that have been built into it. We all know that.
If — to quote the arguably overused but nonetheless accurate saying — data is the new oil, then a machine learning model’s features are the oil gushers, and they must be treated appropriately. To get to your oil wells, you need to dig deep, and while it is hard work, the end results are more than worth it. The same can be said for finding the right features for a machine learning model.
In machine learning, a feature is an individual measurable property or characteristic that is taken from either a raw data point or an aggregation of raw data points. Choosing high-quality, informative, and independent features is a crucial element for developing effective algorithms and models.
Feature engineering is the process of creating new features for a model and is a critical part of any machine learning process. The better a model’s features are, the more accurate it is which naturally leads to better returns for the business using it.
The specific features that are used in a model will depend on the function of the model and the predictions that it is trying to make. If, for example, a model is to be used in transaction monitoring for predicting whether a transaction might be fraudulent, relevant features might be whether the transaction was in a foreign country, whether it was for a larger amount than normal, or if the transaction is unusual for that customer.
These features may be calculated from data points such as the transaction’s location, its value, the average value of the customer’s purchases, and aggregated spending patterns.
While we all know that the right data is of utmost importance for developing the right features and training ML models, preparing this data is a major challenge for data scientists.
Estimates say that 80 percent of the average data scientist’s time is spent on data preparation alone. This includes collecting it, cleaning it, organizing it, and developing it into features. It is a monotonous, time-consuming, and tedious process — a statement that the majority of data scientists agree with.
What’s worse, is that data preparation can often be done unnecessarily. Pretty much every data scientist will have found themselves in a position where they have spent time dragging their heels through data to calculate the very same features that another data scientist within the same company has already found. In addition, data scientists spend a huge amount of time replicating the same feature engineering pipelines every time that they want to deploy a model.
On top of data preparation, every new machine learning project begins with the same task of searching for the right features. The problem is that there usually is not a centralized place to search for them because features are scattered all over the place.
While both these aspects are highly inefficient, it doesn’t need to be this way.
Organizations know this and they are turning to feature stores in their droves to overcome the woes of data preparation.
A feature store is a system that is designed to automate the input, tracking, and management of data into several machine learning models. In the context of machine learning, the “store” means “storage” — the centralized software library where each function creates a single feature from a standardized input, i.e., the data.
Although they are relatively new and unheard of, they are playing an increasingly important role in the development of machine learning models. This is because of the increasing growth of both the use of machine learning models solutions by businesses and attempts at reforming data governance by regulators.
The use of a feature store makes sure that features are always kept updated for predictions. Feature stores also: consistently maintain the history of each feature’s value for model training and re-training; enable the simple re-use of features across the business; make it easy to standardize feature definitions; and enable data scientists to achieve consistency between when models are developed offline and when they are deployed online (in other words, they help to avoid training-serving skew).
A machine learning feature store typically includes:
A feature store manages data pipelines that transform raw data into feature values. These can be either scheduled pipelines that aggregate data at specific intervals or real-time pipelines that are triggered by events and update feature values on the fly.
A feature registry contains standardized feature definitions to act as a centralized source of information, and a feature store makes searching through these features and feature definitions a painless task. APIs and UIs are exposed to data scientists so they can clearly see available features, pipelines, and training data sets and incorporate the features needed for their use case.
Feature stores organize older features into a database so that during training, the examples all have features aligned at the same time. Since all historical feature values are stored along with their most updated values, the feature store can generate full training sets for features and align them properly for training. As features are updated, the store generates updated data sets for model re-training.
Feature stores serve a single vector of features that are made up of the newest values to ML models. Real-time feature serving is useful when models need to know the most up-to-date values for specific metrics, a basic example being weather reporting. With a feature store, these will be immediately available for the model which leads to more accurate predictions.
Because a feature store keeps all feature values updated and stores all historical values in chronological order, it becomes easier to monitor models and keep track of things like feature drift, prediction drift, and model accuracy.
Machine learning feature stores improve the efficiency and productivity of data scientists and the accuracy of ML models by enabling:
The typical ML model development workflow will require data to be gathered, transformed, processed, and trained for each new project. This is because there is usually no easy way for features to be shared, and this leads to multiple teams working in their own silos which in turn causes time and effort to be wasted through repetition.
With a feature store, however, ML teams can easily start on a new project by exploring readily-available features. In most cases, features that have been built by other teams for past projects can be re-used for new projects. This not only saves time and effort but gives ML teams more scope to focus on making their model the best it can be.
When there is no way for features to be consistently calculated, models can vary hugely between different data silos, teams, and projects. In banking, for example, one team may calculate “monthly customer spend” by subtracting monthly spend from monthly money in, whereas another might calculate it just by using monthly spend.
While both of these metrics are sound, if they are both called “monthly customer spend” it can result in inconsistently calculated metrics across different pipelines. However, a feature store’s singular feature registry creates a single centralized location for features where all features are calculated in the same way — this eliminates problems with consistency.
Sets of feature values that are used for ML model training must be the same as the values that were known at the time of the events that the model is being trained on. This ensures that the input feature values it uses are consistent with its training when the model is used for making predictions in deployment.
A machine learning feature store solves this issue by producing training data sets with time-consistent feature values that are taken from each feature set’s history at the point-in-time of the events that are being modeled.
Model management and governance is an often overlooked task that data scientists need to stay on top of. This is especially true now that ML teams are starting to be held more accountable for their models by regulators. ML teams need to be ready to explain, among other things, why their models work in a certain way, what data has been fed to it, when and why this was done, and what predictions it is making.
With a feature store, it is easy for ML teams to identify what data a model has been trained on and compare that to what data the deployed model has been fed. This makes iterating, training, and debugging a ML model easier because scientists can see exactly what data was used and when. Moreover, this level of insight makes it easier for ML teams to explain why their model made X or Y predictions in the past.
You only need to glance at the companies that have recently built their own feature stores for their ML platforms to get an idea of how seriously they are being taken. Ridesharing app Uber built Palette, Airbnb built Zipline, and Netflix built Time Travel.
Fortunately, you don’t need to build your own feature store to take advantage of the benefits. There are many options out there that either offer fully-managed feature stores — i.e., those offered by Google Cloud and Qwak — or will build one for you.
Given everything that we have discussed, it seems as if feature stores are on the up. They are the data warehouses of the machine learning world, and they deliver a whole host of benefits to enterprise ML platforms that are building their entire machine learning development ecosystems around them.
Feature stores enable data scientists to scale their machine learning in a way that has never been possible before. Not only do they save a whole load of time and make your models more accurate, but they also offer your data scientists more structure and consistency which makes their jobs easier and more enjoyable.
So, while it would be disingenuous to say that you must or should be using a feature store, it is a good idea to implement one and start reaping the benefits — and you can get started right now with Qwak.
Our Feature Store enables data scientists and ML engineers to collaborate quickly and effectively among themselves and with the R&D organization. It’s an easy way to develop features using batch and real-time data sources and serve them in production instantly.