Today's businesses are rushing to implement Artificial Intelligence (AI) and Machine Learning (ML) systems to boost their profitability and participate in what most call the fourth industrial revolution. After all, AI can increase business productivity by 40%. But are ML projects giving the expected results? Is it only hype, or is it more than just hiring data scientists and ML engineers to deploy an ML or AI application?
Research shows that 70% of companies don't realize significant benefits from AI projects and that 87% of data science projects don't reach the production stage. Common reasons for such a trend include lacking quality and quantity of data, misalignment with the company's objectives, wrong problem identification, scarcity of talent, etc. But maintaining the sustainability of ML solutions is one essential challenge companies face. As users grow, managing a large amount of data becomes an issue and requires data scientists to constantly account for changes in data.
Of course, an ML application is as good as the data it gets, and data scientists must ensure that they train their ML models on the latest data to capture relevant trends and behavioral patterns. One way is to implement efficient feature stores that streamline ML pipelines and ensure your application is successful. In this article, we will discuss in detail what a feature store is, its functions, components, architectures, and the factors you must consider to build an effective feature store.
A feature store is a medium through which data scientists get relevant data from available sources. Such data sources can include several databases, a data warehouse, or a data lake, which usually stores data in its raw form. Data scientists cannot work directly with raw data. They typically apply particular transformations to get a clean data set they can use for training their models. The modifications may involve filling out missing values, filtering columns, and aggregating data. Such transformations are usually referred to as feature engineering, as data scientists try to extract relevant features from datasets to create predictive ML models. A feature store, therefore, is a central repository that stores all such features.
Feature stores are helpful when the feature engineering process involves more than filtration and aggregation. They become necessary if your transformation pipelines consist of complex calculations that can take up significant computing power. For instance, the process may include merging and normalizing values from several data sources to create a new feature column that acts as input to an ML model. Typically, organizations working on a larger scale require sophisticated feature stores to speed up the transformation process.
More precisely, a feature store becomes necessary if multiple ML applications are running, and each requires the same set of features. Without a feature store, you will have a lot of duplication, as each team will run the same transformations to get the same feature set. Such a process is time-consuming and costly. Each team can access the features directly through a feature store without running data operations repeatedly. Also, you need a feature store if you have multiple pipelines running to serve several applications. Features generated from different pipelines can skew the feature values. Finally, a feature store is necessary if your ML applications require real-time or streaming data to function correctly. After all, you cannot keep running your pipelines after regular intervals to update the ML models - the process would be inefficient and reduce the quality of your product or service.
Feature stores serve several functions that improve the overall ML deployment process. Primarily, features stores speed up the development procedure, ensure compliance and governance, and allow for better collaboration among different teams. Let's explore these functions in more detail.
Before discussing the several feature store architectures, it is essential to understand the components that make up a typical feature store and where it fits in the overall ML pipeline. Commonly, an ML pipeline starts with a data source such as a data warehouse or a data lake. Next, there's a transformation layer that converts raw data from the sources into usable assets that data scientists can use in their models. Once the layer applies the necessary transformations, it sends the modified data back to the sources from where an ML application or data scientists fetch the relevant features. The process usually involves Extract, Transform, and Load (ETL) jobs.
It is at the transformation stage where feature creation usually takes place. For example, you may have a dataset of individual customers and the amount they spent buying certain goods on each day of the month. But you may want to know the average spending for each customer and use the average spending variable as a feature in your ML model. As such, you can write a simple SQL query to compute average spending for each customer - a pretty simple transformation. You can have more complex transformations that involve several steps to calculate a single feature. The transformation steps are collectively known as a Directed Acyclic Graph (DAG), and the average price is the feature.
After transformations, the features return to two data sources - the Inference Store and the Training Store. The Inference store, such as Redis, stores the transformed features so ML applications can use them in real-time as new data comes in. Doing so ensures no latency in delivering the features to the deployed ML models in real-time to make the necessary inferences for the users. In contrast, data scientists use the features in the Training store to train new ML models on their machines. Such training stores can be a data lake, such as Amazon S3, from where different team members can access a feature based on their access rights.
The training store should ensure point-in-time correctness, meaning that the feature values must exactly match the labels for a particular period. Suppose the average spending in a specific month for an individual customer was USD 10,000. Assuming that the month had 30 days, it would be incorrect to show spending of USD 10,000 for each day of the month for that customer. Instead, the training store feature table should show a running average for each day. Assume the spending for the first three days was USD 500, 800, and 1000. It means that the average until day three should be USD 766.67, approximately - it is only at day 30 that you should see an average of USD 10,000. A feature store must ensure point-in-time correctness so that the training data has up-to-date and relevant feature values. If you want to train your model using a 3rd-day average of each month, then the data scientist can quickly use the feature values in the training store.
Now that you know what a feature store is, why you may need one, and its functions and components, it's time to discuss the three types of architectures available to organize your features effectively; the literal feature store, the physical feature store, and the virtual features store. All serve to ensure that the ML development process is smooth and efficient, and organizations must choose the one that best suits their goals and objectives.
A literal feature store only acts as a repository and performs no transformations. Custom-built pipelines apply the relevant changes, and you can use the literal feature store for training offline models and online inferences. The data scientist can point their pipelines to the new literal features store so that they dump all the features into the new repository after running all the computations. Of course, the data scientist must also point all the models to the literal feature store's location to get the data for inference and training. Since the infrastructure is simple, it has a low adoption and maintenance cost.
Organizations that are working on a smaller scale with a couple of data scientists who can maintain their pipelines, transform data through a specialized transformation layer to get the relevant features, and have a decent versioning system, but want a central repository to store and access all the features, may benefit from the literal feature store's simplicity. Feast is one open-source literal feature store you can set up locally. It can quickly integrate with existing infrastructure and deliver point-in-time correct features. Although it offers simple on-demand transformations, you still must use a separate transformation.
Of course, despite a literal feature's simplicity, some downsides still exist. For example, changing a feature or running a new transformation with a literal feature store will require the data scientist to modify the transformation pipelines manually. Also, the data scientist must create a new feature table so that the literal feature store saves the new feature values without replacing the old tables. And all the models and data repositories should point to the new transformation pipeline so that every component uses the latest feature. As such, the data scientist will have to make several manual adjustments with a literal feature store.
A physical feature store allows you to compute and store features and usually comes with a Domain Specific Language (DSL) to write transformations. Apart from the usual inference and training modules, a physical feature store has a metadata layer that stores information, such as the feature's owner, description, service-level agreements (SLAs), etc., so it's easier for team members to collaborate and share their work. And since a physical feature store also computes transformations, it has a transformation engine that replaces existing transformation pipelines.
In terms of performance and functionality, a physical feature store is superior to a literal feature store as a literal feature store doesn't come with a transformation engine - it is only a feature repository. However, a physical feature store has a higher adoption cost and can be a black box for data scientists. They cannot customize the feature store's transformation engine to suit different ML architectures. Of course, making customizations will be easier if an organization builds a physical feature store in-house instead of purchasing it from a third-party vendor. But building a physical feature store from scratch can be time-consuming and require a relevant skill set.
As such, organizations must see if they need a physical feature store or if they can work with a literal store instead. Usually, organizations that have ML applications that work on live or streaming data need a physical feature store due to its low latency feature delivery. The transformation engine can quickly create the relevant features and deliver them to the inference store. For example, companies like Uber, Lyft, and Airbnb have physical feature stores built in-house to serve their ML applications to provide a smooth user experience. Uber's Michelangelo, for instance, is a complete MLOps platform with a shared feature store of around 10,000 features and comes with a DSL that allows data scientists more flexibility with feature engineering.
A virtual feature store is more of a workflow than an actual store. Unlike the physical and literal feature stores, a virtual store organizes and streamlines the feature creation process while remaining within the bounds of the existing infrastructure - it doesn't replace the infrastructure or add to it. For example, a literal feature store sits on top of the existing infrastructure to provide inference and a training store. In contrast, a physical feature store replaces the current infrastructure with a new transformation engine. However, a virtual feature store aims to make the feature engineering process a central component throughout the ML development lifecycle.
A virtual feature store provides a meta-data layer that stores a feature's owner, names, versions, descriptions, logic, SLAs, etc. The information acts as an abstraction to a data scientist's workflow, meaning they can use their preferred architecture and write custom features but adhere to specific protocols to ensure a standardized feature engineering process. So, a virtual feature store primarily manages and coordinates the feature creation and storage process rather than providing a separate transformation engine and storage platform. Teams can use whatever platforms they want for transformation and storage. By establishing a robust feature governance and monitoring framework, a virtual feature store ensures the re-usability and shareability of features.
The virtual feature store allows flexibility for data scientists to use their preferred architectures, languages, platforms, etc., to build their ML pipelines while streamlining the feature engineering process to ensure quick deployment, compliance, and collaboration. Organizations will find a virtual feature store to be the right choice if they don't want a vendor lock-in, as in the case of a physical feature store, or don't have the required expertise to build it in-house and want the low adoption cost of a literal feature store. Usually, companies having multiple data infrastructures can implement a virtual store to avoid the cost and time required for adding or replacing all of them.
The above section outlined some common feature store architectures you can implement for managing features. However, things get pretty complicated once we go into the details of each feature store's components. Organizations face a recurring challenge where they must balance future-proofing their architecture and avoiding premature optimization. The challenge is more apparent when implementing a solution in-house. Of course, even third-party open-source tools can soon become obsolete - data science tools usually keep getting updated within a year or so. However, switching to a new tool is more manageable than revamping an entire custom-built solution that took months or years to build. In any case, you must be careful about such issues to ensure your architecture is flexible enough for easy upgrades.
Firstly, a feature registry must be a primary consideration when opting for any feature store architecture. The registry stores critical meta-data, such as the feature's owner, description, logic, etc., and allows for easy migration of the feature store from one platform to another. Also, the registry will help you manage access controls while ensuring the features are discoverable and reusable. You want a solution with an easy permissions management module and a powerful search engine that lets team members quickly find what they want.
Secondly, generating historically accurate features with streaming data is yet another challenge that requires you to think about backfilling data, which is filling out missing values from the past when upgrading to a new system or updating a feature table with new records. Usually, data backfilling is necessary whenever bad data enters the system. However, backfilling becomes difficult if you have a large amount of streaming data. One solution is to apply standard transformation logic across the board to avoid skew between online and offline data sources. Another way is to implement feature logging, which computes features online on the spot for use in inference and later for training models.
You'll have to consider the solutions you use for an online or offline store. Redis is commonly used for online inference, while an offline store might be Amazon S3, Databricks, Google Cloud, etc., or any other data lake or warehouse. Of course, you must also consider building an orchestration layer if you have a hybrid environment to integrate your entire IT infrastructure with ML pipelines. Also, with data regulations like GDPR, it becomes essential for you to choose a platform that lets you easily manage access to sensitive features.
Additionally, you should build a feature monitoring component as part of the architecture to keep track of failed jobs, feature skew, latency, memory usage, etc. Finally, it would help to consider which platforms data scientists prefer to create features. For example, Jupyter notebooks are the standard IDE in data science. You must ensure the feature store easily integrates with the preferred IDE so a data scientist can quickly fetch the relevant features without waiting long for data loading jobs to provide the results.
Given the complexity and significance of a feature store, it is paramount that you keep aside a healthy budget for investing in an architecture that is scalable, flexible, and future-proof. However, building such an architecture can be challenging. As suggested in the section above, creating a feature store is not only about its direct component. Instead, you have to consider several aspects to ensure an efficient solution. An easy solution is a managed MLOps platform that quickly organizes your ML pipelines and takes your projects to production.
Qwak is one tool that provides an entire feature store as part of a complete MLOps solution that helps you quickly deploy ML models. The feature store module follows a physical feature store architecture having a transformation engine and a storage layer. It has a robust feature registry and monitoring capabilities to streamline the feature creation workflow. Also, it can perform feature-logging to process large-scale online data, and its training store ensures the point-in-time correctness of historical data. So, contact us now and optimize your ML workflows to get the maximum return on investment (ROI) from your ML projects using Qwak.
Qwak simplifies the productionization of machine learning models at scale. Qwak’s Feature Store and ML Platform empower data science and ML engineering teams to Build, Train and Deploy ML models to production continuously.
By abstracting the complexities of model deployment, integration, and optimization, Qwak brings agility and high velocity to all ML initiatives designed to transform business, innovate, and create a competitive advantage.