Leveraging Snowflake Data for Machine Learning with Qwak's Feature Store
This article serves as a comprehensive guide for integrating Qwak's Feature Store with the Snowflake Data Cloud. The guide covers essential prerequisites, setting up a connection to Snowflake, defining data sources and entities, and consuming features in both batch and real-time machine learning models.
Machine learning models need well-organized, high-quality features for training and making predictions. Qwak's Feature Store is a central place that makes it easier to go from raw data to usable machine learning features. This guide shows you how to use Qwak's Feature Store with the Snowflake Data Cloud, a global network where organizations mobilize their data and apps, put AI to work, and collaborate across teams, to manage and serve your machine learning features effectively.
What is Qwak’s Feature Store
A Feature Store is a centralized system designed to store, manage, and serve machine learning features. It addresses common challenges in machine learning, such as feature consistency, reusability, and real-time serving.
Qwak's Feature Store stands out by offering seamless integration with the Snowflake Data Cloud. It allows you to transform raw data from Snowflake, manipulate it, and store it for use as offline or online features in machine learning models.
Overall, Qwak's feature store provides a powerful solution for managing machine learning features, enabling organizations to build more accurate and effective machine learning models.
Before diving into the integration process, make sure you have the following set up:
- A Qwak account: You'll need this to access Qwak's Feature Store.
- Qwak SDK installed locally: The SDK is essential for interacting with the feature store.
- Additional Python Libraries: Install `pyarrow` and `pyathena` as they may be required for data manipulation and querying with the Qwak Client.
1. Connecting to Snowflake
For this tutorial, we'll be fetching data from a Snowflake table with the following schema:
Defining the Data Source
To connect to Snowflake, you'll need to define a SnowflakeSource object. This object specifies the connection details and the table to be queried.
- date_created_column: This column serves as the timestamp that Qwak uses to filter data. It's assumed that the data source employs SCD Type 2 dimensions for historical data storage. The column should be of type datetime.
- username and password: These credentials are stored as Secrets within the Qwak platform for enhanced security (refer to the screenshot below for more details).
- host and the rest of the connection details: These parameters inform Qwak where to connect and which resource (table) to access. You can find all these details in your Snowflake account.
Secure Storage of Credentials
Qwak offers a secret service that allows you to securely store sensitive information like usernames and passwords. This ensures that your credentials are encrypted and managed securely.
Entities are business objects that you want to make predictions about. In this example, we define a user entity.
To test the datasource, you can use the get_sample() method to retrieve sample data. This method will automatically test the connection to the Snowflake table, as well as validate the data by retrieving a sample with the first 10 rows.
The result should be:
Registering the Data Source
Once you've verified the data sample, the next step is to register the Data Source and Entity in Qwak. This can be done effortlessly using Qwak's CLI as shown below:
The `-p` dictates the file where Qwak should look for Feature Store definitions.
The command output should be something like this.
Now you can not only test it locally, but can also see it in your Qwak dashboard and call it in your next FeatureSets.
2. Transforming Snowflake Data into Reusable Feature Vectors
In Qwak, feature sets are either SQL queries or Python functions designed to transform raw data into usable features. These feature sets can be scheduled for regular updates and can also be backfilled to generate historical features.
Defining the Feature Set
When defining a FeatureSet, consider the following components:
- `name`: This identifies the FeatureSet when you're consuming features.
- `entity`: This sets the unique key for each feature vector, which in this example is a registered user.
- `data_source`: Specifies where to pull the raw data from. This should have been defined in the previous step.
- `timestamp_column_name`: This is the column that Qwak uses to sift through historical data.
Scheduling and Backfilling
You can schedule a FeatureSet to run at regular intervals using cron scheduler syntax. For instance, in this example the `user-features` FeatureSet is set up to fetch new data every day at 8:00 AM.
The backfill option is used only when registering the feature set. It tells Qwak how far back in time to fetch historical data for the FeatureSet.
Finally, user_churn_features is a method that returns an SQL based transformation. This helps you filter, transform, and customize the FeatureSet's schema and data.
Because we already registered the Entity and DataSource, we can now query a sample for this FeatureSet to validate it works as expected.
And the sample should look something like the following:
Registering the Feature Set
As with the DataSource, registering a FeatureSet is straightforward:
The `-p` dictates the file where Qwak should look for Feature Store definitions.
Once you've set up the FeatureSet, you should see it reflected in the Qwak Dashboard. At this point, the data ingestion and processing pipeline should have already kicked off.
By registering the FeatureSet, Qwak stores the resulting data in two types of stores: an Offline Store and an Online Store.
- Qwak Offline Store: This store utilizes a high-performance file format called Apache Iceberg, which is stored on top of an object storage and it's optimized for batch consumption.
- Qwak Online Store: This store is built on in-memory cache DB, enabling low-latency feature retrieval. It's particularly useful for real-time predictions.
3. Consuming Features for ML Model Training
Most modern ML models are trained in batches, often referred to as offline training. In this section, we'll demonstrate how to consume features from Qwak's Offline Feature Store for model training.
To retrieve features from the Offline Store, you'll use Qwak's OfflineClient. This requires a key-to-features mapping dictionary, along with start and end datetime values to specify the data fetching range.
The key_to_features mapping dictionary should follow this format, where the listed features are the ones used for model training or prediction:
Running the code snippet above will return the following features sample:
4. Consuming Features for Real-Time Predictions
For real-time predictions, latency is a critical factor. In such cases, you should use Qwak's Online Store for feature retrieval.
The OnlineClient serves as the query interface for Qwak's Online Store, offering fast feature retrieval.
To use the get_feature_values method, you'll need to specify two things:
- ModelSchema: This is generally used to define what the ML model's inference endpoint should expect during a prediction request. Here, it's also used to inform the OnlineClient which features are needed for your model.
- Query DataFrame: This is straightforward; in our example, it contains the user_id entity key to filter results.
The output should look something like this:
This section could address common issues that users might encounter and how to resolve them. For example:
FeatureSet Data Pipeline Fails
If your data ingestion pipeline fails, the first step is to consult the logs for clues about the failure. Navigate to the 'Feature Set Jobs' section in the Qwak Dashboard, as shown below.
Feature Retrieval fails for the Online or Offline Store
If you find that the Offline or Online client isn't retrieving any rows for a given key, you can verify the data in the Qwak UI under the 'FeatureSet Samples' section using an SQL query. For more detailed troubleshooting steps, refer to our documentation.
Note: When constructing your query, make sure to enclose column names in double quotes and prefix them with <feature-store.feature>, as shown in the example below.
In this comprehensive guide, we've walked you through the process of integrating Qwak's Feature Store with Snowflake to manage and serve machine learning features effectively. From setting up prerequisites to defining entities and feature sets, we've covered all the essential steps. We also delved into the specifics of consuming features for both batch and real-time machine learning models.
By now, you should have a solid understanding of how to leverage Qwak's Feature Store in conjunction with Snowflake's data warehousing capabilities. Whether you're looking to fetch features for offline batch training or need low-latency feature retrieval for real-time predictions, Qwak's dual storage system has you covered.
Thank you for reading, and we hope this guide empowers you to build more accurate and efficient machine learning models.
Learn more about how to build a fully blown ML application with your Snowflake data in this blog post.