Feature Store

Real-Time Feature Sets in Feature Store

Real-time features are features that are being calculated on the request time instead of being calculated in advance from a defined data source, such as BigQuery, Kafka, etc. As such, the raw data of the requested feature should arrive in the request itself, or be accessed directly from the feature set definition (code) - for example, making calls to external APIs or ad-hoc data fetching from a database.

Alon Lev

Co-Founder & CEO at Qwak

June 20, 2023

Contents

What are real time feature sets?

Real-time features are features that are being calculated on the request time instead of being calculated in advance from a defined data source, such as BigQuery, Kafka, etc.
As such, the raw data of the requested feature should arrive in the request itself, or be accessed directly from the feature set definition (code) - for example, making calls to external APIs or ad-hoc data fetching from a database.

In addition, real-time features have an optional staleness parameter defined by the user, which enables making sure that the features are always up-to-date; even if they were not calculated in advance.

Real-time features are sent to the model and automatically write the data to the feature store for future requests.

When should I use real time feature sets?

Real-time feature sets’ main use-case is for when the data manipulation needs to be a part of the model inference, but the data returned from the Feature Store is non-existent, stale or needs to be enriched, while in inference time you do not want to (or cannot) pre-calculate it, but you do want to allow reusability and manage these transformations in a single location.

‍

‍

Real-time calculation

‍You are part of a fintech company and need to convert all the different currencies to their USD equivalent before sending them to the model. The transaction amount is taken from the transaction itself and you want to make sure that all the models in your organization are doing the exact same currency conversion.

External API

You are working in an insurance company and every time a new customer onboards to your platform you are invoking an external API in order to do a background check. Similar to the previous use-case, your goal is to create a standard of accessing the external API, whilst minimizing the amount of calls. If you already received a request for a specific customer in the last 24 hours, the real-time feature in Qwak allows you to do that.

How to use real time feature sets?

Code example - Plaid API (Real-time) with a Snowflake batch connection


@feature_set(name="transactions-realtime-fs")
class Realtime(QwakFeatureSet):

    # Real-time extractor that calculates the transaction amount
    @realtime(max_staleness='1m')
    def extract_transaction_category(self, accounts_ids: List[str]) -> List[Dict]:
       transaction_requests = TransactionsGetRequest(accounts=accounts_ids)           
       return PlaidClient().transactions_get(transaction_requests)

    # Batch job that runs every 30 minutes and updates the online & offline stores
    @batch.data_sources(data_sources=["snowflake_plaid_transactions"])
    @batch.scheduling(cron_expression="*/30 * * * *")
    def transform(self):
        return SparkSqlTransformation(sql=
          """SELECT user_id,
                  transaction_amount 
          FROM snowflake_plaid_transactions""")

More insights about real time feature sets

How should I get training data when I'm using real-time features?

When it comes to obtaining training data for real-time features, the current approach involves incorporating real-time functionality as part of a batch feature set. This means that the training process is conducted using the offline store, rather than directly from real-time data.

What happens if I'm not providing the needed data for a real time calculation?

In terms of the repercussions of not providing the necessary data for a real-time calculation, it is important to clarify the context. If you mean that when invoking the real-time function, no data is returned, then null values will be provided. This occurs when the requested key is either outdated or does not exist. In such cases, the real-time function will be called, but if it doesn't contain relevant data, null values will be returned instead.

Can I see feature lineage in real-time?

Regarding the visibility of feature lineage in real-time, since it is implemented as a batch feature set with an added real-time function, feature lineage can indeed be observed and traced back to its source.

About Qwak

Qwak has transformed the MLOps lifecycle, enabling practitioners to scale their models into production faster than ever. The end-to-end platform reliably handles data transformation, storage, pipelines, build and deploy, and encompasses a next generation of MLOps tooling for businesses.