MLOps

5 must-haves for any ML data catalog

Although phrases like “data is an asset” and “data is the new oil” are arguably corporate clichés, they are accurate. As businesses continue to digitize and derive real value from data, the importance of it to them grows; it becomes part of their DNA.

Pavel Klushin

Head of Solution Architecture at Qwak

May 20, 2022

Contents

To extract this value from data, however, businesses must first know what they are looking at. And although this sounds simple enough, it has become a huge barrier to businesses both big and small. This is where data catalogs come in—they help businesses keep track of the data that they have.

What is a data catalog?

A data catalog is exactly what it sounds like. It is an organized inventory of a business’s data assets. Data catalogs are useful for helping organizations manage their data. They also help data professionals collect, organize, access, and enrich data to support discovery and governance.

Let’s expand on this with an analogy.

When you go to the library and you need to find a book, you use their catalog to find out whether the book you are looking for is there, where it can be found, what edition it is, and more—all the information that you might need to decide whether you want the book and, if you do, how you can find it.

Now imagine that you have a catalog that covers every single library in the country, and that from a single interface you can find every library in the country that stocks the book that you are looking for, and you can find all the details that you could possibly ever need.

That’s what a data catalog does. It gives businesses a single, clear view of and deeper visibility into all their data.

Sounds good—but why?

Because of volume. With more data than ever before, being able to find the right data has become much more difficult than ever before. In addition to finding data, there are also more stringent rules and regulations, such as the EU General Data Protection Regulation (GDPR), that govern its management.

When used correctly, a data catalog can help you organize and manage data more efficiently. Instead of searching high and low for random nuggets of data, a user can quickly and easily find exactly what they are looking for. And in machine learning applications, they go a step further by helping users identify which data has affected which model.

Indeed, businesses are building more and more machine learning (ML) models, each iteration of which will use different snapshots of data. Knowing what data has been used with what model is a major challenge that needs to be solved beforehand; it can never be an afterthought.

In addition to all this, there is also the money aspect. One of the most compelling reasons for adopting a machine learning data catalog is the potential boost to your bottom line. Data catalogs directly contribute towards reducing speed to insights, increasing business engagement, and improving the utilization of your datasets.

Machine learning data catalog requirements

Data catalogs are a must-have for any business that is serious about building powerful and reliable ML models. Any data catalog that you use should ideally meet the following requirements:

Include automatic tracking
Be searchable
Be auditable
Enable change tracking
Include API access for integration
Built with collaboration in mind

Below, we are going to look at each of these five requirements in turn.

Include automatic tracking

When it comes to building ML models, automation is the goal. There is little point in asking your ML teams to track their data and experiments in an Excel spreadsheet. A data catalog must therefore automatically be able to track any data assets that are being used to train a model.

Imagine a pipeline that takes data from a Spark cluster, augments it with data from an S3 bucket of flat files, normalizes it, trains hundreds of ML models, and automatically selects the best one for deployment—that is the level of automation that you need to be looking for in a data catalog.

Be searchable

A data catalog that can’t be searched is just… well, useless. What is the point of organizing data if you can’t then intuitively search for it?

When choosing a data catalog, less is not more. Ensure that whatever catalog you are looking at has powerful, granular-level native search functionality so that you can effortlessly find a specific data asset, no matter how far along the pipeline your project is or how many input and output files are connected to an experiment.

Be auditable

We don’t need to explain to you how ML models are the result of complex sets of pipeline steps. Over the course of building a model, your data is cleaned, normalized, transformed, anonymized, and analyzed for bias, and every step in the process must be tracked so that you can show lineage from the original data source to the final model in deployment.

Any data catalog that you use must therefore include automatic data tracking so that every data transformation and its result can be stored and used for reference later. Ideally, your catalog will enable you to track the history of a model’s lifecycle and take a dive into any transformation at any time.

Enable change tracking

Data changes over time, and because of that you will always be building new and more advanced models that account for these changes, and you must be able to track the lifetime of these models and how their data has changed.

The reason for this is simple: a model that went into production a year ago and was trained on two-year-old data will be completely different from a model built today and trained on data that’s six months old. Your data catalog should therefore be powerful enough to track every model over their entire lifetimes and make it easy to visualize changes.

Include API access for integration

As we have said, data is your most valuable asset, and you shouldn’t be relying on an outdated system that ties you in.

While backups, integrations, and private cloud are all great, you can’t beat a data catalog that includes full API access. This enables you to stay in complete control of your data and integrate your catalog with any external system and, if the time comes, move to a different system. When looking for a data catalog, look for options that include a full open API so that you can benefit from full freedom and control.

Built with collaboration in mind

In the post-pandemic world, distributed remote teams are becoming the norm. Your data catalog should therefore facilitate collaboration across different teams and locations with built-in functionality for things like in-line chats, annotations, comments, and more.

Get started with a machine learning data catalog

ML data catalogs enable real-time data discovery and automatic, contextual organization of datasets. By using data catalogs, businesses can build a single source of truth for all their data, track lineage, collaborate, and search and access the right data through a single, intuitive dashboard.

Modern, augmented data catalogs also facilitate collaboration between dispersed teams and geographies, empower all data users within a business to make data-driven decisions, simplify management and governance, and enable data democratization.

While evaluating the different data catalogs available on the market, key things to look out for are automated ingesting, searchability and auditability, inventorying, tagging, profiling, and lineage mapping.

There are diverse challenges in machine learning. Why continue your search when the solution is right in front of you, though?

The Qwak platform includes everything that you need to build and deploy powerful machine learning models at scale. Qwak is a fully managed platform that unifies ML engineering and data operations - providing agile infrastructure that enables the continuous productionization of ML models.