MLOps

GDPR considerations when training machine learning models

Tech innovators and research institutions harness AI and ML to develop materials new drugs, identify fraud, and safeguard crops, all with the power of data.

Pavel Klushin

Head of Solution Architecture at Qwak

July 13, 2022

Contents

GDPR considerations when training machine learning models

By leveraging the capabilities of AI and ML, tech innovators and research institutions are making new materials, discovering new drugs, detecting fraud, protecting crops, and more. In our daily lives, we’re also regularly facing AI algorithms—from email filters and personalized music suggestions, there are very few areas in our lives today that hasn’t been touched by AI and ML, and this has all been made possible by data.

We are currently living in the age of “Big Data”. Each day, we create roughly 2.5 quintillion bytes of it and by 2025, it’s estimated that there’ll be 436 exabytes (or 436 billion gigabytes) of it globally. This is good because data is the key to innovation as long as we can collect, analyze, and use it. And it is artificial intelligence (AI) and machine learning (ML) technologies that make it possible for us to do so on the scale that modern applications demand; nothing else can.

However, new data collection and analytics methods have caught the attention of regulators around the world, and we’re beginning to see more and more new laws and regulations being introduced that are placing restrictions on what reasons data can be collected and used for. A major example is the EU General Data Protection Regulation (GDPR) which was introduced in 2018 and has had a significant global impact on how ML teams can use the data of EU citizens.

What is the GDPR?

The European Union adopted a Data Protection Directive long before Internet users began to share their data online. After years of discussions and preparations, the European Parliament replaced this directive by adopting the GDPR in 2018.

By introducing the GDPR, the EU aimed to harmonize data privacy laws across all its Member States, safeguard EU citizens’ data when being transferred abroad, and provide individuals with more control over their personal data. In short, the GDPR applies to data that, either alone or in combination with other data, can identify a person. This includes:

Personally identifiable information (name, address, date of birth, etc.)
Racial and ethnic data
Web-based data (location, IP address, cookies, etc.)
Political opinions
Sexual orientation
Health and genetic data
Biometric data

Generally speaking, if information can be used to identify an individual, privacy and personal data protection rules apply.

Viewed by many as the gold standard in data protection regulation, the GDPR has been the starting point for data protection laws in countries outside of the EU. This includes many states in countries like the United States, such as the California Consumer Privacy Act.

What is the impact of the GDPR?

When GDPR went into force in May 2018, it had an instant effect on all companies in Europe and any company that holds the personal data of EU citizens. It doesn’t matter if a company is based outside of the EU, for example in the United States; if a company processes or holds the personal data of an EU citizen then the GDPR applies.

The tech industry immediately took issue with the stringent rules that the GDPR introduced because it impacts two core areas of AI and ML.

First, the GDPR enhances data security by placing strict obligations on companies that collect and process any personal data of EU citizens. Given that most AI systems require large volumes of data to train and learn from, you can see the problem.

Second, the GDPR explicitly addresses what it calls “automated individual decision-making” and “profiling”. According to Article 22 EUGDPR, a person has the right not to be subject to either of these things if they “produce legal effects” concerning them. This covers an AI system’s decisions that are made without human intervention. For example, an AI system might analyze a user’s credit card history to identify the user’s spending patterns—this would be subject to the GDPR’s rules.

GDPR’s limitations on AI and ML

At the core of the GDPR are six data protection principles, and this gives rise to a number of challenges for AI and ML. These include:

Fairness and discrimination

The GDPR says that data must be processed with respect to the data subject’s interests. It also places an obligation on the data controller (i.e., the ML team storing and using the data) to take measures to prevent discriminatory effects on individuals. We all know that many ML systems are inadvertently trained using biased data and demonstrate discriminative traits as a result. ML teams must learn how to mitigate these biases to be compliant with the GDPR.

Purpose limitation

Purpose limitation states that data subjects must be informed about the purpose of data collection and processing. This enables subjects to choose whether to consent to it. That said, ML systems sometimes use information that’s a by-product of the original data (i.e., by using social media data for calculating a user metric). In this situation, the GDPR states that data can be used if the “further purpose” is compatible with the original one. If it isn’t then additional approval is required.

Data minimization

The GDPR says that collected data should be adequate, limited, and relevant, thus encouraging ML teams to think through the application of their models. Engineers must determine what data and what quantity is necessary for a project. Of course, this isn’t always possible to predict, so developers should continuously reassess the type and quantity of data needed to fulfill minimization requirements.

Transparency

The GDPR empowers data subjects to decide which of their data is used by third-party controllers. This means you must be open and transparent about why you’re collecting data and what you intend to do with it. Unfortunately, the nature of developing ML systems means that this can be difficult to do. After all, AI is a black box and it’s not always clear how a model makes decisions, particularly in sophisticated applications.

Developing GDPR-friendly ML models

Despite where you’re located or what you may think of the GDPR, you must ensure that all your processes are compliant with it if you’re using the data of EU citizens. Those who violate the GDPR could find themselves subject to large fines, which is a situation that all ML teams should want to avoid.

Thankfully, there are plenty of methods that ML teams can use to go about building GDPR-compliant ML models and AI systems. Here’s a quick look at three of these:

Generative adversarial networks (GANs)

A trend that’s sweeping the ML space right now is using less data more efficiently rather than using lots of data inefficiently. A GAN reduces the need for training data by using output data to generate input data. In other words, the input is used to determine what the output will look like.

This method uses two neural networks—a “generator” and a “discriminator”. The generator learns how to put data together to generate an image that resembles the output while the discriminator learns how to tell the difference between the real data and the synthetic generated data. The downside, however, is that GANs don’t eliminate the need for training because they require a lot of data to be trained properly.

Federated learning

In federated learning, personal data is used but it doesn’t actually leave the system that stores it. It’s never collected or uploaded to an AI or ML system’s backend. Instead, the model is trained locally on the system where the data exists and merges with the master model later on during the development process.

Although federated learning can avoid many GDPR challenges (but not all of them—namely transparency because personal data is still being used for training purposes), a locally trained ML model will experience more limitations than ML models trained on a dedicated system.

Transfer learning

Federated learning enables the reuse of previous works and helps to democratize AI. During federated learning, AI takes an existing model and retrains itself using the existing model to meet the current purpose.

Since the AI model uses a pre-existing model, less data is required. The drawback with transfer learning is that it works best when the previous model has already been trained using a large dataset and is reliable without the existence of any biases.

There’s no perfect solution

There’s no perfect solution to GDPR challenges. The three training methods that we’ve mentioned are pretty limited, and while they may comply with some GDPR principles they might contradict others.

The long and short of the situation is that if ML teams want to continue developing advanced ML models, they’re going to have to get comfortable either working GDPR compliance into their workflows, working with synthetic data, or taking a hybrid approach that uses synthetic data where possible but complying with GDPR where this isn’t possible.

We recently explored the subject of synthetic data in a blog post. We recommend reading this to learn more about how it can be used to train ML models.