Big Data ouTDo 10 min read 3 April 2023

GDPR, the forgotten done right

Marcin Szymaniuk

CEO | Senior Data Engineer

Big data and GDPR, user's right to be forgotten

In the previous article, we covered anonymization and pseudonymization – techniques used in the context of ensuring data privacy, and more specifically, in the context of GDPR. While the basic idea and goal of anonymization are intuitive, GDPR introduces more subtle regulations, such as the Right to Be Forgotten. This concept is tricky because it is subject to interpretation. The complexity is further compounded by technical implementation details within a complex ecosystem that handles large datasets.

Open to interpretation

If you possess customer data and the customer revokes their consent, requesting to be forgotten, you must take action. You need to ensure that you remove the data. What’s tricky about this? There are at least two aspects to consider:

1. If you decide to simply delete the data, ensure that the deletion actually occurs on a physical level. Modern data platforms excel at abstracting certain operations, so just because you think you deleted the data doesn’t mean it’s unrecoverable or that it’s no longer present in your systems.

2. Alternatively, you might not need to remove the data if you choose to anonymize it instead. In this case, you must ensure that you correctly identify all the Personally Identifiable Information (PII) fields and that the anonymization process is accurate. More on this topic follows below.

The details count, you have to pay attention to the deletion behaviour in each tool you use.

The devil lies in the implementation

Deleting data when a customer revokes their consent may seem appealing from a security perspective. The assumption is that once the data is gone, you are safe, right? While this is true, the deletion process itself is not so straightforward. The complexity arises from the varied ways in which tools within the Data & Analytics ecosystem handle data removal. The deletion behavior in these tools does not always align with the general intuition of ‘deletion.’

Most modern data processing tools are optimized for ingestion and analytics queries, and their ability to perform fine-grained data deletion is limited. Sometimes, this means that you cannot simply remove individual records. At other times, it means that the deletion is not immediate. Finally, it might be that the data is merely marked as unavailable, but not actually deleted.

Do you see how some of these options might not be acceptable from a regulatory perspective? Therefore, it is crucial to consider technical limitations when deciding on a strategy for the Right to Be Forgotten.

Below, we will explore some commonly used data processing tools:

1. Traditional relational databases – These databases are good at operations involving single rows at a time, making it easy and safe to delete specific rows.

2. Hadoop and HDFS – It is impossible to delete a single record. If you want to remove the data of a single user, you need to rewrite the entire dataset. The rewrite comes with its own considerations, but the main one is whether it’s acceptable from a performance and cost perspective to rewrite the entire dataset just to remove a single record.

3. Delta, Hudi, Iceberg – These systems allow for the deletion of individual records, but you need to understand that the ‘deletion’ is merely creating another delete-marker record. If you want to ensure that the data physically disappears eventually, you need to carefully design the compaction strategy specific to the storage you are using.

4. NoSQL databases – Each NoSQL database has its own way of handling deletion. For instance, Cassandra uses tombstones – data-deletion markers – which lead to similar concerns as described in the previous point. A carefully designed compaction strategy is a must.

5. BigQuery – Large-scale deletion is not what BigQuery is optimized for, which is why the tool has quotas for these kinds of operations. You also need to be aware that the data does not disappear immediately. That’s why it’s worth considering a full rewrite or crypto shredding (please refer to the next point) in certain scenarios.

Does this cover all options?

Unfortunately, the variety of tools makes it impossible to mention them all in a single article. You have to carefully investigate your ecosystem in the context of the Right to Be Forgotten. I only mention some of the commonly used tools to illustrate the complexity of the problem.

Anonymised data doe snot equal forgotten. The physical evidence remains like a forgotten luggage anyone can look through to de-anonymise the data again.

The right to be forgotten is not fulfilled if the data is not erased on a physical level. Then it’s simply hidden. Left behind till somebody finds it.

Does anonymised = forgotten?

Deleting records is not the only option for forgetting a customer. You can consider anonymizing the records of customers who revoke their consent. Anonymizing the dataset has the obvious benefit of retaining some of the data (without PII) so it can be used for analytics purposes in the future.

However, remember that you need to be very careful when designing the architecture and procedures. Things to keep in mind:

1. Anonymization should be a non-reversible process.

2. You have to correctly identify and anonymize the PII fields.

3. Anonymizing just PII fields might not be enough. You have to ensure that you prevent indirect identification of the user’s data. More about this will be covered in the next article.

4. You must have a formal process in place and follow it when handling your datasets. Automate as much as possible to limit the chance of human errors and privacy breaches.

5. Naive anonymization of individual records comes with problems similar to deletion itself. You have to consider the properties of the analytics tools in your stack before deciding on this approach.

One of the specialized techniques for handling the Right to Be Forgotten is crypto-shredding. The basic idea is to encrypt customer records and remove the encryption key when the customer requests to be forgotten. Using this technique can free you from concerns related to the cost of reprocessing the entire dataset. At the same time, it must be carefully thought through, especially when it comes to storing, securing, and removing the encryption keys.

Look for a description of crypto-shredding in one of the upcoming articles.

What Have You Missed?

Once you have determined the appropriate method for addressing the Right to Be Forgotten (RTBF), you are nearly prepared to begin the implementation process. Why nearly? A crucial aspect of this process is identifying all datasets containing sensitive information. It goes without saying that managing your master data is essential. However, are you fully aware of the life cycles of your datasets? Can you confidently say that you are handling all derived datasets? In other words, are you dealing with a data lake, data lakehouse, or perhaps a data swamp?

Maintaining control over data lineage is a critical component of Data Governance, and we will discuss its significance in the context of RTBF in an upcoming article.