Over the last couple of years, we have seen huge investments in Big Data platforms (the global big data analytics market was valued at nearly $400 billion in 2025, according to Fortune Business Insights), especially in Data Lakehouse architecture and cloud analytics, and it’s not a coincidence. These platforms offer performance, scalability, and flexibility. A single unified platform for everything. That makes them very effective while still supporting data governance and quality control.

 

There has also been a shift in the management of personal data in response to regulatory laws such as the General Data Protection Regulation (GDPR). An issue once confined to IT departments is now a matter of broader business concern.

 

Perhaps the most challenging part of GDPR for data platforms is the Right to Erasure (Art. 17), which is also known as “the right to be forgotten”. It essentially means that upon request, our personal data must be completely removed unless there is a legal reason not to.

 

This is where the problem becomes apparent. The goal of today’s platforms is simple versioning, replication, and retention. It doesn’t really align with GDPR, which requires selective deletion and version control. A good example is a retail company that stores customer purchase history in a Lakehouse environment. It may quickly discover that deleting one user’s data requires scanning raw logs, analytics tables, and ML training datasets. Without implementing a proper architecture, the user’s data might remain in places we wouldn’t think to look, potentially violating GDPR.

 

In Lakehouse environments, manual or script-driven solutions do not scale, so today compliance has to be directly built into the architecture. In short, that means applying the Privacy by Design principle; also, we need control over versioning and retention rules and create processes that automate compliance. Most importantly, we must first know where exactly data comes from, how it’s transformed, and where it goes.

GDPR as a system requirement

From an architectural standpoint, GDPR is more than a legal text—it is a set of system requirements.

 

The regulation specifies several data subject rights that correspond directly to system capabilities. The most relevant to data engineering are the Right to Erasure (Art. 17), Right to Restriction of Processing (Art. 18), and Right to Data Portability (Art. 20). Each of these rights specifies tangible system requirements.

 

Right to Erasure (Art. 17)

 

For the right to erasure to be properly fulfilled, a system must locate all occurrences of the user’s data in tables, delete or anonymize the data in all storage tiers, ensure that it cascades through all aggregated datasets and prevent re-emergence through reprocessing or historical restoration. This goes far beyond a simple DELETE. This becomes a complex operation.

 

Right to Restriction of Processing (Art. 18)

 

The right to restriction of processing requires our systems to flag the data as restricted and prevent it from being accessed by downstream pipelines. The system should also enable checking when and for how long the data is restricted. In distributed data systems, this means that all data information is managed centrally, and the system knows which data has a special legal status and how to treat it.

 

Right to Data Portability (Art. 20)

 

The right to data portability requires our systems to provide customers with data in a structured format. The easiest way to comply is to have the same data across all layers and sources; otherwise, we risk sending incorrect data or omitting the requested information. This could result in GDPR violation. Therefore, we need to ensure proper schema and lineage management, or else portability requests might become costly and time-consuming.

 

These rights translate directly into architectural constraints in modern data lakehouses.

Architectural implications in modern lakehouses

Not long ago, in one of our projects,  we had a choice between data lakes and data warehouses. Data lakes were used for our unstructured data, while warehouses for our structured data. Today’s Lakehouse systems use the best of both worlds. They typically consist of a raw/landing zone for our batch or streaming, a transformation layer, and an aggregated serving layer for analytics and BI. We can also use other benefits, such as versioned storage with time travel and storage and compute separation. While this makes Lakehouse architecture all the more powerful, it poses a compliance challenge.

 

In practice, personal data may exist in multiple forms:

  • Structured formats (like Parquet)
  • Semi-structured (like JSON or logs)
  • Feature Stores (for ML models)
  • Analytical aggregates and materialized views
  • Disaster recovery and backup

 

Every added environment where our data exits makes it all that much harder to delete it. Personal data in a silver table within a medallion architecture isn’t automatically deleted from a historical snapshot, downstream gold table, training dataset, or logs.

 

Replication mechanisms further complicate things. Data may be copied across regions for latency optimization, shared between environments, or replicated into separate analytics platforms. Each copy creates another instance which must be located and removed.

 

Another layer of risk is introduced by backup and disaster recovery. If data is replicated into cold storage or other, secondary regions, erasure efforts need to propagate across all the recovery layers. Otherwise, while restoring, we may reintroduce deleted data. In large enterprises, backup and restoration are standard practices, so this is not just a theoretical issue.

 

From a design perspective, these realities require more than reactive deletion scripts and demand decisions on architectural level aligned with Privacy by Design principle. This means we can minimize the ingestion of unnecessary personal data, separate behavioral data like transactions or logs from PII (Personally Identifiable Information). We can also replace identifiers with tokens, and we should do this as early as possible in pipelines. When identity data is isolated, it is much easier to delete or update it without having to rewrite large datasets.

 

Without this careful engineering, it becomes almost impossible to delete data with full certainty, and that may expose a business to penalties, failed audits, or harm in reputation. There have already been hundreds if not thousands of institutions that have been fined millions of euros for failure to comply and fully delete customer data.

Identifying personal data in distributed systems

One of the most important points is that personal data must be identified before deletion, and it’s one of the hardest problems in Big Data systems because personal data can be explicit, like a name, email, or ID number; be a nested JSON structure; be derived, like a risk score or customer segment; or be free-text logs.

 

Schemas, columns, and tables change over time. They may be restructured or renamed in different versions, so without PII identification tools, it might become impossible to delete data with the level of scrutiny required by GDPR. For example, a social media platform may retain PII in posts, logs, and third-party analytics pipelines, so deletion may be only partial. Apart from regulatory problems, this might also lead to user mistrust.

 

To address issues like that, we need centralized data catalogs that make it easier to find data; we need to create tools for automated PII detection and labeling, metadata-driven governance frameworks, and PII declaration through pipeline development. Data governance needs to be integrated outright in the engineering process; otherwise, it might result in compliance failures, but also it can make it that much harder to implement at later stages.

Operational enforcements: retention, automation and lineage

Retention control is very important, but a frequently undervalued aspect of GDPR compliance. Retention policies cannot just sit there gathering “digital” dust; they are very important, must do their jobs, and need to be enforced. Organizations must define retention periods, and the system must automatically delete data after it expires.

 

In versioned systems, implementing a deletion tool is significantly more complex. Even if we „delete” it, it may still exist in some earlier snapshots. If we have a Lakehouse platform with a 90-day retention policy, deleting something won’t always mean it’s gone. It may be gone from one place, but still exists in another. A well-thought-out policy can reduce not only the storage of old data but also DSAR-related deletions, thereby limiting future risks.

 

Ultimately, retention involves both cost savings and risk management.

 

Manual processing of Data Subject Access Requests (DSARs) may be acceptable for small-scale. However, it is not scalable for enterprise-grade Lakehouse systems. A centralized request repository, an identity verification system, a DSAR intake procedure, orchestration services for deletion/restriction execution, end-to-end audit logging, and verifiable proof of execution are all required components of a compliant system. This is where automation comes in. It ensures consistency across data layers, might reduce human error, enables execution history, and generates compliance KPIs. It’s like a bank that has to process thousands of DSARs every month. If we throw away the automation, we would need a not-exactly-small team that just sits and manually completes each request. The result would be massive delays and an increased risk of GDPR violations.

 

Thanks to automation, we can generate consistent, auditable responses within legal deadlines. Under the GDPR, accountability works like this: you not only have to follow the rules but also be able to prove compliance. It means you need to generate logs and reports to show that the data was deleted as intended and changes propagated through the system.

 

This is not possible without automation.

 

Data lineage is a foundation of GDPR compliance. Seeing how data flows through different systems, tracing what paths transformations follow, and seeing how data is processed are invaluable mechanisms in helping to detect personal data downstream systems and evaluating the effects of delete operations. If we want to create an advanced architecture, our system needs to know not only where our data resides but also how it is transformed, and without that knowledge, achieving deletion effectiveness is a long shot.

Conclusion

The Right to Erasure is more than a trivial operational issue. It is a design challenge. In today’s Data Lakehouse ecosystem, General Data Protection Regulation compliance cannot be achieved simply by documenting, scripting, or cleaning up. It demands system-level Privacy by Design, technical enforcement of retention and versioning, automated DSAR workflows, full data lineage, and centralized metadata governance.

 

Only when these are fully integrated into the design can organizations provide a guarantee that personal data can be identified, managed, limited, and completely and permanently erased. In the world of distributed analytics and AI pipelines, compliance is no longer an afterthought. It is a design attribute. Designing for erasure is not about deleting data—it is about designing systems that honor human rights by default.

 

***

 

All content in this blog is created exclusively by technical experts specializing in Data ConsultingData VisualizationData Engineering, and Data Science. Our aim is purely educational, providing valuable insights without marketing intent.