Lessons

Security & Compliance

PII lifecycle — deletion, retention, and the right to be forgotten

Fully delete data for compliance (GDPR right to be forgotten): logical DELETE vs physical VACUUM, retention properties, targeted partition deletes, and audit.

A user invokes their GDPR "right to be forgotten." You run DELETE FROM users WHERE user_id = ….

Predict: is the data gone?

No — and this is the most important compliance fact in the section. A Delta DELETE does not actually erase the data. Understanding why, and what does, is the whole lesson.


The spine

Beat 1 — the anchor: DELETE is logical; only VACUUM erases

Anchor. A Delta DELETE is logical — it writes new files without the deleted rows and tombstones the old ones in the log, but the old files remain on storage and stay reachable via time travel. The data is truly gone only when VACUUM removes those unreferenced files — and VACUUM only removes files older than the retention window (delta.deletedFileRetentionDuration, default 7 days). Compliant deletion = DELETE, then VACUUM after retention.

From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): Delta never edits files in place; it adds new ones and marks old ones removed. So the "deleted" PII lives on in old files, recoverable with VERSION AS OF, until VACUUM purges them. That's why a DELETE-only GDPR routine is a review finding. (With the 7-day default, the files persist until a VACUUM runs on day 8+.)

Beat 2 — doing forget-me correctly

  1. DELETE the rows, and propagate to every derived/downstream table that copied the PII — deleting from user_lookup doesn't clean user_aggregates.
  2. VACUUM once the retention window has passed → files physically removed, no longer time-travel-recoverable.
  3. VACUUM will not remove files still inside deletedFileRetentionDuration (a safety guard for active readers/time travel) — until then the deleted data remains reachable.

Lock it. DELETE = logical (files linger, time-travel-recoverable). VACUUM past retention = physical erase. Forget-me = DELETE all copies → VACUUM after the window.


The dials (skim now; return when a question needs one)

◆ Retention properties (and the trade-off)

PropertyDefaultControls
delta.deletedFileRetentionDuration7 dayshow long removed files are kept before VACUUM can purge → also the recovery/time-travel window for deleted data
delta.logRetentionDuration30 dayshow long transaction-log history (versions) is kept

Set with ALTER TABLE … SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '15 days'). The tension: longer = more recovery but PII lingers; shorter = faster compliant erasure but less recovery. Tune to policy.

◆ Faster/cleaner deletes — DROP, partitions, appendOnly

◆ Two supporting facts

Takeaways (rebuild it from these)

  1. Delta DELETE is logical — old files remain, time-travel-reachable; PII gone only when VACUUM removes files older than deletedFileRetentionDuration (7 days default).
  2. Compliant forget-me = DELETE (all copies, incl. downstream) → VACUUM after retention; VACUUM won't touch files inside the window.
  3. Retention: deletedFileRetentionDuration (7d, VACUUM/recovery window) vs logRetentionDuration (30d, history). Longer = more recovery but PII lingers.
  4. DROP a managed table physically deletes files (external = metadata only). Partition by a low-cardinality PII-segregating column for ACL isolation + boundary-aligned deletes. appendOnly blocks deletes (conflicts with forget-me).
  5. Audit logs capture PAT/user identity per REST call; DBFS is a path abstraction over object storage, where real deletions happen.

Before you move on — say these without scrolling up

  1. You ran DELETE for a forget-me request — is the PII gone? What's the full procedure?
  2. The two retention properties, their defaults, and the trade-off you tune.
  3. DROP on a managed vs external table — what happens to the files?
  4. How does partitioning make PII deletion efficient — and what property would block forget-me?

That completes Section 7's security story: who's in (identity/secrets) → what they can do (ACLs, UC privileges) → what they can see (row filters/masks) → protect the values (anonymization) → erase for good (this lesson).

Prerequisites