Lessons

Security & Compliance

Anonymization — hashing, pseudonymization, and protecting values at rest

Anonymize/pseudonymize PII using hashing (SHA-2) and related techniques, apply masking in place, and ensure consistent masking across distributed pipelines.

Column masks ([Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)) hide values at query time — but the raw data still sits in the files. Copy a table to a sandbox, hand a dataset to a vendor, or simply refuse to store real PII at rest, and masking isn't enough: you must transform the stored value itself. This lesson is those transforms, with hashing as the workhorse and a few exact SHA-2 facts the exam checks.


The spine

Beat 1 — the anchor: recoverable? linkable?

Anchor. Protecting a value at rest is a choice on two axes: can you get the original back, and can you still link records that had the same value? Anonymization (hashing) is one-way — no recovery — but deterministic, so equal inputs still match. Pseudonymization replaces the identifier with a consistent code so records stay linkable without exposing the real value. Tokenization is reversible via a separately-stored vault. Pick by which of recoverable / linkable you need.

Beat 2 — SHA-2 hashing, and the in-place trap

sha2(col, numBits) is the default: one-way and deterministic (same input → same output). Determinism makes it a pseudonymizer — hash an email that's both the only PII and the join key, and records stay linkable on the hash while the address is gone. Three exact facts:

Predict: you write df.withColumn("ssn_hash", sha2("ssn", 256)) and save. Is the SSN protected?

NowithColumn("ssn_hash", …) adds a new column and leaves the original ssn intact, so raw SSNs still get written. To mask in place, overwrite the original column (same name) or project it out:

df.withColumn("ssn", F.sha2("ssn", 256))   # RIGHT — overwrites in place

Tell: "hashed a new column but PII still exposed" → you added instead of replaced.

Lock it. sha2 = one-way + deterministic (linkable, not recoverable); valid 224/256/384/512 (128 fails), fixed-length output. Overwrite the column to mask in place. Recover later → tokenization.


The dials (skim now; return when a question needs one)

◆ The landscape

◆ Consistency across a distributed pipeline

The hard part in big batch and streaming pipelines: every node must apply identical masking logic — drift leaks rows. Fix: a single, standardized masking function (managed through Unity Catalog) enforced everywhere, applied once at Bronze ingestion so PII is anonymized before it's ever stored. Clean design: one Lakeflow pipeline serving both batch and streaming with a shared masking function at Bronze.

◆ Name the collision

Takeaways (rebuild it from these)

  1. Protect at rest by transforming the value: anonymize (hash, one-way, deterministic→linkable), pseudonymize/tokenize (linkable/reversible), vs query-time masking (display only).
  2. sha2 valid lengths 224/256/384/512 (0=256; 128 fails); fixed-length output (256→64 hex); deterministic; not reversible (tokenize to recover).
  3. withColumn("x_hash", …) adds a column and leaves the raw PII — overwrite the original column to mask in place.
  4. Landscape: salting (strength, breaks linkability), suppression (NULL), generalization (ranges), k-anonymity (≥k per quasi-identifier group, guards re-identification), tokenization (reversible).
  5. Distributed consistency = one shared masking function at Bronze, ideally one Lakeflow pipeline for batch+streaming.

Before you move on — say these without scrolling up

  1. Recoverable vs linkable — where do hashing, tokenization, and masking each land?
  2. sha2 valid lengths — and which common one fails? Output length for a short vs long input?
  3. withColumn("ssn_hash", …) — is the PII protected? What's the fix?
  4. How do you guarantee every node in a big pipeline masks identically?

Next: the last security duty — deleting data for good ("right to be forgotten") and why a Delta DELETE alone doesn't do it → [PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/).

Prerequisites

Leads to