Anonymize/pseudonymize PII using hashing (SHA-2) and related techniques, apply masking in place, and ensure consistent masking across distributed pipelines.

Column masks ([Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)) hide values at query time — but the raw data still sits in the files. Copy a table to a sandbox, hand a dataset to a vendor, or simply refuse to store real PII at rest, and masking isn't enough: you must transform the stored value itself. This lesson is those transforms, with hashing as the workhorse and a few exact SHA-2 facts the exam checks.

The spine

Beat 1 — the anchor: recoverable? linkable?

Anchor. Protecting a value at rest is a choice on two axes: can you get the original back, and can you still link records that had the same value? Anonymization (hashing) is one-way — no recovery — but deterministic, so equal inputs still match. Pseudonymization replaces the identifier with a consistent code so records stay linkable without exposing the real value. Tokenization is reversible via a separately-stored vault. Pick by which of recoverable / linkable you need.

Beat 2 — SHA-2 hashing, and the in-place trap

sha2(col, numBits) is the default: one-way and deterministic (same input → same output). Determinism makes it a pseudonymizer — hash an email that's both the only PII and the join key, and records stay linkable on the hash while the address is gone. Three exact facts:

Valid bit lengths: 224, 256, 384, 512 (0 = 256). 128 is NOT a SHA-2 length — sha2(col, 128) fails.
Output is fixed-length regardless of input — sha2(x, 256) is always 256 bits = 64 hex chars, short email or long password alike (why a hashed column takes a CHAR(64)).
A hash is not encryption — you can't recover the original. Need it back → tokenization (random token + secure vault), not hashing.

Predict: you write df.withColumn("ssn_hash", sha2("ssn", 256)) and save. Is the SSN protected?

…

No — withColumn("ssn_hash", …) adds a new column and leaves the original ssn intact, so raw SSNs still get written. To mask in place, overwrite the original column (same name) or project it out:

df.withColumn("ssn", F.sha2("ssn", 256))   # RIGHT — overwrites in place

Tell: "hashed a new column but PII still exposed" → you added instead of replaced.

Lock it. sha2 = one-way + deterministic (linkable, not recoverable); valid 224/256/384/512 (128 fails), fixed-length output. Overwrite the column to mask in place. Recover later → tokenization.

The dials (skim now; return when a question needs one)

◆ The landscape

Salting (sha2(concat(x, salt), 256)) resists dictionary/rainbow attacks — but a salt breaks cross-dataset linkability unless the same salt is shared (trades matchability for strength).
Suppression — replace with NULL.
Generalization — coarsen: exact age → age range, full ZIP → 3-digit prefix. Keeps utility, lowers identifiability.
k-anonymity — every combination of quasi-identifiers (age-band, city, gender) appears ≥ k times so no row is unique enough to re-identify; regulators typically want k ≥ 5. Motivation: even "anonymized" rows can be re-identified by joining quasi-identifiers to an external dataset — so generalize the quasi-identifiers, don't just hash the obvious PII.
Tokenization — reversible pseudonymization via a vault (when you must re-identify later).

◆ Consistency across a distributed pipeline

The hard part in big batch and streaming pipelines: every node must apply identical masking logic — drift leaks rows. Fix: a single, standardized masking function (managed through Unity Catalog) enforced everywhere, applied once at Bronze ingestion so PII is anonymized before it's ever stored. Clean design: one Lakeflow pipeline serving both batch and streaming with a shared masking function at Bronze.

◆ Name the collision

Masking ([Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)) — hides at query time; raw stays in files.
Anonymization / hashing — transforms the stored value, one-way, deterministic → still linkable.
Tokenization — reversible via a vault (recover the original).
Encryption — reversible with a key (a different tool; hashing is not this).

Takeaways (rebuild it from these)

Protect at rest by transforming the value: anonymize (hash, one-way, deterministic→linkable), pseudonymize/tokenize (linkable/reversible), vs query-time masking (display only).
sha2 valid lengths 224/256/384/512 (0=256; 128 fails); fixed-length output (256→64 hex); deterministic; not reversible (tokenize to recover).
withColumn("x_hash", …) adds a column and leaves the raw PII — overwrite the original column to mask in place.
Landscape: salting (strength, breaks linkability), suppression (NULL), generalization (ranges), k-anonymity (≥k per quasi-identifier group, guards re-identification), tokenization (reversible).
Distributed consistency = one shared masking function at Bronze, ideally one Lakeflow pipeline for batch+streaming.

Before you move on — say these without scrolling up

Recoverable vs linkable — where do hashing, tokenization, and masking each land?
sha2 valid lengths — and which common one fails? Output length for a short vs long input?
withColumn("ssn_hash", …) — is the PII protected? What's the fix?
How do you guarantee every node in a big pipeline masks identically?

Next: the last security duty — deleting data for good ("right to be forgotten") and why a Delta DELETE alone doesn't do it → [PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/).

Anonymization — hashing, pseudonymization, and protecting values at rest