Column masks ([Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)) hide values at query time — but the raw data still sits in the files. Copy a table to a sandbox, hand a dataset to a vendor, or simply refuse to store real PII at rest, and masking isn't enough: you must transform the stored value itself. This lesson is those transforms, with hashing as the workhorse and a few exact SHA-2 facts the exam checks.
The spine
Beat 1 — the anchor: recoverable? linkable?
Anchor. Protecting a value at rest is a choice on two axes: can you get the original back, and can you still link records that had the same value? Anonymization (hashing) is one-way — no recovery — but deterministic, so equal inputs still match. Pseudonymization replaces the identifier with a consistent code so records stay linkable without exposing the real value. Tokenization is reversible via a separately-stored vault. Pick by which of recoverable / linkable you need.
Beat 2 — SHA-2 hashing, and the in-place trap
sha2(col, numBits) is the default: one-way and deterministic (same input → same output). Determinism makes it a pseudonymizer — hash an email that's both the only PII and the join key, and records stay linkable on the hash while the address is gone. Three exact facts:
- Valid bit lengths: 224, 256, 384, 512 (
0= 256). 128 is NOT a SHA-2 length —sha2(col, 128)fails. - Output is fixed-length regardless of input —
sha2(x, 256)is always 256 bits = 64 hex chars, short email or long password alike (why a hashed column takes aCHAR(64)). - A hash is not encryption — you can't recover the original. Need it back → tokenization (random token + secure vault), not hashing.
Predict: you write
df.withColumn("ssn_hash", sha2("ssn", 256))and save. Is the SSN protected?
…
No — withColumn("ssn_hash", …) adds a new column and leaves the original ssn intact, so raw SSNs still get written. To mask in place, overwrite the original column (same name) or project it out:
df.withColumn("ssn", F.sha2("ssn", 256)) # RIGHT — overwrites in place
Tell: "hashed a new column but PII still exposed" → you added instead of replaced.
Lock it.
sha2= one-way + deterministic (linkable, not recoverable); valid 224/256/384/512 (128 fails), fixed-length output. Overwrite the column to mask in place. Recover later → tokenization.
The dials (skim now; return when a question needs one)
◆ The landscape
- Salting (
sha2(concat(x, salt), 256)) resists dictionary/rainbow attacks — but a salt breaks cross-dataset linkability unless the same salt is shared (trades matchability for strength). - Suppression — replace with
NULL. - Generalization — coarsen: exact age → age range, full ZIP → 3-digit prefix. Keeps utility, lowers identifiability.
- k-anonymity — every combination of quasi-identifiers (age-band, city, gender) appears ≥ k times so no row is unique enough to re-identify; regulators typically want k ≥ 5. Motivation: even "anonymized" rows can be re-identified by joining quasi-identifiers to an external dataset — so generalize the quasi-identifiers, don't just hash the obvious PII.
- Tokenization — reversible pseudonymization via a vault (when you must re-identify later).
◆ Consistency across a distributed pipeline
The hard part in big batch and streaming pipelines: every node must apply identical masking logic — drift leaks rows. Fix: a single, standardized masking function (managed through Unity Catalog) enforced everywhere, applied once at Bronze ingestion so PII is anonymized before it's ever stored. Clean design: one Lakeflow pipeline serving both batch and streaming with a shared masking function at Bronze.
◆ Name the collision
- Masking (
[Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)) — hides at query time; raw stays in files. - Anonymization / hashing — transforms the stored value, one-way, deterministic → still linkable.
- Tokenization — reversible via a vault (recover the original).
- Encryption — reversible with a key (a different tool; hashing is not this).
Takeaways (rebuild it from these)
- Protect at rest by transforming the value: anonymize (hash, one-way, deterministic→linkable), pseudonymize/tokenize (linkable/reversible), vs query-time masking (display only).
sha2valid lengths 224/256/384/512 (0=256; 128 fails); fixed-length output (256→64 hex); deterministic; not reversible (tokenize to recover).withColumn("x_hash", …)adds a column and leaves the raw PII — overwrite the original column to mask in place.- Landscape: salting (strength, breaks linkability), suppression (NULL), generalization (ranges), k-anonymity (≥k per quasi-identifier group, guards re-identification), tokenization (reversible).
- Distributed consistency = one shared masking function at Bronze, ideally one Lakeflow pipeline for batch+streaming.
Before you move on — say these without scrolling up
- Recoverable vs linkable — where do hashing, tokenization, and masking each land?
sha2valid lengths — and which common one fails? Output length for a short vs long input?withColumn("ssn_hash", …)— is the PII protected? What's the fix?- How do you guarantee every node in a big pipeline masks identically?
Next: the last security duty — deleting data for good ("right to be forgotten") and why a Delta DELETE alone doesn't do it → [PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/).