Last lesson gave three violation actions: warn (keep the bad row in your clean table — no), drop (delete it — the row's gone, only a count survives), fail (halt the pipeline — one bad record stops everything). In production none of those is what a mature team wants. They want a fourth behaviour: set the bad row aside where it can be seen, understood, and reprocessed. That's quarantine.
The spine
Beat 1 — the anchor: route aside, don't discard and don't block
Anchor. Quarantine = split incoming data into two destinations — valid rows to the clean table, invalid rows to a quarantine table (with metadata: why it failed, when). Nothing is lost, nothing is blocked. The clean table stays clean; bad rows wait in a side table for a human to inspect, fix the source, and replay them.
Predict:
expect_or_dropremoves the bad row so your clean table stays clean. Why isn't that enough in production?
…
Because drop throws the actual row away — you keep a count of failures but not the rows, so you can never investigate what was wrong or reprocess it once fixed. Even 0.1% of 100K/day is ~36K silently deleted rows a year. Quarantine keeps the evidence.
Beat 2 — the pattern to memorise: warn, then split (NOT drop)
Here's the buried insight, and it's the trap: to quarantine on business rules you use expectations in warn mode — not DROP ROW — because warn keeps every row, and you split them yourself.
from pyspark import pipelines as dp # exam/older code: import dlt; @dp == @dlt
rules = {
"valid_amount": "amount > 0",
"has_user": "user_id IS NOT NULL",
"valid_email": "email LIKE '%@%'",
}
quarantine_rule = "NOT({})".format(" AND ".join(rules.values())) # true when ANY rule fails
@dp.table
@dp.expect_all(rules) # WARN: record metrics, keep ALL rows
def raw_tagged():
return spark.readStream.table("bronze_events")
@dp.table # clean: every rule passes
def valid_events():
return dp.read("raw_tagged").filter(" AND ".join(rules.values()))
@dp.table # quarantine: any rule failed
def quarantine_events():
return dp.read("raw_tagged").filter(quarantine_rule)
Predict: swap
expect_allforexpect_all_or_drophere — what breaks?
…
You'd have nothing to quarantine — drop deletes the rows before you can route them. That's the trap in every quarantine question: an answer using expect_or_drop cannot quarantine.
Lock it. Quarantine =
expect_all(warn, keeps all + logs metrics) → two filtered tables:valid(rules pass) andquarantine(NOT(rules)). Add metadata (current_timestamp(), afailure_reason) for triage + replay.
The dials (skim now; return when a question needs one)
◆ Two kinds of "bad," two catch points
- Didn't parse (malformed JSON, type mismatch, unexpected column) — caught at ingest by Auto Loader
_rescued_data([Auto Loader — incremental file ingestion and schema evolution](/lessons/s2-autoloader/)):_rescued_data IS NULL= clean,IS NOT NULL= quarantine.df_good = df_raw.filter("_rescued_data IS NULL").drop("_rescued_data") df_bad = df_raw.filter("_rescued_data IS NOT NULL") # → quarantine - Parsed but violates a business rule (negative amount,
@-less email) — caught by expectations in warn mode (above).
A robust pipeline quarantines both.
◆ Two shapes: two tables, or one table + a flag
The split can be two tables (valid + quarantine, above) or one table with a boolean is_quarantined column, then partitioned by that flag so consumers read only the clean partition. Both are valid quarantine designs — the flag-and-partition form keeps everything in one place while still separating good from bad.
◆ Isolate the quarantine work from the hot path
For a real-time pipeline, keep the production stream lean: filter out bad records and write only valid rows to the production tables at full speed, and run a separate, lightweight process that periodically reads/stores the quarantined records. The good-data job shouldn't share compute with the bad-data handling, so triage never slows the live path.
◆ Where each gate belongs — the medallion pyramid ([The one job — and the two axes everything lives on](/lessons/f1-the-one-job/))
Enforcement tightens as data climbs:
- Bronze — structural checks in warn (
expect): non-null ids, valid timestamps. Keep almost everything (raw fidelity). - Silver — business rules with drop or quarantine: non-negative amounts, valid categories. The clean/quarantine fork usually lives here.
- Gold — critical invariants with fail (
expect_or_fail): a null country code in a financial rollup should stop the run.
Proportional enforcement: higher layer + more damage → harder gate.
◆ Gotchas
- drop vs quarantine — both remove bad rows from the clean table, but drop deletes (only a count remains) while quarantine preserves them for replay. "Investigate / reprocess / don't lose them" ⇒ quarantine ⇒ warn + split, never
expect_or_drop. - Quarantine tables grow unbounded — set a retention/archival policy.
- Exam/Notion write
import dlt/@dlt.expect*; current module ispyspark.pipelines/@dp— same behaviour, recognise both.
Takeaways (rebuild it from these)
- Quarantine = the fourth option: route bad rows aside — clean table stays clean, bad rows preserved with metadata for inspection + replay.
- Two bad-row sources: parse failures →
_rescued_data(IS NULLclean /IS NOT NULLquarantine); rule failures → expectations. - Expectation quarantine = warn + filter split:
rulesdict →expect_all(keeps all, logs metrics) →valid+quarantine(NOT(rules)).expect_or_dropcan't quarantine — it deletes first. - Shapes: two tables or one table +
is_quarantinedflag partitioned by it. Keep the hot path lean — quarantine handling on a separate process/compute. - Gate by layer: Bronze warn (structural), Silver drop/quarantine (business), Gold fail (critical). Quarantine tables need retention;
@dlt(exam) ==@dp(current).
Before you move on — say these without scrolling up
- Why is
expect_or_dropthe wrong tool for quarantine — and what do you use instead? - The warn-and-split pattern — what are the three tables/parts?
- Two sources of bad rows and where each is caught.
- Two valid quarantine shapes — and how do you keep the live stream fast?
That closes Section 3's story: transform ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)) → cleanse/dedup ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)) → enforce rules ([Data quality — expectations and constraints, and what happens to a bad row](/lessons/s3-data-quality/)) → handle the bad (this lesson). Next section: monitoring what all this produces.