Design pipelines that isolate invalid records (quarantine) for inspection and replay, using expectations and Auto Loader rescued data.

Last lesson gave three violation actions: warn (keep the bad row in your clean table — no), drop (delete it — the row's gone, only a count survives), fail (halt the pipeline — one bad record stops everything). In production none of those is what a mature team wants. They want a fourth behaviour: set the bad row aside where it can be seen, understood, and reprocessed. That's quarantine.

The spine

Beat 1 — the anchor: route aside, don't discard and don't block

Anchor. Quarantine = split incoming data into two destinations — valid rows to the clean table, invalid rows to a quarantine table (with metadata: why it failed, when). Nothing is lost, nothing is blocked. The clean table stays clean; bad rows wait in a side table for a human to inspect, fix the source, and replay them.

Predict: expect_or_drop removes the bad row so your clean table stays clean. Why isn't that enough in production?

…

Because drop throws the actual row away — you keep a count of failures but not the rows, so you can never investigate what was wrong or reprocess it once fixed. Even 0.1% of 100K/day is ~36K silently deleted rows a year. Quarantine keeps the evidence.

Beat 2 — the pattern to memorise: warn, then split (NOT drop)

Here's the buried insight, and it's the trap: to quarantine on business rules you use expectations in warn mode — not DROP ROW — because warn keeps every row, and you split them yourself.

from pyspark import pipelines as dp   # exam/older code: import dlt; @dp == @dlt

rules = {
    "valid_amount": "amount > 0",
    "has_user":     "user_id IS NOT NULL",
    "valid_email":  "email LIKE '%@%'",
}
quarantine_rule = "NOT({})".format(" AND ".join(rules.values()))  # true when ANY rule fails

@dp.table
@dp.expect_all(rules)                 # WARN: record metrics, keep ALL rows
def raw_tagged():
    return spark.readStream.table("bronze_events")

@dp.table                              # clean: every rule passes
def valid_events():
    return dp.read("raw_tagged").filter(" AND ".join(rules.values()))

@dp.table                              # quarantine: any rule failed
def quarantine_events():
    return dp.read("raw_tagged").filter(quarantine_rule)

Predict: swap expect_all for expect_all_or_drop here — what breaks?

…

You'd have nothing to quarantine — drop deletes the rows before you can route them. That's the trap in every quarantine question: an answer using expect_or_drop cannot quarantine.

Lock it. Quarantine = expect_all (warn, keeps all + logs metrics) → two filtered tables: valid (rules pass) and quarantine (NOT(rules)). Add metadata (current_timestamp(), a failure_reason) for triage + replay.

The dials (skim now; return when a question needs one)

◆ Two kinds of "bad," two catch points

Didn't parse (malformed JSON, type mismatch, unexpected column) — caught at ingest by Auto Loader _rescued_data ([Auto Loader — incremental file ingestion and schema evolution](/lessons/s2-autoloader/)): _rescued_data IS NULL = clean, IS NOT NULL = quarantine.
```
df_good = df_raw.filter("_rescued_data IS NULL").drop("_rescued_data")
df_bad  = df_raw.filter("_rescued_data IS NOT NULL")   # → quarantine
```
Parsed but violates a business rule (negative amount, @-less email) — caught by expectations in warn mode (above).

A robust pipeline quarantines both.

◆ Two shapes: two tables, or one table + a flag

The split can be two tables (valid + quarantine, above) or one table with a boolean is_quarantined column, then partitioned by that flag so consumers read only the clean partition. Both are valid quarantine designs — the flag-and-partition form keeps everything in one place while still separating good from bad.

◆ Isolate the quarantine work from the hot path

For a real-time pipeline, keep the production stream lean: filter out bad records and write only valid rows to the production tables at full speed, and run a separate, lightweight process that periodically reads/stores the quarantined records. The good-data job shouldn't share compute with the bad-data handling, so triage never slows the live path.

◆ Where each gate belongs — the medallion pyramid (`[The one job — and the two axes everything lives on](/lessons/f1-the-one-job/)`)

Enforcement tightens as data climbs:

Bronze — structural checks in warn (expect): non-null ids, valid timestamps. Keep almost everything (raw fidelity).
Silver — business rules with drop or quarantine: non-negative amounts, valid categories. The clean/quarantine fork usually lives here.
Gold — critical invariants with fail (expect_or_fail): a null country code in a financial rollup should stop the run.

Proportional enforcement: higher layer + more damage → harder gate.

◆ Gotchas

drop vs quarantine — both remove bad rows from the clean table, but drop deletes (only a count remains) while quarantine preserves them for replay. "Investigate / reprocess / don't lose them" ⇒ quarantine ⇒ warn + split, never expect_or_drop.
Quarantine tables grow unbounded — set a retention/archival policy.
Exam/Notion write import dlt / @dlt.expect*; current module is pyspark.pipelines / @dp — same behaviour, recognise both.

Takeaways (rebuild it from these)

Quarantine = the fourth option: route bad rows aside — clean table stays clean, bad rows preserved with metadata for inspection + replay.
Two bad-row sources: parse failures → _rescued_data (IS NULL clean / IS NOT NULL quarantine); rule failures → expectations.
Expectation quarantine = warn + filter split: rules dict → expect_all (keeps all, logs metrics) → valid + quarantine (NOT(rules)). expect_or_drop can't quarantine — it deletes first.
Shapes: two tables or one table + is_quarantined flag partitioned by it. Keep the hot path lean — quarantine handling on a separate process/compute.
Gate by layer: Bronze warn (structural), Silver drop/quarantine (business), Gold fail (critical). Quarantine tables need retention; @dlt (exam) == @dp (current).

Before you move on — say these without scrolling up

Why is expect_or_drop the wrong tool for quarantine — and what do you use instead?
The warn-and-split pattern — what are the three tables/parts?
Two sources of bad rows and where each is caught.
Two valid quarantine shapes — and how do you keep the live stream fast?

That closes Section 3's story: transform ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)) → cleanse/dedup ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)) → enforce rules ([Data quality — expectations and constraints, and what happens to a bad row](/lessons/s3-data-quality/)) → handle the bad (this lesson). Next section: monitoring what all this produces.

Quarantining bad data — the third option beyond drop and fail