Lessons

Data Transformation, Cleansing & Quality

Quarantining bad data — the third option beyond drop and fail

Design pipelines that isolate invalid records (quarantine) for inspection and replay, using expectations and Auto Loader rescued data.

Last lesson gave three violation actions: warn (keep the bad row in your clean table — no), drop (delete it — the row's gone, only a count survives), fail (halt the pipeline — one bad record stops everything). In production none of those is what a mature team wants. They want a fourth behaviour: set the bad row aside where it can be seen, understood, and reprocessed. That's quarantine.


The spine

Beat 1 — the anchor: route aside, don't discard and don't block

Anchor. Quarantine = split incoming data into two destinations — valid rows to the clean table, invalid rows to a quarantine table (with metadata: why it failed, when). Nothing is lost, nothing is blocked. The clean table stays clean; bad rows wait in a side table for a human to inspect, fix the source, and replay them.

Predict: expect_or_drop removes the bad row so your clean table stays clean. Why isn't that enough in production?

Because drop throws the actual row away — you keep a count of failures but not the rows, so you can never investigate what was wrong or reprocess it once fixed. Even 0.1% of 100K/day is ~36K silently deleted rows a year. Quarantine keeps the evidence.

Beat 2 — the pattern to memorise: warn, then split (NOT drop)

Here's the buried insight, and it's the trap: to quarantine on business rules you use expectations in warn mode — not DROP ROW — because warn keeps every row, and you split them yourself.

from pyspark import pipelines as dp   # exam/older code: import dlt; @dp == @dlt

rules = {
    "valid_amount": "amount > 0",
    "has_user":     "user_id IS NOT NULL",
    "valid_email":  "email LIKE '%@%'",
}
quarantine_rule = "NOT({})".format(" AND ".join(rules.values()))  # true when ANY rule fails

@dp.table
@dp.expect_all(rules)                 # WARN: record metrics, keep ALL rows
def raw_tagged():
    return spark.readStream.table("bronze_events")

@dp.table                              # clean: every rule passes
def valid_events():
    return dp.read("raw_tagged").filter(" AND ".join(rules.values()))

@dp.table                              # quarantine: any rule failed
def quarantine_events():
    return dp.read("raw_tagged").filter(quarantine_rule)

Predict: swap expect_all for expect_all_or_drop here — what breaks?

You'd have nothing to quarantine — drop deletes the rows before you can route them. That's the trap in every quarantine question: an answer using expect_or_drop cannot quarantine.

Lock it. Quarantine = expect_all (warn, keeps all + logs metrics) → two filtered tables: valid (rules pass) and quarantine (NOT(rules)). Add metadata (current_timestamp(), a failure_reason) for triage + replay.


The dials (skim now; return when a question needs one)

◆ Two kinds of "bad," two catch points

A robust pipeline quarantines both.

◆ Two shapes: two tables, or one table + a flag

The split can be two tables (valid + quarantine, above) or one table with a boolean is_quarantined column, then partitioned by that flag so consumers read only the clean partition. Both are valid quarantine designs — the flag-and-partition form keeps everything in one place while still separating good from bad.

◆ Isolate the quarantine work from the hot path

For a real-time pipeline, keep the production stream lean: filter out bad records and write only valid rows to the production tables at full speed, and run a separate, lightweight process that periodically reads/stores the quarantined records. The good-data job shouldn't share compute with the bad-data handling, so triage never slows the live path.

◆ Where each gate belongs — the medallion pyramid ([The one job — and the two axes everything lives on](/lessons/f1-the-one-job/))

Enforcement tightens as data climbs:

Proportional enforcement: higher layer + more damage → harder gate.

◆ Gotchas

Takeaways (rebuild it from these)

  1. Quarantine = the fourth option: route bad rows aside — clean table stays clean, bad rows preserved with metadata for inspection + replay.
  2. Two bad-row sources: parse failures → _rescued_data (IS NULL clean / IS NOT NULL quarantine); rule failures → expectations.
  3. Expectation quarantine = warn + filter split: rules dict → expect_all (keeps all, logs metrics) → valid + quarantine (NOT(rules)). expect_or_drop can't quarantine — it deletes first.
  4. Shapes: two tables or one table + is_quarantined flag partitioned by it. Keep the hot path lean — quarantine handling on a separate process/compute.
  5. Gate by layer: Bronze warn (structural), Silver drop/quarantine (business), Gold fail (critical). Quarantine tables need retention; @dlt (exam) == @dp (current).

Before you move on — say these without scrolling up

  1. Why is expect_or_drop the wrong tool for quarantine — and what do you use instead?
  2. The warn-and-split pattern — what are the three tables/parts?
  3. Two sources of bad rows and where each is caught.
  4. Two valid quarantine shapes — and how do you keep the live stream fast?

That closes Section 3's story: transform ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)) → cleanse/dedup ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)) → enforce rules ([Data quality — expectations and constraints, and what happens to a bad row](/lessons/s3-data-quality/)) → handle the bad (this lesson). Next section: monitoring what all this produces.

Prerequisites

Leads to