Lessons

Data Transformation, Cleansing & Quality

Data quality — expectations and constraints, and what happens to a bad row

Enforce data quality using Lakeflow pipeline expectations and Delta table constraints; choose the correct violation behavior (warn/drop/fail).

Deduped data can still be wrong — a negative price, a null customer id, a timestamp from 1970. Data quality is declaring what "valid" means and deciding what happens to a row that fails. Databricks gives you two mechanisms that look alike and are constantly confused on the exam. One axis keeps them apart forever.


The spine

Beat 1 — the anchor: two questions, where and what

Anchor. Every data-quality control answers two things. (1) Where does the rule live — on the table (a constraint, checked on every write) or in the pipeline (an expectation, checked as data flows through a Lakeflow dataset)? (2) What happens to a violating rowkept and counted, dropped, or fail the whole write? Pick the mechanism by answering both.

The one difference that drives everything: constraints only ever do one thing to a bad row (fail the write); expectations let you choose.

Beat 2 — constraints: the hard gate (fail-only)

A constraint lives on the Delta table, so it guards every write, no matter who does it:

ALTER TABLE orders ADD CONSTRAINT valid_amount CHECK (amount > 0);

Predict: you add that CHECK to a table that already contains a row with amount = -5. What happens?

It errors out. Delta validates a new CHECK against existing and future data — so all current rows must already pass before the constraint is added. And once it's on, a future write that violates it fails the entire transaction: recall [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/) — the write is atomic, so no records land, not even the valid ones in that batch. That's "hard gate": bad data can never enter, but one bad row can halt your whole load. There is no warn or drop mode for constraints — pass the write or fail it.

The enforced-vs-informational trap. Databricks enforces CHECK and NOT NULL. It does not enforce PRIMARY KEY, FOREIGN KEY, UNIQUE — those are informational (documentation + optimizer hints via RELY). So "I added a PRIMARY KEY, why did a duplicate still insert?" → because PK is informational. Enforce uniqueness yourself ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)).

Beat 3 — expectations: the configurable gate (and the #1 trap)

An expectation lives inside a Lakeflow pipeline ([Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/)) and checks rows as the dataset computes. Its power: you choose the action, and every violation is tracked as a DQ metric in the event log (constraints track nothing).

IntentSQLPython decoratorViolating row →
Warn (default)CONSTRAINT c EXPECT (cond)@dp.expect / @dp.expect_or_warnKept in the dataset; counted
Drop… EXPECT (cond) ON VIOLATION DROP ROW@dp.expect_or_dropDropped; counted
Fail… EXPECT (cond) ON VIOLATION FAIL UPDATE@dp.expect_or_failAborts the pipeline update

Predict: a bare CONSTRAINT valid_id EXPECT (id IS NOT NULL) with no ON VIOLATION clause — do the null-id rows get removed?

No — this is the #1 trap. The default is warn: the invalid rows stay in the table, merely reported in the metrics. If a question wants bad rows gone, the answer must say DROP ROW (or expect_or_drop); to stop the pipeline dead on any violation, FAIL UPDATE (or expect_or_fail).

Lock it. Constraint = on the table, fail-only, validates existing+future, atomic (no rows land). Expectation = in the pipeline, you choose warn/drop/fail, logged as metrics. Bare EXPECT = warn = keeps bad rows.


The dials (skim now; return when a question needs one)

◆ Grouped + reusable rules (metadata-driven quality)

Grouped variants apply one action to a set of rules: @dp.expect_all({...}), @dp.expect_all_or_drop({...}), @dp.expect_all_or_fail({...}) — each dict entry is name: condition. Because that dict is just data, you keep the rules in one place and apply them to many tables. The portable, tested pattern: store rules in a Delta table outside the pipeline's target schema, read them at build time into a dict (filtered by a pipeline parameter — a tag or schema name), and pass to expect_all. Same idea when rules come from a metadata JSON: load it, loop its entries, apply via expect_all. That's how a pipeline "applies rules dynamically without hardcoding" — never copy-pasting @dp.expect onto every function.

◆ Expectations are per-row — completeness needs a join first

Every expectation is evaluated on a single row, so you cannot directly assert a cross-table fact like "the report contains every source row." To validate completeness, first materialize an intermediate table doing the comparison — a LEFT OUTER JOIN of source against report — then an expectation that no report key is null. Tell: "expectation to prove all source rows are present" → not directly; build the join first.

◆ Name the collision: constraint vs expectation

Delta constraintLakeflow expectation
Lives onthe tablea pipeline dataset
Guardsevery write to the tablerows as the pipeline computes them
Bad-row optionsfail only (hard)warn / drop / fail (choose)
Metricsnonelogged as DQ metrics in the event log
KeywordCHECK / NOT NULLEXPECT / @dp.expect*

Decode: "never allow a negative value into this table, from any source" → table CHECK. "In our pipeline, drop bad rows but keep loading and let me monitor how many" → expectation DROP ROW. "Track invalid rows but keep them for inspection" → expectation, default warn (then route them — next lesson).

Takeaways (rebuild it from these)

  1. Two axes: where the rule lives (table = constraint; pipeline = expectation) and what happens to a bad row (keep/drop/fail).
  2. Constraints (CHECK, NOT NULL) are enforced, fail-only — validated against existing+new data; a violation aborts the atomic write (no rows land).
  3. PRIMARY KEY/FOREIGN KEY/UNIQUE are informational, not enforced (RELY) — they won't stop duplicates; enforce yourself ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)).
  4. Expectations choose the action: EXPECT/expect_or_warn = warn (keeps bad rows!), DROP ROW/expect_or_drop = drop, FAIL UPDATE/expect_or_fail = abort; all counted in the event log. Grouped = expect_all*; reusable = rules-in-a-table + parameter.
  5. Expectations are per-row — completeness needs a LEFT OUTER JOIN intermediate table first. A bare EXPECT does not delete bad rows — the #1 trap.

Before you move on — say these without scrolling up

  1. Add a CHECK to a table that already has a violating row — what happens, and why?
  2. A CHECK violation on a 1000-row batch with one bad row — how many rows land?
  3. PRIMARY KEY didn't stop a duplicate — why?
  4. Bare EXPECT (id IS NOT NULL), no ON VIOLATION — are the null rows removed?
  5. "Prove the report has every source row" — why can't one expectation do it directly?

Next: warn and drop either keep bad rows in your clean table or throw them away — production routes them aside instead, losing nothing → [Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/).

Prerequisites

Leads to