Deduped data can still be wrong — a negative price, a null customer id, a timestamp from 1970. Data quality is declaring what "valid" means and deciding what happens to a row that fails. Databricks gives you two mechanisms that look alike and are constantly confused on the exam. One axis keeps them apart forever.
The spine
Beat 1 — the anchor: two questions, where and what
Anchor. Every data-quality control answers two things. (1) Where does the rule live — on the table (a constraint, checked on every write) or in the pipeline (an expectation, checked as data flows through a Lakeflow dataset)? (2) What happens to a violating row — kept and counted, dropped, or fail the whole write? Pick the mechanism by answering both.
The one difference that drives everything: constraints only ever do one thing to a bad row (fail the write); expectations let you choose.
Beat 2 — constraints: the hard gate (fail-only)
A constraint lives on the Delta table, so it guards every write, no matter who does it:
ALTER TABLE orders ADD CONSTRAINT valid_amount CHECK (amount > 0);
Predict: you add that
CHECKto a table that already contains a row withamount = -5. What happens?
…
It errors out. Delta validates a new CHECK against existing and future data — so all current rows must already pass before the constraint is added. And once it's on, a future write that violates it fails the entire transaction: recall [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/) — the write is atomic, so no records land, not even the valid ones in that batch. That's "hard gate": bad data can never enter, but one bad row can halt your whole load. There is no warn or drop mode for constraints — pass the write or fail it.
The enforced-vs-informational trap. Databricks enforces CHECK and NOT NULL. It does not enforce PRIMARY KEY, FOREIGN KEY, UNIQUE — those are informational (documentation + optimizer hints via RELY). So "I added a PRIMARY KEY, why did a duplicate still insert?" → because PK is informational. Enforce uniqueness yourself ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)).
Beat 3 — expectations: the configurable gate (and the #1 trap)
An expectation lives inside a Lakeflow pipeline ([Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/)) and checks rows as the dataset computes. Its power: you choose the action, and every violation is tracked as a DQ metric in the event log (constraints track nothing).
| Intent | SQL | Python decorator | Violating row → |
|---|---|---|---|
| Warn (default) | CONSTRAINT c EXPECT (cond) | @dp.expect / @dp.expect_or_warn | Kept in the dataset; counted |
| Drop | … EXPECT (cond) ON VIOLATION DROP ROW | @dp.expect_or_drop | Dropped; counted |
| Fail | … EXPECT (cond) ON VIOLATION FAIL UPDATE | @dp.expect_or_fail | Aborts the pipeline update |
Predict: a bare
CONSTRAINT valid_id EXPECT (id IS NOT NULL)with noON VIOLATIONclause — do the null-id rows get removed?
…
No — this is the #1 trap. The default is warn: the invalid rows stay in the table, merely reported in the metrics. If a question wants bad rows gone, the answer must say DROP ROW (or expect_or_drop); to stop the pipeline dead on any violation, FAIL UPDATE (or expect_or_fail).
Lock it. Constraint = on the table, fail-only, validates existing+future, atomic (no rows land). Expectation = in the pipeline, you choose warn/drop/fail, logged as metrics. Bare
EXPECT= warn = keeps bad rows.
The dials (skim now; return when a question needs one)
◆ Grouped + reusable rules (metadata-driven quality)
Grouped variants apply one action to a set of rules: @dp.expect_all({...}), @dp.expect_all_or_drop({...}), @dp.expect_all_or_fail({...}) — each dict entry is name: condition. Because that dict is just data, you keep the rules in one place and apply them to many tables. The portable, tested pattern: store rules in a Delta table outside the pipeline's target schema, read them at build time into a dict (filtered by a pipeline parameter — a tag or schema name), and pass to expect_all. Same idea when rules come from a metadata JSON: load it, loop its entries, apply via expect_all. That's how a pipeline "applies rules dynamically without hardcoding" — never copy-pasting @dp.expect onto every function.
◆ Expectations are per-row — completeness needs a join first
Every expectation is evaluated on a single row, so you cannot directly assert a cross-table fact like "the report contains every source row." To validate completeness, first materialize an intermediate table doing the comparison — a LEFT OUTER JOIN of source against report — then an expectation that no report key is null. Tell: "expectation to prove all source rows are present" → not directly; build the join first.
◆ Name the collision: constraint vs expectation
| Delta constraint | Lakeflow expectation | |
|---|---|---|
| Lives on | the table | a pipeline dataset |
| Guards | every write to the table | rows as the pipeline computes them |
| Bad-row options | fail only (hard) | warn / drop / fail (choose) |
| Metrics | none | logged as DQ metrics in the event log |
| Keyword | CHECK / NOT NULL | EXPECT / @dp.expect* |
Decode: "never allow a negative value into this table, from any source" → table CHECK. "In our pipeline, drop bad rows but keep loading and let me monitor how many" → expectation DROP ROW. "Track invalid rows but keep them for inspection" → expectation, default warn (then route them — next lesson).
Takeaways (rebuild it from these)
- Two axes: where the rule lives (table = constraint; pipeline = expectation) and what happens to a bad row (keep/drop/fail).
- Constraints (
CHECK,NOT NULL) are enforced, fail-only — validated against existing+new data; a violation aborts the atomic write (no rows land). PRIMARY KEY/FOREIGN KEY/UNIQUEare informational, not enforced (RELY) — they won't stop duplicates; enforce yourself ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)).- Expectations choose the action:
EXPECT/expect_or_warn= warn (keeps bad rows!),DROP ROW/expect_or_drop= drop,FAIL UPDATE/expect_or_fail= abort; all counted in the event log. Grouped =expect_all*; reusable = rules-in-a-table + parameter. - Expectations are per-row — completeness needs a LEFT OUTER JOIN intermediate table first. A bare
EXPECTdoes not delete bad rows — the #1 trap.
Before you move on — say these without scrolling up
- Add a
CHECKto a table that already has a violating row — what happens, and why? - A
CHECKviolation on a 1000-row batch with one bad row — how many rows land? PRIMARY KEYdidn't stop a duplicate — why?- Bare
EXPECT (id IS NOT NULL), noON VIOLATION— are the null rows removed? - "Prove the report has every source row" — why can't one expectation do it directly?
Next: warn and drop either keep bad rows in your clean table or throw them away — production routes them aside instead, losing nothing → [Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/).