Recall the two ways a source changes from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): new records arrive (Way 1), or an existing record changes (Way 2, handled by a merge into the target). This lesson is Way 2 at production scale — where that merge stops being something you hand-write and becomes a single declaration.
The spine
Beat 1 — what CDC is, and the pain
Define it plainly: CDC (Change Data Capture) is a feed of change events from a source system (Oracle, MySQL, a Kafka topic). Instead of shipping the whole table, the source sends one event per row-change — each stamped insert / update / delete, plus a column saying when the change happened at the source.
Now feel the pain. To turn that feed into a correct target table by hand, you'd write: a foreachBatch block, a rank window function to pick the latest change per key in each batch, and a MERGE to apply it (from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): MERGE is the keyed upsert). But the genuinely hard part is order:
Predict: the same customer edits their address twice before the next pipeline run, and a network retry makes the older edit arrive after the newer one. Apply them in arrival order — what breaks?
…
You overwrite the new value with the old one. Correctness can't depend on when the pipeline happened to see an event.
Beat 2 — the inversion (the anchor)
Recall the LSDP anchor from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) — declare what you want, let the framework write the plumbing. AUTO CDC applies it exactly here:
Anchor. You declare the key (which column identifies a record) and the sequence (which column orders the changes); Databricks writes the MERGE that keeps the target correct — applying the latest change per key, in the right order, even when events arrive out of order. It's f1's "changed record → merge into target," made declarative.
One naming note, because you'll meet both forms (renamed-product trap): the current API is AUTO CDC (CREATE FLOW … AS AUTO CDC INTO in SQL, dp.create_auto_cdc_flow() in Python). Its older name is APPLY CHANGES INTO, which still works and is exactly what the Nov-2025 exam guide and most questions still show. Same clauses, same behaviour — one thing, two names.
Beat 3 — the two clauses that carry everything
CREATE OR REFRESH STREAMING TABLE my_catalog.silver.customers;
CREATE FLOW customers_cdc AS AUTO CDC INTO my_catalog.silver.customers
FROM stream(bronze_customer_changes) -- a streaming source
KEYS (customer_id) -- what uniquely identifies a record
SEQUENCE BY updated_at -- the logical-time column that orders changes
APPLY AS DELETE WHEN operation = 'DELETE' -- (optional) how a delete event is recognised
COLUMNS * EXCEPT (operation, updated_at) -- (optional) which columns to write
STORED AS SCD TYPE 1; -- or TYPE 2 for full history
Two clauses do the real work — define them one at a time:
KEYS (customer_id)— the primary key. It drives the merge: key already in the target → update it; key absent → insert it. (Keys must be immutable — if a key value changes, AUTO CDC treats it as a brand-new record, because it can't tell it's the same one.)SEQUENCE BY updated_at— the column representing logical time (when the change happened at the source). This is the piece that solves out-of-order delivery: Databricks applies the change with the highest sequence value per key, not the one that arrived last. The source system stamps this (Oracle's commit timestamp) — you never set it; it separates physical arrival order from logical event order.
Ground it in f1: a customer edits their phone twice before the next run → AUTO CDC keeps the edit with the latest updated_at, whichever arrived first.
Lock it. KEYS = identity (match→update, no-match→insert; must be immutable). SEQUENCE BY = logical time; applies the latest change per key, fixing out-of-order arrival. Source stamps the sequence, never you.
The dials (skim now; return when a question needs one)
◆ SCD Type 1 vs Type 2 — how much history
STORED AS SCD TYPE … is the second dial. (SCD = Slowly Changing Dimension — a dimension whose attributes change over time, like a customer's address.)
- Type 1 — overwrite, no history. Only the current value survives; one row per key, always current. Customer changes phone → old number gone. Use when history doesn't matter.
- Type 2 — keep full history. The old row is expired (an end-timestamp stamped on it) and a new row inserted for the new value (with a start-timestamp). The target holds the whole timeline. Databricks manages the temporal columns for you (
__START_AT/__END_AT). Use when you must answer "what was true at the time" (audit, point-in-time joins, regulatory history).
The exam tell: "keep only the latest / current state" → Type 1; "track history / preserve previous values / point-in-time" → Type 2.
◆ Two more the docs add
- AUTO CDC FROM SNAPSHOT (Python-only) — for a source with no change feed, only periodic full snapshots.
dp.create_auto_cdc_from_snapshot_flow(...)compares successive snapshots and infers the inserts/updates/deletes. Use when the upstream can only give you "here's the whole table again." bitemporalSCD — tracks two time dimensions (when the change happened at source and when the system recorded it) viaSYSTEM SEQUENCE BY. Niche; reach for it only to reconstruct "what did we believe was true at time T."
◆ The collision to guard: CDC is not CDF
Both are about "changes" and both are tested. CDC (Change Data Capture) = consuming changes from an external source into Databricks — this lesson. CDF (Change Data Feed) = producing a stream of changes from a Delta table for downstream consumers — a different feature ([Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/)). Direction is the tell: CDC comes in from Oracle/Kafka; CDF goes out of your Delta table.
Takeaways (rebuild it from these)
- CDC = a feed of insert/update/delete events from a source. AUTO CDC (older name APPLY CHANGES INTO, still on the exam) turns that feed into a correct target with one declaration instead of hand-written foreachBatch + rank-window + MERGE.
KEYS= identity (match→update, no-match→insert); keys must be immutable.SEQUENCE BY= logical time; applies the latest change per key and fixes out-of-order arrival. The source stamps it, not you.- SCD Type 1 = overwrite (current only); SCD Type 2 = expire old + insert new (full history, temporal columns auto-managed).
- It's the declarative form of f1's "changed record → merge into target," on a streaming source (
[Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)), built on the MERGE machine from[How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/). - CDC (in, from a source) ≠ CDF (out, from a Delta table) →
[Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/).
Before you move on — say these without scrolling up
- Two edits to one key arrive out of order — what makes AUTO CDC apply the right one, and who stamps that column?
KEYS— what does a match vs no-match do? Why must keys be immutable?- "Track a customer's address history for audit" — Type 1 or Type 2, and what does the framework do to the old row?
- CDC vs CDF — which direction is each?
The Type-2 idea returns when we model dimensions in [Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/); "pick the latest per key" returns, in a different guise, when we handle duplicates in [Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/).