Recall the two ways a source changes from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): new records arrive (Way 1), or an existing record changes (Way 2, handled by a merge into the target). This lesson is Way 2 at production scale — where that merge stops being something you hand-write and becomes a single declaration.

The spine

Beat 1 — what CDC is, and the pain

Define it plainly: CDC (Change Data Capture) is a feed of change events from a source system (Oracle, MySQL, a Kafka topic). Instead of shipping the whole table, the source sends one event per row-change — each stamped insert / update / delete, plus a column saying when the change happened at the source.

Now feel the pain. To turn that feed into a correct target table by hand, you'd write: a foreachBatch block, a rank window function to pick the latest change per key in each batch, and a MERGE to apply it (from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): MERGE is the keyed upsert). But the genuinely hard part is order:

Predict: the same customer edits their address twice before the next pipeline run, and a network retry makes the older edit arrive after the newer one. Apply them in arrival order — what breaks?

…

You overwrite the new value with the old one. Correctness can't depend on when the pipeline happened to see an event.

Beat 2 — the inversion (the anchor)

Recall the LSDP anchor from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) — declare what you want, let the framework write the plumbing. AUTO CDC applies it exactly here:

Anchor. You declare the key (which column identifies a record) and the sequence (which column orders the changes); Databricks writes the MERGE that keeps the target correct — applying the latest change per key, in the right order, even when events arrive out of order. It's f1's "changed record → merge into target," made declarative.

One naming note, because you'll meet both forms (renamed-product trap): the current API is AUTO CDC (CREATE FLOW … AS AUTO CDC INTO in SQL, dp.create_auto_cdc_flow() in Python). Its older name is APPLY CHANGES INTO, which still works and is exactly what the Nov-2025 exam guide and most questions still show. Same clauses, same behaviour — one thing, two names.

Beat 3 — the two clauses that carry everything

CREATE OR REFRESH STREAMING TABLE my_catalog.silver.customers;

CREATE FLOW customers_cdc AS AUTO CDC INTO my_catalog.silver.customers
FROM stream(bronze_customer_changes)      -- a streaming source
KEYS (customer_id)                         -- what uniquely identifies a record
SEQUENCE BY updated_at                      -- the logical-time column that orders changes
APPLY AS DELETE WHEN operation = 'DELETE'   -- (optional) how a delete event is recognised
COLUMNS * EXCEPT (operation, updated_at)    -- (optional) which columns to write
STORED AS SCD TYPE 1;                        -- or TYPE 2 for full history

Two clauses do the real work — define them one at a time:

KEYS (customer_id) — the primary key. It drives the merge: key already in the target → update it; key absent → insert it. (Keys must be immutable — if a key value changes, AUTO CDC treats it as a brand-new record, because it can't tell it's the same one.)
SEQUENCE BY updated_at — the column representing logical time (when the change happened at the source). This is the piece that solves out-of-order delivery: Databricks applies the change with the highest sequence value per key, not the one that arrived last. The source system stamps this (Oracle's commit timestamp) — you never set it; it separates physical arrival order from logical event order.

Ground it in f1: a customer edits their phone twice before the next run → AUTO CDC keeps the edit with the latest updated_at, whichever arrived first.

Lock it. KEYS = identity (match→update, no-match→insert; must be immutable). SEQUENCE BY = logical time; applies the latest change per key, fixing out-of-order arrival. Source stamps the sequence, never you.

The dials (skim now; return when a question needs one)

◆ SCD Type 1 vs Type 2 — how much history

STORED AS SCD TYPE … is the second dial. (SCD = Slowly Changing Dimension — a dimension whose attributes change over time, like a customer's address.)

Type 1 — overwrite, no history. Only the current value survives; one row per key, always current. Customer changes phone → old number gone. Use when history doesn't matter.
Type 2 — keep full history. The old row is expired (an end-timestamp stamped on it) and a new row inserted for the new value (with a start-timestamp). The target holds the whole timeline. Databricks manages the temporal columns for you (__START_AT / __END_AT). Use when you must answer "what was true at the time" (audit, point-in-time joins, regulatory history).

The exam tell: "keep only the latest / current state" → Type 1; "track history / preserve previous values / point-in-time" → Type 2.

◆ Two more the docs add

AUTO CDC FROM SNAPSHOT (Python-only) — for a source with no change feed, only periodic full snapshots. dp.create_auto_cdc_from_snapshot_flow(...) compares successive snapshots and infers the inserts/updates/deletes. Use when the upstream can only give you "here's the whole table again."
bitemporal SCD — tracks two time dimensions (when the change happened at source and when the system recorded it) via SYSTEM SEQUENCE BY. Niche; reach for it only to reconstruct "what did we believe was true at time T."

◆ The collision to guard: CDC is not CDF

Both are about "changes" and both are tested. CDC (Change Data Capture) = consuming changes from an external source into Databricks — this lesson. CDF (Change Data Feed) = producing a stream of changes from a Delta table for downstream consumers — a different feature ([Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/)). Direction is the tell: CDC comes in from Oracle/Kafka; CDF goes out of your Delta table.

Takeaways (rebuild it from these)

CDC = a feed of insert/update/delete events from a source. AUTO CDC (older name APPLY CHANGES INTO, still on the exam) turns that feed into a correct target with one declaration instead of hand-written foreachBatch + rank-window + MERGE.
KEYS = identity (match→update, no-match→insert); keys must be immutable. SEQUENCE BY = logical time; applies the latest change per key and fixes out-of-order arrival. The source stamps it, not you.
SCD Type 1 = overwrite (current only); SCD Type 2 = expire old + insert new (full history, temporal columns auto-managed).
It's the declarative form of f1's "changed record → merge into target," on a streaming source ([Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)), built on the MERGE machine from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/).
CDC (in, from a source) ≠ CDF (out, from a Delta table) → [Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/).

Before you move on — say these without scrolling up

Two edits to one key arrive out of order — what makes AUTO CDC apply the right one, and who stamps that column?
KEYS — what does a match vs no-match do? Why must keys be immutable?
"Track a customer's address history for audit" — Type 1 or Type 2, and what does the framework do to the old row?
CDC vs CDF — which direction is each?

The Type-2 idea returns when we model dimensions in [Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/); "pick the latest per key" returns, in a different guise, when we handle duplicates in [Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/).

APPLY CHANGES — CDC and SCD, declaratively