Lessons

Developing Code for Data Processing

APPLY CHANGES — CDC and SCD, declaratively

Use APPLY CHANGES APIs to simplify CDC in Lakeflow Spark Declarative Pipelines.

Recall the two ways a source changes from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): new records arrive (Way 1), or an existing record changes (Way 2, handled by a merge into the target). This lesson is Way 2 at production scale — where that merge stops being something you hand-write and becomes a single declaration.


The spine

Beat 1 — what CDC is, and the pain

Define it plainly: CDC (Change Data Capture) is a feed of change events from a source system (Oracle, MySQL, a Kafka topic). Instead of shipping the whole table, the source sends one event per row-change — each stamped insert / update / delete, plus a column saying when the change happened at the source.

Now feel the pain. To turn that feed into a correct target table by hand, you'd write: a foreachBatch block, a rank window function to pick the latest change per key in each batch, and a MERGE to apply it (from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): MERGE is the keyed upsert). But the genuinely hard part is order:

Predict: the same customer edits their address twice before the next pipeline run, and a network retry makes the older edit arrive after the newer one. Apply them in arrival order — what breaks?

You overwrite the new value with the old one. Correctness can't depend on when the pipeline happened to see an event.

Beat 2 — the inversion (the anchor)

Recall the LSDP anchor from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) — declare what you want, let the framework write the plumbing. AUTO CDC applies it exactly here:

Anchor. You declare the key (which column identifies a record) and the sequence (which column orders the changes); Databricks writes the MERGE that keeps the target correct — applying the latest change per key, in the right order, even when events arrive out of order. It's f1's "changed record → merge into target," made declarative.

One naming note, because you'll meet both forms (renamed-product trap): the current API is AUTO CDC (CREATE FLOW … AS AUTO CDC INTO in SQL, dp.create_auto_cdc_flow() in Python). Its older name is APPLY CHANGES INTO, which still works and is exactly what the Nov-2025 exam guide and most questions still show. Same clauses, same behaviour — one thing, two names.

Beat 3 — the two clauses that carry everything

CREATE OR REFRESH STREAMING TABLE my_catalog.silver.customers;

CREATE FLOW customers_cdc AS AUTO CDC INTO my_catalog.silver.customers
FROM stream(bronze_customer_changes)      -- a streaming source
KEYS (customer_id)                         -- what uniquely identifies a record
SEQUENCE BY updated_at                      -- the logical-time column that orders changes
APPLY AS DELETE WHEN operation = 'DELETE'   -- (optional) how a delete event is recognised
COLUMNS * EXCEPT (operation, updated_at)    -- (optional) which columns to write
STORED AS SCD TYPE 1;                        -- or TYPE 2 for full history

Two clauses do the real work — define them one at a time:

Ground it in f1: a customer edits their phone twice before the next run → AUTO CDC keeps the edit with the latest updated_at, whichever arrived first.

Lock it. KEYS = identity (match→update, no-match→insert; must be immutable). SEQUENCE BY = logical time; applies the latest change per key, fixing out-of-order arrival. Source stamps the sequence, never you.


The dials (skim now; return when a question needs one)

◆ SCD Type 1 vs Type 2 — how much history

STORED AS SCD TYPE … is the second dial. (SCD = Slowly Changing Dimension — a dimension whose attributes change over time, like a customer's address.)

The exam tell: "keep only the latest / current state" → Type 1; "track history / preserve previous values / point-in-time" → Type 2.

◆ Two more the docs add

◆ The collision to guard: CDC is not CDF

Both are about "changes" and both are tested. CDC (Change Data Capture) = consuming changes from an external source into Databricks — this lesson. CDF (Change Data Feed) = producing a stream of changes from a Delta table for downstream consumers — a different feature ([Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/)). Direction is the tell: CDC comes in from Oracle/Kafka; CDF goes out of your Delta table.

Takeaways (rebuild it from these)

  1. CDC = a feed of insert/update/delete events from a source. AUTO CDC (older name APPLY CHANGES INTO, still on the exam) turns that feed into a correct target with one declaration instead of hand-written foreachBatch + rank-window + MERGE.
  2. KEYS = identity (match→update, no-match→insert); keys must be immutable. SEQUENCE BY = logical time; applies the latest change per key and fixes out-of-order arrival. The source stamps it, not you.
  3. SCD Type 1 = overwrite (current only); SCD Type 2 = expire old + insert new (full history, temporal columns auto-managed).
  4. It's the declarative form of f1's "changed record → merge into target," on a streaming source ([Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)), built on the MERGE machine from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/).
  5. CDC (in, from a source) ≠ CDF (out, from a Delta table)[Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/).

Before you move on — say these without scrolling up

  1. Two edits to one key arrive out of order — what makes AUTO CDC apply the right one, and who stamps that column?
  2. KEYS — what does a match vs no-match do? Why must keys be immutable?
  3. "Track a customer's address history for audit" — Type 1 or Type 2, and what does the framework do to the old row?
  4. CDC vs CDF — which direction is each?

The Type-2 idea returns when we model dimensions in [Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/); "pick the latest per key" returns, in a different guise, when we handle duplicates in [Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/).

Prerequisites

Leads to