This one sits a little apart from the file-layout story of the rest of Section 6 — it's about movement of changes, not layout. But it closes a gap you already felt in [Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/), so it lands naturally.
The spine
Beat 1 — settle the collision: CDF is the out direction
Recall from [APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/): CDC consumes changes into Databricks from an external source (Oracle, Kafka); CDF (Change Data Feed) produces a stream of changes out of a Delta table for downstream consumers. Direction is the whole tell — CDC comes in, CDF goes out. This lesson is CDF.
Beat 2 — the limitation it fixes (why it's in the objective)
Predict: a streaming table is append-only (
[Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)) — it propagates new rows beautifully. So when a row in Silver is updated or deleted, what does a plain streaming read carry downstream to Gold?
…
Nothing — updates and deletes are invisible to an append-only streaming read. If Gold must reflect updated/deleted Silver rows, a streaming table can't carry that. That's the gap CDF fills:
Anchor. With CDF enabled, a Delta table records what changed at the row level — each change stamped insert / update_preimage / update_postimage / delete, with version and timestamp — so a downstream consumer processes only the changes instead of re-reading the whole table. That "only the changes" (including updates and deletes a streaming read drops) is the latency + efficiency win.
Lock it. CDF = out (from a Delta table); CDC = in. CDF fixes the streaming-table append-only limitation by carrying updates and deletes, and lets consumers process only changes → lower latency than full re-reads.
The dials (skim now; return when a question needs one)
◆ Enabling and reading it
ALTER TABLE customers SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
Once on, each DML also writes change metadata (a _change_data folder). Read the feed:
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 5) # or startingTimestamp; readStream for continuous
.table("customers")
# each row gains: _change_type, _commit_version, _commit_timestamp
The _change_type values: insert, delete, and — for an update — two rows, update_preimage (values before) and update_postimage (values after). That before/after pair is the raw material for SCD Type 2 history ([Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/)). Read as a batch (startingVersion) or a stream (readChangeFeed on readStream).
◆ The re-read trap
A tested failure: a job that overwrites/appends the entire change history each run produces duplicates — "the entire history of updated records appended at each execution." The fix is to process CDF incrementally with a MERGE keyed on the id (modify only changed records), or read the CDF as a stream from the last version — not re-scan all history every time.
◆ The cost — enable it deliberately
CDF isn't free: writing change metadata adds write overhead (extra files per DML). Enable it only on tables whose downstream consumers actually need the change stream — typically Silver feeding multiple Golds, or tables shared for incremental consumption. Don't blanket-enable. (And VACUUM deletes _change_data files past retention, [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/).)
◆ Where it connects
- Downstream of
[APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/): AUTO CDC consumes a change feed; CDF produces one. You can chain them — CDF out of Silver, AUTO CDC into Gold. [Delta Sharing — live data out, without copies](/lessons/s4-delta-sharing/): share a table's CDF so a recipient consumes only changes.[Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/): preimage/postimage feeds SCD Type 2.
Takeaways (rebuild it from these)
- CDF = the out direction (produce changes from a Delta table); CDC = in. Same "changes," opposite arrows.
- It fixes the append-only limitation of streaming tables — carries updates and deletes downstream and lets consumers process only changes → lower latency.
- Enable
delta.enableChangeDataFeed = true; readreadChangeFeed=true+startingVersion/timestamp (batch) orreadStream(stream). Rows gain_change_type(insert / update_preimage / update_postimage / delete),_commit_version,_commit_timestamp. - Process CDF incrementally (MERGE / stream) — re-scanning full history each run duplicates. It has write overhead — enable only where needed.
Before you move on — say these without scrolling up
- CDC vs CDF — which direction is each?
- What can't an append-only streaming read carry to Gold, and how does CDF fix it?
- The four
_change_typevalues — and which pair an update produces. - A CDF job keeps producing duplicates — why, and the fix?
Last in Section 6: when a query is still slow, how to read the evidence to find out why → [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/).