Lessons

Cost & Performance Optimisation

Change Data Feed — emitting a table's changes downstream

Apply Change Data Feed (CDF) to address specific limitations of streaming tables and enhance latency.

This one sits a little apart from the file-layout story of the rest of Section 6 — it's about movement of changes, not layout. But it closes a gap you already felt in [Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/), so it lands naturally.


The spine

Beat 1 — settle the collision: CDF is the out direction

Recall from [APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/): CDC consumes changes into Databricks from an external source (Oracle, Kafka); CDF (Change Data Feed) produces a stream of changes out of a Delta table for downstream consumers. Direction is the whole tell — CDC comes in, CDF goes out. This lesson is CDF.

Beat 2 — the limitation it fixes (why it's in the objective)

Predict: a streaming table is append-only ([Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)) — it propagates new rows beautifully. So when a row in Silver is updated or deleted, what does a plain streaming read carry downstream to Gold?

Nothing — updates and deletes are invisible to an append-only streaming read. If Gold must reflect updated/deleted Silver rows, a streaming table can't carry that. That's the gap CDF fills:

Anchor. With CDF enabled, a Delta table records what changed at the row level — each change stamped insert / update_preimage / update_postimage / delete, with version and timestamp — so a downstream consumer processes only the changes instead of re-reading the whole table. That "only the changes" (including updates and deletes a streaming read drops) is the latency + efficiency win.

Lock it. CDF = out (from a Delta table); CDC = in. CDF fixes the streaming-table append-only limitation by carrying updates and deletes, and lets consumers process only changes → lower latency than full re-reads.


The dials (skim now; return when a question needs one)

◆ Enabling and reading it

ALTER TABLE customers SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

Once on, each DML also writes change metadata (a _change_data folder). Read the feed:

spark.read.format("delta")
  .option("readChangeFeed", "true")
  .option("startingVersion", 5)      # or startingTimestamp; readStream for continuous
  .table("customers")
# each row gains: _change_type, _commit_version, _commit_timestamp

The _change_type values: insert, delete, and — for an update — two rows, update_preimage (values before) and update_postimage (values after). That before/after pair is the raw material for SCD Type 2 history ([Dimensional modelling — SCD types and the star schema](/lessons/s10-dimensional/)). Read as a batch (startingVersion) or a stream (readChangeFeed on readStream).

◆ The re-read trap

A tested failure: a job that overwrites/appends the entire change history each run produces duplicates — "the entire history of updated records appended at each execution." The fix is to process CDF incrementally with a MERGE keyed on the id (modify only changed records), or read the CDF as a stream from the last version — not re-scan all history every time.

◆ The cost — enable it deliberately

CDF isn't free: writing change metadata adds write overhead (extra files per DML). Enable it only on tables whose downstream consumers actually need the change stream — typically Silver feeding multiple Golds, or tables shared for incremental consumption. Don't blanket-enable. (And VACUUM deletes _change_data files past retention, [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/).)

◆ Where it connects

Takeaways (rebuild it from these)

  1. CDF = the out direction (produce changes from a Delta table); CDC = in. Same "changes," opposite arrows.
  2. It fixes the append-only limitation of streaming tables — carries updates and deletes downstream and lets consumers process only changes → lower latency.
  3. Enable delta.enableChangeDataFeed = true; read readChangeFeed=true + startingVersion/timestamp (batch) or readStream (stream). Rows gain _change_type (insert / update_preimage / update_postimage / delete), _commit_version, _commit_timestamp.
  4. Process CDF incrementally (MERGE / stream) — re-scanning full history each run duplicates. It has write overhead — enable only where needed.

Before you move on — say these without scrolling up

  1. CDC vs CDF — which direction is each?
  2. What can't an append-only streaming read carry to Gold, and how does CDF fix it?
  3. The four _change_type values — and which pair an update produces.
  4. A CDF job keeps producing duplicates — why, and the fix?

Last in Section 6: when a query is still slow, how to read the evidence to find out why → [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/).

Prerequisites

Leads to