Lessons

Developing Code for Data Processing

Streaming tables vs materialized views

Explain the advantages and disadvantages of streaming tables compared to materialized views.

In [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) I said one line inside your function — read_stream() vs read() — silently decides what kind of table you get, and left the decision itself for here. This is that decision. It's a named exam objective, so we make it airtight.


The spine

Beat 1 — the one question that separates them

These two only exist inside LSDP (in a plain notebook you just call readStream/read and write a Delta table yourself). And the trap is that both are declared with the exact same @dlt.table decorator — so what tells them apart isn't how you declare them. It's the answer to one question you must ask of every table in a pipeline:

When new data arrives, should this table just append the new rows — or recompute its whole result from scratch?

Predict it on two tables:

That difference is the whole lesson:

Anchor. @dlt.table + read_stream()Streaming Table (append the new rows). @dlt.table + read()Materialized View (recompute the result). LSDP reads your read-method and classifies the table for you.

# STREAMING TABLE — read_stream() → new records are APPENDED
@dlt.table
def bronze_orders():
    return spark.readStream.format("cloudFiles").option("cloudFiles.format","json").load("/landing/orders")

# MATERIALIZED VIEW — read() (batch) → result RECOMPUTED each run
@dlt.table
def gold_revenue():
    return dlt.read("silver_orders").groupBy("restaurant_id").sum("amount")

Beat 2 — why they differ: the stateless/stateful test again

Apply the test from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) and the whole behaviour falls out:

Lock it. Streaming table = checkpointed, exactly-once, append-only → ingestion/dedup/CDC source. Materialized view = stateless recompute, sees the whole dataset → aggregations/joins → Gold metrics.

Beat 3 — the sharp doubt: how is a recomputing MV not ruinously slow?

You should be uneasy here, so name it: if an MV is stateless and recomputes from scratch, doesn't it re-read ten million rows every run when its source is a streaming table?

Predict: where would it get "only the new rows" from, if it keeps no checkpoint?

From the source table's Delta transaction log versionsnot a streaming checkpoint. Reaching back to [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): LSDP looks at the source's log, sees exactly which rows appeared between version 47 and version 52, and recomputes only the affected part of the result. So an MV is incremental — but via table version tracking (a property of the Delta table at rest), not operation state (a running query's notepad). Same outcome, different machine.

This is exactly the checkpoint-vs-Delta-log collision from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/), showing up again: an MV doesn't keep a streaming checkpoint; it leans on the source's log.

Lock it. An MV over a streaming source is still incremental — through the source's Delta-log versions, never a streaming checkpoint.


The dials (skim now; return when a question needs one)

◆ When to use each

Common production pattern (medallion from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/)): a streaming table at Bronze feeds a materialized view at Gold. Bronze accumulates new records cheaply; Gold recomputes its aggregation using Delta version tracking to touch only the new Bronze rows.

◆ The continuous-mode trap for MVs

Recall triggered vs continuous from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/). For a streaming table, continuous mode is fine (it just appends). For a materialized view it's a trap: it must recompute its entire aggregation every few seconds as each micro-batch lands — a three-table join with GROUP BY and PERCENTILE over millions of rows becomes ruinously expensive. Always use triggered mode for MVs. If a question pairs an MV with continuous mode and asks about cost, that pairing is the wrong answer.

◆ The boundary that catches people

Can you make a plain Delta table inside LSDP — not an MV? No. Inside LSDP, @dlt.table + dlt.read() is always a materialized view; LSDP manages it. Want full manual control over how/when a table is written? Step outside LSDP into a notebook and use the DataFrame writer (df.write.format("delta").saveAsTable(...)). Rule: want LSDP to manage it → @dlt.table; want manual control → imperative notebook.

Takeaways (rebuild it from these)

  1. Both are @dlt.table; the read method decides: read_stream()streaming table (append), read()materialized view (recompute).
  2. Streaming table = checkpointed, exactly-once, append-only → ingestion/dedup/CDC source. MV = stateless recompute, supports aggregations/joins → Gold metrics.
  3. An MV over a streaming source is still incremental — via the source's Delta-log versions, not a streaming checkpoint.
  4. Common pattern: streaming table (Bronze) → materialized view (Gold).
  5. Never run an MV in continuous mode — full recompute per micro-batch is ruinous. Triggered only.

Before you move on — say these without scrolling up

  1. The single question that decides streaming-table vs MV — what is it?
  2. Which one is append-only, and which sees the whole dataset? Which can do a GROUP BY?
  3. An MV reads from a streaming table — how does it stay incremental without a checkpoint?
  4. Why is continuous mode a trap for a materialized view?

Next: the most powerful thing you build on top of a streaming source — turning a stream of change events into a correctly-updated table, declaratively. → [APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/)

Prerequisites

Leads to