Explain the advantages and disadvantages of streaming tables compared to materialized views.

In [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) I said one line inside your function — read_stream() vs read() — silently decides what kind of table you get, and left the decision itself for here. This is that decision. It's a named exam objective, so we make it airtight.

The spine

Beat 1 — the one question that separates them

These two only exist inside LSDP (in a plain notebook you just call readStream/read and write a Delta table yourself). And the trap is that both are declared with the exact same @dlt.table decorator — so what tells them apart isn't how you declare them. It's the answer to one question you must ask of every table in a pipeline:

When new data arrives, should this table just append the new rows — or recompute its whole result from scratch?

Predict it on two tables:

Bronze: raw orders, exactly as they land. A new order arrives — do you recompute anything? No. You append it. Nothing about the old rows changed.
Gold: revenue per restaurant. A new order arrives — can you just append? No. That restaurant's total must be revised. To be correct you recompute the aggregation.

That difference is the whole lesson:

Anchor. @dlt.table + read_stream() → Streaming Table (append the new rows). @dlt.table + read() → Materialized View (recompute the result). LSDP reads your read-method and classifies the table for you.

# STREAMING TABLE — read_stream() → new records are APPENDED
@dlt.table
def bronze_orders():
    return spark.readStream.format("cloudFiles").option("cloudFiles.format","json").load("/landing/orders")

# MATERIALIZED VIEW — read() (batch) → result RECOMPUTED each run
@dlt.table
def gold_revenue():
    return dlt.read("silver_orders").groupBy("restaurant_id").sum("amount")

Beat 2 — why they differ: the stateless/stateful test again

Apply the test from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) and the whole behaviour falls out:

A streaming table is stateful (in the progress sense) — it keeps a checkpoint of exactly which records it has processed (last Delta version / Kafka offset). Next run, it resumes from there; each record flows through exactly once; the table only ever grows (append-only). This is the machine for continuous ingestion.
A materialized view is stateless — no checkpoint, no memory of prior runs. Each run it executes its query against whatever exists now. It doesn't need memory: to compute "revenue per restaurant" you don't track which rows you already summed — you just sum all rows that exist. That's why MVs can do aggregations, joins, and windows (they see the whole dataset), and streaming tables can't (their output is append-only).

Lock it. Streaming table = checkpointed, exactly-once, append-only → ingestion/dedup/CDC source. Materialized view = stateless recompute, sees the whole dataset → aggregations/joins → Gold metrics.

Beat 3 — the sharp doubt: how is a recomputing MV not ruinously slow?

You should be uneasy here, so name it: if an MV is stateless and recomputes from scratch, doesn't it re-read ten million rows every run when its source is a streaming table?

Predict: where would it get "only the new rows" from, if it keeps no checkpoint?

…

From the source table's Delta transaction log versions — not a streaming checkpoint. Reaching back to [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): LSDP looks at the source's log, sees exactly which rows appeared between version 47 and version 52, and recomputes only the affected part of the result. So an MV is incremental — but via table version tracking (a property of the Delta table at rest), not operation state (a running query's notepad). Same outcome, different machine.

This is exactly the checkpoint-vs-Delta-log collision from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/), showing up again: an MV doesn't keep a streaming checkpoint; it leans on the source's log.

Lock it. An MV over a streaming source is still incremental — through the source's Delta-log versions, never a streaming checkpoint.

The dials (skim now; return when a question needs one)

◆ When to use each

Streaming table — data arrives continuously (Auto Loader, Kafka, or a Delta table read as a stream), you need exactly-once, and output is insert-only: raw events, dedup, a CDC source. The natural Bronze/Silver shape.
Materialized view — the result depends on the whole dataset: aggregations (SUM/COUNT/PERCENTILE), joins, windowing. The natural Gold shape.

Common production pattern (medallion from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/)): a streaming table at Bronze feeds a materialized view at Gold. Bronze accumulates new records cheaply; Gold recomputes its aggregation using Delta version tracking to touch only the new Bronze rows.

◆ The continuous-mode trap for MVs

Recall triggered vs continuous from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/). For a streaming table, continuous mode is fine (it just appends). For a materialized view it's a trap: it must recompute its entire aggregation every few seconds as each micro-batch lands — a three-table join with GROUP BY and PERCENTILE over millions of rows becomes ruinously expensive. Always use triggered mode for MVs. If a question pairs an MV with continuous mode and asks about cost, that pairing is the wrong answer.

◆ The boundary that catches people

Can you make a plain Delta table inside LSDP — not an MV? No. Inside LSDP, @dlt.table + dlt.read() is always a materialized view; LSDP manages it. Want full manual control over how/when a table is written? Step outside LSDP into a notebook and use the DataFrame writer (df.write.format("delta").saveAsTable(...)). Rule: want LSDP to manage it → @dlt.table; want manual control → imperative notebook.

Takeaways (rebuild it from these)

Both are @dlt.table; the read method decides: read_stream() → streaming table (append), read() → materialized view (recompute).
Streaming table = checkpointed, exactly-once, append-only → ingestion/dedup/CDC source. MV = stateless recompute, supports aggregations/joins → Gold metrics.
An MV over a streaming source is still incremental — via the source's Delta-log versions, not a streaming checkpoint.
Common pattern: streaming table (Bronze) → materialized view (Gold).
Never run an MV in continuous mode — full recompute per micro-batch is ruinous. Triggered only.

Before you move on — say these without scrolling up

The single question that decides streaming-table vs MV — what is it?
Which one is append-only, and which sees the whole dataset? Which can do a GROUP BY?
An MV reads from a streaming table — how does it stay incremental without a checkpoint?
Why is continuous mode a trap for a materialized view?

Next: the most powerful thing you build on top of a streaming source — turning a stream of change events into a correctly-updated table, declaratively. → [APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/)

Streaming tables vs materialized views