Build and manage reliable, production-ready batch and streaming pipelines using Lakeflow Spark Declarative Pipelines and Auto Loader; use expectations for quality.

Recall the second axis from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): who builds the keep-it-correct machinery — your hands (procedural) or the engine's (declarative)? This lesson is the declarative side, and it's the framework the exam leans on hardest in Section 1. It runs on the Structured Streaming engine you just met in [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) and produces the Delta tables from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/).

The spine

Beat 1 — the pain, so you feel why this exists

Picture the hand-written pipeline from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): readStream → transform → writeStream, by hand. Now scale it to a real medallion flow — bronze, silver, gold, each feeding the next. Ask yourself:

Predict: beyond the actual transformations, what else must you now write and maintain by hand?

…

Dependency ordering (bronze before silver). Retries on failure. Incremental bookkeeping (checkpoints per table). Recovery after a crash. Quality checks. And here's the sting: as the pipeline grows, that operational plumbing becomes larger than the actual logic. You spend more code babysitting the machinery than expressing what the data should become.

Beat 2 — the inversion (the anchor)

Lakeflow Spark Declarative Pipelines flips it. You stop writing the how and write only the what:

Anchor. You declare what each table is (its transformation); Databricks owns the how — dependency order, orchestration, retries, incremental refresh, and quality tracking. You write logic; the framework writes plumbing.

Every feature in this lesson is just a consequence of that one sentence. When a question asks "what does LSDP do for you," the answer is always some flavour of the how.

Family point so you don't overreach: LSDP is not a new engine. Underneath, it generates and runs the same Structured Streaming / Spark from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) and writes the same Delta tables from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/). On the f1 map, LSDP is the declarative steering wheel; a plain notebook is the procedural one. Same engine, same Delta output — different hands on the controls. And it doesn't replace notebooks: LSDP is preferred for standard medallion pipelines, CDC, streaming ingestion, and anywhere quality/lineage matter; imperative notebooks stay for one-time historical loads and ad-hoc work.

Beat 3 — the mechanism: a declaration builds a graph

Here's how the framework can own ordering — and it produces a result that surprises people, so predict it:

You mark a function with @dlt.table. That decorator does not run the function then and there — it registers a declaration. When the pipeline runs, LSDP reads all your declarations, sees that silver reads from bronze and gold reads from silver, and builds a dependency graph.

Predict: you define gold at the top of your file and bronze at the bottom. Which runs first?

…

Bronze — because execution order comes from the graph, not the order you wrote the code. You never hand-wire the sequence; the graph does. That's "declare it, let the engine order it" — the anchor, made concrete.

import dlt
from pyspark.sql.functions import col

@dlt.table(comment="Raw orders, ingested continuously")   # a STREAMING table (read_stream)
@dlt.expect("valid_amount", "amount > 0")                  # a quality rule, declared inline
def bronze_orders():
    return dlt.read_stream("cloud_files").select("order_id", "customer_id", "amount")

@dlt.table(comment="Revenue per restaurant")               # a MATERIALIZED VIEW (read)
def gold_revenue():
    return dlt.read("bronze_orders").groupBy("restaurant_id").sum("amount")

Lock it. The decorator registers a declaration; LSDP builds a dependency graph and runs in graph order, not code order. You declare tables; it writes and sequences the plumbing.

The dials (skim now; return when a question needs one)

◆ The rename trap — `dlt` → `pyspark.pipelines`

A textbook renamed-product trap. The Python module was renamed from dlt to pyspark.pipelines — current code writes from pyspark import pipelines as dp and uses @dp.table / @dp.view / dp.create_streaming_table(…). The old import dlt / @dlt.table still works (kept as an alias) and is what the exam's questions still show, so I use @dlt.* in these lessons. dp.* and @dlt.* are the same framework under two names.

◆ Table vs view — the storage question

Two kinds of dataset; the choice is only "does anyone outside the pipeline need it?"

Declaration	Persists as	Queryable outside the run?
`@dlt.table`	a Delta table in Unity Catalog	Yes — downstream consumers can read it
`@dlt.view`	a temporary dataset, exists only during the run	No — intermediate hop, takes no storage

Rule: outside consumers need it → table; just an internal hop → view.

◆ The read method classifies the table (`read_stream` vs `read`)

One line inside the function silently sets the table's whole nature — the exam's favourite lever, and the seed of [Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/):

dlt.read_stream(...) → read the source continuously, track position with a checkpoint, return only new records → a streaming table (append-only, incremental).
dlt.read(...) → read everything that exists now, every run, no checkpoint → a materialized view (recomputed each run).

Callback to [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): read_stream isn't LSDP-only — plain spark.readStream does the same in a notebook. LSDP just manages the checkpoint for you (the anchor again).

◆ Expectations — quality declared next to the data

Because the framework owns "the how," quality checks live inside the declaration. Three behaviours, and the distinction is tested directly:

Decorator	A violating row…	Pipeline continues?
`@dlt.expect("name", "cond")`	is kept, violation logged/tracked	Yes
`@dlt.expect_or_drop("name", "cond")`	is dropped — not written downstream	Yes
`@dlt.expect_or_fail("name", "cond")`	fails the whole pipeline	No

The "good rows to silver, bad rows to a quarantine table" pattern is built from these — taken apart in [Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/).

◆ Triggered vs continuous — the cost/latency dial

A pipeline setting (distinct from a streaming query's trigger):

Triggered — runs on a schedule, processes what's available, stops. Cheaper; the batch pattern. Default answer unless the question stresses real-time.
Continuous — runs forever, cluster stays alive 24/7. Lower latency, more expensive.

Match to the f1 arrival axis: scheduled chunks → triggered; a genuine never-ending stream needing freshness → continuous.

◆ The medallion seat (and the `LIVE.` gotcha)

LSDP is built for the medallion flow from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): Auto Loader → bronze streaming table → silver streaming table (clean/dedup, expectations) → gold materialized view (aggregate). Each @dlt.table reads the previous layer, and the dependency graph wires the order — the whole source→bronze→silver→gold circuit, declared instead of hand-built. One gotcha to bank: LIVE. (or a sibling dataset by name) refers to this pipeline's tables; a table outside the pipeline uses its normal catalog.schema.table name. Mixing these up is a common failure.

Takeaways (rebuild it from these)

LSDP = the declarative steering wheel on the Spark/Structured-Streaming engine — you declare tables; it writes the plumbing (order, retries, incremental, quality). Coexists with notebooks.
@dlt.table/@dlt.view register declarations; LSDP builds a dependency graph and runs in graph order, not code order. Table = persisted Delta; view = temporary.
read_stream() → streaming table (checkpointed, incremental); read() → materialized view (full recompute). LSDP manages the checkpoint for you.
Expectations enforce quality inline: expect (log), expect_or_drop (drop row), expect_or_fail (kill pipeline).
Triggered (scheduled, cheaper) vs continuous (24/7, low-latency).

Before you move on — say these without scrolling up

In one sentence, what does the framework own that you'd otherwise hand-write?
You define gold above bronze in the file — which runs first, and why?
read_stream() vs read() — which kind of table does each make?
A row violates @dlt.expect_or_fail — what happens to the pipeline?

Next, the decision the read-method seeds, in full: when a table should be a streaming table vs a materialized view, and why continuous mode is a trap for MVs. → [Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)

Lakeflow Spark Declarative Pipelines