Everything in this exam — and in the craft — sits on one idea. Get it, and most of the question bank stops being facts to memorise and becomes things you can work out on the spot. So we build it slowly here, once, and then spend the rest of the course watching it reappear in disguise. This is the lesson the others lean on; nothing below is skippable.
The whole job, and the only question that matters
Data engineering is moving data from a source to a target. A source is wherever the data is born — an app's database, a stream of events, files landing in cloud storage. A target is the table someone actually uses — a dashboard's numbers, a cleaned table, a model's input. If the source never changed, this would be trivial: copy it once and walk away.
But the source never sits still. New orders arrive; a customer edits their address; yesterday's data is corrected. And the target has to keep reflecting the source as it changes. So the entire field collapses to a single question:
when the source changes, how do you keep the target correct while doing as little work as possible?
There is always one answer that works no matter what: throw the target away and rebuild it from the entire source, every time. It is always correct — and it is hopeless the moment the source has millions of rows, because you'd reprocess everything just to absorb a handful of new records. Every technique you will ever learn — Auto Loader, MERGE, streaming state, Change Data Feed, liquid clustering — exists to avoid that full rebuild. Each is one clever way of touching only what changed.
Anchor. Every technique in data engineering is one clever way of doing only the work a change demands. Everything else is detail hanging off this. When a question stumps you, ask "what's the least work this change actually requires?" — the answer is usually the option that does exactly that and no more.
A source changes in exactly two ways — and they need opposite machines
Here is the first fork, and the single most useful distinction in the whole subject. Look closely and a source changes in two ways that feel similar and need completely different handling. Quietly sliding between them is the easiest way to get lost, so we split them now and never blur them again.
Way 1 — new records arrive
A customer places a new order. Nothing old is touched; a row is simply added at the bottom.
Say the target is revenue per restaurant. First, one word we'll lean on hard — "the query." The query is the transformation that produces the target: here, "sum the order amounts, grouped by restaurant." It is not a lookup you run against the target afterward; it is the logic that builds the target in the first place. Hold that — people trip on it constantly.
Now one new order lands. To keep revenue correct you do not want to re-add ten million historical orders. You want to take the single new order and add its amount to that restaurant's running total. But to add to a running total, the engine must be holding that total somewhere between runs. That held-onto running total — the engine's working memory carried from one run to the next — is exactly what the word state means. Say it plainly: state = the running-total-style memory the engine keeps so it can update an aggregate from new rows alone, instead of rescanning history.
This kills a wrong mental picture before it forms. A table of two million rows and twenty-five columns does not mean two million things in memory. For "revenue per 500 restaurants," the engine remembers about 500 small things — one running total each. Rows stream past, each nudges a total, then the row is discarded. So state grows with the number of distinct groups (keys), never with the number of rows. That one fact is why grouping on a high-cardinality column like customer_id (millions of keys) is the classic state-explosion trap, while grouping by restaurant_id (hundreds) is cheap — same rows, wildly different memory.
Way 2 — an existing record changes
Now the customer edits their phone number. Nothing new arrives; a record that already existed is now different. There is nothing to total up. You just need that customer's row in the target to show the newest value — so you find the row and overwrite it. That operation is a merge. And if they edited it twice before the next run, you take the most recent edit and merge that one.
Where is the "memory" of their current phone number? Not in any engine scratchpad. It's just the data already sitting in the target. The target row is the memory.
Untie the knot before it forms
Here's the exact doubt that makes this feel slippery, so let's name it out loud: "Isn't this all just source-to-target? And isn't 'state' tied to a specific query?" Both Ways are source-to-target — that never changes. But the word state belongs only to Way 1: it's the engine's running-total memory for aggregations, and yes it's tied to the query (a different aggregation keeps different totals; a pure filter keeps none at all). Way 2 is not state — it's updating a row, and the memory is the target data itself. One word, "keep the target correct," stretched across two machines is the whole source of confusion. Pull them apart and half of streaming stops being mysterious.
The two machines, side by side
| Way 1 — engine state | Way 2 — updating the target | |
|---|---|---|
| Triggered by | an aggregation/window/join/dedup (a running SUM/AVG/COUNT) | a source record changing (edit a phone/email) |
| What it holds | tiny accumulators, one per group (per restaurant: running sum) | the actual current rows in the target |
| Where it lives | the engine's state store | the target table — it's just data |
| How the change lands | engine nudges the accumulator from new rows only | you MERGE the new value into the row |
| Tied to the query? | yes — different query, different state | no — it's about the data, not the query |
Read a question's verbs: SUM/AVG/COUNT/window/dedup → you're in Way-1 state; "insert or update this record" / "keep the latest value" → you're in Way-2 merge. That reflex alone answers a surprising number of questions.
Why the engine's memory survives a restart
One more piece, because it explains a lot later. If the stream goes quiet for twenty days and then a job restart happens, the running totals are still there. How? Because the engine doesn't keep state only in the cluster's memory — it also saves it to durable cloud storage as it goes. On restart it reads that saved copy, picks the totals back up, and continues exactly where it left off.
The picture: the cluster is the desk the engine works at; the saved copy is a filing cabinet it never loses. And to name the real components behind that picture — the desk's memory is the state store (on Databricks, RocksDB running on the cluster), and the filing cabinet is the checkpoint in cloud storage. (Exactly how the checkpoint and write-ahead log guarantee this is opened later, at the write stage — [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/).) For now, just hold: state is durable because it's continuously written down, not merely remembered.
The map: two axes, and every tool lives in one box
Zoom out. Every technique you'll meet sits at the crossing of two independent choices — and "the six things people list" are just their combinations.
- How does the data arrive? Batch — a finite pile you process once and it doesn't move. Or streaming — endless, always arriving. Batch usually lets you skip the keep-it-correct machinery (the pile is frozen); streaming forces it (the data keeps coming).
- Who builds the machinery? Procedural — your hands (you write
readStream → transform → writeStreamyourself). Or declarative — you declare the tables you want and let the engine build the machinery (Lakeflow Spark Declarative Pipelines).
BATCH (finite pile) STREAMING (endless)
PROCEDURAL spark.read → write spark.readStream → writeStream
(your hands) (this IS Structured Streaming, used directly)
DECLARATIVE a materialized view a streaming table
(engine's) (@dp.table on a batch read) (@dp.table on a stream)
These axes are independent — how the data arrives is a separate question from who writes the code. That independence matters: nearly every "which approach?" question is just asking you to place the scenario in one of these four boxes, and the two axes are chosen separately.
The shape the target takes: medallion
Last piece of furniture, because every pipeline question assumes it. A target isn't built in one leap; data is refined through three layers: Bronze (raw, exactly as ingested, append-only — the audit copy), Silver (cleaned, deduplicated, conformed), Gold (business aggregates and models the dashboards read). It's just the one job — source→target — repeated: source→bronze→silver→gold, each hop doing only the work its change demands. When a lesson later asks "which layer does this belong in?", you answer it from this: raw fidelity → bronze, cleaning → silver, business rollups → gold.
Takeaways (rebuild the field from these)
- The one job: keep the target correct as the source changes, doing only the work the change demands. Every technique is a way to avoid the full rebuild.
- A source changes two ways — new records (Way 1, handled by engine state) and changed records (Way 2, handled by a
MERGEinto the target). Different machines; never blur them. Verbs tell you which: SUM/window/dedup → state; "keep latest per key" → merge. - State grows with distinct keys, not rows (2M rows → ~500 totals); high-cardinality grouping is the state-explosion trap. State survives restarts because it's saved to a checkpoint in durable storage.
- Everything lives on two independent axes: batch vs streaming (how data arrives) × procedural vs declarative (who builds the machinery). Place the scenario in one of the four boxes.
- Targets are refined bronze → silver → gold — the same job, repeated, each hop minimal.
Next, we take Way 1 — new records arriving — into its hardest home, streaming, and watch exactly what "state" costs and how the engine keeps it from growing forever. That's where Section 1 (writing the code that does this job) begins → [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/).