Every other lesson taught one concept well. This one does the thing the exam actually rewards and the others can't: it connects them. The Professional exam rarely asks "what is X" — it drops you into a scenario ("a nightly job re-ingests files and duplicates appear… what's wrong?") and expects you to reason across ingestion, the Delta log, deduplication, and idempotency at once. That judgment only comes from seeing the whole system as one machine. So read this as a single story — the life of one dataset from a raw file to a governed, monitored, deletable gold table — and watch the same few ideas reappear in every section wearing different clothes. At the end I give you the procedure I'd use to answer any question on this exam.
Here's the promise: if the individual lessons are the trees, this is the map of the forest, plus a compass.
Part 0 — The two ground truths everything rests on
Before the journey, hold two facts. Almost every correct answer is a consequence of one of them.
Ground truth 1 — a table is files + a transaction log. A Delta table isn't a mutable thing; it's a pile of immutable Parquet files plus an ordered transaction log of add/remove actions ([How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/)). This one fact is ACID, time travel, Change Data Feed, "DELETE doesn't really delete," optimistic concurrency, and why data skipping works. When a question confuses you, ask "what is this doing to the files and the log?" and the fog clears.
Ground truth 2 — work is distributed across executors, and reading is bytes. Spark spreads work over executor nodes; the driver coordinates. Cost and time are dominated by how many bytes you read and shuffle ([The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)). So "make it faster/cheaper" is almost always "read fewer bytes" or "shuffle less," and "this correct code is slow" is often "it's running on one node instead of distributing" ([Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)).
And one meta-anchor sitting above both, the single most useful sentence for this exam:
The platform already has a native feature for this — don't hand-roll it, and do only the work a change demands. AQE, Predictive Optimization, liquid clustering, AUTO CDC, native row filters/masks,
event_log(), deletion vectors — when an answer choice reinvents one of these by hand, it's usually the wrong choice. The exam is, over and over, "recognize the built-in that solves this."
Keep those three in your pocket. Now the journey.
Part 1 — Data arrives: ingestion and the two axes
A retail company sends you orders two ways: a Kafka stream of live order events, and a daily SFTP drop of batch files. Your first decision is the one that organizes the entire platform — the two axes from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/):
- Batch vs streaming — is this a finite set you process once, or an unbounded feed you process forever? This isn't about the data, it's about the reading pattern. Kafka → streaming; the SFTP files could be either.
- Procedural vs declarative — do you write the how (Spark code, a Structured Streaming query) or declare the what (a Lakeflow pipeline that figures out the how)?
You land both sources in a bronze table as raw as possible — the medallion's first layer exists so you never lose the source of truth. Ingestion tools: Auto Loader (cloudFiles) for incrementally picking up new files with schema evolution (its _rescued_data column catches what doesn't parse — remember that name, it comes back in quality), or COPY INTO for simpler idempotent loads.
The moment data streams, state appears ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)). A streaming aggregation or dedup must remember things across micro-batches, and on an unbounded stream that memory grows forever unless you bound it with a watermark. Watermark is the single most important word in streaming — tattoo this: any streaming aggregation, dedup, or stream-stream join that doesn't mention a watermark is an unbounded-state bug. You'll meet the watermark again in three different sections; it's the same idea every time.
The first connected trap. "A nightly job re-ingests all files from a directory and rows duplicate." Nothing here is about ingestion syntax — it's about idempotency ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)): re-running reprocessed the same files. The fix is a native, replay-safe pattern (insert-only MERGE, or Auto Loader's exactly-once file tracking), not a bigger cluster. See how the question lives between sections? That's the exam.
Part 2 — Data is shaped: pipelines, transformation, quality
Now bronze → silver → gold. You choose Lakeflow Spark Declarative Pipelines ([Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/)) because it's declarative: you define the tables and the dependencies, it handles orchestration, retries, and — crucially — data quality expectations. This is the "use the native feature" anchor in action: instead of hand-writing quality checks and CDC logic, you declare them.
Three shaping concerns, each a lesson, each connected:
- Transform (
[Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)) — window functions are the star. The anchor that unlocks them: a window function computes per row without collapsing rows (unlikeGROUP BY). And notice the collision that spans sections: "window" means the analyticOVER(...)here, but the timewindow(col, "5 min")in streaming — same word, different tool. - Cleanse (
[Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)) — dedup, where the streaming/batch axis bites again: batch dedup isROW_NUMBER()=1(a window function — see how Part 2's tools compose?); streaming dedup needs that watermark from Part 1. Same watermark. Third appearance coming. - Enforce quality (
[Data quality — expectations and constraints, and what happens to a bad row](/lessons/s3-data-quality/),[Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)) — expectations with three actions (warn / drop / fail), and the production pattern that routes bad rows aside (quarantine) instead of dropping them. The bad rows split on Auto Loader's_rescued_data(parse failures) or expectation results (rule failures) — the same_rescued_datafrom Part 1.
And CDC: when the source is a change feed, you don't hand-write merges — AUTO CDC ([APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/)) does dedup + ordering (SEQUENCE BY) + SCD for you. Native feature, again.
The connected trap here. "Bad rows should be kept for investigation, not lost." An answer using expect_or_drop is wrong — drop discards. The right answer keeps rows (warn) and splits them ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)). You can only see that if you know what each expectation action does to a row — a fact that also determines where its metrics land (Part 6).
Part 3 — Data is made fast and cheap: performance
Gold tables get queried constantly, so now Ground Truth 2 dominates: reads = bytes. The whole of Section 6 is one loop:
- Skip bytes (
[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)) — data skipping uses file min/max stats (first-32-columns rule) to prune files. If your filter column isn't reflected in the layout, nothing prunes. - Right-size files (
[Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)) —OPTIMIZE, auto-compaction fix the small-file problem;VACUUMcleans up (and — foreshadowing Part 7 — is what actually deletes data). - Organize the layout (
[Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)) — liquid clustering (the modern default) over partitioning/Z-order; handle skew. - Let the platform maintain it (
[Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)) — Predictive Optimization + Automatic Liquid Clustering (CLUSTER BY AUTO). Native feature, again. - Diagnose, don't guess (
[Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/)) — read the Query Profile / Spark UI; each symptom (bad skipping / wrong join / spill / skew) has one known fix from the lessons above.
Notice the through-line: performance keeps reaching back to Ground Truth 1. Data skipping is reading the log's stats. Compaction is rewriting files and logging add/remove. The transaction log is the substrate of performance, not a separate topic.
Connected trap. "A join is slow, the profile shows a huge shuffle, one side is small." → broadcast join (ship the small side, don't shuffle the big one). AQE can flip it automatically — the native feature. The knobs (shuffle.partitions, broadcastTimeout) live in [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/), a Section 1 lesson. Performance answers pull from the compute-config lesson constantly.
Part 4 — Data runs in production: orchestration, deployment, debugging
The pipeline works in a notebook; now it must run reliably, deploy safely, and be debuggable — Section 9's territory, built on Section 1's Jobs foundation ([Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/), [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/)).
- Orchestrate — multi-task jobs with
depends_on, task values, for-each, condition tasks; production identity is a service principal under a cluster policy (the security thread from Part 7 starting early). - Deploy — Databricks Asset Bundles (validate → deploy → run), Git Folders for CI/CD. Environments isolated dev/test/prod (also a security control — Part 7).
- Debug — job repair re-runs only failed tasks (do-only-the-work-a-change-demands, again); the Spark UI / logs / system tables find spill, skew, and driver-only bottlenecks. The diagnostic tools are the same Query Profile / Spark UI from Part 3 — used here to debug failures instead of tune performance. One toolset, two lenses.
Connected trap. "This migrated code is correct but takes 20 minutes." Not a bug — it's %sh or a driver-side Python loop running on one node ([Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)). Ground Truth 2: distribute it as Spark DataFrame ops. A "debugging" question answered by a distribution fact.
Part 5 — Data is watched: monitoring and alerting
Running isn't enough; you must know what's happening ([The monitoring map — which surface answers which question](/lessons/s5-observability-surfaces/)). The anchor: pick the surface by scope × lifetime. And here the whole platform's monitoring converges:
- System tables (
[System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)) — durable, account-wide SQL:billing.usage(in DBUs, joinlist_pricesfor dollars) attributes cost perrun_as;access.auditis the who-did-what (a security evidence source — Part 7). - Pipeline event log (
[The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)) —event_log()TVF,event_type='flow_progress', where the data-quality expectation results from Part 2 actually land (passed_records/failed_records). The quality you declared in Section 3 is the quality you read in Section 5. Same metrics, two ends. - SQL Alerts (
[SQL Alerts — the single-value rule that makes or breaks them](/lessons/s5-sql-alerts/)) — the single-value rule: an alert compares one value, so the query must do the thinking (collapse conditions withCASE→1/0). Alert on system-table cost, on event-logfailed_records, on a null-rate — any monitoring query becomes a page.
Connected trap. "Alert when email-null% > 15% AND phone-invalid% > 10%." Wrong answers hand the alert two columns; the right answer wraps both in one CASE returning 1/0. The insight ("alert evaluates one value") is a Section 5 fact, but the query you write is a Section 3 data-quality query. Sections fuse.
Part 6 — Data is protected: security and compliance
The biggest section, and it layers from outside in ([Access control — least privilege and the object permission ladders](/lessons/s7-access-control/) → [Unity Catalog privileges — the three-level traversal and delegation](/lessons/s7-uc-privileges/) → [Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/) → [Anonymization — hashing, pseudonymization, and protecting values at rest](/lessons/s7-anonymization/) → [PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/)):
- Who's in — object ACLs (the view→operate→manage ladders), secrets for credentials (
[Secrets — storing credentials, redaction, and scope ACLs](/lessons/s7-secrets/): redaction isn't security; scope ACLs are). - What they can do to data — UC's three-level traversal (
USE CATALOG+USE SCHEMA+SELECT; miss any level → blocked),MANAGEto delegate. - What they can see inside a table — row filters (→boolean) and column masks (→value), attached as UDFs, evaluated per caller. But masks are query-time, not at-rest.
- Protecting the values themselves — because masks don't protect files, you transform stored values:
sha2(deterministic → linkable, one-way), tokenization (reversible). AndwithColumn("x_hash",…)adds a column — you must overwrite to actually mask. - Erasing for good — the capstone compliance fact that ties straight back to Ground Truth 1:
DELETEis logical (old files linger, time-travel-reachable); onlyVACUUMpast the 7-day retention physically erases. GDPR forget-me = DELETE all copies → VACUUM.
Connected trap. "We ran DELETE for a GDPR request — are we compliant?" No — the files remain until VACUUM, because of the transaction log from Part 0. A security question answered by a Delta internals fact. This is the exam's signature move.
Part 7 — The through-lines (read this twice)
Now zoom out. A handful of ideas recur across every section. If you internalize these, you've compressed the whole exam:
A. The transaction log is everywhere. Time travel, CDF, "DELETE is logical," data-skipping stats, OCC conflicts, OPTIMIZE/VACUUM, schema evolution on MERGE — all are the add/remove log ([How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/)). Half of "advanced" questions are log questions in disguise.
B. The watermark is the streaming-state bound. Streaming aggregation, streaming dedup, stream-stream joins — each needs a watermark to keep state finite ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)). No watermark in a streaming-state answer ⇒ wrong.
C. "Use the native feature." AQE (join/skew/partition fixes), Predictive Optimization + CLUSTER BY AUTO, AUTO CDC, native row filters/masks, event_log(), deletion vectors, insert-only MERGE, job repair. Hand-rolled reinventions are usually the distractor.
D. Do only the work a change demands. Incremental over full-recompute, MERGE/CDF over reprocess-everything, job repair over full re-run, materialized views that refresh incrementally, VACUUM only past retention. The whole platform is built to touch only what changed.
E. Match the tool to (scope × lifetime) / (batch × streaming) / (who needs what). Monitoring surfaces, access ladders, compute types — most "which should you use" questions are a two-axis lookup, not memorization.
F. The recurring collisions — say which one out loud. These trip everyone because the same word means different things:
| Word | Which is it? | Tell |
|---|---|---|
| partition | Spark shuffle partition · table partition (directory) · Kafka partition | code/spark.sql.shuffle.partitions vs PARTITIONED BY vs Kafka offset |
| window | analytic OVER(...) · time window(col,"5 min") | ranking/running-total vs streaming time-bucket |
| event log | cluster (compute) · pipeline (flow_progress) · audit (access.audit) | the scope word in the question |
| state | streaming state store · SCD table state | micro-batch memory vs historical rows |
| access mode | Standard (was Shared) · Dedicated (was Single user) | multi-user+UC vs single/jobs |
| USAGE | old traversal name · = USE CATALOG/USE SCHEMA now | Hive-era vs UC |
| owner | workspace object = 1 user only · UC object = can be a group | jobs/notebooks vs catalogs/tables |
| DELETE | logical tombstone · VACUUM = physical | still time-travelable until vacuumed |
| mask | query-time display · hashing = at-rest | files still raw vs value transformed |
G. The renamed-product traps (teach current, recognize the exam's old name). The exam bank predates some renames:
| Exam / older | Current |
|---|---|
APPLY CHANGES INTO | AUTO CDC |
dlt / @dlt.* | pyspark.pipelines / @dp |
| DLT / Delta Live Tables | Lakeflow Spark Declarative Pipelines |
| Workflows · Repos | Lakeflow Jobs · Git Folders |
| Single user / Shared access mode | Dedicated / Standard |
Jobs API 2.1 (include_history) | 2.2 (pagination, only_latest) |
| unlimited retries + max-concurrent-1 (streaming) | Continuous mode (auto backoff) |
| legacy SQL alerts (UNKNOWN state) | Alerts v2 (OK/TRIGGERED/ERROR) |
EXPECTATION_FAILED action rows | event_type='flow_progress' |
USAGE | USE CATALOG / USE SCHEMA |
Part 8 — The procedure: how to answer any question
When a scenario question lands, run this — it's the compression of everything above:
- Name the domain axis. Is this batch or streaming? Procedural or declarative? Read or write? Which medallion layer? This alone eliminates half the options (e.g. streaming ⇒ the answer needs a watermark / checkpoint).
- Find the anchor. Reduce to the one principle: "reads = bytes" (perf), "DELETE is logical" (compliance), "alert evaluates one value" (alerts), "traverse all three levels" (UC), "duplicate = key × bounded?" (dedup). The anchor usually names the answer.
- Prefer the native feature and the minimal work. If one option uses a built-in (AQE, AUTO CDC, native mask, job repair,
MERGE) and another hand-rolls or full-recomputes, lean to the built-in. Through-lines C and D. - Say the collision out loud. If the question hinges on "window," "partition," "event log," "access mode," "owner," "USAGE," "DELETE," "mask" — disambiguate with the tell (table F) before choosing.
- Check the freshness trap. Is the question using an old name (table G)? The current mechanism is usually right, but if the question is clearly legacy, match its world (e.g. it explicitly asks the "recommended retry policy," the legacy "unlimited + max-concurrent-1" may be the expected answer). Know both.
- Verify the option does what it claims to the files/log/state. Ground Truth 1 and 2. "Does this actually remove the data / bound the state / reduce bytes read?" Many distractors are plausible sentences that don't survive this check (
expect_or_drop"quarantines";withColumn"masks in place";DELETE"erases").
Run steps 1→6 and the scenario questions stop being memory tests and become derivations.
Part 9 — If you remember only this
- A table is files + a log. That log is ACID, time travel, CDF, data skipping, and why DELETE needs VACUUM.
- Reads = bytes; distribute the work; skip/prune/right-size to read fewer.
- Streaming ⇒ state ⇒ watermark. Always.
- Declarative + native features beat hand-rolled: Lakeflow SDP, expectations, AUTO CDC, AQE, Predictive Optimization, native masks,
event_log(), job repair. - Do only the work a change demands — incremental, MERGE/CDF, repair, refresh.
- Security layers outside-in: identity → traversal privileges → row/column visibility → transform the values → erase (DELETE+VACUUM).
- Monitoring = match surface to scope × lifetime; the quality you declare is the quality you read; alerts evaluate one value.
- When stuck, name the collision, prefer the built-in, check what it does to the files/log/state.
You've walked one dataset from a raw file to a governed, fast, monitored, deletable gold table — and every exam domain showed up exactly where it belongs in that journey. That's not ten disconnected topics. It's one machine. Now you can see it whole, and answer from the picture instead of the flashcard.
Go back to any lesson above for the depth; come back here for the map.