Lessons

Foundations

The whole picture — how a lakehouse fits together, and how to answer any question about it

Synthesize all exam domains into one connected mental model — follow data end to end, surface the cross-section through-lines, and apply a repeatable question-answering procedure.

Every other lesson taught one concept well. This one does the thing the exam actually rewards and the others can't: it connects them. The Professional exam rarely asks "what is X" — it drops you into a scenario ("a nightly job re-ingests files and duplicates appear… what's wrong?") and expects you to reason across ingestion, the Delta log, deduplication, and idempotency at once. That judgment only comes from seeing the whole system as one machine. So read this as a single story — the life of one dataset from a raw file to a governed, monitored, deletable gold table — and watch the same few ideas reappear in every section wearing different clothes. At the end I give you the procedure I'd use to answer any question on this exam.

Here's the promise: if the individual lessons are the trees, this is the map of the forest, plus a compass.


Part 0 — The two ground truths everything rests on

Before the journey, hold two facts. Almost every correct answer is a consequence of one of them.

Ground truth 1 — a table is files + a transaction log. A Delta table isn't a mutable thing; it's a pile of immutable Parquet files plus an ordered transaction log of add/remove actions ([How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/)). This one fact is ACID, time travel, Change Data Feed, "DELETE doesn't really delete," optimistic concurrency, and why data skipping works. When a question confuses you, ask "what is this doing to the files and the log?" and the fog clears.

Ground truth 2 — work is distributed across executors, and reading is bytes. Spark spreads work over executor nodes; the driver coordinates. Cost and time are dominated by how many bytes you read and shuffle ([The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)). So "make it faster/cheaper" is almost always "read fewer bytes" or "shuffle less," and "this correct code is slow" is often "it's running on one node instead of distributing" ([Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)).

And one meta-anchor sitting above both, the single most useful sentence for this exam:

The platform already has a native feature for this — don't hand-roll it, and do only the work a change demands. AQE, Predictive Optimization, liquid clustering, AUTO CDC, native row filters/masks, event_log(), deletion vectors — when an answer choice reinvents one of these by hand, it's usually the wrong choice. The exam is, over and over, "recognize the built-in that solves this."

Keep those three in your pocket. Now the journey.


Part 1 — Data arrives: ingestion and the two axes

A retail company sends you orders two ways: a Kafka stream of live order events, and a daily SFTP drop of batch files. Your first decision is the one that organizes the entire platform — the two axes from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/):

You land both sources in a bronze table as raw as possible — the medallion's first layer exists so you never lose the source of truth. Ingestion tools: Auto Loader (cloudFiles) for incrementally picking up new files with schema evolution (its _rescued_data column catches what doesn't parse — remember that name, it comes back in quality), or COPY INTO for simpler idempotent loads.

The moment data streams, state appears ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)). A streaming aggregation or dedup must remember things across micro-batches, and on an unbounded stream that memory grows forever unless you bound it with a watermark. Watermark is the single most important word in streaming — tattoo this: any streaming aggregation, dedup, or stream-stream join that doesn't mention a watermark is an unbounded-state bug. You'll meet the watermark again in three different sections; it's the same idea every time.

The first connected trap. "A nightly job re-ingests all files from a directory and rows duplicate." Nothing here is about ingestion syntax — it's about idempotency ([Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/)): re-running reprocessed the same files. The fix is a native, replay-safe pattern (insert-only MERGE, or Auto Loader's exactly-once file tracking), not a bigger cluster. See how the question lives between sections? That's the exam.


Part 2 — Data is shaped: pipelines, transformation, quality

Now bronze → silver → gold. You choose Lakeflow Spark Declarative Pipelines ([Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/)) because it's declarative: you define the tables and the dependencies, it handles orchestration, retries, and — crucially — data quality expectations. This is the "use the native feature" anchor in action: instead of hand-writing quality checks and CDC logic, you declare them.

Three shaping concerns, each a lesson, each connected:

And CDC: when the source is a change feed, you don't hand-write merges — AUTO CDC ([APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/)) does dedup + ordering (SEQUENCE BY) + SCD for you. Native feature, again.

The connected trap here. "Bad rows should be kept for investigation, not lost." An answer using expect_or_drop is wrong — drop discards. The right answer keeps rows (warn) and splits them ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)). You can only see that if you know what each expectation action does to a row — a fact that also determines where its metrics land (Part 6).


Part 3 — Data is made fast and cheap: performance

Gold tables get queried constantly, so now Ground Truth 2 dominates: reads = bytes. The whole of Section 6 is one loop:

  1. Skip bytes ([The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)) — data skipping uses file min/max stats (first-32-columns rule) to prune files. If your filter column isn't reflected in the layout, nothing prunes.
  2. Right-size files ([Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)) — OPTIMIZE, auto-compaction fix the small-file problem; VACUUM cleans up (and — foreshadowing Part 7 — is what actually deletes data).
  3. Organize the layout ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)) — liquid clustering (the modern default) over partitioning/Z-order; handle skew.
  4. Let the platform maintain it ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)) — Predictive Optimization + Automatic Liquid Clustering (CLUSTER BY AUTO). Native feature, again.
  5. Diagnose, don't guess ([Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/)) — read the Query Profile / Spark UI; each symptom (bad skipping / wrong join / spill / skew) has one known fix from the lessons above.

Notice the through-line: performance keeps reaching back to Ground Truth 1. Data skipping is reading the log's stats. Compaction is rewriting files and logging add/remove. The transaction log is the substrate of performance, not a separate topic.

Connected trap. "A join is slow, the profile shows a huge shuffle, one side is small." → broadcast join (ship the small side, don't shuffle the big one). AQE can flip it automatically — the native feature. The knobs (shuffle.partitions, broadcastTimeout) live in [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/), a Section 1 lesson. Performance answers pull from the compute-config lesson constantly.


Part 4 — Data runs in production: orchestration, deployment, debugging

The pipeline works in a notebook; now it must run reliably, deploy safely, and be debuggable — Section 9's territory, built on Section 1's Jobs foundation ([Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/), [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/)).

Connected trap. "This migrated code is correct but takes 20 minutes." Not a bug — it's %sh or a driver-side Python loop running on one node ([Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)). Ground Truth 2: distribute it as Spark DataFrame ops. A "debugging" question answered by a distribution fact.


Part 5 — Data is watched: monitoring and alerting

Running isn't enough; you must know what's happening ([The monitoring map — which surface answers which question](/lessons/s5-observability-surfaces/)). The anchor: pick the surface by scope × lifetime. And here the whole platform's monitoring converges:

Connected trap. "Alert when email-null% > 15% AND phone-invalid% > 10%." Wrong answers hand the alert two columns; the right answer wraps both in one CASE returning 1/0. The insight ("alert evaluates one value") is a Section 5 fact, but the query you write is a Section 3 data-quality query. Sections fuse.


Part 6 — Data is protected: security and compliance

The biggest section, and it layers from outside in ([Access control — least privilege and the object permission ladders](/lessons/s7-access-control/)[Unity Catalog privileges — the three-level traversal and delegation](/lessons/s7-uc-privileges/)[Row filters and column masks — access control inside a table](/lessons/s7-row-col-masks/)[Anonymization — hashing, pseudonymization, and protecting values at rest](/lessons/s7-anonymization/)[PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/)):

  1. Who's in — object ACLs (the view→operate→manage ladders), secrets for credentials ([Secrets — storing credentials, redaction, and scope ACLs](/lessons/s7-secrets/): redaction isn't security; scope ACLs are).
  2. What they can do to data — UC's three-level traversal (USE CATALOG + USE SCHEMA + SELECT; miss any level → blocked), MANAGE to delegate.
  3. What they can see inside a table — row filters (→boolean) and column masks (→value), attached as UDFs, evaluated per caller. But masks are query-time, not at-rest.
  4. Protecting the values themselves — because masks don't protect files, you transform stored values: sha2 (deterministic → linkable, one-way), tokenization (reversible). And withColumn("x_hash",…) adds a column — you must overwrite to actually mask.
  5. Erasing for good — the capstone compliance fact that ties straight back to Ground Truth 1: DELETE is logical (old files linger, time-travel-reachable); only VACUUM past the 7-day retention physically erases. GDPR forget-me = DELETE all copies → VACUUM.

Connected trap. "We ran DELETE for a GDPR request — are we compliant?" No — the files remain until VACUUM, because of the transaction log from Part 0. A security question answered by a Delta internals fact. This is the exam's signature move.


Part 7 — The through-lines (read this twice)

Now zoom out. A handful of ideas recur across every section. If you internalize these, you've compressed the whole exam:

A. The transaction log is everywhere. Time travel, CDF, "DELETE is logical," data-skipping stats, OCC conflicts, OPTIMIZE/VACUUM, schema evolution on MERGE — all are the add/remove log ([How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/)). Half of "advanced" questions are log questions in disguise.

B. The watermark is the streaming-state bound. Streaming aggregation, streaming dedup, stream-stream joins — each needs a watermark to keep state finite ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)). No watermark in a streaming-state answer ⇒ wrong.

C. "Use the native feature." AQE (join/skew/partition fixes), Predictive Optimization + CLUSTER BY AUTO, AUTO CDC, native row filters/masks, event_log(), deletion vectors, insert-only MERGE, job repair. Hand-rolled reinventions are usually the distractor.

D. Do only the work a change demands. Incremental over full-recompute, MERGE/CDF over reprocess-everything, job repair over full re-run, materialized views that refresh incrementally, VACUUM only past retention. The whole platform is built to touch only what changed.

E. Match the tool to (scope × lifetime) / (batch × streaming) / (who needs what). Monitoring surfaces, access ladders, compute types — most "which should you use" questions are a two-axis lookup, not memorization.

F. The recurring collisions — say which one out loud. These trip everyone because the same word means different things:

WordWhich is it?Tell
partitionSpark shuffle partition · table partition (directory) · Kafka partitioncode/spark.sql.shuffle.partitions vs PARTITIONED BY vs Kafka offset
windowanalytic OVER(...) · time window(col,"5 min")ranking/running-total vs streaming time-bucket
event logcluster (compute) · pipeline (flow_progress) · audit (access.audit)the scope word in the question
statestreaming state store · SCD table statemicro-batch memory vs historical rows
access modeStandard (was Shared) · Dedicated (was Single user)multi-user+UC vs single/jobs
USAGEold traversal name · = USE CATALOG/USE SCHEMA nowHive-era vs UC
ownerworkspace object = 1 user only · UC object = can be a groupjobs/notebooks vs catalogs/tables
DELETElogical tombstone · VACUUM = physicalstill time-travelable until vacuumed
maskquery-time display · hashing = at-restfiles still raw vs value transformed

G. The renamed-product traps (teach current, recognize the exam's old name). The exam bank predates some renames:

Exam / olderCurrent
APPLY CHANGES INTOAUTO CDC
dlt / @dlt.*pyspark.pipelines / @dp
DLT / Delta Live TablesLakeflow Spark Declarative Pipelines
Workflows · ReposLakeflow Jobs · Git Folders
Single user / Shared access modeDedicated / Standard
Jobs API 2.1 (include_history)2.2 (pagination, only_latest)
unlimited retries + max-concurrent-1 (streaming)Continuous mode (auto backoff)
legacy SQL alerts (UNKNOWN state)Alerts v2 (OK/TRIGGERED/ERROR)
EXPECTATION_FAILED action rowsevent_type='flow_progress'
USAGEUSE CATALOG / USE SCHEMA

Part 8 — The procedure: how to answer any question

When a scenario question lands, run this — it's the compression of everything above:

  1. Name the domain axis. Is this batch or streaming? Procedural or declarative? Read or write? Which medallion layer? This alone eliminates half the options (e.g. streaming ⇒ the answer needs a watermark / checkpoint).
  2. Find the anchor. Reduce to the one principle: "reads = bytes" (perf), "DELETE is logical" (compliance), "alert evaluates one value" (alerts), "traverse all three levels" (UC), "duplicate = key × bounded?" (dedup). The anchor usually names the answer.
  3. Prefer the native feature and the minimal work. If one option uses a built-in (AQE, AUTO CDC, native mask, job repair, MERGE) and another hand-rolls or full-recomputes, lean to the built-in. Through-lines C and D.
  4. Say the collision out loud. If the question hinges on "window," "partition," "event log," "access mode," "owner," "USAGE," "DELETE," "mask" — disambiguate with the tell (table F) before choosing.
  5. Check the freshness trap. Is the question using an old name (table G)? The current mechanism is usually right, but if the question is clearly legacy, match its world (e.g. it explicitly asks the "recommended retry policy," the legacy "unlimited + max-concurrent-1" may be the expected answer). Know both.
  6. Verify the option does what it claims to the files/log/state. Ground Truth 1 and 2. "Does this actually remove the data / bound the state / reduce bytes read?" Many distractors are plausible sentences that don't survive this check (expect_or_drop "quarantines"; withColumn "masks in place"; DELETE "erases").

Run steps 1→6 and the scenario questions stop being memory tests and become derivations.


Part 9 — If you remember only this

You've walked one dataset from a raw file to a governed, fast, monitored, deletable gold table — and every exam domain showed up exactly where it belongs in that journey. That's not ten disconnected topics. It's one machine. Now you can see it whole, and answer from the picture instead of the flashcard.

Go back to any lesson above for the depth; come back here for the map.

Prerequisites