[Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/) sequenced the tasks; this lesson decides what they run on and how Spark behaves while they run. It's where "the pipeline is correct but slow/expensive" gets fixed.
The spine
Beat 1 — the anchor: two moves, never blur them
Every config question is one of two moves — hold them apart:
Anchor. First, pick the cheapest compute that correctly fits the workload. Then, tune Spark to the shape of the data (how it shuffles, skews, fits in memory). That's the whole lesson: choose the box, then tune the engine inside it.
Beat 2 — Move 1: which compute (two dials)
Dial A — cluster type. Recall the table from [Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/), holding the "why":
| Type | Use for | Why |
|---|---|---|
| Job compute (new job cluster) | production jobs | fresh + isolated + terminates after run → pay only during execution |
| All-purpose | interactive development only | shared, always-on → wasteful, un-isolated for production |
| Serverless | bursty/lightweight, SQL warehouse tasks | Databricks-managed, auto-scales, no provisioning wait |
| Instance pool | frequent short jobs needing fast startup | pre-warmed VMs → sub-minute startup |
Tells: "production, lowest cost" → new job cluster; "huge dashboard, no provisioning wait" → serverless; "tiny jobs firing constantly, startup hurts" → instance pool.
Dial B — access mode (a governance setting, names recently changed — renamed-product trap): "Single user" → Dedicated, "Shared" → Standard.
- Standard (was Shared) — multi-user, full Unity Catalog governance (row filters, column masks). The default for interactive/multi-user work.
- Dedicated (was Single user) — bound to one principal; for automated jobs and workloads needing privileged access.
Mapping: many analysts, full governance → Standard; a job as a service principal → Dedicated (you'll meet these under governance in [Unity Catalog inheritance — how one grant cascades](/lessons/s8-uc-inheritance/)). Production identity rule: jobs run as a service principal, never a personal account, under a cluster policy — a rule set constraining what users may configure (instance types, node count, auto-termination, tags). Its advantages: cost control, standardisation, guardrails. What it does not do is make clusters faster — so "which is NOT an advantage of cluster policies?" → the speed claim.
Lock it. Cheapest correct compute = pick type (job/all-purpose/serverless/pool) + access mode (Standard vs Dedicated). Production = service principal + cluster policy.
Beat 3 — Move 2: tune Spark to the data's shape
Only a handful of configs are tested, and each maps to a symptom:
| Config | Controls | Rule of thumb |
|---|---|---|
spark.sql.shuffle.partitions | partitions a shuffle (groupBy/join) produces — default 200 | set to ~2–3× total executor cores; 200 is too high for a small cluster |
spark.sql.adaptive.enabled (AQE) | lets Spark adjust the plan at runtime — coalesce partitions, fix skew, switch join strategy | keep on in production |
spark.executor.memory | RAM per executor | raise it on out-of-memory (OOM) |
spark.sql.broadcastTimeout | how long a broadcast join may take | raise for a larger broadcast dimension |
AQE (Adaptive Query Execution) is the important one: Spark builds a plan before it knows real data sizes; AQE revises it mid-flight using actual shuffle stats — merging tiny partitions, splitting skewed ones, flipping a sort-merge join to a broadcast join when a side is small. The single highest-value "just turn it on" setting.
Two symptom→fix pairs the exam likes:
- Data spills to disk during a sort/shuffle (partitions too big for memory) → increase
shuffle.partitionsso each is smaller. (Debugging the spill is[Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/).) - Excessive shuffling in a
groupBy→ repartition by the key first (df.repartition("region").groupBy("region")…) so each key is co-located.
Lock it. Tune to the data:
shuffle.partitions(spill), AQE (keep on),executor.memory(OOM). Repartition by key to cut groupBy shuffle.
The dials (skim now; return when a question needs one)
◆ Driver vs distributed — why %sh code is slow
A quiet performance trap: not everything in a notebook uses the cluster's parallelism. A cluster is a driver node plus executor nodes; Spark distributes DataFrame/SQL work across executors, but some things run only on the driver — one machine. %sh runs shell commands on the driver only; so does plain single-node Python (a for loop over rows, a driver-side pandas op). That's the tell behind "migrated legacy code is correct but takes 20 minutes" — it's doing the work on one node. %sh pwd prints the driver's directory precisely because %sh is driver-local. Fix: refactor driver-only code into Spark DataFrame/SQL so it distributes.
◆ The objective's exact phrasings
- "appropriate configs for environments and dependencies" → compute + library setup (dependencies are
[Managing third-party libraries](/lessons/s1-third-party-libs/)). - "high memory for notebook tasks" → heavy in-memory work (large collect, wide transform) → higher-memory cluster / more
executor.memory. - "auto-optimization to disallow retries" → a task that must not auto-retry (a non-idempotent side effect) → set max retries to 0.
◆ Photon — the free speedup
Photon is Databricks' vectorized C++ execution engine — a drop-in accelerator for SQL/DataFrame scans and aggregations, no code change, just enable it. The biggest single performance lever for analytical workloads; ties into the cost story in Section 6 ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)).
Takeaways (rebuild it from these)
- Two moves: which compute (cheapest correct), then tune Spark to the data shape.
- Cluster type: job compute = production (ephemeral, cheap); all-purpose = dev only; serverless = bursty/SQL; instance pool = fast startup. Access mode: Standard (was Shared, multi-user + UC governance) vs Dedicated (was Single user, for jobs). Production = service principal + cluster policy.
- Spark knobs:
shuffle.partitions(default 200; raise to cure spill), AQE (runtime plan fixes — keep on),executor.memory(OOM). Excess groupBy shuffle → repartition by the key. - Driver-only code (
%sh, single-node Python) doesn't distribute → rewrite as Spark. Photon = free vectorized speedup. - Objective decode: high-memory notebook task → bigger memory; "disallow retries" → max retries = 0.
Before you move on — say these without scrolling up
- The two moves every config question reduces to — name them.
- "Production job, lowest cost" — which cluster type? "Full UC governance for many analysts" — which access mode?
- A job spills to disk during a shuffle — which knob, up or down?
- Migrated code is correct but takes 20 min on big data — what's likely wrong, and the fix?
Next in Section 1: custom logic Spark's built-ins can't express — [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/) — and why the kind of UDF you choose can make a job 100× faster or slower.