Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.

[Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/) sequenced the tasks; this lesson decides what they run on and how Spark behaves while they run. It's where "the pipeline is correct but slow/expensive" gets fixed.

The spine

Beat 1 — the anchor: two moves, never blur them

Every config question is one of two moves — hold them apart:

Anchor. First, pick the cheapest compute that correctly fits the workload. Then, tune Spark to the shape of the data (how it shuffles, skews, fits in memory). That's the whole lesson: choose the box, then tune the engine inside it.

Beat 2 — Move 1: which compute (two dials)

Dial A — cluster type. Recall the table from [Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/), holding the "why":

Type	Use for	Why
Job compute (new job cluster)	production jobs	fresh + isolated + terminates after run → pay only during execution
All-purpose	interactive development only	shared, always-on → wasteful, un-isolated for production
Serverless	bursty/lightweight, SQL warehouse tasks	Databricks-managed, auto-scales, no provisioning wait
Instance pool	frequent short jobs needing fast startup	pre-warmed VMs → sub-minute startup

Tells: "production, lowest cost" → new job cluster; "huge dashboard, no provisioning wait" → serverless; "tiny jobs firing constantly, startup hurts" → instance pool.

Dial B — access mode (a governance setting, names recently changed — renamed-product trap): "Single user" → Dedicated, "Shared" → Standard.

Standard (was Shared) — multi-user, full Unity Catalog governance (row filters, column masks). The default for interactive/multi-user work.
Dedicated (was Single user) — bound to one principal; for automated jobs and workloads needing privileged access.

Mapping: many analysts, full governance → Standard; a job as a service principal → Dedicated (you'll meet these under governance in [Unity Catalog inheritance — how one grant cascades](/lessons/s8-uc-inheritance/)). Production identity rule: jobs run as a service principal, never a personal account, under a cluster policy — a rule set constraining what users may configure (instance types, node count, auto-termination, tags). Its advantages: cost control, standardisation, guardrails. What it does not do is make clusters faster — so "which is NOT an advantage of cluster policies?" → the speed claim.

Lock it. Cheapest correct compute = pick type (job/all-purpose/serverless/pool) + access mode (Standard vs Dedicated). Production = service principal + cluster policy.

Beat 3 — Move 2: tune Spark to the data's shape

Only a handful of configs are tested, and each maps to a symptom:

Config	Controls	Rule of thumb
`spark.sql.shuffle.partitions`	partitions a shuffle (groupBy/join) produces — default 200	set to ~2–3× total executor cores; 200 is too high for a small cluster
`spark.sql.adaptive.enabled` (AQE)	lets Spark adjust the plan at runtime — coalesce partitions, fix skew, switch join strategy	keep on in production
`spark.executor.memory`	RAM per executor	raise it on out-of-memory (OOM)
`spark.sql.broadcastTimeout`	how long a broadcast join may take	raise for a larger broadcast dimension

AQE (Adaptive Query Execution) is the important one: Spark builds a plan before it knows real data sizes; AQE revises it mid-flight using actual shuffle stats — merging tiny partitions, splitting skewed ones, flipping a sort-merge join to a broadcast join when a side is small. The single highest-value "just turn it on" setting.

Two symptom→fix pairs the exam likes:

Data spills to disk during a sort/shuffle (partitions too big for memory) → increase shuffle.partitions so each is smaller. (Debugging the spill is [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/).)
Excessive shuffling in a groupBy → repartition by the key first (df.repartition("region").groupBy("region")…) so each key is co-located.

Lock it. Tune to the data: shuffle.partitions (spill), AQE (keep on), executor.memory (OOM). Repartition by key to cut groupBy shuffle.

The dials (skim now; return when a question needs one)

◆ Driver vs distributed — why `%sh` code is slow

A quiet performance trap: not everything in a notebook uses the cluster's parallelism. A cluster is a driver node plus executor nodes; Spark distributes DataFrame/SQL work across executors, but some things run only on the driver — one machine. %sh runs shell commands on the driver only; so does plain single-node Python (a for loop over rows, a driver-side pandas op). That's the tell behind "migrated legacy code is correct but takes 20 minutes" — it's doing the work on one node. %sh pwd prints the driver's directory precisely because %sh is driver-local. Fix: refactor driver-only code into Spark DataFrame/SQL so it distributes.

◆ The objective's exact phrasings

"appropriate configs for environments and dependencies" → compute + library setup (dependencies are [Managing third-party libraries](/lessons/s1-third-party-libs/)).
"high memory for notebook tasks" → heavy in-memory work (large collect, wide transform) → higher-memory cluster / more executor.memory.
"auto-optimization to disallow retries" → a task that must not auto-retry (a non-idempotent side effect) → set max retries to 0.

◆ Photon — the free speedup

Photon is Databricks' vectorized C++ execution engine — a drop-in accelerator for SQL/DataFrame scans and aggregations, no code change, just enable it. The biggest single performance lever for analytical workloads; ties into the cost story in Section 6 ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)).

Takeaways (rebuild it from these)

Two moves: which compute (cheapest correct), then tune Spark to the data shape.
Cluster type: job compute = production (ephemeral, cheap); all-purpose = dev only; serverless = bursty/SQL; instance pool = fast startup. Access mode: Standard (was Shared, multi-user + UC governance) vs Dedicated (was Single user, for jobs). Production = service principal + cluster policy.
Spark knobs: shuffle.partitions (default 200; raise to cure spill), AQE (runtime plan fixes — keep on), executor.memory (OOM). Excess groupBy shuffle → repartition by the key.
Driver-only code (%sh, single-node Python) doesn't distribute → rewrite as Spark. Photon = free vectorized speedup.
Objective decode: high-memory notebook task → bigger memory; "disallow retries" → max retries = 0.

Before you move on — say these without scrolling up

The two moves every config question reduces to — name them.
"Production job, lowest cost" — which cluster type? "Full UC governance for many analysts" — which access mode?
A job spills to disk during a shuffle — which knob, up or down?
Migrated code is correct but takes 20 min on big data — what's likely wrong, and the fix?

Next in Section 1: custom logic Spark's built-ins can't express — [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/) — and why the kind of UDF you choose can make a job 100× faster or slower.

Job & environment configuration — compute and Spark tuning