Lessons

Developing Code for Data Processing

Job & environment configuration — compute and Spark tuning

Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.

[Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/) sequenced the tasks; this lesson decides what they run on and how Spark behaves while they run. It's where "the pipeline is correct but slow/expensive" gets fixed.


The spine

Beat 1 — the anchor: two moves, never blur them

Every config question is one of two moves — hold them apart:

Anchor. First, pick the cheapest compute that correctly fits the workload. Then, tune Spark to the shape of the data (how it shuffles, skews, fits in memory). That's the whole lesson: choose the box, then tune the engine inside it.

Beat 2 — Move 1: which compute (two dials)

Dial A — cluster type. Recall the table from [Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/), holding the "why":

TypeUse forWhy
Job compute (new job cluster)production jobsfresh + isolated + terminates after run → pay only during execution
All-purposeinteractive development onlyshared, always-on → wasteful, un-isolated for production
Serverlessbursty/lightweight, SQL warehouse tasksDatabricks-managed, auto-scales, no provisioning wait
Instance poolfrequent short jobs needing fast startuppre-warmed VMs → sub-minute startup

Tells: "production, lowest cost"new job cluster; "huge dashboard, no provisioning wait"serverless; "tiny jobs firing constantly, startup hurts"instance pool.

Dial B — access mode (a governance setting, names recently changed — renamed-product trap): "Single user" → Dedicated, "Shared" → Standard.

Mapping: many analysts, full governanceStandard; a job as a service principalDedicated (you'll meet these under governance in [Unity Catalog inheritance — how one grant cascades](/lessons/s8-uc-inheritance/)). Production identity rule: jobs run as a service principal, never a personal account, under a cluster policy — a rule set constraining what users may configure (instance types, node count, auto-termination, tags). Its advantages: cost control, standardisation, guardrails. What it does not do is make clusters faster — so "which is NOT an advantage of cluster policies?" → the speed claim.

Lock it. Cheapest correct compute = pick type (job/all-purpose/serverless/pool) + access mode (Standard vs Dedicated). Production = service principal + cluster policy.

Beat 3 — Move 2: tune Spark to the data's shape

Only a handful of configs are tested, and each maps to a symptom:

ConfigControlsRule of thumb
spark.sql.shuffle.partitionspartitions a shuffle (groupBy/join) produces — default 200set to ~2–3× total executor cores; 200 is too high for a small cluster
spark.sql.adaptive.enabled (AQE)lets Spark adjust the plan at runtime — coalesce partitions, fix skew, switch join strategykeep on in production
spark.executor.memoryRAM per executorraise it on out-of-memory (OOM)
spark.sql.broadcastTimeouthow long a broadcast join may takeraise for a larger broadcast dimension

AQE (Adaptive Query Execution) is the important one: Spark builds a plan before it knows real data sizes; AQE revises it mid-flight using actual shuffle stats — merging tiny partitions, splitting skewed ones, flipping a sort-merge join to a broadcast join when a side is small. The single highest-value "just turn it on" setting.

Two symptom→fix pairs the exam likes:

Lock it. Tune to the data: shuffle.partitions (spill), AQE (keep on), executor.memory (OOM). Repartition by key to cut groupBy shuffle.


The dials (skim now; return when a question needs one)

◆ Driver vs distributed — why %sh code is slow

A quiet performance trap: not everything in a notebook uses the cluster's parallelism. A cluster is a driver node plus executor nodes; Spark distributes DataFrame/SQL work across executors, but some things run only on the driver — one machine. %sh runs shell commands on the driver only; so does plain single-node Python (a for loop over rows, a driver-side pandas op). That's the tell behind "migrated legacy code is correct but takes 20 minutes" — it's doing the work on one node. %sh pwd prints the driver's directory precisely because %sh is driver-local. Fix: refactor driver-only code into Spark DataFrame/SQL so it distributes.

◆ The objective's exact phrasings

◆ Photon — the free speedup

Photon is Databricks' vectorized C++ execution engine — a drop-in accelerator for SQL/DataFrame scans and aggregations, no code change, just enable it. The biggest single performance lever for analytical workloads; ties into the cost story in Section 6 ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)).

Takeaways (rebuild it from these)

  1. Two moves: which compute (cheapest correct), then tune Spark to the data shape.
  2. Cluster type: job compute = production (ephemeral, cheap); all-purpose = dev only; serverless = bursty/SQL; instance pool = fast startup. Access mode: Standard (was Shared, multi-user + UC governance) vs Dedicated (was Single user, for jobs). Production = service principal + cluster policy.
  3. Spark knobs: shuffle.partitions (default 200; raise to cure spill), AQE (runtime plan fixes — keep on), executor.memory (OOM). Excess groupBy shuffle → repartition by the key.
  4. Driver-only code (%sh, single-node Python) doesn't distribute → rewrite as Spark. Photon = free vectorized speedup.
  5. Objective decode: high-memory notebook task → bigger memory; "disallow retries" → max retries = 0.

Before you move on — say these without scrolling up

  1. The two moves every config question reduces to — name them.
  2. "Production job, lowest cost" — which cluster type? "Full UC governance for many analysts" — which access mode?
  3. A job spills to disk during a shuffle — which knob, up or down?
  4. Migrated code is correct but takes 20 min on big data — what's likely wrong, and the fix?

Next in Section 1: custom logic Spark's built-ins can't express — [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/) — and why the kind of UDF you choose can make a job 100× faster or slower.

Prerequisites

Leads to