Lessons

Developing Code for Data Processing

Jobs & orchestration — multi-task, dependencies, control flow

Create and automate ETL workloads using Jobs via UI/APIs/CLI; create pipeline components that use control-flow operators (if/else, for-each).

We've built the transformations — Structured Streaming, LSDP, AUTO CDC. This lesson is the thing that runs them: on a schedule, in the right order, recovering when a piece fails. It's a big slice of the exam, but it collapses to one clean distinction.


The spine

Beat 1 — Pipeline vs Job (the anchor)

Two words get blurred; separate them first. Recall from [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/) that a pipeline is the transformation logic — the @dlt.table declarations (or notebook code) that say what happens to data. A Lakeflow Job (formerly "Workflow") is the operational wrapper — it says when and how that logic runs.

Anchor. Pipeline = what happens to data. Job = when and how it runs (schedule, cluster, retries, task order, alerts). Every orchestration feature below is just the Job answering "when and how."

A subtlety to bank: a Job has no "declarative vs imperative" flavour — that lives inside the task (a task can run an LSDP pipeline or a notebook). The Job just wraps and sequences tasks.

Beat 2 — a Job is a graph of tasks

A real Job is rarely one step. It's a multi-task Job — tasks wired into a dependency graph (the same graph idea as LSDP in [Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/), but now at the task level, and you wire it explicitly):

Ground it in the restaurant pipeline from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): an ingest task, then silver (depends_on: ingest), then gold (depends_on: silver) — a chain the Job runs top to bottom, restarting nothing that already succeeded.

Lock it. Job = a graph of tasks; depends_on orders them; independent tasks run in parallel.

Beat 3 — the surprise: what rolls back when a task fails?

Here's the buried doubt, and the official sample question tests it directly. A Job has tasks A → (B, C in parallel). A and B succeed; C fails.

Predict: what's the state of the data now? Does the Job roll back because one task failed?

A and B's work is fully committed, and some of C's operations may have already completed. There is no automatic cross-task rollback. A Job is a dependency graph for orchestration, not a single database transaction — each task commits its own work as it goes. (Recall [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): atomicity is per Delta commit, not per Job. A task's individual writes are atomic; the Job as a whole is not.) So "because C failed, everything rolls back" is always wrong.

Lock it. No cross-task rollback. Succeeded tasks stay committed; a failed task may have partially completed.


The dials (skim now; return when a question needs one)

◆ Control flow — passing data and branching

The exam objective names control flow (if/else, for-each). These are Job-level operators:

◆ Which cluster does a Job run on?

Match the cluster to the work:

Cluster typeUse forWhy
New job clusterproduction batch jobsfresh, isolated, terminates after the run → pay only during execution
All-purpose clusterinteractive development onlyshared, always-on → wasteful and un-isolated for production
Serverlesslightweight/bursty tasks, SQL warehouse tasksauto-scales, no provisioning overhead
Instance poolfrequent short jobs needing fast startuppre-warmed VMs → sub-minute startup

Tell: "production job, lowest cost" → new job cluster, never all-purpose. And recall the streaming recovery config from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): a streaming Job wants new job cluster + unlimited retries + max concurrent runs = 1.

◆ Repair vs rerun

When one task fails, you don't rerun the whole Job. repair-run reruns only the failed task and its downstream dependents, reusing successful upstream results — cheaper and safer. run-now reruns the entire Job from the start (expensive, risks duplicate work). Always prefer repair-run for a single failed task. (The REST/CLI form is [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/); monitoring is [Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/).)

Takeaways (rebuild it from these)

  1. Pipeline = what; Job = when/how. The Job wraps tasks and sequences them.
  2. A Job is a graph of tasks: depends_on orders them; independent tasks run in parallel.
  3. Control flow: Task Values (pass data, cast from string), Condition task (if/else; untaken branch skipped), For-Each (one template, parallel instances), outcome:"failed" (cleanup/alerts).
  4. New job cluster for production (isolated, pay-per-run); all-purpose is dev-only. Streaming recovery = new job cluster + unlimited retries + max concurrent runs 1.
  5. repair-run reruns only failed + downstream (preferred); run-now reruns everything. No automatic cross-task rollback.

Before you move on — say these without scrolling up

  1. Pipeline vs Job — which is "what," which is "when/how"?
  2. B and C both depend only on A — when do they run?
  3. C fails after A and B succeeded — what's committed, and what does NOT happen?
  4. One task failed — repair-run or run-now, and what's the difference?

Next: the same Jobs, driven programmatically — the REST API and CLI the exam quizzes verbatim. → [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/)

Prerequisites

Leads to