Pass Databricks DE Pro

Monitor and control jobs programmatically (REST API, CLI), configure failure notifications, and set retry policies for production and streaming jobs.

Reference card. The operational knobs for watching and controlling jobs from outside the UI — extending the API/CLI mechanics of [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/) toward monitoring: read run state, get told on failure, set the right retry. One idea threads it: monitoring only reads, so it's all GET; the trap is singular (one thing) vs plural (a collection). Each item is a tell + the answer.

Read job state via REST — everything is a `GET`

Need	Endpoint	Tell
List all jobs	`GET /api/2.2/jobs/list`	"retrieve available / list jobs" → `jobs/list` (not `jobs/get` = one job)
One run's details, incl. repair history	`GET /api/2.2/jobs/runs/get` (`run_id`; `include_history=true` in 2.1)	"latest run including repair history" → `runs/get`
All runs of a job	`GET /api/2.2/jobs/runs/list` (`job_id`, time filters)	"all / failed runs in a window" → `runs/list`

Singular vs plural is the trap: runs/get = one run (detail, with its repair history); runs/list = many runs (collection). Same for jobs/get vs jobs/list.

The 2.1 → 2.2 version shift (freshness)

Bank often shows 2.1 (e.g. runs/get?include_history=true); current is 2.2:

Pagination of tasks/job_clusters past 100 elements.
only_latest=true on runs/get — omit runs superseded by a retry/repair.
has_more removed → use next_page_token; expand_tasks=true for per-job task arrays.
Queueing on by default for jobs created via 2.2.

Recognise 2.1 phrasings, prefer 2.2.

CLI — "did today's run succeed?"

The exam scenario: run an external script only if today's daily job finished successfully.

databricks jobs list-runs --job-id 123 \
  --start-time-from <today-midnight-epoch-ms> \
  --completed-only

--completed-only → only finished runs (read success/failure).
--start-time-from (today's midnight epoch) → restrict to today.
Then branch on the run's result state.

(Older CLI: databricks runs list / runs get; current unified CLI: databricks jobs list-runs / jobs get-run — recognise both.)

Failure notifications — the task-retry gotcha

Job-level notifications fire only on the overall job's success/failure. So a task that fails and is retried produces no notification — and if a retry eventually succeeds, the job ends "success," silently. Fix: task-level notifications. Tell: "notifications aren't sent when failed tasks are retried" → task-level notifications, not job-level.

Retry policy for production streaming (freshness)

Exam / legacy best practice: unlimited retries (auto-restart the failed query) + maximum concurrent runs = 1 (never two instances of the same stream). The answer the bank expects.
Current recommendation: run in Continuous mode, which auto-retries the whole job with exponential backoff (you can't set a retry policy on a continuous job) — the modern replacement.

Tell: "recommended retry policy for production streaming" → unlimited retries + max concurrent 1 (legacy/exam); know Continuous mode is the modern replacement.

Recall (say without scrolling up)

Why are monitoring calls all GET, and what's the singular-vs-plural trap?
"Latest run including repair history" — which endpoint?
Failed tasks get retried but no notification arrives — why, and the fix?
Recommended production-streaming retry policy (legacy/exam) — and its modern replacement.

That completes Section 5's monitoring story: the map of surfaces ([The monitoring map — which surface answers which question](/lessons/s5-observability-surfaces/)) → durable system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)) → pipeline event log ([The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)) → SQL Alerts to get told ([SQL Alerts — the single-value rule that makes or breaks them](/lessons/s5-sql-alerts/)) → operational REST/CLI, notifications, retries here.