Reference card. The operational knobs for watching and controlling jobs from outside the UI — extending the API/CLI mechanics of [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/) toward monitoring: read run state, get told on failure, set the right retry. One idea threads it: monitoring only reads, so it's all GET; the trap is singular (one thing) vs plural (a collection). Each item is a tell + the answer.
Read job state via REST — everything is a GET
| Need | Endpoint | Tell |
|---|---|---|
| List all jobs | GET /api/2.2/jobs/list | "retrieve available / list jobs" → jobs/list (not jobs/get = one job) |
| One run's details, incl. repair history | GET /api/2.2/jobs/runs/get (run_id; include_history=true in 2.1) | "latest run including repair history" → runs/get |
| All runs of a job | GET /api/2.2/jobs/runs/list (job_id, time filters) | "all / failed runs in a window" → runs/list |
Singular vs plural is the trap: runs/get = one run (detail, with its repair history); runs/list = many runs (collection). Same for jobs/get vs jobs/list.
The 2.1 → 2.2 version shift (freshness)
Bank often shows 2.1 (e.g. runs/get?include_history=true); current is 2.2:
- Pagination of
tasks/job_clusterspast 100 elements. only_latest=trueonruns/get— omit runs superseded by a retry/repair.has_moreremoved → usenext_page_token;expand_tasks=truefor per-job task arrays.- Queueing on by default for jobs created via 2.2.
Recognise 2.1 phrasings, prefer 2.2.
CLI — "did today's run succeed?"
The exam scenario: run an external script only if today's daily job finished successfully.
databricks jobs list-runs --job-id 123 \
--start-time-from <today-midnight-epoch-ms> \
--completed-only
--completed-only→ only finished runs (read success/failure).--start-time-from(today's midnight epoch) → restrict to today.- Then branch on the run's result state.
(Older CLI: databricks runs list / runs get; current unified CLI: databricks jobs list-runs / jobs get-run — recognise both.)
Failure notifications — the task-retry gotcha
Job-level notifications fire only on the overall job's success/failure. So a task that fails and is retried produces no notification — and if a retry eventually succeeds, the job ends "success," silently. Fix: task-level notifications. Tell: "notifications aren't sent when failed tasks are retried" → task-level notifications, not job-level.
Retry policy for production streaming (freshness)
- Exam / legacy best practice: unlimited retries (auto-restart the failed query) + maximum concurrent runs = 1 (never two instances of the same stream). The answer the bank expects.
- Current recommendation: run in Continuous mode, which auto-retries the whole job with exponential backoff (you can't set a retry policy on a continuous job) — the modern replacement.
Tell: "recommended retry policy for production streaming" → unlimited retries + max concurrent 1 (legacy/exam); know Continuous mode is the modern replacement.
Recall (say without scrolling up)
- Why are monitoring calls all
GET, and what's the singular-vs-plural trap? - "Latest run including repair history" — which endpoint?
- Failed tasks get retried but no notification arrives — why, and the fix?
- Recommended production-streaming retry policy (legacy/exam) — and its modern replacement.
That completes Section 5's monitoring story: the map of surfaces ([The monitoring map — which surface answers which question](/lessons/s5-observability-surfaces/)) → durable system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)) → pipeline event log ([The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)) → SQL Alerts to get told ([SQL Alerts — the single-value rule that makes or breaks them](/lessons/s5-sql-alerts/)) → operational REST/CLI, notifications, retries here.