Lessons

Monitoring & Alerting

Operational job monitoring — REST/CLI, notifications, retry policy

Monitor and control jobs programmatically (REST API, CLI), configure failure notifications, and set retry policies for production and streaming jobs.

Reference card. The operational knobs for watching and controlling jobs from outside the UI — extending the API/CLI mechanics of [Jobs via REST API and CLI](/lessons/s1-jobs-api-cli/) toward monitoring: read run state, get told on failure, set the right retry. One idea threads it: monitoring only reads, so it's all GET; the trap is singular (one thing) vs plural (a collection). Each item is a tell + the answer.

Read job state via REST — everything is a GET

NeedEndpointTell
List all jobsGET /api/2.2/jobs/list"retrieve available / list jobs" → jobs/list (not jobs/get = one job)
One run's details, incl. repair historyGET /api/2.2/jobs/runs/get (run_id; include_history=true in 2.1)"latest run including repair history" → runs/get
All runs of a jobGET /api/2.2/jobs/runs/list (job_id, time filters)"all / failed runs in a window" → runs/list

Singular vs plural is the trap: runs/get = one run (detail, with its repair history); runs/list = many runs (collection). Same for jobs/get vs jobs/list.

The 2.1 → 2.2 version shift (freshness)

Bank often shows 2.1 (e.g. runs/get?include_history=true); current is 2.2:

Recognise 2.1 phrasings, prefer 2.2.

CLI — "did today's run succeed?"

The exam scenario: run an external script only if today's daily job finished successfully.

databricks jobs list-runs --job-id 123 \
  --start-time-from <today-midnight-epoch-ms> \
  --completed-only

(Older CLI: databricks runs list / runs get; current unified CLI: databricks jobs list-runs / jobs get-run — recognise both.)

Failure notifications — the task-retry gotcha

Job-level notifications fire only on the overall job's success/failure. So a task that fails and is retried produces no notification — and if a retry eventually succeeds, the job ends "success," silently. Fix: task-level notifications. Tell: "notifications aren't sent when failed tasks are retried" → task-level notifications, not job-level.

Retry policy for production streaming (freshness)

Tell: "recommended retry policy for production streaming" → unlimited retries + max concurrent 1 (legacy/exam); know Continuous mode is the modern replacement.

Recall (say without scrolling up)

  1. Why are monitoring calls all GET, and what's the singular-vs-plural trap?
  2. "Latest run including repair history" — which endpoint?
  3. Failed tasks get retried but no notification arrives — why, and the fix?
  4. Recommended production-streaming retry policy (legacy/exam) — and its modern replacement.

That completes Section 5's monitoring story: the map of surfaces ([The monitoring map — which surface answers which question](/lessons/s5-observability-surfaces/)) → durable system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)) → pipeline event log ([The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)) → SQL Alerts to get told ([SQL Alerts — the single-value rule that makes or breaks them](/lessons/s5-sql-alerts/)) → operational REST/CLI, notifications, retries here.

Prerequisites