Lessons

Monitoring & Alerting

The monitoring map — which surface answers which question

Identify the correct Databricks observability surface for a monitoring question — Spark UI/Query Profile, cluster event log, job run history, pipeline event log, and system tables.

Something's wrong — a job ran long, a cluster cost too much, a table has nulls it shouldn't — and the first question is never how do I fix it, it's where do I even look? Databricks has half a dozen monitoring surfaces that overlap enough to be genuinely confusing (three different things are literally called an "event log"). This lesson is the map: every other Section 5 lesson goes deep on one surface; this one tells you which to reach for.


The spine

Beat 1 — the anchor: scope × lifetime

Anchor. Every monitoring surface answers a question at one scope — a single query, a cluster, a job's runs, a pipeline, or the whole account — and keeps history for one lifetime — from ephemeral (gone when the cluster stops) to 60 days to durable. Name the scope and lifetime your question needs, and the surface is chosen for you.

Predict: you want to explain last quarter's compute bill per team. What scope, what lifetime — and does the Spark UI qualify?

Scope = account, lifetime = durable. The Spark UI is query-scoped and ephemeral — it forgets everything when the cluster stops, so it can't answer a quarterly bill. Wrong scope = right kind of place, wrong thing.

Beat 2 — the five surfaces

SurfaceScopeAnswersLifetimeLesson
Spark UI / Query Profileone query executionwhy was this slow — operators, shuffle, spill, skewephemeral[Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/)
Cluster event logone clusterlifecycle — start, autoscale, restart, terminate, whytied to clusterthis lesson
Job run historya job's runsdid each run succeed/fail, notebook results60 daysthis lesson + [Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/)
Pipeline event loga Lakeflow pipelineflow progress + data-quality metricsDelta table[The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)
System tablesthe whole accountdurable SQL history — cost, audit, query historydurable[System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)

Lock it. Query→Spark UI (ephemeral). Cluster→cluster event log. Job→run history (60 days). Pipeline→pipeline event log. Account→system tables (durable).


The dials (skim now; return when a question needs one)

◆ The cluster event log — the machine's diary (taught here)

A cluster starts, waits for nodes, autoscales up/down, maybe restarts, terminates — and the cluster event log records each lifecycle event with a timestamp + reason. Different from the Spark UI: Spark UI = work (stages/tasks); cluster event log = the machine (its size/state over time). The exam's use is autoscaling forensics: analysts share an autoscaling interactive cluster and the admin wants to know whether upscaling is from too many concurrent users or heavy individual queries → the cluster event log's UPSIZE/DOWNSIZE events (with timestamps + reasons) let you line up when it grew against what ran. Tell: "when and why did the cluster resize" → cluster event log (not Spark UI, not system tables).

◆ Job run history — 60 days, then export (taught here)

Every job keeps a run history: each run's state, parameters, notebook results. The hard fact: retention = 60 days, then runs are auto-removed. To keep longer:

Tell: "keep job run records beyond the window" → window is 60 days, fix is export (nothing extends it in place). The durable "how much did jobs cost / who ran what" is system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)), not run history.

◆ The three-"event log" collision

Disambiguate by the scope word: cluster → cluster event log; pipeline / expectation → pipeline event log; who accessed / auditor → audit log.

Takeaways (rebuild it from these)

  1. Choose by scope × lifetime: query→Spark UI (ephemeral); cluster→cluster event log; job→run history (60 days); pipeline→pipeline event log; account→system tables (durable).
  2. Cluster event log = lifecycle diary — UPSIZE/DOWNSIZE with timestamps + reasons → explains when/why a cluster resized.
  3. Job run history = 60 days; keep longer only by export (HTML / logs to storage / REST-CLI pull).
  4. "Event log" is overloaded — cluster (compute), pipeline (flow_progress/DQ), audit (system.access). Use the scope word.
  5. Ephemeral surfaces answer why this run; durable ones answer the bill and audit over time.

Before you move on — say these without scrolling up

  1. The two dimensions that pick a monitoring surface.
  2. "Why did the shared cluster keep upsizing?" — which surface, and which events?
  3. Job run history retention, and the only way to keep records longer.
  4. The three things called "event log" — the scope word that tells them apart.

Next: the durable, account-wide continent on the map — system tables, and how one SQL query attributes cost per user → [System tables — the account's durable, queryable memory](/lessons/s5-system-tables/).

Prerequisites

Leads to