Something's wrong — a job ran long, a cluster cost too much, a table has nulls it shouldn't — and the first question is never how do I fix it, it's where do I even look? Databricks has half a dozen monitoring surfaces that overlap enough to be genuinely confusing (three different things are literally called an "event log"). This lesson is the map: every other Section 5 lesson goes deep on one surface; this one tells you which to reach for.
The spine
Beat 1 — the anchor: scope × lifetime
Anchor. Every monitoring surface answers a question at one scope — a single query, a cluster, a job's runs, a pipeline, or the whole account — and keeps history for one lifetime — from ephemeral (gone when the cluster stops) to 60 days to durable. Name the scope and lifetime your question needs, and the surface is chosen for you.
Predict: you want to explain last quarter's compute bill per team. What scope, what lifetime — and does the Spark UI qualify?
…
Scope = account, lifetime = durable. The Spark UI is query-scoped and ephemeral — it forgets everything when the cluster stops, so it can't answer a quarterly bill. Wrong scope = right kind of place, wrong thing.
Beat 2 — the five surfaces
| Surface | Scope | Answers | Lifetime | Lesson |
|---|---|---|---|---|
| Spark UI / Query Profile | one query execution | why was this slow — operators, shuffle, spill, skew | ephemeral | [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/) |
| Cluster event log | one cluster | lifecycle — start, autoscale, restart, terminate, why | tied to cluster | this lesson |
| Job run history | a job's runs | did each run succeed/fail, notebook results | 60 days | this lesson + [Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/) |
| Pipeline event log | a Lakeflow pipeline | flow progress + data-quality metrics | Delta table | [The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/) |
| System tables | the whole account | durable SQL history — cost, audit, query history | durable | [System tables — the account's durable, queryable memory](/lessons/s5-system-tables/) |
Lock it. Query→Spark UI (ephemeral). Cluster→cluster event log. Job→run history (60 days). Pipeline→pipeline event log. Account→system tables (durable).
The dials (skim now; return when a question needs one)
◆ The cluster event log — the machine's diary (taught here)
A cluster starts, waits for nodes, autoscales up/down, maybe restarts, terminates — and the cluster event log records each lifecycle event with a timestamp + reason. Different from the Spark UI: Spark UI = work (stages/tasks); cluster event log = the machine (its size/state over time). The exam's use is autoscaling forensics: analysts share an autoscaling interactive cluster and the admin wants to know whether upscaling is from too many concurrent users or heavy individual queries → the cluster event log's UPSIZE/DOWNSIZE events (with timestamps + reasons) let you line up when it grew against what ran. Tell: "when and why did the cluster resize" → cluster event log (not Spark UI, not system tables).
◆ Job run history — 60 days, then export (taught here)
Every job keeps a run history: each run's state, parameters, notebook results. The hard fact: retention = 60 days, then runs are auto-removed. To keep longer:
- Export notebook run results to HTML from the run detail page, or
- deliver cluster logs to DBFS / cloud storage, or
- pull run metadata via Jobs REST API / CLI on a schedule (
[Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/)).
Tell: "keep job run records beyond the window" → window is 60 days, fix is export (nothing extends it in place). The durable "how much did jobs cost / who ran what" is system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)), not run history.
◆ The three-"event log" collision
- Cluster event log — a cluster's lifecycle (resize/restart/terminate). Compute state.
- Pipeline event log — a Lakeflow pipeline's
flow_progress+ data-quality metrics ([The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)). Pipeline + quality. - Audit log (
system.access.audit) — who did what across the account ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)). Security.
Disambiguate by the scope word: cluster → cluster event log; pipeline / expectation → pipeline event log; who accessed / auditor → audit log.
Takeaways (rebuild it from these)
- Choose by scope × lifetime: query→Spark UI (ephemeral); cluster→cluster event log; job→run history (60 days); pipeline→pipeline event log; account→system tables (durable).
- Cluster event log = lifecycle diary —
UPSIZE/DOWNSIZEwith timestamps + reasons → explains when/why a cluster resized. - Job run history = 60 days; keep longer only by export (HTML / logs to storage / REST-CLI pull).
- "Event log" is overloaded — cluster (compute), pipeline (
flow_progress/DQ), audit (system.access). Use the scope word. - Ephemeral surfaces answer why this run; durable ones answer the bill and audit over time.
Before you move on — say these without scrolling up
- The two dimensions that pick a monitoring surface.
- "Why did the shared cluster keep upsizing?" — which surface, and which events?
- Job run history retention, and the only way to keep records longer.
- The three things called "event log" — the scope word that tells them apart.
Next: the durable, account-wide continent on the map — system tables, and how one SQL query attributes cost per user → [System tables — the account's durable, queryable memory](/lessons/s5-system-tables/).