Identify the correct Databricks observability surface for a monitoring question — Spark UI/Query Profile, cluster event log, job run history, pipeline event log, and system tables.

Something's wrong — a job ran long, a cluster cost too much, a table has nulls it shouldn't — and the first question is never how do I fix it, it's where do I even look? Databricks has half a dozen monitoring surfaces that overlap enough to be genuinely confusing (three different things are literally called an "event log"). This lesson is the map: every other Section 5 lesson goes deep on one surface; this one tells you which to reach for.

The spine

Beat 1 — the anchor: scope × lifetime

Anchor. Every monitoring surface answers a question at one scope — a single query, a cluster, a job's runs, a pipeline, or the whole account — and keeps history for one lifetime — from ephemeral (gone when the cluster stops) to 60 days to durable. Name the scope and lifetime your question needs, and the surface is chosen for you.

Predict: you want to explain last quarter's compute bill per team. What scope, what lifetime — and does the Spark UI qualify?

…

Scope = account, lifetime = durable. The Spark UI is query-scoped and ephemeral — it forgets everything when the cluster stops, so it can't answer a quarterly bill. Wrong scope = right kind of place, wrong thing.

Beat 2 — the five surfaces

Surface	Scope	Answers	Lifetime	Lesson
Spark UI / Query Profile	one query execution	why was this slow — operators, shuffle, spill, skew	ephemeral	`[Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/)`
Cluster event log	one cluster	lifecycle — start, autoscale, restart, terminate, why	tied to cluster	this lesson
Job run history	a job's runs	did each run succeed/fail, notebook results	60 days	this lesson + `[Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/)`
Pipeline event log	a Lakeflow pipeline	flow progress + data-quality metrics	Delta table	`[The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)`
System tables	the whole account	durable SQL history — cost, audit, query history	durable	`[System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)`

Lock it. Query→Spark UI (ephemeral). Cluster→cluster event log. Job→run history (60 days). Pipeline→pipeline event log. Account→system tables (durable).

The dials (skim now; return when a question needs one)

◆ The cluster event log — the machine's diary (taught here)

A cluster starts, waits for nodes, autoscales up/down, maybe restarts, terminates — and the cluster event log records each lifecycle event with a timestamp + reason. Different from the Spark UI: Spark UI = work (stages/tasks); cluster event log = the machine (its size/state over time). The exam's use is autoscaling forensics: analysts share an autoscaling interactive cluster and the admin wants to know whether upscaling is from too many concurrent users or heavy individual queries → the cluster event log's UPSIZE/DOWNSIZE events (with timestamps + reasons) let you line up when it grew against what ran. Tell: "when and why did the cluster resize" → cluster event log (not Spark UI, not system tables).

◆ Job run history — 60 days, then export (taught here)

Every job keeps a run history: each run's state, parameters, notebook results. The hard fact: retention = 60 days, then runs are auto-removed. To keep longer:

Export notebook run results to HTML from the run detail page, or
deliver cluster logs to DBFS / cloud storage, or
pull run metadata via Jobs REST API / CLI on a schedule ([Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/)).

Tell: "keep job run records beyond the window" → window is 60 days, fix is export (nothing extends it in place). The durable "how much did jobs cost / who ran what" is system tables ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)), not run history.

◆ The three-"event log" collision

Cluster event log — a cluster's lifecycle (resize/restart/terminate). Compute state.
Pipeline event log — a Lakeflow pipeline's flow_progress + data-quality metrics ([The pipeline event log — where a Lakeflow pipeline records itself](/lessons/s5-pipeline-event-log/)). Pipeline + quality.
Audit log (system.access.audit) — who did what across the account ([System tables — the account's durable, queryable memory](/lessons/s5-system-tables/)). Security.

Disambiguate by the scope word: cluster → cluster event log; pipeline / expectation → pipeline event log; who accessed / auditor → audit log.

Takeaways (rebuild it from these)

Choose by scope × lifetime: query→Spark UI (ephemeral); cluster→cluster event log; job→run history (60 days); pipeline→pipeline event log; account→system tables (durable).
Cluster event log = lifecycle diary — UPSIZE/DOWNSIZE with timestamps + reasons → explains when/why a cluster resized.
Job run history = 60 days; keep longer only by export (HTML / logs to storage / REST-CLI pull).
"Event log" is overloaded — cluster (compute), pipeline (flow_progress/DQ), audit (system.access). Use the scope word.
Ephemeral surfaces answer why this run; durable ones answer the bill and audit over time.

Before you move on — say these without scrolling up

The two dimensions that pick a monitoring surface.
"Why did the shared cluster keep upsizing?" — which surface, and which events?
Job run history retention, and the only way to keep records longer.
The three things called "event log" — the scope word that tells them apart.

Next: the durable, account-wide continent on the map — system tables, and how one SQL query attributes cost per user → [System tables — the account's durable, queryable memory](/lessons/s5-system-tables/).

The monitoring map — which surface answers which question