Lessons

Monitoring & Alerting

SQL Alerts — the single-value rule that makes or breaks them

Design Databricks SQL Alerts that evaluate a query result against a threshold, including collapsing multi-condition logic into a single evaluated value.

You've built the monitoring queries — cost per user from system tables, failed-records from the pipeline event log, null-rate on a table. But nobody watches a dashboard at 3 a.m. A SQL Alert runs a query on a schedule and notifies you when the result crosses a line. Simple idea — with exactly one sharp edge that every exam question here pokes at.


The spine

Beat 1 — the anchor: the alert only compares; the query must think

Predict: you want to be paged when "missing-email % > 15% AND invalid-phone % > 10%." Can you just hand the alert those two columns and an AND?

No — and that "no" is the whole lesson:

Anchor. A SQL Alert evaluates a single value — one field from the query result — against a threshold, and notifies when true. It cannot natively "trigger when A > x AND B > y." All the intelligence lives in the SQL query: write a query whose one returned field is true exactly when you want to be paged. The alert only compares; the query must think.

Internalize the inversion and every SQL-alert question becomes: how do I make the query emit a single field meaning "alert now"?

Beat 2 — the consequence: collapse conditions into one field

The most-tested move. Two conditions → one field, two clean ways:

1. A CASE returning 1/0 — evaluate the boolean in SQL, alert on = 1:

SELECT CASE
         WHEN new_tickets > 200 AND under_processing > 150 THEN 1 ELSE 0
       END AS should_alert
FROM ( SELECT SUM(CASE WHEN status='new' THEN 1 ELSE 0 END) AS new_tickets,
              SUM(CASE WHEN status='in_progress' THEN 1 ELSE 0 END) AS under_processing
       FROM tickets );
-- Alert condition: should_alert = 1

Both thresholds live inside the CASE; the alert checks one scalar. (The call-center/ticket question and its siblings.)

2. Compute the metrics as aggregate columns and point the alert's evaluated field at the one that matters. Same principle — query does the arithmetic, alert reads one number.

The wrong answers always return multiple columns/rows and expect the alert to AND them — it can't.

Lock it. Multiple conditions → CASE WHEN c1 AND c2 THEN 1 ELSE 0 END, alert on = 1 (or a single aggregate). One field carries the verdict.


The dials (skim now; return when a question needs one)

◆ Grouped queries — query decides what, alert config decides how often

When the query groups (e.g. mean(temperature) > 120 per sensor_id), the alert evaluates its configured column against the threshold across the returned rows — firing per group as framed. If it sends three consecutive notifications, that's the alert re-evaluating on its schedule and re-notifying while the condition holds — cadence is an alert setting, not query logic. Keep straight: the query decides what is compared; the alert schedule/notification config decides how often you hear it. (So "average temp for at least one sensor exceeded 120 on three consecutive executions" = the query's per-sensor condition held across three scheduled runs.)

◆ Temporal conditions — push the time logic into the query

"Alert when the daily average of percent_null exceeds 5% for three consecutive days, notify once":

That percent_null typically comes from a Lakehouse Monitoring profile-metrics table — Databricks auto-profiles a table and writes metrics (percent_null, drift) to a metrics table on a schedule, which you then alert on. The standard "monitor a column's null rate over time" pipeline.

◆ The current model — Alerts v2 (freshness)

Channels (email, Slack, PagerDuty, webhooks), schedule, destinations are configuration. And tune thresholds / "notify once" to avoid alert fatigue — an ignored alert is worse than none.

Takeaways (rebuild it from these)

  1. A SQL Alert compares one field to a threshold — the alert only compares; the query must do the thinking.
  2. Multiple conditions → collapse into one field: CASE WHEN c1 AND c2 THEN 1 ELSE 0 END, alert on = 1. Returning many columns for the alert to AND is the classic wrong answer.
  3. Grouped query → alert evaluates the chosen column per result; notification cadence ("three consecutive", "notify once") is an alert setting, not query logic.
  4. Temporal conditions → time logic in the query (GROUP BY DATE_TRUNC, window/self-join); often fed by Lakehouse Monitoring profile-metrics (percent_null, drift).
  5. Alerts v2 = source + comparison_operator + threshold; states OK/TRIGGERED/ERROR (no UNKNOWN).

Before you move on — say these without scrolling up

  1. Why can't an alert do "A > x AND B > y" directly — and what do you do instead?
  2. A grouped alert fires three times — is that the query or the alert config?
  3. "Daily avg over 3 consecutive days, notify once" — which part is query, which is alert setting?
  4. Alerts v2 evaluation — the three pieces, and the states.

Next: the operational side — driving jobs via REST/CLI, notifications, and retry settings → [Operational job monitoring — REST/CLI, notifications, retry policy](/lessons/s5-job-monitoring/).

Prerequisites

Leads to