Bronze is raw; the real engineering happens when you transform it into silver and gold. This lesson is the advanced-transformation toolkit the exam expects you to write, not just recognise. One tool dominates — window functions — so we build there first, then joins and aggregations.
The spine
Beat 1 — the anchor: a window keeps every row; GROUP BY collapses
Here's the distinction that makes window functions click. Say you want to tag each employee with a salary tier within their department — but you must keep every employee row in the output.
Predict: can
GROUP BY departmentdo that?
…
No. GROUP BY collapses each department into one summary row — the individual employees vanish. You need something that computes a per-group value but keeps the rows. That's a window function:
Anchor. A window function computes a value for each row using a group of related rows, but keeps every row.
GROUP BYcollapses a group to one row; a window function adds a column to every row, computed over its group. That's the whole power: rank, running total, "previous row" — without losing the rows.
A window is three parts inside OVER (...):
PARTITION BY— which rows form the group (e.g. perdepartment). (The analytic "window" — not a time window; see the collision below.)ORDER BY— the sequence within the group (e.g. bysalary).- the frame (
ROWS BETWEEN …) — which rows of the ordered group feed this row's calculation.
Beat 2 — ranking, and the tie behaviour (the most-tested detail)
Three ranking functions look identical and differ only on ties:
| Function | On a tie (equal ORDER BY value) | Sequence |
|---|---|---|
ROW_NUMBER() | each tied row gets a different number (arbitrary among ties) | 1, 2, 3, 4 |
RANK() | tied rows get the same rank, then it skips | 1, 2, 2, 4 |
DENSE_RANK() | tied rows get the same rank, no gap | 1, 2, 2, 3 |
Predict: "assign a salary tier per department, where employees with the same salary share the same tier, and tiers don't skip numbers" — which one?
…
DENSE_RANK() — same value → same tier (rules out ROW_NUMBER), no gaps (rules out RANK).
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
w = Window.partitionBy("department").orderBy(col("salary").desc())
df.withColumn("tier", dense_rank().over(w))
Beat 3 — running totals, and the frame gotcha
For a cumulative sum/average, the frame is what makes it "running":
AVG(score) OVER (PARTITION BY student ORDER BY exam_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) -- cumulative average
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW = "all rows from the start of the group up to this row" → a running/cumulative value (this exact frame is the cumulative-average answer).
The gotcha to bank: state the frame explicitly. Leave it off with an
ORDER BYpresent and you inherit a default frame you didn't intend — so your "running total" quietly isn't running.
LAG() / LEAD() reach the previous/next row (e.g. "days since a user's last event").
Lock it. Window = per-row value over a group, keeps every row. Ties →
ROW_NUMBER(all different) /RANK(gaps) /DENSE_RANK(no gaps). Running total → explicit frameUNBOUNDED PRECEDING → CURRENT ROW.
The dials (skim now; return when a question needs one)
◆ Name the collision: analytic window vs time window
Two different things are both called "window," and Section-3 questions use both:
- Analytic window —
OVER (PARTITION BY …)above; a per-row calc over related rows. - Time window —
window(event_time, "5 minutes"), the tumbling/sliding time bucket from[Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)used in streaming aggregations.
Tell: OVER (...) → analytic window; window(col, "…") inside a groupBy → time window. "Metrics per non-overlapping 10-minute interval per server" is a time window, not OVER. And note the full signature — window(timeCol, windowDuration, slideDuration, startTime): non-overlapping means slideDuration = windowDuration (tumbling), and "offset by 3 minutes" is the startTime argument (window(ts, "10 minutes", "10 minutes", "3 minutes")).
◆ Joins — pick by "which unmatched rows do I keep," plus two you'll forget
Standard INNER / LEFT / RIGHT / FULL OUTER you know. The two people forget, because they're tested:
LEFT ANTI JOIN— left rows with no match on the right. The efficient exclusion filter — "transactions from customers not in the blocked list" — cleaner/faster thanLEFT JOIN … WHERE right IS NULL.LEFT SEMI JOIN— left rows that have a match, without duplicating when the right has many matches. "Orders that have a matching customer," no fan-out.
Recall from [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/) / [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/): a broadcast join ships a small dimension to every executor so the big fact isn't shuffled (/*+ BROADCAST(dim) */ or let AQE do it).
◆ Aggregations beyond GROUP BY
ROLLUP— hierarchical subtotals:GROUP BY ROLLUP(year, month, day)→ totals by year, year+month, year+month+day, and grand total. For a natural hierarchy.CUBE— all 2ⁿ combinations of the grouping columns. When every combination is meaningful — but it explodes, so filter.- Higher-order functions (
transform,filter,reduce) operate on array columns element-wise — more efficient thanexplode+ re-collect.
◆ Extracting date parts (a small tested trap)
To get the date from a timestamp: to_date(ts), date(ts), or CAST(ts AS DATE). To get a component (day/month/year number): date_part('day', ts), EXTRACT(DAY FROM ts), dayofmonth(ts). Tell: date_part('day', ts) returns the day number, so it's not "extracting the date" — that's the odd-one-out in a "which is NOT valid for extracting the date" question.
Takeaways (rebuild it from these)
- Window = per-row value over a group, keeping every row (GROUP BY collapses; window adds a column).
PARTITION BY+ORDER BY+ frame. - Ties:
ROW_NUMBER(all different) ·RANK(same, gaps) ·DENSE_RANK(same, no gaps → "same value = same tier"). - Running totals need explicit
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW; always state the frame.LAG/LEAD= previous/next row. - Collision: analytic
OVER(...)vs timewindow(col,"5 min")(+startTime= offset,slide = window= non-overlapping). - Joins: ANTI = exclusion, SEMI = "has a match, no dup", broadcast = small dim, no shuffle. ROLLUP = hierarchy, CUBE = all combos; higher-order fns for arrays.
Before you move on — say these without scrolling up
- You need a per-employee tier but must keep every row — why can't GROUP BY, and what does?
- "Same salary → same tier, no skipped numbers" — which ranking function?
- What turns a plain
SUM(...) OVER(...)into a running total — and the gotcha if you omit it? OVER(...)vswindow(col, "5 minutes")— which is which, and where does "offset by 3 minutes" go?
Next: a specific cleansing transformation with a sharp gotcha — deduplication, and why the fix depends on how far apart the duplicates arrive → [Deduplication — distinct, keep-latest, and the streaming state trap](/lessons/s3-dedup/).