A UDF (User-Defined Function) is your own function applied to a column or group, for when Spark's built-ins can't express the logic. Simple idea — but the exam isn't testing how to write one; it's testing which kind, because the choice can make identical logic run at near-native speed or 100× slower. That gap has one cause, and once you see it the whole topic collapses.
The spine
Beat 1 — prefer built-ins; a UDF is the fallback
Family before mechanics. Spark ships built-in functions (col, when, from_json, date math, sha2, …) that are fastest — they run JVM-native and Spark's optimizer (Catalyst) can see inside them and rearrange them.
Predict: what does Catalyst lose the moment you wrap logic in a UDF?
…
Its x-ray vision. A UDF is a black box the optimizer must run as-is — it can't see in, can't reorder, can't optimize. So rule zero: if a built-in can do it, use the built-in. A UDF is the exception, not the tool.
Beat 2 — the anchor: the JVM↔Python crossing
Here's the one mechanism that explains every performance difference. Spark runs on the JVM; your Python UDF runs in a Python process. So every row a Python UDF touches must be serialized out of the JVM, shipped to Python, computed, and shipped back.
Predict: do that one row at a time across ten million rows — where does the time actually go?
…
Into the serialization, not the work. The crossing dwarfs the computation. That single fact ranks every UDF type:
Anchor. A UDF's speed is decided by how data crosses the JVM↔Python boundary. Row-by-row crossing = slow. Batch crossing via Apache Arrow (vectorization) = fast.
Define the hero: Apache Arrow is a columnar in-memory format that hands Python a whole batch of a column at once, with no per-row serialization. "Vectorized" = "operates on the batch, not row-by-row." Arrow is why Pandas UDFs are fast.
Lock it. Built-ins never leave the JVM (fastest). UDFs cross the boundary — row-by-row (slow) or Arrow-batched (fast). That distinction is the whole lesson.
The dials (skim now; return when a question needs one)
◆ The types, ranked by that crossing
| Type | Speed | How it crosses | Use when |
|---|---|---|---|
| Built-in functions | fastest | never leaves the JVM; Catalyst-optimised | always, if they cover the logic |
| Python UDF | slowest | row-by-row JVM→Python serialisation; black box | legacy code / trivial one-offs only |
Pandas UDF — scalar (@pandas_udf) | 5–100× faster | Arrow ships a whole column batch; pandas Series in, Series out | custom per-value logic Spark lacks |
Pandas UDF — grouped map (applyInPandas) | batch-per-group | Arrow ships an entire group as one pandas DataFrame | logic needing the whole group at once |
| SQL UDF | depends | callable inside SQL | when the function must be called from SQL; return type explicit |
# SLOW — Python UDF: one row at a time across the boundary
@udf("double")
def to_usd(x): return x * 1.09
# FAST — scalar Pandas UDF: Arrow hands you the whole column batch, vectorised
@pandas_udf("double")
def to_usd(s: pd.Series) -> pd.Series: return s * 1.09
◆ The two exam tells
- "This Python UDF is slow — how to speed it up?" → rewrite as a scalar Pandas UDF. Name the mechanism (almost verbatim an answer): it "leverages Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs."
- "The logic needs the whole group" — a 7-day rolling average per store, ranking within a group, per-key state →
applyInPandas(grouped map). Each group arrives as one pandas DataFrame. Recall the state idea from[Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): a running calculation per group is exactlygroupBy(...).applyInPandas(fn, schema). (It triggers a shuffle to gather each group, unless already partitioned by that key.)
Gotcha: a Pandas UDF must return the exact type declared in @pandas_udf(returnType) — a mismatch is a runtime error. Same for a registered SQL UDF: return type must be explicit.
◆ When a built-in beats any UDF (the "trick" answer)
Many things people write UDFs for are already built-ins. Ranking within a group, running totals, "latest per key" → window functions (row_number, rank, sum() over …), native and Catalyst-optimised. So before reaching for applyInPandas, ask whether a window function does it — usually faster and simpler ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)).
Takeaways (rebuild it from these)
- Prefer built-ins (JVM-native, Catalyst optimises them). A UDF is a black box — the fallback only.
- The cost is the JVM↔Python crossing; Arrow vectorization is the cure. That distinction ranks everything.
- Python UDF = row-by-row, slowest. Scalar Pandas UDF (
@pandas_udf) = Arrow batch, 5–100× faster → the fix for a slow Python UDF.applyInPandas= whole group as a DataFrame → group-level / stateful-per-key logic. SQL UDF = callable from SQL, explicit return type. - Pandas UDFs must return the declared type exactly.
- Before a grouped UDF, check whether a window function already does it natively.
Before you move on — say these without scrolling up
- What does Catalyst lose when you use a UDF instead of a built-in?
- Where does the time actually go in a slow Python UDF — and what fixes it, by what mechanism?
- Logic needs the whole group at once — which UDF?
- Before writing a grouped UDF for "rank within a group," what should you check first?
Next: getting the packages your code (UDFs included) depends on onto the cluster, safely and reproducibly → [Managing third-party libraries](/lessons/s1-third-party-libs/).