A UDF (User-Defined Function) is your own function applied to a column or group, for when Spark's built-ins can't express the logic. Simple idea — but the exam isn't testing how to write one; it's testing which kind, because the choice can make identical logic run at near-native speed or 100× slower. That gap has one cause, and once you see it the whole topic collapses.

The spine

Beat 1 — prefer built-ins; a UDF is the fallback

Family before mechanics. Spark ships built-in functions (col, when, from_json, date math, sha2, …) that are fastest — they run JVM-native and Spark's optimizer (Catalyst) can see inside them and rearrange them.

Predict: what does Catalyst lose the moment you wrap logic in a UDF?

…

Its x-ray vision. A UDF is a black box the optimizer must run as-is — it can't see in, can't reorder, can't optimize. So rule zero: if a built-in can do it, use the built-in. A UDF is the exception, not the tool.

Beat 2 — the anchor: the JVM↔Python crossing

Here's the one mechanism that explains every performance difference. Spark runs on the JVM; your Python UDF runs in a Python process. So every row a Python UDF touches must be serialized out of the JVM, shipped to Python, computed, and shipped back.

Predict: do that one row at a time across ten million rows — where does the time actually go?

…

Into the serialization, not the work. The crossing dwarfs the computation. That single fact ranks every UDF type:

Anchor. A UDF's speed is decided by how data crosses the JVM↔Python boundary. Row-by-row crossing = slow. Batch crossing via Apache Arrow (vectorization) = fast.

Define the hero: Apache Arrow is a columnar in-memory format that hands Python a whole batch of a column at once, with no per-row serialization. "Vectorized" = "operates on the batch, not row-by-row." Arrow is why Pandas UDFs are fast.

Lock it. Built-ins never leave the JVM (fastest). UDFs cross the boundary — row-by-row (slow) or Arrow-batched (fast). That distinction is the whole lesson.

The dials (skim now; return when a question needs one)

◆ The types, ranked by that crossing

Type	Speed	How it crosses	Use when
Built-in functions	fastest	never leaves the JVM; Catalyst-optimised	always, if they cover the logic
Python UDF	slowest	row-by-row JVM→Python serialisation; black box	legacy code / trivial one-offs only
Pandas UDF — scalar (`@pandas_udf`)	5–100× faster	Arrow ships a whole column batch; pandas Series in, Series out	custom per-value logic Spark lacks
Pandas UDF — grouped map (`applyInPandas`)	batch-per-group	Arrow ships an entire group as one pandas DataFrame	logic needing the whole group at once
SQL UDF	depends	callable inside SQL	when the function must be called from SQL; return type explicit

# SLOW — Python UDF: one row at a time across the boundary
@udf("double")
def to_usd(x): return x * 1.09

# FAST — scalar Pandas UDF: Arrow hands you the whole column batch, vectorised
@pandas_udf("double")
def to_usd(s: pd.Series) -> pd.Series: return s * 1.09

◆ The two exam tells

"This Python UDF is slow — how to speed it up?" → rewrite as a scalar Pandas UDF. Name the mechanism (almost verbatim an answer): it "leverages Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs."
"The logic needs the whole group" — a 7-day rolling average per store, ranking within a group, per-key state → applyInPandas (grouped map). Each group arrives as one pandas DataFrame. Recall the state idea from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): a running calculation per group is exactly groupBy(...).applyInPandas(fn, schema). (It triggers a shuffle to gather each group, unless already partitioned by that key.)

Gotcha: a Pandas UDF must return the exact type declared in @pandas_udf(returnType) — a mismatch is a runtime error. Same for a registered SQL UDF: return type must be explicit.

◆ When a built-in beats any UDF (the "trick" answer)

Many things people write UDFs for are already built-ins. Ranking within a group, running totals, "latest per key" → window functions (row_number, rank, sum() over …), native and Catalyst-optimised. So before reaching for applyInPandas, ask whether a window function does it — usually faster and simpler ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)).

Takeaways (rebuild it from these)

Prefer built-ins (JVM-native, Catalyst optimises them). A UDF is a black box — the fallback only.
The cost is the JVM↔Python crossing; Arrow vectorization is the cure. That distinction ranks everything.
Python UDF = row-by-row, slowest. Scalar Pandas UDF (@pandas_udf) = Arrow batch, 5–100× faster → the fix for a slow Python UDF. applyInPandas = whole group as a DataFrame → group-level / stateful-per-key logic. SQL UDF = callable from SQL, explicit return type.
Pandas UDFs must return the declared type exactly.
Before a grouped UDF, check whether a window function already does it natively.

Before you move on — say these without scrolling up

What does Catalyst lose when you use a UDF instead of a built-in?
Where does the time actually go in a slow Python UDF — and what fixes it, by what mechanism?
Logic needs the whole group at once — which UDF?
Before writing a grouped UDF for "rank within a group," what should you check first?

Next: getting the packages your code (UDFs included) depends on onto the cluster, safely and reproducibly → [Managing third-party libraries](/lessons/s1-third-party-libs/).

UDFs — Python vs Pandas, and why the type is everything