Lessons

Developing Code for Data Processing

UDFs — Python vs Pandas, and why the type is everything

Develop User-Defined Functions (UDFs) using Pandas/Python UDF.

A UDF (User-Defined Function) is your own function applied to a column or group, for when Spark's built-ins can't express the logic. Simple idea — but the exam isn't testing how to write one; it's testing which kind, because the choice can make identical logic run at near-native speed or 100× slower. That gap has one cause, and once you see it the whole topic collapses.


The spine

Beat 1 — prefer built-ins; a UDF is the fallback

Family before mechanics. Spark ships built-in functions (col, when, from_json, date math, sha2, …) that are fastest — they run JVM-native and Spark's optimizer (Catalyst) can see inside them and rearrange them.

Predict: what does Catalyst lose the moment you wrap logic in a UDF?

Its x-ray vision. A UDF is a black box the optimizer must run as-is — it can't see in, can't reorder, can't optimize. So rule zero: if a built-in can do it, use the built-in. A UDF is the exception, not the tool.

Beat 2 — the anchor: the JVM↔Python crossing

Here's the one mechanism that explains every performance difference. Spark runs on the JVM; your Python UDF runs in a Python process. So every row a Python UDF touches must be serialized out of the JVM, shipped to Python, computed, and shipped back.

Predict: do that one row at a time across ten million rows — where does the time actually go?

Into the serialization, not the work. The crossing dwarfs the computation. That single fact ranks every UDF type:

Anchor. A UDF's speed is decided by how data crosses the JVM↔Python boundary. Row-by-row crossing = slow. Batch crossing via Apache Arrow (vectorization) = fast.

Define the hero: Apache Arrow is a columnar in-memory format that hands Python a whole batch of a column at once, with no per-row serialization. "Vectorized" = "operates on the batch, not row-by-row." Arrow is why Pandas UDFs are fast.

Lock it. Built-ins never leave the JVM (fastest). UDFs cross the boundary — row-by-row (slow) or Arrow-batched (fast). That distinction is the whole lesson.


The dials (skim now; return when a question needs one)

◆ The types, ranked by that crossing

TypeSpeedHow it crossesUse when
Built-in functionsfastestnever leaves the JVM; Catalyst-optimisedalways, if they cover the logic
Python UDFslowestrow-by-row JVM→Python serialisation; black boxlegacy code / trivial one-offs only
Pandas UDF — scalar (@pandas_udf)5–100× fasterArrow ships a whole column batch; pandas Series in, Series outcustom per-value logic Spark lacks
Pandas UDF — grouped map (applyInPandas)batch-per-groupArrow ships an entire group as one pandas DataFramelogic needing the whole group at once
SQL UDFdependscallable inside SQLwhen the function must be called from SQL; return type explicit
# SLOW — Python UDF: one row at a time across the boundary
@udf("double")
def to_usd(x): return x * 1.09

# FAST — scalar Pandas UDF: Arrow hands you the whole column batch, vectorised
@pandas_udf("double")
def to_usd(s: pd.Series) -> pd.Series: return s * 1.09

◆ The two exam tells

  1. "This Python UDF is slow — how to speed it up?" → rewrite as a scalar Pandas UDF. Name the mechanism (almost verbatim an answer): it "leverages Apache Arrow to enable vectorized operations between the JVM and Python runtimes, reducing serialization costs."
  2. "The logic needs the whole group" — a 7-day rolling average per store, ranking within a group, per-key state → applyInPandas (grouped map). Each group arrives as one pandas DataFrame. Recall the state idea from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): a running calculation per group is exactly groupBy(...).applyInPandas(fn, schema). (It triggers a shuffle to gather each group, unless already partitioned by that key.)

Gotcha: a Pandas UDF must return the exact type declared in @pandas_udf(returnType) — a mismatch is a runtime error. Same for a registered SQL UDF: return type must be explicit.

◆ When a built-in beats any UDF (the "trick" answer)

Many things people write UDFs for are already built-ins. Ranking within a group, running totals, "latest per key" → window functions (row_number, rank, sum() over …), native and Catalyst-optimised. So before reaching for applyInPandas, ask whether a window function does it — usually faster and simpler ([Advanced transformations — window functions, joins, aggregations](/lessons/s3-transformations/)).

Takeaways (rebuild it from these)

  1. Prefer built-ins (JVM-native, Catalyst optimises them). A UDF is a black box — the fallback only.
  2. The cost is the JVM↔Python crossing; Arrow vectorization is the cure. That distinction ranks everything.
  3. Python UDF = row-by-row, slowest. Scalar Pandas UDF (@pandas_udf) = Arrow batch, 5–100× faster → the fix for a slow Python UDF. applyInPandas = whole group as a DataFrame → group-level / stateful-per-key logic. SQL UDF = callable from SQL, explicit return type.
  4. Pandas UDFs must return the declared type exactly.
  5. Before a grouped UDF, check whether a window function already does it natively.

Before you move on — say these without scrolling up

  1. What does Catalyst lose when you use a UDF instead of a built-in?
  2. Where does the time actually go in a slow Python UDF — and what fixes it, by what mechanism?
  3. Logic needs the whole group at once — which UDF?
  4. Before writing a grouped UDF for "rank within a group," what should you check first?

Next: getting the packages your code (UDFs included) depends on onto the cluster, safely and reproducibly → [Managing third-party libraries](/lessons/s1-third-party-libs/).

Prerequisites

Leads to