The src/ package from [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) is only trustworthy if it's tested. And here's the motivating truth the exam circles: a pipeline that crashes is annoying; a pipeline that silently produces subtly wrong numbers is catastrophic — it destroys trust without anyone noticing. Testing catches the silent kind before production.
The spine
Beat 1 — the anchor: testable by design first
Predict: can you cleanly unit-test a 200-line notebook cell that reads, joins, aggregates, and writes all at once?
…
No — there's no seam to grab. So the real move is to write code that's testable by construction, and the tool the objective names is DataFrame.transform:
Anchor. Write each transformation as a pure function
df → df, and chain them with.transform(). Now each function can be tested in complete isolation — which is what makes everything below possible.
def normalize_email(df): return df.withColumn("email", col("email").lower())
def add_fraud_flag(df): return df.withColumn("is_fraud", col("amount") > 50000)
# chained, and each piece independently testable
result = raw.transform(normalize_email).transform(add_fraud_flag)
That's why DataFrame.transform shows up as a "what's the advantage?" question — the advantage is modular, composable, testable transformations, not a performance trick.
Lock it. Pure
df→dffunctions chained with.transform()= the seam that makes testing possible.
The dials (skim now; return when a question needs one)
◆ The three layers
| Layer | Speed | Runs on | Tests |
|---|---|---|---|
| Unit | milliseconds | local Spark (no cluster) | one function in isolation |
| Integration | minutes | a real cluster | interactions between subsystems / the full pipeline |
| CI/CD gate | per push | CI runner (local Spark) | runs unit tests automatically, blocks a broken merge |
Two distinctions the exam draws sharply:
- Unit testing = individual functions in isolation — payoff: easy troubleshooting, each step checked alone.
- Integration testing = interactions between subsystems — not individual functions (that's unit), not a complete end-to-end use case (that's system). "Validates how components work together" → integration.
◆ The tools (define each)
- Local Spark session —
SparkSession.builder.master("local[*]")spins up Spark inside the test process, no cluster.local[*]uses all cores; plainlocalis single-threaded and fully deterministic (for repeatable tests). assertDataFrameEqual(result, expected)— the core assertion: rows and values match. Strict about row order by default; passcheckRowOrder=Falsewhen order isn't guaranteed.assertSchemaEqual— checks only the schema (names, types, nullability), not data. Use when you control shape but not exact values.- Mocking — replace external calls in a unit test:
with patch("dbutils.secrets") as m: m.get.return_value = "test"so no real secret scope is hit. - Built-in debugger — the notebook/editor step-debugger for a failing transform (the objective calls it out).
◆ Where tests live, and how they gate
Define test functions in Files in Git Folders (formerly "Files in Repos") — separate from notebook code, importable. Then the CI/CD gate: a stage runs pytest tests/ on a fresh runner using local Spark; any failed assertion stops deployment. Unit tests run on every push; integration tests less often, on real Databricks infra. (Git + CI/CD wiring is [Git Folders & CI/CD — version control inside the workspace](/lessons/s9-git-cicd/).)
Takeaways (rebuild it from these)
- Testable-by-design first: transformations as pure
df→dffunctions, chained withDataFrame.transform(modular, composable, testable — that's its point). - Unit (isolated functions, local Spark, ms) → Integration (subsystem interactions, real cluster) → CI gate (auto-run on push, blocks broken merges). Integration ≠ individual functions (unit) and ≠ full use case (system).
assertDataFrameEqual(rows+values;checkRowOrder=Falsewhen needed) ·assertSchemaEqual(schema only) · local Spark (master("local[*]")) · mock external calls.- Test code lives in Git Folders; the CI stage runs
pytestand blocks deploy on failure.
Before you move on — say these without scrolling up
- Why can't you cleanly unit-test one big notebook cell — and what's the fix?
DataFrame.transform's real advantage — is it speed?- "Validates how components work together" — unit, integration, or system?
assertDataFrameEqualvsassertSchemaEqual— what does each check?
Next in Section 1 — two tight reference cards: reading job parameters + secrets ([Reference card — job parameters & secrets in notebooks](/lessons/s1-job-params-widgets/)), and the control-flow operators ([Reference card — pipeline control-flow operators](/lessons/s1-control-flow/)).