Develop unit and integration tests using assertDataFrameEqual, assertSchemaEqual, DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.

The src/ package from [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) is only trustworthy if it's tested. And here's the motivating truth the exam circles: a pipeline that crashes is annoying; a pipeline that silently produces subtly wrong numbers is catastrophic — it destroys trust without anyone noticing. Testing catches the silent kind before production.

The spine

Beat 1 — the anchor: testable by design first

Predict: can you cleanly unit-test a 200-line notebook cell that reads, joins, aggregates, and writes all at once?

…

No — there's no seam to grab. So the real move is to write code that's testable by construction, and the tool the objective names is DataFrame.transform:

Anchor. Write each transformation as a pure function df → df, and chain them with .transform(). Now each function can be tested in complete isolation — which is what makes everything below possible.

def normalize_email(df):    return df.withColumn("email", col("email").lower())
def add_fraud_flag(df):     return df.withColumn("is_fraud", col("amount") > 50000)

# chained, and each piece independently testable
result = raw.transform(normalize_email).transform(add_fraud_flag)

That's why DataFrame.transform shows up as a "what's the advantage?" question — the advantage is modular, composable, testable transformations, not a performance trick.

Lock it. Pure df→df functions chained with .transform() = the seam that makes testing possible.

The dials (skim now; return when a question needs one)

◆ The three layers

Layer	Speed	Runs on	Tests
Unit	milliseconds	local Spark (no cluster)	one function in isolation
Integration	minutes	a real cluster	interactions between subsystems / the full pipeline
CI/CD gate	per push	CI runner (local Spark)	runs unit tests automatically, blocks a broken merge

Two distinctions the exam draws sharply:

Unit testing = individual functions in isolation — payoff: easy troubleshooting, each step checked alone.
Integration testing = interactions between subsystems — not individual functions (that's unit), not a complete end-to-end use case (that's system). "Validates how components work together" → integration.

◆ The tools (define each)

Local Spark session — SparkSession.builder.master("local[*]") spins up Spark inside the test process, no cluster. local[*] uses all cores; plain local is single-threaded and fully deterministic (for repeatable tests).
assertDataFrameEqual(result, expected) — the core assertion: rows and values match. Strict about row order by default; pass checkRowOrder=False when order isn't guaranteed.
assertSchemaEqual — checks only the schema (names, types, nullability), not data. Use when you control shape but not exact values.
Mocking — replace external calls in a unit test: with patch("dbutils.secrets") as m: m.get.return_value = "test" so no real secret scope is hit.
Built-in debugger — the notebook/editor step-debugger for a failing transform (the objective calls it out).

◆ Where tests live, and how they gate

Define test functions in Files in Git Folders (formerly "Files in Repos") — separate from notebook code, importable. Then the CI/CD gate: a stage runs pytest tests/ on a fresh runner using local Spark; any failed assertion stops deployment. Unit tests run on every push; integration tests less often, on real Databricks infra. (Git + CI/CD wiring is [Git Folders & CI/CD — version control inside the workspace](/lessons/s9-git-cicd/).)

Takeaways (rebuild it from these)

Testable-by-design first: transformations as pure df→df functions, chained with DataFrame.transform (modular, composable, testable — that's its point).
Unit (isolated functions, local Spark, ms) → Integration (subsystem interactions, real cluster) → CI gate (auto-run on push, blocks broken merges). Integration ≠ individual functions (unit) and ≠ full use case (system).
assertDataFrameEqual (rows+values; checkRowOrder=False when needed) · assertSchemaEqual (schema only) · local Spark (master("local[*]")) · mock external calls.
Test code lives in Git Folders; the CI stage runs pytest and blocks deploy on failure.

Before you move on — say these without scrolling up

Why can't you cleanly unit-test one big notebook cell — and what's the fix?
DataFrame.transform's real advantage — is it speed?
"Validates how components work together" — unit, integration, or system?
assertDataFrameEqual vs assertSchemaEqual — what does each check?

Next in Section 1 — two tight reference cards: reading job parameters + secrets ([Reference card — job parameters & secrets in notebooks](/lessons/s1-job-params-widgets/)), and the control-flow operators ([Reference card — pipeline control-flow operators](/lessons/s1-control-flow/)).

Unit & integration testing on Databricks