Lessons

Developing Code for Data Processing

Unit & integration testing on Databricks

Develop unit and integration tests using assertDataFrameEqual, assertSchemaEqual, DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.

The src/ package from [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) is only trustworthy if it's tested. And here's the motivating truth the exam circles: a pipeline that crashes is annoying; a pipeline that silently produces subtly wrong numbers is catastrophic — it destroys trust without anyone noticing. Testing catches the silent kind before production.


The spine

Beat 1 — the anchor: testable by design first

Predict: can you cleanly unit-test a 200-line notebook cell that reads, joins, aggregates, and writes all at once?

No — there's no seam to grab. So the real move is to write code that's testable by construction, and the tool the objective names is DataFrame.transform:

Anchor. Write each transformation as a pure function df → df, and chain them with .transform(). Now each function can be tested in complete isolation — which is what makes everything below possible.

def normalize_email(df):    return df.withColumn("email", col("email").lower())
def add_fraud_flag(df):     return df.withColumn("is_fraud", col("amount") > 50000)

# chained, and each piece independently testable
result = raw.transform(normalize_email).transform(add_fraud_flag)

That's why DataFrame.transform shows up as a "what's the advantage?" question — the advantage is modular, composable, testable transformations, not a performance trick.

Lock it. Pure df→df functions chained with .transform() = the seam that makes testing possible.


The dials (skim now; return when a question needs one)

◆ The three layers

LayerSpeedRuns onTests
Unitmillisecondslocal Spark (no cluster)one function in isolation
Integrationminutesa real clusterinteractions between subsystems / the full pipeline
CI/CD gateper pushCI runner (local Spark)runs unit tests automatically, blocks a broken merge

Two distinctions the exam draws sharply:

◆ The tools (define each)

◆ Where tests live, and how they gate

Define test functions in Files in Git Folders (formerly "Files in Repos") — separate from notebook code, importable. Then the CI/CD gate: a stage runs pytest tests/ on a fresh runner using local Spark; any failed assertion stops deployment. Unit tests run on every push; integration tests less often, on real Databricks infra. (Git + CI/CD wiring is [Git Folders & CI/CD — version control inside the workspace](/lessons/s9-git-cicd/).)

Takeaways (rebuild it from these)

  1. Testable-by-design first: transformations as pure df→df functions, chained with DataFrame.transform (modular, composable, testable — that's its point).
  2. Unit (isolated functions, local Spark, ms) → Integration (subsystem interactions, real cluster) → CI gate (auto-run on push, blocks broken merges). Integration ≠ individual functions (unit) and ≠ full use case (system).
  3. assertDataFrameEqual (rows+values; checkRowOrder=False when needed) · assertSchemaEqual (schema only) · local Spark (master("local[*]")) · mock external calls.
  4. Test code lives in Git Folders; the CI stage runs pytest and blocks deploy on failure.

Before you move on — say these without scrolling up

  1. Why can't you cleanly unit-test one big notebook cell — and what's the fix?
  2. DataFrame.transform's real advantage — is it speed?
  3. "Validates how components work together" — unit, integration, or system?
  4. assertDataFrameEqual vs assertSchemaEqual — what does each check?

Next in Section 1 — two tight reference cards: reading job parameters + secrets ([Reference card — job parameters & secrets in notebooks](/lessons/s1-job-params-widgets/)), and the control-flow operators ([Reference card — pipeline control-flow operators](/lessons/s1-control-flow/)).

Prerequisites

Leads to