Lessons

Developing Code for Data Processing

Python project structure for Databricks Asset Bundles

Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration.

You've built pipelines ([Lakeflow Spark Declarative Pipelines](/lessons/s1-lakeflow-sdp/)), jobs ([Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/)), and their dependencies ([Managing third-party libraries](/lessons/s1-third-party-libs/)). This lesson packages all of it as code in one versioned project so it deploys the same way to dev and prod — the foundation of CI/CD.


The spine

Beat 1 — the pain, then the anchor

Naming caveat up front (renamed-product trap): current docs call this Declarative Automation Bundles; the Nov-2025 exam guide and questions still say Databricks Asset Bundles (DABs) — I'll say DABs to match.

Predict: without DABs, your job/pipeline config lives in the UI, hand-clicked per environment. What goes wrong across dev, uat, prod?

Configuration drift (dev and prod quietly diverge), no version control on the config, and manual deployment mistakes. DABs fixes all three by making your Databricks resources — jobs, pipelines, clusters — infrastructure as code:

Anchor. Define your Databricks resources once, as code, in a versioned bundle; deploy that same definition to any environment, with only the per-environment values swapped. One source of truth, many targets.

Beat 2 — how one definition serves many environments

The mechanism is targets + variable substitution. A target is a named environment (dev, uat, prod) with its own workspace URL, catalog, cluster size, identity. Variable substitution${var.catalog} — resolves to dev_catalog under the dev target and prod_catalog under prod.

Predict: so how do you deploy to prod without copy-pasting a whole second config?

You don't copy anything — the one definition stays fixed; only the target's values swap in. That's the anchor made concrete: same recipe, different ingredients per kitchen.

Lock it. One versioned definition + per-target values (${var.…}) = one source of truth deployed to many environments. Prod runs as a service principal (recall [Jobs & orchestration — multi-task, dependencies, control flow](/lessons/s1-jobs-orchestration/)).


The dials (skim now; return when a question needs one)

◆ The project shape

my_bundle/
├── databricks.yml            ← the ONE root config: bundle name, targets, includes
├── resources/                ← one YAML per job/pipeline (separate files avoid merge conflicts)
│    ├── ingest_job.yml
│    └── etl_pipeline.yml
├── src/                      ← your Python package (modular, importable, testable)
│    └── my_pkg/…
└── requirements.txt / *.whl  ← dependencies (from [Managing third-party libraries](/lessons/s1-third-party-libs/))

◆ The src/ package and sys.path

Keep transformation logic in an importable Python package under src/ (built into a wheel per [Managing third-party libraries](/lessons/s1-third-party-libs/)), not pasted into notebooks — so it's modular and unit-testable ([Unit & integration testing on Databricks](/lessons/s1-testing/)). That raises the one Python-internals fact the exam asks directly:

sys.path = the list of directories Python searches when you import a module. import sys; print(sys.path).

For import my_pkg to work, the package's location must be on sys.path — which a good bundle handles by installing your wheel (it lands on the path) rather than relying on fragile relative paths. "Which variable lists the directories searched for modules?" → sys.path.

◆ The four CLI commands (and the CI/CD order)

CommandDoesWhen
databricks bundle initscaffolds a new project (like git init)once, at project start
databricks bundle validatechecks the YAML, touches no workspacebefore every deploy
databricks bundle deploy -t devcreates/updates resources in the targeton every change
databricks bundle run <resource> -t devtriggers a job/pipeline to verifyafter deploy

Tell: the CI/CD sequence is validate → deploy → runnot init, because init is a one-time developer action, not part of an automated pipeline. (Deploy mechanics are [Declarative Automation Bundles — deploying Databricks as code](/lessons/s9-dabs-deploy/); the Git side is [Git Folders & CI/CD — version control inside the workspace](/lessons/s9-git-cicd/).)

Takeaways (rebuild it from these)

  1. DABs (current docs: Declarative Automation Bundles) = infrastructure as code: one versioned definition, deployed per-target.
  2. databricks.yml (one, at root) + resources/ (one YAML per job) + targets + ${var.…} substitution + mode: production (service principal).
  3. Logic lives in an importable src/ package (a wheel), which is why sys.path matters and what makes the code unit-testable.
  4. CLI: init (once) · validate (no workspace) · deploy -t · run -t. CI/CD = validate → deploy → run (never init).

Before you move on — say these without scrolling up

  1. Three things that go wrong without DABs — and the one idea that fixes all three.
  2. How does one definition deploy to prod without copy-pasting config?
  3. sys.path — what is it, and why does a bundle care?
  4. The CI/CD command sequence — and which command is not in it?

Next: how you prove that src/ package is correct before it ships → [Unit & integration testing on Databricks](/lessons/s1-testing/).

Prerequisites

Leads to