Reference lesson. Your code — a UDF from [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/), a pipeline transform — needs a package the cluster doesn't have. Installing it seems trivial (%pip install), and that's the trap: the method that's easiest interactively is the one that silently fails in production.
The spine — pick by three questions
Anchor. Choose the install method by scope (this notebook? the whole cluster? this job?), persistence (survives a restart?), and production-safety (versioned and reproducible?).
%pipis dev-only; production dependencies live in the bundle.
Predict before the table: you %pip install a package, the cluster restarts overnight — is it still there? No — %pip is session-scoped and gone on restart. That's why it's never production.
The menu (method → the three questions)
| Method | Scope | Persists a restart? | Use when |
|---|---|---|---|
%pip install | current notebook session | No — gone on restart | interactive dev only. Never production |
| Cluster library (UI/API) | all notebooks on that cluster | Yes, until detached | a shared dep a whole team's cluster agrees on |
requirements.txt in a DABs bundle | that bundle's job cluster | Yes — installed before code runs | production jobs — version-pinned, in Git |
Init script (.sh) | cluster system level | Yes — runs on every startup | private package indices, auth, OS-level deps |
| Unity Catalog Volume | governed central storage | Yes | regulated environments needing an audit trail |
| Python wheel in DABs | bundle level | Yes — built + deployed automatically | your own custom shared code across jobs |
Tell: "production job, reproducible" → requirements.txt in DABs; "our own shared code across jobs" → a wheel; "just testing in a notebook" → %pip (only there).
Wheels — the format for custom code
A wheel (.whl) is the standard packaged-and-installable format for Python code. Not sbt (Scala), not npm (JS), not CRAN (R). Build a wheel, put it in Workspace Files or a UC Volume, install via a cluster library or a bundle's requirements.txt. "How do I distribute my custom Python package?" → wheel.
The gotchas the exam loves
- After
%pip install, run%restart_python— otherwise the interpreter doesn't pick up the new package even though pip reported success ("why isn't my import working?"). - Two versions of the same package on a cluster → the cluster fails to start — a startup error, not a runtime one.
- Air-gapped (no-internet) workspace — can't
pip installfrom PyPI, so init-script-with-pip fails. Answer: build the wheel, upload to Workspace Files or a UC Volume, install via a cluster library or bundlerequirements.txt. - Freshness (verify against docs): init scripts must live in workspace files or UC Volumes — the old DBFS location is deprecated. A scenario storing an init script on DBFS → that placement is the dated/wrong part.
Where it connects: this is one half of the objective's "configs for environments and dependencies" from [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/). The production answer (requirements.txt / wheel in the bundle) previews [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) and [Declarative Automation Bundles — deploying Databricks as code](/lessons/s9-dabs-deploy/).
Recall (say without scrolling up)
- The three questions that pick an install method — name them.
- Production job needs a reproducible dependency — which method?
- Air-gapped workspace, can't reach PyPI — how do you get your package on?
- After
%pip installyour import fails — what did you forget?