Pass Databricks DE Pro

Manage and troubleshoot external third-party library installations and dependencies in Databricks, including PyPI packages, local wheels, and source archives.

Reference lesson. Your code — a UDF from [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/), a pipeline transform — needs a package the cluster doesn't have. Installing it seems trivial (%pip install), and that's the trap: the method that's easiest interactively is the one that silently fails in production.

The spine — pick by three questions

Anchor. Choose the install method by scope (this notebook? the whole cluster? this job?), persistence (survives a restart?), and production-safety (versioned and reproducible?). %pip is dev-only; production dependencies live in the bundle.

Predict before the table: you %pip install a package, the cluster restarts overnight — is it still there? No — %pip is session-scoped and gone on restart. That's why it's never production.

The menu (method → the three questions)

Method	Scope	Persists a restart?	Use when
`%pip install`	current notebook session	No — gone on restart	interactive dev only. Never production
Cluster library (UI/API)	all notebooks on that cluster	Yes, until detached	a shared dep a whole team's cluster agrees on
`requirements.txt` in a DABs bundle	that bundle's job cluster	Yes — installed before code runs	production jobs — version-pinned, in Git
Init script (`.sh`)	cluster system level	Yes — runs on every startup	private package indices, auth, OS-level deps
Unity Catalog Volume	governed central storage	Yes	regulated environments needing an audit trail
Python wheel in DABs	bundle level	Yes — built + deployed automatically	your own custom shared code across jobs

Tell: "production job, reproducible" → requirements.txt in DABs; "our own shared code across jobs" → a wheel; "just testing in a notebook" → %pip (only there).

Wheels — the format for custom code

A wheel (.whl) is the standard packaged-and-installable format for Python code. Not sbt (Scala), not npm (JS), not CRAN (R). Build a wheel, put it in Workspace Files or a UC Volume, install via a cluster library or a bundle's requirements.txt. "How do I distribute my custom Python package?" → wheel.

The gotchas the exam loves

After %pip install, run %restart_python — otherwise the interpreter doesn't pick up the new package even though pip reported success ("why isn't my import working?").
Two versions of the same package on a cluster → the cluster fails to start — a startup error, not a runtime one.
Air-gapped (no-internet) workspace — can't pip install from PyPI, so init-script-with-pip fails. Answer: build the wheel, upload to Workspace Files or a UC Volume, install via a cluster library or bundle requirements.txt.
Freshness (verify against docs): init scripts must live in workspace files or UC Volumes — the old DBFS location is deprecated. A scenario storing an init script on DBFS → that placement is the dated/wrong part.

Where it connects: this is one half of the objective's "configs for environments and dependencies" from [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/). The production answer (requirements.txt / wheel in the bundle) previews [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) and [Declarative Automation Bundles — deploying Databricks as code](/lessons/s9-dabs-deploy/).

Recall (say without scrolling up)

The three questions that pick an install method — name them.
Production job needs a reproducible dependency — which method?
Air-gapped workspace, can't reach PyPI — how do you get your package on?
After %pip install your import fails — what did you forget?

Managing third-party libraries

The spine — pick by three questions

The menu (method → the three questions)

Wheels — the format for custom code

The gotchas the exam loves

Recall (say without scrolling up)