Lessons

Developing Code for Data Processing

Managing third-party libraries

Manage and troubleshoot external third-party library installations and dependencies in Databricks, including PyPI packages, local wheels, and source archives.

Reference lesson. Your code — a UDF from [UDFs — Python vs Pandas, and why the type is everything](/lessons/s1-udfs/), a pipeline transform — needs a package the cluster doesn't have. Installing it seems trivial (%pip install), and that's the trap: the method that's easiest interactively is the one that silently fails in production.

The spine — pick by three questions

Anchor. Choose the install method by scope (this notebook? the whole cluster? this job?), persistence (survives a restart?), and production-safety (versioned and reproducible?). %pip is dev-only; production dependencies live in the bundle.

Predict before the table: you %pip install a package, the cluster restarts overnight — is it still there? No — %pip is session-scoped and gone on restart. That's why it's never production.

The menu (method → the three questions)

MethodScopePersists a restart?Use when
%pip installcurrent notebook sessionNo — gone on restartinteractive dev only. Never production
Cluster library (UI/API)all notebooks on that clusterYes, until detacheda shared dep a whole team's cluster agrees on
requirements.txt in a DABs bundlethat bundle's job clusterYes — installed before code runsproduction jobs — version-pinned, in Git
Init script (.sh)cluster system levelYes — runs on every startupprivate package indices, auth, OS-level deps
Unity Catalog Volumegoverned central storageYesregulated environments needing an audit trail
Python wheel in DABsbundle levelYes — built + deployed automaticallyyour own custom shared code across jobs

Tell: "production job, reproducible"requirements.txt in DABs; "our own shared code across jobs" → a wheel; "just testing in a notebook"%pip (only there).

Wheels — the format for custom code

A wheel (.whl) is the standard packaged-and-installable format for Python code. Not sbt (Scala), not npm (JS), not CRAN (R). Build a wheel, put it in Workspace Files or a UC Volume, install via a cluster library or a bundle's requirements.txt. "How do I distribute my custom Python package?" → wheel.

The gotchas the exam loves

Where it connects: this is one half of the objective's "configs for environments and dependencies" from [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/). The production answer (requirements.txt / wheel in the bundle) previews [Python project structure for Databricks Asset Bundles](/lessons/s1-dabs-project/) and [Declarative Automation Bundles — deploying Databricks as code](/lessons/s9-dabs-deploy/).

Recall (say without scrolling up)

  1. The three questions that pick an install method — name them.
  2. Production job needs a reproducible dependency — which method?
  3. Air-gapped workspace, can't reach PyPI — how do you get your package on?
  4. After %pip install your import fails — what did you forget?

Prerequisites

Leads to