Learn
Deep, connected lessons. Read one, then practice its questions in the same place.
Structured Streaming & the state model
Compare Spark Structured Streaming and Lakeflow Spark Declarative Pipelines to determine the optimal approach; build reliable batch and streaming pipelines. (Underlies streaming across Sections 1 and 3.)
Lakeflow Spark Declarative Pipelines
Build and manage reliable, production-ready batch and streaming pipelines using Lakeflow Spark Declarative Pipelines and Auto Loader; use expectations for quality.
Streaming tables vs materialized views
Explain the advantages and disadvantages of streaming tables compared to materialized views.
APPLY CHANGES — CDC and SCD, declaratively
Use APPLY CHANGES APIs to simplify CDC in Lakeflow Spark Declarative Pipelines.
Jobs & orchestration — multi-task, dependencies, control flow
Create and automate ETL workloads using Jobs via UI/APIs/CLI; create pipeline components that use control-flow operators (if/else, for-each).
Jobs via REST API and CLI
Create and automate ETL workloads using Jobs via UI/APIs/CLI.
Job & environment configuration — compute and Spark tuning
Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.
UDFs — Python vs Pandas, and why the type is everything
Develop User-Defined Functions (UDFs) using Pandas/Python UDF.
Managing third-party libraries
Manage and troubleshoot external third-party library installations and dependencies in Databricks, including PyPI packages, local wheels, and source archives.
Python project structure for Databricks Asset Bundles
Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration.
Unit & integration testing on Databricks
Develop unit and integration tests using assertDataFrameEqual, assertSchemaEqual, DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.
Reference card — job parameters & secrets in notebooks
Understand the notebook development environment, variable management, and creating secure, configurable code.
Reference card — pipeline control-flow operators
Create a pipeline component that uses control flow operators (e.g., if/else, for/each, etc.).
Advanced transformations — window functions, joins, aggregations
Write efficient Spark SQL and PySpark code to apply advanced transformations — window functions, joins, aggregations — on large datasets.
Deduplication — distinct, keep-latest, and the streaming state trap
Deduplicate data using appropriate techniques for batch and streaming, including watermark-bounded streaming deduplication and idempotent merges.
Data quality — expectations and constraints, and what happens to a bad row
Enforce data quality using Lakeflow pipeline expectations and Delta table constraints; choose the correct violation behavior (warn/drop/fail).
Quarantining bad data — the third option beyond drop and fail
Design pipelines that isolate invalid records (quarantine) for inspection and replay, using expectations and Auto Loader rescued data.
Delta Sharing — live data out, without copies
Share data with Delta Sharing — Databricks-to-Databricks vs open protocol, creating shares, WITH HISTORY, egress, and shareable object types.
Lakehouse Federation — query external data in place
Use Lakehouse Federation to query external data sources (databases, warehouses) in place through Unity Catalog, without copying or migrating the data.
The monitoring map — which surface answers which question
Identify the correct Databricks observability surface for a monitoring question — Spark UI/Query Profile, cluster event log, job run history, pipeline event log, and system tables.
System tables — the account's durable, queryable memory
Use system tables (billing usage, audit access, query history) to build observability, cost-attribution, and compliance reporting with SQL.
The pipeline event log — where a Lakeflow pipeline records itself
Query the Lakeflow pipeline event log to extract execution progress and data-quality expectation metrics programmatically.
SQL Alerts — the single-value rule that makes or breaks them
Design Databricks SQL Alerts that evaluate a query result against a threshold, including collapsing multi-condition logic into a single evaluated value.
Operational job monitoring — REST/CLI, notifications, retry policy
Monitor and control jobs programmatically (REST API, CLI), configure failure notifications, and set retry policies for production and streaming jobs.
The performance model — why a query is slow, and the one lever
Understand the optimization techniques Databricks uses for query performance on large datasets (data skipping, file pruning, etc.).
Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM
Understand Delta optimization techniques; keep large-dataset queries performant (file sizing, compaction) and manage storage.
Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)
Understand delta optimization techniques such as deletion vectors and liquid clustering; the benefits of liquid clustering over partitioning and Z-order.
Letting the platform maintain layout — Predictive Optimization & managed tables
Understand how/why using Unity Catalog managed tables reduces operational overhead and maintenance burden; predictive optimization.
Change Data Feed — emitting a table's changes downstream
Apply Change Data Feed (CDF) to address specific limitations of streaming tables and enhance latency.
Reading the evidence — Query Profile & Spark UI
Use the query profile to analyze a query and identify bottlenecks — bad data skipping, inefficient join types, and data shuffling.
Access control — least privilege and the object permission ladders
Apply least-privilege access controls to workspace objects (clusters, jobs, notebooks, pipelines) using permission levels and ownership rules.
Unity Catalog privileges — the three-level traversal and delegation
Grant Unity Catalog data privileges (SELECT, MODIFY, USE CATALOG/SCHEMA, MANAGE) following least privilege, and delegate administration without granting admin.
Row filters and column masks — access control inside a table
Enforce row-level and column-level security in Unity Catalog using row filters and column masks (SQL UDFs), and contrast with dynamic views.
Secrets — storing credentials, redaction, and scope ACLs
Store and retrieve credentials with Databricks secret scopes, understand output redaction and its limits, and control access with scope-level ACLs.
Anonymization — hashing, pseudonymization, and protecting values at rest
Anonymize/pseudonymize PII using hashing (SHA-2) and related techniques, apply masking in place, and ensure consistent masking across distributed pipelines.
PII lifecycle — deletion, retention, and the right to be forgotten
Fully delete data for compliance (GDPR right to be forgotten): logical DELETE vs physical VACUUM, retention properties, targeted partition deletes, and audit.
Unity Catalog inheritance — how one grant cascades
Understand the Unity Catalog privilege inheritance model — grants cascade to current and future children — plus ALL PRIVILEGES scope and the default workspace catalog.
Discoverability & metadata — comments, tags, and DESCRIBE
Make data discoverable and documented with comments, tags, AI-generated comments, and the metadata inspection commands (DESCRIBE EXTENDED).
Declarative Automation Bundles — deploying Databricks as code
Package and deploy Databricks resources with Declarative Automation Bundles (Databricks Asset Bundles): validate → deploy → run, targets, and binding existing jobs.
Git Folders & CI/CD — version control inside the workspace
Use Git Folders for branch-based development, collaboration, and CI/CD; understand the notebook source format and how it enables testing and version control.
Dimensional modelling — SCD types and the star schema
Model dimensions with the right slowly-changing-dimension (SCD) type, build star schemas on Delta, and understand informational constraints.
Delta data models — managed vs external, clones, and materialization
Choose Delta table types (managed vs external), copy safely with shallow/deep clone, and pick the right materialization (view, table, materialized view).