Learn

Deep, connected lessons. Read one, then practice its questions in the same place.

The one job — and the two axes everything lives on

How Delta Lake works — the transaction log

Foundation for Cost & Performance (§6) and Data Modelling (§10); Delta Lake is named across the exam.

Compare Spark Structured Streaming and Lakeflow Spark Declarative Pipelines to determine the optimal approach; build reliable batch and streaming pipelines. (Underlies streaming across Sections 1 and 3.)

21 Q

Lakeflow Spark Declarative Pipelines

Build and manage reliable, production-ready batch and streaming pipelines using Lakeflow Spark Declarative Pipelines and Auto Loader; use expectations for quality.

4 Q

Streaming tables vs materialized views

Explain the advantages and disadvantages of streaming tables compared to materialized views.

2 Q

APPLY CHANGES — CDC and SCD, declaratively

Use APPLY CHANGES APIs to simplify CDC in Lakeflow Spark Declarative Pipelines.

7 Q

Jobs & orchestration — multi-task, dependencies, control flow

Create and automate ETL workloads using Jobs via UI/APIs/CLI; create pipeline components that use control-flow operators (if/else, for-each).

Jobs via REST API and CLI

Create and automate ETL workloads using Jobs via UI/APIs/CLI.

Job & environment configuration — compute and Spark tuning

Choose the appropriate configs for environments and dependencies, high memory for notebook tasks, and auto-optimization to disallow retries.

6 Q

UDFs — Python vs Pandas, and why the type is everything

Develop User-Defined Functions (UDFs) using Pandas/Python UDF.

5 Q

Managing third-party libraries

Manage and troubleshoot external third-party library installations and dependencies in Databricks, including PyPI packages, local wheels, and source archives.

3 Q

Python project structure for Databricks Asset Bundles

Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration.

2 Q

Unit & integration testing on Databricks

Develop unit and integration tests using assertDataFrameEqual, assertSchemaEqual, DataFrame.transform, and testing frameworks, to ensure code correctness, including a built-in debugger.

1 Q

Reference card — job parameters & secrets in notebooks

Understand the notebook development environment, variable management, and creating secure, configurable code.

3 Q

Reference card — pipeline control-flow operators

Create a pipeline component that uses control flow operators (e.g., if/else, for/each, etc.).

1 Q

Auto Loader — incremental file ingestion and schema evolution

Incrementally ingest files from cloud storage with Auto Loader (cloudFiles): schema inference/evolution modes, rescued data, formats, throttling, and vs COPY INTO.

19 Q

Advanced transformations — window functions, joins, aggregations

Write efficient Spark SQL and PySpark code to apply advanced transformations — window functions, joins, aggregations — on large datasets.

4 Q

Deduplication — distinct, keep-latest, and the streaming state trap

Deduplicate data using appropriate techniques for batch and streaming, including watermark-bounded streaming deduplication and idempotent merges.

11 Q

Data quality — expectations and constraints, and what happens to a bad row

Enforce data quality using Lakeflow pipeline expectations and Delta table constraints; choose the correct violation behavior (warn/drop/fail).

14 Q

Quarantining bad data — the third option beyond drop and fail

Design pipelines that isolate invalid records (quarantine) for inspection and replay, using expectations and Auto Loader rescued data.

6 Q

Delta Sharing — live data out, without copies

Share data with Delta Sharing — Databricks-to-Databricks vs open protocol, creating shares, WITH HISTORY, egress, and shareable object types.

15 Q

Lakehouse Federation — query external data in place

Use Lakehouse Federation to query external data sources (databases, warehouses) in place through Unity Catalog, without copying or migrating the data.

5 Q

The monitoring map — which surface answers which question

Identify the correct Databricks observability surface for a monitoring question — Spark UI/Query Profile, cluster event log, job run history, pipeline event log, and system tables.

3 Q

System tables — the account's durable, queryable memory

Use system tables (billing usage, audit access, query history) to build observability, cost-attribution, and compliance reporting with SQL.

2 Q

The pipeline event log — where a Lakeflow pipeline records itself

Query the Lakeflow pipeline event log to extract execution progress and data-quality expectation metrics programmatically.

2 Q

SQL Alerts — the single-value rule that makes or breaks them

Design Databricks SQL Alerts that evaluate a query result against a threshold, including collapsing multi-condition logic into a single evaluated value.

4 Q

Operational job monitoring — REST/CLI, notifications, retry policy

Monitor and control jobs programmatically (REST API, CLI), configure failure notifications, and set retry policies for production and streaming jobs.

6 Q

The performance model — why a query is slow, and the one lever

Understand the optimization techniques Databricks uses for query performance on large datasets (data skipping, file pruning, etc.).

15 Q

Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM

Understand Delta optimization techniques; keep large-dataset queries performant (file sizing, compaction) and manage storage.

9 Q

Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)

Understand delta optimization techniques such as deletion vectors and liquid clustering; the benefits of liquid clustering over partitioning and Z-order.

26 Q

Letting the platform maintain layout — Predictive Optimization & managed tables

Understand how/why using Unity Catalog managed tables reduces operational overhead and maintenance burden; predictive optimization.

6 Q

Change Data Feed — emitting a table's changes downstream

Apply Change Data Feed (CDF) to address specific limitations of streaming tables and enhance latency.

9 Q

Reading the evidence — Query Profile & Spark UI

Use the query profile to analyze a query and identify bottlenecks — bad data skipping, inefficient join types, and data shuffling.

30 Q

Access control — least privilege and the object permission ladders

Apply least-privilege access controls to workspace objects (clusters, jobs, notebooks, pipelines) using permission levels and ownership rules.

13 Q

Unity Catalog privileges — the three-level traversal and delegation

Grant Unity Catalog data privileges (SELECT, MODIFY, USE CATALOG/SCHEMA, MANAGE) following least privilege, and delegate administration without granting admin.

9 Q

Row filters and column masks — access control inside a table

Enforce row-level and column-level security in Unity Catalog using row filters and column masks (SQL UDFs), and contrast with dynamic views.

11 Q

Secrets — storing credentials, redaction, and scope ACLs

Store and retrieve credentials with Databricks secret scopes, understand output redaction and its limits, and control access with scope-level ACLs.

11 Q

Anonymization — hashing, pseudonymization, and protecting values at rest

Anonymize/pseudonymize PII using hashing (SHA-2) and related techniques, apply masking in place, and ensure consistent masking across distributed pipelines.

5 Q

PII lifecycle — deletion, retention, and the right to be forgotten

Fully delete data for compliance (GDPR right to be forgotten): logical DELETE vs physical VACUUM, retention properties, targeted partition deletes, and audit.

11 Q

Unity Catalog inheritance — how one grant cascades

Understand the Unity Catalog privilege inheritance model — grants cascade to current and future children — plus ALL PRIVILEGES scope and the default workspace catalog.

6 Q

Discoverability & metadata — comments, tags, and DESCRIBE

Make data discoverable and documented with comments, tags, AI-generated comments, and the metadata inspection commands (DESCRIBE EXTENDED).

9 Q

Declarative Automation Bundles — deploying Databricks as code

Package and deploy Databricks resources with Declarative Automation Bundles (Databricks Asset Bundles): validate → deploy → run, targets, and binding existing jobs.

11 Q

Git Folders & CI/CD — version control inside the workspace

Use Git Folders for branch-based development, collaboration, and CI/CD; understand the notebook source format and how it enables testing and version control.

10 Q

Dimensional modelling — SCD types and the star schema

Model dimensions with the right slowly-changing-dimension (SCD) type, build star schemas on Delta, and understand informational constraints.

5 Q

Delta data models — managed vs external, clones, and materialization

Choose Delta table types (managed vs external), copy safely with shallow/deep clone, and pick the right materialization (view, table, materialized view).

13 Q

The whole picture — how a lakehouse fits together, and how to answer any question about it

Synthesize all exam domains into one connected mental model — follow data end to end, surface the cross-section through-lines, and apply a repeatable question-answering procedure.

Learn

Foundations

Developing Code for Data Processing

Data Ingestion

Data Transformation, Cleansing & Quality

Data Sharing & Federation

Monitoring & Alerting

Cost & Performance Optimisation

Security & Compliance

Data Governance

Debugging & Deploying

Data Modelling

Capstone