Understand the optimization techniques Databricks uses for query performance on large datasets (data skipping, file pruning, etc.).

Section 6 is the biggest slice of the exam after Developing Code, and it looks like a grab-bag — OPTIMIZE, VACUUM, Z-order, liquid clustering, deletion vectors, data skipping, CDF, query profiles. It isn't. Every one of those hangs off a single sentence. Internalise it here and the whole section becomes derivable.

The spine

Beat 1 — the anchor: cost is the bytes you read

Recall from [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/) that a Delta table is Parquet files plus a log, and each add entry carries per-column min/max statistics for its file. Hold that — it's the hinge for everything here.

Predict: table A is 10 TB but a query reads 200 MB of it. Table B is 100 GB but every query reads all of it. Which query is slow?

…

B — because slowness isn't about how big the table is, it's about how many bytes a query is forced to touch.

Anchor. A query is slow mostly because it reads too many bytes. So every Section-6 technique is one of two moves: read fewer files (skip the ones you don't need), or make the files you do read the right size and shape (so skipping works and there's no per-file overhead). "Go faster" = "read less."

Beat 2 — data skipping: the free win, and how it works

SELECT * FROM events WHERE event_date = '2026-06-01'

Predict: before opening any Parquet file, what can Delta check to avoid reading a file whose dates are all in January?

…

Its min/max stats in the log. A file whose event_date range is [2026-01-01, 2026-03-31] cannot contain June 1 — so Delta skips it without opening it. Only files whose range overlaps the filter get read. That's data skipping (a.k.a. file pruning) — automatic, free, the single biggest lever when it works.

The catch — and this is the thread through the next two lessons: skipping only helps if file ranges are tight and non-overlapping on the filter column. If every file contains dates from all of 2026 (written in random order), every file overlaps June 1, nothing is skipped, and Spark full-scans despite the filter.

Predict: a WHERE event_date = … query still reads the whole table. What's the most likely cause?

…

The filter column isn't reflected in the table's layout (not partitioned/clustered/Z-ordered on it), so the per-file stats don't separate the data. Skipping is only as good as the layout ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)).

Lock it. Cost ≈ bytes read. Data skipping uses min/max stats to skip non-overlapping files — but only if the layout makes ranges tight on the filter column. "Skipping isn't working" → filter column not in the layout → full scan.

The dials (skim now; return when a question needs one)

◆ The statistics — min/max/null-count, first 32 columns

Delta collects min, max, and null-count per file (not average) — by default only for the first 32 columns. Two tested consequences:

A column beyond position 32 has no stats → can't skip on it → scan. Fix: reorder so filter columns are within the first 32, or raise delta.dataSkippingNumIndexedColumns.
Long high-cardinality strings (free text/JSON) give uninformative stats — wasteful to index and useless as filters; keep indexed columns to the ones you filter/join on.

(Inspect with DESCRIBE DETAIL for file count/size; per-file stats live in the log — the debugging angle is [Reading the evidence — Query Profile & Spark UI](/lessons/s6-query-profile/).)

◆ Dynamic file pruning — skipping at runtime

Beyond static filters, Databricks also does dynamic file pruning: it skips files using a filter discovered at runtime — e.g. from the small side of a join, the actual key values feed back to prune the big fact table's files. Same idea (skip non-matching files), but the pruning filter is known only mid-execution. Tell: "the query leveraged dynamic file pruning" = runtime-filter-based skipping, often in joins.

◆ Partition pruning vs data skipping (name the collision)

Partition pruning — a table physically partitioned into folders (/event_date=2026-06-01/); a filter on that column ignores whole folders. Coarse, directory-level.
Data skipping — the min/max-stats mechanism, at the file level, on any indexed column. Finer, no folders needed.

Both are "read fewer files." Modern Databricks favours data skipping + clustering over folder partitioning ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)), but the exam tests the distinction.

◆ File size — the Goldilocks problem

Even with perfect skipping, file extremes wreck you:

Too many tiny files — Spark pays a fixed open/read/close cost per file; millions of 1 KB files are slow regardless. The small-file problem (from frequent small streaming writes, over-partitioning, a bad repartition(10000)).
Too few huge files — can't be skipped within and can't parallelise well.

Target: right-sized files (~128 MB–1 GB) — the job of compaction ([Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)).

Takeaways (rebuild it from these)

A query's cost ≈ bytes read. Every technique is "read fewer files" or "right-size/shape files." Touched size matters, not table size.
Data skipping / file pruning uses per-file min/max stats to skip files that can't match — automatic and free, but only when ranges are tight on the filter column.
Stats = min/max/null-count (not average), first 32 columns by default — a filter column beyond that (or a long high-cardinality string) can't be skipped on.
Dynamic file pruning = skip via a runtime filter (often from a join). Partition pruning (folders) vs data skipping (files) — both "read less."
Files must be right-sized (~128 MB–1 GB): too many tiny = per-file overhead; too few huge = no fine skipping/parallelism.

Before you move on — say these without scrolling up

What actually determines a query's speed — and the two moves every technique makes?
How does Delta skip a file without opening it, and when does that fail?
Which three stats, on how many columns by default — and which stat is not collected?
Partition pruning vs data skipping vs dynamic file pruning — one line each.

Next: the cure for wrong-sized files — compaction, with OPTIMIZE, optimized writes, auto compaction, and VACUUM → [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/).

The performance model — why a query is slow, and the one lever