Understand delta optimization techniques such as deletion vectors and liquid clustering; the benefits of liquid clustering over partitioning and Z-order.

[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/) left a cliffhanger: data skipping only works if each file's min/max range is tight on the column you filter — else every file overlaps and Spark reads everything. [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/) made files the right size. This lesson makes their contents the right shape. It's the most-tested topic in the section, and it's a clean evolution story.

The spine

Beat 1 — the anchor: co-locate similar values

Anchor. Data skipping is only as good as your layout. To skip more, co-locate similar values into the same files, so each file's min/max range on your filter columns is narrow and non-overlapping. The three techniques below are just increasingly good ways to do that co-location: partitioning → Z-order → liquid clustering.

Beat 2 — partitioning, and why it became the fallback

Hive-style partitioning splits the table into folders by a column value (/country=US/). A filter on that column prunes whole folders. Simple, fine for a low-cardinality column you almost always filter on.

Predict: you partition a table by customer_id (millions of values). What goes wrong?

…

Millions of folders, each with one tiny file — the small-file pathology from [The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/). That's the first of three failure modes:

High cardinality → small files. Never partition on high-cardinality columns.
Data skew. If "Enterprise" has 10× the rows of "Small," that folder is huge; one executor sweats. Partitioning causes skew.
Rigid. Partition columns are fixed at creation; changing them rewrites the whole table.

Rule of thumb: don't partition tables under ~1 TB, and only when each partition holds ≥ ~1 GB. Under that, partitioning fragments more than it helps.

Beat 3 — Z-order → liquid clustering

Z-ordering doesn't make folders; it sorts and co-locates similar rows into the same files, tightening ranges. You run it inside compaction (it sorts while OPTIMIZE already has the files in memory — no extra pass):

OPTIMIZE my_table ZORDER BY (customer_id, event_date);

Multi-column Z-order uses a Z-curve (Morton curve) — a space-filling curve mapping N-D values to one sort order so points close in N-D stay close in 1-D. Its limits set up the next step: you choose columns upfront, and changing them = a full-table OPTIMIZE rewrite; no skew/small-file help.

Liquid clustering (CLUSTER BY, GA DBR 13.3+) was built to fix exactly that:

CREATE TABLE t (...) CLUSTER BY (customer_id, event_date);
ALTER TABLE t CLUSTER BY (event_date, region);   -- change keys ANY TIME — no rewrite
OPTIMIZE t;                                        -- clusters incrementally (new/changed data only)

Flexible keys — change with ALTER TABLE … CLUSTER BY, no rewrite (partitioning + Z-order both force one).
Incremental — OPTIMIZE reorganizes only new/unclustered data.
Fixes skew + small files together — splits oversized groups and combines undersized ones, so files are balanced and executors get equal work.

Lock it. Co-locate to tighten ranges: partitioning (folders, low-cardinality, ≥1 TB tables) → Z-order (within files, upfront columns, rewrite to change) → liquid clustering (change keys anytime, incremental, fixes skew+small-files). New data stays unclustered until the next OPTIMIZE.

The dials (skim now; return when a question needs one)

◆ The comparison the objective asks for directly

	Hive partitioning	Z-order	Liquid clustering
Change keys later	rewrite whole table	rewrite whole table	any time, no rewrite
Handles small files	no (can worsen)	no	yes — balanced
Handles data skew	no (can cause)	no	yes — splits/combines
Recommendation	legacy / big low-cardinality	existing tables	new tables (preferred)

Derivable tell: "skew + frequently-changing filter columns + no costly rewrites, table < 1 TB" → liquid clustering (reconstruct it, don't memorise). Note: liquid clustering is an alternative to partitioning/Z-order, not combined with them on the same table.

◆ Deletion vectors — keep updates from wrecking the layout

From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): deletion vectors (Merge-on-Read) mark rows dead with a tiny bitmap + write a small new-rows file, instead of rewriting a whole file. Why it's a layout topic: without DVs, every UPDATE/DELETE rewrites big files and churns your organised layout back into small files. DVs keep point-changes cheap; a periodic OPTIMIZE (or Predictive Optimization) physically applies them and re-tidies. So the two priority actions for a heavily-updated large table (a real exam scenario): liquid-cluster on the merge/join keys (so MERGE prunes files) + enable deletion vectors (so updates don't rewrite whole files).

◆ Skew: on-disk layout vs at-query-time (and salting)

Liquid clustering balances skew on disk (splits hot groups).
AQE (from [Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)) fixes skew at query time by splitting skewed shuffle partitions.
Salting — a manual fix when one key is hot: add a random prefix to the skewed key before aggregating (spreading it across partitions), then aggregate again after stripping the prefix. Good layout + AQE + (if needed) salting is the full skew answer.

◆ Read-parallelism knob — `spark.sql.files.maxPartitionBytes`

Distinct from file layout: spark.sql.files.maxPartitionBytes (default 128 MB) controls how large files are split into Spark partitions when read — lower it to get more, smaller read partitions (more parallelism / less memory pressure per task); it's a read-side tuning knob, not a storage-layout one. (Cluster-UI Spark configs apply to all notebooks on that cluster.)

Takeaways (rebuild it from these)

Skipping is only as good as layout: co-locate similar values → tight ranges. Evolution: partitioning → Z-order → liquid clustering.
Partitioning = folders; low-cardinality, ≥1 GB partitions, tables ≳1 TB. High cardinality → small files; causes skew; rigid (rewrite to change).
Z-order (OPTIMIZE … ZORDER BY, Z-curve) co-locates within files; upfront columns; changing = full rewrite; no skew/small-file help.
Liquid clustering (CLUSTER BY) = modern default: change keys anytime no rewrite, incremental OPTIMIZE, fixes skew + small files. Alternative to (not combined with) partitioning/Z-order; new data unclustered until next OPTIMIZE.
Deletion vectors keep UPDATE/DELETE from shredding layout → heavily-updated big table = cluster on join keys + enable DVs. Skew: liquid clustering (disk) + AQE (query time) + salting (manual). maxPartitionBytes = read-split size.

Before you move on — say these without scrolling up

Three failure modes of partitioning — and the cardinality/size rules of thumb.
Z-order vs liquid clustering — the two things liquid clustering does that Z-order can't.
After creating a clustered table, when does new data actually get clustered?
Heavily-updated large table — the two priority actions, and why DVs matter for layout.

Next: on UC managed tables you increasingly don't run OPTIMIZE or pick clustering keys at all — the platform does → [Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/).

Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)