[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/) left a cliffhanger: data skipping only works if each file's min/max range is tight on the column you filter — else every file overlaps and Spark reads everything. [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/) made files the right size. This lesson makes their contents the right shape. It's the most-tested topic in the section, and it's a clean evolution story.
The spine
Beat 1 — the anchor: co-locate similar values
Anchor. Data skipping is only as good as your layout. To skip more, co-locate similar values into the same files, so each file's min/max range on your filter columns is narrow and non-overlapping. The three techniques below are just increasingly good ways to do that co-location: partitioning → Z-order → liquid clustering.
Beat 2 — partitioning, and why it became the fallback
Hive-style partitioning splits the table into folders by a column value (/country=US/). A filter on that column prunes whole folders. Simple, fine for a low-cardinality column you almost always filter on.
Predict: you partition a table by
customer_id(millions of values). What goes wrong?
…
Millions of folders, each with one tiny file — the small-file pathology from [The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/). That's the first of three failure modes:
- High cardinality → small files. Never partition on high-cardinality columns.
- Data skew. If "Enterprise" has 10× the rows of "Small," that folder is huge; one executor sweats. Partitioning causes skew.
- Rigid. Partition columns are fixed at creation; changing them rewrites the whole table.
Rule of thumb: don't partition tables under ~1 TB, and only when each partition holds ≥ ~1 GB. Under that, partitioning fragments more than it helps.
Beat 3 — Z-order → liquid clustering
Z-ordering doesn't make folders; it sorts and co-locates similar rows into the same files, tightening ranges. You run it inside compaction (it sorts while OPTIMIZE already has the files in memory — no extra pass):
OPTIMIZE my_table ZORDER BY (customer_id, event_date);
Multi-column Z-order uses a Z-curve (Morton curve) — a space-filling curve mapping N-D values to one sort order so points close in N-D stay close in 1-D. Its limits set up the next step: you choose columns upfront, and changing them = a full-table OPTIMIZE rewrite; no skew/small-file help.
Liquid clustering (CLUSTER BY, GA DBR 13.3+) was built to fix exactly that:
CREATE TABLE t (...) CLUSTER BY (customer_id, event_date);
ALTER TABLE t CLUSTER BY (event_date, region); -- change keys ANY TIME — no rewrite
OPTIMIZE t; -- clusters incrementally (new/changed data only)
- Flexible keys — change with
ALTER TABLE … CLUSTER BY, no rewrite (partitioning + Z-order both force one). - Incremental — OPTIMIZE reorganizes only new/unclustered data.
- Fixes skew + small files together — splits oversized groups and combines undersized ones, so files are balanced and executors get equal work.
Lock it. Co-locate to tighten ranges: partitioning (folders, low-cardinality, ≥1 TB tables) → Z-order (within files, upfront columns, rewrite to change) → liquid clustering (change keys anytime, incremental, fixes skew+small-files). New data stays unclustered until the next OPTIMIZE.
The dials (skim now; return when a question needs one)
◆ The comparison the objective asks for directly
| Hive partitioning | Z-order | Liquid clustering | |
|---|---|---|---|
| Change keys later | rewrite whole table | rewrite whole table | any time, no rewrite |
| Handles small files | no (can worsen) | no | yes — balanced |
| Handles data skew | no (can cause) | no | yes — splits/combines |
| Recommendation | legacy / big low-cardinality | existing tables | new tables (preferred) |
Derivable tell: "skew + frequently-changing filter columns + no costly rewrites, table < 1 TB" → liquid clustering (reconstruct it, don't memorise). Note: liquid clustering is an alternative to partitioning/Z-order, not combined with them on the same table.
◆ Deletion vectors — keep updates from wrecking the layout
From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): deletion vectors (Merge-on-Read) mark rows dead with a tiny bitmap + write a small new-rows file, instead of rewriting a whole file. Why it's a layout topic: without DVs, every UPDATE/DELETE rewrites big files and churns your organised layout back into small files. DVs keep point-changes cheap; a periodic OPTIMIZE (or Predictive Optimization) physically applies them and re-tidies. So the two priority actions for a heavily-updated large table (a real exam scenario): liquid-cluster on the merge/join keys (so MERGE prunes files) + enable deletion vectors (so updates don't rewrite whole files).
◆ Skew: on-disk layout vs at-query-time (and salting)
- Liquid clustering balances skew on disk (splits hot groups).
- AQE (from
[Job & environment configuration — compute and Spark tuning](/lessons/s1-job-env-config/)) fixes skew at query time by splitting skewed shuffle partitions. - Salting — a manual fix when one key is hot: add a random prefix to the skewed key before aggregating (spreading it across partitions), then aggregate again after stripping the prefix. Good layout + AQE + (if needed) salting is the full skew answer.
◆ Read-parallelism knob — spark.sql.files.maxPartitionBytes
Distinct from file layout: spark.sql.files.maxPartitionBytes (default 128 MB) controls how large files are split into Spark partitions when read — lower it to get more, smaller read partitions (more parallelism / less memory pressure per task); it's a read-side tuning knob, not a storage-layout one. (Cluster-UI Spark configs apply to all notebooks on that cluster.)
Takeaways (rebuild it from these)
- Skipping is only as good as layout: co-locate similar values → tight ranges. Evolution: partitioning → Z-order → liquid clustering.
- Partitioning = folders; low-cardinality, ≥1 GB partitions, tables ≳1 TB. High cardinality → small files; causes skew; rigid (rewrite to change).
- Z-order (
OPTIMIZE … ZORDER BY, Z-curve) co-locates within files; upfront columns; changing = full rewrite; no skew/small-file help. - Liquid clustering (
CLUSTER BY) = modern default: change keys anytime no rewrite, incremental OPTIMIZE, fixes skew + small files. Alternative to (not combined with) partitioning/Z-order; new data unclustered until next OPTIMIZE. - Deletion vectors keep UPDATE/DELETE from shredding layout → heavily-updated big table = cluster on join keys + enable DVs. Skew: liquid clustering (disk) + AQE (query time) + salting (manual).
maxPartitionBytes= read-split size.
Before you move on — say these without scrolling up
- Three failure modes of partitioning — and the cardinality/size rules of thumb.
- Z-order vs liquid clustering — the two things liquid clustering does that Z-order can't.
- After creating a clustered table, when does new data actually get clustered?
- Heavily-updated large table — the two priority actions, and why DVs matter for layout.
Next: on UC managed tables you increasingly don't run OPTIMIZE or pick clustering keys at all — the platform does → [Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/).