Lessons

Cost & Performance Optimisation

Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)

Understand delta optimization techniques such as deletion vectors and liquid clustering; the benefits of liquid clustering over partitioning and Z-order.

[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/) left a cliffhanger: data skipping only works if each file's min/max range is tight on the column you filter — else every file overlaps and Spark reads everything. [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/) made files the right size. This lesson makes their contents the right shape. It's the most-tested topic in the section, and it's a clean evolution story.


The spine

Beat 1 — the anchor: co-locate similar values

Anchor. Data skipping is only as good as your layout. To skip more, co-locate similar values into the same files, so each file's min/max range on your filter columns is narrow and non-overlapping. The three techniques below are just increasingly good ways to do that co-location: partitioning → Z-order → liquid clustering.

Beat 2 — partitioning, and why it became the fallback

Hive-style partitioning splits the table into folders by a column value (/country=US/). A filter on that column prunes whole folders. Simple, fine for a low-cardinality column you almost always filter on.

Predict: you partition a table by customer_id (millions of values). What goes wrong?

Millions of folders, each with one tiny file — the small-file pathology from [The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/). That's the first of three failure modes:

Rule of thumb: don't partition tables under ~1 TB, and only when each partition holds ≥ ~1 GB. Under that, partitioning fragments more than it helps.

Beat 3 — Z-order → liquid clustering

Z-ordering doesn't make folders; it sorts and co-locates similar rows into the same files, tightening ranges. You run it inside compaction (it sorts while OPTIMIZE already has the files in memory — no extra pass):

OPTIMIZE my_table ZORDER BY (customer_id, event_date);

Multi-column Z-order uses a Z-curve (Morton curve) — a space-filling curve mapping N-D values to one sort order so points close in N-D stay close in 1-D. Its limits set up the next step: you choose columns upfront, and changing them = a full-table OPTIMIZE rewrite; no skew/small-file help.

Liquid clustering (CLUSTER BY, GA DBR 13.3+) was built to fix exactly that:

CREATE TABLE t (...) CLUSTER BY (customer_id, event_date);
ALTER TABLE t CLUSTER BY (event_date, region);   -- change keys ANY TIME — no rewrite
OPTIMIZE t;                                        -- clusters incrementally (new/changed data only)

Lock it. Co-locate to tighten ranges: partitioning (folders, low-cardinality, ≥1 TB tables) → Z-order (within files, upfront columns, rewrite to change) → liquid clustering (change keys anytime, incremental, fixes skew+small-files). New data stays unclustered until the next OPTIMIZE.


The dials (skim now; return when a question needs one)

◆ The comparison the objective asks for directly

Hive partitioningZ-orderLiquid clustering
Change keys laterrewrite whole tablerewrite whole tableany time, no rewrite
Handles small filesno (can worsen)noyes — balanced
Handles data skewno (can cause)noyes — splits/combines
Recommendationlegacy / big low-cardinalityexisting tablesnew tables (preferred)

Derivable tell: "skew + frequently-changing filter columns + no costly rewrites, table < 1 TB"liquid clustering (reconstruct it, don't memorise). Note: liquid clustering is an alternative to partitioning/Z-order, not combined with them on the same table.

◆ Deletion vectors — keep updates from wrecking the layout

From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): deletion vectors (Merge-on-Read) mark rows dead with a tiny bitmap + write a small new-rows file, instead of rewriting a whole file. Why it's a layout topic: without DVs, every UPDATE/DELETE rewrites big files and churns your organised layout back into small files. DVs keep point-changes cheap; a periodic OPTIMIZE (or Predictive Optimization) physically applies them and re-tidies. So the two priority actions for a heavily-updated large table (a real exam scenario): liquid-cluster on the merge/join keys (so MERGE prunes files) + enable deletion vectors (so updates don't rewrite whole files).

◆ Skew: on-disk layout vs at-query-time (and salting)

◆ Read-parallelism knob — spark.sql.files.maxPartitionBytes

Distinct from file layout: spark.sql.files.maxPartitionBytes (default 128 MB) controls how large files are split into Spark partitions when read — lower it to get more, smaller read partitions (more parallelism / less memory pressure per task); it's a read-side tuning knob, not a storage-layout one. (Cluster-UI Spark configs apply to all notebooks on that cluster.)

Takeaways (rebuild it from these)

  1. Skipping is only as good as layout: co-locate similar values → tight ranges. Evolution: partitioning → Z-order → liquid clustering.
  2. Partitioning = folders; low-cardinality, ≥1 GB partitions, tables ≳1 TB. High cardinality → small files; causes skew; rigid (rewrite to change).
  3. Z-order (OPTIMIZE … ZORDER BY, Z-curve) co-locates within files; upfront columns; changing = full rewrite; no skew/small-file help.
  4. Liquid clustering (CLUSTER BY) = modern default: change keys anytime no rewrite, incremental OPTIMIZE, fixes skew + small files. Alternative to (not combined with) partitioning/Z-order; new data unclustered until next OPTIMIZE.
  5. Deletion vectors keep UPDATE/DELETE from shredding layout → heavily-updated big table = cluster on join keys + enable DVs. Skew: liquid clustering (disk) + AQE (query time) + salting (manual). maxPartitionBytes = read-split size.

Before you move on — say these without scrolling up

  1. Three failure modes of partitioning — and the cardinality/size rules of thumb.
  2. Z-order vs liquid clustering — the two things liquid clustering does that Z-order can't.
  3. After creating a clustered table, when does new data actually get clustered?
  4. Heavily-updated large table — the two priority actions, and why DVs matter for layout.

Next: on UC managed tables you increasingly don't run OPTIMIZE or pick clustering keys at all — the platform does → [Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/).

Prerequisites

Leads to