Lessons

Cost & Performance Optimisation

Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM

Understand Delta optimization techniques; keep large-dataset queries performant (file sizing, compaction) and manage storage.

[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/) gave the target: files ~128 MB–1 GB. This lesson is how you get there — and why the cleanup step at the end is the one operation that can break time travel.


The spine

Beat 1 — why small files are inevitable (recall f2)

From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): Parquet files are immutable, so every write creates new files. A pipeline appending a small batch every 5 minutes makes 288 files/day, ~2,000/week; over-partitioning or a stray repartition(10000) does it instantly. Left alone, the table drifts to the "millions of tiny files" pathology from lesson 1. Compaction isn't optional maintenance — it's how a live table stays fast.

Beat 2 — the anchor: three ways to right-size, one way to clean up

Anchor. You right-size files after the fact (OPTIMIZE), prevent bad sizes at write time (optimized writes), or auto-fix after every write (auto compaction). Separately, VACUUM deletes the old files those operations leave behind — and VACUUM is the only one here that touches time travel.

Keep two ideas apart from the start, because the exam leans on the confusion:

Predict: you run OPTIMIZE, then VACUUM. Which one made your queries faster, and which one made storage cheaper?

OPTIMIZE (compaction) makes queries fast (fewer, bigger files to read). VACUUM makes storage cheap (deletes dead files). Different operations, different jobs — VACUUM does not speed queries.

Beat 3 — OPTIMIZE, and why it's safe for history

OPTIMIZE reads small files and rewrites them into fewer large ones via bin-packing (fills ~1 GB bins):

OPTIMIZE my_table;
OPTIMIZE my_table WHERE event_date = '2026-06-01';   -- one partition (cheaper)

Predict: after OPTIMIZE, does VERSION AS OF still work?

Yes. In [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/) terms, OPTIMIZE removes the small files from the current state and adds the big ones — but the old small files still physically exist (tombstoned) until VACUUM. So OPTIMIZE does not break time travel; queries just read the compacted files.

Lock it. OPTIMIZE = bin-pack into ~1 GB files, makes queries fast, safe for time travel. VACUUM = delete tombstoned files, makes storage cheap, the risky one.


The dials (skim now; return when a question needs one)

◆ Prevention: optimized writes, auto compaction, Auto Optimize

MethodWhen it runsTrade-off / target
Manual OPTIMIZEon demand / scheduledyou own the schedule; ~1 GB (1024 MB) target
Optimized writesat write time — a shuffle consolidates each partition to one file before writingslower writes, clean files from the start
Auto compactionasynchronously after a write, once a small-file threshold is crossedsmall post-write overhead; runs an OPTIMIZE without Z-order, ~128 MB target; good for high-frequency streaming

Auto Optimize = the umbrella term for optimized writes + auto compaction together. Tell: tiny batches every few minutes where nightly OPTIMIZE can't keep up → auto compaction; clean files immediately on a batch write → optimized writes; big periodic cleanup → OPTIMIZE.

◆ Target file size autotuning (a tested nuance)

Delta autotunes the target file size. For a table with frequent MERGEs, it tunes to a smaller target — smaller files mean each MERGE rewrites less data, cutting MERGE duration. So "why did files get smaller after enabling optimized writes/auto compaction on a MERGE-heavy table?" → autotuned to a smaller target to speed MERGEs, not a bug.

◆ VACUUM — reclaim storage, and the one thing that breaks time travel

VACUUM my_table;                  -- default 7-day retention
VACUUM my_table RETAIN 720 HOURS; -- 30 days
VACUUM my_table DRY RUN;          -- list what WOULD be deleted

Three traps to lock:

The before/after the exam draws:

Before OPTIMIZEAfter OPTIMIZEAfter VACUUM
Query speedslowfastsame as after OPTIMIZE
Storage costhighhigh (old files remain)low
Time travelworksworksbroken past retention

◆ Where it connects

OPTIMIZE rarely travels alone — pair it with a layout clause (OPTIMIZE … ZORDER BY …) so it co-locates similar data while rewriting, tightening the min/max ranges skipping needs ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)). And on UC managed tables you increasingly run none of this — Predictive Optimization schedules OPTIMIZE and VACUUM for you ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)).

Takeaways (rebuild it from these)

  1. Small files are inevitable (immutable files + frequent writes); compaction keeps a live table fast.
  2. OPTIMIZE = bin-pack into ~1 GB files (doesn't break time travel). Optimized writes = prevent small files at write (adds a shuffle). Auto compaction = async post-write OPTIMIZE (~128 MB, no Z-order). Auto Optimize = both together. File size autotunes smaller for MERGE-heavy tables.
  3. Compaction makes queries fast; VACUUM makes storage cheap — different jobs. VACUUM deletes tombstoned files past a 7-day default, doesn't speed queries, and is the only operation that breaks time travel (RETAIN 0 HOURS = nukes it; DRY RUN first).
  4. OPTIMIZE usually runs with a layout clause (ZORDER); on managed tables Predictive Optimization can run it all.

Before you move on — say these without scrolling up

  1. OPTIMIZE vs VACUUM — which makes queries fast, which makes storage cheap?
  2. Does OPTIMIZE break time travel? Does VACUUM? Why the difference?
  3. Auto compaction vs optimized writes vs Auto Optimize — when each?
  4. Files got smaller after enabling auto compaction on a MERGE-heavy table — bug or feature?

Next: organizing the contents of those well-sized files so skipping actually skips → [Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/).

Prerequisites

Leads to