[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/) gave the target: files ~128 MB–1 GB. This lesson is how you get there — and why the cleanup step at the end is the one operation that can break time travel.
The spine
Beat 1 — why small files are inevitable (recall f2)
From [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/): Parquet files are immutable, so every write creates new files. A pipeline appending a small batch every 5 minutes makes 288 files/day, ~2,000/week; over-partitioning or a stray repartition(10000) does it instantly. Left alone, the table drifts to the "millions of tiny files" pathology from lesson 1. Compaction isn't optional maintenance — it's how a live table stays fast.
Beat 2 — the anchor: three ways to right-size, one way to clean up
Anchor. You right-size files after the fact (
OPTIMIZE), prevent bad sizes at write time (optimized writes), or auto-fix after every write (auto compaction). Separately,VACUUMdeletes the old files those operations leave behind — andVACUUMis the only one here that touches time travel.
Keep two ideas apart from the start, because the exam leans on the confusion:
Predict: you run
OPTIMIZE, thenVACUUM. Which one made your queries faster, and which one made storage cheaper?
…
OPTIMIZE (compaction) makes queries fast (fewer, bigger files to read). VACUUM makes storage cheap (deletes dead files). Different operations, different jobs — VACUUM does not speed queries.
Beat 3 — OPTIMIZE, and why it's safe for history
OPTIMIZE reads small files and rewrites them into fewer large ones via bin-packing (fills ~1 GB bins):
OPTIMIZE my_table;
OPTIMIZE my_table WHERE event_date = '2026-06-01'; -- one partition (cheaper)
Predict: after
OPTIMIZE, doesVERSION AS OFstill work?
…
Yes. In [How Delta Lake works — the transaction log](/lessons/f2-delta-transaction-log/) terms, OPTIMIZE removes the small files from the current state and adds the big ones — but the old small files still physically exist (tombstoned) until VACUUM. So OPTIMIZE does not break time travel; queries just read the compacted files.
Lock it. OPTIMIZE = bin-pack into ~1 GB files, makes queries fast, safe for time travel. VACUUM = delete tombstoned files, makes storage cheap, the risky one.
The dials (skim now; return when a question needs one)
◆ Prevention: optimized writes, auto compaction, Auto Optimize
| Method | When it runs | Trade-off / target |
|---|---|---|
Manual OPTIMIZE | on demand / scheduled | you own the schedule; ~1 GB (1024 MB) target |
| Optimized writes | at write time — a shuffle consolidates each partition to one file before writing | slower writes, clean files from the start |
| Auto compaction | asynchronously after a write, once a small-file threshold is crossed | small post-write overhead; runs an OPTIMIZE without Z-order, ~128 MB target; good for high-frequency streaming |
Auto Optimize = the umbrella term for optimized writes + auto compaction together. Tell: tiny batches every few minutes where nightly OPTIMIZE can't keep up → auto compaction; clean files immediately on a batch write → optimized writes; big periodic cleanup → OPTIMIZE.
◆ Target file size autotuning (a tested nuance)
Delta autotunes the target file size. For a table with frequent MERGEs, it tunes to a smaller target — smaller files mean each MERGE rewrites less data, cutting MERGE duration. So "why did files get smaller after enabling optimized writes/auto compaction on a MERGE-heavy table?" → autotuned to a smaller target to speed MERGEs, not a bug.
◆ VACUUM — reclaim storage, and the one thing that breaks time travel
VACUUM my_table; -- default 7-day retention
VACUUM my_table RETAIN 720 HOURS; -- 30 days
VACUUM my_table DRY RUN; -- list what WOULD be deleted
Three traps to lock:
- VACUUM does not speed queries — after OPTIMIZE, queries already read only the compacted files; VACUUM only frees storage.
- VACUUM breaks time travel past the retention window — it deletes the files
VERSION AS OFneeds. The only operation here that does. (If a query fails with "file not found," suspect a VACUUM ran.) RETAIN 0 HOURSdeletes all old files immediately → destroys time travel. Non-prod only; alwaysDRY RUNfirst. (And a shallow clone referencing old files can be corrupted by a source VACUUM — UC guards this, but it's why DRY RUN matters.) VACUUM also deletes CDF_change_datafiles past retention ([Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/)).
The before/after the exam draws:
| Before OPTIMIZE | After OPTIMIZE | After VACUUM | |
|---|---|---|---|
| Query speed | slow | fast | same as after OPTIMIZE |
| Storage cost | high | high (old files remain) | low |
| Time travel | works | works | broken past retention |
◆ Where it connects
OPTIMIZE rarely travels alone — pair it with a layout clause (OPTIMIZE … ZORDER BY …) so it co-locates similar data while rewriting, tightening the min/max ranges skipping needs ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/)). And on UC managed tables you increasingly run none of this — Predictive Optimization schedules OPTIMIZE and VACUUM for you ([Letting the platform maintain layout — Predictive Optimization & managed tables](/lessons/s6-predictive-optimization/)).
Takeaways (rebuild it from these)
- Small files are inevitable (immutable files + frequent writes); compaction keeps a live table fast.
OPTIMIZE= bin-pack into ~1 GB files (doesn't break time travel). Optimized writes = prevent small files at write (adds a shuffle). Auto compaction = async post-write OPTIMIZE (~128 MB, no Z-order). Auto Optimize = both together. File size autotunes smaller for MERGE-heavy tables.- Compaction makes queries fast; VACUUM makes storage cheap — different jobs. VACUUM deletes tombstoned files past a 7-day default, doesn't speed queries, and is the only operation that breaks time travel (
RETAIN 0 HOURS= nukes it;DRY RUNfirst). - OPTIMIZE usually runs with a layout clause (
ZORDER); on managed tables Predictive Optimization can run it all.
Before you move on — say these without scrolling up
- OPTIMIZE vs VACUUM — which makes queries fast, which makes storage cheap?
- Does OPTIMIZE break time travel? Does VACUUM? Why the difference?
- Auto compaction vs optimized writes vs Auto Optimize — when each?
- Files got smaller after enabling auto compaction on a MERGE-heavy table — bug or feature?
Next: organizing the contents of those well-sized files so skipping actually skips → [Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/).