In [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/) the target of every pipeline was "a Delta table," and I asked you to take that on trust. This lesson pays that debt. Almost every storage, optimisation, and modelling question on the exam is really a question about the one mechanism we build here — so if you own this, a whole third of the bank becomes derivable.
First — what Delta Lake is (and is not)
Get the family right before any mechanics (this is the trap most people fall into). Delta Lake is not a database, and not a new storage system. It does not hold your data in some special place. Your data is, and stays, ordinary Parquet files sitting in cheap cloud object storage (ADLS, S3, GCS). What Delta Lake adds is a metadata layer — a transaction log — written next to those Parquet files, in a folder called _delta_log. That log is the entire difference between "a pile of files" and "a reliable table."
So the relationship, plainly: object storage holds the files; Parquet is the file format; Delta Lake is the log laid on top that turns those files into a table with guarantees. It doesn't move or rewrite your data to exist — it just starts keeping a ledger about it. (This is why the lakehouse from [The one job — and the two axes everything lives on](/lessons/f1-the-one-job/) is possible at all: warehouse-grade reliability on top of lake-grade cheap files.)
Why a plain pile of files isn't a table — the pain
Object storage on its own is "a filesystem with a billing model." It has no notion of a transaction, no schema enforcement, no way to update one row, no cleanup when a write dies halfway. Three concrete failures follow, and each maps to one letter of ACID — worth feeling before we fix them:
- The half-finished write. Your job writes 1,000,000 rows to a Silver table; the cluster dies at row 500,000. Now the folder holds 500,000 rows in partial files. A downstream read is silently, exactly 50% wrong. Nobody gets an error. (no Atomicity)
- The concurrent read-write mess. A nightly write is landing new files while a data scientist reads the same table. They get some old files and some half-written new ones — a model trained on data that never actually existed at any moment. (no Isolation)
- The schema corruption. An upstream change starts sending
priceas the string"Twenty Dollars". Object storage accepts anything, so the bad rows land, and tomorrow's dashboardSUM(price)crashes — or worse, quietly returns garbage. (no Consistency)
Delta fixes all three with one idea: the log.
The anchor: current state = every file added, minus every file removed
This single sentence generates everything else, so plant it and keep returning to it:
Anchor. A Delta table's current contents = the set of data files that have been added and not later removed, according to the log. Delta never edits a Parquet file's bytes; it only records "this file is now part of the table" or "this file no longer is."
Everything below — reading, updating, deleting, time travel, ACID — is just a consequence of keeping an honest add/remove ledger.
The transaction log, concretely
A Delta table on disk is two things: the Parquet data files, and the _delta_log folder beside them.
my_table/
├── _delta_log/
│ ├── 00000000000000000000.json ← version 0 (e.g. CREATE)
│ ├── 00000000000000000001.json ← version 1 (INSERT)
│ ├── 00000000000000000002.json ← version 2 (UPDATE)
│ └── ...
├── part-00000-aaa.parquet ← data files (Parquet, immutable)
├── part-00001-bbb.parquet
└── part-00002-ccc.parquet
Each NN...N.json is one commit — the atomic record of one transaction. A commit holds up to four kinds of entry, but only the last matters for the anchor:
- commitInfo — the audit line: who, what operation, when, from which notebook/job. (This is what powers
DESCRIBE HISTORY.) - metaData — the table schema (column names, types, nullability). Written on create and whenever the schema changes.
- protocol — the minimum Delta reader/writer version needed to read the table safely.
- add / remove — the file operations.
add= "this Parquet file is now part of the table" (and it carries the file's per-column min/max stats, which later power data skipping — see[The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)).remove= "this file is no longer part of the current table."
Reading the table is then trivial to state: Delta does not replay history row-by-row. It walks the log, takes every add, subtracts every remove, and reads exactly the Parquet files still standing. A worked example — five commits on an invoices table:
| Version | Operation | Active files after |
|---|---|---|
| v0 | CREATE + load invoices 101–200 | file1 |
| v1 | INSERT invoices 1–100 | file1, file2 |
| v2 | UPDATE one row | file1, file2 (+deletion vector), file3 |
| v3 | DELETE one row | file1, file2 (+updated deletion vector), file3 |
| v4 | INSERT invoices 201–99457 | file1, file2, file3, file4 |
SELECT * at v4 reads the four active files (applying deletion vectors — defined below); any file that was ever removed is simply ignored.
Scaling the log — and the checkpoint collision resolved. Thousands of tiny JSON commits would be slow to read, so every ~10 commits Delta writes a Delta-log checkpoint: a single Parquet snapshot of the net current state (adds minus removes so far). A read then loads the latest checkpoint plus only the handful of JSON commits after it — cost is O(commits since last checkpoint), never O(all history).
Now, the word you were promised in [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/): this Delta-log checkpoint is a different machine from the streaming checkpoint. Same word, two meanings — ask "which is in play?":
| "checkpoint" | What it is | Where it lives |
|---|---|---|
| Streaming checkpoint | a streaming query's progress bookmark + operation state, for crash recovery | the query's checkpointLocation |
| Delta-log checkpoint | a periodic Parquet snapshot of a table's transaction log, so reads stay fast | inside the table's _delta_log/ |
One belongs to a running query; the other belongs to a table at rest. They never touch.
ACID, each solving one of the three pains
- Atomicity — all or nothing. Delta writes the new Parquet files first, then writes the commit JSON as one atomic operation. If the cluster dies before the JSON lands, those Parquet files are orphans — no
addentry ever references them, so they are invisible to every reader and get cleaned up later by VACUUM (see[Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)). The half-finished write simply never becomes part of the table. - Consistency — schema-on-write. Every write is validated against the table's schema before the commit. Wrong type, missing NOT-NULL, a failed CHECK constraint → the write fails before any file is registered. Bad data is rejected at the door, not discovered at query time.
- Isolation — snapshot isolation via OCC. Define it: Optimistic Concurrency Control assumes conflicts are rare, so it takes no locks — every writer proceeds freely, and conflict is checked only at commit time. Writer T2 asks "did any file I touched change between the version I read and now?" If T1 and T2 touched different files, both commit; only if they touched the same files does T2 retry. Crucially Delta works at the file level, not the row level — in analytics, writers usually hit different partitions, so real conflicts are rare and OCC wins. (Contrast: a traditional database uses pessimistic locking — lock the data up front, make others wait — which is right for high-contention row-level banking, wrong for file-level analytics.)
- Durability. Once the commit JSON is in object storage, it's durable — ADLS/S3 replicate with 11-nines durability. Cluster death after commit changes nothing; the next read sees the committed state.
What actually happens to files on each DML — the rule that makes you dangerous
The rule: Parquet files are IMMUTABLE. Delta never changes bytes inside an existing file. Every change writes new files and records add/remove. (This is forced by physics — you cannot seek into a file on S3/ADLS and overwrite four bytes; object storage is write-once. Delta works with that constraint, not against it.)
- INSERT → write a new Parquet file, record one
add. - UPDATE / DELETE → here's the clever part. Rewriting a whole 1-million-row file to change one row would be brutal, so Databricks uses deletion vectors (a technique called Merge-on-Read). Define it: a deletion vector (DV) is a tiny bitmap stored beside a Parquet file that marks which row positions to ignore when reading. So an UPDATE writes (a) a DV marking the old row dead in its file, and (b) a small new file holding the updated row; a DELETE just writes/updates a DV — a "soft delete." No big rewrite. On read, Delta loads the file, skips the DV-marked rows, and adds the new file — the anchor still holds (adds minus removes, now with "ignore these positions" layered on). This is why UPDATE/DELETE on Delta is cheap. Two precisions to be exact (verify against docs, since defaults shift): deletion vectors are not blanket-on for every Delta table — they're auto-enabled on new tables in recent runtimes / SQL warehouses and can be turned on explicitly, but don't assume an arbitrary table has them; and the soft-deleted rows are only physically removed when you run
OPTIMIZEorREORG TABLE … APPLY (PURGE). The full performance story is[Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/). - MERGE → the upsert: INSERT + UPDATE + DELETE in one atomic commit, matched on a key. Recall
[The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): a changing record is handled by a merge into the target — this is that operation, and it's the engine under[APPLY CHANGES — CDC and SCD, declaratively](/lessons/s1-apply-changes-cdc/)(APPLY CHANGES is MERGE that Databricks writes for you). One tell worth banking: INSERT maps columns by position; MERGE maps by name — so when column order might differ, MERGE is the safe choice.
Batch writes: CTAS, overwrite, append, and DROP
The DML above is row-level; you also write tables in bulk, and a cluster of exam questions lives here. All of it is still just adds/removes on the log — but the behaviours are worth stating.
- CTAS (
CREATE TABLE t AS SELECT …) — the SELECT runs once, at definition time, and its result is materialized into a new Delta table. Afterwards, queryingtreads the stored data; it does not re-run the SELECT. That's the key contrast with a view (which recomputes on every read). So "when does the join logic in a CTAS execute?" → once, when the table is defined. - CTAS is independent of its source. Because CTAS copied the data into a new table, dropping the source table afterwards leaves the CTAS table intact — they share no files. ("Drop the source" → only the source goes; the target is unaffected.)
- Write modes —
df.write.mode("append").saveAsTable(t)adds rows (a newaddcommit);mode("overwrite")replaces the table's current data with the new result — the old files areremoved from the current state but still exist for time travel until VACUUM. A nightly "completely overwritten from source" table is exactlyoverwrite, and comparing last-night-vs-tonight is a job for time travel +EXCEPT(below). saveAsTablevssave(recall from the family framing):saveAsTableregisters the table in Unity Catalog (governed, queryable by name);savejust writes files to a path with no catalog entry.- DROP — recall managed vs external:
DROPon a managed table deletes the files; on an external table it removes only the metadata and the files survive.
Time travel — a free consequence
Because the log preserves every add/remove, the state at any past version is fully reconstructable. SELECT * FROM t VERSION AS OF 3 or TIMESTAMP AS OF '2026-01-15' reads the files that were active then. RESTORE creates a new version matching an old state (history is preserved, not erased). This works right up until VACUUM physically deletes the old files — the one action that truly breaks time travel (again, [Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)).
Takeaways (rebuild it from these)
- Delta Lake = a transaction log (
_delta_log) over immutable Parquet files in object storage. Not a database, not new storage — a ledger on top. - Current state = adds − removes. Every capability derives from that ledger.
- A commit JSON carries commitInfo / metaData / protocol / add-remove; Delta-log checkpoints snapshot the net state so reads stay O(since last checkpoint). This checkpoint is not the streaming checkpoint.
- ACID, each solving a real failure: atomic JSON commit; schema-on-write; OCC file-level isolation (no locks, conflict at commit); durable object storage.
- Parquet is immutable — every change writes new files. UPDATE/DELETE use deletion vectors (mark rows dead, don't rewrite); MERGE is the keyed upsert (maps by name). Time travel falls out of the preserved log.
From this foundation, Section 6 becomes "what you do to these files to keep them fast" — compaction and cleanup ([Right-sizing files — OPTIMIZE, optimized writes, auto compaction, VACUUM](/lessons/s6-compaction/)), layout ([Organizing files — partitioning, Z-order, liquid clustering (and deletion vectors)](/lessons/s6-data-layout/), [The performance model — why a query is slow, and the one lever](/lessons/s6-performance-model/)), and change streams ([Change Data Feed — emitting a table's changes downstream](/lessons/s6-cdf/)) — and Section 10 becomes "how you shape them into models" ([Delta data models — managed vs external, clones, and materialization](/lessons/s10-data-models/)).