In [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) I said bronze is usually a stateless streaming query — read the source as it arrives and append it. This whole section is about the single most common source: files landing in a cloud bucket. Auto Loader is the purpose-built tool for it, and the exam tests it hard (it's the whole 7% of this section).
We'll learn it the way it actually behaves — as a story that happens to one folder over time. Picture s3://bucket/incoming/ where your restaurant's order files (JSON) drop every few minutes. Watch that folder across three moments: files land, more files land, and then a file lands that doesn't match. Each moment teaches one core idea.
The spine — one landing folder, three moments
Moment 1 — files land: what Auto Loader is
You need every file in incoming/ loaded into a bronze Delta table exactly once — never missed, never reprocessed — and you don't want to babysit which files you've already seen.
Predict: the hand-rolled way is to glob the folder each run and diff it against what's already loaded. Why is that fragile at scale?
…
Because you're maintaining the "what have I seen" state by hand — and at thousands of files a day it drifts: a file processed twice, or missed during a crash. That's the f1 anchor again ([The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): do only the work a change demands) applied to files — and Auto Loader is the machine that does it for you:
Anchor. Auto Loader (
format("cloudFiles")) incrementally and idempotently discovers and ingests only the new files in a cloud location, tracking what it has already processed in its checkpoint. It is Structured Streaming under the hood ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) — so it's exactly-once and needs acheckpointLocation— just pointed at files instead of Kafka.
(spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", "/chk/schema")
.load("s3://bucket/incoming/"))
Lock it.
cloudFiles= incremental, exactly-once file ingestion. It's streaming (from[Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) aimed at a folder, so it carries a checkpoint.
Moment 2 — more files land: how does it know which are new?
Overnight, 5 new files drop into a folder that already holds 10,000 processed ones.
Predict: how does Auto Loader ingest only those 5 the next run — without re-reading all 10,005?
…
Two halves. Remembering what's done is the checkpoint (the same progress-bookmark idea from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) — it records which files are already ingested). Discovering what's new is a separate choice, and it's tested directly:
- Directory listing mode (the default) — Auto Loader lists the input directory and compares against the checkpoint to find new files. Simple, no setup.
- File notification mode — instead of listing, it subscribes to the cloud provider's file-arrival events (SNS/SQS on S3, Event Grid on ADLS). No expensive listing — built for very high file volumes where listing a giant directory every trigger is the bottleneck.
Tell: "default Auto Loader execution mode" → directory listing. "millions of files, directory listing is too slow/expensive" → file notification.
Lock it. Checkpoint = what's already done. Discovery = directory listing (default) vs file notification (high volume, event-driven).
Moment 3 — a file lands that doesn't match: inference, evolution, rescue
For months every order file looked like {order_id, amount} with amount an integer. Today two odd files arrive: one has a brand-new column tip; another sends amount as the string "twenty".
Predict: does the mismatched row crash the stream, get silently dropped, or something else?
…
The answer depends on one setting — but first, how Auto Loader even has a schema to violate. It infers the schema by sampling files, and cloudFiles.schemaLocation persists that inferred schema so later runs skip re-sampling (cheaper, stable) — it's the companion to the checkpoint. Now, when a new column appears, cloudFiles.schemaEvolutionMode decides what happens:
| Mode | On a new column | Stream fails? |
|---|---|---|
addNewColumns (default) | stream fails once, but the new column is added to the schema — restart picks it up | yes, once, then evolves |
rescue | schema does not evolve; unexpected data goes to _rescued_data | never fails |
failOnNewColumns | stream fails and does not evolve — you fix the schema manually | yes |
none | ignore new columns, no rescue | no |
The surprise worth sitting with: the default (addNewColumns) fails the stream — but productively. It records the new column, so one restart continues with the wider schema. People expect the default to "just handle it silently"; it doesn't.
And the "twenty"-where-int-expected value? That's what _rescued_data is for — a catch-all column holding anything that doesn't fit the schema (type mismatches, missing columns, extra fields). The bad value isn't lost and doesn't crash the row; it lands in _rescued_data, where you can inspect or route it. That's the exact split point for the quarantine pattern ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)): rows with a non-null _rescued_data are your bad-data path.
Lock it. Auto Loader infers a schema (persisted in
schemaLocation). New column →schemaEvolutionMode(defaultaddNewColumns= fail-once-then-add;rescue= never fail). Mismatched values →_rescued_data(the quarantine split).
The dials (skim now; return when a question needs one)
◆ Two "bad data" catchers — don't confuse them
_rescued_data— schema mismatches (type/extra/missing), kept as a column on the good rows. The quarantine split ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)).badRecordsPath— a Spark-native read option (JSON/CSV) that writes unparseable records/files to a path instead of failing. Use for genuinely corrupt input, not schema drift.
◆ Formats — including images/PDFs
cloudFiles.format = json, csv, parquet, avro, and binaryFile. For images/PDFs, binaryFile ingests each file as raw bytes. Tell: "ingest JPEG/PNG images incrementally" → Auto Loader + binaryFile.
◆ Filtering and throttling
pathGlobFilter— standard Spark file-source option to restrict ingestion to a glob (e.g. only*.jpg).cloudFiles.maxBytesPerTrigger(andmaxFilesPerTrigger) — a soft cap on how much a micro-batch processes. Tell: large files cause long, unpredictable micro-batches → cap withmaxBytesPerTrigger.
◆ Reader evolution vs writer evolution (two different knobs)
cloudFiles.schemaEvolutionModegoverns the reader — what Auto Loader does when it sees a new column.mergeSchema=trueon the Delta write governs whether new columns are actually added to the target table.
A fully-evolving append pipeline often needs both — the reader accepts the column and the writer adds it downstream.
◆ Where it sits — vs COPY INTO
Both load files idempotently (each tracks what it's loaded), for different shapes:
- Auto Loader — continuous / high-volume / streaming ingestion of many files, with schema evolution. The default for "new files keep arriving."
COPY INTO— a simpler SQL, batch idempotent load; good for smaller or periodic loads.
Tell: "incrementally process new files as they arrive, exactly-once, minimal infra, schema evolution" → Auto Loader; "simple periodic SQL batch load" → COPY INTO.
Takeaways (rebuild it from these)
- Auto Loader =
format("cloudFiles"): incremental, idempotent, exactly-once file ingestion. It's Structured Streaming, so it needs a checkpoint (and aschemaLocationfor the inferred schema). - Discovery: directory listing (default) vs file notification (event-driven, for very high volume).
- Schema evolution modes:
addNewColumns(default — fail once, then add),rescue(never fail, extras →_rescued_data),failOnNewColumns(fail, no evolve),none. - Bad data:
_rescued_data(schema mismatch → quarantine split,[Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)) vsbadRecordsPath(unparseable).binaryFilefor images/PDFs;pathGlobFilterto select;maxBytesPerTriggerto cap batch size. - Reader
schemaEvolutionMode≠ writermergeSchema. Auto Loader = continuous/high-volume streaming;COPY INTO= simple batch SQL load.
Before you move on — say these without scrolling up
- Why is hand-globbing a folder fragile — and what does Auto Loader use to know a file is already done?
- 10,000 files processed, 5 new ones land — the two discovery modes, and which is the default?
- A file arrives with a new column under the default mode — what happens to the stream, exactly?
amountarrives as"twenty"instead of an int — where does that value go, and how does that feed quarantine?- Auto Loader vs COPY INTO — one tell for each.
Section 2 is one lesson because it's one deep tool. Next, we pick up where _rescued_data pointed — cleaning and quality-gating that ingested data in Section 3.