Incrementally ingest files from cloud storage with Auto Loader (cloudFiles): schema inference/evolution modes, rescued data, formats, throttling, and vs COPY INTO.

In [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) I said bronze is usually a stateless streaming query — read the source as it arrives and append it. This whole section is about the single most common source: files landing in a cloud bucket. Auto Loader is the purpose-built tool for it, and the exam tests it hard (it's the whole 7% of this section).

We'll learn it the way it actually behaves — as a story that happens to one folder over time. Picture s3://bucket/incoming/ where your restaurant's order files (JSON) drop every few minutes. Watch that folder across three moments: files land, more files land, and then a file lands that doesn't match. Each moment teaches one core idea.

The spine — one landing folder, three moments

Moment 1 — files land: what Auto Loader is

You need every file in incoming/ loaded into a bronze Delta table exactly once — never missed, never reprocessed — and you don't want to babysit which files you've already seen.

Predict: the hand-rolled way is to glob the folder each run and diff it against what's already loaded. Why is that fragile at scale?

…

Because you're maintaining the "what have I seen" state by hand — and at thousands of files a day it drifts: a file processed twice, or missed during a crash. That's the f1 anchor again ([The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): do only the work a change demands) applied to files — and Auto Loader is the machine that does it for you:

Anchor. Auto Loader (format("cloudFiles")) incrementally and idempotently discovers and ingests only the new files in a cloud location, tracking what it has already processed in its checkpoint. It is Structured Streaming under the hood ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) — so it's exactly-once and needs a checkpointLocation — just pointed at files instead of Kafka.

(spark.readStream.format("cloudFiles")
   .option("cloudFiles.format", "json")
   .option("cloudFiles.schemaLocation", "/chk/schema")
   .load("s3://bucket/incoming/"))

Lock it. cloudFiles = incremental, exactly-once file ingestion. It's streaming (from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) aimed at a folder, so it carries a checkpoint.

Moment 2 — more files land: how does it know which are new?

Overnight, 5 new files drop into a folder that already holds 10,000 processed ones.

Predict: how does Auto Loader ingest only those 5 the next run — without re-reading all 10,005?

…

Two halves. Remembering what's done is the checkpoint (the same progress-bookmark idea from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) — it records which files are already ingested). Discovering what's new is a separate choice, and it's tested directly:

Directory listing mode (the default) — Auto Loader lists the input directory and compares against the checkpoint to find new files. Simple, no setup.
File notification mode — instead of listing, it subscribes to the cloud provider's file-arrival events (SNS/SQS on S3, Event Grid on ADLS). No expensive listing — built for very high file volumes where listing a giant directory every trigger is the bottleneck.

Tell: "default Auto Loader execution mode" → directory listing. "millions of files, directory listing is too slow/expensive" → file notification.

Lock it. Checkpoint = what's already done. Discovery = directory listing (default) vs file notification (high volume, event-driven).

Moment 3 — a file lands that doesn't match: inference, evolution, rescue

For months every order file looked like {order_id, amount} with amount an integer. Today two odd files arrive: one has a brand-new column tip; another sends amount as the string "twenty".

Predict: does the mismatched row crash the stream, get silently dropped, or something else?

…

The answer depends on one setting — but first, how Auto Loader even has a schema to violate. It infers the schema by sampling files, and cloudFiles.schemaLocation persists that inferred schema so later runs skip re-sampling (cheaper, stable) — it's the companion to the checkpoint. Now, when a new column appears, cloudFiles.schemaEvolutionMode decides what happens:

Mode	On a new column	Stream fails?
`addNewColumns` (default)	stream fails once, but the new column is added to the schema — restart picks it up	yes, once, then evolves
`rescue`	schema does not evolve; unexpected data goes to `_rescued_data`	never fails
`failOnNewColumns`	stream fails and does not evolve — you fix the schema manually	yes
`none`	ignore new columns, no rescue	no

The surprise worth sitting with: the default (addNewColumns) fails the stream — but productively. It records the new column, so one restart continues with the wider schema. People expect the default to "just handle it silently"; it doesn't.

And the "twenty"-where-int-expected value? That's what _rescued_data is for — a catch-all column holding anything that doesn't fit the schema (type mismatches, missing columns, extra fields). The bad value isn't lost and doesn't crash the row; it lands in _rescued_data, where you can inspect or route it. That's the exact split point for the quarantine pattern ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)): rows with a non-null _rescued_data are your bad-data path.

Lock it. Auto Loader infers a schema (persisted in schemaLocation). New column → schemaEvolutionMode (default addNewColumns = fail-once-then-add; rescue = never fail). Mismatched values → _rescued_data (the quarantine split).

The dials (skim now; return when a question needs one)

◆ Two "bad data" catchers — don't confuse them

_rescued_data — schema mismatches (type/extra/missing), kept as a column on the good rows. The quarantine split ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)).
badRecordsPath — a Spark-native read option (JSON/CSV) that writes unparseable records/files to a path instead of failing. Use for genuinely corrupt input, not schema drift.

◆ Formats — including images/PDFs

cloudFiles.format = json, csv, parquet, avro, and binaryFile. For images/PDFs, binaryFile ingests each file as raw bytes. Tell: "ingest JPEG/PNG images incrementally" → Auto Loader + binaryFile.

◆ Filtering and throttling

pathGlobFilter — standard Spark file-source option to restrict ingestion to a glob (e.g. only *.jpg).
cloudFiles.maxBytesPerTrigger (and maxFilesPerTrigger) — a soft cap on how much a micro-batch processes. Tell: large files cause long, unpredictable micro-batches → cap with maxBytesPerTrigger.

◆ Reader evolution vs writer evolution (two different knobs)

cloudFiles.schemaEvolutionMode governs the reader — what Auto Loader does when it sees a new column.
mergeSchema=true on the Delta write governs whether new columns are actually added to the target table.

A fully-evolving append pipeline often needs both — the reader accepts the column and the writer adds it downstream.

◆ Where it sits — vs COPY INTO

Both load files idempotently (each tracks what it's loaded), for different shapes:

Auto Loader — continuous / high-volume / streaming ingestion of many files, with schema evolution. The default for "new files keep arriving."
COPY INTO — a simpler SQL, batch idempotent load; good for smaller or periodic loads.

Tell: "incrementally process new files as they arrive, exactly-once, minimal infra, schema evolution" → Auto Loader; "simple periodic SQL batch load" → COPY INTO.

Takeaways (rebuild it from these)

Auto Loader = format("cloudFiles"): incremental, idempotent, exactly-once file ingestion. It's Structured Streaming, so it needs a checkpoint (and a schemaLocation for the inferred schema).
Discovery: directory listing (default) vs file notification (event-driven, for very high volume).
Schema evolution modes: addNewColumns (default — fail once, then add), rescue (never fail, extras → _rescued_data), failOnNewColumns (fail, no evolve), none.
Bad data: _rescued_data (schema mismatch → quarantine split, [Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)) vs badRecordsPath (unparseable). binaryFile for images/PDFs; pathGlobFilter to select; maxBytesPerTrigger to cap batch size.
Reader schemaEvolutionMode ≠ writer mergeSchema. Auto Loader = continuous/high-volume streaming; COPY INTO = simple batch SQL load.

Before you move on — say these without scrolling up

Why is hand-globbing a folder fragile — and what does Auto Loader use to know a file is already done?
10,000 files processed, 5 new ones land — the two discovery modes, and which is the default?
A file arrives with a new column under the default mode — what happens to the stream, exactly?
amount arrives as "twenty" instead of an int — where does that value go, and how does that feed quarantine?
Auto Loader vs COPY INTO — one tell for each.

Section 2 is one lesson because it's one deep tool. Next, we pick up where _rescued_data pointed — cleaning and quality-gating that ingested data in Section 3.

Auto Loader — incremental file ingestion and schema evolution