Lessons

Data Ingestion

Auto Loader — incremental file ingestion and schema evolution

Incrementally ingest files from cloud storage with Auto Loader (cloudFiles): schema inference/evolution modes, rescued data, formats, throttling, and vs COPY INTO.

In [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) I said bronze is usually a stateless streaming query — read the source as it arrives and append it. This whole section is about the single most common source: files landing in a cloud bucket. Auto Loader is the purpose-built tool for it, and the exam tests it hard (it's the whole 7% of this section).

We'll learn it the way it actually behaves — as a story that happens to one folder over time. Picture s3://bucket/incoming/ where your restaurant's order files (JSON) drop every few minutes. Watch that folder across three moments: files land, more files land, and then a file lands that doesn't match. Each moment teaches one core idea.


The spine — one landing folder, three moments

Moment 1 — files land: what Auto Loader is

You need every file in incoming/ loaded into a bronze Delta table exactly once — never missed, never reprocessed — and you don't want to babysit which files you've already seen.

Predict: the hand-rolled way is to glob the folder each run and diff it against what's already loaded. Why is that fragile at scale?

Because you're maintaining the "what have I seen" state by hand — and at thousands of files a day it drifts: a file processed twice, or missed during a crash. That's the f1 anchor again ([The one job — and the two axes everything lives on](/lessons/f1-the-one-job/): do only the work a change demands) applied to files — and Auto Loader is the machine that does it for you:

Anchor. Auto Loader (format("cloudFiles")) incrementally and idempotently discovers and ingests only the new files in a cloud location, tracking what it has already processed in its checkpoint. It is Structured Streaming under the hood ([Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) — so it's exactly-once and needs a checkpointLocation — just pointed at files instead of Kafka.

(spark.readStream.format("cloudFiles")
   .option("cloudFiles.format", "json")
   .option("cloudFiles.schemaLocation", "/chk/schema")
   .load("s3://bucket/incoming/"))

Lock it. cloudFiles = incremental, exactly-once file ingestion. It's streaming (from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/)) aimed at a folder, so it carries a checkpoint.

Moment 2 — more files land: how does it know which are new?

Overnight, 5 new files drop into a folder that already holds 10,000 processed ones.

Predict: how does Auto Loader ingest only those 5 the next run — without re-reading all 10,005?

Two halves. Remembering what's done is the checkpoint (the same progress-bookmark idea from [Structured Streaming & the state model](/lessons/s1-structured-streaming-state/) — it records which files are already ingested). Discovering what's new is a separate choice, and it's tested directly:

Tell: "default Auto Loader execution mode"directory listing. "millions of files, directory listing is too slow/expensive"file notification.

Lock it. Checkpoint = what's already done. Discovery = directory listing (default) vs file notification (high volume, event-driven).

Moment 3 — a file lands that doesn't match: inference, evolution, rescue

For months every order file looked like {order_id, amount} with amount an integer. Today two odd files arrive: one has a brand-new column tip; another sends amount as the string "twenty".

Predict: does the mismatched row crash the stream, get silently dropped, or something else?

The answer depends on one setting — but first, how Auto Loader even has a schema to violate. It infers the schema by sampling files, and cloudFiles.schemaLocation persists that inferred schema so later runs skip re-sampling (cheaper, stable) — it's the companion to the checkpoint. Now, when a new column appears, cloudFiles.schemaEvolutionMode decides what happens:

ModeOn a new columnStream fails?
addNewColumns (default)stream fails once, but the new column is added to the schema — restart picks it upyes, once, then evolves
rescueschema does not evolve; unexpected data goes to _rescued_datanever fails
failOnNewColumnsstream fails and does not evolve — you fix the schema manuallyyes
noneignore new columns, no rescueno

The surprise worth sitting with: the default (addNewColumns) fails the stream — but productively. It records the new column, so one restart continues with the wider schema. People expect the default to "just handle it silently"; it doesn't.

And the "twenty"-where-int-expected value? That's what _rescued_data is for — a catch-all column holding anything that doesn't fit the schema (type mismatches, missing columns, extra fields). The bad value isn't lost and doesn't crash the row; it lands in _rescued_data, where you can inspect or route it. That's the exact split point for the quarantine pattern ([Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)): rows with a non-null _rescued_data are your bad-data path.

Lock it. Auto Loader infers a schema (persisted in schemaLocation). New column → schemaEvolutionMode (default addNewColumns = fail-once-then-add; rescue = never fail). Mismatched values → _rescued_data (the quarantine split).


The dials (skim now; return when a question needs one)

◆ Two "bad data" catchers — don't confuse them

◆ Formats — including images/PDFs

cloudFiles.format = json, csv, parquet, avro, and binaryFile. For images/PDFs, binaryFile ingests each file as raw bytes. Tell: "ingest JPEG/PNG images incrementally" → Auto Loader + binaryFile.

◆ Filtering and throttling

◆ Reader evolution vs writer evolution (two different knobs)

A fully-evolving append pipeline often needs both — the reader accepts the column and the writer adds it downstream.

◆ Where it sits — vs COPY INTO

Both load files idempotently (each tracks what it's loaded), for different shapes:

Tell: "incrementally process new files as they arrive, exactly-once, minimal infra, schema evolution"Auto Loader; "simple periodic SQL batch load"COPY INTO.

Takeaways (rebuild it from these)

  1. Auto Loader = format("cloudFiles"): incremental, idempotent, exactly-once file ingestion. It's Structured Streaming, so it needs a checkpoint (and a schemaLocation for the inferred schema).
  2. Discovery: directory listing (default) vs file notification (event-driven, for very high volume).
  3. Schema evolution modes: addNewColumns (default — fail once, then add), rescue (never fail, extras → _rescued_data), failOnNewColumns (fail, no evolve), none.
  4. Bad data: _rescued_data (schema mismatch → quarantine split, [Quarantining bad data — the third option beyond drop and fail](/lessons/s3-quarantine/)) vs badRecordsPath (unparseable). binaryFile for images/PDFs; pathGlobFilter to select; maxBytesPerTrigger to cap batch size.
  5. Reader schemaEvolutionMode ≠ writer mergeSchema. Auto Loader = continuous/high-volume streaming; COPY INTO = simple batch SQL load.

Before you move on — say these without scrolling up

  1. Why is hand-globbing a folder fragile — and what does Auto Loader use to know a file is already done?
  2. 10,000 files processed, 5 new ones land — the two discovery modes, and which is the default?
  3. A file arrives with a new column under the default mode — what happens to the stream, exactly?
  4. amount arrives as "twenty" instead of an int — where does that value go, and how does that feed quarantine?
  5. Auto Loader vs COPY INTO — one tell for each.

Section 2 is one lesson because it's one deep tool. Next, we pick up where _rescued_data pointed — cleaning and quality-gating that ingested data in Section 3.

Prerequisites