Lessons

Data Modelling

Delta data models — managed vs external, clones, and materialization

Choose Delta table types (managed vs external), copy safely with shallow/deep clone, and pick the right materialization (view, table, materialized view).

Beyond what you model (dimensions, facts) sit the physical choices: where the table's data lives, how you copy it, and whether a result is recomputed each read or stored. Three small decisions, each with a sharp tell.


The spine

Beat 1 — managed vs external: the LOCATION keyword decides

Predict: you CREATE TABLE … LOCATION 's3://…/mydata'. Later you DROP it. What happens to the files?

They survive — because LOCATION made it external, and DROP on an external table removes only metadata.

Anchor. Specifying a LOCATION = external (unmanaged) — data lives at your path; DROP removes only metadata (files stay). No LOCATION = managed — Databricks owns the storage; DROP deletes the underlying files ([PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/)).

"Which is external?" → the one created with LOCATION. A subtlety: if a schema is created with a LOCATION, a CTAS inside it with no explicit table location inherits the schema's location (the table lands under that path — even as a managed table in a mounted container).

Beat 2 — clones: shallow (cheap, fragile) vs deep (independent, safe)

The difference is whether the data files are copied:

Tell: "fast zero-copy sandbox" → shallow (don't VACUUM the source); "backup that stays valid no matter what happens to the source" → deep.

Lock it. LOCATION → external (DROP keeps files); no LOCATION → managed (DROP deletes files). Shallow clone = metadata + shared files (fragile); deep clone = copied files (independent).


The dials (skim now; return when a question needs one)

◆ Materialization — recompute or store?

For a gold-layer result, choose when the work happens:

ObjectStores data?RecomputedUse when
Viewnoevery readcheap query, rarely run, always-fresh
Table (batch)yeswhen you rebuild ityou control refresh explicitly
Materialized viewyes (precomputed)auto / scheduled, incrementallyfrequently-queried aggregates needing only periodic refresh
Streaming tableyesincrementally per new datacontinuous append ingestion ([Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/))

Predict: a gold metric is queried constantly but only needs to recompute daily. A plain view or something else?

Not a plain view (it re-runs the heavy aggregation on every read) → materialize it: a materialized view (precomputed, auto-refreshed) or a nightly batch table overwritten each run. Tell: "precomputed results, refreshed automatically/on a schedule, fast reads" → materialized view; "raw incremental ingestion" → streaming table.

◆ Declare the schema — don't lean on inference

Databricks infers schema using the most permissive types that fit the sampled data — which quietly accepts values you'd rather reject and can widen types. When correctness matters, explicitly declare the schema (types + constraints). Declaration is enforcement; inference is convenience.

Takeaways (rebuild it from these)

  1. LOCATION → external (unmanaged); no LOCATIONmanaged. DROP deletes data for managed, metadata-only for external. A schema's LOCATION is inherited by CTAS tables created in it.
  2. SHALLOW CLONE = metadata only, references source files (cheap, but source VACUUM breaks it); DEEP CLONE = copies files, independent (safe backup).
  3. Materialize by need: view (recompute each read) · table (manual rebuild) · materialized view (precomputed, auto/scheduled — for frequent reads) · streaming table (incremental ingestion).
  4. Gold metrics queried often but refreshed daily → materialized view / nightly batch table, not a plain view.
  5. Declare schemas explicitly for quality — inference picks the most permissive types.

Before you move on — say these without scrolling up

  1. CREATE TABLE … LOCATION … then DROP — do the files survive? Managed or external?
  2. Shallow vs deep clone — which breaks if you VACUUM the source, and why?
  3. A metric queried constantly but refreshed daily — which materialization, and why not a view?
  4. Why declare a schema instead of trusting inference?

That completes Section 10 — and the last of the section lessons: model the dimensions (SCD/star) → choose the physical table (managed/clone/materialization).

Prerequisites