Choose Delta table types (managed vs external), copy safely with shallow/deep clone, and pick the right materialization (view, table, materialized view).

Beyond what you model (dimensions, facts) sit the physical choices: where the table's data lives, how you copy it, and whether a result is recomputed each read or stored. Three small decisions, each with a sharp tell.

The spine

Beat 1 — managed vs external: the `LOCATION` keyword decides

Predict: you CREATE TABLE … LOCATION 's3://…/mydata'. Later you DROP it. What happens to the files?

…

They survive — because LOCATION made it external, and DROP on an external table removes only metadata.

Anchor. Specifying a LOCATION = external (unmanaged) — data lives at your path; DROP removes only metadata (files stay). No LOCATION = managed — Databricks owns the storage; DROP deletes the underlying files ([PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/)).

"Which is external?" → the one created with LOCATION. A subtlety: if a schema is created with a LOCATION, a CTAS inside it with no explicit table location inherits the schema's location (the table lands under that path — even as a managed table in a mounted container).

Beat 2 — clones: shallow (cheap, fragile) vs deep (independent, safe)

The difference is whether the data files are copied:

SHALLOW CLONE — copies only transaction-log metadata and references the source's existing data files. Instant and cheap (a quick dev/test copy) — but shares files, so VACUUM on the source can purge files the clone needs and break it.
DEEP CLONE — copies the data files too. Fully independent, survives the source's VACUUM. The right choice for a real backup.

Tell: "fast zero-copy sandbox" → shallow (don't VACUUM the source); "backup that stays valid no matter what happens to the source" → deep.

Lock it. LOCATION → external (DROP keeps files); no LOCATION → managed (DROP deletes files). Shallow clone = metadata + shared files (fragile); deep clone = copied files (independent).

The dials (skim now; return when a question needs one)

◆ Materialization — recompute or store?

For a gold-layer result, choose when the work happens:

Object	Stores data?	Recomputed	Use when
View	no	every read	cheap query, rarely run, always-fresh
Table (batch)	yes	when you rebuild it	you control refresh explicitly
Materialized view	yes (precomputed)	auto / scheduled, incrementally	frequently-queried aggregates needing only periodic refresh
Streaming table	yes	incrementally per new data	continuous append ingestion (`[Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)`)

Predict: a gold metric is queried constantly but only needs to recompute daily. A plain view or something else?

…

Not a plain view (it re-runs the heavy aggregation on every read) → materialize it: a materialized view (precomputed, auto-refreshed) or a nightly batch table overwritten each run. Tell: "precomputed results, refreshed automatically/on a schedule, fast reads" → materialized view; "raw incremental ingestion" → streaming table.

◆ Declare the schema — don't lean on inference

Databricks infers schema using the most permissive types that fit the sampled data — which quietly accepts values you'd rather reject and can widen types. When correctness matters, explicitly declare the schema (types + constraints). Declaration is enforcement; inference is convenience.

Takeaways (rebuild it from these)

LOCATION → external (unmanaged); no LOCATION → managed. DROP deletes data for managed, metadata-only for external. A schema's LOCATION is inherited by CTAS tables created in it.
SHALLOW CLONE = metadata only, references source files (cheap, but source VACUUM breaks it); DEEP CLONE = copies files, independent (safe backup).
Materialize by need: view (recompute each read) · table (manual rebuild) · materialized view (precomputed, auto/scheduled — for frequent reads) · streaming table (incremental ingestion).
Gold metrics queried often but refreshed daily → materialized view / nightly batch table, not a plain view.
Declare schemas explicitly for quality — inference picks the most permissive types.

Before you move on — say these without scrolling up

CREATE TABLE … LOCATION … then DROP — do the files survive? Managed or external?
Shallow vs deep clone — which breaks if you VACUUM the source, and why?
A metric queried constantly but refreshed daily — which materialization, and why not a view?
Why declare a schema instead of trusting inference?

That completes Section 10 — and the last of the section lessons: model the dimensions (SCD/star) → choose the physical table (managed/clone/materialization).

Delta data models — managed vs external, clones, and materialization