Beyond what you model (dimensions, facts) sit the physical choices: where the table's data lives, how you copy it, and whether a result is recomputed each read or stored. Three small decisions, each with a sharp tell.
The spine
Beat 1 — managed vs external: the LOCATION keyword decides
Predict: you
CREATE TABLE … LOCATION 's3://…/mydata'. Later youDROPit. What happens to the files?
…
They survive — because LOCATION made it external, and DROP on an external table removes only metadata.
Anchor. Specifying a
LOCATION= external (unmanaged) — data lives at your path;DROPremoves only metadata (files stay). NoLOCATION= managed — Databricks owns the storage;DROPdeletes the underlying files ([PII lifecycle — deletion, retention, and the right to be forgotten](/lessons/s7-pii-lifecycle/)).
"Which is external?" → the one created with LOCATION. A subtlety: if a schema is created with a LOCATION, a CTAS inside it with no explicit table location inherits the schema's location (the table lands under that path — even as a managed table in a mounted container).
Beat 2 — clones: shallow (cheap, fragile) vs deep (independent, safe)
The difference is whether the data files are copied:
SHALLOW CLONE— copies only transaction-log metadata and references the source's existing data files. Instant and cheap (a quick dev/test copy) — but shares files, soVACUUMon the source can purge files the clone needs and break it.DEEP CLONE— copies the data files too. Fully independent, survives the source'sVACUUM. The right choice for a real backup.
Tell: "fast zero-copy sandbox" → shallow (don't VACUUM the source); "backup that stays valid no matter what happens to the source" → deep.
Lock it.
LOCATION→ external (DROP keeps files); no LOCATION → managed (DROP deletes files). Shallow clone = metadata + shared files (fragile); deep clone = copied files (independent).
The dials (skim now; return when a question needs one)
◆ Materialization — recompute or store?
For a gold-layer result, choose when the work happens:
| Object | Stores data? | Recomputed | Use when |
|---|---|---|---|
| View | no | every read | cheap query, rarely run, always-fresh |
| Table (batch) | yes | when you rebuild it | you control refresh explicitly |
| Materialized view | yes (precomputed) | auto / scheduled, incrementally | frequently-queried aggregates needing only periodic refresh |
| Streaming table | yes | incrementally per new data | continuous append ingestion ([Streaming tables vs materialized views](/lessons/s1-streaming-tables-vs-mv/)) |
Predict: a gold metric is queried constantly but only needs to recompute daily. A plain view or something else?
…
Not a plain view (it re-runs the heavy aggregation on every read) → materialize it: a materialized view (precomputed, auto-refreshed) or a nightly batch table overwritten each run. Tell: "precomputed results, refreshed automatically/on a schedule, fast reads" → materialized view; "raw incremental ingestion" → streaming table.
◆ Declare the schema — don't lean on inference
Databricks infers schema using the most permissive types that fit the sampled data — which quietly accepts values you'd rather reject and can widen types. When correctness matters, explicitly declare the schema (types + constraints). Declaration is enforcement; inference is convenience.
Takeaways (rebuild it from these)
LOCATION→ external (unmanaged); noLOCATION→ managed.DROPdeletes data for managed, metadata-only for external. A schema'sLOCATIONis inherited byCTAStables created in it.SHALLOW CLONE= metadata only, references source files (cheap, but sourceVACUUMbreaks it);DEEP CLONE= copies files, independent (safe backup).- Materialize by need: view (recompute each read) · table (manual rebuild) · materialized view (precomputed, auto/scheduled — for frequent reads) · streaming table (incremental ingestion).
- Gold metrics queried often but refreshed daily → materialized view / nightly batch table, not a plain view.
- Declare schemas explicitly for quality — inference picks the most permissive types.
Before you move on — say these without scrolling up
CREATE TABLE … LOCATION …thenDROP— do the files survive? Managed or external?- Shallow vs deep clone — which breaks if you
VACUUMthe source, and why? - A metric queried constantly but refreshed daily — which materialization, and why not a view?
- Why declare a schema instead of trusting inference?
That completes Section 10 — and the last of the section lessons: model the dimensions (SCD/star) → choose the physical table (managed/clone/materialization).