Lessons

Data Sharing & Federation

Lakehouse Federation — query external data in place

Use Lakehouse Federation to query external data sources (databases, warehouses) in place through Unity Catalog, without copying or migrating the data.

Delta Sharing was data going out. This lesson is the mirror image — data coming in. Your org's data isn't all in Databricks: some lives in PostgreSQL, some in Snowflake, some in an Oracle box the trading desk owns. You want to query it from Databricks — join it to your lakehouse tables — without an ETL that copies it all in first. That's Lakehouse Federation.


The spine

Beat 1 — the anchor

Predict: to join Oracle portfolio data to your lakehouse market data, the "obvious" move is to ETL Oracle into Delta first. What does Federation let you skip?

The copy entirely. You reach the source where it lives:

Anchor. Lakehouse Federation lets you query external data sources — databases, warehouses, other lakes — in place, directly, through Unity Catalog, with no migration or copy. The external data stays put; you just reach it under UC governance.

Beat 2 — the setup: connection → foreign catalog → grant

Two objects, in order — and the order is tested:

  1. A connection — how to reach the external system (host + credentials). One per source (e.g. mysql_connection).
  2. A foreign catalog — a UC catalog that mirrors the external system, so its tables appear in your namespace and you query them with normal SQL under UC.

Predict: you've created mysql_connection. What's the next step?

Create the foreign catalog from that connection — then grant UC permissions on the foreign catalog / schemas / tables so teams can use it. (Connection first, foreign catalog second, governance third — every question tests this sequence.)

Lock it. Federation = query external sources in place, through UC, no copy. Setup = connection → foreign catalog → grant.


The dials (skim now; return when a question needs one)

◆ Read-only and always current

Federation hits the source system directly, read-only. Two consequences the exam raises:

The whole-feature tell: "query Oracle / Snowflake / SQL Server / Postgres from Databricks without moving the data" → Lakehouse Federation.

◆ Name the collision (with Delta Sharing)

Both honour "don't copy the data." Providing data to a partner → Sharing; reaching into an external database → Federation.

Takeaways (rebuild it from these)

  1. Lakehouse Federation = query external sources in place, through UC, no copy/migration.
  2. Setup order: connection (reach the source) → foreign catalog (mirror it into UC) → grant UC permissions. Query with normal SQL.
  3. Read-only and live — results always reflect the current source, so no staleness/consistency gap and no ETL to maintain.
  4. Collision: Federation = data in (query external); Delta Sharing = data out (share yours). Both avoid copying.

Before you move on — say these without scrolling up

  1. What does Federation let you skip versus the "ETL it in first" approach?
  2. You've made the connection — the next two steps, in order?
  3. "Worried about staleness across teams querying external data" — why is Federation the answer?
  4. Providing data to a partner vs reaching into an external DB — which feature each?

That closes Section 4 — the no-copy pair. Next: monitoring and alerting on everything the pipelines produce.

Prerequisites