Skip to content

Federation — Cloudera Hive cross-Lakehouse JOIN

Persona path: alice (setup) → carol (analyst) · Catalogs: cloudera (external) and postgres_oltp_banking (internal) · Duration: ~20 min · Difficulty: star star star

Status note: this demo expects the Cloudera catalog provisioned by the demo perimeter akko-demo-cloudera. If the catalog is not yet live, follow Admin / Catalog Manager Pro to add it manually.

This demo proves that AKKO can federate an external Cloudera Hive cluster and run a single SQL that joins a Cloudera table with an internal AKKO source. Carol does not need to copy data or change tooling.

What this proves

  • An external Cloudera Hive cluster can be added as a catalog without code changes.
  • A single SQL query joins the external Hive table with an internal Postgres source.
  • The Query Engine pushes predicates down to both sides.
  • Carol's role determines what columns and rows she sees on both sides.

Pre-requisites

  • Demo URL: https://demo.akko-ai.com
  • An external Cloudera CDP cluster reachable from the AKKO cluster (Kerberos enabled).
  • Catalog postgres_oltp_banking already federated.
  • 2 personas provisioned: alice, carol.

Step 1 — Alice federates the Cloudera catalog

Sign in as alice. Navigate to Administration → Catalog Manager Pro.

Click Add catalog → Hive (Cloudera).

Fill the form:

  • Catalog name: cloudera
  • Metastore URI: thrift://hms.cloudera.akko-demo:9083
  • Default file format: PARQUET
  • Auth: Kerberos
  • Principal: akko-query/_HOST@AKKO-DEMO.LOCAL
  • Keytab: upload akko-query.keytab (encrypted at rest)

Click Test connection. Expected: Connection OK, 4 databases discovered.

Click Save. The Catalog auto-syncs metadata in the background.

Screenshot: tests/e2e/playwright/artefacts/demos/federation-cloudera/01-add-catalog.png

Step 2 — Alice verifies catalog visibility

Sign out, sign in as carol. Navigate to Governance → Catalog explorer.

Expected: a new catalog cloudera appears with the discovered databases. Click cloudera → banking_external.

Expected tables:

banking_external.partner_transactions   (12 400 000 rows)
banking_external.partner_merchants      (38 000 rows)
banking_external.partner_countries      (240 rows)

Screenshot: tests/e2e/playwright/artefacts/demos/federation-cloudera/02-cloudera-catalog.png

Step 3 — Carol writes the cross-catalog JOIN

Open DevHub → SQL editor. Run:

SELECT
    i.country_code,
    count(DISTINCT i.transaction_id) AS internal_tx,
    count(DISTINCT e.partner_tx_id)  AS external_tx,
    avg(e.exchange_rate)             AS avg_rate
FROM postgres_oltp_banking.public.transactions i
JOIN cloudera.banking_external.partner_transactions e
  ON e.merchant_id = i.merchant_id
 AND e.tx_date     = date(i.transaction_ts)
WHERE i.transaction_ts >= current_date - INTERVAL '30' DAY
GROUP BY i.country_code
ORDER BY internal_tx DESC
LIMIT 20;

Expected: 20 rows in ~4 s. Top 5:

| country | internal_tx | external_tx | avg_rate |
| FR      | 84 120      | 78 940      | 1.0000   |
| DE      | 41 320      | 39 870      | 1.0000   |
| US      | 28 410      | 27 600      | 1.0820   |
| GB      | 19 240      | 18 990      | 0.8612   |
| CH      | 12 410      | 12 100      | 0.9714   |

Screenshot: tests/e2e/playwright/artefacts/demos/federation-cloudera/03-join-result.png

Step 4 — Carol inspects predicate pushdown

In the SQL editor, click Explain → Distributed plan. Expected: the plan shows two sources, with the date filter pushed to the Hive side and the country filter pushed to the Postgres side.

Output [country_code, internal_tx, external_tx, avg_rate]
  Aggregate
    InnerJoin [on merchant_id, tx_date]
      RemoteSource postgres_oltp_banking (predicate: ts >= now - 30 days)
      RemoteSource cloudera (predicate: tx_date >= 2026-04-17)

Screenshot: tests/e2e/playwright/artefacts/demos/federation-cloudera/04-explain-plan.png

Step 5 — Carol promotes the dashboard

Click Promote to dashboard. ADEN proposes a 4-tile dashboard:

  1. KPI — Internal tx total (30 d)
  2. KPI — External tx total (30 d)
  3. Bar — Internal vs External per country
  4. Table — Per country breakdown
+-------------------- AKKO Federation — Banking cross-source ----------+
| 384 200 internal tx | 372 100 external tx                            |
+----------------------------------------------------------------------+
| [ Bar: internal vs external per country ]  [ Table top 20 ]          |
+----------------------------------------------------------------------+

Screenshot: tests/e2e/playwright/artefacts/demos/federation-cloudera/05-dashboard.png

Cleanup

  • Sign out.
  • Optional: keep the catalog for the next demo cycle.

What this proves

  • An external Cloudera cluster is a 1-form integration.
  • A single SQL JOINs internal and external data without ETL.
  • The Query Engine pushes predicates to both sides.
  • Carol's role applies on both catalogs (e.g. masked PII on the internal side).

Files in the repo

File Role
helm/akko/charts/akko-catalog-manager/templates/secret-cloudera.yaml Kerberos keytab Secret template
trino/catalog/cloudera.properties Generated catalog file