Skip to content

Cloudera-style external lakehouse federation

AKKO connects to legacy data lakes you already own without moving a single byte. Federation, not migration. Sprint 82 A2.

What it does

The cloudera Trino catalog reads Hive tables that live in a customer- owned Hadoop-style cluster (HDFS NameNode + DataNode + Hive Metastore + optional Kerberos KDC). Once federated, those tables appear alongside your other AKKO sources — same SQL, same RBAC, same governance.

Layer-first wording on the customer surface : Lakehouse external. The vendor name Cloudera shows only on the architecture page (admin expert view) and in this governance memo. The pattern works against any deployment that exposes a Hive Metastore Thrift endpoint : Cloudera CDH/CDP, Cloudera Data Platform, Amazon EMR, Google Dataproc, Hortonworks HDP, or a custom Apache Hive install.

Architecture

┌─────────────────────────────────────────┐    ┌──────────────────────────────────────┐
│ AKKO namespace                          │    │ akko-demo-cloudera namespace         │
│                                          │    │  (customer-owned, untouched)         │
│  Trino coordinator                       │    │                                       │
│    /etc/trino/catalog/cloudera.properties│    │  Hive Metastore (Thrift :9083) ◄─┐   │
│    /etc/trino/cloudera-conf/             │    │  HDFS NameNode (RPC :9000)       │   │
│    /etc/security/cloudera-keytabs/       │────►  KDC (Kerberos :88)              │   │
│                                          │    │                                   │   │
│  NetworkPolicy : `part-of: akko` ───────►│    │  Whitelist : ns=akko + label     │   │
│                                          │    │              `part-of: akko`     │   │
└─────────────────────────────────────────┘    └──────────────────────────────────────┘

Two authentication modes

The catalog body is rendered conditionally based on global.cloudera.authentication (Helm value).

NONE (default — current demo cluster ships this)

Trino opens a plain Thrift TCP connection to the Hive Metastore. The Metastore's hive-site.xml does NOT enable SASL/Kerberos on the Thrift listener ; matching hive.metastore.authentication.type=NONE on the Trino side is mandatory or the handshake fails immediately.

connector.name=hive
hive.metastore=thrift
hive.metastore.uri=thrift://akko-demo-cloudera-hive-metastore.akko-demo-cloudera.svc.cluster.local:9083
hive.metastore.authentication.type=NONE
hive.hdfs.authentication.type=NONE

KERBEROS

Trino performs a SPNEGO / GSSAPI handshake against the Hive Metastore using a keytab provisioned by the cloudera-keytab-job (akko-init post-install hook). The Job execs into the KDC pod, calls kadmin.local addprinc -randkey trino@AKKO.LOCAL, ktadds the keytab, streams it back via base64, and creates the akko-cloudera-trino-keytab Secret in the akko namespace. The Trino coordinator mounts the Secret at /etc/security/cloudera-keytabs/ and references it from the catalog body :

connector.name=hive
hive.metastore=thrift
hive.metastore.uri=thrift://<hive-metastore-fqdn>:9083
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=hive/<hive-metastore-fqdn>@AKKO.LOCAL
hive.metastore.client.principal=trino@AKKO.LOCAL
hive.metastore.client.keytab=/etc/security/cloudera-keytabs/trino-cloudera.keytab
hive.hdfs.authentication.type=KERBEROS
hive.hdfs.trino.principal=trino@AKKO.LOCAL
hive.hdfs.trino.keytab=/etc/security/cloudera-keytabs/trino-cloudera.keytab
hive.config.resources=/etc/trino/cloudera-conf/core-site.xml,/etc/trino/cloudera-conf/hdfs-site.xml

Activation requires the upstream Hive Metastore to be re-deployed with hive.metastore.sasl.enabled=true — currently a TODO in the akko-demo-cloudera chart (Sprint 61.2.4).

Enabling the catalog

In your values overlay :

global:
  cloudera:
    enabled: true
    # Override to point at your real Cloudera / CDP / EMR endpoint
    metastoreUri: "thrift://<your-hive-metastore-host>:9083"
    hdfsUri: "hdfs://<your-namenode-host>:9000"
    # Switch to KERBEROS once the source cluster is hardened
    authentication: NONE
    kerberos:
      bootstrap: true     # auto-provision the trino@<realm> keytab
      realm: "<YOUR.REALM>"
      kdcHost: "<your-kdc-fqdn>"
      kdcPort: 88
      hiveServicePrincipal: "hive/<hive-fqdn>@<YOUR.REALM>"
      hdfsServicePrincipal: "hdfs/<namenode-fqdn>@<YOUR.REALM>"
      clientPrincipal: "trino@<YOUR.REALM>"

then :

helm upgrade akko helm/akko/ -n akko \
  --reuse-values \
  -f helm/examples/values-netcup.yaml

The Trino coordinator picks up the new mount + catalog on rolling restart. The bootstrap Job runs after every upgrade and registers the catalog idempotently via CREATE CATALOG cloudera USING hive WITH (...).

NetworkPolicy

Trino pods need the app.kubernetes.io/part-of: akko label so the akko-demo-cloudera namespace NetworkPolicy whitelists ingress on ports 9083 (Hive Thrift), 9000 (HDFS NameNode RPC), 88 (Kerberos KDC). This label is set automatically by the AKKO umbrella via the trino.coordinator.labels override in values-trino.yaml.

Without the label, nc -z <hive-metastore> 9083 returns Connection refused (post-DNAT RST from kube-router enforcing the NP). See gotcha_kube_router_np_post_dnat.md for the root-cause analysis.

OPA — service-account override

The bootstrap Job registers the catalog via CREATE CATALOG SQL using the svc-catalog-manager service account. OPA denies CreateCatalog by default unless the caller carries the akko-admin group ; the Trino file-based group provider does NOT populate groups for header- only (non-JWT) requests, so a one-line override in OPA's user_overrides data file is required :

akko-opa:
  groupPolicies:
    user_overrides:
      svc-catalog-manager:
        data_scope: ["*"]

This promotes svc-catalog-manager to admin via the existing data.user_overrides[user].data_scope = ["*"] wildcard rule (Sprint 68 ADEN deep-dive). No upstream OPA Rego changes required.

Smoke test

After deployment, from a pod with cluster-internal Trino access :

# As svc-catalog-manager (used by the bootstrap Job)
curl -s -X POST \
  -H 'X-Trino-User: svc-catalog-manager' \
  -H 'Content-Type: text/plain' \
  --data 'SHOW SCHEMAS FROM cloudera' \
  http://akko-trino:8080/v1/statement

A successful response returns at least the default schema.

Troubleshooting

Connection refused to port 9083

Trino pods are missing the app.kubernetes.io/part-of: akko label → cross-namespace NetworkPolicy denies ingress. Verify with :

kubectl -n akko get pod -l app.kubernetes.io/name=trino \
  -o jsonpath='{.items[0].metadata.labels}' | tr ',' '\n' | grep part-of

If missing, ensure coordinator.labels.app.kubernetes.io/part-of: akko is set in values-trino.yaml and re-upgrade.

Access Denied: Cannot create catalog cloudera

OPA user_overrides.svc-catalog-manager is not set. Add the override in your values overlay and re-upgrade. The override survives Trino restarts because OPA reloads its data on every helm upgrade.

Cannot obtain metadata on first query

The Hive Metastore's hive-site.xml enforces SASL but the catalog runs in NONE mode (or the reverse). Verify by kubectl exec -n akko-demo-cloudera deploy/akko-demo-cloudera-hive-metastore -- grep sasl /opt/hive/conf/hive-site.xml — if sasl.enabled=true, your catalog MUST run in KERBEROS mode.

Kerberos handshake fails with Server not found in Kerberos database

The service principal in the catalog body uses _HOST substitution but Trino doesn't expand it (it expects the full FQDN). Use the explicit form hive/<full-service-fqdn>@<REALM> everywhere.

Server certificate validation failed

The KDC issues tickets for kdc-fqdn:88 but the Trino pod resolves kdc-fqdn to a different IP than what's in the krb5.conf. Check /etc/trino/cloudera-conf/krb5.conf (rendered from the akko-trino-cloudera-conf ConfigMap) and ensure the [realms] block references the same FQDN that resolves cluster-internally.

  • akko-demo-cloudera chart — the simulated Hadoop perimeter (HDFS + Hive Metastore + KDC) used for the demo
  • Climscore federation — another external customer data source pattern (Postgres direct connection, no Hive layer)
  • RBAC matrix — how the 5 AKKO roles see the cloudera catalog