Cloudera-style external lakehouse federation¶
AKKO connects to legacy data lakes you already own without moving a single byte. Federation, not migration. Sprint 82 A2.
What it does¶
The cloudera Trino catalog reads Hive tables that live in a customer-
owned Hadoop-style cluster (HDFS NameNode + DataNode + Hive Metastore +
optional Kerberos KDC). Once federated, those tables appear alongside
your other AKKO sources — same SQL, same RBAC, same governance.
Layer-first wording on the customer surface : Lakehouse external.
The vendor name Cloudera shows only on the architecture page (admin
expert view) and in this governance memo. The pattern works against
any deployment that exposes a Hive Metastore Thrift endpoint :
Cloudera CDH/CDP, Cloudera Data Platform, Amazon EMR, Google Dataproc,
Hortonworks HDP, or a custom Apache Hive install.
Architecture¶
┌─────────────────────────────────────────┐ ┌──────────────────────────────────────┐
│ AKKO namespace │ │ akko-demo-cloudera namespace │
│ │ │ (customer-owned, untouched) │
│ Trino coordinator │ │ │
│ /etc/trino/catalog/cloudera.properties│ │ Hive Metastore (Thrift :9083) ◄─┐ │
│ /etc/trino/cloudera-conf/ │ │ HDFS NameNode (RPC :9000) │ │
│ /etc/security/cloudera-keytabs/ │────► KDC (Kerberos :88) │ │
│ │ │ │ │
│ NetworkPolicy : `part-of: akko` ───────►│ │ Whitelist : ns=akko + label │ │
│ │ │ `part-of: akko` │ │
└─────────────────────────────────────────┘ └──────────────────────────────────────┘
Two authentication modes¶
The catalog body is rendered conditionally based on
global.cloudera.authentication (Helm value).
NONE (default — current demo cluster ships this)¶
Trino opens a plain Thrift TCP connection to the Hive Metastore. The
Metastore's hive-site.xml does NOT enable SASL/Kerberos on the
Thrift listener ; matching hive.metastore.authentication.type=NONE
on the Trino side is mandatory or the handshake fails immediately.
connector.name=hive
hive.metastore=thrift
hive.metastore.uri=thrift://akko-demo-cloudera-hive-metastore.akko-demo-cloudera.svc.cluster.local:9083
hive.metastore.authentication.type=NONE
hive.hdfs.authentication.type=NONE
KERBEROS¶
Trino performs a SPNEGO / GSSAPI handshake against the Hive Metastore
using a keytab provisioned by the cloudera-keytab-job (akko-init
post-install hook). The Job execs into the KDC pod, calls
kadmin.local addprinc -randkey trino@AKKO.LOCAL, ktadds the keytab,
streams it back via base64, and creates the
akko-cloudera-trino-keytab Secret in the akko namespace. The Trino
coordinator mounts the Secret at /etc/security/cloudera-keytabs/ and
references it from the catalog body :
connector.name=hive
hive.metastore=thrift
hive.metastore.uri=thrift://<hive-metastore-fqdn>:9083
hive.metastore.authentication.type=KERBEROS
hive.metastore.service.principal=hive/<hive-metastore-fqdn>@AKKO.LOCAL
hive.metastore.client.principal=trino@AKKO.LOCAL
hive.metastore.client.keytab=/etc/security/cloudera-keytabs/trino-cloudera.keytab
hive.hdfs.authentication.type=KERBEROS
hive.hdfs.trino.principal=trino@AKKO.LOCAL
hive.hdfs.trino.keytab=/etc/security/cloudera-keytabs/trino-cloudera.keytab
hive.config.resources=/etc/trino/cloudera-conf/core-site.xml,/etc/trino/cloudera-conf/hdfs-site.xml
Activation requires the upstream Hive Metastore to be re-deployed with
hive.metastore.sasl.enabled=true — currently a TODO in the
akko-demo-cloudera chart (Sprint 61.2.4).
Enabling the catalog¶
In your values overlay :
global:
cloudera:
enabled: true
# Override to point at your real Cloudera / CDP / EMR endpoint
metastoreUri: "thrift://<your-hive-metastore-host>:9083"
hdfsUri: "hdfs://<your-namenode-host>:9000"
# Switch to KERBEROS once the source cluster is hardened
authentication: NONE
kerberos:
bootstrap: true # auto-provision the trino@<realm> keytab
realm: "<YOUR.REALM>"
kdcHost: "<your-kdc-fqdn>"
kdcPort: 88
hiveServicePrincipal: "hive/<hive-fqdn>@<YOUR.REALM>"
hdfsServicePrincipal: "hdfs/<namenode-fqdn>@<YOUR.REALM>"
clientPrincipal: "trino@<YOUR.REALM>"
then :
The Trino coordinator picks up the new mount + catalog on rolling
restart. The bootstrap Job runs after every upgrade and registers the
catalog idempotently via CREATE CATALOG cloudera USING hive WITH (...).
NetworkPolicy¶
Trino pods need the app.kubernetes.io/part-of: akko label so the
akko-demo-cloudera namespace NetworkPolicy whitelists ingress on
ports 9083 (Hive Thrift), 9000 (HDFS NameNode RPC), 88 (Kerberos KDC).
This label is set automatically by the AKKO umbrella via the
trino.coordinator.labels override in values-trino.yaml.
Without the label, nc -z <hive-metastore> 9083 returns
Connection refused (post-DNAT RST from kube-router enforcing the NP).
See gotcha_kube_router_np_post_dnat.md for the
root-cause analysis.
OPA — service-account override¶
The bootstrap Job registers the catalog via CREATE CATALOG SQL using
the svc-catalog-manager service account. OPA denies CreateCatalog
by default unless the caller carries the akko-admin group ; the
Trino file-based group provider does NOT populate groups for header-
only (non-JWT) requests, so a one-line override in OPA's
user_overrides data file is required :
This promotes svc-catalog-manager to admin via the existing
data.user_overrides[user].data_scope = ["*"] wildcard rule (Sprint
68 ADEN deep-dive). No upstream OPA Rego changes required.
Smoke test¶
After deployment, from a pod with cluster-internal Trino access :
# As svc-catalog-manager (used by the bootstrap Job)
curl -s -X POST \
-H 'X-Trino-User: svc-catalog-manager' \
-H 'Content-Type: text/plain' \
--data 'SHOW SCHEMAS FROM cloudera' \
http://akko-trino:8080/v1/statement
A successful response returns at least the default schema.
Troubleshooting¶
Connection refused to port 9083¶
Trino pods are missing the app.kubernetes.io/part-of: akko label →
cross-namespace NetworkPolicy denies ingress. Verify with :
kubectl -n akko get pod -l app.kubernetes.io/name=trino \
-o jsonpath='{.items[0].metadata.labels}' | tr ',' '\n' | grep part-of
If missing, ensure coordinator.labels.app.kubernetes.io/part-of: akko
is set in values-trino.yaml and re-upgrade.
Access Denied: Cannot create catalog cloudera¶
OPA user_overrides.svc-catalog-manager is not set. Add the override
in your values overlay and re-upgrade. The override survives Trino
restarts because OPA reloads its data on every helm upgrade.
Cannot obtain metadata on first query¶
The Hive Metastore's hive-site.xml enforces SASL but the catalog
runs in NONE mode (or the reverse). Verify by
kubectl exec -n akko-demo-cloudera deploy/akko-demo-cloudera-hive-metastore -- grep sasl /opt/hive/conf/hive-site.xml
— if sasl.enabled=true, your catalog MUST run in KERBEROS mode.
Kerberos handshake fails with Server not found in Kerberos database¶
The service principal in the catalog body uses _HOST substitution
but Trino doesn't expand it (it expects the full FQDN). Use the
explicit form hive/<full-service-fqdn>@<REALM> everywhere.
Server certificate validation failed¶
The KDC issues tickets for kdc-fqdn:88 but the Trino pod resolves
kdc-fqdn to a different IP than what's in the krb5.conf. Check
/etc/trino/cloudera-conf/krb5.conf (rendered from the
akko-trino-cloudera-conf ConfigMap) and ensure the [realms] block
references the same FQDN that resolves cluster-internally.
Related¶
akko-demo-clouderachart — the simulated Hadoop perimeter (HDFS + Hive Metastore + KDC) used for the demo- Climscore federation — another external customer data source pattern (Postgres direct connection, no Hive layer)
- RBAC matrix — how the 5 AKKO roles see the cloudera catalog