ADR-030 — Trino Gateway for coordinator HA¶

Status: Proposed (Sprint 45 planning)
Date: 2026-04-25
Drivers: CTO prospect audit 2026-04-25 — "single Trino coordinator = single point of failure"; Sprint 44 Stream HA close-out

Context¶

Trino's architecture is deliberately single-coordinator: the coordinator owns query parsing, planning, scheduling and result coordination. Worker HA is trivial (we shipped that in commit 7d301cb: workers 1 → 2 + retry-policy=QUERY), but coordinator HA is not. Killing the coordinator pod kills every in-flight query and locks every Cockpit / ADEN / Superset / dbt caller for the ~30-60 s it takes to reschedule.

A banking-tier-2 prospect flagged this as a P0 dealbreaker for SaaS distribution. They expect coordinator HA out of the box.

Considered options¶

Option A — Trinodb/trino-gateway (project starburstdata/trino-gateway)¶

A purpose-built reverse proxy in front of N independent coordinators. Routes queries by routing rules (round-robin, hash, custom). Each coordinator runs its own scheduler; the Gateway tracks query → coordinator mapping for the lifetime of the query so result fetching works.

License: Apache 2.0 (governance: Trino Software Foundation, fork of Lyft's trino-gateway). Compatible with AKKO commercial-safe rule.
Maturity: 2.x stable, used in production by Bloomberg, Lyft, Pinterest.
Operator effort: separate Helm chart, separate database (PostgreSQL — can reuse akko-postgresql), 2-3 coordinator replicas behind it.
Failover behaviour: if a coordinator dies, in-flight queries on it fail and need re-submission; new queries from the same client Just Work because Gateway picks another coordinator.

Option B — Single coordinator + aggressive `retry-policy=QUERY`¶

Skip Gateway. Rely on Trino's built-in QUERY-level retry: if the coordinator dies mid-query, the client driver re-submits, hits the new coordinator (after pod restart), retries transparently.

Pros: zero operator effort, no extra component.
Cons: 30-60 s outage window per coordinator restart. Long queries lose all progress. Doesn't address the prospect's "single point of failure" red flag — it just makes recovery automatic.

Option C — k8s-native HPA on coordinator with HA Service¶

Run multiple coordinators behind a regular k8s Service, hash-routed. Each coordinator independently. Will not work: clients lose query state mid-fetch when the Service round-robins their next request to a different coordinator.

Decision¶

Adopt Option A — Trinodb/trino-gateway as the AKKO HA story for Trino coordinator.

Default install : 2 coordinators + Gateway in front. Gateway exposes the same Service name (akko-trino) every consumer already uses, so no client-side change.

Consequences¶

Positive

True coordinator HA: rolling restart, single-pod kill, single-node drain → no client-visible failure beyond a sub-second blip.
Existing dbt / ADEN / Superset / Cockpit configs keep working (Gateway exposes the existing akko-trino Service, just routes to coordinators internally).
Built-in routing rules unlock future per-tenant query isolation (one coordinator per tenant for noisy-neighbour control).

Negative

New component to operate (Gateway pod + its Postgres state).
~150 MB additional RAM for the Gateway pod itself.
~250 MB additional Postgres footprint (Gateway routing state).

Implementation plan¶

Sprint 45 — phased rollout :

ADR + chart skeleton (this commit): plumbing only, default off.
Helm sub-chart akko-trino-gateway: vendored upstream chart + AKKO defaults (PostgreSQL co-located on akko-postgresql, ServiceMonitor, NetworkPolicy, Ingress on gateway.<domain>).
Coordinator multi-replica: bump trino server config to allow multiple coordinators (Trino 480 supports that since the elastic coordinator work).
Service rewiring: akko-trino Service swaps from coordinator-direct to Gateway-direct (one-line change in trino chart values, transparent to consumers).
Fault-injection test: Playwright + kubectl delete pod loop to prove zero client-visible failure during a coordinator restart.

DoD : kubectl -n akko delete pod -l component=coordinator while running dbt run banking_demo → DAG completes with at most 1 transparent retry. Ditto for ADEN /query 5x in a row from cockpit.

References¶

Sprint 44 Stream HA dealbreaker list: akko-technical-map/sprints/sprint-43-5-master-plan.md
Trino Gateway upstream: https://trinodb.github.io/trino-gateway/
Trino 480 elastic coordinator notes: see Trino release notes, "queue.coordinator.*"
Related ADR-021 (Catalog Manager Pro): same hot-path, complementary to Gateway.