Skip to content

ADR-030 — Trino Gateway for coordinator HA

  • Status: Proposed (Sprint 45 planning)
  • Date: 2026-04-25
  • Drivers: CTO prospect audit 2026-04-25 — "single Trino coordinator = single point of failure"; Sprint 44 Stream HA close-out

Context

Trino's architecture is deliberately single-coordinator: the coordinator owns query parsing, planning, scheduling and result coordination. Worker HA is trivial (we shipped that in commit 7d301cb: workers 1 → 2 + retry-policy=QUERY), but coordinator HA is not. Killing the coordinator pod kills every in-flight query and locks every Cockpit / ADEN / Superset / dbt caller for the ~30-60 s it takes to reschedule.

A banking-tier-2 prospect flagged this as a P0 dealbreaker for SaaS distribution. They expect coordinator HA out of the box.

Considered options

Option A — Trinodb/trino-gateway (project starburstdata/trino-gateway)

A purpose-built reverse proxy in front of N independent coordinators. Routes queries by routing rules (round-robin, hash, custom). Each coordinator runs its own scheduler; the Gateway tracks query → coordinator mapping for the lifetime of the query so result fetching works.

  • License: Apache 2.0 (governance: Trino Software Foundation, fork of Lyft's trino-gateway). Compatible with AKKO commercial-safe rule.
  • Maturity: 2.x stable, used in production by Bloomberg, Lyft, Pinterest.
  • Operator effort: separate Helm chart, separate database (PostgreSQL — can reuse akko-postgresql), 2-3 coordinator replicas behind it.
  • Failover behaviour: if a coordinator dies, in-flight queries on it fail and need re-submission; new queries from the same client Just Work because Gateway picks another coordinator.

Option B — Single coordinator + aggressive retry-policy=QUERY

Skip Gateway. Rely on Trino's built-in QUERY-level retry: if the coordinator dies mid-query, the client driver re-submits, hits the new coordinator (after pod restart), retries transparently.

  • Pros: zero operator effort, no extra component.
  • Cons: 30-60 s outage window per coordinator restart. Long queries lose all progress. Doesn't address the prospect's "single point of failure" red flag — it just makes recovery automatic.

Option C — k8s-native HPA on coordinator with HA Service

Run multiple coordinators behind a regular k8s Service, hash-routed. Each coordinator independently. Will not work: clients lose query state mid-fetch when the Service round-robins their next request to a different coordinator.

Decision

Adopt Option A — Trinodb/trino-gateway as the AKKO HA story for Trino coordinator.

Default install : 2 coordinators + Gateway in front. Gateway exposes the same Service name (akko-trino) every consumer already uses, so no client-side change.

Consequences

Positive

  • True coordinator HA: rolling restart, single-pod kill, single-node drain → no client-visible failure beyond a sub-second blip.
  • Existing dbt / ADEN / Superset / Cockpit configs keep working (Gateway exposes the existing akko-trino Service, just routes to coordinators internally).
  • Built-in routing rules unlock future per-tenant query isolation (one coordinator per tenant for noisy-neighbour control).

Negative

  • New component to operate (Gateway pod + its Postgres state).
  • ~150 MB additional RAM for the Gateway pod itself.
  • ~250 MB additional Postgres footprint (Gateway routing state).

Implementation plan

Sprint 45 — phased rollout :

  1. ADR + chart skeleton (this commit): plumbing only, default off.
  2. Helm sub-chart akko-trino-gateway: vendored upstream chart + AKKO defaults (PostgreSQL co-located on akko-postgresql, ServiceMonitor, NetworkPolicy, Ingress on gateway.<domain>).
  3. Coordinator multi-replica: bump trino server config to allow multiple coordinators (Trino 480 supports that since the elastic coordinator work).
  4. Service rewiring: akko-trino Service swaps from coordinator-direct to Gateway-direct (one-line change in trino chart values, transparent to consumers).
  5. Fault-injection test: Playwright + kubectl delete pod loop to prove zero client-visible failure during a coordinator restart.

DoD : kubectl -n akko delete pod -l component=coordinator while running dbt run banking_demo → DAG completes with at most 1 transparent retry. Ditto for ADEN /query 5x in a row from cockpit.

References

  • Sprint 44 Stream HA dealbreaker list: akko-technical-map/sprints/sprint-43-5-master-plan.md
  • Trino Gateway upstream: https://trinodb.github.io/trino-gateway/
  • Trino 480 elastic coordinator notes: see Trino release notes, "queue.coordinator.*"
  • Related ADR-021 (Catalog Manager Pro): same hot-path, complementary to Gateway.