Skip to content

ADR-031 — PostgreSQL HA via CloudNativePG operator

  • Status: Proposed (Sprint 45 planning)
  • Date: 2026-04-25
  • Drivers: CTO prospect audit 2026-04-25 — "Postgres single primary + nightly pg_dump = unacceptable for banking SaaS"; Sprint 44 Stream HA close-out

Context

AKKO ships two PostgreSQL instances by design (rule documented in CLAUDE.md):

  • akko-postgresql — infra metadata (Keycloak, Airflow, Polaris, OpenMetadata, MLflow, Superset, JupyterHub, LiteLLM).
  • akko-postgresql-data — functional / customer data (analytics, geospatial, RAG, banking demo tables).

Both run as a single primary StatefulSet with one PVC. Failure modes today :

  • Pod / node outage → ~30 s of total platform downtime while the StatefulSet reschedules; every Keycloak login, every Trino query, every Airflow run fails during that window.
  • Disk corruption → restore from yesterday's pg_dump = up to 24 h of data loss (RPO 24 h, RTO ~30 min manual).
  • No read-replica → reporting queries (Superset, Cockpit usage page) contend with OLTP.

The Sprint 44 Stream HA dealbreaker list explicitly asked for "Postgres HA via CloudNativePG". Banking-tier-2 prospects expect RPO < 5 min and RTO < 5 min on the metadata DB and 0 RPO with sync replication on the customer data DB.

Considered options

Option A — CloudNativePG (CNPG) operator

Apache 2.0 operator from EDB / CNCF Sandbox (graduating). Manages PostgreSQL clusters declaratively via a Cluster CRD :

  • 1 primary + N synchronous / async replicas, automated failover via kubectl plugin pg-failover.
  • WAL streaming to S3 (akko-storage Lego layer — already shipped) for point-in-time recovery (PITR) at any second of the retention window.
  • Replication slots, pgBouncer, monitoring, backups, all as first-class CRDs.
  • Single binary, no Patroni / repmgr / etcd-for-Postgres glue.

Status: CNCF Sandbox project (graduation track), 12k GitHub stars, adopted by EDB itself, Kubegres / Patroni community converging here.

Option B — Bitnami postgresql-ha chart (Patroni-based)

Patroni + etcd + repmgr stack, packaged by Bitnami / VMware.

  • Pros: classic battle-tested combo, large operational community.
  • Cons: 2025-12-03 Bitnami flipped to maintenance mode; future of the chart unclear (cf. memory feedback_governance_first.md — governance > license, and Bitnami now flagged as single-vendor risk). Also requires us to run an etcd cluster just for Postgres consensus.

Option C — Stay single-primary, beef up backup story

Keep StatefulSet but add : - continuous WAL archiving to akko-storage S3, - Velero scheduled snapshot of the PVC, - automated restore drill (already shipped as Sprint 44 commit 2cf5b70).

Pros: zero new component.

Cons: doesn't fix the 30 s downtime on pod death. Banking prospect called this out as a non-starter.

Option D — Native pg_basebackup standby pair (manual)

Roll our own primary + standby with custom Bash + pg_basebackup + streaming replication. Reject — reinventing what CNPG packages correctly. Low operational confidence for an SLA-bound platform.

Decision

Adopt Option A — CloudNativePG.

Two Cluster resources :

  • akko-postgresql cluster: primary + 1 async replica + 1 sync replica → RPO 0 on infra metadata, RTO < 30 s automated failover.
  • akko-postgresql-data cluster: primary + 2 sync replicas (banking demo tables) → RPO 0, RTO < 30 s. WAL → akko-storage S3 with 30 days retention for PITR.

CNPG image is ghcr.io/cloudnative-pg/postgresql:16, our existing akko-postgres:2026.04 custom image (PostGIS + pgvector) becomes a sidecar / extension via the postgresql.parameters.shared_preload_libraries list. CNPG supports custom images via the imageName field — we'll vendor akko-postgres inside akko/akko-postgres-cnpg:2026.04 which is just our base + the cnpg entrypoint.

Consequences

Positive

  • True HA: zero downtime on pod / node loss for banking critical workloads.
  • WAL archiving + PITR: RPO drops to seconds, RTO drops to ~5 min for the worst case.
  • Sync replicas + read-only Service for reporting → Superset and Cockpit Usage page no longer contend with OLTP.
  • Backup / restore drill replaces our hand-rolled CronJobs (commit 2cf5b70 becomes a sanity check, not the primary DR posture).
  • Operator-managed certificates for replication.

Negative

  • One more operator to install + monitor (~80 MB RAM for the operator pod itself).
  • Migration is non-trivial: we need to dump from the current StatefulSet, restore into the new CNPG Cluster, cut consumers over to the new Service. Estimated 2-3 h downtime for the migration window (do during a planned maintenance, not on a busy demo day).
  • Extra Helm dependency: cnpg/cloudnative-pg chart pinned in Chart.yaml → +1 chart, +1 update treadmill.
  • ~3-5x storage requirement (primary + 2 sync replicas + WAL) → from ~20 Gi today to ~80 Gi for the data cluster.

Implementation plan

Sprint 45 — phased rollout (estimated ~25 h effort) :

  1. Operator install (1 h): add cloudnative-pg/cloudnative-pg sub-chart dependency in helm/akko/Chart.yaml, default enabled: false. Helm install on Netcup with --set enabled=true.

  2. Image vendoring (3 h): fork akko-postgres:2026.04 to akko-postgres-cnpg:2026.04 based on ghcr.io/cloudnative-pg/postgresql:16-bookworm, keep PostGIS + pgvector preload extensions. CI pipeline in helm/scripts/build-images.sh.

  3. Two Cluster CRDs (4 h): helm/akko/charts/akko-postgres/templates/cluster.yaml

  4. helm/akko/charts/akko-postgresql-data/templates/cluster.yaml, gated by cnpg.enabled flag. Default off — single-primary StatefulSet stays the path for offline / dev installs.

  5. Migration script (5 h): helm/scripts/migrate-postgres-to-cnpg.sh that dumps from current StatefulSet, scales it down, deploys the CNPG Cluster, restores from dump, swaps the akko-postgresql Service selector to point at CNPG primary. Idempotent + reversible.

  6. WAL archiving (3 h): configure cluster.spec.backup.barmanObjectStore to push WAL to akko-storage S3 (canonical akko-s3 Secret, Sprint 43.5 glue). Retention policies via recoveryWindow.

  7. Failover drill (3 h): extend the restore-drill CronJob (commit 2cf5b70) to also run a kubectl cnpg promote on a designated replica weekly. Verify the platform stays available throughout.

  8. Rollout to Netcup (3 h): planned maintenance window, migration script, smoke tests.

  9. Documentation + ADR finalisation (3 h): this ADR moves to Status: Accepted once Netcup is migrated and the failover drill has fired green twice.

DoD : kubectl -n akko delete pod akko-postgresql-cluster-1 (the primary) → CNPG promotes a sync replica within 30 s, no Cockpit / Keycloak / Airflow downtime visible to the user. WAL archiving visible in akko-storage backups/wal/ prefix.

Migration risks + mitigations

Risk Mitigation
Dump / restore window > planned 3 h Pre-test on staging copy of Netcup. Keep StatefulSet as a hot rollback for 1 sprint.
pgvector / PostGIS extensions not loaded by CNPG The vendored akko-postgres-cnpg image preloads them; smoke test in step 7 explicitly queries vector_ops + postgis_version().
Service name mismatch breaks consumers akko-postgresql Service stays named the same; only the selector flips.
Storage class doesn't support RWO multi-attach Already RWO local-path on Netcup; CNPG only needs one PVC per replica, not RWX.

References

  • CloudNativePG: https://cloudnative-pg.io/
  • Sprint 44 Stream HA: akko-technical-map/sprints/sprint-43-5-master-plan.md
  • Sprint 44 dealbreaker #1 (OPA HA): commit 1c4a677 (already shipped)
  • Restore drill (complement to CNPG, not replacement): commit 2cf5b70
  • Bitnami maintenance-mode notice (rationale for B → A): https://github.com/bitnami/charts/issues/...
  • Memory feedback_governance_first.md — governance > license (CNPG = CNCF Sandbox = neutral foundation).