ADR-031 — PostgreSQL HA via CloudNativePG operator¶

Status: Proposed (Sprint 45 planning)
Date: 2026-04-25
Drivers: CTO prospect audit 2026-04-25 — "Postgres single primary + nightly pg_dump = unacceptable for banking SaaS"; Sprint 44 Stream HA close-out

Context¶

AKKO ships two PostgreSQL instances by design (rule documented in CLAUDE.md):

akko-postgresql — infra metadata (Keycloak, Airflow, Polaris, OpenMetadata, MLflow, Superset, JupyterHub, LiteLLM).
akko-postgresql-data — functional / customer data (analytics, geospatial, RAG, banking demo tables).

Both run as a single primary StatefulSet with one PVC. Failure modes today :

Pod / node outage → ~30 s of total platform downtime while the StatefulSet reschedules; every Keycloak login, every Trino query, every Airflow run fails during that window.
Disk corruption → restore from yesterday's pg_dump = up to 24 h of data loss (RPO 24 h, RTO ~30 min manual).
No read-replica → reporting queries (Superset, Cockpit usage page) contend with OLTP.

The Sprint 44 Stream HA dealbreaker list explicitly asked for "Postgres HA via CloudNativePG". Banking-tier-2 prospects expect RPO < 5 min and RTO < 5 min on the metadata DB and 0 RPO with sync replication on the customer data DB.

Considered options¶

Option A — CloudNativePG (CNPG) operator¶

Apache 2.0 operator from EDB / CNCF Sandbox (graduating). Manages PostgreSQL clusters declaratively via a Cluster CRD :

1 primary + N synchronous / async replicas, automated failover via kubectl plugin pg-failover.
WAL streaming to S3 (akko-storage Lego layer — already shipped) for point-in-time recovery (PITR) at any second of the retention window.
Replication slots, pgBouncer, monitoring, backups, all as first-class CRDs.
Single binary, no Patroni / repmgr / etcd-for-Postgres glue.

Status: CNCF Sandbox project (graduation track), 12k GitHub stars, adopted by EDB itself, Kubegres / Patroni community converging here.

Option B — Bitnami `postgresql-ha` chart (Patroni-based)¶

Patroni + etcd + repmgr stack, packaged by Bitnami / VMware.

Pros: classic battle-tested combo, large operational community.
Cons: 2025-12-03 Bitnami flipped to maintenance mode; future of the chart unclear (cf. memory feedback_governance_first.md — governance > license, and Bitnami now flagged as single-vendor risk). Also requires us to run an etcd cluster just for Postgres consensus.

Option C — Stay single-primary, beef up backup story¶

Keep StatefulSet but add : - continuous WAL archiving to akko-storage S3, - Velero scheduled snapshot of the PVC, - automated restore drill (already shipped as Sprint 44 commit 2cf5b70).

Pros: zero new component.

Cons: doesn't fix the 30 s downtime on pod death. Banking prospect called this out as a non-starter.

Option D — Native pg_basebackup standby pair (manual)¶

Roll our own primary + standby with custom Bash + pg_basebackup + streaming replication. Reject — reinventing what CNPG packages correctly. Low operational confidence for an SLA-bound platform.

Decision¶

Adopt Option A — CloudNativePG.

Two Cluster resources :

akko-postgresql cluster: primary + 1 async replica + 1 sync replica → RPO 0 on infra metadata, RTO < 30 s automated failover.
akko-postgresql-data cluster: primary + 2 sync replicas (banking demo tables) → RPO 0, RTO < 30 s. WAL → akko-storage S3 with 30 days retention for PITR.

CNPG image is ghcr.io/cloudnative-pg/postgresql:16, our existing akko-postgres:2026.04 custom image (PostGIS + pgvector) becomes a sidecar / extension via the postgresql.parameters.shared_preload_libraries list. CNPG supports custom images via the imageName field — we'll vendor akko-postgres inside akko/akko-postgres-cnpg:2026.04 which is just our base + the cnpg entrypoint.

Consequences¶

Positive

True HA: zero downtime on pod / node loss for banking critical workloads.
WAL archiving + PITR: RPO drops to seconds, RTO drops to ~5 min for the worst case.
Sync replicas + read-only Service for reporting → Superset and Cockpit Usage page no longer contend with OLTP.
Backup / restore drill replaces our hand-rolled CronJobs (commit 2cf5b70 becomes a sanity check, not the primary DR posture).
Operator-managed certificates for replication.

Negative

One more operator to install + monitor (~80 MB RAM for the operator pod itself).
Migration is non-trivial: we need to dump from the current StatefulSet, restore into the new CNPG Cluster, cut consumers over to the new Service. Estimated 2-3 h downtime for the migration window (do during a planned maintenance, not on a busy demo day).
Extra Helm dependency: cnpg/cloudnative-pg chart pinned in Chart.yaml → +1 chart, +1 update treadmill.
~3-5x storage requirement (primary + 2 sync replicas + WAL) → from ~20 Gi today to ~80 Gi for the data cluster.

Implementation plan¶

Sprint 45 — phased rollout (estimated ~25 h effort) :

Operator install (1 h): add cloudnative-pg/cloudnative-pg sub-chart dependency in helm/akko/Chart.yaml, default enabled: false. Helm install on Netcup with --set enabled=true.
Image vendoring (3 h): fork akko-postgres:2026.04 to akko-postgres-cnpg:2026.04 based on ghcr.io/cloudnative-pg/postgresql:16-bookworm, keep PostGIS + pgvector preload extensions. CI pipeline in helm/scripts/build-images.sh.
Two Cluster CRDs (4 h): helm/akko/charts/akko-postgres/templates/cluster.yaml
helm/akko/charts/akko-postgresql-data/templates/cluster.yaml, gated by cnpg.enabled flag. Default off — single-primary StatefulSet stays the path for offline / dev installs.
Migration script (5 h): helm/scripts/migrate-postgres-to-cnpg.sh that dumps from current StatefulSet, scales it down, deploys the CNPG Cluster, restores from dump, swaps the akko-postgresql Service selector to point at CNPG primary. Idempotent + reversible.
WAL archiving (3 h): configure cluster.spec.backup.barmanObjectStore to push WAL to akko-storage S3 (canonical akko-s3 Secret, Sprint 43.5 glue). Retention policies via recoveryWindow.
Failover drill (3 h): extend the restore-drill CronJob (commit 2cf5b70) to also run a kubectl cnpg promote on a designated replica weekly. Verify the platform stays available throughout.
Rollout to Netcup (3 h): planned maintenance window, migration script, smoke tests.
Documentation + ADR finalisation (3 h): this ADR moves to Status: Accepted once Netcup is migrated and the failover drill has fired green twice.

DoD : kubectl -n akko delete pod akko-postgresql-cluster-1 (the primary) → CNPG promotes a sync replica within 30 s, no Cockpit / Keycloak / Airflow downtime visible to the user. WAL archiving visible in akko-storage backups/wal/ prefix.

Migration risks + mitigations¶

Risk	Mitigation
Dump / restore window > planned 3 h	Pre-test on staging copy of Netcup. Keep StatefulSet as a hot rollback for 1 sprint.
pgvector / PostGIS extensions not loaded by CNPG	The vendored akko-postgres-cnpg image preloads them; smoke test in step 7 explicitly queries `vector_ops` + `postgis_version()`.
Service name mismatch breaks consumers	`akko-postgresql` Service stays named the same; only the selector flips.
Storage class doesn't support RWO multi-attach	Already RWO local-path on Netcup; CNPG only needs one PVC per replica, not RWX.

References¶

CloudNativePG: https://cloudnative-pg.io/
Sprint 44 Stream HA: akko-technical-map/sprints/sprint-43-5-master-plan.md
Sprint 44 dealbreaker #1 (OPA HA): commit 1c4a677 (already shipped)
Restore drill (complement to CNPG, not replacement): commit 2cf5b70
Bitnami maintenance-mode notice (rationale for B → A): https://github.com/bitnami/charts/issues/...
Memory feedback_governance_first.md — governance > license (CNPG = CNCF Sandbox = neutral foundation).