ADR-031 — PostgreSQL HA via CloudNativePG operator¶
- Status: Proposed (Sprint 45 planning)
- Date: 2026-04-25
- Drivers: CTO prospect audit 2026-04-25 — "Postgres single primary + nightly pg_dump = unacceptable for banking SaaS"; Sprint 44 Stream HA close-out
Context¶
AKKO ships two PostgreSQL instances by design (rule documented in
CLAUDE.md):
akko-postgresql— infra metadata (Keycloak, Airflow, Polaris, OpenMetadata, MLflow, Superset, JupyterHub, LiteLLM).akko-postgresql-data— functional / customer data (analytics, geospatial, RAG, banking demo tables).
Both run as a single primary StatefulSet with one PVC. Failure modes today :
- Pod / node outage → ~30 s of total platform downtime while the StatefulSet reschedules; every Keycloak login, every Trino query, every Airflow run fails during that window.
- Disk corruption → restore from yesterday's pg_dump = up to 24 h of data loss (RPO 24 h, RTO ~30 min manual).
- No read-replica → reporting queries (Superset, Cockpit usage page) contend with OLTP.
The Sprint 44 Stream HA dealbreaker list explicitly asked for "Postgres HA via CloudNativePG". Banking-tier-2 prospects expect RPO < 5 min and RTO < 5 min on the metadata DB and 0 RPO with sync replication on the customer data DB.
Considered options¶
Option A — CloudNativePG (CNPG) operator¶
Apache 2.0 operator from EDB / CNCF Sandbox (graduating). Manages
PostgreSQL clusters declaratively via a Cluster CRD :
- 1 primary + N synchronous / async replicas, automated failover via
kubectl plugin pg-failover. - WAL streaming to S3 (akko-storage Lego layer — already shipped) for point-in-time recovery (PITR) at any second of the retention window.
- Replication slots, pgBouncer, monitoring, backups, all as first-class CRDs.
- Single binary, no Patroni / repmgr / etcd-for-Postgres glue.
Status: CNCF Sandbox project (graduation track), 12k GitHub stars, adopted by EDB itself, Kubegres / Patroni community converging here.
Option B — Bitnami postgresql-ha chart (Patroni-based)¶
Patroni + etcd + repmgr stack, packaged by Bitnami / VMware.
- Pros: classic battle-tested combo, large operational community.
- Cons: 2025-12-03 Bitnami flipped to maintenance mode; future of
the chart unclear (cf. memory
feedback_governance_first.md— governance > license, and Bitnami now flagged as single-vendor risk). Also requires us to run an etcd cluster just for Postgres consensus.
Option C — Stay single-primary, beef up backup story¶
Keep StatefulSet but add :
- continuous WAL archiving to akko-storage S3,
- Velero scheduled snapshot of the PVC,
- automated restore drill (already shipped as Sprint 44 commit
2cf5b70).
Pros: zero new component.
Cons: doesn't fix the 30 s downtime on pod death. Banking prospect called this out as a non-starter.
Option D — Native pg_basebackup standby pair (manual)¶
Roll our own primary + standby with custom Bash + pg_basebackup +
streaming replication. Reject — reinventing what CNPG packages
correctly. Low operational confidence for an SLA-bound platform.
Decision¶
Adopt Option A — CloudNativePG.
Two Cluster resources :
akko-postgresqlcluster: primary + 1 async replica + 1 sync replica → RPO 0 on infra metadata, RTO < 30 s automated failover.akko-postgresql-datacluster: primary + 2 sync replicas (banking demo tables) → RPO 0, RTO < 30 s. WAL → akko-storage S3 with 30 days retention for PITR.
CNPG image is ghcr.io/cloudnative-pg/postgresql:16, our existing
akko-postgres:2026.04 custom image (PostGIS + pgvector) becomes a
sidecar / extension via the postgresql.parameters.shared_preload_libraries
list. CNPG supports custom images via the imageName field —
we'll vendor akko-postgres inside akko/akko-postgres-cnpg:2026.04
which is just our base + the cnpg entrypoint.
Consequences¶
Positive
- True HA: zero downtime on pod / node loss for banking critical workloads.
- WAL archiving + PITR: RPO drops to seconds, RTO drops to ~5 min for the worst case.
- Sync replicas + read-only Service for reporting → Superset and Cockpit Usage page no longer contend with OLTP.
- Backup / restore drill replaces our hand-rolled CronJobs (commit
2cf5b70becomes a sanity check, not the primary DR posture). - Operator-managed certificates for replication.
Negative
- One more operator to install + monitor (~80 MB RAM for the operator pod itself).
- Migration is non-trivial: we need to dump from the current StatefulSet, restore into the new CNPG Cluster, cut consumers over to the new Service. Estimated 2-3 h downtime for the migration window (do during a planned maintenance, not on a busy demo day).
- Extra Helm dependency:
cnpg/cloudnative-pgchart pinned in Chart.yaml → +1 chart, +1 update treadmill. - ~3-5x storage requirement (primary + 2 sync replicas + WAL) → from ~20 Gi today to ~80 Gi for the data cluster.
Implementation plan¶
Sprint 45 — phased rollout (estimated ~25 h effort) :
-
Operator install (1 h): add
cloudnative-pg/cloudnative-pgsub-chart dependency inhelm/akko/Chart.yaml, defaultenabled: false. Helm install on Netcup with--set enabled=true. -
Image vendoring (3 h): fork
akko-postgres:2026.04toakko-postgres-cnpg:2026.04based onghcr.io/cloudnative-pg/postgresql:16-bookworm, keep PostGIS + pgvector preload extensions. CI pipeline inhelm/scripts/build-images.sh. -
Two Cluster CRDs (4 h):
helm/akko/charts/akko-postgres/templates/cluster.yaml -
helm/akko/charts/akko-postgresql-data/templates/cluster.yaml, gated bycnpg.enabledflag. Default off — single-primary StatefulSet stays the path for offline / dev installs. -
Migration script (5 h):
helm/scripts/migrate-postgres-to-cnpg.shthat dumps from current StatefulSet, scales it down, deploys the CNPG Cluster, restores from dump, swaps theakko-postgresqlService selector to point at CNPG primary. Idempotent + reversible. -
WAL archiving (3 h): configure
cluster.spec.backup.barmanObjectStoreto push WAL to akko-storage S3 (canonical akko-s3 Secret, Sprint 43.5 glue). Retention policies viarecoveryWindow. -
Failover drill (3 h): extend the restore-drill CronJob (commit
2cf5b70) to also run akubectl cnpg promoteon a designated replica weekly. Verify the platform stays available throughout. -
Rollout to Netcup (3 h): planned maintenance window, migration script, smoke tests.
-
Documentation + ADR finalisation (3 h): this ADR moves to
Status: Acceptedonce Netcup is migrated and the failover drill has fired green twice.
DoD : kubectl -n akko delete pod akko-postgresql-cluster-1 (the
primary) → CNPG promotes a sync replica within 30 s, no Cockpit /
Keycloak / Airflow downtime visible to the user. WAL archiving
visible in akko-storage backups/wal/ prefix.
Migration risks + mitigations¶
| Risk | Mitigation |
|---|---|
| Dump / restore window > planned 3 h | Pre-test on staging copy of Netcup. Keep StatefulSet as a hot rollback for 1 sprint. |
| pgvector / PostGIS extensions not loaded by CNPG | The vendored akko-postgres-cnpg image preloads them; smoke test in step 7 explicitly queries vector_ops + postgis_version(). |
| Service name mismatch breaks consumers | akko-postgresql Service stays named the same; only the selector flips. |
| Storage class doesn't support RWO multi-attach | Already RWO local-path on Netcup; CNPG only needs one PVC per replica, not RWX. |
References¶
- CloudNativePG:
https://cloudnative-pg.io/ - Sprint 44 Stream HA:
akko-technical-map/sprints/sprint-43-5-master-plan.md - Sprint 44 dealbreaker #1 (OPA HA): commit
1c4a677(already shipped) - Restore drill (complement to CNPG, not replacement): commit
2cf5b70 - Bitnami maintenance-mode notice (rationale for B → A): https://github.com/bitnami/charts/issues/...
- Memory
feedback_governance_first.md— governance > license (CNPG = CNCF Sandbox = neutral foundation).