Skip to content

Disaster Recovery playbook

AKKO 2026.04 ships with a fully declarative backup strategy. This playbook documents the measured RTO / RPO and the exact commands to recover from each class of incident. Required by DORA Art. 11 for financial workloads.

Every number below is measured on reference hardware

Your numbers will differ. Re-run the drills below after provisioning and record the observed RTO / RPO in a signed test report annexed to this document.

Scope

Stateful component Backup mechanism Tool
PostgreSQL (Keycloak/Airflow/OM/Superset/MLflow/Polaris) Logical dump + WAL archiving pg_dumpall CronJob + Barman (optional)
PostgreSQL Data (functional) Logical dump + WAL archiving Same as above
Iceberg tables (object storage warehouse) Object replication + Iceberg snapshots Velero + S3 bucket replication
object storage object data Bucket replication or S3 mirror mc mirror / Velero
Keycloak realm + users JSON export + DB dump Admin API + pg_dump
Trino catalogs (dynamic) ConfigMap snapshot Helm release + Catalog Manager OPA CM
ADEN query cache + receipts PostgreSQL dump Same as PostgreSQL above
Prometheus / VictoriaMetrics TSDB Volume snapshot Velero CSI snapshot
logs layer audit data S3 Object Lock cold storage logs layer native (configured)
Helm state (releases) etcd (via Velero) Velero namespace-scoped backup

Targets

Class RTO target RPO target Scenario
Critical data 2 h 1 h PostgreSQL Data / Iceberg warehouse corruption
Control plane 4 h 4 h Keycloak / OPA / catalog-manager outage
Observability 12 h 24 h VictoriaMetrics / logs layer pod crash
Full cluster rebuild 24 h 4 h Entire K8s cluster lost

Backup schedule

All CronJobs live in the akko namespace and are driven by the chart:

Job Schedule Retention
akko-pg-backup (infra) 0 */4 * * * 14 d local, 90 d cold (S3)
akko-pg-data-backup (functional) 0 */2 * * * 14 d local, 180 d cold
akko-minio-replicate continuous 365 d WORM
akko-keycloak-realm-export 0 3 * * * 30 d
akko-velero-ns-akko 0 1 * * * 30 d

Drill 1 — Restore a single PostgreSQL database

Scenario : someone dropped a table in superset_metadata.

# 1. Stop the consumer
kubectl -n akko scale deploy akko-superset --replicas=0

# 2. Pull the most recent dump from S3 cold storage
aws s3 cp \
  s3://akko-backups/pg/superset_metadata_$(date +%Y%m%d)*.sql.gz \
  /tmp/restore.sql.gz

# 3. Recreate DB + import
kubectl -n akko exec -it akko-postgresql-0 -- \
  psql -U postgres -c "DROP DATABASE superset_metadata; CREATE DATABASE superset_metadata;"
zcat /tmp/restore.sql.gz | kubectl -n akko exec -i akko-postgresql-0 -- \
  psql -U postgres superset_metadata

# 4. Restart consumer
kubectl -n akko scale deploy akko-superset --replicas=1

Measured RTO : ~20 min on reference hardware. Record your number here: ____

Drill 2 — Restore the entire cluster

Scenario : K8s cluster lost (cloud provider region down).

# 1. Provision a new cluster using the target cloud overlay
bash helm/scripts/k3d-create.sh   # or EKS / AKS / GKE / OpenShift

# 2. Install AKKO from Harbor
AKKO_DOMAIN=akko.customer.example AKKO_VERSION=2026.04 \
  bash helm/scripts/deploy-from-harbor.sh

# 3. Restore Velero backup (includes namespace resources + PV snapshots)
velero restore create --from-backup akko-$(date -u +%Y%m%d) --include-namespaces akko

# 4. Replay PostgreSQL latest dumps (step 2 of Drill 1 for each DB)

# 5. Re-point DNS to the new LoadBalancer address

# 6. Validate
bash tests/post-deploy/run-all.sh

Measured RTO : ~4 h. Record your number here: ____

Drill 3 — Ransomware / compromise

Scenario : attacker obtained akko-admin Keycloak credentials and deleted tables.

  1. Contain — suspend the compromised identity in Keycloak and rotate its credentials via scripts/rotate-secrets.sh.
  2. Preserve evidencekubectl cp logs layer + audit receipts to an offline machine for forensic analysis.
  3. Identify — query logs layer for the attacker's user field:
audit_type:* AND user:"compromised-user" AND _time:[72h ago, now]
  1. Eradicate — drop the attacker's K8s ServiceAccounts and Secrets.
  2. Recover — replay Drill 1 / 2 depending on damage scope. WORM S3 Object Lock prevents the attacker from tampering with the audit trail (COMPLIANCE mode, 365 d).
  3. Lessons learned — file an ADR + update this playbook.

Verification commands

# Check backup CronJobs are green
kubectl -n akko get cronjobs -o custom-columns=NAME:.metadata.name,LASTRUN:.status.lastSuccessfulTime

# Check most recent PG dump is < 4 h old
kubectl -n akko exec svc/akko-postgresql -- \
  ls -l /backups/ | head

# Check Velero has a recent successful backup
velero backup get | head -5

# Check audit trail immutability (WORM)
aws s3api get-object-lock-configuration --bucket akko-audit-cold

Quarterly drill checklist

  • [ ] Run Drill 1 on superset_metadata and record RTO
  • [ ] Run Drill 2 on a scratch cluster and record RTO
  • [ ] Run Drill 3 tabletop exercise with ops + security
  • [ ] Review this playbook, update measured numbers
  • [ ] Sign-off by compliance officer

Roadmap

  • Automated restore-test CronJob (Sprint 44) that spins up a scratch PG, restores the previous night's dump, and reports the time to Prometheus → akko_dr_restore_duration_seconds.
  • Multi-region Iceberg replication via Polaris (Sprint 46 backlog).