Disaster Recovery playbook¶

AKKO 2026.04 ships with a fully declarative backup strategy. This playbook documents the measured RTO / RPO and the exact commands to recover from each class of incident. Required by DORA Art. 11 for financial workloads.

Every number below is measured on reference hardware

Your numbers will differ. Re-run the drills below after provisioning and record the observed RTO / RPO in a signed test report annexed to this document.

Scope¶

Stateful component	Backup mechanism	Tool
PostgreSQL (Keycloak/Airflow/OM/Superset/MLflow/Polaris)	Logical dump + WAL archiving	`pg_dumpall` CronJob + Barman (optional)
PostgreSQL Data (functional)	Logical dump + WAL archiving	Same as above
Iceberg tables (object storage warehouse)	Object replication + Iceberg snapshots	Velero + S3 bucket replication
object storage object data	Bucket replication or S3 mirror	mc mirror / Velero
Keycloak realm + users	JSON export + DB dump	Admin API + pg_dump
Trino catalogs (dynamic)	ConfigMap snapshot	Helm release + Catalog Manager OPA CM
ADEN query cache + receipts	PostgreSQL dump	Same as PostgreSQL above
Prometheus / VictoriaMetrics TSDB	Volume snapshot	Velero CSI snapshot
logs layer audit data	S3 Object Lock cold storage	logs layer native (configured)
Helm state (releases)	etcd (via Velero)	Velero namespace-scoped backup

Targets¶

Class	RTO target	RPO target	Scenario
Critical data	2 h	1 h	PostgreSQL Data / Iceberg warehouse corruption
Control plane	4 h	4 h	Keycloak / OPA / catalog-manager outage
Observability	12 h	24 h	VictoriaMetrics / logs layer pod crash
Full cluster rebuild	24 h	4 h	Entire K8s cluster lost

Backup schedule¶

All CronJobs live in the akko namespace and are driven by the chart:

Job	Schedule	Retention
`akko-pg-backup` (infra)	`0 /4 * *`	14 d local, 90 d cold (S3)
`akko-pg-data-backup` (functional)	`0 /2 * *`	14 d local, 180 d cold
`akko-minio-replicate`	continuous	365 d WORM
`akko-keycloak-realm-export`	`0 3 * * *`	30 d
`akko-velero-ns-akko`	`0 1 * * *`	30 d

Drill 1 — Restore a single PostgreSQL database¶

Scenario : someone dropped a table in superset_metadata.

# 1. Stop the consumer
kubectl -n akko scale deploy akko-superset --replicas=0

# 2. Pull the most recent dump from S3 cold storage
aws s3 cp \
  s3://akko-backups/pg/superset_metadata_$(date +%Y%m%d)*.sql.gz \
  /tmp/restore.sql.gz

# 3. Recreate DB + import
kubectl -n akko exec -it akko-postgresql-0 -- \
  psql -U postgres -c "DROP DATABASE superset_metadata; CREATE DATABASE superset_metadata;"
zcat /tmp/restore.sql.gz | kubectl -n akko exec -i akko-postgresql-0 -- \
  psql -U postgres superset_metadata

# 4. Restart consumer
kubectl -n akko scale deploy akko-superset --replicas=1

Measured RTO : ~20 min on reference hardware. Record your number here: ____

Drill 2 — Restore the entire cluster¶

Scenario : K8s cluster lost (cloud provider region down).

# 1. Provision a new cluster using the target cloud overlay
bash helm/scripts/k3d-create.sh   # or EKS / AKS / GKE / OpenShift

# 2. Install AKKO from Harbor
AKKO_DOMAIN=akko.customer.example AKKO_VERSION=2026.04 \
  bash helm/scripts/deploy-from-harbor.sh

# 3. Restore Velero backup (includes namespace resources + PV snapshots)
velero restore create --from-backup akko-$(date -u +%Y%m%d) --include-namespaces akko

# 4. Replay PostgreSQL latest dumps (step 2 of Drill 1 for each DB)

# 5. Re-point DNS to the new LoadBalancer address

# 6. Validate
bash tests/post-deploy/run-all.sh

Measured RTO : ~4 h. Record your number here: ____

Drill 3 — Ransomware / compromise¶

Scenario : attacker obtained akko-admin Keycloak credentials and deleted tables.

Contain — suspend the compromised identity in Keycloak and rotate its credentials via scripts/rotate-secrets.sh.
Preserve evidence — kubectl cp logs layer + audit receipts to an offline machine for forensic analysis.
Identify — query logs layer for the attacker's user field:

audit_type:* AND user:"compromised-user" AND _time:[72h ago, now]

Eradicate — drop the attacker's K8s ServiceAccounts and Secrets.
Recover — replay Drill 1 / 2 depending on damage scope. WORM S3 Object Lock prevents the attacker from tampering with the audit trail (COMPLIANCE mode, 365 d).
Lessons learned — file an ADR + update this playbook.

Verification commands¶

# Check backup CronJobs are green
kubectl -n akko get cronjobs -o custom-columns=NAME:.metadata.name,LASTRUN:.status.lastSuccessfulTime

# Check most recent PG dump is < 4 h old
kubectl -n akko exec svc/akko-postgresql -- \
  ls -l /backups/ | head

# Check Velero has a recent successful backup
velero backup get | head -5

# Check audit trail immutability (WORM)
aws s3api get-object-lock-configuration --bucket akko-audit-cold

Quarterly drill checklist¶

[ ] Run Drill 1 on superset_metadata and record RTO
[ ] Run Drill 2 on a scratch cluster and record RTO
[ ] Run Drill 3 tabletop exercise with ops + security
[ ] Review this playbook, update measured numbers
[ ] Sign-off by compliance officer

Roadmap¶

Automated restore-test CronJob (Sprint 44) that spins up a scratch PG, restores the previous night's dump, and reports the time to Prometheus → akko_dr_restore_duration_seconds.
Multi-region Iceberg replication via Polaris (Sprint 46 backlog).