Disaster Recovery playbook¶
AKKO 2026.04 ships with a fully declarative backup strategy. This playbook documents the measured RTO / RPO and the exact commands to recover from each class of incident. Required by DORA Art. 11 for financial workloads.
Every number below is measured on reference hardware
Your numbers will differ. Re-run the drills below after provisioning and record the observed RTO / RPO in a signed test report annexed to this document.
Scope¶
| Stateful component | Backup mechanism | Tool |
|---|---|---|
| PostgreSQL (Keycloak/Airflow/OM/Superset/MLflow/Polaris) | Logical dump + WAL archiving | pg_dumpall CronJob + Barman (optional) |
| PostgreSQL Data (functional) | Logical dump + WAL archiving | Same as above |
| Iceberg tables (object storage warehouse) | Object replication + Iceberg snapshots | Velero + S3 bucket replication |
| object storage object data | Bucket replication or S3 mirror | mc mirror / Velero |
| Keycloak realm + users | JSON export + DB dump | Admin API + pg_dump |
| Trino catalogs (dynamic) | ConfigMap snapshot | Helm release + Catalog Manager OPA CM |
| ADEN query cache + receipts | PostgreSQL dump | Same as PostgreSQL above |
| Prometheus / VictoriaMetrics TSDB | Volume snapshot | Velero CSI snapshot |
| logs layer audit data | S3 Object Lock cold storage | logs layer native (configured) |
| Helm state (releases) | etcd (via Velero) | Velero namespace-scoped backup |
Targets¶
| Class | RTO target | RPO target | Scenario |
|---|---|---|---|
| Critical data | 2 h | 1 h | PostgreSQL Data / Iceberg warehouse corruption |
| Control plane | 4 h | 4 h | Keycloak / OPA / catalog-manager outage |
| Observability | 12 h | 24 h | VictoriaMetrics / logs layer pod crash |
| Full cluster rebuild | 24 h | 4 h | Entire K8s cluster lost |
Backup schedule¶
All CronJobs live in the akko namespace and are driven by the chart:
| Job | Schedule | Retention |
|---|---|---|
akko-pg-backup (infra) |
0 */4 * * * |
14 d local, 90 d cold (S3) |
akko-pg-data-backup (functional) |
0 */2 * * * |
14 d local, 180 d cold |
akko-minio-replicate |
continuous | 365 d WORM |
akko-keycloak-realm-export |
0 3 * * * |
30 d |
akko-velero-ns-akko |
0 1 * * * |
30 d |
Drill 1 — Restore a single PostgreSQL database¶
Scenario : someone dropped a table in superset_metadata.
# 1. Stop the consumer
kubectl -n akko scale deploy akko-superset --replicas=0
# 2. Pull the most recent dump from S3 cold storage
aws s3 cp \
s3://akko-backups/pg/superset_metadata_$(date +%Y%m%d)*.sql.gz \
/tmp/restore.sql.gz
# 3. Recreate DB + import
kubectl -n akko exec -it akko-postgresql-0 -- \
psql -U postgres -c "DROP DATABASE superset_metadata; CREATE DATABASE superset_metadata;"
zcat /tmp/restore.sql.gz | kubectl -n akko exec -i akko-postgresql-0 -- \
psql -U postgres superset_metadata
# 4. Restart consumer
kubectl -n akko scale deploy akko-superset --replicas=1
Measured RTO : ~20 min on reference hardware. Record your number here: ____
Drill 2 — Restore the entire cluster¶
Scenario : K8s cluster lost (cloud provider region down).
# 1. Provision a new cluster using the target cloud overlay
bash helm/scripts/k3d-create.sh # or EKS / AKS / GKE / OpenShift
# 2. Install AKKO from Harbor
AKKO_DOMAIN=akko.customer.example AKKO_VERSION=2026.04 \
bash helm/scripts/deploy-from-harbor.sh
# 3. Restore Velero backup (includes namespace resources + PV snapshots)
velero restore create --from-backup akko-$(date -u +%Y%m%d) --include-namespaces akko
# 4. Replay PostgreSQL latest dumps (step 2 of Drill 1 for each DB)
# 5. Re-point DNS to the new LoadBalancer address
# 6. Validate
bash tests/post-deploy/run-all.sh
Measured RTO : ~4 h. Record your number here: ____
Drill 3 — Ransomware / compromise¶
Scenario : attacker obtained akko-admin Keycloak credentials and
deleted tables.
- Contain — suspend the compromised identity in Keycloak and rotate
its credentials via
scripts/rotate-secrets.sh. - Preserve evidence —
kubectl cplogs layer + audit receipts to an offline machine for forensic analysis. - Identify — query logs layer for the attacker's
userfield:
- Eradicate — drop the attacker's K8s ServiceAccounts and Secrets.
- Recover — replay Drill 1 / 2 depending on damage scope. WORM S3 Object Lock prevents the attacker from tampering with the audit trail (COMPLIANCE mode, 365 d).
- Lessons learned — file an ADR + update this playbook.
Verification commands¶
# Check backup CronJobs are green
kubectl -n akko get cronjobs -o custom-columns=NAME:.metadata.name,LASTRUN:.status.lastSuccessfulTime
# Check most recent PG dump is < 4 h old
kubectl -n akko exec svc/akko-postgresql -- \
ls -l /backups/ | head
# Check Velero has a recent successful backup
velero backup get | head -5
# Check audit trail immutability (WORM)
aws s3api get-object-lock-configuration --bucket akko-audit-cold
Quarterly drill checklist¶
- [ ] Run Drill 1 on
superset_metadataand record RTO - [ ] Run Drill 2 on a scratch cluster and record RTO
- [ ] Run Drill 3 tabletop exercise with ops + security
- [ ] Review this playbook, update measured numbers
- [ ] Sign-off by compliance officer
Roadmap¶
- Automated restore-test CronJob (Sprint 44) that spins up a scratch PG,
restores the previous night's dump, and reports the time to
Prometheus →
akko_dr_restore_duration_seconds. - Multi-region Iceberg replication via Polaris (Sprint 46 backlog).