Skip to content

DR drill execution log

Use this form after every quarterly DR drill (Drills 1–3 in dr-playbook.md). The signed log is the auditor's proof that recovery procedures were tested and the measured RTO/RPO match the targets. Without this log, the playbook is theatre.

Workflow:

  1. Operator copies this template into a new file dr-drill-YYYY-Q[1-4].md in akko-confidential/dr-drills/ (private repo — drill outputs may contain operational details).
  2. Operator + on-call witness execute each drill, fill in the log.
  3. Compliance officer co-signs at the bottom.
  4. The signed PDF is attached to the audit trail (cold storage in the akko-audit-cold bucket with object-lock retention ≥ 7 years).

Field Value
Drill date YYYY-MM-DD HH:MM TZ
Quarter Q1 / Q2 / Q3 / Q4 — YYYY
Cluster netcup-prod / aks-prod / eks-prod / scratch
AKKO version (from helm list -n akko)
Operator name + role
Witness name + role (must be different person)
Compliance officer name + role
Playbook revision git SHA of docs/docs/admin/dr-playbook.md

Drill 1 — single-database PITR

See playbook §4.1. Restore one PostgreSQL DB to a point-in-time 5 min before "now". Reference DB: superset_metadata.

Step Target Measured Pass / Fail Notes
Locate latest base backup < 5 min __ min ☐ ☐
Restore to scratch PVC < 15 min __ min ☐ ☐
Replay WAL to T-5 < 5 min __ min ☐ ☐
Verify row counts match expected < 1 min __ s ☐ ☐
Cut-over Service to restored DB < 5 min __ min ☐ ☐
Total RTO 20 min __ min ☐ ☐
Measured RPO 5 min __ s ☐ ☐

Issues encountered (if any):

(blank)

Action items raised (link to backlog tickets):

(blank)

Drill 2 — full-cluster restore on scratch hardware

See playbook §4.2. From a fresh k8s cluster, restore the entire AKKO namespace from Velero + ObjectStorage. Reference target RTO: 4 h.

Step Target Measured Pass / Fail Notes
Provision scratch cluster (k3s on cold-spare hardware) < 30 min __ min ☐ ☐
Install cert-manager + Traefik + Velero < 15 min __ min ☐ ☐
velero restore create from latest backup < 90 min __ min ☐ ☐
Restore SeaweedFS volumes from snapshot < 60 min __ min ☐ ☐
helm upgrade akko to reconcile state < 15 min __ min ☐ ☐
Verify cockpit + Trino + ADEN respond < 5 min __ min ☐ ☐
Run banking-end-to-end smoke (read-only path) < 10 min __ min ☐ ☐
Total RTO 4 h __ h __ min ☐ ☐
Measured RPO 24 h (last nightly) __ h ☐ ☐

Components that failed to restore cleanly (if any):

(blank)

Did the audit trail (decision_logs, query_log, admin_events) restore intact and tamper-free? ☐ Yes ☐ No — explain:

(blank)

Drill 3 — tabletop incident response

See playbook §4.3. No infrastructure changes — this is a paper exercise where the on-call team walks through the response to a named incident. Witness scores each participant's clarity + speed.

Scenario chosen (one of):

  • ☐ A1: Active database lost (storage corruption)
  • ☐ A2: Ransomware encrypts SeaweedFS volumes
  • ☐ A3: Keycloak realm wiped (admin error)
  • ☐ A4: Region-level cloud outage (provider down)
  • ☐ A5: Insider threat — admin exfiltrates audit log
  • ☐ A6: Custom — describe:
(blank)
Phase Owner Time-to-action Observed
Detection (alert fires / user reports) Monitoring < 5 min __
Triage (severity + scope) On-call < 10 min __
Communication (Slack #incidents) On-call < 2 min __
Containment decision Security lead < 15 min __
Recovery start (which drill triggered) Operator < 30 min __
Status update cadence All-hands every 30 min __
All-clear declaration Compliance end of MTTR __
Post-mortem within 7 d Author TBD within 7 d __

Misses or hesitations observed (transparent — the post-mortem is the output of the drill, not the verdict on the team):

(blank)

Process changes raised:

(blank)

Backup-system spot-checks (run BEFORE drill 1)

# PostgreSQL — last basebackup must be < 4 h old
kubectl -n akko exec svc/akko-postgresql -- ls -lt /backups/ | head

# Velero — last successful backup
velero backup get --selector akko.scope=full | head -5

# SeaweedFS volume snapshot — last snapshot
kubectl -n akko exec deploy/akko-seaweedfs-master -- \
  weed shell <<< 'volume.list' | grep -E 'snapshot.*[0-9]+'

# Audit cold storage — object-lock still in force
aws s3api get-object-lock-configuration \
  --bucket akko-audit-cold | jq '.ObjectLockConfiguration.Rule.DefaultRetention'

# cert-manager — wildcard cert not expiring < 30 d
kubectl -n cert-manager get certificate -o json | \
  jq '.items[] | {name: .metadata.name, expiry: .status.notAfter}'

Record any anomalies above the drill section so a degraded backup state is captured even if the drills themselves pass.


Sign-off

Role Name Signature (initials) Date
Operator
Witness
Compliance officer

Retention — store the signed PDF in cold storage with object-lock retention ≥ 7 years (DORA Art. 11 + GDPR Art. 30 records of processing). The git revision of the playbook used for this drill (recorded in the Header) lets future auditors prove that the procedures actually executed match the procedures documented at the time.