DR drill execution log¶
Use this form after every quarterly DR drill (Drills 1–3 in
dr-playbook.md). The signed log is the auditor's proof that recovery procedures were tested and the measured RTO/RPO match the targets. Without this log, the playbook is theatre.Workflow:
- Operator copies this template into a new file
dr-drill-YYYY-Q[1-4].mdinakko-confidential/dr-drills/(private repo — drill outputs may contain operational details).- Operator + on-call witness execute each drill, fill in the log.
- Compliance officer co-signs at the bottom.
- The signed PDF is attached to the audit trail (cold storage in the
akko-audit-coldbucket with object-lock retention ≥ 7 years).
Header¶
| Field | Value |
|---|---|
| Drill date | YYYY-MM-DD HH:MM TZ |
| Quarter | Q1 / Q2 / Q3 / Q4 — YYYY |
| Cluster | netcup-prod / aks-prod / eks-prod / scratch |
| AKKO version | (from helm list -n akko) |
| Operator | name + role |
| Witness | name + role (must be different person) |
| Compliance officer | name + role |
| Playbook revision | git SHA of docs/docs/admin/dr-playbook.md |
Drill 1 — single-database PITR¶
See playbook §4.1. Restore one PostgreSQL DB to a point-in-time 5 min before "now". Reference DB:
superset_metadata.
| Step | Target | Measured | Pass / Fail | Notes |
|---|---|---|---|---|
| Locate latest base backup | < 5 min | __ min | ☐ ☐ | |
| Restore to scratch PVC | < 15 min | __ min | ☐ ☐ | |
| Replay WAL to T-5 | < 5 min | __ min | ☐ ☐ | |
| Verify row counts match expected | < 1 min | __ s | ☐ ☐ | |
| Cut-over Service to restored DB | < 5 min | __ min | ☐ ☐ | |
| Total RTO | 20 min | __ min | ☐ ☐ | |
| Measured RPO | 5 min | __ s | ☐ ☐ |
Issues encountered (if any):
Action items raised (link to backlog tickets):
Drill 2 — full-cluster restore on scratch hardware¶
See playbook §4.2. From a fresh k8s cluster, restore the entire AKKO namespace from Velero + ObjectStorage. Reference target RTO: 4 h.
| Step | Target | Measured | Pass / Fail | Notes |
|---|---|---|---|---|
| Provision scratch cluster (k3s on cold-spare hardware) | < 30 min | __ min | ☐ ☐ | |
| Install cert-manager + Traefik + Velero | < 15 min | __ min | ☐ ☐ | |
velero restore create from latest backup |
< 90 min | __ min | ☐ ☐ | |
| Restore SeaweedFS volumes from snapshot | < 60 min | __ min | ☐ ☐ | |
helm upgrade akko to reconcile state |
< 15 min | __ min | ☐ ☐ | |
| Verify cockpit + Trino + ADEN respond | < 5 min | __ min | ☐ ☐ | |
| Run banking-end-to-end smoke (read-only path) | < 10 min | __ min | ☐ ☐ | |
| Total RTO | 4 h | __ h __ min | ☐ ☐ | |
| Measured RPO | 24 h (last nightly) | __ h | ☐ ☐ |
Components that failed to restore cleanly (if any):
Did the audit trail (decision_logs, query_log, admin_events) restore intact and tamper-free? ☐ Yes ☐ No — explain:
Drill 3 — tabletop incident response¶
See playbook §4.3. No infrastructure changes — this is a paper exercise where the on-call team walks through the response to a named incident. Witness scores each participant's clarity + speed.
Scenario chosen (one of):
- ☐ A1: Active database lost (storage corruption)
- ☐ A2: Ransomware encrypts SeaweedFS volumes
- ☐ A3: Keycloak realm wiped (admin error)
- ☐ A4: Region-level cloud outage (provider down)
- ☐ A5: Insider threat — admin exfiltrates audit log
- ☐ A6: Custom — describe:
| Phase | Owner | Time-to-action | Observed |
|---|---|---|---|
| Detection (alert fires / user reports) | Monitoring | < 5 min | __ |
| Triage (severity + scope) | On-call | < 10 min | __ |
| Communication (Slack #incidents) | On-call | < 2 min | __ |
| Containment decision | Security lead | < 15 min | __ |
| Recovery start (which drill triggered) | Operator | < 30 min | __ |
| Status update cadence | All-hands | every 30 min | __ |
| All-clear declaration | Compliance | end of MTTR | __ |
| Post-mortem within 7 d | Author TBD | within 7 d | __ |
Misses or hesitations observed (transparent — the post-mortem is the output of the drill, not the verdict on the team):
Process changes raised:
Backup-system spot-checks (run BEFORE drill 1)¶
# PostgreSQL — last basebackup must be < 4 h old
kubectl -n akko exec svc/akko-postgresql -- ls -lt /backups/ | head
# Velero — last successful backup
velero backup get --selector akko.scope=full | head -5
# SeaweedFS volume snapshot — last snapshot
kubectl -n akko exec deploy/akko-seaweedfs-master -- \
weed shell <<< 'volume.list' | grep -E 'snapshot.*[0-9]+'
# Audit cold storage — object-lock still in force
aws s3api get-object-lock-configuration \
--bucket akko-audit-cold | jq '.ObjectLockConfiguration.Rule.DefaultRetention'
# cert-manager — wildcard cert not expiring < 30 d
kubectl -n cert-manager get certificate -o json | \
jq '.items[] | {name: .metadata.name, expiry: .status.notAfter}'
Record any anomalies above the drill section so a degraded backup state is captured even if the drills themselves pass.
Sign-off¶
| Role | Name | Signature (initials) | Date |
|---|---|---|---|
| Operator | |||
| Witness | |||
| Compliance officer |
Retention — store the signed PDF in cold storage with object-lock retention ≥ 7 years (DORA Art. 11 + GDPR Art. 30 records of processing). The git revision of the playbook used for this drill (recorded in the Header) lets future auditors prove that the procedures actually executed match the procedures documented at the time.