Runbooks — Production incident playbooks¶
Chaque alerte PrometheusRule a un runbook dédié avec : - Diagnostic step-by-step - Remediation par cas - Fix pérenne (R02 : zero workaround) - Lessons learned applicables
Index¶
| Runbook | Alerte | Severity |
|---|---|---|
| pod-crashloop.md | PodCrashLoopBackOff | 🔴 critical |
| pod-oom-killed.md | PodOOMKilled | 🟡 warning / 🔴 critical |
| pvc-full.md | PVCAlmostFull / PVCFull | 🟡 warning / 🔴 critical |
| node-disk-full.md | NodeDiskFull / NodeDiskPressure | 🔴 critical |
| node-high-cpu.md | NodeCPUHigh | 🟡 warning / 🔴 critical |
| node-high-memory.md | NodeMemoryHigh | 🟡 warning / 🔴 critical |
| image-pull-backoff.md | ImagePullBackOff | 🔴 critical |
| tls-cert-expires.md | TLSCertExpiresSoon | 🟡 warning / 🔴 critical |
| loki-ingestion-slow.md | LokiIngestionSlow | 🟡 warning |
| openmetadata-api-errors.md | OpenMetadataAPIErrors5xx | 🔴 critical |
| polaris-exceptions.md | PolarisAPIExceptions | 🔴 critical |
| smoke-test-failing.md | SmokeTestFailing | 🔴 critical |
| keycloak-sso-broken.md | (manuel) | 🔴 critical |
| helm-upgrade-stuck.md | (manuel) | 🔴 critical |
| trino-slow-queries.md | TrinoHighQueryLatency | 🟡 warning |
15 runbooks couverts — Sprint 22 suite COMPLET.
Format standard¶
Chaque runbook suit ce template :
# Runbook: <alert name>
**Alerte** : `<name>` (severity `<level>`)
**Symptôme** : <user-facing description>
## Diagnostic
1. Vérifier l'état du pod/service
2. Lire les logs
3. Vérifier les events K8s
## Causes + remediation
| Cause | Symptôme | Fix |
...
## Fix pérenne (R02)
Décrire comment fixer **dans le code** + commit + push + CI.
## Prévention
PrometheusRule, tests Playwright, dashboards.
## Lessons learned
Lien vers L<N> dans _RULES.md.
Contribuer un runbook¶
- Nouveau fichier dans
docs/docs/admin/runbooks/ - Créer aussi la version FR
<nom>.fr.md(R14) - Ajouter à
docs/mkdocs.ymlnav - Si l'alerte correspondante existe dans
prometheusrule-akko.yaml, ajouter le champannotations.runbook_urlpointant vers ce fichier - Commit + push