Skip to content

Runbooks — Production incident playbooks

Chaque alerte PrometheusRule a un runbook dédié avec : - Diagnostic step-by-step - Remediation par cas - Fix pérenne (R02 : zero workaround) - Lessons learned applicables

Index

Runbook Alerte Severity
pod-crashloop.md PodCrashLoopBackOff 🔴 critical
pod-oom-killed.md PodOOMKilled 🟡 warning / 🔴 critical
pvc-full.md PVCAlmostFull / PVCFull 🟡 warning / 🔴 critical
node-disk-full.md NodeDiskFull / NodeDiskPressure 🔴 critical
node-high-cpu.md NodeCPUHigh 🟡 warning / 🔴 critical
node-high-memory.md NodeMemoryHigh 🟡 warning / 🔴 critical
image-pull-backoff.md ImagePullBackOff 🔴 critical
tls-cert-expires.md TLSCertExpiresSoon 🟡 warning / 🔴 critical
loki-ingestion-slow.md LokiIngestionSlow 🟡 warning
openmetadata-api-errors.md OpenMetadataAPIErrors5xx 🔴 critical
polaris-exceptions.md PolarisAPIExceptions 🔴 critical
smoke-test-failing.md SmokeTestFailing 🔴 critical
keycloak-sso-broken.md (manuel) 🔴 critical
helm-upgrade-stuck.md (manuel) 🔴 critical
trino-slow-queries.md TrinoHighQueryLatency 🟡 warning

15 runbooks couverts — Sprint 22 suite COMPLET.

Format standard

Chaque runbook suit ce template :

# Runbook: <alert name>

**Alerte** : `<name>` (severity `<level>`)
**Symptôme** : <user-facing description>

## Diagnostic
1. Vérifier l'état du pod/service
2. Lire les logs
3. Vérifier les events K8s

## Causes + remediation
| Cause | Symptôme | Fix |
...

## Fix pérenne (R02)
Décrire comment fixer **dans le code** + commit + push + CI.

## Prévention
PrometheusRule, tests Playwright, dashboards.

## Lessons learned
Lien vers L<N> dans _RULES.md.

Contribuer un runbook

  1. Nouveau fichier dans docs/docs/admin/runbooks/
  2. Créer aussi la version FR <nom>.fr.md (R14)
  3. Ajouter à docs/mkdocs.yml nav
  4. Si l'alerte correspondante existe dans prometheusrule-akko.yaml, ajouter le champ annotations.runbook_url pointant vers ce fichier
  5. Commit + push