Runbooks — Production incident playbooks¶

Chaque alerte PrometheusRule a un runbook dédié avec : - Diagnostic step-by-step - Remediation par cas - Fix pérenne (R02 : zero workaround) - Lessons learned applicables

Index¶

Runbook	Alerte	Severity
pod-crashloop.md	PodCrashLoopBackOff	🔴 critical
pod-oom-killed.md	PodOOMKilled	🟡 warning / 🔴 critical
pvc-full.md	PVCAlmostFull / PVCFull	🟡 warning / 🔴 critical
node-disk-full.md	NodeDiskFull / NodeDiskPressure	🔴 critical
node-high-cpu.md	NodeCPUHigh	🟡 warning / 🔴 critical
node-high-memory.md	NodeMemoryHigh	🟡 warning / 🔴 critical
image-pull-backoff.md	ImagePullBackOff	🔴 critical
tls-cert-expires.md	TLSCertExpiresSoon	🟡 warning / 🔴 critical
loki-ingestion-slow.md	LokiIngestionSlow	🟡 warning
openmetadata-api-errors.md	OpenMetadataAPIErrors5xx	🔴 critical
polaris-exceptions.md	PolarisAPIExceptions	🔴 critical
smoke-test-failing.md	SmokeTestFailing	🔴 critical
keycloak-sso-broken.md	(manuel)	🔴 critical
helm-upgrade-stuck.md	(manuel)	🔴 critical
trino-slow-queries.md	TrinoHighQueryLatency	🟡 warning

15 runbooks couverts — Sprint 22 suite COMPLET.

Format standard¶

Chaque runbook suit ce template :

# Runbook: <alert name>

**Alerte** : `<name>` (severity `<level>`)
**Symptôme** : <user-facing description>

## Diagnostic
1. Vérifier l'état du pod/service
2. Lire les logs
3. Vérifier les events K8s

## Causes + remediation
| Cause | Symptôme | Fix |
...

## Fix pérenne (R02)
Décrire comment fixer **dans le code** + commit + push + CI.

## Prévention
PrometheusRule, tests Playwright, dashboards.

## Lessons learned
Lien vers L<N> dans _RULES.md.

Contribuer un runbook¶

Nouveau fichier dans docs/docs/admin/runbooks/
Créer aussi la version FR <nom>.fr.md (R14)
Ajouter à docs/mkdocs.yml nav
Si l'alerte correspondante existe dans prometheusrule-akko.yaml, ajouter le champ annotations.runbook_url pointant vers ce fichier
Commit + push