Runbook: ImagePullBackOff¶

Alerte : ImagePullBackOff (PrometheusRule, severity critical)

Symptôme :

Pod <pod> n'arrive pas à tirer l'image <image> depuis Harbor ou un registry externe.

Severity : 🔴 critical — notif Slack #akko-critical, repeat 30 min

Diagnostic¶

1. Identifier l'image qui échoue¶

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl describe pod -n akko <pod> | grep -A3 "Failed to pull"

2. Vérifier les events (détail de l'erreur)¶

kubectl get events -n akko --sort-by='.lastTimestamp' | grep -i "pull" | tail -10

3. Tester le pull manuellement depuis un worker¶

ssh root@<node> 'crictl pull <image>'
# ou sur un k3s avec containerd :
ssh root@<node> 'ctr -n k8s.io images pull <image>'

Causes fréquentes + fix¶

Cause	Symptôme events	Fix
Image n'existe pas	`manifest unknown`	Vérifier tag dans `helm/akko/values.yaml` vs Harbor UI. Re-build + push via CI.
Tag `:latest` mutable	Pull OK hier, fail aujourd'hui	R08 violé — pinner à `YYYY.MM`. Fix: `values.yaml` + commit.
Harbor auth	`401 Unauthorized`	Vérifier `imagePullSecrets` (`kubectl get secret harbor-creds -n akko`). Recréer si expiré.
Rate limit Docker Hub	`toomanyrequests`	Éviter Docker Hub — mirrorer via Harbor (voir harbor.md).
Network policy bloque Harbor	`connection refused` vers IP Harbor	Vérifier NetworkPolicy autorise egress vers port registry (30500).
DNS interne cassé	`no such host`	`kubectl get svc -n harbor` + `kubectl run -it --rm --restart=Never test --image=busybox -- nslookup harbor.harbor`.

Fix pérenne (R02 : zero workaround)¶

Jamais : kubectl edit deployment pour changer l'image en live.

Toujours : 1. Edit helm/akko/values.yaml ou values-netcup.yaml avec le bon tag 2. Commit + push 3. CI rebuild + deploy 4. Vérifier pod Running

Si l'image doit être rebuilt : 1. Edit Dockerfile dans docker/<image>/ 2. Commit + push — le workflow deploy-netcup.yml détecte les paths docker/** et relance le build

Prévention¶

Tag pinning R08 : tous les tags doivent être YYYY.MM — jamais :latest. Linter CI : grep -rn ':latest' helm/ docker/ | grep -v example
Harbor retention policy : configurer rétention 90 jours + 5 derniers tags pour éviter les purges accidentelles.
Fallback registry : prévoir un mirror Harbor pour les images upstream (postgres, python, jupyter) si Docker Hub pose problème.

Lessons learned¶

L08 — Tout tag :latest casse tôt ou tard. R08 est strict, zéro tolérance.

Liens utiles¶

Harbor Registry — setup + authentification
CI GitHub Actions — workflow de push
logs layer : {namespace="akko", pod=~".*<service>.*"} |= "pull"