Runbook : NodeCPUHigh¶

Alerte : NodeCPUHigh (PrometheusRule, severity warning à 80 %, critical à 95 %)

Symptôme :

Le node <node> sature son CPU (> 80 % load moyen 5 min). Les pods commencent à subir du throttling.

Severity : 🟡 warning à 80 %, 🔴 critical à 95 %

Diagnostic¶

1. Identifier le node chargé¶

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
kubectl top node

2. Top pods consommateurs sur ce node¶

kubectl top pod -A --sort-by=cpu | head -20
kubectl get pod -A -o wide | grep <node> | head -20

3. Dashboard Dashboards « Node CPU »¶

Query PromQL :

(1 - rate(node_cpu_seconds_total{mode="idle", instance=~"<node>.*"}[5m]))

4. Vérifier le throttling¶

kubectl exec -n akko <pod> -- cat /sys/fs/cgroup/cpu.stat 2>/dev/null | grep throttled
# Si nr_throttled > 0 → le container est CPU-limité

Causes fréquentes + correctif¶

Cause	Typique pour	Correctif
Query Trino lourde	Coordinator 100 % CPU pendant 10+ min	`SHOW TRANSACTIONS`, kill la query ; cf. trino-slow-queries.fr.md
OOM-prevent driver Spark	GC thrashing	Bump mémoire driver ou réduire executor count
Boucle scheduler Airflow	Scheduler CPU > 80 % constant	Bump `parallelism`, shard les DAGs
VACUUM FULL Postgres	PG bloqué 100 % CPU	Attendre la fin du vacuum ou `pg_cancel_backend()`
Node sous-dimensionné	Toujours > 70 % sur 24 h	Ajouter un node ou bump taille VM
Container mining/crypto	CPU 100 % inconnu	Audit seccomp + NetworkPolicy egress

Correctif d'urgence (< 5 min)¶

# 1. Identifier le PID qui consomme
ssh root@<node>
top -b -n1 | head -20

# 2. Si c'est Trino : kill la query
kubectl exec -n akko deploy/akko-trino -- \
  trino --execute="SELECT query_id, state, query FROM system.runtime.queries WHERE state='RUNNING' ORDER BY created DESC LIMIT 5"

# Via HTTP avec token admin :
curl -X DELETE -H "Authorization: Bearer $TRINO_ADMIN_TOKEN" \
  https://trino.$AKKO_DOMAIN/v1/query/<query_id>

# 3. Si le pod peut être rescheduled : drainer le node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=30

Correctif pérenne (R02)¶

Option A — bump des resource limits¶

# helm/examples/values-dev.yaml (exemple Trino)
trino:
  coordinator:
    resources:
      limits:
        cpu: 2        # était 1
        memory: 4Gi
  worker:
    resources:
      limits:
        cpu: 4        # était 2

Option B — ajouter un node¶

Netcup : helm upgrade avec un nouveau pool
EKS : eksctl scale nodegroup --nodes=<N+1>
k3d dev : k3d node create

Option C — HPA (Horizontal Pod Autoscaler)¶

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 8
  targetCPUUtilizationPercentage: 70

Prévention¶

Dashboard Dashboards « Node resources » avec CPU, RAM, Disque par node
PrometheusRule early warning à 70 % pour 10 min
Revue capacity planning trimestrielle
Alertes throttling : container_cpu_cfs_throttled_periods_total

Lessons learned¶

Trino est le principal consommateur CPU — c'est attendu. Ne pas paniquer à 80 % si la query est normale. Panique à 100 % constant > 30 min.

Liens utiles¶

Trino slow queries
Node disk full
Dashboard : Dashboards / « AKKO — Kubernetes overview »