Skip to content

Runbook: Trino slow queries / high latency

Alerte : TrinoHighQueryLatency (severity warning)

Trino P95 query latency > 5s


Diagnostic

1. Identifier les slow queries

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

# Via Trino UI
kubectl port-forward -n akko svc/akko-trino 8080:8080
# Open http://localhost:8080 → Query History

Ou via SQL :

-- Top 10 queries les plus lentes des dernières 24h
SELECT
  user, query,
  date_diff('second', create_time, end_time) AS duration_s,
  total_bytes / 1024.0 / 1024.0 AS mb_scanned,
  total_rows
FROM system.runtime.queries
WHERE create_time > NOW() - INTERVAL '24' HOUR
  AND state = 'FINISHED'
ORDER BY duration_s DESC
LIMIT 10;

2. Vérifier la mémoire Trino coordinator + workers

kubectl top pods -n akko | grep -E 'trino'

Si proches des limites, queries timeout ou queue build up.

3. Stats Iceberg metadata

-- Iceberg: fichiers orphelins, manifestes non compactés
SHOW STATS FOR iceberg.banking.customers;
SELECT * FROM iceberg.banking."customers$files" LIMIT 10;
SELECT * FROM iceberg.banking."customers$manifests" LIMIT 10;

Causes + remediation

Cas 1 : Table Iceberg fragmentée (beaucoup de petits fichiers)

Fix : compaction Iceberg

ALTER TABLE iceberg.banking.customers EXECUTE optimize;
ALTER TABLE iceberg.banking.customers EXECUTE expire_snapshots(retention_threshold => '7d');
ALTER TABLE iceberg.banking.customers EXECUTE remove_orphan_files;

Cas 2 : Pas assez de workers

Fix : bump trino.server.worker.count dans values

# helm/examples/values-dev.yaml
trino:
  server:
    workers: 3  # default souvent 1

Cas 3 : Query inefficace (full scan, cross-join)

Fix : analyser via EXPLAIN ANALYZE

EXPLAIN ANALYZE SELECT ...;

Conseiller partitioning / bucketing sur les tables souvent filtrées.

Cas 4 : OPA policy latency (chaque query check OPA)

Vérifier :

kubectl logs -n akko deploy/akko-akko-opa --tail=50 | grep -i 'eval\|slow'

Si OPA ralentit, revoir les policies (cache, decision log).


Lessons learned

Pas encore de L spécifique pour Trino perf. Ajouter dans _RULES.md après 1er incident résolu.


Voir aussi