Aller au contenu

ADR-037 — mTLS service mesh selection (Sprint 52 P1)

Date : 2026-04-26 (Sprint 52 kickoff) Status : Proposed — awaiting operator decision Related : ADR-029 governance over license, Stream Sécu P1 (#205)

Context

Sprint 47 V2 closed user-facing layer security (functional FQDN + oauth2-proxy ForwardAuth + ADR-035 wildcard TLS). Sprint 49 closed network policy gaps (33/33 charts) + OPA admin-only fns.

Next layer up : inter-service traffic encryption + mutual authentication. Today, every pod-to-pod call inside the akko namespace is plaintext over HTTP. NetworkPolicy gates who can talk to whom, but not what travels on the wire.

Threat model that mTLS closes : - Compromised pod inside akko ns sniffs traffic of another tenant pod - Operator with kubectl exec privileges can run tcpdump on any pod - Lateral movement after a breach via plaintext credential capture

Sprint 52 Stream Sécu P1 (#205) calls for "mTLS + cosign + encryption at rest". Cosign verify-policy shipped as template (ADR-036 follow-up). Encryption at rest already partial (pgcrypto Sprint 46 A2). mTLS is the remaining structural piece.

Options considered

Option A — Linkerd (CNCF Graduated, Rust-based proxy)

Dimension Score Notes
License Apache 2.0, CNCF Graduated (Foundation-neutral, Sprint 47 governance rule respected)
Resource cost ~10 MB RAM per pod (Rust micro-proxy) — fits Netcup demo
Setup complexity linkerd install \| kubectl apply -f - + namespace annotation, that's it
mTLS posture Auto-mTLS by default, certificate rotation every 24h
Observability Built-in Grafana-like dashboard + Prometheus metrics out-of-box
Multi-cluster ⚠️ Possible but not as polished as Istio
Egress control ⚠️ Basic — relies on NetworkPolicy for fine-grained egress
Ecosystem ⚠️ Smaller than Istio but stable
AKKO Lego fit One sub-chart akko-linkerd (planned), namespace annotation, done
Dimension Score Notes
License Apache 2.0, CNCF Graduated (Foundation-neutral)
Resource cost ~150 MB RAM per pod (Envoy + Pilot + Gateway). Sinks Netcup.
Setup complexity 5 helm charts (base, istiod, gateway, ztunnel, ambient)
mTLS posture PeerAuthentication CRD, sophisticated identity model
Observability Kiali + Jaeger native. Big setup.
Multi-cluster Best in class
Egress control Sophisticated AuthorizationPolicy CRD
Ecosystem Largest community, most blog posts
AKKO Lego fit Couples 5 sub-charts; conflicts with the AKKO "small Lego" philosophy

Option C — Cilium service mesh (eBPF, sidecarless)

Dimension Score Notes
License Apache 2.0, CNCF Graduated
Resource cost Zero per-pod overhead (eBPF in kernel)
Setup complexity ⚠️ Replaces the cluster CNI (k3s flannel → cilium). Not zero-touch.
mTLS posture mTLS via SPIFFE identities, no sidecar
Observability Hubble UI native — flow-level visibility
Multi-cluster Cluster Mesh built-in
Egress control Layer 7 policies via CiliumNetworkPolicy
Ecosystem Backed by Isovalent (now Cisco). CNCF Graduated 2023.
AKKO Lego fit ⚠️ Heavier coupling than Linkerd (CNI swap), but no per-pod cost

Option D — None (status quo)

Dimension Score Notes
Risk Plaintext inter-pod traffic, no defence against compromised pod
Ops effort Nothing to install / maintain
Audit/Compliance Fails SOC2 / ISO27001 / PCI-DSS controls on data-in-transit

Decision drivers

  1. Sovereignty rule (ADR-029) — Foundation-neutral wins over single-vendor. All 3 mesh options are CNCF Graduated, so this is neutral between A/B/C.
  2. Resource cost on Netcup — Single-node demo box has 16 GB RAM. Istio's 150 MB/pod × 50 pods = 7.5 GB just for sidecars. Linkerd's 10 MB × 50 = 500 MB. Cilium = 0 MB sidecar.
  3. Multi-infra portability — chart must work on k3d / k3s / EKS / GKE / AKS / OpenShift / bare-metal. Cilium needs CNI swap (not trivial on k3s). Linkerd/Istio install on top of any CNI.
  4. AKKO Lego philosophy — small, swappable sub-charts. Linkerd's single namespace is the simplest fit; Istio's 5 charts breaks modularity.
  5. First-cluster bootstrap — Netcup is the canonical demo. Whatever ships must "just work" on a fresh helm install akko .... Cilium's CNI swap can't be retrofitted on a running k3s cluster without pod downtime.

Recommendation

Option A — Linkerd as the default akko-mtls sub-chart.

Rationale : - Lowest resource overhead consistent with the Netcup single-node demo - CNI-agnostic, retrofittable on any running cluster - Apache 2.0 + CNCF Graduated (ADR-029 compliant) - Minimal AKKO Lego coupling (one sub-chart, namespace annotation) - Auto-mTLS + cert rotation requires zero ongoing operator work

Cilium ranks higher on raw performance (eBPF, no sidecar) but the CNI swap is a dealbreaker for a chart that must drop into existing k3s / EKS / GKE clusters. Recommend revisiting Cilium once AKKO has a "first-deploy bootstrap" runbook that can include the CNI choice.

Istio is over-engineered for the AKKO use case and doesn't fit on demo boxes.

Implementation outline (Sprint 52 P1)

  1. Add helm/akko/charts/akko-mtls/ sub-chart wrapping Linkerd's official Helm chart (linkerd-control-plane + linkerd-crds).
  2. Annotate every other AKKO sub-chart's pod template with linkerd.io/inject: enabled (gated by global.security.mtls.enabled).
  3. Update NetworkPolicy templates to also allow Linkerd's identity + destination services on port 8080 + 8086.
  4. Ship a tests/integration/test_mtls.py that asserts linkerd viz tap shows mTLS=true on every cross-pod call inside akko ns.
  5. Document the activation runbook in docs/admin/mtls.md (EN+FR).

Default akko-mtls.enabled=false for one sprint then flip to true on the next major (v2026.05).

Consequences

Positive

  • Inter-pod traffic encrypted + authenticated by default
  • Zero per-pod application change (sidecar-injected)
  • Compliance posture improves for SOC2 / ISO27001 / PCI-DSS audits
  • Single-vendor risk avoided (CNCF Foundation governance)

Negative

  • ~10 MB RAM per pod (50 pods × 10 MB = 500 MB on Netcup, manageable)
  • One more sub-chart to maintain
  • Linkerd's adoption is smaller than Istio — fewer Stack Overflow answers
  • Initial mesh injection requires a one-time pod restart cluster-wide

Risks

  • Linkerd's UI requires a separate linkerd-viz install — operator must learn the CLI for linkerd check / linkerd identity
  • Cert rotation (every 24h by default) requires Trust Anchor renewal — AKKO chart should ship a CronJob or use cert-manager integration
  • Compatibility with Spark Connect / Trino federation needs E2E validation (HTTP/2 + gRPC pass through Linkerd, but verify on Netcup before flipping enabled=true)

Reference

  • Linkerd : https://linkerd.io (CNCF Graduated 2021)
  • ADR-029 — governance over license (Foundation-neutral rule)
  • ADR-032 — image signing + SBOM (cosign)
  • ADR-036 — functional URLs (sister-document for the user-facing layer)
  • Stream Sécu P1 task #205 (this ADR's home)