ADR-037 — mTLS service mesh selection (Sprint 52 P1)¶
Date : 2026-04-26 (Sprint 52 kickoff) Status : Proposed — awaiting operator decision Related : ADR-029 governance over license, Stream Sécu P1 (#205)
Context¶
Sprint 47 V2 closed user-facing layer security (functional FQDN + oauth2-proxy ForwardAuth + ADR-035 wildcard TLS). Sprint 49 closed network policy gaps (33/33 charts) + OPA admin-only fns.
Next layer up : inter-service traffic encryption + mutual authentication. Today, every pod-to-pod call inside the akko namespace is plaintext over HTTP. NetworkPolicy gates who can talk to whom, but not what travels on the wire.
Threat model that mTLS closes :
- Compromised pod inside akko ns sniffs traffic of another tenant pod
- Operator with kubectl exec privileges can run tcpdump on any pod
- Lateral movement after a breach via plaintext credential capture
Sprint 52 Stream Sécu P1 (#205) calls for "mTLS + cosign + encryption at rest". Cosign verify-policy shipped as template (ADR-036 follow-up). Encryption at rest already partial (pgcrypto Sprint 46 A2). mTLS is the remaining structural piece.
Options considered¶
Option A — Linkerd (CNCF Graduated, Rust-based proxy)¶
| Dimension | Score | Notes |
|---|---|---|
| License | ✅ | Apache 2.0, CNCF Graduated (Foundation-neutral, Sprint 47 governance rule respected) |
| Resource cost | ✅ | ~10 MB RAM per pod (Rust micro-proxy) — fits Netcup demo |
| Setup complexity | ✅ | linkerd install \| kubectl apply -f - + namespace annotation, that's it |
| mTLS posture | ✅ | Auto-mTLS by default, certificate rotation every 24h |
| Observability | ✅ | Built-in Grafana-like dashboard + Prometheus metrics out-of-box |
| Multi-cluster | ⚠️ | Possible but not as polished as Istio |
| Egress control | ⚠️ | Basic — relies on NetworkPolicy for fine-grained egress |
| Ecosystem | ⚠️ | Smaller than Istio but stable |
| AKKO Lego fit | ✅ | One sub-chart akko-linkerd (planned), namespace annotation, done |
Option B — Istio (most popular, Envoy-based)¶
| Dimension | Score | Notes |
|---|---|---|
| License | ✅ | Apache 2.0, CNCF Graduated (Foundation-neutral) |
| Resource cost | ❌ | ~150 MB RAM per pod (Envoy + Pilot + Gateway). Sinks Netcup. |
| Setup complexity | ❌ | 5 helm charts (base, istiod, gateway, ztunnel, ambient) |
| mTLS posture | ✅ | PeerAuthentication CRD, sophisticated identity model |
| Observability | ✅ | Kiali + Jaeger native. Big setup. |
| Multi-cluster | ✅ | Best in class |
| Egress control | ✅ | Sophisticated AuthorizationPolicy CRD |
| Ecosystem | ✅ | Largest community, most blog posts |
| AKKO Lego fit | ❌ | Couples 5 sub-charts; conflicts with the AKKO "small Lego" philosophy |
Option C — Cilium service mesh (eBPF, sidecarless)¶
| Dimension | Score | Notes |
|---|---|---|
| License | ✅ | Apache 2.0, CNCF Graduated |
| Resource cost | ✅ | Zero per-pod overhead (eBPF in kernel) |
| Setup complexity | ⚠️ | Replaces the cluster CNI (k3s flannel → cilium). Not zero-touch. |
| mTLS posture | ✅ | mTLS via SPIFFE identities, no sidecar |
| Observability | ✅ | Hubble UI native — flow-level visibility |
| Multi-cluster | ✅ | Cluster Mesh built-in |
| Egress control | ✅ | Layer 7 policies via CiliumNetworkPolicy |
| Ecosystem | ✅ | Backed by Isovalent (now Cisco). CNCF Graduated 2023. |
| AKKO Lego fit | ⚠️ | Heavier coupling than Linkerd (CNI swap), but no per-pod cost |
Option D — None (status quo)¶
| Dimension | Score | Notes |
|---|---|---|
| Risk | ❌ | Plaintext inter-pod traffic, no defence against compromised pod |
| Ops effort | ✅ | Nothing to install / maintain |
| Audit/Compliance | ❌ | Fails SOC2 / ISO27001 / PCI-DSS controls on data-in-transit |
Decision drivers¶
- Sovereignty rule (ADR-029) — Foundation-neutral wins over single-vendor. All 3 mesh options are CNCF Graduated, so this is neutral between A/B/C.
- Resource cost on Netcup — Single-node demo box has 16 GB RAM. Istio's 150 MB/pod × 50 pods = 7.5 GB just for sidecars. Linkerd's 10 MB × 50 = 500 MB. Cilium = 0 MB sidecar.
- Multi-infra portability — chart must work on k3d / k3s / EKS / GKE / AKS / OpenShift / bare-metal. Cilium needs CNI swap (not trivial on k3s). Linkerd/Istio install on top of any CNI.
- AKKO Lego philosophy — small, swappable sub-charts. Linkerd's single namespace is the simplest fit; Istio's 5 charts breaks modularity.
- First-cluster bootstrap — Netcup is the canonical demo. Whatever
ships must "just work" on a fresh
helm install akko .... Cilium's CNI swap can't be retrofitted on a running k3s cluster without pod downtime.
Recommendation¶
Option A — Linkerd as the default akko-mtls sub-chart.
Rationale : - Lowest resource overhead consistent with the Netcup single-node demo - CNI-agnostic, retrofittable on any running cluster - Apache 2.0 + CNCF Graduated (ADR-029 compliant) - Minimal AKKO Lego coupling (one sub-chart, namespace annotation) - Auto-mTLS + cert rotation requires zero ongoing operator work
Cilium ranks higher on raw performance (eBPF, no sidecar) but the CNI swap is a dealbreaker for a chart that must drop into existing k3s / EKS / GKE clusters. Recommend revisiting Cilium once AKKO has a "first-deploy bootstrap" runbook that can include the CNI choice.
Istio is over-engineered for the AKKO use case and doesn't fit on demo boxes.
Implementation outline (Sprint 52 P1)¶
- Add
helm/akko/charts/akko-mtls/sub-chart wrapping Linkerd's official Helm chart (linkerd-control-plane + linkerd-crds). - Annotate every other AKKO sub-chart's pod template with
linkerd.io/inject: enabled(gated byglobal.security.mtls.enabled). - Update NetworkPolicy templates to also allow Linkerd's identity + destination services on port 8080 + 8086.
- Ship a
tests/integration/test_mtls.pythat assertslinkerd viz tapshows mTLS=true on every cross-pod call inside akko ns. - Document the activation runbook in
docs/admin/mtls.md(EN+FR).
Default akko-mtls.enabled=false for one sprint then flip to true on
the next major (v2026.05).
Consequences¶
Positive¶
- Inter-pod traffic encrypted + authenticated by default
- Zero per-pod application change (sidecar-injected)
- Compliance posture improves for SOC2 / ISO27001 / PCI-DSS audits
- Single-vendor risk avoided (CNCF Foundation governance)
Negative¶
- ~10 MB RAM per pod (50 pods × 10 MB = 500 MB on Netcup, manageable)
- One more sub-chart to maintain
- Linkerd's adoption is smaller than Istio — fewer Stack Overflow answers
- Initial mesh injection requires a one-time pod restart cluster-wide
Risks¶
- Linkerd's UI requires a separate
linkerd-vizinstall — operator must learn the CLI forlinkerd check/linkerd identity - Cert rotation (every 24h by default) requires Trust Anchor renewal — AKKO chart should ship a CronJob or use cert-manager integration
- Compatibility with Spark Connect / Trino federation needs E2E validation (HTTP/2 + gRPC pass through Linkerd, but verify on Netcup before flipping enabled=true)
Reference¶
- Linkerd : https://linkerd.io (CNCF Graduated 2021)
- ADR-029 — governance over license (Foundation-neutral rule)
- ADR-032 — image signing + SBOM (cosign)
- ADR-036 — functional URLs (sister-document for the user-facing layer)
- Stream Sécu P1 task #205 (this ADR's home)