ADR-035 — TLS wildcard via cert-manager + Let's Encrypt DNS-01¶
- Status: Proposed (Sprint 47 V2 — pérenne fix for the functional FQDN refactor)
- Date: 2026-04-25
- Drivers: Sprint 47 functional URL refactor needs every new sub-domain to be reachable in HTTPS without per-host operator intervention. The Traefik-Hub-managed per-host cert flow is not pérenne and not multi-infra.
Context¶
Sprint 47 introduces ~15 new functional sub-domains (lab, bi,
orchestrator, federation, compute, experiments, llm,
metrics, directory, storage, registry, alerts, mcp,
mcp-catalog, auth/identity Phase 2). The DNS wildcard
*.akko-ai.com → 159.195.77.208 is already in place at the registrar.
Current TLS state on Netcup:
- Legacy hosts (jupyter.akko-ai.com, mlflow.akko-ai.com, etc.) have
per-host Let's Encrypt certs auto-provisioned at first hit by
Traefik Hub.
- New hosts (lab.akko-ai.com, bi.akko-ai.com, etc.) get no cert →
TLS handshake fails with tlsv1 alert internal error → user sees
redirect_uri loop break.
Why per-host is not the right answer: 1. Let's Encrypt rate-limits new certs to 50/week per registered domain. With ~30 sub-domains today + every new layer / per-tenant sub-domain tomorrow, the limit is reached fast. 2. Per-host certs bind to Traefik Hub's specific config — moving to EKS / AKS / GKE / OpenShift / bare-metal breaks the contract. 3. The Lego rule is violated: swapping the ingress controller (Traefik → NGINX → Cilium Gateway) shouldn't force a TLS migration.
Considered options¶
Option A — cert-manager + Let's Encrypt DNS-01 wildcard¶
CNCF Graduated project. Issues a single *.akko-ai.com cert that
covers every present and future sub-domain.
- License: Apache 2.0.
- Mechanism: DNS-01 challenge (cert-manager provisions a
_acme-challenge.akko-ai.comTXT record via the DNS provider API). - Portable: works on every k8s distro (k3d/k3s/kind/EKS/AKS/GKE/ OpenShift/bare-metal/air-gapped if private CA is used).
- Storage: one Secret
akko-wildcard-tlsreferenced by every Ingress'tls.secretName. - Renewal: cert-manager renews 30 days before expiry, fully automated.
Option B — cert-manager + HTTP-01 per FQDN¶
Same operator, but issues a separate cert per host via the HTTP-01 challenge (port 80 reachable from LE).
- Pros: no DNS provider API token needed.
- Cons:
- Hits LE rate limit (50/week, then 50 every 168h rolling) at scale.
- Ingress must terminate on port 80 too (some Traefik / OpenShift Routes are HTTPS-only by default).
- Each new sub-domain triggers a fresh ACME flow → ~30 s wait at first request (vs. 0 s with wildcard).
Option C — keep Traefik Hub auto-cert¶
Traefik Hub auto-provisions per-host certs from the edge cluster. Works on the Netcup deploy today.
- Pros: zero install (already in place).
- Cons:
- Tied to Traefik Hub specifically; non-Traefik clusters need a different solution.
- Per-host = same LE rate-limit problem as Option B.
- The functional URL refactor (Sprint 47) was designed around the Lego rule — TLS provisioning should be Lego too.
Option D — operator-supplied cert (BYOS)¶
Operator hands AKKO a wildcard cert from their corporate CA / Sectigo / DigiCert account.
- Pros: works air-gapped, no LE dependency.
- Cons: not zero-config — every customer brings their own cert.
Decision¶
Adopt Option A — cert-manager + Let's Encrypt DNS-01 wildcard as the default AKKO TLS posture, with Option D (BYOS) as a documented fallback for air-gapped / banking customers.
The wildcard *.akko-ai.com cert covers every sub-domain (lab, bi,
orchestrator, federation, compute, experiments, llm,
metrics, directory, storage, registry, alerts, mcp,
mcp-catalog, aden, rag, reports, docs, query, keycloak,
identity …) and any future layer added without operator action.
Consequences¶
Positive
- One cert renewal cycle instead of N. Zero rate-limit pressure.
- Adding a new sub-domain (Sprint 48+ tenant routing, etc.) is a single Helm values flip — no cert dance.
- Works identically on k3d / k3s / EKS / AKS / GKE / OpenShift / bare-metal because cert-manager + Let's Encrypt is universal.
- Multi-tenant per-customer sub-domains (
<tenant>.<domain>) come for free.
Negative
- DNS provider API token required (Cloudflare / Netcup CCP / Route53 / Azure DNS). Stored in a K8s Secret with restricted RBAC.
- One additional operator (cert-manager controller pod, ~70 MB RAM).
- Initial rollout: ~5-10 minutes for the first wildcard issue (DNS-01 propagation + ACME challenge), then instant for renewals.
Neutral
- Wildcard certs cover only one DNS level (
*.akko-ai.commatcheslab.akko-ai.combut notlab.tenant1.akko-ai.com). For nested multi-tenant DNS, a second wildcard*.tenant1.akko-ai.comis needed (handled by adding a second Certificate CRD).
Implementation plan¶
Sprint 47 V2 (estimated 4 h) :
- Install cert-manager as a Helm sub-chart dependency
(
cert-manager/cert-managerv1.16+, in its owncert-managerns). - DNS provider Secret — operator-provided. Document the four biggest providers + a script for each :
- ClusterIssuer — Let's Encrypt prod with DNS-01 solver :
apiVersion: cert-manager.io/v1 kind: ClusterIssuer metadata: { name: akko-letsencrypt } spec: acme: email: "{{ .Values.global.acmeEmail }}" server: https://acme-v02.api.letsencrypt.org/directory privateKeySecretRef: { name: akko-letsencrypt-account } solvers: - dns01: cloudflare: apiTokenSecretRef: name: dns-provider-token key: api-token - Wildcard Certificate :
- Flip every Ingress'
tls.secretNamefromakko-tlstoakko-wildcard-tls. One sed pass onhelm/akko/charts/*/templates/ingress.yaml+helm/scripts/generate-domain-values.sh. - Validation :
- Re-flip cockpit subdomain refs to functional FQDNs (revert of
the Sprint 47 hot-fix
39f7e8b). - Run Playwright suite on the live cluster — every persona × 10 actions green.
BYOS (air-gapped) fallback — Option D¶
For customers without internet access for ACME challenges :
# values-byos-tls.yaml
cert-manager:
enabled: false
global:
tls:
secretName: akko-wildcard-tls # operator pre-creates this
# via:
# kubectl -n akko create secret tls akko-wildcard-tls \
# --cert=path/to/wildcard.pem \
# --key=path/to/wildcard.key
The chart honours either path: cert-manager-issued OR operator-supplied.
Multi-tenant nested wildcards (Sprint 48+ preview)¶
When customer onboarding adds <tenant>.akko-ai.com per-tenant
sub-domains, a second Certificate CRD covers *.<tenant>.akko-ai.com
without touching the platform-wide one.
Verification at install time¶
The akko-init smoke test gets a new check :
kubectl -n akko get secret akko-wildcard-tls -o json | \
jq -r '.data["tls.crt"]' | base64 -d | openssl x509 -noout -ext subjectAltName | \
grep -q "DNS:\\*\\.${AKKO_DOMAIN}" && echo OK || exit 1
References¶
- Sprint 47 master plan:
akko-technical-map/sprints/sprint-47-functional-urls-master-plan.md - Sprint 47 hot-fix (revert cockpit subdomains): commit
39f7e8b - cert-manager docs:
https://cert-manager.io/docs/configuration/acme/dns01/ - Memory rule:
feedback_no_workarounds.md("STRICT — fix prod-ready uniquement")