Skip to content

ADR-035 — TLS wildcard via cert-manager + Let's Encrypt DNS-01

  • Status: Proposed (Sprint 47 V2 — pérenne fix for the functional FQDN refactor)
  • Date: 2026-04-25
  • Drivers: Sprint 47 functional URL refactor needs every new sub-domain to be reachable in HTTPS without per-host operator intervention. The Traefik-Hub-managed per-host cert flow is not pérenne and not multi-infra.

Context

Sprint 47 introduces ~15 new functional sub-domains (lab, bi, orchestrator, federation, compute, experiments, llm, metrics, directory, storage, registry, alerts, mcp, mcp-catalog, auth/identity Phase 2). The DNS wildcard *.akko-ai.com → 159.195.77.208 is already in place at the registrar.

Current TLS state on Netcup: - Legacy hosts (jupyter.akko-ai.com, mlflow.akko-ai.com, etc.) have per-host Let's Encrypt certs auto-provisioned at first hit by Traefik Hub. - New hosts (lab.akko-ai.com, bi.akko-ai.com, etc.) get no cert → TLS handshake fails with tlsv1 alert internal error → user sees redirect_uri loop break.

Why per-host is not the right answer: 1. Let's Encrypt rate-limits new certs to 50/week per registered domain. With ~30 sub-domains today + every new layer / per-tenant sub-domain tomorrow, the limit is reached fast. 2. Per-host certs bind to Traefik Hub's specific config — moving to EKS / AKS / GKE / OpenShift / bare-metal breaks the contract. 3. The Lego rule is violated: swapping the ingress controller (Traefik → NGINX → Cilium Gateway) shouldn't force a TLS migration.

Considered options

Option A — cert-manager + Let's Encrypt DNS-01 wildcard

CNCF Graduated project. Issues a single *.akko-ai.com cert that covers every present and future sub-domain.

  • License: Apache 2.0.
  • Mechanism: DNS-01 challenge (cert-manager provisions a _acme-challenge.akko-ai.com TXT record via the DNS provider API).
  • Portable: works on every k8s distro (k3d/k3s/kind/EKS/AKS/GKE/ OpenShift/bare-metal/air-gapped if private CA is used).
  • Storage: one Secret akko-wildcard-tls referenced by every Ingress' tls.secretName.
  • Renewal: cert-manager renews 30 days before expiry, fully automated.

Option B — cert-manager + HTTP-01 per FQDN

Same operator, but issues a separate cert per host via the HTTP-01 challenge (port 80 reachable from LE).

  • Pros: no DNS provider API token needed.
  • Cons:
  • Hits LE rate limit (50/week, then 50 every 168h rolling) at scale.
  • Ingress must terminate on port 80 too (some Traefik / OpenShift Routes are HTTPS-only by default).
  • Each new sub-domain triggers a fresh ACME flow → ~30 s wait at first request (vs. 0 s with wildcard).

Option C — keep Traefik Hub auto-cert

Traefik Hub auto-provisions per-host certs from the edge cluster. Works on the Netcup deploy today.

  • Pros: zero install (already in place).
  • Cons:
  • Tied to Traefik Hub specifically; non-Traefik clusters need a different solution.
  • Per-host = same LE rate-limit problem as Option B.
  • The functional URL refactor (Sprint 47) was designed around the Lego rule — TLS provisioning should be Lego too.

Option D — operator-supplied cert (BYOS)

Operator hands AKKO a wildcard cert from their corporate CA / Sectigo / DigiCert account.

  • Pros: works air-gapped, no LE dependency.
  • Cons: not zero-config — every customer brings their own cert.

Decision

Adopt Option A — cert-manager + Let's Encrypt DNS-01 wildcard as the default AKKO TLS posture, with Option D (BYOS) as a documented fallback for air-gapped / banking customers.

The wildcard *.akko-ai.com cert covers every sub-domain (lab, bi, orchestrator, federation, compute, experiments, llm, metrics, directory, storage, registry, alerts, mcp, mcp-catalog, aden, rag, reports, docs, query, keycloak, identity …) and any future layer added without operator action.

Consequences

Positive

  • One cert renewal cycle instead of N. Zero rate-limit pressure.
  • Adding a new sub-domain (Sprint 48+ tenant routing, etc.) is a single Helm values flip — no cert dance.
  • Works identically on k3d / k3s / EKS / AKS / GKE / OpenShift / bare-metal because cert-manager + Let's Encrypt is universal.
  • Multi-tenant per-customer sub-domains (<tenant>.<domain>) come for free.

Negative

  • DNS provider API token required (Cloudflare / Netcup CCP / Route53 / Azure DNS). Stored in a K8s Secret with restricted RBAC.
  • One additional operator (cert-manager controller pod, ~70 MB RAM).
  • Initial rollout: ~5-10 minutes for the first wildcard issue (DNS-01 propagation + ACME challenge), then instant for renewals.

Neutral

  • Wildcard certs cover only one DNS level (*.akko-ai.com matches lab.akko-ai.com but not lab.tenant1.akko-ai.com). For nested multi-tenant DNS, a second wildcard *.tenant1.akko-ai.com is needed (handled by adding a second Certificate CRD).

Implementation plan

Sprint 47 V2 (estimated 4 h) :

  1. Install cert-manager as a Helm sub-chart dependency (cert-manager/cert-manager v1.16+, in its own cert-manager ns).
    # helm/akko/Chart.yaml
    - name: cert-manager
      version: 1.16.2
      repository: https://charts.jetstack.io
      condition: cert-manager.enabled
    
  2. DNS provider Secret — operator-provided. Document the four biggest providers + a script for each :
    apiVersion: v1
    kind: Secret
    metadata: { name: dns-provider-token, namespace: cert-manager }
    stringData:
      # one of: cloudflare-token, netcup-customer-id+netcup-key, route53-key, azure-tenant-id
      api-token: "<from-operator>"
    
  3. ClusterIssuer — Let's Encrypt prod with DNS-01 solver :
    apiVersion: cert-manager.io/v1
    kind: ClusterIssuer
    metadata: { name: akko-letsencrypt }
    spec:
      acme:
        email: "{{ .Values.global.acmeEmail }}"
        server: https://acme-v02.api.letsencrypt.org/directory
        privateKeySecretRef: { name: akko-letsencrypt-account }
        solvers:
          - dns01:
              cloudflare:
                apiTokenSecretRef:
                  name: dns-provider-token
                  key: api-token
    
  4. Wildcard Certificate :
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata: { name: akko-wildcard, namespace: akko }
    spec:
      secretName: akko-wildcard-tls
      dnsNames:
        - "{{ .Values.global.domain }}"
        - "*.{{ .Values.global.domain }}"
      issuerRef:
        name: akko-letsencrypt
        kind: ClusterIssuer
    
  5. Flip every Ingress' tls.secretName from akko-tls to akko-wildcard-tls. One sed pass on helm/akko/charts/*/templates/ingress.yaml + helm/scripts/generate-domain-values.sh.
  6. Validation :
    echo | openssl s_client -connect lab.akko-ai.com:443 -servername lab.akko-ai.com 2>&1 \
      | openssl x509 -noout -ext subjectAltName | grep akko-ai
    # expect: DNS:akko-ai.com, DNS:*.akko-ai.com
    
  7. Re-flip cockpit subdomain refs to functional FQDNs (revert of the Sprint 47 hot-fix 39f7e8b).
  8. Run Playwright suite on the live cluster — every persona × 10 actions green.

BYOS (air-gapped) fallback — Option D

For customers without internet access for ACME challenges :

# values-byos-tls.yaml
cert-manager:
  enabled: false
global:
  tls:
    secretName: akko-wildcard-tls    # operator pre-creates this
                                      # via:
                                      # kubectl -n akko create secret tls akko-wildcard-tls \
                                      #   --cert=path/to/wildcard.pem \
                                      #   --key=path/to/wildcard.key

The chart honours either path: cert-manager-issued OR operator-supplied.

Multi-tenant nested wildcards (Sprint 48+ preview)

When customer onboarding adds <tenant>.akko-ai.com per-tenant sub-domains, a second Certificate CRD covers *.<tenant>.akko-ai.com without touching the platform-wide one.

Verification at install time

The akko-init smoke test gets a new check :

kubectl -n akko get secret akko-wildcard-tls -o json | \
  jq -r '.data["tls.crt"]' | base64 -d | openssl x509 -noout -ext subjectAltName | \
  grep -q "DNS:\\*\\.${AKKO_DOMAIN}" && echo OK || exit 1

References

  • Sprint 47 master plan: akko-technical-map/sprints/sprint-47-functional-urls-master-plan.md
  • Sprint 47 hot-fix (revert cockpit subdomains): commit 39f7e8b
  • cert-manager docs: https://cert-manager.io/docs/configuration/acme/dns01/
  • Memory rule: feedback_no_workarounds.md ("STRICT — fix prod-ready uniquement")