Aller au contenu

ADR-044 — JWT validators split KEYCLOAK_ISSUER from KEYCLOAK_JWKS_URL, anchor on canonical FQDN

  • Status : Accepted
  • Date : 2026-04-29
  • Sprint : 60.8
  • Drivers : founder bug B11 (live demo broken on /api/cockpit/catalog-manager/api/admin/catalogs), Sprint 47 functional FQDN rollout (ADR-036)
  • Related : ADR-036 (functional FQDN), ADR-038 (OIDC user matching by sub), ADR-040 (cockpit-backend service account)

Context

Every AKKO sub-service that validates an end-user JWT (catalog-manager, ai-service, akko-rag, the future cockpit-backend) does two distinct operations on the bearer token :

  1. Validate the iss claim matches the issuer URL we trust.
  2. Fetch the JSON Web Key Set (JWKS) to verify the signature.

Historically these two URLs were the same value (KEYCLOAK_ISSUER derived the JWKS URL by appending /protocol/openid-connect/certs). This collapses two distinct concerns into one configuration knob and breaks in two predictable ways :

Failure mode 1 : public vs in-cluster mismatch

The iss claim Keycloak emits depends on KC_HOSTNAME. AKKO's standard deployment (ADR-036) sets it to the canonical public FQDN keycloak.<domain>. When a service runs in-cluster and tries to validate iss = https://keycloak.<domain>/realms/akko, the value is correct, but the fetch of JWKS at the same URL forces a hairpin DNS hop (pod → public DNS → ingress → traefik → keycloak service) which is slow and fragile (NetworkPolicy egress, TLS chain, oauth2-proxy interception).

The natural choice is to fetch JWKS in-cluster (http://akko-akko-keycloak: 8080/realms/akko/protocol/openid-connect/certs) but validate iss as the public URL — these must be configured independently.

Failure mode 2 : alias vs canonical hostname (B11 founder bug)

Sprint 47 introduced functional FQDN aliases : identity.<domain>, login.<domain>, auth.<domain>. These are routing aliases — the ingress accepts these hostnames and proxies them to the Keycloak service. Convenient for oauth2-proxy --redirect-url, terrible for iss validation.

KC_HOSTNAME is not changed by these aliases — it stays keycloak.<domain>. So Keycloak emits iss = https://keycloak.<domain>/ realms/akko regardless of whether the user logged in via identity. or login.. A service that auto-derives KEYCLOAK_ISSUER from global. domain with the wrong prefix (e.g. https://identity.<domain>/realms/ akko) rejects every legitimate JWT with invalid jwt: Invalid issuer.

Live evidence captured 2026-04-29 :

$ kubectl exec deploy/akko-akko-aden -- python3 -c '
  import urllib.request, json
  print(json.loads(urllib.request.urlopen(
    "http://akko-akko-keycloak:8080/realms/akko/.well-known/openid-configuration"
  ).read())["issuer"])'
https://keycloak.akko-ai.com/realms/akko

The catalog-manager Helm chart was auto-deriving https://identity.akko- ai.com/realms/akko → 401 on every request.

Decision

  1. Two distinct env vars in every AKKO service that validates JWT :

  2. KEYCLOAK_ISSUER : the value compared against the JWT iss claim.

  3. KEYCLOAK_JWKS_URL : the URL to fetch the public signing keys.

  4. KEYCLOAK_ISSUER resolution priority (Helm template):

.Values.keycloak.issuer                                  # explicit override (BYO-IdP)
> "https://keycloak.{{ .Values.global.domain }}/realms/akko"  # auto-derive
> "http://{{ .Release.Name }}-akko-keycloak:8080/realms/akko" # k3d / dev fallback

The auto-derive must use keycloak.<domain>, never identity., login., auth., or any alias — these are browser redirect aliases, not the canonical IdP endpoint.

  1. KEYCLOAK_JWKS_URL defaults to in-cluster :
.Values.keycloak.jwksUrl                                                              # explicit override
> "http://{{ .Release.Name }}-akko-keycloak:8080/realms/akko/protocol/openid-connect/certs"  # in-cluster Service

Hairpin DNS for JWKS is a code smell — the in-cluster Service is the right path for server-to-server traffic.

  1. Source code (Python example) :
KC_ISSUER = os.environ.get("KEYCLOAK_ISSUER", "<dev fallback>")
KC_JWKS_URL = os.environ.get(
    "KEYCLOAK_JWKS_URL",
    f"{KC_ISSUER}/protocol/openid-connect/certs",  # backwards-compat
)

The KEYCLOAK_JWKS_URL env var falls back to deriving from KEYCLOAK_ISSUER when unset, so existing dev / k3d images keep working without a chart upgrade. Production deployments override both.

  1. Customer override pattern (BYO-IdP — when the customer has their own AD / Okta / Azure AD instead of the bundled Keycloak) :
# values-<customer>.yaml
akko-catalog-manager:
  keycloak:
    issuer: "https://idp.bigcorp.com/realms/bigcorp"  # customer IdP
    jwksUrl: "https://idp.bigcorp.com/realms/bigcorp/protocol/openid-connect/certs"

In BYO-IdP the issuer and JWKS URL are both public (the customer's IdP is not in-cluster), so we drop the in-cluster shortcut.

Consequences

Positive

  • JWT validation works whether the user logged in through any alias FQDN — the iss claim is anchored on the canonical Keycloak hostname, not on the routing alias the browser hit.
  • Server-to-server JWKS fetch stays in-cluster — lower latency, no hairpin DNS, no TLS handshake, no NetworkPolicy egress to the public ingress.
  • Customer install just sets keycloak.issuer to their IdP URL — no code change, no secret juggling.

Negative

  • One extra env var per service (KEYCLOAK_JWKS_URL). Documented in the service template + values.yaml comment block.
  • Operators migrating from old releases must run a helm upgrade -f values-<env>.yaml (not --reuse-values) to pick up the new env var — see gotcha helm_reuse_values_drops_overlays.

Affected services

Service Status Sprint
catalog-manager ✅ Done 60.8 (commits cbabbd9, fa0b7dc)
ai-service TODO 60.10
akko-rag TODO 60.10
cockpit-backend TODO 60.10 (already designed per ADR-040)

Validation

Baseline gates in tests/integration/test_security_baseline.py :

  • test_catalog_manager_issuer_auto_derives_from_global_domainhelm template ... global.domain=acme.example.com must produce KEYCLOAK_ISSUER="https://keycloak.acme.example.com/realms/akko".
  • test_catalog_manager_jwks_stays_in_cluster — JWKS URL must contain the in-cluster Service name, never leak the public hostname.
  • test_catalog_manager_chart_default_verify_jwt_true — no insecure verifyJwt: false in chart-by-default.
  • test_catalog_manager_main_py_splits_issuer_and_jwks — source code reads both env vars independently (Sprint 39.5 anti-regression).

Live verification (Playwright) :

  • tests/playwright/tests/diag-catalog-manager-401.spec.ts — POSTs to /api/cockpit/catalog-manager/api/admin/catalogs, asserts not 401, body shape doesn't contain Invalid issuer.

References

  • ADR-036 : functional FQDN canonical naming (keycloak. for IdP)
  • gotcha keycloak_iss_canonical_hostname (memory) : the always-the-canonical-hostname rule
  • gotcha helm_reuse_values_drops_overlays (memory) : why the fix didn't propagate during the rollback cycle 2026-04-29
  • gotcha volumemount_override_masks_image (memory) : why the rebuild seemed to do nothing — the pod had a kubectl-patched volumeMount overriding /app/app/main.py
  • Founder directive 2026-04-29 : "no workaround, prod-ready, no hardcoding"