ADR-044 — JWT validators split KEYCLOAK_ISSUER from KEYCLOAK_JWKS_URL, anchor on canonical FQDN¶
- Status : Accepted
- Date : 2026-04-29
- Sprint : 60.8
- Drivers : founder bug B11 (live demo broken on
/api/cockpit/catalog-manager/api/admin/catalogs), Sprint 47 functional FQDN rollout (ADR-036) - Related : ADR-036 (functional FQDN), ADR-038 (OIDC user matching by
sub), ADR-040 (cockpit-backend service account)
Context¶
Every AKKO sub-service that validates an end-user JWT (catalog-manager,
ai-service, akko-rag, the future cockpit-backend) does two distinct
operations on the bearer token :
- Validate the
issclaim matches the issuer URL we trust. - Fetch the JSON Web Key Set (JWKS) to verify the signature.
Historically these two URLs were the same value (KEYCLOAK_ISSUER derived
the JWKS URL by appending /protocol/openid-connect/certs). This collapses
two distinct concerns into one configuration knob and breaks in two
predictable ways :
Failure mode 1 : public vs in-cluster mismatch¶
The iss claim Keycloak emits depends on KC_HOSTNAME. AKKO's standard
deployment (ADR-036) sets it to the canonical public FQDN
keycloak.<domain>. When a service runs in-cluster and tries to validate
iss = https://keycloak.<domain>/realms/akko, the value is correct,
but the fetch of JWKS at the same URL forces a hairpin DNS hop (pod →
public DNS → ingress → traefik → keycloak service) which is slow and
fragile (NetworkPolicy egress, TLS chain, oauth2-proxy interception).
The natural choice is to fetch JWKS in-cluster (http://akko-akko-keycloak:
8080/realms/akko/protocol/openid-connect/certs) but validate iss as the
public URL — these must be configured independently.
Failure mode 2 : alias vs canonical hostname (B11 founder bug)¶
Sprint 47 introduced functional FQDN aliases : identity.<domain>,
login.<domain>, auth.<domain>. These are routing aliases — the
ingress accepts these hostnames and proxies them to the Keycloak service.
Convenient for oauth2-proxy --redirect-url, terrible for iss
validation.
KC_HOSTNAME is not changed by these aliases — it stays
keycloak.<domain>. So Keycloak emits iss = https://keycloak.<domain>/
realms/akko regardless of whether the user logged in via identity. or
login.. A service that auto-derives KEYCLOAK_ISSUER from global.
domain with the wrong prefix (e.g. https://identity.<domain>/realms/
akko) rejects every legitimate JWT with invalid jwt: Invalid
issuer.
Live evidence captured 2026-04-29 :
$ kubectl exec deploy/akko-akko-aden -- python3 -c '
import urllib.request, json
print(json.loads(urllib.request.urlopen(
"http://akko-akko-keycloak:8080/realms/akko/.well-known/openid-configuration"
).read())["issuer"])'
https://keycloak.akko-ai.com/realms/akko
The catalog-manager Helm chart was auto-deriving https://identity.akko-
ai.com/realms/akko → 401 on every request.
Decision¶
-
Two distinct env vars in every AKKO service that validates JWT :
-
KEYCLOAK_ISSUER: the value compared against the JWTissclaim. -
KEYCLOAK_JWKS_URL: the URL to fetch the public signing keys. -
KEYCLOAK_ISSUERresolution priority (Helm template):
.Values.keycloak.issuer # explicit override (BYO-IdP)
> "https://keycloak.{{ .Values.global.domain }}/realms/akko" # auto-derive
> "http://{{ .Release.Name }}-akko-keycloak:8080/realms/akko" # k3d / dev fallback
The auto-derive must use keycloak.<domain>, never identity.,
login., auth., or any alias — these are browser redirect aliases,
not the canonical IdP endpoint.
KEYCLOAK_JWKS_URLdefaults to in-cluster :
.Values.keycloak.jwksUrl # explicit override
> "http://{{ .Release.Name }}-akko-keycloak:8080/realms/akko/protocol/openid-connect/certs" # in-cluster Service
Hairpin DNS for JWKS is a code smell — the in-cluster Service is the right path for server-to-server traffic.
- Source code (Python example) :
KC_ISSUER = os.environ.get("KEYCLOAK_ISSUER", "<dev fallback>")
KC_JWKS_URL = os.environ.get(
"KEYCLOAK_JWKS_URL",
f"{KC_ISSUER}/protocol/openid-connect/certs", # backwards-compat
)
The KEYCLOAK_JWKS_URL env var falls back to deriving from
KEYCLOAK_ISSUER when unset, so existing dev / k3d images keep
working without a chart upgrade. Production deployments override
both.
- Customer override pattern (BYO-IdP — when the customer has their own AD / Okta / Azure AD instead of the bundled Keycloak) :
# values-<customer>.yaml
akko-catalog-manager:
keycloak:
issuer: "https://idp.bigcorp.com/realms/bigcorp" # customer IdP
jwksUrl: "https://idp.bigcorp.com/realms/bigcorp/protocol/openid-connect/certs"
In BYO-IdP the issuer and JWKS URL are both public (the customer's IdP is not in-cluster), so we drop the in-cluster shortcut.
Consequences¶
Positive¶
- JWT validation works whether the user logged in through any alias FQDN
— the
issclaim is anchored on the canonical Keycloak hostname, not on the routing alias the browser hit. - Server-to-server JWKS fetch stays in-cluster — lower latency, no hairpin DNS, no TLS handshake, no NetworkPolicy egress to the public ingress.
- Customer install just sets
keycloak.issuerto their IdP URL — no code change, no secret juggling.
Negative¶
- One extra env var per service (
KEYCLOAK_JWKS_URL). Documented in the service template + values.yaml comment block. - Operators migrating from old releases must run a
helm upgrade -f values-<env>.yaml(not--reuse-values) to pick up the new env var — see gotchahelm_reuse_values_drops_overlays.
Affected services¶
| Service | Status | Sprint |
|---|---|---|
| catalog-manager | ✅ Done | 60.8 (commits cbabbd9, fa0b7dc) |
| ai-service | TODO | 60.10 |
| akko-rag | TODO | 60.10 |
| cockpit-backend | TODO | 60.10 (already designed per ADR-040) |
Validation¶
Baseline gates in tests/integration/test_security_baseline.py :
test_catalog_manager_issuer_auto_derives_from_global_domain—helm template ... global.domain=acme.example.commust produceKEYCLOAK_ISSUER="https://keycloak.acme.example.com/realms/akko".test_catalog_manager_jwks_stays_in_cluster— JWKS URL must contain the in-cluster Service name, never leak the public hostname.test_catalog_manager_chart_default_verify_jwt_true— no insecureverifyJwt: falsein chart-by-default.test_catalog_manager_main_py_splits_issuer_and_jwks— source code reads both env vars independently (Sprint 39.5 anti-regression).
Live verification (Playwright) :
tests/playwright/tests/diag-catalog-manager-401.spec.ts— POSTs to/api/cockpit/catalog-manager/api/admin/catalogs, asserts not 401, body shape doesn't containInvalid issuer.
References¶
- ADR-036 : functional FQDN canonical naming (keycloak.
for IdP) - gotcha
keycloak_iss_canonical_hostname(memory) : the always-the-canonical-hostname rule - gotcha
helm_reuse_values_drops_overlays(memory) : why the fix didn't propagate during the rollback cycle 2026-04-29 - gotcha
volumemount_override_masks_image(memory) : why the rebuild seemed to do nothing — the pod had a kubectl-patched volumeMount overriding /app/app/main.py - Founder directive 2026-04-29 : "no workaround, prod-ready, no hardcoding"