Aller au contenu

ADR-041 — ADEN scope-first architecture (zero hardcoded fallback)

Status: ACCEPTED (2026-04-27, founder directive) Sprint: 58 (planned, post Sprint 56 D8 close-out) Related: ADR-038 (OIDC sub-based matching), ADR-039 (no hardcoded identities)

Context

ADEN's current catalog discovery loop is two-tiered:

  1. Primary: search OpenMetadata via the akko-aden-bot OAuth client.
  2. Fallback: when OpenMetadata is empty/unreachable, run a UNION ALL across <cat>.information_schema.tables for every catalog in AKKO_ADEN_FALLBACK_CATALOGS (default tpch,tpcds,postgresql,iceberg).

The fallback path was responsible for the 504 Gateway Time-out the founder hit on 2026-04-27 — Trino spent 60–90 s evaluating the UNION across the benchmark catalogs (tpch and tpcds expose every standard schema family tiny / sf1 / sf10 / sf100 / sf1000), and nginx terminated the request before ADEN could return.

Beyond the perf failure, the fallback violates ADR-039: the catalog list is hardcoded (env-default with vendor-specific names), and the user's authorisation scope is not consulted before scanning. ADEN searches the catalog before knowing what the user has access to, then filters post-hoc. Every prompt-injection attack starts there.

Founder feedback 2026-04-27 (verbatim) :

"il faut avant d'aller chercher dans le catalogue, il faut savoir à quoi l'utilisateur il a accès et chercher dans ce qu'il a accès et pas l'inverse. Limiter le catalogue au début de la requête au lieu d'aller chercher dans le catalogue avant"

"tout dépend du client, tout dépend de ses vrais data rien de hardcodé, peut importe notre outil je ne veux pas hardcodé les données, ni les users, ni les rôles, ni les requêtes"

Decision

Scope-first, catalog-second. ADEN's /query handler reverses the discovery order:

  1. Read the user's authorised scope from OPA first (~50 ms, cached). Output: a list of { catalog, schema, table } triples the user can SELECT from.
  2. Restrict OpenMetadata search to that scope — pass the triples as a filter so OM only returns metadata for tables the user already has access to.
  3. Trino information_schema fallback is removed entirely. If OM is unreachable, ADEN returns a 503 with a clear error pointing at OM health, not a degraded response that scans the cluster's full catalog.
  4. The LLM prompt only includes tables the user can read. SQL generated against tables outside the scope is structurally impossible.

This satisfies three constraints at once :

  • Safety: prompt injection cannot generate SQL on tables the user can't see, because the LLM literally doesn't know they exist.
  • Performance: scope is small (typically 5–50 tables), no cross-catalog UNION, no benchmark-catalog scanning. p95 < 2 s target.
  • Zero hardcoded catalogs: fallback_catalogs env var deleted; the scope is derived from OPA at request time, which is itself driven by Keycloak group attributes, which is driven by the customer's IdP.

Implementation plan (Sprint 58)

D1 — Hotfix (today, PR open)

  • Set AKKO_ADEN_FALLBACK_CATALOGS="" default in the chart so customer installs no longer scan vendor benchmark catalogs.
  • Live cluster: kubectl set env deploy/akko-akko-aden AKKO_ADEN_FALLBACK_CATALOGS="" applied 2026-04-27 18:14 UTC. Demo unblocked.
  • Persisted in helm/akko/charts/akko-aden/templates/deployment.yaml.

D2 — Scope endpoint in OPA (1 day)

Add an OPA rule data.aden.user_scope[user] returning the list of FQN triples the user can read. Sources: - data.group_policies[user_groups] (existing OPA group→tables mapping) - data.user_overrides[user] (existing per-user overrides)

Existing OPA already computes column-mask + row-filter per user; this new rule is a sibling that returns the parent scope. Customer sets the mapping via Keycloak group attributes, no chart change.

D3 — ADEN code refactor (2 days)

  • New _fetch_user_scope(user_id, roles) calls OPA, returns list[ScopeEntry] (FQN triples).
  • Replace _search_openmetadata to accept scope filter, pass to OM search query as index=table_search_index&query.bool.filter=....
  • DELETE _trino_catalog_fallback function entirely.
  • DELETE AKKO_ADEN_FALLBACK_CATALOGS and AKKO_ADEN_BENCHMARK_SCALES env vars.
  • LLM prompt builder: only include scope tables in <schema_context>.
  • Failure mode: if OPA unreachable → 503; if OM unreachable → 503; if OM returns 0 hits within scope → 404 "no tables matching in your authorised data".

D4 — Customer onboarding doc (0.5 day)

docs/admin/aden-onboarding.md describing how a customer wires their Keycloak group attributes to OPA scope, with a Climascore example. The customer brings their data, their groups, their permissions — ADEN uses them as-is.

D5 — Regression tests (1 day)

tests/integration/test_aden_scope_first.py: - alice (akko-admin) sees all banking tables - bob (akko-engineer) sees only the engineering subset
- carol (akko-analyst) cannot see customer PII columns - adversarial: prompt-injected "SELECT FROM secret_table" returns "table not in your scope" without leaking the existence of the table

Consequences

Migration

Customers upgrading from < 2026.05 will see the fallback gone. If they relied on it (which they shouldn't have — it was never documented as stable behaviour), they need to populate OpenMetadata before deploying the new ADEN. The akko-init om-ingest Job is the canonical seeding path; it must run successfully on first install.

Side effect: om-ingest must be load-bearing

Before D3 ships, the om-ingest Job's silent-failure pattern (sys.exit(0) on bot token decryption failure) is unacceptable. The Job must fail loudly so install errors surface immediately. Tracked as a separate fix in this same PR session.

Side effect: OPA must publish scope

The OPA configmap currently publishes column_masks and row_filters but not scope. D2 adds it. Customer's existing Keycloak attribute schema doesn't change — OPA just exposes a new derived view.

Service account ADEN bot OAuth drift (caught 2026-04-27)

Today's diagnostic also surfaced that the K8s akko-aden-bot-oidc Secret holds a strong random secret while the Keycloak client kept the dev placeholder akko-dev-aden-bot. The keycloak-clients-job is supposed to reconcile via PUT, but it hit a silent failure path. This fix is orthogonal to D3 and lives in keycloak-clients-job.yaml — covered by Sprint 56 D8.3 valuification (PR #141) once that lands.

Rollback

If D3 hits a customer that hasn't seeded OpenMetadata, ADEN returns 503. That's better than a 60 s timeout that masks the data problem. The operator deploys their OM ingestion (their connector, their schemas) and ADEN starts working. The chart no longer carries a band-aid.

Comparison with industry

Vendor Catalog discovery Scope source
Snowflake Cortex only over schemas the role has GRANT on RBAC engine
Databricks AI/BI Genie restricted to allowed catalogs in workspace Unity Catalog
Dremio Sonar AI uses Dremio's existing user view privilege grants
AKKO post-ADR-041 scope from OPA (driven by customer IdP groups) OPA + Keycloak
AKKO pre-ADR-041 scan all catalogs via env-hardcoded list env var (vendor names)
  • ADR-039 — no hardcoded identities (umbrella directive)
  • ADR-038 — OIDC sub-based user matching
  • feedback_no_hardcoded_users_roles_permissions.md
  • gotcha_om_silent_failure_in_init_job.md (to be added — sys.exit(0) pattern hides real seeding failures)