Aller au contenu

ADR-049 — AKKO Catalog Sync Daemon

  • Status : Proposed (Sprint 63 — 2026-05)
  • Owner : AKKO Platform / ADEN team
  • Related : ADR-041 (ADEN scope-first), ADR-042 (multi-tier cache), ADR-043 (Milvus vector_catalog), ADR-046 (engine_storage_domain naming)

Context

When an AKKO operator (or an AKKO customer) adds a new dataset, schema, or full catalog (Cloudera Hive, Postgres OLTP, S3 Iceberg, …) the metadata layer is cold :

  • Tables sit in OpenMetadata with empty description and no business tag. Stewards rarely back-fill these manually.
  • The Milvus collection catalog (ADR-043) used by ADEN for NL→SQL schema linking only embeds tables that have a description — meaning brand-new sources are invisible to the semantic search.
  • The verified-queries cache (semantic-models/<domain>.yaml) only contains the queries the team explicitly hand-curated. Real usage patterns (frequent JOINs, filters, projections) are never harvested.
  • Implicit foreign keys between tables are not declared, so cross-source joins suggested by ADEN sometimes mis-link entities.

Result : the recall@5 of the Milvus retrieval step drops sharply for recently-onboarded customers. ADEN feels great on the seeded banking demo, mediocre on a fresh prospect's data. Word-of-mouth competitor Atlan / DataHub Cloud / Coalesce / Select Star all ship LLM-based auto-enrichment — but they all do it as cloud SaaS that exfiltrates samples to AWS Bedrock or proprietary LLM endpoints. Incompatible with the AKKO sovereignty positioning (DORA, NIS2, RGPD-friendly, "never leaves your VPC").

Decision

Ship akko-catalog-sync, an in-cluster Python daemon that performs five enrichment passes against the AKKO data plane :

  1. Discovery — list OM tables with an incremental watermark on lastModifiedTimestamp. Webhook + CronJob + manual API trigger.
  2. ProfilingSELECT * FROM <fqn> TABLESAMPLE BERNOULLI(5) LIMIT 100 via Trino, then computes cardinality / null % / top values per column.
  3. LLM enrichment — short RAG prompt (column schema + 5 anonymised sample rows + upstream lineage tables) sent to LiteLLM → qwen2.5:3b (Ollama in-cluster, not a cloud LLM). Output is one factual ≤20-word description.
  4. PII tagging — vote between OM Presidio NER (already bundled, regex + transformer recognisers) and a zero-shot LLM call. PII flag set when both agree.
  5. Implicit FK discovery — column fingerprint (top-100 value hash) matching across tables ; confirmation via EXPLAIN COUNT(*) WHERE NOT IN. Pushed to OM as inferred lineage edges.

A sixth pass — query log mining — parses query-completed events from the Trino event listener (already enabled in helm/akko/values.yaml) and proposes verified queries to add to the domain semantic-model YAMLs. The daemon never edits these YAMLs in place ; it surfaces suggestions in the cockpit "Catalog Sync" page for the steward to accept.

Publish path :

  • OM REST PATCH /tables/name/{fqn} for description
  • OM REST PATCH /classifications for tags (PII, domain, tier)
  • OM REST POST /lineage for inferred FK edges
  • LiteLLM POST /embeddings (akko-embed, 768 d nomic-embed-text-v2) → MilvusClient.upsert(catalog, …) reusing the schema declared in docker/aden/vector_catalog.py.
  • Audit row in akko_postgres.catalog_sync.audit_log.

Sovereignty guarantee

The daemon's NetworkPolicy (chart-bundled) only allows egress to :

  • kube-dns (UDP/TCP 53)
  • The same namespace's OpenMetadata (8585), Trino (8080), LiteLLM (4000), Milvus (19530), Postgres (5432).

No public Internet egress is permitted. Anonymisation runs in-cluster before the LLM prompt, with Presidio masking email/IBAN/phone/SSN. The daemon never streams sample row values to any external service.

This is the only OSS LLM-based catalog auto-enricher with that guarantee. DataHub Cloud uses AWS Bedrock for "AI Documentation", Atlan / Coalesce / Select Star are SaaS by design. AKKO ships strictly on-prem.

Recall hypothesis

Two papers (arXiv 2503.09003 RAG description generation, arXiv 2502.20657 NL2SQL with cap-20-word descriptions) report 80-87 % steward acceptance on auto-generated descriptions and a 15-25 percentage-point recall@5 lift in NL2SQL pipelines when the schema is described instead of left bare. The Sprint 63.8 A/B run will validate this on a 50-table golden set across the 6 AKKO demo domains. Decision rule :

  • Recall@5 lift ≥ 15 pp + steward acceptance ≥ 70 % → ship catalogSyncEnabled=true as default for all new clusters.
  • Lift between 5 and 15 pp → keep MVP, prioritise prompt iteration (Sprint 64.1).
  • Lift < 5 pp → reduce scope to PII tagging + Milvus push only ; keep the daemon as a lighter automation but drop the description LLM prompt path.

Alternatives rejected

  • Reuse the existing aden-catalog-indexer CronJob — it already embeds OM tables into Milvus, but it does not generate descriptions. Extending it with an LLM call would mix concerns (sync schedule, embedding, profiling, generation) and break the single-purpose design. The catalog-sync daemon calls the same LiteLLM endpoint for embeddings and pushes to the same collection, but owns the description-generation pass on its own clock.
  • DataHub Actions framework as a drop-in — Apache 2.0, closest fit, but the "AI Documentation" feature path forces samples through AWS Bedrock (Acryl-managed account). Forking and replacing the LLM call with Ollama would be a significant rebuild + carry the heavy DataHub Action runtime that AKKO does not need (no Kafka, no transformer chain).
  • Inline enrichment inside ADEN at query time — too slow per-query (LLM round-trip on every cold question), and would fail the 60-second p95 budget.

Status fields

Implementation phases :

Phase Done Note
63.1 Service skeleton partial Dockerfile, FastAPI app, Helm sub-chart, umbrella wiring
63.2 Discovery + sampling partial OM list + Trino TABLESAMPLE wired in app
63.3 LLM enrichment partial LiteLLM round-trip + ≤20-word truncate
63.4 FK discovery TODO fingerprint logic + OM lineage POST
63.5 Query log miner TODO VictoriaLogs poll + sqlparse JOIN extraction
63.6 Publish + audit Postgres partial OM PATCH + Milvus upsert ; audit log table TODO
63.7 Cockpit Catalog Sync UI TODO accept/reject UX
63.8 A/B golden set TODO 50 tables annotated humain
63.9 Docs + integration tests TODO mkdocs FR/EN + akko-test-all.sh L7

Consequences

  • One additional Deployment + optional CronJob in the platform — modest footprint (50 m CPU / 256 MiB memory steady-state).
  • Adds a load source on Trino (TABLESAMPLE on every covered table on every full sync). Mitigated by samplePercent=5 default + watermark.
  • Increases LiteLLM request volume — qwen2.5:3b is the cheap description model, fallback to qwen2.5-coder:7b only on low confidence. Operator can flip both off with one Helm value.
  • New Secrets to provision (OM bot token, optional LiteLLM key, Milvus token). Documented in docs/docs/services/catalog-sync.{md,fr.md} (Sprint 63.9).