ADR-049 — AKKO Catalog Sync Daemon¶
- Status : Proposed (Sprint 63 — 2026-05)
- Owner : AKKO Platform / ADEN team
- Related : ADR-041 (ADEN scope-first), ADR-042 (multi-tier cache), ADR-043 (Milvus vector_catalog), ADR-046 (engine_storage_domain naming)
Context¶
When an AKKO operator (or an AKKO customer) adds a new dataset, schema, or full catalog (Cloudera Hive, Postgres OLTP, S3 Iceberg, …) the metadata layer is cold :
- Tables sit in OpenMetadata with empty
descriptionand no business tag. Stewards rarely back-fill these manually. - The Milvus collection
catalog(ADR-043) used by ADEN for NL→SQL schema linking only embeds tables that have a description — meaning brand-new sources are invisible to the semantic search. - The verified-queries cache (
semantic-models/<domain>.yaml) only contains the queries the team explicitly hand-curated. Real usage patterns (frequent JOINs, filters, projections) are never harvested. - Implicit foreign keys between tables are not declared, so cross-source joins suggested by ADEN sometimes mis-link entities.
Result : the recall@5 of the Milvus retrieval step drops sharply for recently-onboarded customers. ADEN feels great on the seeded banking demo, mediocre on a fresh prospect's data. Word-of-mouth competitor Atlan / DataHub Cloud / Coalesce / Select Star all ship LLM-based auto-enrichment — but they all do it as cloud SaaS that exfiltrates samples to AWS Bedrock or proprietary LLM endpoints. Incompatible with the AKKO sovereignty positioning (DORA, NIS2, RGPD-friendly, "never leaves your VPC").
Decision¶
Ship akko-catalog-sync, an in-cluster Python daemon that performs
five enrichment passes against the AKKO data plane :
- Discovery — list OM tables with an incremental watermark on
lastModifiedTimestamp. Webhook + CronJob + manual API trigger. - Profiling —
SELECT * FROM <fqn> TABLESAMPLE BERNOULLI(5) LIMIT 100via Trino, then computes cardinality / null % / top values per column. - LLM enrichment — short RAG prompt (column schema + 5 anonymised
sample rows + upstream lineage tables) sent to LiteLLM →
qwen2.5:3b(Ollama in-cluster, not a cloud LLM). Output is one factual ≤20-word description. - PII tagging — vote between OM Presidio NER (already bundled, regex + transformer recognisers) and a zero-shot LLM call. PII flag set when both agree.
- Implicit FK discovery — column fingerprint (top-100 value hash)
matching across tables ; confirmation via
EXPLAIN COUNT(*) WHERE NOT IN. Pushed to OM as inferred lineage edges.
A sixth pass — query log mining — parses query-completed events
from the Trino event listener (already enabled in
helm/akko/values.yaml) and proposes verified queries to add to the
domain semantic-model YAMLs. The daemon never edits these YAMLs in
place ; it surfaces suggestions in the cockpit "Catalog Sync" page for
the steward to accept.
Publish path :
- OM REST
PATCH /tables/name/{fqn}for description - OM REST
PATCH /classificationsfor tags (PII, domain, tier) - OM REST
POST /lineagefor inferred FK edges - LiteLLM
POST /embeddings(akko-embed, 768 d nomic-embed-text-v2) →MilvusClient.upsert(catalog, …)reusing the schema declared indocker/aden/vector_catalog.py. - Audit row in
akko_postgres.catalog_sync.audit_log.
Sovereignty guarantee¶
The daemon's NetworkPolicy (chart-bundled) only allows egress to :
kube-dns(UDP/TCP 53)- The same namespace's OpenMetadata (8585), Trino (8080), LiteLLM (4000), Milvus (19530), Postgres (5432).
No public Internet egress is permitted. Anonymisation runs in-cluster before the LLM prompt, with Presidio masking email/IBAN/phone/SSN. The daemon never streams sample row values to any external service.
This is the only OSS LLM-based catalog auto-enricher with that guarantee. DataHub Cloud uses AWS Bedrock for "AI Documentation", Atlan / Coalesce / Select Star are SaaS by design. AKKO ships strictly on-prem.
Recall hypothesis¶
Two papers (arXiv 2503.09003 RAG description generation, arXiv 2502.20657 NL2SQL with cap-20-word descriptions) report 80-87 % steward acceptance on auto-generated descriptions and a 15-25 percentage-point recall@5 lift in NL2SQL pipelines when the schema is described instead of left bare. The Sprint 63.8 A/B run will validate this on a 50-table golden set across the 6 AKKO demo domains. Decision rule :
- Recall@5 lift ≥ 15 pp + steward acceptance ≥ 70 % → ship
catalogSyncEnabled=trueas default for all new clusters. - Lift between 5 and 15 pp → keep MVP, prioritise prompt iteration (Sprint 64.1).
- Lift < 5 pp → reduce scope to PII tagging + Milvus push only ; keep the daemon as a lighter automation but drop the description LLM prompt path.
Alternatives rejected¶
- Reuse the existing
aden-catalog-indexerCronJob — it already embeds OM tables into Milvus, but it does not generate descriptions. Extending it with an LLM call would mix concerns (sync schedule, embedding, profiling, generation) and break the single-purpose design. The catalog-sync daemon calls the same LiteLLM endpoint for embeddings and pushes to the same collection, but owns the description-generation pass on its own clock. - DataHub Actions framework as a drop-in — Apache 2.0, closest fit, but the "AI Documentation" feature path forces samples through AWS Bedrock (Acryl-managed account). Forking and replacing the LLM call with Ollama would be a significant rebuild + carry the heavy DataHub Action runtime that AKKO does not need (no Kafka, no transformer chain).
- Inline enrichment inside ADEN at query time — too slow per-query (LLM round-trip on every cold question), and would fail the 60-second p95 budget.
Status fields¶
Implementation phases :
| Phase | Done | Note |
|---|---|---|
| 63.1 Service skeleton | partial | Dockerfile, FastAPI app, Helm sub-chart, umbrella wiring |
| 63.2 Discovery + sampling | partial | OM list + Trino TABLESAMPLE wired in app |
| 63.3 LLM enrichment | partial | LiteLLM round-trip + ≤20-word truncate |
| 63.4 FK discovery | TODO | fingerprint logic + OM lineage POST |
| 63.5 Query log miner | TODO | VictoriaLogs poll + sqlparse JOIN extraction |
| 63.6 Publish + audit Postgres | partial | OM PATCH + Milvus upsert ; audit log table TODO |
| 63.7 Cockpit Catalog Sync UI | TODO | accept/reject UX |
| 63.8 A/B golden set | TODO | 50 tables annotated humain |
| 63.9 Docs + integration tests | TODO | mkdocs FR/EN + akko-test-all.sh L7 |
Consequences¶
- One additional Deployment + optional CronJob in the platform — modest footprint (50 m CPU / 256 MiB memory steady-state).
- Adds a load source on Trino (TABLESAMPLE on every covered table on
every full sync). Mitigated by
samplePercent=5default + watermark. - Increases LiteLLM request volume — qwen2.5:3b is the cheap description model, fallback to qwen2.5-coder:7b only on low confidence. Operator can flip both off with one Helm value.
- New Secrets to provision (OM bot token, optional LiteLLM key, Milvus
token). Documented in
docs/docs/services/catalog-sync.{md,fr.md}(Sprint 63.9).