akko-catalog-sync — auto-enrich catalog + semantic layer¶
akko-catalog-sync is the AKKO answer to the catalog cold-start
problem : when an operator adds a new dataset, schema or full catalog,
OpenMetadata stays empty (no description, no tags, no FK), the Milvus
collection used by ADEN cannot retrieve it via semantic search, and the
NL→SQL recall@5 of the new source drops sharply.
The daemon walks the OM table inventory, samples 5 % of each table via
Trino TABLESAMPLE BERNOULLI, anonymises the rows (Presidio masking),
sends a short RAG prompt to a local Ollama model (qwen2.5:3b via
the LiteLLM gateway), and writes the result back to OM + Milvus. The
sample data never leaves the cluster.
Sprint 63 / ADR-049 — ship skeleton, MVP path
Phase 63.1 (this release) ships the service skeleton, the OM PATCH and Milvus upsert paths, and an off-by-default toggle. Phases 63.4–63.7 add foreign-key discovery, query-log mining and a cockpit accept/reject UI. See ADR-049.
Architecture¶
flowchart LR
subgraph Triggers
CRON[CronJob 6h]
HOOK[POST /sync<br/>webhook]
CLI[Manual --table=fqn]
end
subgraph Daemon["akko-catalog-sync (FastAPI)"]
DISC[Discovery<br/>OM list + watermark]
SAMP[Sampler<br/>Trino TABLESAMPLE 5%]
ENR[Enrich<br/>Ollama qwen2.5:3b<br/>≤20 words]
PUB[Publish<br/>OM PATCH + Milvus upsert]
end
subgraph Backends
OM[(OpenMetadata)]
TR[(Trino)]
LL[LiteLLM → Ollama]
MV[(Milvus<br/>collection: catalog)]
end
CRON --> DISC
HOOK --> DISC
CLI --> DISC
DISC --> SAMP --> ENR --> PUB
DISC --> OM
SAMP --> TR
ENR --> LL
PUB --> OM
PUB --> MV
Sovereignty guarantee¶
The chart-bundled NetworkPolicy only allows :
kube-dns(UDP/TCP 53)- Same-namespace egress to OpenMetadata (8585), Trino (8080), LiteLLM (4000), Milvus (19530), Postgres (5432)
No public Internet egress is allowed. This is the only OSS LLM-based catalog auto-enricher with that property — DataHub Cloud routes samples through AWS Bedrock, Atlan / Coalesce / Select Star / Secoda are SaaS by design.
API¶
| Endpoint | Method | Purpose |
|---|---|---|
/healthz |
GET | Liveness / readiness probe |
/metrics |
GET | Prometheus exposition |
/status |
GET | Feature flags + config snapshot |
/sync |
POST | Run a discover → sample → enrich → publish pass |
POST /sync request body :
Default is dry_run=true (set in the chart values). The cockpit "Sync
staged changes" UI flips to false when an operator accepts a delta.
Configuration (zero hardcoding)¶
Every URL, model and toggle is sourced from env vars wired by the
Helm sub-chart. No host name, password or model ID is baked into the
Python code. The values block lives at
helm/akko/charts/akko-catalog-sync/values.yaml.
Key flags (all in global.features.catalogSyncEnabled umbrella) :
features.llmDescriptions— turn the LLM description path on/offfeatures.piiLlmVote— Presidio + LLM majority vote for PII taggingfeatures.fkDiscovery— phase 63.4 fingerprint matchingfeatures.queryMiner— phase 63.5 VictoriaLogs JOIN extractionfeatures.milvusPush— push enrichment to ADEN's collection
Metrics¶
akko_catalog_sync_tables_processed_total{outcome="ok|dry_run|error|sample_failed"}akko_catalog_sync_llm_latency_seconds(histogram)akko_catalog_sync_publish_failures_total{sink="openmetadata|milvus"}
Smoke test¶
After enabling the daemon (global.features.catalogSyncEnabled=true)
and provisioning the OM bot token Secret :
kubectl -n akko run csync-curl --rm -it --image=curlimages/curl:8.10.1 -- \
sh -c 'curl -sX POST -H "Content-Type: application/json" \
-d "{\"limit\":3,\"dry_run\":true}" \
http://akko-akko-catalog-sync:8000/sync | head -c 4000'
You should see a JSON array with 3 entries, each with a generated
≤20-word description and om_updated:false (dry-run).