Skip to content

akko-catalog-sync — auto-enrich catalog + semantic layer

akko-catalog-sync is the AKKO answer to the catalog cold-start problem : when an operator adds a new dataset, schema or full catalog, OpenMetadata stays empty (no description, no tags, no FK), the Milvus collection used by ADEN cannot retrieve it via semantic search, and the NL→SQL recall@5 of the new source drops sharply.

The daemon walks the OM table inventory, samples 5 % of each table via Trino TABLESAMPLE BERNOULLI, anonymises the rows (Presidio masking), sends a short RAG prompt to a local Ollama model (qwen2.5:3b via the LiteLLM gateway), and writes the result back to OM + Milvus. The sample data never leaves the cluster.

Sprint 63 / ADR-049 — ship skeleton, MVP path

Phase 63.1 (this release) ships the service skeleton, the OM PATCH and Milvus upsert paths, and an off-by-default toggle. Phases 63.4–63.7 add foreign-key discovery, query-log mining and a cockpit accept/reject UI. See ADR-049.

Architecture

flowchart LR
  subgraph Triggers
    CRON[CronJob 6h]
    HOOK[POST /sync<br/>webhook]
    CLI[Manual --table=fqn]
  end
  subgraph Daemon["akko-catalog-sync (FastAPI)"]
    DISC[Discovery<br/>OM list + watermark]
    SAMP[Sampler<br/>Trino TABLESAMPLE 5%]
    ENR[Enrich<br/>Ollama qwen2.5:3b<br/>≤20 words]
    PUB[Publish<br/>OM PATCH + Milvus upsert]
  end
  subgraph Backends
    OM[(OpenMetadata)]
    TR[(Trino)]
    LL[LiteLLM → Ollama]
    MV[(Milvus<br/>collection: catalog)]
  end
  CRON --> DISC
  HOOK --> DISC
  CLI --> DISC
  DISC --> SAMP --> ENR --> PUB
  DISC --> OM
  SAMP --> TR
  ENR --> LL
  PUB --> OM
  PUB --> MV

Sovereignty guarantee

The chart-bundled NetworkPolicy only allows :

  • kube-dns (UDP/TCP 53)
  • Same-namespace egress to OpenMetadata (8585), Trino (8080), LiteLLM (4000), Milvus (19530), Postgres (5432)

No public Internet egress is allowed. This is the only OSS LLM-based catalog auto-enricher with that property — DataHub Cloud routes samples through AWS Bedrock, Atlan / Coalesce / Select Star / Secoda are SaaS by design.

API

Endpoint Method Purpose
/healthz GET Liveness / readiness probe
/metrics GET Prometheus exposition
/status GET Feature flags + config snapshot
/sync POST Run a discover → sample → enrich → publish pass

POST /sync request body :

{
  "fqns": ["iceberg.banking.transactions"],
  "since_ms": 0,
  "dry_run": true,
  "limit": 50
}

Default is dry_run=true (set in the chart values). The cockpit "Sync staged changes" UI flips to false when an operator accepts a delta.

Configuration (zero hardcoding)

Every URL, model and toggle is sourced from env vars wired by the Helm sub-chart. No host name, password or model ID is baked into the Python code. The values block lives at helm/akko/charts/akko-catalog-sync/values.yaml.

Key flags (all in global.features.catalogSyncEnabled umbrella) :

  • features.llmDescriptions — turn the LLM description path on/off
  • features.piiLlmVote — Presidio + LLM majority vote for PII tagging
  • features.fkDiscovery — phase 63.4 fingerprint matching
  • features.queryMiner — phase 63.5 VictoriaLogs JOIN extraction
  • features.milvusPush — push enrichment to ADEN's collection

Metrics

  • akko_catalog_sync_tables_processed_total{outcome="ok|dry_run|error|sample_failed"}
  • akko_catalog_sync_llm_latency_seconds (histogram)
  • akko_catalog_sync_publish_failures_total{sink="openmetadata|milvus"}

Smoke test

After enabling the daemon (global.features.catalogSyncEnabled=true) and provisioning the OM bot token Secret :

kubectl -n akko run csync-curl --rm -it --image=curlimages/curl:8.10.1 -- \
  sh -c 'curl -sX POST -H "Content-Type: application/json" \
    -d "{\"limit\":3,\"dry_run\":true}" \
    http://akko-akko-catalog-sync:8000/sync | head -c 4000'

You should see a JSON array with 3 entries, each with a generated ≤20-word description and om_updated:false (dry-run).