Catalog auto-enrichment with NORA¶
TL;DR¶
NORA is the steward review queue for AI-generated catalog enrichments.
The Catalog Auto-Enrichment Service samples a small subset of rows
from every federated source (operational DBs, lakehouse tables, etc.),
asks the on-prem LLM to propose a description + PII tags + foreign-key
candidates, and writes those proposals to a Postgres queue. Stewards
review them in the cockpit /#nora page, then accept / edit / reject.
Accepted proposals are written back to the Catalog layer (OpenMetadata) and immediately upserted into the Vector layer (Milvus) so the ADEN natural-language query engine reflects them in milliseconds.
No row sample ever leaves the cluster.
Layer mapping¶
| Layer (user-facing) | Engine (admin) |
|---|---|
| Catalog | OpenMetadata |
| Catalog Auto-Enrichment Service | akko-catalog-sync daemon |
| AI Gateway | LiteLLM |
| Vector store | Milvus |
| Steward review queue | Postgres catalog_sync.proposals_pending |
| Cockpit review page | /#nora |
Vendor names appear only here in the Architecture / Operations doc and never in user-facing surfaces (sidebar, KPIs, toasts).
End-to-end flow¶
flowchart LR
A[Federated source<br/>postgres_oltp_banking, iceberg, cloudera, climscore, …] -->|TABLESAMPLE| B[Catalog Auto-Enrichment Service]
B -->|prompt| C[AI Gateway]
C -->|description<br/>+ PII tags<br/>+ FK candidates| B
B -->|insert| D[Steward review queue<br/>proposals_pending]
D --> E[Cockpit /#nora]
E -->|accept| F[Catalog]
E -->|accept| G[Vector store]
F --> H[ADEN NL→SQL]
G --> H
Triggers¶
The service has 3 entry points :
- CronJob (default) — runs every 6 h, sweeps tables modified since
the last watermark. Safe to leave on permanently ; the LLM cost is
bounded by
AKKO_SYNC_BATCH_SIZEper run. - Admin manual trigger (UI) —
/#norapage exposes an admin-only Trigger catalog auto-enrichment card. Admin selects a federated source from the datalist (populated live from/api/catalogs/sources), optionally caps the table count, optionally ticksdry-run, and clicks Run enrichment. The cockpit callsPOST /api/cockpit/catalog-sync/syncwhich proxies to the daemon. - Direct HTTP (operator scripts) —
POST /syncon the daemon service inside the cluster, body : See HTTP API below for the full schema.
HTTP API¶
POST /sync on akko-catalog-sync:8000
| Field | Type | Default | Description |
|---|---|---|---|
source |
string|null | null |
Trino catalog / OM service name. Resolves to FQN prefix <source>.. Mutually exclusive with fqns. |
fqns |
string[]|null | null |
Explicit allow-list of fully-qualified table names. Wins over source. |
since_ms |
int | 0 |
Incremental watermark (Unix ms). 0 = full scan within the FQN allow-list. |
limit |
int | 50 |
Max tables processed in this run (1–500). |
dry_run |
bool|null | env default | When true, no write to Catalog or Vector store. |
Response : 200 with an array of per-table results, each containing
fqn, description, confidence, pii_columns, om_updated,
milvus_upserted, error.
Error responses :
404 no tables matched the request— the OM/tableslisting is empty for that source. Usually means OM ingestion has not run yet (see Prerequisites below).404 no OpenMetadata tables matched source='<x>'— source name differs from the OM serviceFQN. Check Catalog Manager.401 / 403— caller missing admin role (admin gate is enforced upstream by the cockpit oauth2-proxy).
Prerequisites¶
Before NORA can produce proposals for a source, the source must be indexed in OpenMetadata. This is a one-time setup per source :
# 1. Register the database service in OpenMetadata
# (OM UI → Settings → Database Services → Add)
# 2. Run a metadata ingestion workflow against it
# (OM UI → Ingestions → Run, or Airflow DAG akko_om_ingestion)
# 3. Confirm tables appear in /api/catalogs/search?q=*
Once the ingestion has populated table_search_index, the
/api/catalogs/sources endpoint will list the source and the cockpit
trigger card datalist will offer it as a suggestion.
The Sprint 82 A4 mission documented the 6 target sources :
postgres_oltp_bankingpostgres_oltp_publicsectorpostgres_postgis_esgpostgres_postgis_iotcloudera(delivered by Sprint 82 A2)climscore(delivered by Sprint 82 A3, read-only federation)
RBAC matrix¶
| Persona | Sidebar /#nora |
Trigger card | Accept / Reject | Dry-run only |
|---|---|---|---|---|
| alice (admin) | yes | yes | yes | yes |
| dave (steward) | yes | no | yes | no |
| bob (engineer) | no | no | no | no |
| carol (analyst) | no | no | no | no |
| eve (viewer) | no | no | no | no |
The admin trigger surface is hidden from stewards on purpose : stewards review, admins operate. A steward who needs a fresh batch contacts an admin or waits for the 6 h CronJob.
Operational runbook¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Trigger returns 404 "no tables matched" | OM ingestion not run for that source | Register service + run ingestion workflow |
| Trigger returns 404 "source name differs" | Typo, or OM service named differently from the Trino catalog | Check /api/catalogs/sources, use the exact name |
| Card hidden for an admin | Token claim cached pre-deploy | Hard refresh + re-login |
| Datalist empty | /api/catalogs/sources returned [] or 5xx |
Check OM health + cockpit-backend logs |
pending count not growing after trigger |
Daemon returned 200 but with error per row |
Inspect daemon logs : sampler / LLM / sink failure |
| Accepted proposal not visible in OM | OM PATCH best-effort fire-and-forget failed | Re-trigger the same proposal accept (idempotent) |
| ADEN still returns old description | Milvus upsert raced with read | Wait 1 s, retry — eventual consistency is typically sub-second |
Discipline notes¶
- Zero hardcoded sources — the cockpit datalist is fed by
/api/catalogs/sourceswhich enumerates OM database services. Adding a new source = registering it in OM and running ingestion ; no code change required. - Layer-first vocabulary — every label visible to users speaks of "Catalog Auto-Enrichment", "Federated source", "Lakehouse table", "Operational DB". Vendor names live in this admin doc only.
- No SSH bypass — every operator action goes through the cockpit oauth2-proxy → cockpit-backend → daemon. The daemon never exposes a public Ingress.
- Sample-aware —
AKKO_SAMPLE_PERCENT(default 5%) andAKKO_DESCRIPTION_MAX_WORDS(default 100) cap both the row sample size and the LLM output, so no full-table snapshot ever leaves the cluster and no descriptions balloon unreasonably.