Catalog auto-enrichment with NORA¶

TL;DR¶

NORA is the steward review queue for AI-generated catalog enrichments. The Catalog Auto-Enrichment Service samples a small subset of rows from every federated source (operational DBs, lakehouse tables, etc.), asks the on-prem LLM to propose a description + PII tags + foreign-key candidates, and writes those proposals to a Postgres queue. Stewards review them in the cockpit /#nora page, then accept / edit / reject.

Accepted proposals are written back to the Catalog layer (OpenMetadata) and immediately upserted into the Vector layer (Milvus) so the ADEN natural-language query engine reflects them in milliseconds.

No row sample ever leaves the cluster.

Layer mapping¶

Layer (user-facing)	Engine (admin)
Catalog	OpenMetadata
Catalog Auto-Enrichment Service	`akko-catalog-sync` daemon
AI Gateway	LiteLLM
Vector store	Milvus
Steward review queue	Postgres `catalog_sync.proposals_pending`
Cockpit review page	`/#nora`

Vendor names appear only here in the Architecture / Operations doc and never in user-facing surfaces (sidebar, KPIs, toasts).

End-to-end flow¶

flowchart LR
    A[Federated source<br/>postgres_oltp_banking, iceberg, cloudera, climscore, …] -->|TABLESAMPLE| B[Catalog Auto-Enrichment Service]
    B -->|prompt| C[AI Gateway]
    C -->|description<br/>+ PII tags<br/>+ FK candidates| B
    B -->|insert| D[Steward review queue<br/>proposals_pending]
    D --> E[Cockpit /#nora]
    E -->|accept| F[Catalog]
    E -->|accept| G[Vector store]
    F --> H[ADEN NL→SQL]
    G --> H

Triggers¶

The service has 3 entry points :

CronJob (default) — runs every 6 h, sweeps tables modified since the last watermark. Safe to leave on permanently ; the LLM cost is bounded by AKKO_SYNC_BATCH_SIZE per run.
Admin manual trigger (UI) — /#nora page exposes an admin-only Trigger catalog auto-enrichment card. Admin selects a federated source from the datalist (populated live from /api/catalogs/sources), optionally caps the table count, optionally ticks dry-run, and clicks Run enrichment. The cockpit calls POST /api/cockpit/catalog-sync/sync which proxies to the daemon.
Direct HTTP (operator scripts) — POST /sync on the daemon service inside the cluster, body :
```
{ "source": "postgres_oltp_banking", "limit": 50, "dry_run": false }
```
See HTTP API below for the full schema.

HTTP API¶

POST /sync on akko-catalog-sync:8000

Field	Type	Default	Description
`source`	string\|null	`null`	Trino catalog / OM service name. Resolves to FQN prefix `<source>.`. Mutually exclusive with `fqns`.
`fqns`	string[]\|null	`null`	Explicit allow-list of fully-qualified table names. Wins over `source`.
`since_ms`	int	`0`	Incremental watermark (Unix ms). 0 = full scan within the FQN allow-list.
`limit`	int	`50`	Max tables processed in this run (1–500).
`dry_run`	bool\|null	env default	When `true`, no write to Catalog or Vector store.

Response : 200 with an array of per-table results, each containing fqn, description, confidence, pii_columns, om_updated, milvus_upserted, error.

Error responses :

404 no tables matched the request — the OM /tables listing is empty for that source. Usually means OM ingestion has not run yet (see Prerequisites below).
404 no OpenMetadata tables matched source='<x>' — source name differs from the OM serviceFQN. Check Catalog Manager.
401 / 403 — caller missing admin role (admin gate is enforced upstream by the cockpit oauth2-proxy).

Prerequisites¶

Before NORA can produce proposals for a source, the source must be indexed in OpenMetadata. This is a one-time setup per source :

# 1. Register the database service in OpenMetadata
#    (OM UI → Settings → Database Services → Add)
# 2. Run a metadata ingestion workflow against it
#    (OM UI → Ingestions → Run, or Airflow DAG akko_om_ingestion)
# 3. Confirm tables appear in /api/catalogs/search?q=*

Once the ingestion has populated table_search_index, the /api/catalogs/sources endpoint will list the source and the cockpit trigger card datalist will offer it as a suggestion.

The Sprint 82 A4 mission documented the 6 target sources :

postgres_oltp_banking
postgres_oltp_publicsector
postgres_postgis_esg
postgres_postgis_iot
cloudera (delivered by Sprint 82 A2)
climscore (delivered by Sprint 82 A3, read-only federation)

RBAC matrix¶

Persona	Sidebar `/#nora`	Trigger card	Accept / Reject	Dry-run only
alice (admin)	yes	yes	yes	yes
dave (steward)	yes	no	yes	no
bob (engineer)	no	no	no	no
carol (analyst)	no	no	no	no
eve (viewer)	no	no	no	no

The admin trigger surface is hidden from stewards on purpose : stewards review, admins operate. A steward who needs a fresh batch contacts an admin or waits for the 6 h CronJob.

Operational runbook¶

Symptom	Likely cause	Fix
Trigger returns 404 "no tables matched"	OM ingestion not run for that source	Register service + run ingestion workflow
Trigger returns 404 "source name differs"	Typo, or OM service named differently from the Trino catalog	Check `/api/catalogs/sources`, use the exact `name`
Card hidden for an admin	Token claim cached pre-deploy	Hard refresh + re-login
Datalist empty	`/api/catalogs/sources` returned `[]` or 5xx	Check OM health + cockpit-backend logs
`pending` count not growing after trigger	Daemon returned 200 but with `error` per row	Inspect daemon logs : sampler / LLM / sink failure
Accepted proposal not visible in OM	OM PATCH best-effort fire-and-forget failed	Re-trigger the same proposal accept (idempotent)
ADEN still returns old description	Milvus upsert raced with read	Wait 1 s, retry — eventual consistency is typically sub-second

Discipline notes¶

Zero hardcoded sources — the cockpit datalist is fed by /api/catalogs/sources which enumerates OM database services. Adding a new source = registering it in OM and running ingestion ; no code change required.
Layer-first vocabulary — every label visible to users speaks of "Catalog Auto-Enrichment", "Federated source", "Lakehouse table", "Operational DB". Vendor names live in this admin doc only.
No SSH bypass — every operator action goes through the cockpit oauth2-proxy → cockpit-backend → daemon. The daemon never exposes a public Ingress.
Sample-aware — AKKO_SAMPLE_PERCENT (default 5%) and AKKO_DESCRIPTION_MAX_WORDS (default 100) cap both the row sample size and the LLM output, so no full-table snapshot ever leaves the cluster and no descriptions balloon unreasonably.