Skip to content

Catalog auto-enrichment with NORA

TL;DR

NORA is the steward review queue for AI-generated catalog enrichments. The Catalog Auto-Enrichment Service samples a small subset of rows from every federated source (operational DBs, lakehouse tables, etc.), asks the on-prem LLM to propose a description + PII tags + foreign-key candidates, and writes those proposals to a Postgres queue. Stewards review them in the cockpit /#nora page, then accept / edit / reject.

Accepted proposals are written back to the Catalog layer (OpenMetadata) and immediately upserted into the Vector layer (Milvus) so the ADEN natural-language query engine reflects them in milliseconds.

No row sample ever leaves the cluster.


Layer mapping

Layer (user-facing) Engine (admin)
Catalog OpenMetadata
Catalog Auto-Enrichment Service akko-catalog-sync daemon
AI Gateway LiteLLM
Vector store Milvus
Steward review queue Postgres catalog_sync.proposals_pending
Cockpit review page /#nora

Vendor names appear only here in the Architecture / Operations doc and never in user-facing surfaces (sidebar, KPIs, toasts).


End-to-end flow

flowchart LR
    A[Federated source<br/>postgres_oltp_banking, iceberg, cloudera, climscore, …] -->|TABLESAMPLE| B[Catalog Auto-Enrichment Service]
    B -->|prompt| C[AI Gateway]
    C -->|description<br/>+ PII tags<br/>+ FK candidates| B
    B -->|insert| D[Steward review queue<br/>proposals_pending]
    D --> E[Cockpit /#nora]
    E -->|accept| F[Catalog]
    E -->|accept| G[Vector store]
    F --> H[ADEN NL→SQL]
    G --> H

Triggers

The service has 3 entry points :

  1. CronJob (default) — runs every 6 h, sweeps tables modified since the last watermark. Safe to leave on permanently ; the LLM cost is bounded by AKKO_SYNC_BATCH_SIZE per run.
  2. Admin manual trigger (UI)/#nora page exposes an admin-only Trigger catalog auto-enrichment card. Admin selects a federated source from the datalist (populated live from /api/catalogs/sources), optionally caps the table count, optionally ticks dry-run, and clicks Run enrichment. The cockpit calls POST /api/cockpit/catalog-sync/sync which proxies to the daemon.
  3. Direct HTTP (operator scripts) — POST /sync on the daemon service inside the cluster, body :
    { "source": "postgres_oltp_banking", "limit": 50, "dry_run": false }
    
    See HTTP API below for the full schema.

HTTP API

POST /sync on akko-catalog-sync:8000

Field Type Default Description
source string|null null Trino catalog / OM service name. Resolves to FQN prefix <source>.. Mutually exclusive with fqns.
fqns string[]|null null Explicit allow-list of fully-qualified table names. Wins over source.
since_ms int 0 Incremental watermark (Unix ms). 0 = full scan within the FQN allow-list.
limit int 50 Max tables processed in this run (1–500).
dry_run bool|null env default When true, no write to Catalog or Vector store.

Response : 200 with an array of per-table results, each containing fqn, description, confidence, pii_columns, om_updated, milvus_upserted, error.

Error responses :

  • 404 no tables matched the request — the OM /tables listing is empty for that source. Usually means OM ingestion has not run yet (see Prerequisites below).
  • 404 no OpenMetadata tables matched source='<x>' — source name differs from the OM serviceFQN. Check Catalog Manager.
  • 401 / 403 — caller missing admin role (admin gate is enforced upstream by the cockpit oauth2-proxy).

Prerequisites

Before NORA can produce proposals for a source, the source must be indexed in OpenMetadata. This is a one-time setup per source :

# 1. Register the database service in OpenMetadata
#    (OM UI → Settings → Database Services → Add)
# 2. Run a metadata ingestion workflow against it
#    (OM UI → Ingestions → Run, or Airflow DAG akko_om_ingestion)
# 3. Confirm tables appear in /api/catalogs/search?q=*

Once the ingestion has populated table_search_index, the /api/catalogs/sources endpoint will list the source and the cockpit trigger card datalist will offer it as a suggestion.

The Sprint 82 A4 mission documented the 6 target sources :

  • postgres_oltp_banking
  • postgres_oltp_publicsector
  • postgres_postgis_esg
  • postgres_postgis_iot
  • cloudera (delivered by Sprint 82 A2)
  • climscore (delivered by Sprint 82 A3, read-only federation)

RBAC matrix

Persona Sidebar /#nora Trigger card Accept / Reject Dry-run only
alice (admin) yes yes yes yes
dave (steward) yes no yes no
bob (engineer) no no no no
carol (analyst) no no no no
eve (viewer) no no no no

The admin trigger surface is hidden from stewards on purpose : stewards review, admins operate. A steward who needs a fresh batch contacts an admin or waits for the 6 h CronJob.


Operational runbook

Symptom Likely cause Fix
Trigger returns 404 "no tables matched" OM ingestion not run for that source Register service + run ingestion workflow
Trigger returns 404 "source name differs" Typo, or OM service named differently from the Trino catalog Check /api/catalogs/sources, use the exact name
Card hidden for an admin Token claim cached pre-deploy Hard refresh + re-login
Datalist empty /api/catalogs/sources returned [] or 5xx Check OM health + cockpit-backend logs
pending count not growing after trigger Daemon returned 200 but with error per row Inspect daemon logs : sampler / LLM / sink failure
Accepted proposal not visible in OM OM PATCH best-effort fire-and-forget failed Re-trigger the same proposal accept (idempotent)
ADEN still returns old description Milvus upsert raced with read Wait 1 s, retry — eventual consistency is typically sub-second

Discipline notes

  • Zero hardcoded sources — the cockpit datalist is fed by /api/catalogs/sources which enumerates OM database services. Adding a new source = registering it in OM and running ingestion ; no code change required.
  • Layer-first vocabulary — every label visible to users speaks of "Catalog Auto-Enrichment", "Federated source", "Lakehouse table", "Operational DB". Vendor names live in this admin doc only.
  • No SSH bypass — every operator action goes through the cockpit oauth2-proxy → cockpit-backend → daemon. The daemon never exposes a public Ingress.
  • Sample-awareAKKO_SAMPLE_PERCENT (default 5%) and AKKO_DESCRIPTION_MAX_WORDS (default 100) cap both the row sample size and the LLM output, so no full-table snapshot ever leaves the cluster and no descriptions balloon unreasonably.