RAG Pipeline¶

AKKO ships a ready-to-run RAG (Retrieval-Augmented Generation) pipeline that stays 100 % inside your cluster. Embeddings are computed by nomic-embed-text running on Ollama, stored in the PostgreSQL pgvector extension, and retrieved by qwen2.5:3b served through LiteLLM.

The interactive walk-through lives in notebooks/rag-pipeline-demo.ipynb, mounted read-only inside every JupyterHub pod.

Why RAG on AKKO¶

Sovereign — no call to OpenAI, Pinecone, Weaviate Cloud, or Anthropic. Models and vectors never leave the cluster.
Composable — the same pgvector tables are queryable from Trino (postgresql.akko.kb_chunks) and exposed to agents via MCP.
Cheap — nomic-embed-text is a 137 MB model that embeds thousands of chunks per minute on CPU.
Auditable — every embedding call goes through LiteLLM, which logs the caller (X-User-Id) and the model used.

Architecture¶

flowchart LR
    subgraph Ingest[Ingest]
        DOC[Source documents<br/>PDF, Markdown, HTML, Confluence export]
        CHUNK[LangChain chunker<br/>RecursiveCharacterTextSplitter<br/>chunk_size=1000, overlap=200]
        EMB[Ollama<br/>nomic-embed-text<br/>768 dims]
        PG[(PostgreSQL<br/>pgvector · HNSW index)]
    end
    subgraph Query[Query]
        Q[User question]
        EMBQ[Ollama<br/>nomic-embed-text]
        SEARCH[pgvector<br/>cosine top-k=5]
        PROMPT[Prompt builder<br/>context + question]
        LLM[Ollama<br/>qwen2.5:3b]
        ANS[Answer + citations]
    end
    DOC --> CHUNK --> EMB --> PG
    Q --> EMBQ --> SEARCH
    PG --> SEARCH
    SEARCH --> PROMPT
    Q --> PROMPT
    PROMPT --> LLM --> ANS

Storage Model¶

PostgreSQL table definition (auto-provisioned by postgres-init, see postgres/init/05-pgvector.sql):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE akko.kb_chunks (
  id            BIGSERIAL PRIMARY KEY,
  source        TEXT        NOT NULL,    -- e.g. 'docs/handbook.pdf'
  chunk_index   INT         NOT NULL,
  content       TEXT        NOT NULL,
  tokens        INT         NOT NULL,
  embedding     vector(768) NOT NULL,
  metadata      JSONB       NOT NULL DEFAULT '{}'::jsonb,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- HNSW index for sub-linear cosine search
CREATE INDEX kb_chunks_hnsw ON akko.kb_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX kb_chunks_source_idx ON akko.kb_chunks (source);

Trino catalog access: postgresql.akko.kb_chunks — same table, exposed through the Trino PostgreSQL connector. Columns of type vector are promoted to ARRAY<DOUBLE> in Trino, so akko_ai_similarity and akko_ai_search work out of the box.

Ingest Pipeline¶

# notebooks/rag-pipeline-demo.ipynb — simplified
from langchain_community.vectorstores import PGVector
from langchain_ollama import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
emb      = OllamaEmbeddings(
    base_url=os.environ["OLLAMA_HOST"],   # http://akko-ollama:11434
    model="nomic-embed-text",
)

store = PGVector(
    collection_name="akko.kb_chunks",
    connection_string=os.environ["POSTGRES_URL"],
    embedding_function=emb,
)

for doc in documents:
    chunks = splitter.split_documents([doc])
    store.add_documents(chunks)

Every env var is injected by the spawner (jupyterhub_config.py). No hardcoded hostname.

Query Pipeline¶

from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate

chat = ChatOllama(
    base_url=os.environ["OLLAMA_HOST"],
    model="qwen2.5:3b",
    temperature=0,
)

retriever = store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
prompt = PromptTemplate.from_template(
    "Answer the question using only the context below. Cite sources as [source:chunk_index].\n"
    "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
)

chain = prompt | chat
docs  = retriever.invoke(question)
ctx   = "\n\n".join(f"[{d.metadata['source']}:{d.metadata['chunk_index']}] {d.page_content}" for d in docs)
ans   = chain.invoke({"context": ctx, "question": question})

Configuration¶

Key Helm values (helm/akko/values.yaml):

akko-ollama:
  models:
    - name: qwen2.5-coder:7b
    - name: qwen2.5:3b
    - name: nomic-embed-text

akko-postgresql-data:
  extensions:
    - pgvector
    - postgis

akko-litellm:
  routes:
    - model_name: nomic-embed-text
      litellm_params:
        model: ollama/nomic-embed-text
        api_base: http://akko-ollama:11434

Variable (notebook env)	Default	Purpose
`OLLAMA_HOST`	`http://akko-ollama:11434`	Direct Ollama (bypasses LiteLLM for speed)
`LITELLM_HOST`	`http://akko-litellm:4000`	Auditable path (recommended for shared jobs)
`POSTGRES_URL`	`postgresql://alice:...@akko-postgresql-data:5432/akko`	Data PostgreSQL (pgvector)
`RAG_EMBED_MODEL`	`nomic-embed-text`	Embedding model
`RAG_CHAT_MODEL`	`qwen2.5:3b`	Generation model
`RAG_TOP_K`	`5`	Retriever top-k

RBAC¶

Role	Insert chunks	Query chunks	Create KB collection
`akko-admin`	yes	yes	yes
`akko-engineer`	yes	yes	yes
`akko-analyst`	no (read-only)	yes	no
`akko-user`	no	yes (PII-masked)	no
`akko-viewer`	no	yes (published KBs only)	no

All access is enforced by OPA on top of the standard PostgreSQL Trino connector rules (see Admin / RBAC). The akko.kb_chunks.metadata column can carry a project key used by rowFilters to restrict cross-team visibility.

Performance Notes¶

HNSW index brings the search from O(n) to ~O(log n). On 1 M chunks, p95 stays below 50 ms.
Batch embedding — call store.add_documents([batch]) with 32-64 chunks to saturate Ollama without blowing memory.
Re-ingest strategy — delete by source, re-chunk, re-embed. pgvector has no tombstones, so a vacuum runs weekly via pg_cron (akko-postgresql-data).
Prefer akko_ai_search in SQL — if the question is a literal string, the Trino AI plugin caches the query embedding per worker (see AI / Trino functions).

Integration Points¶

JupyterHub — notebooks/rag-pipeline-demo.ipynb is the reference.
ADEN — when AKKO_RAG_ENABLED=true, ADEN calls the retriever before the SQL generation prompt to inject domain knowledge (table descriptions, business rules).
MCP Trino — agents can run execute_query against postgresql.akko.kb_chunks to search documents.
Superset — a dedicated dashboard (AKKO RAG insights) plots embeddings (UMAP) for drift monitoring.

Troubleshooting¶

Symptom	Likely cause	Fix
`could not find extension "vector"`	`pgvector` not installed in the data cluster	Check `akko-postgresql-data` values; the image is `akko-postgres:2026.03` which ships pgvector
Very slow ingest (> 30 s / chunk)	Model not loaded in Ollama memory	`kubectl exec -it deploy/akko-ollama -- ollama pull nomic-embed-text`
HTTP 401 on `OllamaEmbeddings`	Notebook missing LiteLLM virtual key	Use `OLLAMA_HOST` directly, or set `LITELLM_KEY` in the spawner env
Answers repeat the question verbatim	Retriever returned irrelevant chunks	Raise `RAG_TOP_K`, tune `chunk_size`/`overlap`, or re-chunk the corpus
HNSW index rebuild blocks inserts	Running `CREATE INDEX` on a populated table	Use `CREATE INDEX CONCURRENTLY` or build the index before bulk ingest

AI / ADEN · AI / Trino functions · AI / MCP servers
Services / Ollama · Services / LiteLLM · Services / PostgreSQL
Architecture / AI Stack
Notebook: notebooks/rag-pipeline-demo.ipynb in the repo