Skip to content

RAG Pipeline

AKKO ships a ready-to-run RAG (Retrieval-Augmented Generation) pipeline that stays 100 % inside your cluster. Embeddings are computed by nomic-embed-text running on Ollama, stored in the PostgreSQL pgvector extension, and retrieved by qwen2.5:3b served through LiteLLM.

The interactive walk-through lives in notebooks/rag-pipeline-demo.ipynb, mounted read-only inside every JupyterHub pod.

Why RAG on AKKO

  • Sovereign — no call to OpenAI, Pinecone, Weaviate Cloud, or Anthropic. Models and vectors never leave the cluster.
  • Composable — the same pgvector tables are queryable from Trino (postgresql.akko.kb_chunks) and exposed to agents via MCP.
  • Cheapnomic-embed-text is a 137 MB model that embeds thousands of chunks per minute on CPU.
  • Auditable — every embedding call goes through LiteLLM, which logs the caller (X-User-Id) and the model used.

Architecture

flowchart LR
    subgraph Ingest[Ingest]
        DOC[Source documents<br/>PDF, Markdown, HTML, Confluence export]
        CHUNK[LangChain chunker<br/>RecursiveCharacterTextSplitter<br/>chunk_size=1000, overlap=200]
        EMB[Ollama<br/>nomic-embed-text<br/>768 dims]
        PG[(PostgreSQL<br/>pgvector · HNSW index)]
    end
    subgraph Query[Query]
        Q[User question]
        EMBQ[Ollama<br/>nomic-embed-text]
        SEARCH[pgvector<br/>cosine top-k=5]
        PROMPT[Prompt builder<br/>context + question]
        LLM[Ollama<br/>qwen2.5:3b]
        ANS[Answer + citations]
    end
    DOC --> CHUNK --> EMB --> PG
    Q --> EMBQ --> SEARCH
    PG --> SEARCH
    SEARCH --> PROMPT
    Q --> PROMPT
    PROMPT --> LLM --> ANS

Storage Model

PostgreSQL table definition (auto-provisioned by postgres-init, see postgres/init/05-pgvector.sql):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE akko.kb_chunks (
  id            BIGSERIAL PRIMARY KEY,
  source        TEXT        NOT NULL,    -- e.g. 'docs/handbook.pdf'
  chunk_index   INT         NOT NULL,
  content       TEXT        NOT NULL,
  tokens        INT         NOT NULL,
  embedding     vector(768) NOT NULL,
  metadata      JSONB       NOT NULL DEFAULT '{}'::jsonb,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- HNSW index for sub-linear cosine search
CREATE INDEX kb_chunks_hnsw ON akko.kb_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX kb_chunks_source_idx ON akko.kb_chunks (source);

Trino catalog access: postgresql.akko.kb_chunks — same table, exposed through the Trino PostgreSQL connector. Columns of type vector are promoted to ARRAY<DOUBLE> in Trino, so akko_ai_similarity and akko_ai_search work out of the box.

Ingest Pipeline

# notebooks/rag-pipeline-demo.ipynb — simplified
from langchain_community.vectorstores import PGVector
from langchain_ollama import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
emb      = OllamaEmbeddings(
    base_url=os.environ["OLLAMA_HOST"],   # http://akko-ollama:11434
    model="nomic-embed-text",
)

store = PGVector(
    collection_name="akko.kb_chunks",
    connection_string=os.environ["POSTGRES_URL"],
    embedding_function=emb,
)

for doc in documents:
    chunks = splitter.split_documents([doc])
    store.add_documents(chunks)

Every env var is injected by the spawner (jupyterhub_config.py). No hardcoded hostname.

Query Pipeline

from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate

chat = ChatOllama(
    base_url=os.environ["OLLAMA_HOST"],
    model="qwen2.5:3b",
    temperature=0,
)

retriever = store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
prompt = PromptTemplate.from_template(
    "Answer the question using only the context below. Cite sources as [source:chunk_index].\n"
    "Context:\n{context}\n\nQuestion: {question}\nAnswer:"
)

chain = prompt | chat
docs  = retriever.invoke(question)
ctx   = "\n\n".join(f"[{d.metadata['source']}:{d.metadata['chunk_index']}] {d.page_content}" for d in docs)
ans   = chain.invoke({"context": ctx, "question": question})

Configuration

Key Helm values (helm/akko/values.yaml):

akko-ollama:
  models:
    - name: qwen2.5-coder:7b
    - name: qwen2.5:3b
    - name: nomic-embed-text

akko-postgresql-data:
  extensions:
    - pgvector
    - postgis

akko-litellm:
  routes:
    - model_name: nomic-embed-text
      litellm_params:
        model: ollama/nomic-embed-text
        api_base: http://akko-ollama:11434
Variable (notebook env) Default Purpose
OLLAMA_HOST http://akko-ollama:11434 Direct Ollama (bypasses LiteLLM for speed)
LITELLM_HOST http://akko-litellm:4000 Auditable path (recommended for shared jobs)
POSTGRES_URL postgresql://alice:...@akko-postgresql-data:5432/akko Data PostgreSQL (pgvector)
RAG_EMBED_MODEL nomic-embed-text Embedding model
RAG_CHAT_MODEL qwen2.5:3b Generation model
RAG_TOP_K 5 Retriever top-k

RBAC

Role Insert chunks Query chunks Create KB collection
akko-admin yes yes yes
akko-engineer yes yes yes
akko-analyst no (read-only) yes no
akko-user no yes (PII-masked) no
akko-viewer no yes (published KBs only) no

All access is enforced by OPA on top of the standard PostgreSQL Trino connector rules (see Admin / RBAC). The akko.kb_chunks.metadata column can carry a project key used by rowFilters to restrict cross-team visibility.

Performance Notes

  • HNSW index brings the search from O(n) to ~O(log n). On 1 M chunks, p95 stays below 50 ms.
  • Batch embedding — call store.add_documents([batch]) with 32-64 chunks to saturate Ollama without blowing memory.
  • Re-ingest strategy — delete by source, re-chunk, re-embed. pgvector has no tombstones, so a vacuum runs weekly via pg_cron (akko-postgresql-data).
  • Prefer akko_ai_search in SQL — if the question is a literal string, the Trino AI plugin caches the query embedding per worker (see AI / Trino functions).

Integration Points

  • JupyterHubnotebooks/rag-pipeline-demo.ipynb is the reference.
  • ADEN — when AKKO_RAG_ENABLED=true, ADEN calls the retriever before the SQL generation prompt to inject domain knowledge (table descriptions, business rules).
  • MCP Trino — agents can run execute_query against postgresql.akko.kb_chunks to search documents.
  • Superset — a dedicated dashboard (AKKO RAG insights) plots embeddings (UMAP) for drift monitoring.

Troubleshooting

Symptom Likely cause Fix
could not find extension "vector" pgvector not installed in the data cluster Check akko-postgresql-data values; the image is akko-postgres:2026.03 which ships pgvector
Very slow ingest (> 30 s / chunk) Model not loaded in Ollama memory kubectl exec -it deploy/akko-ollama -- ollama pull nomic-embed-text
HTTP 401 on OllamaEmbeddings Notebook missing LiteLLM virtual key Use OLLAMA_HOST directly, or set LITELLM_KEY in the spawner env
Answers repeat the question verbatim Retriever returned irrelevant chunks Raise RAG_TOP_K, tune chunk_size/overlap, or re-chunk the corpus
HNSW index rebuild blocks inserts Running CREATE INDEX on a populated table Use CREATE INDEX CONCURRENTLY or build the index before bulk ingest