RAG Pipeline¶
AKKO ships a ready-to-run RAG (Retrieval-Augmented Generation) pipeline that stays 100 % inside your cluster. Embeddings are computed by nomic-embed-text running on Ollama, stored in the PostgreSQL pgvector extension, and retrieved by qwen2.5:3b served through LiteLLM.
The interactive walk-through lives in
notebooks/rag-pipeline-demo.ipynb, mounted read-only inside every JupyterHub pod.
Why RAG on AKKO¶
- Sovereign — no call to OpenAI, Pinecone, Weaviate Cloud, or Anthropic. Models and vectors never leave the cluster.
- Composable — the same pgvector tables are queryable from Trino (
postgresql.akko.kb_chunks) and exposed to agents via MCP. - Cheap —
nomic-embed-textis a 137 MB model that embeds thousands of chunks per minute on CPU. - Auditable — every embedding call goes through LiteLLM, which logs the caller (
X-User-Id) and the model used.
Architecture¶
flowchart LR
subgraph Ingest[Ingest]
DOC[Source documents<br/>PDF, Markdown, HTML, Confluence export]
CHUNK[LangChain chunker<br/>RecursiveCharacterTextSplitter<br/>chunk_size=1000, overlap=200]
EMB[Ollama<br/>nomic-embed-text<br/>768 dims]
PG[(PostgreSQL<br/>pgvector · HNSW index)]
end
subgraph Query[Query]
Q[User question]
EMBQ[Ollama<br/>nomic-embed-text]
SEARCH[pgvector<br/>cosine top-k=5]
PROMPT[Prompt builder<br/>context + question]
LLM[Ollama<br/>qwen2.5:3b]
ANS[Answer + citations]
end
DOC --> CHUNK --> EMB --> PG
Q --> EMBQ --> SEARCH
PG --> SEARCH
SEARCH --> PROMPT
Q --> PROMPT
PROMPT --> LLM --> ANS
Storage Model¶
PostgreSQL table definition (auto-provisioned by postgres-init, see postgres/init/05-pgvector.sql):
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE akko.kb_chunks (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL, -- e.g. 'docs/handbook.pdf'
chunk_index INT NOT NULL,
content TEXT NOT NULL,
tokens INT NOT NULL,
embedding vector(768) NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- HNSW index for sub-linear cosine search
CREATE INDEX kb_chunks_hnsw ON akko.kb_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
CREATE INDEX kb_chunks_source_idx ON akko.kb_chunks (source);
Trino catalog access: postgresql.akko.kb_chunks — same table, exposed through the Trino PostgreSQL connector. Columns of type vector are promoted to ARRAY<DOUBLE> in Trino, so akko_ai_similarity and akko_ai_search work out of the box.
Ingest Pipeline¶
# notebooks/rag-pipeline-demo.ipynb — simplified
from langchain_community.vectorstores import PGVector
from langchain_ollama import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
emb = OllamaEmbeddings(
base_url=os.environ["OLLAMA_HOST"], # http://akko-ollama:11434
model="nomic-embed-text",
)
store = PGVector(
collection_name="akko.kb_chunks",
connection_string=os.environ["POSTGRES_URL"],
embedding_function=emb,
)
for doc in documents:
chunks = splitter.split_documents([doc])
store.add_documents(chunks)
Every env var is injected by the spawner (jupyterhub_config.py). No hardcoded hostname.
Query Pipeline¶
from langchain_ollama import ChatOllama
from langchain.prompts import PromptTemplate
chat = ChatOllama(
base_url=os.environ["OLLAMA_HOST"],
model="qwen2.5:3b",
temperature=0,
)
retriever = store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
prompt = PromptTemplate.from_template(
"Answer the question using only the context below. Cite sources as [source:chunk_index].\n"
"Context:\n{context}\n\nQuestion: {question}\nAnswer:"
)
chain = prompt | chat
docs = retriever.invoke(question)
ctx = "\n\n".join(f"[{d.metadata['source']}:{d.metadata['chunk_index']}] {d.page_content}" for d in docs)
ans = chain.invoke({"context": ctx, "question": question})
Configuration¶
Key Helm values (helm/akko/values.yaml):
akko-ollama:
models:
- name: qwen2.5-coder:7b
- name: qwen2.5:3b
- name: nomic-embed-text
akko-postgresql-data:
extensions:
- pgvector
- postgis
akko-litellm:
routes:
- model_name: nomic-embed-text
litellm_params:
model: ollama/nomic-embed-text
api_base: http://akko-ollama:11434
| Variable (notebook env) | Default | Purpose |
|---|---|---|
OLLAMA_HOST |
http://akko-ollama:11434 |
Direct Ollama (bypasses LiteLLM for speed) |
LITELLM_HOST |
http://akko-litellm:4000 |
Auditable path (recommended for shared jobs) |
POSTGRES_URL |
postgresql://alice:...@akko-postgresql-data:5432/akko |
Data PostgreSQL (pgvector) |
RAG_EMBED_MODEL |
nomic-embed-text |
Embedding model |
RAG_CHAT_MODEL |
qwen2.5:3b |
Generation model |
RAG_TOP_K |
5 |
Retriever top-k |
RBAC¶
| Role | Insert chunks | Query chunks | Create KB collection |
|---|---|---|---|
akko-admin |
yes | yes | yes |
akko-engineer |
yes | yes | yes |
akko-analyst |
no (read-only) | yes | no |
akko-user |
no | yes (PII-masked) | no |
akko-viewer |
no | yes (published KBs only) | no |
All access is enforced by OPA on top of the standard PostgreSQL Trino connector rules (see Admin / RBAC). The akko.kb_chunks.metadata column can carry a project key used by rowFilters to restrict cross-team visibility.
Performance Notes¶
- HNSW index brings the search from O(n) to ~O(log n). On 1 M chunks, p95 stays below 50 ms.
- Batch embedding — call
store.add_documents([batch])with 32-64 chunks to saturate Ollama without blowing memory. - Re-ingest strategy — delete by
source, re-chunk, re-embed.pgvectorhas no tombstones, so a vacuum runs weekly viapg_cron(akko-postgresql-data). - Prefer
akko_ai_searchin SQL — if the question is a literal string, the Trino AI plugin caches the query embedding per worker (see AI / Trino functions).
Integration Points¶
- JupyterHub —
notebooks/rag-pipeline-demo.ipynbis the reference. - ADEN — when
AKKO_RAG_ENABLED=true, ADEN calls the retriever before the SQL generation prompt to inject domain knowledge (table descriptions, business rules). - MCP Trino — agents can run
execute_queryagainstpostgresql.akko.kb_chunksto search documents. - Superset — a dedicated dashboard (
AKKO RAG insights) plots embeddings (UMAP) for drift monitoring.
Troubleshooting¶
| Symptom | Likely cause | Fix |
|---|---|---|
could not find extension "vector" |
pgvector not installed in the data cluster |
Check akko-postgresql-data values; the image is akko-postgres:2026.03 which ships pgvector |
| Very slow ingest (> 30 s / chunk) | Model not loaded in Ollama memory | kubectl exec -it deploy/akko-ollama -- ollama pull nomic-embed-text |
HTTP 401 on OllamaEmbeddings |
Notebook missing LiteLLM virtual key | Use OLLAMA_HOST directly, or set LITELLM_KEY in the spawner env |
| Answers repeat the question verbatim | Retriever returned irrelevant chunks | Raise RAG_TOP_K, tune chunk_size/overlap, or re-chunk the corpus |
| HNSW index rebuild blocks inserts | Running CREATE INDEX on a populated table |
Use CREATE INDEX CONCURRENTLY or build the index before bulk ingest |
Related¶
- AI / ADEN · AI / Trino functions · AI / MCP servers
- Services / Ollama · Services / LiteLLM · Services / PostgreSQL
- Architecture / AI Stack
- Notebook:
notebooks/rag-pipeline-demo.ipynbin the repo