Ollama¶
Local LLM inference server for AI-powered data exploration.
| URL | https://ollama.akko.local |
| Helm sub-chart | akko-ollama (custom) |
Overview¶
Ollama runs open-weight LLMs locally on your infrastructure. No API keys, no data leaving your network. Combined with LiteLLM (unified AI gateway), jupyter-ai in notebooks, and pgvector in PostgreSQL, it enables fully private AI workflows including RAG (Retrieval-Augmented Generation) pipelines.
Models¶
AKKO ships with three pre-pulled models:
| Model | Size | Purpose |
|---|---|---|
qwen2.5-coder:7b |
4.7 GB | Code generation and technical assistance |
qwen2.5:3b |
2.0 GB | General chat and text generation |
nomic-embed-text |
274 MB | Text embeddings (768 dimensions) for RAG pipelines |
The ollama-init sidecar pulls these models on first startup. This can take several minutes depending on your internet connection.
Architecture¶
JupyterHub (jupyter-ai)
|
LiteLLM (:4000) ──→ Ollama (:11434)
(OpenAI-compatible (model inference)
API gateway)
|
PostgreSQL (pgvector)
(embedding storage)
- Ollama handles model loading and inference
- LiteLLM provides an OpenAI-compatible API layer, routing requests to the appropriate Ollama model
- jupyter-ai in notebooks connects to LiteLLM for AI-assisted coding
- pgvector in PostgreSQL stores embeddings for RAG pipelines
Using Ollama from Notebooks¶
Direct API Access¶
import requests
response = requests.post(
"http://akko-akko-ollama:11434/api/generate",
json={
"model": "qwen2.5:3b",
"prompt": "Explain Apache Iceberg in one paragraph.",
"stream": False,
},
)
print(response.json()["response"])
Via LiteLLM (OpenAI-Compatible)¶
from openai import OpenAI
client = OpenAI(
base_url="http://akko-akko-litellm:4000/v1",
api_key="akko-dev-litellm-key",
)
response = client.chat.completions.create(
model="ollama/qwen2.5:3b",
messages=[{"role": "user", "content": "What is a lakehouse?"}],
)
print(response.choices[0].message.content)
Via jupyter-ai¶
The %%ai magic command in JupyterLab connects to LiteLLM:
Embeddings for RAG¶
import requests
response = requests.post(
"http://akko-akko-ollama:11434/api/embed",
json={
"model": "nomic-embed-text",
"input": "Apache Iceberg is an open table format for analytic datasets.",
},
)
embedding = response.json()["embeddings"][0]
# Store in PostgreSQL with pgvector
Healthcheck¶
The Ollama container image does not include curl or wget. The healthcheck uses a TCP probe:
Note
Use CMD (not CMD-SHELL) because the Ollama image's /bin/sh does not
support /dev/tcp.
Resource Requirements¶
| Model | Minimum RAM | Recommended |
|---|---|---|
nomic-embed-text |
1 GB | 2 GB |
qwen2.5:3b |
4 GB | 6 GB |
qwen2.5-coder:7b |
8 GB | 10 GB |
Docker Desktop memory
Loading qwen2.5-coder:7b requires at least 10 GB of available RAM in Docker Desktop. If Ollama crashes or restarts, increase the memory allocation.
Known Issues¶
No tool calling with small models
Small models like TinyLlama do not support MCP/tool calling. Use qwen2.5-coder:7b or qwen2.5:3b for function calling capabilities.
Model pull takes time
The qwen2.5-coder:7b model is 4.7 GB. The ollama-init sidecar can take 5+ minutes to download on first startup. Do not interrupt the process.
GPU acceleration
Ollama supports GPU acceleration via NVIDIA CUDA. On macOS with Apple Silicon, Metal acceleration is used automatically. On Linux, pass --gpus all in the Docker configuration to enable CUDA.