Ollama¶

Local LLM inference server for AI-powered data exploration.


URL	`https://ollama.akko.local`
Helm sub-chart	`akko-ollama` (custom)

Overview¶

Ollama runs open-weight LLMs locally on your infrastructure. No API keys, no data leaving your network. Combined with LiteLLM (unified AI gateway), jupyter-ai in notebooks, and pgvector in PostgreSQL, it enables fully private AI workflows including RAG (Retrieval-Augmented Generation) pipelines.

Models¶

AKKO ships with three pre-pulled models:

Model	Size	Purpose
`qwen2.5-coder:7b`	4.7 GB	Code generation and technical assistance
`qwen2.5:3b`	2.0 GB	General chat and text generation
`nomic-embed-text`	274 MB	Text embeddings (768 dimensions) for RAG pipelines

The ollama-init sidecar pulls these models on first startup. This can take several minutes depending on your internet connection.

Architecture¶

  JupyterHub (jupyter-ai)
        |
  LiteLLM (:4000)  ──→  Ollama (:11434)
  (OpenAI-compatible       (model inference)
   API gateway)
        |
  PostgreSQL (pgvector)
  (embedding storage)

Ollama handles model loading and inference
LiteLLM provides an OpenAI-compatible API layer, routing requests to the appropriate Ollama model
jupyter-ai in notebooks connects to LiteLLM for AI-assisted coding
pgvector in PostgreSQL stores embeddings for RAG pipelines

Using Ollama from Notebooks¶

Direct API Access¶

import requests

response = requests.post(
    "http://akko-akko-ollama:11434/api/generate",
    json={
        "model": "qwen2.5:3b",
        "prompt": "Explain Apache Iceberg in one paragraph.",
        "stream": False,
    },
)
print(response.json()["response"])

Via LiteLLM (OpenAI-Compatible)¶

from openai import OpenAI

client = OpenAI(
    base_url="http://akko-akko-litellm:4000/v1",
    api_key="akko-dev-litellm-key",
)

response = client.chat.completions.create(
    model="ollama/qwen2.5:3b",
    messages=[{"role": "user", "content": "What is a lakehouse?"}],
)
print(response.choices[0].message.content)

Via jupyter-ai¶

The %%ai magic command in JupyterLab connects to LiteLLM:

%%ai ollama:qwen2.5:3b
Explain the difference between Spark and Trino.

Embeddings for RAG¶

import requests

response = requests.post(
    "http://akko-akko-ollama:11434/api/embed",
    json={
        "model": "nomic-embed-text",
        "input": "Apache Iceberg is an open table format for analytic datasets.",
    },
)
embedding = response.json()["embeddings"][0]
# Store in PostgreSQL with pgvector

Healthcheck¶

The Ollama container image does not include curl or wget. The healthcheck uses a TCP probe:

healthcheck:
  test: ["CMD", "bash", "-c", "exec 3<>/dev/tcp/localhost/11434"]

Note

Use CMD (not CMD-SHELL) because the Ollama image's /bin/sh does not support /dev/tcp.

Resource Requirements¶

Model	Minimum RAM	Recommended
`nomic-embed-text`	1 GB	2 GB
`qwen2.5:3b`	4 GB	6 GB
`qwen2.5-coder:7b`	8 GB	10 GB

Docker Desktop memory

Loading qwen2.5-coder:7b requires at least 10 GB of available RAM in Docker Desktop. If Ollama crashes or restarts, increase the memory allocation.

Known Issues¶

No tool calling with small models

Small models like TinyLlama do not support MCP/tool calling. Use qwen2.5-coder:7b or qwen2.5:3b for function calling capabilities.

Model pull takes time

The qwen2.5-coder:7b model is 4.7 GB. The ollama-init sidecar can take 5+ minutes to download on first startup. Do not interrupt the process.

GPU acceleration

Ollama supports GPU acceleration via NVIDIA CUDA. On macOS with Apple Silicon, Metal acceleration is used automatically. On Linux, pass --gpus all in the Docker configuration to enable CUDA.