Skip to content

LiteLLM — AI Gateway

LiteLLM provides an OpenAI-compatible API gateway that unifies access to all LLM backends in AKKO. Instead of calling Ollama directly, services and notebooks use LiteLLM's standard API — making it trivial to swap backends (Ollama, vLLM, cloud APIs) without changing application code.

Architecture

Notebooks / Cockpit / Applications
            |
      +-----v------+
      |   LiteLLM   |  OpenAI-compatible API (port 4000)
      | (AI Gateway) |
      +-----+------+
            |
      +-----v------+
      |   Ollama    |  Local LLM runtime (port 11434)
      | qwen2.5-coder:7b, qwen2.5:3b, nomic-embed-text
      +------------+

Configured Models

Model Size Use Case
qwen2.5-coder:7b 4.7 GB Code generation, text-to-SQL, MCP tool calling
qwen2.5:3b 2.0 GB General chat, jupyter-ai conversations
nomic-embed-text 137 MB RAG embeddings (pgvector)

Usage

From Notebooks (jupyter-ai)

jupyter-ai is pre-configured to use LiteLLM as its backend:

# jupyter-ai uses JUPYTER_AI_DEFAULT_PROVIDER and JUPYTER_AI_DEFAULT_MODEL
# These are set automatically by the JupyterHub spawner

Direct API Calls

from openai import OpenAI

client = OpenAI(
    base_url="http://akko-akko-litellm:4000/v1",
    api_key="akko-dev-litellm-key"  # master_key from values
)

response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[{"role": "user", "content": "Write a SQL query to find top customers"}]
)
print(response.choices[0].message.content)

Text-to-SQL Example

response = client.chat.completions.create(
    model="qwen2.5-coder:7b",
    messages=[{
        "role": "system",
        "content": "You are a SQL expert. Generate Trino SQL queries for an Iceberg lakehouse."
    }, {
        "role": "user",
        "content": "Show me the top 10 customers by total balance"
    }]
)

URLs

Mode URL
Kubernetes (k3d) https://llm.akko.local

Configuration

Kubernetes (Helm)

akko-litellm:
  enabled: true
  ollamaHost: "akko-akko-ollama"
  config:
    general_settings:
      master_key: "your-secret-key"
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi     # Dev override: 1Gi (LiteLLM OOMs below 1Gi under load)

Memory Requirement

LiteLLM requires at least 1 Gi memory limit. It will be OOMKilled at 512Mi.

Authentication

LiteLLM is protected by OAuth2-Proxy (ForwardAuth middleware). Users must be authenticated via Keycloak SSO to access the API through the ingress.

For internal service-to-service calls (within the cluster), use the master_key for authentication.

Troubleshooting

LiteLLM OOMKilled (CrashLoopBackOff)

Symptoms: The LiteLLM pod enters CrashLoopBackOff status. kubectl describe pod shows OOMKilled as the last termination reason.

Cause: LiteLLM requires at least 1 Gi of memory. If the memory limit is set below this threshold (e.g., 512Mi), the kernel OOM-killer terminates the process.

Solution:

# Check current memory limits
kubectl get pod -n akko -l app.kubernetes.io/name=akko-litellm -o jsonpath='{.items[0].spec.containers[0].resources}'

# Ensure memory limit is at least 1Gi in your values file
# akko-litellm.resources.limits.memory: 1Gi
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
  --set akko-litellm.resources.limits.memory=1Gi

Ollama Backend Unreachable

Symptoms: API calls to LiteLLM return Connection refused or 502 Bad Gateway. Logs show Connection error to ollama backend.

Cause: The Ollama pod is not running, not ready, or the ollamaHost value in the Helm chart does not match the actual Ollama service name.

Solution:

# Verify Ollama pod is running
kubectl get pods -n akko -l app.kubernetes.io/name=akko-ollama

# Check Ollama service endpoint
kubectl get svc -n akko | grep ollama

# Check LiteLLM logs for connection errors
kubectl logs -n akko deploy/akko-akko-litellm --tail=50

# Verify the ollamaHost value matches the service name
kubectl get cm -n akko -l app.kubernetes.io/name=akko-litellm -o yaml | grep -i ollama

API Key Validation Failure

Symptoms: Requests return 401 Unauthorized or Authentication Error: Invalid API Key. Internal services that previously worked start failing.

Cause: The master_key configured in LiteLLM does not match the key used by the calling service. This often happens after a Helm upgrade where the key was regenerated or overridden.

Solution:

# Check the current master_key in the LiteLLM config
kubectl get secret -n akko -l app.kubernetes.io/name=akko-litellm -o yaml

# Verify notebook/service environment variables match
kubectl exec -n akko deploy/akko-jupyterhub -- env | grep LITELLM

# Restart LiteLLM after fixing the key
kubectl rollout restart -n akko deploy/akko-akko-litellm

Model Not Found

Symptoms: API calls return Model not found: <model-name>. The LiteLLM UI shows no available models.

Cause: The model name in the request does not match any model configured in LiteLLM's routing config, or Ollama has not finished pulling the model yet.

Solution:

# List models available in Ollama
kubectl exec -n akko deploy/akko-akko-ollama -- ollama list

# Check LiteLLM config for registered models
kubectl logs -n akko deploy/akko-akko-litellm --tail=100 | grep -i model

# Verify the LiteLLM config map lists the expected models
kubectl get cm -n akko -l app.kubernetes.io/name=akko-litellm -o yaml

Slow Response Times

Symptoms: API calls take 30+ seconds to return. Timeouts occur on larger prompts. Notebooks appear to hang during AI-assisted tasks.

Cause: The Ollama backend is running on CPU (expected in dev), the model is too large for available resources, or multiple concurrent requests are queuing up because Ollama processes them sequentially.

Solution:

# Check Ollama resource usage
kubectl top pod -n akko -l app.kubernetes.io/name=akko-ollama

# Check if Ollama is CPU-bound (high CPU, no GPU)
kubectl describe pod -n akko -l app.kubernetes.io/name=akko-ollama | grep -A5 "Resources"

# Consider using a smaller model (qwen2.5:3b instead of 7b) for faster responses
# Or switch to vLLM with GPU in production for high throughput