vLLM — High-Throughput LLM Inference¶

vLLM is a high-throughput LLM inference server designed for production GPU environments. It replaces Ollama when GPU hardware is available, serving an OpenAI-compatible API with continuous batching and optimized GPU memory utilization. LiteLLM automatically routes to vLLM when enabled.

Architecture¶

Notebooks / Cockpit / Applications
            |
      +-----v------+
      |   LiteLLM   |  OpenAI-compatible API (port 4000)
      | (AI Gateway) |
      +-----+------+
            |
      +-----v------+
      |    vLLM     |  GPU inference (port 8000)
      |  Qwen/Qwen2.5-7B-Instruct
      +------------+

In development (no GPU), LiteLLM routes to Ollama instead. In production (GPU available), LiteLLM routes to vLLM. The switch is transparent — application code never changes.

Dev:  LiteLLM → Ollama (CPU, multiple small models)
Prod: LiteLLM → vLLM   (GPU, single high-throughput model)

URLs¶

Mode	URL
Kubernetes (production)	Internal only — accessed via LiteLLM gateway

vLLM does not have its own ingress. All requests go through LiteLLM (https://llm.akko.local), which routes to vLLM internally.

Usage¶

From Notebooks (via LiteLLM)¶

No code changes needed. The same LiteLLM API works whether the backend is Ollama or vLLM:

from openai import OpenAI

client = OpenAI(
    base_url="http://akko-akko-litellm:4000/v1",
    api_key="akko-dev-litellm-key",
)

response = client.chat.completions.create(
    model="akko-coder",  # Routed to vLLM in production
    messages=[{"role": "user", "content": "Write a SQL query for customer segmentation"}],
)
print(response.choices[0].message.content)

Direct vLLM API (internal)¶

For services that need to bypass LiteLLM:

from openai import OpenAI

client = OpenAI(
    base_url="http://akko-akko-vllm:8000/v1",
    api_key="EMPTY",  # vLLM does not require auth internally
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Explain Apache Iceberg"}],
)

Configuration¶

Dev Environment (values-dev.yaml)¶

vLLM is disabled in dev — Ollama handles inference on CPU:

akko-vllm:
  enabled: false

akko-litellm:
  vllmBackend:
    enabled: false  # Routes to Ollama

Production Environment (values-production.yaml)¶

vLLM is enabled with GPU resources:

akko-vllm:
  enabled: true
  replicaCount: 1
  model:
    name: "Qwen/Qwen2.5-7B-Instruct"
    maxModelLen: 4096
    gpuMemoryUtilization: 0.9

akko-litellm:
  vllmBackend:
    enabled: true
    host: "akko-akko-vllm"
    port: 8000
    model: "Qwen/Qwen2.5-7B-Instruct"

Model Configuration¶

Parameter	Default	Description
`model.name`	`Qwen/Qwen2.5-7B-Instruct`	HuggingFace model ID
`model.maxModelLen`	`4096`	Maximum sequence length
`model.gpuMemoryUtilization`	`0.9`	Fraction of GPU memory to use (0.0-1.0)

Key Features¶

Feature	Description
Continuous Batching	Dynamically batches incoming requests for maximum throughput
PagedAttention	Efficient GPU memory management, serves more concurrent requests
OpenAI-Compatible API	Drop-in replacement — same API as OpenAI, GPT4All, Ollama
HuggingFace Models	Load any model from HuggingFace Hub by ID
GPU Memory Optimization	Configurable utilization ratio to balance throughput and model size
Tensor Parallelism	Scale across multiple GPUs for larger models

Dev vs Production¶

Aspect	Dev (Ollama)	Production (vLLM)
Hardware	CPU (Apple Silicon / x86)	NVIDIA GPU (CUDA)
Models	Multiple small models (3B-7B)	Single optimized model (7B+)
Throughput	Low (1-5 tokens/s)	High (50-200+ tokens/s)
Batching	Sequential	Continuous batching
Memory	System RAM	GPU VRAM (16 Gi+ recommended)
Helm value	`akko-vllm.enabled: false`	`akko-vllm.enabled: true`

Resource Requirements¶

Resource	Minimum	Recommended
GPU	1x NVIDIA (16 GB VRAM)	1x NVIDIA A100 (40/80 GB)
RAM	8 Gi	16 Gi
GPU Memory Utilization	0.8	0.9

GPU Required

vLLM requires an NVIDIA GPU with CUDA support. It will not start on CPU-only nodes. For development without a GPU, use Ollama instead (akko-vllm.enabled: false).

Model Download

On first startup, vLLM downloads the model from HuggingFace Hub. For Qwen/Qwen2.5-7B-Instruct, this is approximately 15 GB. Ensure sufficient disk space and network bandwidth.

Healthcheck¶

vLLM exposes a /health endpoint used by Kubernetes probes:

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 120  # Model loading takes time
  periodSeconds: 30
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10

Slow Startup

vLLM takes 60-120 seconds to load a 7B model into GPU memory. The initialDelaySeconds is set accordingly to avoid premature restarts.

Troubleshooting¶

GPU Not Detected¶

Symptoms: The vLLM pod fails to start. Logs show RuntimeError: No CUDA GPUs are available or torch.cuda.is_available() returns False. The pod enters CrashLoopBackOff.

Cause: The Kubernetes node does not have an NVIDIA GPU, the NVIDIA device plugin is not installed, or the pod spec is missing the nvidia.com/gpu resource request.

Solution:

# Check if NVIDIA device plugin is running
kubectl get pods -n kube-system | grep nvidia

# Verify GPU resources are advertised by nodes
kubectl describe nodes | grep -A5 "nvidia.com/gpu"

# Check vLLM pod events for scheduling failures
kubectl describe pod -n akko -l app.kubernetes.io/name=akko-vllm | grep -A10 "Events"

# Verify the pod requests GPU resources
kubectl get pod -n akko -l app.kubernetes.io/name=akko-vllm -o jsonpath='{.items[0].spec.containers[0].resources}'

# If no GPU is available, disable vLLM and use Ollama instead
# helm upgrade akko helm/akko/ -n akko --set akko-vllm.enabled=false

CUDA Version Mismatch¶

Symptoms: vLLM pod starts but crashes immediately. Logs show CUDA driver version is insufficient for CUDA runtime version or CUDA error: no kernel image is available for execution on the device.

Cause: The CUDA runtime version in the vLLM container image does not match the CUDA driver version installed on the host node. This typically happens with older GPU drivers or newer container images.

Solution:

# Check host CUDA driver version
kubectl exec -n akko -l app.kubernetes.io/name=akko-vllm -- nvidia-smi 2>/dev/null || echo "nvidia-smi not available"

# Check CUDA version in the vLLM container
kubectl logs -n akko deploy/akko-akko-vllm --tail=20 | grep -i "cuda\|driver"

# Verify GPU compute capability matches the model requirements
kubectl exec -n akko deploy/akko-akko-vllm -- python3 -c "import torch; print(torch.cuda.get_device_capability())" 2>/dev/null

# Solutions:
# 1. Update the NVIDIA driver on the host node
# 2. Use a vLLM image built for your CUDA version
# 3. Pin the vLLM image tag to a version compatible with your driver

Model Download Timeout¶

Symptoms: vLLM pod stays in Running state but never becomes Ready. Logs show Downloading model... with slow or stalled progress. The readiness probe eventually fails and the pod restarts.

Cause: The HuggingFace model download is slow due to network bandwidth limitations, a proxy blocking large downloads, or a missing HF_TOKEN for gated models.

Solution:

# Check download progress in logs
kubectl logs -n akko deploy/akko-akko-vllm --tail=30 | grep -i "download\|model\|huggingface"

# Verify network connectivity from the pod
kubectl exec -n akko deploy/akko-akko-vllm -- curl -sI https://huggingface.co

# Increase readiness probe timeout to allow for slow downloads
# In values: initialDelaySeconds: 300, failureThreshold: 30

# For gated models, ensure HF_TOKEN is set
kubectl get secret -n akko -l app.kubernetes.io/name=akko-vllm -o yaml | grep HF_TOKEN

# Consider pre-downloading the model to a PVC to avoid repeated downloads

Out of VRAM¶

Symptoms: vLLM crashes with torch.cuda.OutOfMemoryError: CUDA out of memory or ValueError: The model's max seq len is too large. The pod restarts repeatedly.

Cause: The model requires more GPU VRAM than available. The gpuMemoryUtilization ratio is set too high (leaving no room for KV cache), or maxModelLen is too large for the available memory.

Solution:

# Check GPU memory usage
kubectl exec -n akko deploy/akko-akko-vllm -- nvidia-smi 2>/dev/null

# Check current configuration
kubectl get cm -n akko -l app.kubernetes.io/name=akko-vllm -o yaml | grep -i "memory\|model_len\|gpu"

# Reduce GPU memory utilization and max sequence length
# In your values file:
# akko-vllm.model.gpuMemoryUtilization: 0.8  (down from 0.9)
# akko-vllm.model.maxModelLen: 2048           (down from 4096)

helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
  --set akko-vllm.model.gpuMemoryUtilization=0.8 \
  --set akko-vllm.model.maxModelLen=2048

# Or switch to a smaller model (e.g., 3B instead of 7B)

Slow Startup (Model Loading)¶

Symptoms: vLLM pod is Running but not Ready for several minutes. Health endpoint returns errors. LiteLLM routes fail because the vLLM backend is not yet available.

Cause: Loading a 7B+ model into GPU memory takes 60-120+ seconds. The default initialDelaySeconds may be insufficient for larger models or slower hardware.

Solution:

# Check pod age and readiness status
kubectl get pods -n akko -l app.kubernetes.io/name=akko-vllm -o wide

# Follow startup logs in real time
kubectl logs -n akko deploy/akko-akko-vllm -f --tail=20

# Check if the health endpoint responds
kubectl exec -n akko deploy/akko-akko-vllm -- curl -s http://localhost:8000/health || echo "Not ready yet"

# If the pod keeps restarting due to probe failures, increase the delay
# In values: livenessProbe.initialDelaySeconds: 300
# In values: readinessProbe.initialDelaySeconds: 120, failureThreshold: 30

# Verify LiteLLM handles the vLLM startup gracefully
kubectl logs -n akko deploy/akko-akko-litellm --tail=20 | grep -i "vllm\|backend\|error"