Skip to content

Whisper — Speech-to-Text

Whisper provides audio transcription (speech-to-text) for RAG pipelines and audio analysis in AKKO. It uses an OpenAI-compatible API to transcribe audio files into text — fully offline, no external API calls.

Architecture

ai-service / ADEN / Cockpit
            |
      +-----v------+
      |   Whisper   |  REST API (port 8000)
      | (Speech-to- |  WAV/MP3/M4A/FLAC/OGG -> Text
      |   Text)     |
      +------------+

Supported Formats

Input Format Description
WAV Uncompressed PCM audio
MP3 MPEG Layer 3 compressed audio
M4A AAC/ALAC compressed audio
FLAC Lossless compressed audio
OGG Vorbis/Opus compressed audio

Usage

From Trino (akko_ai_transcribe)

-- Transcribe a single audio file from object storage
SELECT akko_ai_transcribe('s3://akko-documents/meeting-2026-04.wav');

-- Transcribe all audio files and store results
SELECT
    file_path,
    akko_ai_transcribe(file_path) AS transcript
FROM iceberg.raw.audio_files;

From ai-service (REST API)

import httpx

# Upload a file
with open("meeting.wav", "rb") as f:
    response = httpx.post(
        "http://akko-akko-ai-service:8000/v1/transcribe",
        files={"file": ("meeting.wav", f)},
    )
print(response.json()["text"])

# Or use an S3 URI
response = httpx.get(
    "http://akko-akko-ai-service:8000/v1/transcribe",
    params={"s3_uri": "s3://akko-documents/meeting.wav"},
)
print(response.json()["text"])

From Notebooks

import requests

# Transcribe an audio file from object storage
resp = requests.get(
    "http://akko-akko-ai-service:8000/v1/transcribe",
    params={"s3_uri": "s3://akko-documents/interview.mp3"},
    timeout=300,
)
result = resp.json()
print(f"Language: {result['language']}")
print(f"Duration: {result['duration_seconds']}s")
print(f"Transcript:\n{result['text']}")

Health Check

curl http://akko-akko-whisper:8000/health

Airflow DAG

The akko_audio_transcription DAG runs every 15 minutes and automatically:

  1. Lists new audio files in the akko-documents S3 bucket
  2. Transcribes each file via the AI Service /v1/transcribe endpoint
  3. Stores the transcript in pgvector rag.documents (content_type=audio/transcript)
  4. Tracks processed files in rag.audio_transcription_tracking

Configuration

Kubernetes (Helm)

akko-whisper:
  enabled: true
  image:
    repository: hwdsl2/whisper-server
    tag: "latest"  # Pin to a specific version in production
  whisperModel: "base"  # Options: tiny, base, small, medium, large
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: "2"
      memory: 2Gi

Whisper Model Selection

Model Size Speed Accuracy Use Case
tiny 39 MB Fastest Low Quick previews, development
base 74 MB Fast Moderate Default, good balance
small 244 MB Moderate Good Production with decent hardware
medium 769 MB Slow High High-quality transcription
large 1.5 GB Slowest Highest Maximum accuracy

Memory Requirement

The Whisper model is loaded into memory at startup. The base model requires ~256 Mi, while large requires ~2 Gi. Adjust resource limits accordingly.

Network Access

Whisper is an internal service with no internet access. It processes audio locally using CPU-based speech recognition. The NetworkPolicy restricts:

  • Ingress: Only ai-service, ADEN, and cockpit can reach port 8000
  • Egress: DNS only (no internet access)

RBAC

The akko_ai_transcribe Trino function is available to:

  • admin — Full access
  • engineer — Full access
  • analyst — Full access
  • steward — No access (governance-only role)
  • viewer — No access

Troubleshooting

Whisper Pod CrashLoopBackOff (OOMKilled)

Symptoms: The Whisper pod enters CrashLoopBackOff status. kubectl describe pod shows OOMKilled as the last termination reason.

Cause: The selected Whisper model is too large for the configured memory limits.

Solution:

# Check current memory limits
kubectl get pod -n akko -l app.kubernetes.io/name=akko-whisper -o jsonpath='{.items[0].spec.containers[0].resources}'

# Use a smaller model or increase memory
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
  --set akko-whisper.whisperModel=tiny \
  --set akko-whisper.resources.limits.memory=1Gi

Slow Transcription

Symptoms: Audio transcription takes several minutes for short files. CPU usage is at 100%.

Cause: Whisper uses CPU-based inference. Larger models and longer audio files require more processing time.

Solution:

# Check CPU allocation
kubectl top pod -n akko -l app.kubernetes.io/name=akko-whisper

# Use a smaller model for faster processing
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
  --set akko-whisper.whisperModel=tiny

# Or increase CPU limits
helm upgrade akko helm/akko/ -n akko -f helm/examples/values-dev.yaml \
  --set akko-whisper.resources.limits.cpu=4

Empty Transcription Results

Symptoms: The /v1/transcribe endpoint returns {"status": "error", "error": "Could not transcribe audio"}.

Cause: The audio file may be corrupted, in an unsupported format, or contain only silence.

Solution:

# Check Whisper pod logs
kubectl logs -n akko -l app.kubernetes.io/name=akko-whisper --tail=50

# Verify the audio file is valid
kubectl exec -n akko deploy/akko-akko-ai-service -- \
  curl -s "http://akko-akko-whisper:8000/health"