Skip to content

MLflow — Experiment Tracking & Model Registry

MLflow provides ML experiment tracking and a model registry for AKKO. Data scientists log experiments from JupyterHub notebooks, compare runs, and register production-ready models — all with metadata stored in PostgreSQL and artifacts stored in object storage (S3-compatible).

Architecture

JupyterHub (notebooks)          Airflow (DAGs)
        \                         /
         +-------v-------v-------+
         |       MLflow (:5000)       |
         |  Tracking Server + UI      |
         +-------+-------+-----------+
                 |               |
    +------------v---+   +------v-----------+
    |  PostgreSQL     |   |  object storage (S3)       |
    |  (metadata:     |   |  s3://akko-warehouse/mlflow/
    |   experiments,  |   |  (artifacts: models,
    |   runs, params, |   |   datasets, plots)
    |   metrics)      |   |
    +----------------+   +------------------+
  • PostgreSQL stores experiment metadata (runs, parameters, metrics, tags)
  • object storage stores artifacts (model binaries, datasets, plots) at s3://akko-warehouse/mlflow/
  • MLflow UI provides experiment comparison, metric visualization, and model versioning

URLs

Mode URL
Kubernetes (k3d) https://experiments.akko.local

Usage

From Notebooks (JupyterHub)

The MLFLOW_TRACKING_URI environment variable is pre-configured in the JupyterHub spawner. No setup required:

import mlflow

# Automatic — MLFLOW_TRACKING_URI is set by the spawner
mlflow.set_experiment("banking-risk-model")

with mlflow.start_run():
    mlflow.log_param("algorithm", "random_forest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("f1_score", 0.91)

    # Log the trained model
    mlflow.sklearn.log_model(model, "model")

Experiment Comparison

import mlflow

# Search runs across experiments
runs = mlflow.search_runs(
    experiment_names=["banking-risk-model"],
    order_by=["metrics.accuracy DESC"],
    max_results=10,
)
print(runs[["params.algorithm", "metrics.accuracy", "metrics.f1_score"]])

Model Registry

import mlflow

# Register the best model
result = mlflow.register_model(
    model_uri="runs:/<run-id>/model",
    name="banking-risk-classifier",
)

# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="banking-risk-classifier",
    version=result.version,
    stage="Production",
)

From Airflow DAGs

Airflow workers have MLFLOW_TRACKING_URI set, so DAGs can log runs and register models:

import mlflow

mlflow.set_experiment("airflow-etl-quality")

with mlflow.start_run(run_name="daily-etl"):
    mlflow.log_metric("rows_processed", 125000)
    mlflow.log_metric("data_quality_score", 0.98)

Configuration

Kubernetes (Helm)

akko-mlflow:
  enabled: true
  image:
    repository: localhost:5050/akko-mlflow   # custom image (k3d: k3d-akko-registry:5050/akko-mlflow)
    tag: "2026.03"
  database:
    host: "akko-postgresql"
    name: "mlflow"
  artifacts:
    root: "s3://akko-warehouse/mlflow/"
    s3Endpoint: "http://akko-minio:9000"
  ingress:
    host: "experiments.akko.local"
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi     # Dev override: 768Mi

Environment Variables (Notebooks & Airflow)

Variable Value Purpose
MLFLOW_TRACKING_URI http://akko-akko-mlflow:5000 Points notebooks/DAGs to the MLflow server
MLFLOW_S3_ENDPOINT_URL http://akko-minio:9000 Artifact storage endpoint
AWS_ACCESS_KEY_ID (from secret) object storage credentials for artifact access
AWS_SECRET_ACCESS_KEY (from secret) object storage credentials for artifact access

Key Features

Feature Description
Experiment Tracking Log parameters, metrics, and tags for every run
Model Registry Version models, transition stages (Staging/Production/Archived)
Artifact Storage Store models, datasets, and plots on object storage (S3-compatible)
UI Compare experiments, visualize metrics, browse artifacts
Autologging mlflow.autolog() for scikit-learn, PyTorch, XGBoost, LightGBM

Authentication

MLflow is protected by OAuth2-Proxy (ForwardAuth middleware). Users must be authenticated via Keycloak SSO to access the UI through the ingress.

For internal service-to-service calls (within the cluster), no authentication is required — services connect directly to http://akko-akko-mlflow:5000.


Healthcheck

MLflow exposes a /health endpoint used by Kubernetes probes:

livenessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 15
  periodSeconds: 30
readinessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 10
  periodSeconds: 10

Resource Requirements

Component Minimum RAM Recommended
MLflow server 256 Mi 512 Mi

PostgreSQL dependency

MLflow requires the mlflow database to exist in PostgreSQL. The akko-init job creates it automatically on first deployment.


Troubleshooting

Artifact Storage Failure (object storage Connection)

Symptoms: mlflow.log_artifact() or mlflow.sklearn.log_model() raises ClientError: An error occurred (NoSuchBucket) or ConnectionRefusedError. The MLflow UI shows runs but artifacts are missing.

Cause: object storage is unreachable from the MLflow pod, the akko-warehouse bucket does not exist, or the S3 credentials (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) are incorrect.

Solution:

# Verify object storage pod is running
kubectl get pods -n akko -l app.kubernetes.io/name=minio

# Check if the bucket exists
kubectl exec -n akko deploy/akko-minio -- mc ls local/akko-warehouse/mlflow/ 2>/dev/null || echo "Bucket or path missing"

# Verify MLflow environment variables
kubectl exec -n akko deploy/akko-akko-mlflow -- env | grep -E "AWS_|S3_ENDPOINT|ARTIFACT"

# Check MLflow logs for S3 errors
kubectl logs -n akko deploy/akko-akko-mlflow --tail=50 | grep -i "s3\|minio\|artifact\|bucket"

Database Connection Failure (PostgreSQL)

Symptoms: MLflow pod enters CrashLoopBackOff. Logs show OperationalError: could not connect to server or FATAL: database "mlflow" does not exist.

Cause: The PostgreSQL pod is not ready, the mlflow database was not created by the init job, or the database credentials in the MLflow configuration are incorrect.

Solution:

# Check PostgreSQL pod status
kubectl get pods -n akko -l app.kubernetes.io/name=akko-postgresql

# Verify the mlflow database exists
kubectl exec -n akko deploy/akko-postgresql -- psql -U postgres -c "\l" | grep mlflow

# If missing, create it manually
kubectl exec -n akko deploy/akko-postgresql -- psql -U postgres -c "CREATE DATABASE mlflow;"

# Check MLflow database connection string
kubectl logs -n akko deploy/akko-akko-mlflow --tail=50 | grep -i "database\|postgres\|connection"

# Restart MLflow after fixing the database
kubectl rollout restart -n akko deploy/akko-akko-mlflow

Experiment Tracking Failures

Symptoms: mlflow.set_experiment() or mlflow.start_run() raises MlflowException: API request failed from notebooks. The MLflow UI returns a 500 error.

Cause: The MLFLOW_TRACKING_URI environment variable in the notebook spawner points to a wrong or unreachable address, or the MLflow server is overloaded / restarting.

Solution:

# Verify MLflow pod is healthy
kubectl get pods -n akko -l app.kubernetes.io/name=akko-mlflow
kubectl exec -n akko deploy/akko-akko-mlflow -- curl -s http://localhost:5000/health

# Check the tracking URI configured in JupyterHub
kubectl exec -n akko deploy/akko-jupyterhub -- env | grep MLFLOW_TRACKING_URI

# Test connectivity from a notebook pod
kubectl exec -n akko $(kubectl get pod -n akko -l app=jupyterhub -o jsonpath='{.items[0].metadata.name}') \
  -- curl -s http://akko-akko-mlflow:5000/api/2.0/mlflow/experiments/search

# Check MLflow server logs
kubectl logs -n akko deploy/akko-akko-mlflow --tail=100

Model Registry Permission Errors

Symptoms: mlflow.register_model() raises RestException: RESOURCE_ALREADY_EXISTS or PERMISSION_DENIED. Model version transitions fail silently.

Cause: A model with the same name already exists in a different experiment, or the MLflow server is running in read-only mode due to a database migration issue.

Solution:

# List registered models via the API
kubectl exec -n akko deploy/akko-akko-mlflow -- \
  curl -s http://localhost:5000/api/2.0/mlflow/registered-models/search | python3 -m json.tool

# Check for database migration issues
kubectl logs -n akko deploy/akko-akko-mlflow --tail=100 | grep -i "migration\|alembic\|upgrade"

# If the database schema is outdated, trigger a manual upgrade
kubectl exec -n akko deploy/akko-akko-mlflow -- mlflow db upgrade postgresql://postgres:$POSTGRES_PASSWORD@akko-postgresql:5432/mlflow

MLflow UI Not Loading (502 Bad Gateway)

Symptoms: Navigating to https://experiments.akko.local returns a 502 error. The pod is running but the ingress cannot reach it.

Cause: The readiness probe has not passed yet (MLflow is still initializing), the service port mapping is incorrect, or OAuth2-Proxy is failing to authenticate.

Solution:

# Check pod readiness
kubectl get pods -n akko -l app.kubernetes.io/name=akko-mlflow -o wide

# Test the health endpoint directly
kubectl exec -n akko deploy/akko-akko-mlflow -- curl -s http://localhost:5000/health

# Check ingress configuration
kubectl get ingress -n akko | grep mlflow
kubectl describe ingress -n akko akko-akko-mlflow

# Check OAuth2-Proxy logs if authentication is failing
kubectl logs -n akko deploy/akko-oauth2-proxy --tail=50 | grep -i mlflow