MLflow — Experiment Tracking & Model Registry¶
MLflow provides ML experiment tracking and a model registry for AKKO. Data scientists log experiments from JupyterHub notebooks, compare runs, and register production-ready models — all with metadata stored in PostgreSQL and artifacts stored in object storage (S3-compatible).
Architecture¶
JupyterHub (notebooks) Airflow (DAGs)
\ /
+-------v-------v-------+
| MLflow (:5000) |
| Tracking Server + UI |
+-------+-------+-----------+
| |
+------------v---+ +------v-----------+
| PostgreSQL | | object storage (S3) |
| (metadata: | | s3://akko-warehouse/mlflow/
| experiments, | | (artifacts: models,
| runs, params, | | datasets, plots)
| metrics) | |
+----------------+ +------------------+
- PostgreSQL stores experiment metadata (runs, parameters, metrics, tags)
- object storage stores artifacts (model binaries, datasets, plots) at
s3://akko-warehouse/mlflow/ - MLflow UI provides experiment comparison, metric visualization, and model versioning
URLs¶
| Mode | URL |
|---|---|
| Kubernetes (k3d) | https://experiments.akko.local |
Usage¶
From Notebooks (JupyterHub)¶
The MLFLOW_TRACKING_URI environment variable is pre-configured in the JupyterHub spawner. No setup required:
import mlflow
# Automatic — MLFLOW_TRACKING_URI is set by the spawner
mlflow.set_experiment("banking-risk-model")
with mlflow.start_run():
mlflow.log_param("algorithm", "random_forest")
mlflow.log_param("n_estimators", 100)
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("f1_score", 0.91)
# Log the trained model
mlflow.sklearn.log_model(model, "model")
Experiment Comparison¶
import mlflow
# Search runs across experiments
runs = mlflow.search_runs(
experiment_names=["banking-risk-model"],
order_by=["metrics.accuracy DESC"],
max_results=10,
)
print(runs[["params.algorithm", "metrics.accuracy", "metrics.f1_score"]])
Model Registry¶
import mlflow
# Register the best model
result = mlflow.register_model(
model_uri="runs:/<run-id>/model",
name="banking-risk-classifier",
)
# Transition to production
client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
name="banking-risk-classifier",
version=result.version,
stage="Production",
)
From Airflow DAGs¶
Airflow workers have MLFLOW_TRACKING_URI set, so DAGs can log runs and register models:
import mlflow
mlflow.set_experiment("airflow-etl-quality")
with mlflow.start_run(run_name="daily-etl"):
mlflow.log_metric("rows_processed", 125000)
mlflow.log_metric("data_quality_score", 0.98)
Configuration¶
Kubernetes (Helm)¶
akko-mlflow:
enabled: true
image:
repository: localhost:5050/akko-mlflow # custom image (k3d: k3d-akko-registry:5050/akko-mlflow)
tag: "2026.03"
database:
host: "akko-postgresql"
name: "mlflow"
artifacts:
root: "s3://akko-warehouse/mlflow/"
s3Endpoint: "http://akko-minio:9000"
ingress:
host: "experiments.akko.local"
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi # Dev override: 768Mi
Environment Variables (Notebooks & Airflow)¶
| Variable | Value | Purpose |
|---|---|---|
MLFLOW_TRACKING_URI |
http://akko-akko-mlflow:5000 |
Points notebooks/DAGs to the MLflow server |
MLFLOW_S3_ENDPOINT_URL |
http://akko-minio:9000 |
Artifact storage endpoint |
AWS_ACCESS_KEY_ID |
(from secret) | object storage credentials for artifact access |
AWS_SECRET_ACCESS_KEY |
(from secret) | object storage credentials for artifact access |
Key Features¶
| Feature | Description |
|---|---|
| Experiment Tracking | Log parameters, metrics, and tags for every run |
| Model Registry | Version models, transition stages (Staging/Production/Archived) |
| Artifact Storage | Store models, datasets, and plots on object storage (S3-compatible) |
| UI | Compare experiments, visualize metrics, browse artifacts |
| Autologging | mlflow.autolog() for scikit-learn, PyTorch, XGBoost, LightGBM |
Authentication¶
MLflow is protected by OAuth2-Proxy (ForwardAuth middleware). Users must be authenticated via Keycloak SSO to access the UI through the ingress.
For internal service-to-service calls (within the cluster), no authentication is required — services connect directly to http://akko-akko-mlflow:5000.
Healthcheck¶
MLflow exposes a /health endpoint used by Kubernetes probes:
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 15
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 10
Resource Requirements¶
| Component | Minimum RAM | Recommended |
|---|---|---|
| MLflow server | 256 Mi | 512 Mi |
PostgreSQL dependency
MLflow requires the mlflow database to exist in PostgreSQL. The akko-init job creates it automatically on first deployment.
Troubleshooting¶
Artifact Storage Failure (object storage Connection)¶
Symptoms: mlflow.log_artifact() or mlflow.sklearn.log_model() raises ClientError: An error occurred (NoSuchBucket) or ConnectionRefusedError. The MLflow UI shows runs but artifacts are missing.
Cause: object storage is unreachable from the MLflow pod, the akko-warehouse bucket does not exist, or the S3 credentials (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) are incorrect.
Solution:
# Verify object storage pod is running
kubectl get pods -n akko -l app.kubernetes.io/name=minio
# Check if the bucket exists
kubectl exec -n akko deploy/akko-minio -- mc ls local/akko-warehouse/mlflow/ 2>/dev/null || echo "Bucket or path missing"
# Verify MLflow environment variables
kubectl exec -n akko deploy/akko-akko-mlflow -- env | grep -E "AWS_|S3_ENDPOINT|ARTIFACT"
# Check MLflow logs for S3 errors
kubectl logs -n akko deploy/akko-akko-mlflow --tail=50 | grep -i "s3\|minio\|artifact\|bucket"
Database Connection Failure (PostgreSQL)¶
Symptoms: MLflow pod enters CrashLoopBackOff. Logs show OperationalError: could not connect to server or FATAL: database "mlflow" does not exist.
Cause: The PostgreSQL pod is not ready, the mlflow database was not created by the init job, or the database credentials in the MLflow configuration are incorrect.
Solution:
# Check PostgreSQL pod status
kubectl get pods -n akko -l app.kubernetes.io/name=akko-postgresql
# Verify the mlflow database exists
kubectl exec -n akko deploy/akko-postgresql -- psql -U postgres -c "\l" | grep mlflow
# If missing, create it manually
kubectl exec -n akko deploy/akko-postgresql -- psql -U postgres -c "CREATE DATABASE mlflow;"
# Check MLflow database connection string
kubectl logs -n akko deploy/akko-akko-mlflow --tail=50 | grep -i "database\|postgres\|connection"
# Restart MLflow after fixing the database
kubectl rollout restart -n akko deploy/akko-akko-mlflow
Experiment Tracking Failures¶
Symptoms: mlflow.set_experiment() or mlflow.start_run() raises MlflowException: API request failed from notebooks. The MLflow UI returns a 500 error.
Cause: The MLFLOW_TRACKING_URI environment variable in the notebook spawner points to a wrong or unreachable address, or the MLflow server is overloaded / restarting.
Solution:
# Verify MLflow pod is healthy
kubectl get pods -n akko -l app.kubernetes.io/name=akko-mlflow
kubectl exec -n akko deploy/akko-akko-mlflow -- curl -s http://localhost:5000/health
# Check the tracking URI configured in JupyterHub
kubectl exec -n akko deploy/akko-jupyterhub -- env | grep MLFLOW_TRACKING_URI
# Test connectivity from a notebook pod
kubectl exec -n akko $(kubectl get pod -n akko -l app=jupyterhub -o jsonpath='{.items[0].metadata.name}') \
-- curl -s http://akko-akko-mlflow:5000/api/2.0/mlflow/experiments/search
# Check MLflow server logs
kubectl logs -n akko deploy/akko-akko-mlflow --tail=100
Model Registry Permission Errors¶
Symptoms: mlflow.register_model() raises RestException: RESOURCE_ALREADY_EXISTS or PERMISSION_DENIED. Model version transitions fail silently.
Cause: A model with the same name already exists in a different experiment, or the MLflow server is running in read-only mode due to a database migration issue.
Solution:
# List registered models via the API
kubectl exec -n akko deploy/akko-akko-mlflow -- \
curl -s http://localhost:5000/api/2.0/mlflow/registered-models/search | python3 -m json.tool
# Check for database migration issues
kubectl logs -n akko deploy/akko-akko-mlflow --tail=100 | grep -i "migration\|alembic\|upgrade"
# If the database schema is outdated, trigger a manual upgrade
kubectl exec -n akko deploy/akko-akko-mlflow -- mlflow db upgrade postgresql://postgres:$POSTGRES_PASSWORD@akko-postgresql:5432/mlflow
MLflow UI Not Loading (502 Bad Gateway)¶
Symptoms: Navigating to https://experiments.akko.local returns a 502 error. The pod is running but the ingress cannot reach it.
Cause: The readiness probe has not passed yet (MLflow is still initializing), the service port mapping is incorrect, or OAuth2-Proxy is failing to authenticate.
Solution:
# Check pod readiness
kubectl get pods -n akko -l app.kubernetes.io/name=akko-mlflow -o wide
# Test the health endpoint directly
kubectl exec -n akko deploy/akko-akko-mlflow -- curl -s http://localhost:5000/health
# Check ingress configuration
kubectl get ingress -n akko | grep mlflow
kubectl describe ingress -n akko akko-akko-mlflow
# Check OAuth2-Proxy logs if authentication is failing
kubectl logs -n akko deploy/akko-oauth2-proxy --tail=50 | grep -i mlflow