Skip to content

AKKO CI/CD Pipeline — Woodpecker + Buildah + Harbor

How changes flow from git push to a running pod on the AKKO cluster.

The before/after in one line

Before: SSH into Netcup, git pull, run bash helm/scripts/deploy-netcup-full.sh manually. Zero audit trail, zero pre-merge gate, no "this commit was tested".

After: a git push main fires an automatic pipeline, 13 checks run on the change, Buildah rebuilds 13 OCI images in parallel pods, pushes them to Harbor, and helm upgrades the live cluster. GitHub shows ✓/✗ on every commit.

Architecture at a glance

  Developer
     │ git push / open PR
  ┌──────────────┐   GitHub webhook    ┌──────────────────────────┐
  │ AKKO-p/AKKO  │ ──────────────────► │ Woodpecker server        │
  │ (GitHub)     │   /api/hook         │ ci.akko-ai.com           │
  └──────────────┘                     │ (akko ns on k3s)         │
                                       └──────────────────────────┘
                                      spawn step pod per .woodpecker/*.yml
                    ┌─────────────────────────────────────────────────────────┐
                    │ 13 validation pipelines (all run on PR & push)          │
                    │ ───────────────────────────────────────────────         │
                    │ 00-validate          helm dependency + changeme scan    │
                    │ 01-helm-lint         helm lint (prod + rendered)        │
                    │ 02-helm-template     kubeconform on rendered manifests  │
                    │ 03-no-hardcoding     lint script (no akko.local leaks)  │
                    │ 04-secrets           gitleaks + plaintext scan          │
                    │ 05-doc-consistency   README/CLAUDE/Chart cross-check    │
                    │ 06-trivy             fs + config CVE scan               │
                    │ 07-trino-ai-plugin   Maven test on the Trino plugin     │
                    │ 10-aden-tests        pytest + coverage gate             │
                    │ 13-np-coverage       NetworkPolicy coverage lint        │
                    │                                                         │
                    │ +2 push-only (Harbor creds, not exposed to PR events)  │
                    │ 11-license-check     syft license scan on Harbor images │
                    │                                                         │
                    │ +1 push + manual (the actual deploy)                    │
                    │ 08-deploy-netcup     13 Buildah builds + helm upgrade   │
                    │                                                         │
                    │ +1 manual-only (needs live cluster)                     │
                    │ 09-tests             pytest + Playwright smoke          │
                    └─────────────────────────────────────────────────────────┘
                                  │                            │
          ✓ all green             │                            │ 08 specific
                                  ▼                            ▼
          GitHub status           │              ┌─────────────────────────┐
          green checkmark         │              │ 13 Buildah build pods   │
                                  │              │ each pushes to Harbor   │
                                  │              └─────────────────────────┘
                                  │                            │
                                  │                            ▼
                                  │              ┌─────────────────────────┐
                                  │              │ helm-upgrade step       │
                                  │              │ reuse-values, wait 15m  │
                                  │              └─────────────────────────┘
                                  │                            │
                                  │                            ▼
                                  │              ┌─────────────────────────┐
                                  └────────────► │ k3s cluster pulls new   │
                                                 │ 2026.04 image digests   │
                                                 └─────────────────────────┘

The 13 validation pipelines

Each .woodpecker/NN-*.yml is a separate pipeline. They all run in parallel on PR and push events, up to the 2-agent pool capacity.

File Purpose Typical runtime
00-validate.yml helm dependency update + template + no changeme in rendered manifests ~30s
01-helm-lint.yml helm lint against both default and rendered-with-secrets ~20s
02-helm-template.yml kubeconform validates each rendered manifest against k8s schema ~40s
03-no-hardcoding.yml helm/scripts/lint-no-hardcoding.sh — 0 akko.local literals ~25s
04-secrets.yml gitleaks + grep for akko-dev- plaintext in committed helm files ~15s
05-doc-consistency.yml cross-check service counts, Helm deps, Keycloak clients across README/CLAUDE/docs ~20s
06-trivy.yml Trivy fs scan for CRITICAL/HIGH + Trivy config scan of helm chart ~60s
07-trino-ai-plugin.yml mvn test on docker/trino-ai-functions — 79 JUnit tests ~45s
10-aden-tests.yml pytest on docker/aden/tests/ with 30% coverage gate ~10s
13-np-coverage.yml check-networkpolicy-coverage.sh — every chart with Ingress has traefikIngress NP ~5s
11-license-check.yml syft license scan pulling images from Harbor (push-only, needs creds) ~90s
09-tests.yml pytest integration + Playwright smoke against live cluster (manual-only) ~5min
08-deploy-netcup.yml The big one. 13 Buildah builds + helm upgrade. See below. ~30-40min

Why Buildah (and not docker build or kaniko)

k3s on Netcup runs on containerd, not Docker. The docker:27-cli image has the CLI binary but no daemon to connect to. Every docker build inside a Woodpecker step pod died with ERROR: Docker daemon is not running.

Mounting /var/run/docker.sock is not an option — k3s ships only containerd on the host. And mounting host sockets is a privilege escalation we don't want to normalise.

The first attempt used kaniko (Google's daemon-less builder). That worked technically but on 2026-04-23 we discovered that GoogleContainerTools/kaniko had been archived by Google — last commit was the archive notice on 2025-06-03. Locking AKKO into an unmaintained dependency was unacceptable, so we pivoted.

Buildah (Red Hat, Apache 2.0, actively maintained with releases every ~3 weeks) became the canonical builder. It:

  • Is a statically-compiled Go binary (buildah) that reads a Dockerfile and produces an OCI image from inside an ordinary pod
  • Needs no daemon, no privileged access, no host mount
  • Is the default builder in OpenShift Pipelines (Tekton) — Red Hat commits to multi-year support cycles
  • Has 100% Dockerfile compatibility (same RUN/COPY/ARG/ENV/HEALTHCHECK semantics as docker build)

See akko-technical-map/decisions/ADR-024-buildah-replaces-kaniko.md (private planning repo) for the full alternative analysis (Buildah vs BuildKit vs Podman vs nerdctl).

The state-pollution lesson

An earlier kaniko attempt ran all 13 builds in a single pod via a shell loop. Kaniko's /kaniko/ and /workspace/ directories are shared across invocations, and after 2-3 builds the filesystem gets polluted (observed: /var/mail: too many levels of symbolic links).

Buildah would have the same class of issue if run sequentially. The fix applies equally: one Woodpecker step per image → one pod per build → each starts from a clean filesystem. 13 step declarations in .woodpecker/08-deploy-netcup.yml, parallelised across the agent pool.

The deploy pipeline (08-deploy-netcup) in detail

steps:
  - detect-changes            # writes /tmp/needs_rebuild
  - build-postgres            # Buildah pod → pushes harbor.akko-ai.com/akko/postgres:2026.04
  - build-spark               # parallel with others up to 2 agents
  - build-notebook
  - build-mlflow
  - build-cockpit
  - build-trino
  - build-ai-service
  - build-airflow
  - build-dbt
  - build-mcp-trino
  - build-mcp-openmetadata
  - build-docs
  - build-catalog-manager
  - helm-upgrade              # depends_on: [every build-*]

Each build-* step is this template:

image: quay.io/buildah/stable:v1.43.1
commands:
  - 'mkdir -p /tmp/auth'
  - 'AUTH=$(printf "%s:%s" "$HARBOR_USERNAME" "$HARBOR_PASSWORD" | base64 | tr -d "\n"); printf ''{"auths":{"harbor.akko-ai.com":{"auth":"%s"}}}'' "$AUTH" > /tmp/auth/config.json'
  - 'export REGISTRY_AUTH_FILE=/tmp/auth/config.json'
  - 'buildah --storage-driver vfs bud --layers --format docker -t harbor.akko-ai.com/akko/<name>:<tag> -f <dockerfile> <context>'
  - 'buildah --storage-driver vfs push harbor.akko-ai.com/akko/<name>:<tag>'
environment:
  HARBOR_USERNAME: {from_secret: harbor_username}
  HARBOR_PASSWORD: {from_secret: harbor_password}
depends_on: [detect-changes]

Notes:

  • --storage-driver vfs avoids needing fuse-overlayfs or kernel capabilities — runs in a PodSecurity-restricted pod. Slower than overlay (~20%) but portable.
  • --layers enables intermediate layer caching within a single build (same as docker).
  • --format docker produces Docker v2 images (OCI format also works; docker is more widely accepted by older registries).
  • REGISTRY_AUTH_FILE is buildah's way of pointing at the Harbor auth config — equivalent to docker's ~/.docker/config.json.

helm-upgrade then:

  1. Writes kubeconfig from a Woodpecker secret
  2. Regenerates values-domain.yaml for the AKKO_DOMAIN secret
  3. helm upgrade akko helm/akko/ -n akko --install --reuse-values --wait --timeout=15m

k3s sees the new image digests in the updated Deployments, rolls pods one at a time, and the cluster is live on the new version.

The secrets that power all this

Stored in Woodpecker's own secret vault (not k8s Secrets, not GitHub Actions secrets):

Name Consumer Events
harbor_username build-* steps, 11-license-check push, manual
harbor_password idem push, manual
kubeconfig helm-upgrade step push, manual
akko_domain helm-upgrade step, 09-tests push, manual

Woodpecker refuses to inject any from_secret: into event: pull_request pipelines — that's the anti-exfiltration default. So every pipeline that needs a secret must either omit event: pull_request from its when: block or accept that it compile-errors on PR commits.

Rebuild / redeploy path

Automatic (canonical)

git push origin main
# → pipeline 00..13 run in parallel on main push
# → 08-deploy-netcup builds 13 images with Buildah, pushes Harbor, runs helm upgrade

Manual (fallback when CI is down or you're iterating)

ssh root@159.195.77.208
cd /root/akko && git pull
bash helm/scripts/deploy-netcup-full.sh

The manual script uses docker build (still works on the Netcup host which has Docker installed for dev convenience). It's the fallback documented in feedback_deploy_cycle_netcup.md. Prefer the automatic path — CI gives an audit trail and runs the 12 validation checks first.

Hot reload a single service (construction phase)

During active development, going through the full 14-build + helm-upgrade pipeline for a one-line CSS change or a routes_query.py tweak is wasteful. Use dev-reload.sh instead — it builds only the named image locally on the Netcup host, pushes it with a dev tag, and patches the live Deployment:

ssh root@159.195.77.208
cd /root/akko && git pull
bash helm/scripts/dev-reload.sh cockpit           # 30-60 s instead of 30-40 min
bash helm/scripts/dev-reload.sh akko-rag          # for a new service
bash helm/scripts/dev-reload.sh ai-service

Uses a timestamped dev tag (harbor.akko-ai.com/akko/<svc>:2026.04-dev-<ts>) so the production :2026.04 tag stays untouched. The next automatic helm upgrade on a main push rolls the Deployment back to the clean tag. Full service list: bash helm/scripts/dev-reload.sh (no argument prints the help).

When to prefer which path:

Path Latency Use when
dev-reload.sh <svc> 30 s - 5 min iterating on ONE service, fast feedback loop
git push main → CI 5-40 min change is ready, PR green, want audit trail
deploy-netcup-full.sh 30-45 min CI is down, need everything rebuilt

Rollback

helm -n akko rollback akko   # previous revision
# OR, to reset to a tagged image version:
AKKO_TAG=2026.04-stable bash helm/scripts/deploy-netcup-full.sh

Build caching

Two caching layers speed up rebuilds:

  1. Path-filtered triggers (see .woodpecker/08-deploy-netcup.yml). A push that only touches docs/** triggers only build-docs + helm-upgrade. Unrelated builds skip entirely, shaving ~90 % of pipeline time on typical commits.

  2. Buildah registry cache. Every buildah bud writes intermediate layers to harbor.akko-ai.com/akko/cache-<service> via --cache-to and pulls them back via --cache-from. When a build IS unavoidable, buildah resumes from the last valid cached layer instead of re-running every RUN directive. Typical speedup on a single-line Dockerfile change:

Image Cold Cached
notebook-slim ~12 min ~2-3 min
notebook-full (FROM slim) ~35 min ~4-6 min
trino (Maven plugin build) ~10 min ~2 min
cockpit (nginx + HTML) ~3 min ~40 s

The first run after this feature lands is as slow as a cold build — the cache repos don't exist yet in Harbor. Buildah tolerates a 404 on --cache-from and populates the cache via --cache-to at the end. From run #2 onwards, the cache kicks in.

Cache images live in separate Harbor repos (akko/cache-*) so production tags (akko/<service>:2026.04) stay clean. Recommended Harbor retention policy on cache-* repos: keep last 5 manifests + GC weekly.

Known debt

  • Trust flag unused: Buildah runs fully unprivileged — we do not enable the repo's trusted.volumes flag. Any future step requiring it should be evaluated via ADR before flipping.
  • 09-tests gated to manual: it runs live smoke tests against the cluster, which only makes sense post-deploy. Proper path = add depends_on: [helm-upgrade] inter-pipeline once Woodpecker exposes it across files (currently v3.4 doesn't support cross-pipeline deps).
  • Dockerfile rebuild-from-scratch latent bugs: docker build's layer cache had been hiding real bugs for months. When a fresh rebuild exposed them we fixed notebook (libpq5 broken deps on Ubuntu 24.04) and mlflow (apt-key deprecation on Debian bullseye) in PR #13. Other Dockerfiles may surface similar issues on the next full rebuild — the pattern to remove is usually a stray apt-get -y upgrade between apt-get update and apt-get install.

Tool maintenance policy

Every builder/tool integrated into AKKO CI must pass a maintenance check before adoption: last commit < 3 months old, last release < 6 months old, ≥ 2 active maintainers, clear backing organisation. Decisions documented via ADR. See feedback_verify_tool_maintained.md in the session memory.

References