AKKO CI/CD Pipeline — Woodpecker + Buildah + Harbor¶
How changes flow from git push to a running pod on the AKKO cluster.
The before/after in one line¶
Before: SSH into Netcup, git pull, run bash helm/scripts/deploy-netcup-full.sh manually. Zero audit trail, zero pre-merge gate, no "this commit was tested".
After: a git push main fires an automatic pipeline, 13 checks run on the change, Buildah rebuilds 13 OCI images in parallel pods, pushes them to Harbor, and helm upgrades the live cluster. GitHub shows ✓/✗ on every commit.
Architecture at a glance¶
Developer
│
│ git push / open PR
▼
┌──────────────┐ GitHub webhook ┌──────────────────────────┐
│ AKKO-p/AKKO │ ──────────────────► │ Woodpecker server │
│ (GitHub) │ /api/hook │ ci.akko-ai.com │
└──────────────┘ │ (akko ns on k3s) │
└──────────────────────────┘
│
spawn step pod per .woodpecker/*.yml
▼
┌─────────────────────────────────────────────────────────┐
│ 13 validation pipelines (all run on PR & push) │
│ ─────────────────────────────────────────────── │
│ 00-validate helm dependency + changeme scan │
│ 01-helm-lint helm lint (prod + rendered) │
│ 02-helm-template kubeconform on rendered manifests │
│ 03-no-hardcoding lint script (no akko.local leaks) │
│ 04-secrets gitleaks + plaintext scan │
│ 05-doc-consistency README/CLAUDE/Chart cross-check │
│ 06-trivy fs + config CVE scan │
│ 07-trino-ai-plugin Maven test on the Trino plugin │
│ 10-aden-tests pytest + coverage gate │
│ 13-np-coverage NetworkPolicy coverage lint │
│ │
│ +2 push-only (Harbor creds, not exposed to PR events) │
│ 11-license-check syft license scan on Harbor images │
│ │
│ +1 push + manual (the actual deploy) │
│ 08-deploy-netcup 13 Buildah builds + helm upgrade │
│ │
│ +1 manual-only (needs live cluster) │
│ 09-tests pytest + Playwright smoke │
└─────────────────────────────────────────────────────────┘
│ │
✓ all green │ │ 08 specific
▼ ▼
GitHub status │ ┌─────────────────────────┐
green checkmark │ │ 13 Buildah build pods │
│ │ each pushes to Harbor │
│ └─────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ helm-upgrade step │
│ │ reuse-values, wait 15m │
│ └─────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
└────────────► │ k3s cluster pulls new │
│ 2026.04 image digests │
└─────────────────────────┘
The 13 validation pipelines¶
Each .woodpecker/NN-*.yml is a separate pipeline. They all run in parallel on PR and push events, up to the 2-agent pool capacity.
| File | Purpose | Typical runtime |
|---|---|---|
00-validate.yml |
helm dependency update + template + no changeme in rendered manifests |
~30s |
01-helm-lint.yml |
helm lint against both default and rendered-with-secrets | ~20s |
02-helm-template.yml |
kubeconform validates each rendered manifest against k8s schema | ~40s |
03-no-hardcoding.yml |
helm/scripts/lint-no-hardcoding.sh — 0 akko.local literals |
~25s |
04-secrets.yml |
gitleaks + grep for akko-dev- plaintext in committed helm files |
~15s |
05-doc-consistency.yml |
cross-check service counts, Helm deps, Keycloak clients across README/CLAUDE/docs | ~20s |
06-trivy.yml |
Trivy fs scan for CRITICAL/HIGH + Trivy config scan of helm chart | ~60s |
07-trino-ai-plugin.yml |
mvn test on docker/trino-ai-functions — 79 JUnit tests |
~45s |
10-aden-tests.yml |
pytest on docker/aden/tests/ with 30% coverage gate |
~10s |
13-np-coverage.yml |
check-networkpolicy-coverage.sh — every chart with Ingress has traefikIngress NP |
~5s |
11-license-check.yml |
syft license scan pulling images from Harbor (push-only, needs creds) | ~90s |
09-tests.yml |
pytest integration + Playwright smoke against live cluster (manual-only) | ~5min |
08-deploy-netcup.yml |
The big one. 13 Buildah builds + helm upgrade. See below. | ~30-40min |
Why Buildah (and not docker build or kaniko)¶
k3s on Netcup runs on containerd, not Docker. The docker:27-cli image has the CLI binary but no daemon to connect to. Every docker build inside a Woodpecker step pod died with ERROR: Docker daemon is not running.
Mounting /var/run/docker.sock is not an option — k3s ships only containerd on the host. And mounting host sockets is a privilege escalation we don't want to normalise.
The first attempt used kaniko (Google's daemon-less builder). That worked technically but on 2026-04-23 we discovered that GoogleContainerTools/kaniko had been archived by Google — last commit was the archive notice on 2025-06-03. Locking AKKO into an unmaintained dependency was unacceptable, so we pivoted.
Buildah (Red Hat, Apache 2.0, actively maintained with releases every ~3 weeks) became the canonical builder. It:
- Is a statically-compiled Go binary (
buildah) that reads a Dockerfile and produces an OCI image from inside an ordinary pod - Needs no daemon, no privileged access, no host mount
- Is the default builder in OpenShift Pipelines (Tekton) — Red Hat commits to multi-year support cycles
- Has 100% Dockerfile compatibility (same RUN/COPY/ARG/ENV/HEALTHCHECK semantics as docker build)
See akko-technical-map/decisions/ADR-024-buildah-replaces-kaniko.md (private planning repo) for the full alternative analysis (Buildah vs BuildKit vs Podman vs nerdctl).
The state-pollution lesson¶
An earlier kaniko attempt ran all 13 builds in a single pod via a shell loop. Kaniko's /kaniko/ and /workspace/ directories are shared across invocations, and after 2-3 builds the filesystem gets polluted (observed: /var/mail: too many levels of symbolic links).
Buildah would have the same class of issue if run sequentially. The fix applies equally: one Woodpecker step per image → one pod per build → each starts from a clean filesystem. 13 step declarations in .woodpecker/08-deploy-netcup.yml, parallelised across the agent pool.
The deploy pipeline (08-deploy-netcup) in detail¶
steps:
- detect-changes # writes /tmp/needs_rebuild
- build-postgres # Buildah pod → pushes harbor.akko-ai.com/akko/postgres:2026.04
- build-spark # parallel with others up to 2 agents
- build-notebook
- build-mlflow
- build-cockpit
- build-trino
- build-ai-service
- build-airflow
- build-dbt
- build-mcp-trino
- build-mcp-openmetadata
- build-docs
- build-catalog-manager
- helm-upgrade # depends_on: [every build-*]
Each build-* step is this template:
image: quay.io/buildah/stable:v1.43.1
commands:
- 'mkdir -p /tmp/auth'
- 'AUTH=$(printf "%s:%s" "$HARBOR_USERNAME" "$HARBOR_PASSWORD" | base64 | tr -d "\n"); printf ''{"auths":{"harbor.akko-ai.com":{"auth":"%s"}}}'' "$AUTH" > /tmp/auth/config.json'
- 'export REGISTRY_AUTH_FILE=/tmp/auth/config.json'
- 'buildah --storage-driver vfs bud --layers --format docker -t harbor.akko-ai.com/akko/<name>:<tag> -f <dockerfile> <context>'
- 'buildah --storage-driver vfs push harbor.akko-ai.com/akko/<name>:<tag>'
environment:
HARBOR_USERNAME: {from_secret: harbor_username}
HARBOR_PASSWORD: {from_secret: harbor_password}
depends_on: [detect-changes]
Notes:
--storage-driver vfsavoids needing fuse-overlayfs or kernel capabilities — runs in a PodSecurity-restricted pod. Slower thanoverlay(~20%) but portable.--layersenables intermediate layer caching within a single build (same as docker).--format dockerproduces Docker v2 images (OCI format also works;dockeris more widely accepted by older registries).REGISTRY_AUTH_FILEis buildah's way of pointing at the Harbor auth config — equivalent to docker's~/.docker/config.json.
helm-upgrade then:
- Writes kubeconfig from a Woodpecker secret
- Regenerates
values-domain.yamlfor the AKKO_DOMAIN secret helm upgrade akko helm/akko/ -n akko --install --reuse-values --wait --timeout=15m
k3s sees the new image digests in the updated Deployments, rolls pods one at a time, and the cluster is live on the new version.
The secrets that power all this¶
Stored in Woodpecker's own secret vault (not k8s Secrets, not GitHub Actions secrets):
| Name | Consumer | Events |
|---|---|---|
harbor_username |
build-* steps, 11-license-check | push, manual |
harbor_password |
idem | push, manual |
kubeconfig |
helm-upgrade step | push, manual |
akko_domain |
helm-upgrade step, 09-tests | push, manual |
Woodpecker refuses to inject any from_secret: into event: pull_request pipelines — that's the anti-exfiltration default. So every pipeline that needs a secret must either omit event: pull_request from its when: block or accept that it compile-errors on PR commits.
Rebuild / redeploy path¶
Automatic (canonical)¶
git push origin main
# → pipeline 00..13 run in parallel on main push
# → 08-deploy-netcup builds 13 images with Buildah, pushes Harbor, runs helm upgrade
Manual (fallback when CI is down or you're iterating)¶
The manual script uses docker build (still works on the Netcup host which has Docker installed for dev convenience). It's the fallback documented in feedback_deploy_cycle_netcup.md. Prefer the automatic path — CI gives an audit trail and runs the 12 validation checks first.
Hot reload a single service (construction phase)¶
During active development, going through the full 14-build + helm-upgrade pipeline for a one-line CSS change or a routes_query.py tweak is wasteful. Use dev-reload.sh instead — it builds only the named image locally on the Netcup host, pushes it with a dev tag, and patches the live Deployment:
ssh root@159.195.77.208
cd /root/akko && git pull
bash helm/scripts/dev-reload.sh cockpit # 30-60 s instead of 30-40 min
bash helm/scripts/dev-reload.sh akko-rag # for a new service
bash helm/scripts/dev-reload.sh ai-service
Uses a timestamped dev tag (harbor.akko-ai.com/akko/<svc>:2026.04-dev-<ts>) so the production :2026.04 tag stays untouched. The next automatic helm upgrade on a main push rolls the Deployment back to the clean tag. Full service list: bash helm/scripts/dev-reload.sh (no argument prints the help).
When to prefer which path:
| Path | Latency | Use when |
|---|---|---|
dev-reload.sh <svc> |
30 s - 5 min | iterating on ONE service, fast feedback loop |
git push main → CI |
5-40 min | change is ready, PR green, want audit trail |
deploy-netcup-full.sh |
30-45 min | CI is down, need everything rebuilt |
Rollback¶
helm -n akko rollback akko # previous revision
# OR, to reset to a tagged image version:
AKKO_TAG=2026.04-stable bash helm/scripts/deploy-netcup-full.sh
Build caching¶
Two caching layers speed up rebuilds:
-
Path-filtered triggers (see
.woodpecker/08-deploy-netcup.yml). A push that only touchesdocs/**triggers onlybuild-docs+helm-upgrade. Unrelated builds skip entirely, shaving ~90 % of pipeline time on typical commits. -
Buildah registry cache. Every
buildah budwrites intermediate layers toharbor.akko-ai.com/akko/cache-<service>via--cache-toand pulls them back via--cache-from. When a build IS unavoidable, buildah resumes from the last valid cached layer instead of re-running everyRUNdirective. Typical speedup on a single-line Dockerfile change:
| Image | Cold | Cached |
|---|---|---|
| notebook-slim | ~12 min | ~2-3 min |
| notebook-full (FROM slim) | ~35 min | ~4-6 min |
| trino (Maven plugin build) | ~10 min | ~2 min |
| cockpit (nginx + HTML) | ~3 min | ~40 s |
The first run after this feature lands is as slow as a cold build — the cache repos don't exist yet in Harbor. Buildah tolerates a 404 on --cache-from and populates the cache via --cache-to at the end. From run #2 onwards, the cache kicks in.
Cache images live in separate Harbor repos (akko/cache-*) so production tags (akko/<service>:2026.04) stay clean. Recommended Harbor retention policy on cache-* repos: keep last 5 manifests + GC weekly.
Known debt¶
- Trust flag unused: Buildah runs fully unprivileged — we do not enable the repo's
trusted.volumesflag. Any future step requiring it should be evaluated via ADR before flipping. - 09-tests gated to manual: it runs live smoke tests against the cluster, which only makes sense post-deploy. Proper path = add
depends_on: [helm-upgrade]inter-pipeline once Woodpecker exposes it across files (currently v3.4 doesn't support cross-pipeline deps). - Dockerfile rebuild-from-scratch latent bugs: docker build's layer cache had been hiding real bugs for months. When a fresh rebuild exposed them we fixed notebook (
libpq5broken deps on Ubuntu 24.04) and mlflow (apt-keydeprecation on Debian bullseye) in PR #13. Other Dockerfiles may surface similar issues on the next full rebuild — the pattern to remove is usually a strayapt-get -y upgradebetweenapt-get updateandapt-get install.
Tool maintenance policy¶
Every builder/tool integrated into AKKO CI must pass a maintenance check before adoption: last commit < 3 months old, last release < 6 months old, ≥ 2 active maintainers, clear backing organisation. Decisions documented via ADR. See feedback_verify_tool_maintained.md in the session memory.
References¶
.woodpecker/*.yml— the 13 pipelineshelm/scripts/build-images.sh— local dev build (docker-based)helm/scripts/deploy-netcup-full.sh— manual fallback deploy- Woodpecker docs — v3.4 reference
- Buildah docs —
buildah bud,buildah push