ADR-006: Ad-hoc to Production Workflow Pattern¶

Status¶

Accepted

Date¶

2026-03-09

Context¶

One of Databricks' strongest selling points is the seamless path from notebook exploration to production jobs — same environment, same runtime, one click to schedule. AKKO must provide an equivalent workflow using only open-source tools, without vendor-specific "jobs" APIs or proprietary runtimes.

The challenge: how do data teams go from "I wrote a query in a notebook" to "this runs every day in production, monitored and governed"?

Decision¶

Implement the Ad-hoc to Production workflow as a standard pipeline through open tools:

JupyterHub (explore) --> code-server (refactor) --> Git push --> dbt model --> Airflow DAG --> Trino/Iceberg (production)

Workflow stages:

Stage 1: Explore (JupyterHub)¶

Data scientists and analysts explore data in Jupyter notebooks
Connect to Trino (SQL) or Spark (PySpark) for interactive queries
Use pandas, plotly, seaborn for quick visualizations
Query Iceberg tables with time travel for historical analysis
Output: working notebook with validated logic

Stage 2: Refactor (code-server)¶

Open code-server (VS Code in browser) from JupyterHub environment
Extract notebook logic into clean Python modules or dbt SQL models
Write unit tests, add type hints, apply linting
Structure code following project conventions (src/, tests/, models/)
Output: production-ready code in a Git branch

Stage 3: Version (Git)¶

Push branch to Git repository (Gitea, GitHub, GitLab)
Create pull request for code review
CI runs tests, linting, dbt compile
Merge to main after approval
Output: reviewed, tested code in main branch

Stage 4: Model (dbt)¶

dbt models define transformations as SQL SELECT statements
Materialized as Iceberg tables via Trino adapter
dbt tests validate data quality (uniqueness, not-null, relationships)
dbt docs generate lineage and documentation automatically
Output: tested, documented transformation models

Stage 5: Orchestrate (Airflow)¶

Airflow DAG schedules the dbt run + downstream tasks
OpenLineage provider emits lineage events to OpenMetadata
Alerting on failure (Slack, email, PagerDuty via providers)
Retry policies, SLAs, and monitoring built-in
Output: scheduled, monitored production pipeline

Stage 6: Serve (Trino/Iceberg)¶

Production Iceberg tables available via Trino for BI tools (Superset)
Governed via Polaris RBAC (catalog-level) and OPA (row/column-level)
Time travel enables rollback and audit
Output: governed, queryable production data

Alternatives Considered¶

Databricks-style Unified Platform¶

Single environment for exploration and production
Notebooks directly schedulable as "jobs"
But: proprietary runtime, vendor lock-in, no Git-native workflow
Rejected: contradicts AKKO's open-source and sovereignty principles

Spark Submit Only (No dbt)¶

Skip dbt, use PySpark scripts directly in Airflow
Simpler stack (fewer tools)
But: loses dbt's testing framework, documentation generation, and SQL-first accessibility for analysts
Rejected: dbt's value for data quality and documentation outweighs the added complexity

Notebook-as-Production (Papermill)¶

Run notebooks in production via Papermill (parameterized execution)
Keeps the same artifact from exploration to production
But: notebooks are hard to test, review, and version control cleanly
Hidden state, execution order issues, merge conflicts in JSON
Rejected: notebooks are for exploration, not production

Consequences¶

Positive¶

Fully portable — every tool in the chain is open-source and replaceable
Audit-friendly — Git history provides complete provenance of every transformation
Skill-building — team learns industry-standard tools (Git, dbt, Airflow) not vendor-specific APIs
Separation of concerns — exploration, transformation, orchestration, and governance are cleanly separated
No vendor lock-in at any stage

Negative¶

More steps than Databricks' "one-click schedule" — requires discipline and tooling knowledge
Onboarding takes longer — team must learn JupyterHub + Git + dbt + Airflow (mitigated by training and templates)
Context switching between tools (notebook, code-server, Git, Airflow UI) can slow initial development

Neutral¶

Each stage can be skipped for simple use cases (e.g., SQL-only analysts can go directly from JupyterHub SQL to Superset without dbt)
Workflow can evolve — adding CI/CD (GitHub Actions), data contracts, or automated testing does not require changing the fundamental pattern