Skip to content

ADR-006: Ad-hoc to Production Workflow Pattern

Status

Accepted

Date

2026-03-09

Context

One of Databricks' strongest selling points is the seamless path from notebook exploration to production jobs — same environment, same runtime, one click to schedule. AKKO must provide an equivalent workflow using only open-source tools, without vendor-specific "jobs" APIs or proprietary runtimes.

The challenge: how do data teams go from "I wrote a query in a notebook" to "this runs every day in production, monitored and governed"?

Decision

Implement the Ad-hoc to Production workflow as a standard pipeline through open tools:

JupyterHub (explore) --> code-server (refactor) --> Git push --> dbt model --> Airflow DAG --> Trino/Iceberg (production)

Workflow stages:

Stage 1: Explore (JupyterHub)

  • Data scientists and analysts explore data in Jupyter notebooks
  • Connect to Trino (SQL) or Spark (PySpark) for interactive queries
  • Use pandas, plotly, seaborn for quick visualizations
  • Query Iceberg tables with time travel for historical analysis
  • Output: working notebook with validated logic

Stage 2: Refactor (code-server)

  • Open code-server (VS Code in browser) from JupyterHub environment
  • Extract notebook logic into clean Python modules or dbt SQL models
  • Write unit tests, add type hints, apply linting
  • Structure code following project conventions (src/, tests/, models/)
  • Output: production-ready code in a Git branch

Stage 3: Version (Git)

  • Push branch to Git repository (Gitea, GitHub, GitLab)
  • Create pull request for code review
  • CI runs tests, linting, dbt compile
  • Merge to main after approval
  • Output: reviewed, tested code in main branch

Stage 4: Model (dbt)

  • dbt models define transformations as SQL SELECT statements
  • Materialized as Iceberg tables via Trino adapter
  • dbt tests validate data quality (uniqueness, not-null, relationships)
  • dbt docs generate lineage and documentation automatically
  • Output: tested, documented transformation models

Stage 5: Orchestrate (Airflow)

  • Airflow DAG schedules the dbt run + downstream tasks
  • OpenLineage provider emits lineage events to OpenMetadata
  • Alerting on failure (Slack, email, PagerDuty via providers)
  • Retry policies, SLAs, and monitoring built-in
  • Output: scheduled, monitored production pipeline

Stage 6: Serve (Trino/Iceberg)

  • Production Iceberg tables available via Trino for BI tools (Superset)
  • Governed via Polaris RBAC (catalog-level) and OPA (row/column-level)
  • Time travel enables rollback and audit
  • Output: governed, queryable production data

Alternatives Considered

Databricks-style Unified Platform

  • Single environment for exploration and production
  • Notebooks directly schedulable as "jobs"
  • But: proprietary runtime, vendor lock-in, no Git-native workflow
  • Rejected: contradicts AKKO's open-source and sovereignty principles

Spark Submit Only (No dbt)

  • Skip dbt, use PySpark scripts directly in Airflow
  • Simpler stack (fewer tools)
  • But: loses dbt's testing framework, documentation generation, and SQL-first accessibility for analysts
  • Rejected: dbt's value for data quality and documentation outweighs the added complexity

Notebook-as-Production (Papermill)

  • Run notebooks in production via Papermill (parameterized execution)
  • Keeps the same artifact from exploration to production
  • But: notebooks are hard to test, review, and version control cleanly
  • Hidden state, execution order issues, merge conflicts in JSON
  • Rejected: notebooks are for exploration, not production

Consequences

Positive

  • Fully portable — every tool in the chain is open-source and replaceable
  • Audit-friendly — Git history provides complete provenance of every transformation
  • Skill-building — team learns industry-standard tools (Git, dbt, Airflow) not vendor-specific APIs
  • Separation of concerns — exploration, transformation, orchestration, and governance are cleanly separated
  • No vendor lock-in at any stage

Negative

  • More steps than Databricks' "one-click schedule" — requires discipline and tooling knowledge
  • Onboarding takes longer — team must learn JupyterHub + Git + dbt + Airflow (mitigated by training and templates)
  • Context switching between tools (notebook, code-server, Git, Airflow UI) can slow initial development

Neutral

  • Each stage can be skipped for simple use cases (e.g., SQL-only analysts can go directly from JupyterHub SQL to Superset without dbt)
  • Workflow can evolve — adding CI/CD (GitHub Actions), data contracts, or automated testing does not require changing the fundamental pattern

References