ADR-006: Ad-hoc to Production Workflow Pattern¶
Status¶
Accepted
Date¶
2026-03-09
Context¶
One of Databricks' strongest selling points is the seamless path from notebook exploration to production jobs — same environment, same runtime, one click to schedule. AKKO must provide an equivalent workflow using only open-source tools, without vendor-specific "jobs" APIs or proprietary runtimes.
The challenge: how do data teams go from "I wrote a query in a notebook" to "this runs every day in production, monitored and governed"?
Decision¶
Implement the Ad-hoc to Production workflow as a standard pipeline through open tools:
JupyterHub (explore) --> code-server (refactor) --> Git push --> dbt model --> Airflow DAG --> Trino/Iceberg (production)
Workflow stages:
Stage 1: Explore (JupyterHub)¶
- Data scientists and analysts explore data in Jupyter notebooks
- Connect to Trino (SQL) or Spark (PySpark) for interactive queries
- Use pandas, plotly, seaborn for quick visualizations
- Query Iceberg tables with time travel for historical analysis
- Output: working notebook with validated logic
Stage 2: Refactor (code-server)¶
- Open code-server (VS Code in browser) from JupyterHub environment
- Extract notebook logic into clean Python modules or dbt SQL models
- Write unit tests, add type hints, apply linting
- Structure code following project conventions (src/, tests/, models/)
- Output: production-ready code in a Git branch
Stage 3: Version (Git)¶
- Push branch to Git repository (Gitea, GitHub, GitLab)
- Create pull request for code review
- CI runs tests, linting, dbt compile
- Merge to main after approval
- Output: reviewed, tested code in main branch
Stage 4: Model (dbt)¶
- dbt models define transformations as SQL SELECT statements
- Materialized as Iceberg tables via Trino adapter
- dbt tests validate data quality (uniqueness, not-null, relationships)
- dbt docs generate lineage and documentation automatically
- Output: tested, documented transformation models
Stage 5: Orchestrate (Airflow)¶
- Airflow DAG schedules the dbt run + downstream tasks
- OpenLineage provider emits lineage events to OpenMetadata
- Alerting on failure (Slack, email, PagerDuty via providers)
- Retry policies, SLAs, and monitoring built-in
- Output: scheduled, monitored production pipeline
Stage 6: Serve (Trino/Iceberg)¶
- Production Iceberg tables available via Trino for BI tools (Superset)
- Governed via Polaris RBAC (catalog-level) and OPA (row/column-level)
- Time travel enables rollback and audit
- Output: governed, queryable production data
Alternatives Considered¶
Databricks-style Unified Platform¶
- Single environment for exploration and production
- Notebooks directly schedulable as "jobs"
- But: proprietary runtime, vendor lock-in, no Git-native workflow
- Rejected: contradicts AKKO's open-source and sovereignty principles
Spark Submit Only (No dbt)¶
- Skip dbt, use PySpark scripts directly in Airflow
- Simpler stack (fewer tools)
- But: loses dbt's testing framework, documentation generation, and SQL-first accessibility for analysts
- Rejected: dbt's value for data quality and documentation outweighs the added complexity
Notebook-as-Production (Papermill)¶
- Run notebooks in production via Papermill (parameterized execution)
- Keeps the same artifact from exploration to production
- But: notebooks are hard to test, review, and version control cleanly
- Hidden state, execution order issues, merge conflicts in JSON
- Rejected: notebooks are for exploration, not production
Consequences¶
Positive¶
- Fully portable — every tool in the chain is open-source and replaceable
- Audit-friendly — Git history provides complete provenance of every transformation
- Skill-building — team learns industry-standard tools (Git, dbt, Airflow) not vendor-specific APIs
- Separation of concerns — exploration, transformation, orchestration, and governance are cleanly separated
- No vendor lock-in at any stage
Negative¶
- More steps than Databricks' "one-click schedule" — requires discipline and tooling knowledge
- Onboarding takes longer — team must learn JupyterHub + Git + dbt + Airflow (mitigated by training and templates)
- Context switching between tools (notebook, code-server, Git, Airflow UI) can slow initial development
Neutral¶
- Each stage can be skipped for simple use cases (e.g., SQL-only analysts can go directly from JupyterHub SQL to Superset without dbt)
- Workflow can evolve — adding CI/CD (GitHub Actions), data contracts, or automated testing does not require changing the fundamental pattern