ADR-001: Use Apache Iceberg as Table Format¶
Status¶
Accepted
Date¶
2026-03-09
Context¶
AKKO needs an open table format to implement a lakehouse architecture on top of object storage (MinIO/S3). The table format must support ACID transactions, time travel, schema evolution, and be engine-agnostic (Trino + Spark at minimum). Three serious contenders exist in the market:
- Apache Iceberg — Apache TLP, backed by Apple/Netflix/Snowflake/Confluent
- Delta Lake — Created by Databricks, open-sourced under Linux Foundation
- Apache Hudi — Created by Uber, Apache TLP, streaming-oriented
Decision¶
Use Apache Iceberg as the sole table format for AKKO.
Why Iceberg wins: 1. Industry convergence — Snowflake (Iceberg Tables), Databricks (UniForm reads Iceberg), AWS (Iceberg as default in Athena/Glue), Google BigQuery (Iceberg support). No other format has this breadth of adoption. 2. REST catalog standard — Iceberg defines an open REST catalog spec. Apache Polaris, Tabular, Gravitino all implement it. No proprietary metastore required. 3. Partition evolution — Change partitioning strategy without rewriting data. Delta and Hudi cannot do this. 4. Schema evolution — Add, rename, drop, reorder columns safely. Full backward/forward compatibility. 5. Time travel — Query any snapshot by ID or timestamp. Built-in, no extra configuration. 6. Hidden partitioning — Users write SQL without knowing partition columns. The engine resolves partitions automatically. 7. Apache TLP — True community governance, no single-vendor control.
Alternatives Considered¶
Delta Lake¶
- Created and controlled by Databricks
- OSS version lacks many features available in Databricks Runtime (Liquid Clustering, UniForm write, Z-ordering optimizations)
- Delta UniForm can read Iceberg metadata, which validates Iceberg as the interchange format
- Tight coupling with Spark; Trino support exists but is second-class
- Rejected: vendor-centric, OSS version is a subset
Apache Hudi¶
- Designed for streaming upserts (Uber use case)
- Complex architecture (Copy-on-Write vs Merge-on-Read, multiple table types)
- Smaller community than Iceberg, less engine support
- Overkill for batch-first workloads
- Rejected: complexity not justified, streaming not a current priority
Consequences¶
Positive¶
- Near-zero lock-in risk — Iceberg is the most portable format, readable by every major engine
- Future-proof — all major cloud providers converging on Iceberg
- Clean separation of compute (Trino/Spark) and storage (MinIO/S3)
- Polaris REST catalog provides centralized governance
Negative¶
- Iceberg compaction and maintenance (expire snapshots, rewrite data files) must be managed — no auto-optimization like Databricks
- Fewer streaming-optimized features compared to Hudi (acceptable for batch-first platform)
Neutral¶
- Iceberg metadata files grow over time — periodic maintenance is standard practice
- Migration from other formats is straightforward via CTAS (CREATE TABLE AS SELECT)