ADR-001: Use Apache Iceberg as Table Format¶

Status¶

Accepted

Date¶

2026-03-09

Context¶

AKKO needs an open table format to implement a lakehouse architecture on top of object storage (MinIO/S3). The table format must support ACID transactions, time travel, schema evolution, and be engine-agnostic (Trino + Spark at minimum). Three serious contenders exist in the market:

Apache Iceberg — Apache TLP, backed by Apple/Netflix/Snowflake/Confluent
Delta Lake — Created by Databricks, open-sourced under Linux Foundation
Apache Hudi — Created by Uber, Apache TLP, streaming-oriented

Decision¶

Use Apache Iceberg as the sole table format for AKKO.

Why Iceberg wins: 1. Industry convergence — Snowflake (Iceberg Tables), Databricks (UniForm reads Iceberg), AWS (Iceberg as default in Athena/Glue), Google BigQuery (Iceberg support). No other format has this breadth of adoption. 2. REST catalog standard — Iceberg defines an open REST catalog spec. Apache Polaris, Tabular, Gravitino all implement it. No proprietary metastore required. 3. Partition evolution — Change partitioning strategy without rewriting data. Delta and Hudi cannot do this. 4. Schema evolution — Add, rename, drop, reorder columns safely. Full backward/forward compatibility. 5. Time travel — Query any snapshot by ID or timestamp. Built-in, no extra configuration. 6. Hidden partitioning — Users write SQL without knowing partition columns. The engine resolves partitions automatically. 7. Apache TLP — True community governance, no single-vendor control.

Alternatives Considered¶

Delta Lake¶

Created and controlled by Databricks
OSS version lacks many features available in Databricks Runtime (Liquid Clustering, UniForm write, Z-ordering optimizations)
Delta UniForm can read Iceberg metadata, which validates Iceberg as the interchange format
Tight coupling with Spark; Trino support exists but is second-class
Rejected: vendor-centric, OSS version is a subset

Apache Hudi¶

Designed for streaming upserts (Uber use case)
Complex architecture (Copy-on-Write vs Merge-on-Read, multiple table types)
Smaller community than Iceberg, less engine support
Overkill for batch-first workloads
Rejected: complexity not justified, streaming not a current priority

Consequences¶

Positive¶

Near-zero lock-in risk — Iceberg is the most portable format, readable by every major engine
Future-proof — all major cloud providers converging on Iceberg
Clean separation of compute (Trino/Spark) and storage (MinIO/S3)
Polaris REST catalog provides centralized governance

Negative¶

Iceberg compaction and maintenance (expire snapshots, rewrite data files) must be managed — no auto-optimization like Databricks
Fewer streaming-optimized features compared to Hudi (acceptable for batch-first platform)

Neutral¶

Iceberg metadata files grow over time — periodic maintenance is standard practice
Migration from other formats is straightforward via CTAS (CREATE TABLE AS SELECT)