Aller au contenu

ADR-001: Use Apache Iceberg as Table Format

Status

Accepted

Date

2026-03-09

Context

AKKO needs an open table format to implement a lakehouse architecture on top of object storage (MinIO/S3). The table format must support ACID transactions, time travel, schema evolution, and be engine-agnostic (Trino + Spark at minimum). Three serious contenders exist in the market:

  • Apache Iceberg — Apache TLP, backed by Apple/Netflix/Snowflake/Confluent
  • Delta Lake — Created by Databricks, open-sourced under Linux Foundation
  • Apache Hudi — Created by Uber, Apache TLP, streaming-oriented

Decision

Use Apache Iceberg as the sole table format for AKKO.

Why Iceberg wins: 1. Industry convergence — Snowflake (Iceberg Tables), Databricks (UniForm reads Iceberg), AWS (Iceberg as default in Athena/Glue), Google BigQuery (Iceberg support). No other format has this breadth of adoption. 2. REST catalog standard — Iceberg defines an open REST catalog spec. Apache Polaris, Tabular, Gravitino all implement it. No proprietary metastore required. 3. Partition evolution — Change partitioning strategy without rewriting data. Delta and Hudi cannot do this. 4. Schema evolution — Add, rename, drop, reorder columns safely. Full backward/forward compatibility. 5. Time travel — Query any snapshot by ID or timestamp. Built-in, no extra configuration. 6. Hidden partitioning — Users write SQL without knowing partition columns. The engine resolves partitions automatically. 7. Apache TLP — True community governance, no single-vendor control.

Alternatives Considered

Delta Lake

  • Created and controlled by Databricks
  • OSS version lacks many features available in Databricks Runtime (Liquid Clustering, UniForm write, Z-ordering optimizations)
  • Delta UniForm can read Iceberg metadata, which validates Iceberg as the interchange format
  • Tight coupling with Spark; Trino support exists but is second-class
  • Rejected: vendor-centric, OSS version is a subset

Apache Hudi

  • Designed for streaming upserts (Uber use case)
  • Complex architecture (Copy-on-Write vs Merge-on-Read, multiple table types)
  • Smaller community than Iceberg, less engine support
  • Overkill for batch-first workloads
  • Rejected: complexity not justified, streaming not a current priority

Consequences

Positive

  • Near-zero lock-in risk — Iceberg is the most portable format, readable by every major engine
  • Future-proof — all major cloud providers converging on Iceberg
  • Clean separation of compute (Trino/Spark) and storage (MinIO/S3)
  • Polaris REST catalog provides centralized governance

Negative

  • Iceberg compaction and maintenance (expire snapshots, rewrite data files) must be managed — no auto-optimization like Databricks
  • Fewer streaming-optimized features compared to Hudi (acceptable for batch-first platform)

Neutral

  • Iceberg metadata files grow over time — periodic maintenance is standard practice
  • Migration from other formats is straightforward via CTAS (CREATE TABLE AS SELECT)

References