Apache Spark¶

Overview¶

Apache Spark provides distributed data processing for AKKO, with native Iceberg integration via the Polaris REST catalog. Spark is primarily accessed through the Spark Connect gRPC protocol from JupyterHub notebooks using PySpark.

Architecture¶

  akko-spark image (custom)
  ├── Iceberg runtime JAR
  ├── Iceberg AWS bundle JAR
  ├── Spark Connect JAR
  ├── OpenLineage Spark listener
  └── OpenMetadata Spark agent

  ┌─────────────────┐
  │  spark-master    │  Cluster manager (Spark Standalone)
  │  :8090 (Web UI)  │
  └────────┬────────┘
           │
  ┌────────┴────────┐
  │  spark-worker    │  Executor node
  └─────────────────┘
           │
  ┌────────┴────────┐
  │  spark-connect   │  gRPC server (:15002)
  │  (Thrift-like)   │
  └─────────────────┘
           ▲
           │ gRPC (sc://)
  ┌────────┴────────┐
  │   Notebooks      │  PySpark client
  │  (akko-notebook) │
  └─────────────────┘

All three services (spark-master, spark-worker, spark-connect) use the same custom Docker image: akko-spark.

Custom Image¶

The akko-spark image is built from apache/spark:3.5.1-python3 with additional JARs baked in:

docker/spark/Dockerfile

FROM apache/spark:3.5.1-python3

USER root

# Iceberg + AWS JARs
RUN wget -q -P /opt/spark/jars/ \
    iceberg-spark-runtime-3.5_2.12-1.5.2.jar \
    iceberg-aws-bundle-1.5.2.jar \
    spark-connect_2.12-3.5.1.jar

# OpenLineage + OpenMetadata agent
RUN wget -q -P /opt/spark/jars/ \
    openlineage-spark_2.12-1.28.0.jar \
    openmetadata-spark-agent.jar

USER spark

JAR	Version	Purpose
`iceberg-spark-runtime`	1.5.2	Iceberg table format support
`iceberg-aws-bundle`	1.5.2	S3 filesystem support
`spark-connect`	3.5.1	Spark Connect gRPC server
`openlineage-spark`	1.28.0	OpenLineage event emission
`openmetadata-spark-agent`	1.0	Lineage transport to OpenMetadata

JARs Must Be in the Docker Image

Spark JARs must be baked into the Docker image at build time. Using --jars or --packages at runtime causes ClassLoader conflicts and unpredictable failures. Always rebuild the image when adding or updating JARs:

docker build -t akko-spark:2026.03 \
  -f docker/spark/Dockerfile docker/spark/

Spark Connect¶

Spark Connect is a gRPC-based protocol (introduced in Spark 3.4) that decouples the client (PySpark in notebooks) from the server (Spark cluster). This means:

No JARs needed in the notebook image
Thin client: only pyspark pip package required
Server-side execution with client-side DataFrame API

The Spark Connect server listens on port 15002 (gRPC).

Connecting from Notebooks¶

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://akko-akko-spark-connect:15002") \
    .getOrCreate()

# Query Iceberg tables
df = spark.sql("SELECT * FROM iceberg.analytics.transactions")
df.show()

Environment Variable

The SPARK_REMOTE environment variable is pre-set in notebook containers to sc://akko-akko-spark-connect:15002. PySpark will use it automatically if no .remote() is specified:

# This also works (uses SPARK_REMOTE env var)
spark = SparkSession.builder.getOrCreate()

Iceberg Integration¶

Spark reads and writes Iceberg tables through the Polaris REST catalog. The catalog configuration is passed via Spark properties in the Helm values:

Property	Value
`spark.sql.catalog.iceberg`	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.iceberg.type`	`rest`
`spark.sql.catalog.iceberg.uri`	`http://akko-akko-polaris:8181/api/catalog`
`spark.sql.catalog.iceberg.warehouse`	`akko-warehouse`
`spark.sql.catalog.iceberg.credential`	`root:{POLARIS_ROOT_SECRET}`
`spark.sql.catalog.iceberg.scope`	`PRINCIPAL_ROLE:ALL`

Example: Writing Iceberg Tables¶

# Create a namespace
spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.analytics")

# Create a table
spark.sql("""
    CREATE TABLE IF NOT EXISTS iceberg.analytics.transactions (
        id BIGINT,
        customer_id INT,
        amount DECIMAL(10,2),
        transaction_date DATE
    ) USING iceberg
""")

# Insert data
spark.sql("""
    INSERT INTO iceberg.analytics.transactions VALUES
    (1, 101, 250.00, DATE '2026-01-15'),
    (2, 102, 1500.00, DATE '2026-01-16')
""")

object storage as S3 Storage Backend¶

Iceberg table data files (Parquet) are stored in object storage, AKKO's S3-compatible object storage. The S3 configuration is passed through Spark properties:

Property	Value
`spark.sql.catalog.iceberg.s3.endpoint`	`http://akko-minio:9000`
`spark.sql.catalog.iceberg.s3.path-style-access`	`true`
`spark.sql.catalog.iceberg.s3.access-key-id`	`{MINIO_ROOT_USER}`
`spark.sql.catalog.iceberg.s3.secret-access-key`	`{MINIO_ROOT_PASSWORD}`

OpenLineage¶

The Spark image includes the OpenLineage listener JAR, which emits lineage events (job start, complete, fail) for every Spark job. These events capture input/output datasets, enabling end-to-end data lineage tracking.

Services Summary¶

Service	Port	Protocol	Purpose
`spark-master`	8090 (Web UI), 7077 (cluster)	HTTP, Spark	Cluster manager
`spark-worker`	--	Spark	Task executor
`spark-connect`	15002	gRPC	Client-facing Spark Connect server

Known Issues¶

Important Gotchas

JARs must be in Docker image: Never use --jars or --packages. ClassLoader issues will cause silent failures. Always rebuild the image.

SerializedLambda on .collect(): Calling .collect() on Iceberg metadata tables (e.g., snapshots, history) fails with a SerializedLambda error in Spark Connect mode. Use .show() instead:

# This fails:
spark.sql("SELECT * FROM iceberg.analytics.transactions.snapshots").collect()

# This works:
spark.sql("SELECT * FROM iceberg.analytics.transactions.snapshots").show()

Rebuild after Dockerfile changes: After modifying the Dockerfile, rebuild images with helm/scripts/build-images.sh and redeploy with helm upgrade akko helm/akko/ -n akko.
Polaris credential vending: object storage does not support STS AssumeRole. The Polaris catalog is configured with stsUnavailable=true to bypass credential vending and use direct S3 credentials.