Skip to content

Apache Spark

Overview

Apache Spark provides distributed data processing for AKKO, with native Iceberg integration via the Polaris REST catalog. Spark is primarily accessed through the Spark Connect gRPC protocol from JupyterHub notebooks using PySpark.

Architecture

  akko-spark image (custom)
  ├── Iceberg runtime JAR
  ├── Iceberg AWS bundle JAR
  ├── Spark Connect JAR
  ├── OpenLineage Spark listener
  └── OpenMetadata Spark agent

  ┌─────────────────┐
  │  spark-master    │  Cluster manager (Spark Standalone)
  │  :8090 (Web UI)  │
  └────────┬────────┘
  ┌────────┴────────┐
  │  spark-worker    │  Executor node
  └─────────────────┘
  ┌────────┴────────┐
  │  spark-connect   │  gRPC server (:15002)
  │  (Thrift-like)   │
  └─────────────────┘
           │ gRPC (sc://)
  ┌────────┴────────┐
  │   Notebooks      │  PySpark client
  │  (akko-notebook) │
  └─────────────────┘

All three services (spark-master, spark-worker, spark-connect) use the same custom Docker image: akko-spark.

Custom Image

The akko-spark image is built from apache/spark:3.5.1-python3 with additional JARs baked in:

docker/spark/Dockerfile
FROM apache/spark:3.5.1-python3

USER root

# Iceberg + AWS JARs
RUN wget -q -P /opt/spark/jars/ \
    iceberg-spark-runtime-3.5_2.12-1.5.2.jar \
    iceberg-aws-bundle-1.5.2.jar \
    spark-connect_2.12-3.5.1.jar

# OpenLineage + OpenMetadata agent
RUN wget -q -P /opt/spark/jars/ \
    openlineage-spark_2.12-1.28.0.jar \
    openmetadata-spark-agent.jar

USER spark
JAR Version Purpose
iceberg-spark-runtime 1.5.2 Iceberg table format support
iceberg-aws-bundle 1.5.2 S3 filesystem support
spark-connect 3.5.1 Spark Connect gRPC server
openlineage-spark 1.28.0 OpenLineage event emission
openmetadata-spark-agent 1.0 Lineage transport to OpenMetadata

JARs Must Be in the Docker Image

Spark JARs must be baked into the Docker image at build time. Using --jars or --packages at runtime causes ClassLoader conflicts and unpredictable failures. Always rebuild the image when adding or updating JARs:

docker build -t akko-spark:2026.03 \
  -f docker/spark/Dockerfile docker/spark/

Spark Connect

Spark Connect is a gRPC-based protocol (introduced in Spark 3.4) that decouples the client (PySpark in notebooks) from the server (Spark cluster). This means:

  • No JARs needed in the notebook image
  • Thin client: only pyspark pip package required
  • Server-side execution with client-side DataFrame API

The Spark Connect server listens on port 15002 (gRPC).

Connecting from Notebooks

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://akko-akko-spark-connect:15002") \
    .getOrCreate()

# Query Iceberg tables
df = spark.sql("SELECT * FROM iceberg.analytics.transactions")
df.show()

Environment Variable

The SPARK_REMOTE environment variable is pre-set in notebook containers to sc://akko-akko-spark-connect:15002. PySpark will use it automatically if no .remote() is specified:

# This also works (uses SPARK_REMOTE env var)
spark = SparkSession.builder.getOrCreate()

Iceberg Integration

Spark reads and writes Iceberg tables through the Polaris REST catalog. The catalog configuration is passed via Spark properties in the Helm values:

Property Value
spark.sql.catalog.iceberg org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.iceberg.type rest
spark.sql.catalog.iceberg.uri http://akko-akko-polaris:8181/api/catalog
spark.sql.catalog.iceberg.warehouse akko-warehouse
spark.sql.catalog.iceberg.credential root:{POLARIS_ROOT_SECRET}
spark.sql.catalog.iceberg.scope PRINCIPAL_ROLE:ALL

Example: Writing Iceberg Tables

# Create a namespace
spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.analytics")

# Create a table
spark.sql("""
    CREATE TABLE IF NOT EXISTS iceberg.analytics.transactions (
        id BIGINT,
        customer_id INT,
        amount DECIMAL(10,2),
        transaction_date DATE
    ) USING iceberg
""")

# Insert data
spark.sql("""
    INSERT INTO iceberg.analytics.transactions VALUES
    (1, 101, 250.00, DATE '2026-01-15'),
    (2, 102, 1500.00, DATE '2026-01-16')
""")

object storage as S3 Storage Backend

Iceberg table data files (Parquet) are stored in object storage, AKKO's S3-compatible object storage. The S3 configuration is passed through Spark properties:

Property Value
spark.sql.catalog.iceberg.s3.endpoint http://akko-minio:9000
spark.sql.catalog.iceberg.s3.path-style-access true
spark.sql.catalog.iceberg.s3.access-key-id {MINIO_ROOT_USER}
spark.sql.catalog.iceberg.s3.secret-access-key {MINIO_ROOT_PASSWORD}

OpenLineage

The Spark image includes the OpenLineage listener JAR, which emits lineage events (job start, complete, fail) for every Spark job. These events capture input/output datasets, enabling end-to-end data lineage tracking.

Services Summary

Service Port Protocol Purpose
spark-master 8090 (Web UI), 7077 (cluster) HTTP, Spark Cluster manager
spark-worker -- Spark Task executor
spark-connect 15002 gRPC Client-facing Spark Connect server

Known Issues

Important Gotchas

  • JARs must be in Docker image: Never use --jars or --packages. ClassLoader issues will cause silent failures. Always rebuild the image.
  • SerializedLambda on .collect(): Calling .collect() on Iceberg metadata tables (e.g., snapshots, history) fails with a SerializedLambda error in Spark Connect mode. Use .show() instead:

    # This fails:
    spark.sql("SELECT * FROM iceberg.analytics.transactions.snapshots").collect()
    
    # This works:
    spark.sql("SELECT * FROM iceberg.analytics.transactions.snapshots").show()
    
  • Rebuild after Dockerfile changes: After modifying the Dockerfile, rebuild images with helm/scripts/build-images.sh and redeploy with helm upgrade akko helm/akko/ -n akko.

  • Polaris credential vending: object storage does not support STS AssumeRole. The Polaris catalog is configured with stsUnavailable=true to bypass credential vending and use direct S3 credentials.