Apache Spark¶
Overview¶
Apache Spark provides distributed data processing for AKKO, with native Iceberg integration via the Polaris REST catalog. Spark is primarily accessed through the Spark Connect gRPC protocol from JupyterHub notebooks using PySpark.
Architecture¶
akko-spark image (custom)
├── Iceberg runtime JAR
├── Iceberg AWS bundle JAR
├── Spark Connect JAR
├── OpenLineage Spark listener
└── OpenMetadata Spark agent
┌─────────────────┐
│ spark-master │ Cluster manager (Spark Standalone)
│ :8090 (Web UI) │
└────────┬────────┘
│
┌────────┴────────┐
│ spark-worker │ Executor node
└─────────────────┘
│
┌────────┴────────┐
│ spark-connect │ gRPC server (:15002)
│ (Thrift-like) │
└─────────────────┘
▲
│ gRPC (sc://)
┌────────┴────────┐
│ Notebooks │ PySpark client
│ (akko-notebook) │
└─────────────────┘
All three services (spark-master, spark-worker, spark-connect) use the
same custom Docker image: akko-spark.
Custom Image¶
The akko-spark image is built from apache/spark:3.5.1-python3 with
additional JARs baked in:
FROM apache/spark:3.5.1-python3
USER root
# Iceberg + AWS JARs
RUN wget -q -P /opt/spark/jars/ \
iceberg-spark-runtime-3.5_2.12-1.5.2.jar \
iceberg-aws-bundle-1.5.2.jar \
spark-connect_2.12-3.5.1.jar
# OpenLineage + OpenMetadata agent
RUN wget -q -P /opt/spark/jars/ \
openlineage-spark_2.12-1.28.0.jar \
openmetadata-spark-agent.jar
USER spark
| JAR | Version | Purpose |
|---|---|---|
iceberg-spark-runtime |
1.5.2 | Iceberg table format support |
iceberg-aws-bundle |
1.5.2 | S3 filesystem support |
spark-connect |
3.5.1 | Spark Connect gRPC server |
openlineage-spark |
1.28.0 | OpenLineage event emission |
openmetadata-spark-agent |
1.0 | Lineage transport to OpenMetadata |
JARs Must Be in the Docker Image
Spark JARs must be baked into the Docker image at build time. Using
--jars or --packages at runtime causes ClassLoader conflicts and
unpredictable failures. Always rebuild the image when adding or updating
JARs:
Spark Connect¶
Spark Connect is a gRPC-based protocol (introduced in Spark 3.4) that decouples the client (PySpark in notebooks) from the server (Spark cluster). This means:
- No JARs needed in the notebook image
- Thin client: only
pysparkpip package required - Server-side execution with client-side DataFrame API
The Spark Connect server listens on port 15002 (gRPC).
Connecting from Notebooks¶
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.remote("sc://akko-akko-spark-connect:15002") \
.getOrCreate()
# Query Iceberg tables
df = spark.sql("SELECT * FROM iceberg.analytics.transactions")
df.show()
Environment Variable
The SPARK_REMOTE environment variable is pre-set in notebook containers
to sc://akko-akko-spark-connect:15002. PySpark will use it automatically if no
.remote() is specified:
Iceberg Integration¶
Spark reads and writes Iceberg tables through the Polaris REST catalog. The catalog configuration is passed via Spark properties in the Helm values:
| Property | Value |
|---|---|
spark.sql.catalog.iceberg |
org.apache.iceberg.spark.SparkCatalog |
spark.sql.catalog.iceberg.type |
rest |
spark.sql.catalog.iceberg.uri |
http://akko-akko-polaris:8181/api/catalog |
spark.sql.catalog.iceberg.warehouse |
akko-warehouse |
spark.sql.catalog.iceberg.credential |
root:{POLARIS_ROOT_SECRET} |
spark.sql.catalog.iceberg.scope |
PRINCIPAL_ROLE:ALL |
Example: Writing Iceberg Tables¶
# Create a namespace
spark.sql("CREATE NAMESPACE IF NOT EXISTS iceberg.analytics")
# Create a table
spark.sql("""
CREATE TABLE IF NOT EXISTS iceberg.analytics.transactions (
id BIGINT,
customer_id INT,
amount DECIMAL(10,2),
transaction_date DATE
) USING iceberg
""")
# Insert data
spark.sql("""
INSERT INTO iceberg.analytics.transactions VALUES
(1, 101, 250.00, DATE '2026-01-15'),
(2, 102, 1500.00, DATE '2026-01-16')
""")
object storage as S3 Storage Backend¶
Iceberg table data files (Parquet) are stored in object storage, AKKO's S3-compatible object storage. The S3 configuration is passed through Spark properties:
| Property | Value |
|---|---|
spark.sql.catalog.iceberg.s3.endpoint |
http://akko-minio:9000 |
spark.sql.catalog.iceberg.s3.path-style-access |
true |
spark.sql.catalog.iceberg.s3.access-key-id |
{MINIO_ROOT_USER} |
spark.sql.catalog.iceberg.s3.secret-access-key |
{MINIO_ROOT_PASSWORD} |
OpenLineage¶
The Spark image includes the OpenLineage listener JAR, which emits lineage events (job start, complete, fail) for every Spark job. These events capture input/output datasets, enabling end-to-end data lineage tracking.
Services Summary¶
| Service | Port | Protocol | Purpose |
|---|---|---|---|
spark-master |
8090 (Web UI), 7077 (cluster) | HTTP, Spark | Cluster manager |
spark-worker |
-- | Spark | Task executor |
spark-connect |
15002 | gRPC | Client-facing Spark Connect server |
Known Issues¶
Important Gotchas
- JARs must be in Docker image: Never use
--jarsor--packages. ClassLoader issues will cause silent failures. Always rebuild the image. -
SerializedLambdaon.collect(): Calling.collect()on Iceberg metadata tables (e.g.,snapshots,history) fails with aSerializedLambdaerror in Spark Connect mode. Use.show()instead: -
Rebuild after Dockerfile changes: After modifying the Dockerfile, rebuild images with
helm/scripts/build-images.shand redeploy withhelm upgrade akko helm/akko/ -n akko. - Polaris credential vending: object storage does not support STS AssumeRole.
The Polaris catalog is configured with
stsUnavailable=trueto bypass credential vending and use direct S3 credentials.