Unified Data + AI Architecture¶

The thesis behind AKKO: big data and AI are not two stacks. They share the same infrastructure primitives. Industry has been cutting them in half since 2023 because the AI wave came late; the cost of that split is duplicated plumbing, fragmented governance, and two teams who don't talk.

AKKO collapses the split.

The five things both worlds need¶

Every data pipeline — whether it's an ELT of 10 billion transactions or a RAG over 10 million PDFs — runs through the same five layers:

Layer	Need	AKKO component
Storage	Cheap, durable, object store; columnar formats	MinIO + Iceberg + Polaris
Compute	Distributed, container-native, schedulable	Spark + Trino
Orchestration	Cron-or-event triggered DAGs, retries, audit	Airflow 3
Governance	SSO, RBAC, fine-grained row/column policies, audit	Keycloak + OPA + OpenMetadata
Observability	Metrics, traces, logs, lineage	Grafana / VictoriaMetrics + Tempo + Loki + OpenMetadata

The AI-specific layers — embeddings, vector search, model registry, LLM serving — are additions to this, not replacements:

AI-specific need	AKKO component
Embedding model hosting	Ollama (local) + LiteLLM (cloud bridge)
Vector store (small)	pgvector in `akko-postgresql-data`
Vector store (medium)	OpenSearch dense_vector (already bundled)
Vector store (huge)	Iceberg tables with VECTOR columns + Trino
Model registry	MLflow
Serving	Custom FastAPI services (`akko-ai-service`, future `akko-rag`)
Document extraction	Docling (IBM Research, MIT)
ASR	Whisper
LLM-native SQL	ADEN (AKKO Data-Engine Native) + Trino AI plugin

Every single one runs on the same Kubernetes, authenticates via the same Keycloak, is governed by the same OPA policies, and its lineage is recorded in the same OpenMetadata graph.

Three volume tiers with one API surface¶

For the document-intelligence case (PDFs at scale), AKKO scales gracefully from a small office to national-archive volumes. The consuming side (APIs, cockpit UI, SQL dialect) never changes — only the backend swaps:

Tier 1 — Small (up to ~1 000 PDFs ≈ 250k chunks)
├─ pgvector in akko-postgresql-data
├─ FastAPI direct ingest
├─ query latency < 50 ms
└─ Use case: SMB, law firms, prospect demos

Tier 2 — Mid (up to ~100 000 PDFs ≈ 25M chunks)
├─ OpenSearch dense_vector (bundled in AKKO)
├─ Airflow nightly batch ingestion
├─ Hybrid BM25 keyword + dense vector search
├─ query latency < 100 ms
└─ Use case: regional insurers, consulting firms, mid-market

Tier 3 — Big data (millions of PDFs ≈ billions of chunks)
├─ Iceberg tables on MinIO (cold tier)
├─ OpenSearch hot tier (last 12 months)
├─ Spark batch embedding distributed on cluster
├─ Trino SQL for hybrid vector + structured queries
├─ Qdrant optional for ultra-low latency hot set
└─ Use case: pharma, national banks, gov/fisc, media

Backend selection via values.yaml. The cockpit page, the REST API, the SQL dialect stay identical. The client can start on Tier 1 and graduate to Tier 3 without re-architecting.

Why this matters commercially¶

Competing vendors force the client into a category at purchase time:

Vendor	Forces client to
Databricks Vector Search	Be big from day one (expensive); no on-prem
Pinecone	Be SaaS-only; not sovereign; proprietary licence
Snowflake Cortex	Be all-in on Snowflake; proprietary lock-in
Weaviate Cloud / Qdrant Cloud	Just the vector store; everything else your problem
Elasticsearch + "some AI"	Outdated split, Elastic licence blocks SaaS

AKKO grows with the client. Start Tier 1 on a single VM, grow to Tier 3 on a Kubernetes cluster of 50 nodes, same API surface. Tier 1 bundled in the Lab edition; Tier 2 unlocked in Data edition; Tier 3 unlocked in Enterprise edition. Upgrade = flip a value in values.yaml and redeploy — no schema migration, no API contract rewrite.

A concrete pipeline: from PDF to dashboard¶

Taking a banking example. Branches upload audit PDFs to MinIO. Analysts ask "show me high-risk customers with historical audit concerns". The path:

  Branch uploads PDF  ──►  MinIO bucket  ◄─── S3 event
                                               │
                                               ▼  Airflow DAG
                                      ┌────────────────┐
                                      │ Docling extract│
                                      └────────────────┘
                                               │ markdown + tables + OCR
                                               ▼
                                      ┌────────────────┐
                                      │ Spark job:     │
                                      │ chunk + embed  │
                                      │ via Ollama UDF │
                                      └────────────────┘
                                               │ billions of (text, vec) rows
                                               ▼
                            ┌────────────────────────────────────┐
                            │ Iceberg table  rag.banking_chunks  │
                            │ partitioned by year + quarter      │
                            └────────────────────────────────────┘
                                               │
        ┌──────────────────────────────────────┼──────────────────────────────┐
        ▼                                      ▼                              ▼
 ┌──────────────┐                  ┌──────────────────────┐         ┌───────────────────┐
 │ Trino SQL    │                  │ OpenSearch hot tier  │         │ akko-rag service  │
 │ JOINs chunks │                  │ last 12 months only  │         │ cockpit UI chat   │
 │ + customers  │                  │ BM25 + vector hybrid │         │ with citations    │
 │ + loans      │                  └──────────────────────┘         └───────────────────┘
 │ + transactions│                             │                              │
 └──────────────┘                              └──────────┬───────────────────┘
                                                          │
                                               ┌──────────▼───────────┐
                                               │ ADEN (natural lang)  │
                                               │ → SQL → exec → chart │
                                               └──────────────────────┘
                                                          │
                                                          ▼
                                               Superset dashboard,
                                               row-filtered by OPA,
                                               published from Jupyter

A single SQL statement can look like:

SELECT c.customer_id, c.name,
       SUM(t.amount)   AS volume_12m,
       COUNT(DISTINCT l.loan_id) AS active_loans,
       (SELECT COUNT(*)
          FROM iceberg.rag.rag_chunks
          WHERE metadata->>'document_type' = 'audit_report'
            AND text ILIKE '%' || c.customer_id || '%') AS audit_mentions
FROM iceberg.banking_prod.customers c
LEFT JOIN iceberg.banking_prod.transactions t USING (customer_id)
LEFT JOIN iceberg.banking_prod.loans l USING (customer_id)
WHERE c.risk_rating IN ('HIGH', 'CRITICAL')
GROUP BY c.customer_id, c.name
ORDER BY audit_mentions DESC
LIMIT 50;

One query, 50 milliseconds of latency, joining structured customer data, transaction history, loan portfolio, and semantic search over a corpus of audit PDFs. That's what the unified architecture delivers. Every other vendor forces you to issue three separate queries (to their SQL engine, to their vector DB, to their BI tool) then stitch results in application code.

Shared governance = real governance¶

Because everything lives in one mesh, one OPA policy protects:

Who can query a Trino table
Who can read an Iceberg partition
Who can retrieve a RAG chunk
Who can see a row in a Superset dashboard
Who can call an MLflow model

That policy is version-controlled in Git (helm/akko/charts/akko-opa/). It's enforced at every read path without bypass. The audit trail spans the entire pipeline end-to-end — a regulator asking "who accessed this customer's data in the last 12 months" gets a complete answer in one query, not a manual stitch across five systems.

What to remember¶

Big data and AI infrastructure are the same problem — move data through compute pipelines with governance and observability.
AKKO bundles the primitives once and composes them for both worlds.
The three-tier volume strategy means the same API serves a 10-person consulting firm and a 10,000-person pharma.
Governance and lineage are inherent, not afterthoughts — they're the default when components share auth, catalog, and metrics.

CI/CD Pipeline — how changes to all of this deploy
Control Plane — unified API over Datasets / Workspaces / Pipelines / Agents
AI Stack — the AI-specific layer detail
Data Flow — structured-data-only pipeline detail