Skip to content

Unified Data + AI Architecture

The thesis behind AKKO: big data and AI are not two stacks. They share the same infrastructure primitives. Industry has been cutting them in half since 2023 because the AI wave came late; the cost of that split is duplicated plumbing, fragmented governance, and two teams who don't talk.

AKKO collapses the split.

The five things both worlds need

Every data pipeline — whether it's an ELT of 10 billion transactions or a RAG over 10 million PDFs — runs through the same five layers:

Layer Need AKKO component
Storage Cheap, durable, object store; columnar formats MinIO + Iceberg + Polaris
Compute Distributed, container-native, schedulable Spark + Trino
Orchestration Cron-or-event triggered DAGs, retries, audit Airflow 3
Governance SSO, RBAC, fine-grained row/column policies, audit Keycloak + OPA + OpenMetadata
Observability Metrics, traces, logs, lineage Grafana / VictoriaMetrics + Tempo + Loki + OpenMetadata

The AI-specific layers — embeddings, vector search, model registry, LLM serving — are additions to this, not replacements:

AI-specific need AKKO component
Embedding model hosting Ollama (local) + LiteLLM (cloud bridge)
Vector store (small) pgvector in akko-postgresql-data
Vector store (medium) OpenSearch dense_vector (already bundled)
Vector store (huge) Iceberg tables with VECTOR columns + Trino
Model registry MLflow
Serving Custom FastAPI services (akko-ai-service, future akko-rag)
Document extraction Docling (IBM Research, MIT)
ASR Whisper
LLM-native SQL ADEN (AKKO Data-Engine Native) + Trino AI plugin

Every single one runs on the same Kubernetes, authenticates via the same Keycloak, is governed by the same OPA policies, and its lineage is recorded in the same OpenMetadata graph.

Three volume tiers with one API surface

For the document-intelligence case (PDFs at scale), AKKO scales gracefully from a small office to national-archive volumes. The consuming side (APIs, cockpit UI, SQL dialect) never changes — only the backend swaps:

Tier 1 — Small (up to ~1 000 PDFs ≈ 250k chunks)
├─ pgvector in akko-postgresql-data
├─ FastAPI direct ingest
├─ query latency < 50 ms
└─ Use case: SMB, law firms, prospect demos

Tier 2 — Mid (up to ~100 000 PDFs ≈ 25M chunks)
├─ OpenSearch dense_vector (bundled in AKKO)
├─ Airflow nightly batch ingestion
├─ Hybrid BM25 keyword + dense vector search
├─ query latency < 100 ms
└─ Use case: regional insurers, consulting firms, mid-market

Tier 3 — Big data (millions of PDFs ≈ billions of chunks)
├─ Iceberg tables on MinIO (cold tier)
├─ OpenSearch hot tier (last 12 months)
├─ Spark batch embedding distributed on cluster
├─ Trino SQL for hybrid vector + structured queries
├─ Qdrant optional for ultra-low latency hot set
└─ Use case: pharma, national banks, gov/fisc, media

Backend selection via values.yaml. The cockpit page, the REST API, the SQL dialect stay identical. The client can start on Tier 1 and graduate to Tier 3 without re-architecting.

Why this matters commercially

Competing vendors force the client into a category at purchase time:

Vendor Forces client to
Databricks Vector Search Be big from day one (expensive); no on-prem
Pinecone Be SaaS-only; not sovereign; proprietary licence
Snowflake Cortex Be all-in on Snowflake; proprietary lock-in
Weaviate Cloud / Qdrant Cloud Just the vector store; everything else your problem
Elasticsearch + "some AI" Outdated split, Elastic licence blocks SaaS

AKKO grows with the client. Start Tier 1 on a single VM, grow to Tier 3 on a Kubernetes cluster of 50 nodes, same API surface. Tier 1 bundled in the Lab edition; Tier 2 unlocked in Data edition; Tier 3 unlocked in Enterprise edition. Upgrade = flip a value in values.yaml and redeploy — no schema migration, no API contract rewrite.

A concrete pipeline: from PDF to dashboard

Taking a banking example. Branches upload audit PDFs to MinIO. Analysts ask "show me high-risk customers with historical audit concerns". The path:

  Branch uploads PDF  ──►  MinIO bucket  ◄─── S3 event
                                               ▼  Airflow DAG
                                      ┌────────────────┐
                                      │ Docling extract│
                                      └────────────────┘
                                               │ markdown + tables + OCR
                                      ┌────────────────┐
                                      │ Spark job:     │
                                      │ chunk + embed  │
                                      │ via Ollama UDF │
                                      └────────────────┘
                                               │ billions of (text, vec) rows
                            ┌────────────────────────────────────┐
                            │ Iceberg table  rag.banking_chunks  │
                            │ partitioned by year + quarter      │
                            └────────────────────────────────────┘
        ┌──────────────────────────────────────┼──────────────────────────────┐
        ▼                                      ▼                              ▼
 ┌──────────────┐                  ┌──────────────────────┐         ┌───────────────────┐
 │ Trino SQL    │                  │ OpenSearch hot tier  │         │ akko-rag service  │
 │ JOINs chunks │                  │ last 12 months only  │         │ cockpit UI chat   │
 │ + customers  │                  │ BM25 + vector hybrid │         │ with citations    │
 │ + loans      │                  └──────────────────────┘         └───────────────────┘
 │ + transactions│                             │                              │
 └──────────────┘                              └──────────┬───────────────────┘
                                               ┌──────────▼───────────┐
                                               │ ADEN (natural lang)  │
                                               │ → SQL → exec → chart │
                                               └──────────────────────┘
                                               Superset dashboard,
                                               row-filtered by OPA,
                                               published from Jupyter

A single SQL statement can look like:

SELECT c.customer_id, c.name,
       SUM(t.amount)   AS volume_12m,
       COUNT(DISTINCT l.loan_id) AS active_loans,
       (SELECT COUNT(*)
          FROM iceberg.rag.rag_chunks
          WHERE metadata->>'document_type' = 'audit_report'
            AND text ILIKE '%' || c.customer_id || '%') AS audit_mentions
FROM iceberg.banking_prod.customers c
LEFT JOIN iceberg.banking_prod.transactions t USING (customer_id)
LEFT JOIN iceberg.banking_prod.loans l USING (customer_id)
WHERE c.risk_rating IN ('HIGH', 'CRITICAL')
GROUP BY c.customer_id, c.name
ORDER BY audit_mentions DESC
LIMIT 50;

One query, 50 milliseconds of latency, joining structured customer data, transaction history, loan portfolio, and semantic search over a corpus of audit PDFs. That's what the unified architecture delivers. Every other vendor forces you to issue three separate queries (to their SQL engine, to their vector DB, to their BI tool) then stitch results in application code.

Shared governance = real governance

Because everything lives in one mesh, one OPA policy protects:

  • Who can query a Trino table
  • Who can read an Iceberg partition
  • Who can retrieve a RAG chunk
  • Who can see a row in a Superset dashboard
  • Who can call an MLflow model

That policy is version-controlled in Git (helm/akko/charts/akko-opa/). It's enforced at every read path without bypass. The audit trail spans the entire pipeline end-to-end — a regulator asking "who accessed this customer's data in the last 12 months" gets a complete answer in one query, not a manual stitch across five systems.

What to remember

  1. Big data and AI infrastructure are the same problem — move data through compute pipelines with governance and observability.
  2. AKKO bundles the primitives once and composes them for both worlds.
  3. The three-tier volume strategy means the same API serves a 10-person consulting firm and a 10,000-person pharma.
  4. Governance and lineage are inherent, not afterthoughts — they're the default when components share auth, catalog, and metrics.
  • CI/CD Pipeline — how changes to all of this deploy
  • Control Plane — unified API over Datasets / Workspaces / Pipelines / Agents
  • AI Stack — the AI-specific layer detail
  • Data Flow — structured-data-only pipeline detail