Unified Data + AI Architecture¶
The thesis behind AKKO: big data and AI are not two stacks. They share the same infrastructure primitives. Industry has been cutting them in half since 2023 because the AI wave came late; the cost of that split is duplicated plumbing, fragmented governance, and two teams who don't talk.
AKKO collapses the split.
The five things both worlds need¶
Every data pipeline — whether it's an ELT of 10 billion transactions or a RAG over 10 million PDFs — runs through the same five layers:
| Layer | Need | AKKO component |
|---|---|---|
| Storage | Cheap, durable, object store; columnar formats | MinIO + Iceberg + Polaris |
| Compute | Distributed, container-native, schedulable | Spark + Trino |
| Orchestration | Cron-or-event triggered DAGs, retries, audit | Airflow 3 |
| Governance | SSO, RBAC, fine-grained row/column policies, audit | Keycloak + OPA + OpenMetadata |
| Observability | Metrics, traces, logs, lineage | Grafana / VictoriaMetrics + Tempo + Loki + OpenMetadata |
The AI-specific layers — embeddings, vector search, model registry, LLM serving — are additions to this, not replacements:
| AI-specific need | AKKO component |
|---|---|
| Embedding model hosting | Ollama (local) + LiteLLM (cloud bridge) |
| Vector store (small) | pgvector in akko-postgresql-data |
| Vector store (medium) | OpenSearch dense_vector (already bundled) |
| Vector store (huge) | Iceberg tables with VECTOR columns + Trino |
| Model registry | MLflow |
| Serving | Custom FastAPI services (akko-ai-service, future akko-rag) |
| Document extraction | Docling (IBM Research, MIT) |
| ASR | Whisper |
| LLM-native SQL | ADEN (AKKO Data-Engine Native) + Trino AI plugin |
Every single one runs on the same Kubernetes, authenticates via the same Keycloak, is governed by the same OPA policies, and its lineage is recorded in the same OpenMetadata graph.
Three volume tiers with one API surface¶
For the document-intelligence case (PDFs at scale), AKKO scales gracefully from a small office to national-archive volumes. The consuming side (APIs, cockpit UI, SQL dialect) never changes — only the backend swaps:
Tier 1 — Small (up to ~1 000 PDFs ≈ 250k chunks)
├─ pgvector in akko-postgresql-data
├─ FastAPI direct ingest
├─ query latency < 50 ms
└─ Use case: SMB, law firms, prospect demos
Tier 2 — Mid (up to ~100 000 PDFs ≈ 25M chunks)
├─ OpenSearch dense_vector (bundled in AKKO)
├─ Airflow nightly batch ingestion
├─ Hybrid BM25 keyword + dense vector search
├─ query latency < 100 ms
└─ Use case: regional insurers, consulting firms, mid-market
Tier 3 — Big data (millions of PDFs ≈ billions of chunks)
├─ Iceberg tables on MinIO (cold tier)
├─ OpenSearch hot tier (last 12 months)
├─ Spark batch embedding distributed on cluster
├─ Trino SQL for hybrid vector + structured queries
├─ Qdrant optional for ultra-low latency hot set
└─ Use case: pharma, national banks, gov/fisc, media
Backend selection via values.yaml. The cockpit page, the REST API, the
SQL dialect stay identical. The client can start on Tier 1 and graduate to
Tier 3 without re-architecting.
Why this matters commercially¶
Competing vendors force the client into a category at purchase time:
| Vendor | Forces client to |
|---|---|
| Databricks Vector Search | Be big from day one (expensive); no on-prem |
| Pinecone | Be SaaS-only; not sovereign; proprietary licence |
| Snowflake Cortex | Be all-in on Snowflake; proprietary lock-in |
| Weaviate Cloud / Qdrant Cloud | Just the vector store; everything else your problem |
| Elasticsearch + "some AI" | Outdated split, Elastic licence blocks SaaS |
AKKO grows with the client. Start Tier 1 on a single VM, grow to Tier 3
on a Kubernetes cluster of 50 nodes, same API surface. Tier 1 bundled in the
Lab edition; Tier 2 unlocked in Data edition; Tier 3 unlocked in Enterprise
edition. Upgrade = flip a value in values.yaml and redeploy — no schema
migration, no API contract rewrite.
A concrete pipeline: from PDF to dashboard¶
Taking a banking example. Branches upload audit PDFs to MinIO. Analysts ask "show me high-risk customers with historical audit concerns". The path:
Branch uploads PDF ──► MinIO bucket ◄─── S3 event
│
▼ Airflow DAG
┌────────────────┐
│ Docling extract│
└────────────────┘
│ markdown + tables + OCR
▼
┌────────────────┐
│ Spark job: │
│ chunk + embed │
│ via Ollama UDF │
└────────────────┘
│ billions of (text, vec) rows
▼
┌────────────────────────────────────┐
│ Iceberg table rag.banking_chunks │
│ partitioned by year + quarter │
└────────────────────────────────────┘
│
┌──────────────────────────────────────┼──────────────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────────────┐ ┌───────────────────┐
│ Trino SQL │ │ OpenSearch hot tier │ │ akko-rag service │
│ JOINs chunks │ │ last 12 months only │ │ cockpit UI chat │
│ + customers │ │ BM25 + vector hybrid │ │ with citations │
│ + loans │ └──────────────────────┘ └───────────────────┘
│ + transactions│ │ │
└──────────────┘ └──────────┬───────────────────┘
│
┌──────────▼───────────┐
│ ADEN (natural lang) │
│ → SQL → exec → chart │
└──────────────────────┘
│
▼
Superset dashboard,
row-filtered by OPA,
published from Jupyter
A single SQL statement can look like:
SELECT c.customer_id, c.name,
SUM(t.amount) AS volume_12m,
COUNT(DISTINCT l.loan_id) AS active_loans,
(SELECT COUNT(*)
FROM iceberg.rag.rag_chunks
WHERE metadata->>'document_type' = 'audit_report'
AND text ILIKE '%' || c.customer_id || '%') AS audit_mentions
FROM iceberg.banking_prod.customers c
LEFT JOIN iceberg.banking_prod.transactions t USING (customer_id)
LEFT JOIN iceberg.banking_prod.loans l USING (customer_id)
WHERE c.risk_rating IN ('HIGH', 'CRITICAL')
GROUP BY c.customer_id, c.name
ORDER BY audit_mentions DESC
LIMIT 50;
One query, 50 milliseconds of latency, joining structured customer data, transaction history, loan portfolio, and semantic search over a corpus of audit PDFs. That's what the unified architecture delivers. Every other vendor forces you to issue three separate queries (to their SQL engine, to their vector DB, to their BI tool) then stitch results in application code.
Shared governance = real governance¶
Because everything lives in one mesh, one OPA policy protects:
- Who can query a Trino table
- Who can read an Iceberg partition
- Who can retrieve a RAG chunk
- Who can see a row in a Superset dashboard
- Who can call an MLflow model
That policy is version-controlled in Git (helm/akko/charts/akko-opa/).
It's enforced at every read path without bypass. The audit trail spans
the entire pipeline end-to-end — a regulator asking "who accessed this
customer's data in the last 12 months" gets a complete answer in one
query, not a manual stitch across five systems.
What to remember¶
- Big data and AI infrastructure are the same problem — move data through compute pipelines with governance and observability.
- AKKO bundles the primitives once and composes them for both worlds.
- The three-tier volume strategy means the same API serves a 10-person consulting firm and a 10,000-person pharma.
- Governance and lineage are inherent, not afterthoughts — they're the default when components share auth, catalog, and metrics.
Related¶
- CI/CD Pipeline — how changes to all of this deploy
- Control Plane — unified API over Datasets / Workspaces / Pipelines / Agents
- AI Stack — the AI-specific layer detail
- Data Flow — structured-data-only pipeline detail