ADR-042 — RAG strategy: intelligent chunking, vector store, AI usage telemetry¶
Status: PROPOSED (2026-04-27, founder directive — to validate) Sprint: 59 (planned, post Sprint 58 ADEN scope-first) Related: ADR-039 (no hardcoded identities), ADR-041 (ADEN scope-first), Sprint 41 RAG 3-tier delivery
Context¶
Founder feedback 2026-04-27 evening, after consulting an external IA expert :
"pour le rag un expert ia m'a dit que la partie la plus importante c'est : Chunking intelligent et m'a dit que la meilleure base vectorielle : https://github.com/milvus-io/milvus, il m'a dit aussi ce qu'il sera important de souligner (donc à valider si on a toutes les informations, c'est la partie consommation et utilisation des ia aussi, donc faut voir si on a déjà ça dans le cockpit akko)"
Three sub-decisions need to be made together because they shape the RAG product :
-
Chunking strategy — current
docker/akko-rag/app/chunker.pyis a word-based sliding window (400 words, fixed). The code itself flags this as Phase 0 ("Phase 2 upgrades this to the docling chapter-aware splitter"). Phase 2 was never shipped. -
Vector store — current is
pgvector(Sprint 41) tier 1, with OpenSearch + Iceberg as tier 2/3 for cold + scale. Founder's external expert names Milvus (LF AI & Data Foundation) as best-in-class. Question : replace pgvector tier 1 with Milvus, or keep pgvector for hot tier and slot Milvus as a tier-1.5 / tier-2 between hot and OpenSearch ? -
AI usage telemetry — cockpit "Utilisation" page has KPI placeholders (
--) but no live data. Founder wants visibility on consumption and usage. LiteLLM exposes Prometheus metrics; nothing wires them into a Perses dashboard or cockpit page.
Current state (audited 2026-04-27)¶
Chunking (akko-rag)¶
chunker.py: word-based, fixed window 400 words (proxy for tokens), overlap configurable viachunk_size_tokens/chunk_overlap_tokens.- No semantic boundaries (paragraph / section / heading).
- No structural awareness (table / list / code block stays bundled with surrounding prose, often splitting mid-row).
- Same chunker at ingest and reindex (deterministic doc IDs — important).
- Phase 2 placeholder comment references
doclingchapter-aware split.doclingis already in the AKKO image (docker/akko-rag/Dockerfile).
Vector store (akko-rag)¶
- Tier 1 —
pgvectorextension onakko-postgresql-data(banking demo uses 50k embeddings, sub-100ms p95). - Tier 2 — OpenSearch with
dense_vector+ BM25 hybrid (Sprint 42). - Tier 3 — Iceberg cold table with Lance / Parquet (batch retrieval for archival queries).
- All three already integrated. No vendor lock-in : the akko-rag service picks tier per query SLO.
AI usage telemetry¶
- LiteLLM in chart : ✅ exposes
/metrics(Prometheus format) by default (litellm_total_requests,litellm_total_tokens,litellm_total_cost,litellm_request_latency_seconds, etc.). Verified by checking upstream LiteLLM 1.72.2 source. - Prometheus scrape : NOT configured for LiteLLM service. No
ServiceMonitor, no scrape_config in
helm/akko/charts/akko-litellm/. - Cockpit "Utilisation" page : KPI tiles show
--because no backend fetches the data. - Audit trail in OPA / cockpit : separate concern, already partially wired.
Decision¶
A. Chunking — implement docling chapter-aware splitter (Sprint 59 D1-D2)¶
Replace the word-based chunker with docling semantic splitter when
ingesting structured documents (PDF, DOCX, Markdown). Keep the
word-based splitter as a fallback for plain text streams without
detectable structure (logs, CSVs).
Why : the external expert's directive matches industry consensus — Anthropic's contextual retrieval blog post (2024-09), LlamaIndex SemanticSplitterNodeParser, LangChain RecursiveCharacterTextSplitter, Cohere's chunking benchmark — all converge on : split on natural boundaries, not byte counts. Recall improves 15–35 % on benchmarks (BeIR / MS MARCO) with semantic chunking.
How :
- docker/akko-rag/app/chunker.py : add chunk_text_semantic(text,
doc_meta) -> list[Chunk] that delegates to docling.chunking for
PDF/DOCX/MD inputs, falls back to current chunk_text for plain text.
- Keep deterministic chunk IDs : hash(doc_id + chunk_index +
chunk_strategy) so an old word-based corpus and a new semantic corpus
can coexist during migration.
- Helm value akko-rag.chunking.strategy: auto | word | semantic so
customers can pin the strategy if they have a deterministic corpus
they don't want to re-chunk.
- Re-index migration : provide a Job akko-rag-reindex that streams
every doc through the new chunker. Tracked separately.
B. Vector store — keep pgvector default, add Milvus as opt-in tier (Sprint 59 D3-D5)¶
Do not replace pgvector. The current 3-tier architecture is sound and pgvector handles the hot tier well for a typical AKKO customer (thousands to low-millions of embeddings).
Add Milvus as an opt-in alternative tier-1 for customers with hard scaling requirements (10M+ embeddings, sub-50ms p99 across multi-AZ, GPU-accelerated index). Same Lego principle as MinIO → SeaweedFS → Ceph : one default packaged, additional slots for customers with specific needs.
Why Milvus over alternatives :
- ✅ License : Apache 2.0 (clean for AKKO commercial redistribution).
- ✅ Governance : LF AI & Data Foundation (Linux Foundation neutral —
matches feedback_governance_first.md rule).
- ✅ Maturity : graduated CNCF-style project, 30k+ GitHub stars,
reference clients at scale.
- ✅ Hybrid search built-in (dense + sparse + BM25) — eliminates the
need to wire OpenSearch separately for the hot tier when a customer
picks Milvus.
- ⚠️ Operational overhead : separate stateful service with etcd +
object-storage backend. Heavy for small customers; appropriate for
large ones.
Why not replace pgvector default : - pgvector benefits from PostgreSQL's existing operational story (CloudNativePG HA, backup, point-in-time-recovery). - Most AKKO RAG corpora fit in pgvector comfortably (100k–5M chunks). - Adding Milvus as default doubles operational burden for customers who don't need it.
How :
- New sub-chart helm/akko/charts/akko-milvus/ with Milvus standalone
+ etcd + MinIO-compat (or SeaweedFS).
- Helm value akko-rag.hotTier.backend: pgvector | milvus to switch.
- akko-rag service code : abstract the hot-tier client behind an
interface; add MilvusHotTierClient next to the existing
PgvectorHotTierClient.
- Default disabled (akko-milvus.enabled: false); customer or AKKO
operator flips on demand.
C. AI usage telemetry — Sprint 59 D6-D8¶
Goal : the cockpit "Utilisation" page populates with live data (tokens IN/OUT, cost per model, top users, p95 latency, error rate).
Sources :
1. LiteLLM /metrics (Prometheus). Already exposed. Need
ServiceMonitor + scrape config.
2. ADEN /metrics (added Sprint 27 — verified aden_query_total,
aden_query_duration_seconds, aden_llm_calls_total).
3. akko-rag /metrics (Sprint 41 — rag_query_total,
rag_embeddings_count).
4. OpenMetadata audit log — qui a interrogé quoi (different concern,
cockpit page already loads this from OM).
Wiring :
- helm/akko/charts/akko-litellm/templates/servicemonitor.yaml :
scrape :4000/metrics (LiteLLM exposes there).
- helm/akko/charts/akko-aden/templates/servicemonitor.yaml : already
exists; verify the Prometheus selector picks it up.
- helm/akko/charts/akko-rag/templates/servicemonitor.yaml : verify.
- New Perses dashboard akko-ai-usage.json :
- 4 stat panels : total tokens (24h), total cost (24h), top user
(24h), p95 latency (5m).
- 2 timeseries : tokens per minute by model, cost per minute by
model.
- 1 table : top 10 users by token consumption (24h).
- New cockpit page section under "Utilisation > AI Usage" — embeds
the Perses dashboard via the same metrics.<domain> iframe pattern
fixed in PR #149.
- Cockpit existing KPIs (#usage-llm-requests, #usage-trino-queries,
#usage-online-users) : wire to a new /api/usage/summary endpoint
in cockpit-backend (when ADR-040 lands) that aggregates Prometheus
queries → returns single numbers.
Consequences¶
Migration cost¶
- Chunking : 2 days (new code path + tests + migration Job + values toggle).
- Milvus opt-in : 3 days (sub-chart + abstraction layer + tests + values).
- Telemetry : 3 days (3 ServiceMonitors + Perses dashboard + cockpit page + 1 cockpit-backend endpoint).
- Total : ~8 days, fits a sprint with one engineer.
Customer onboarding¶
The new chunking strategy + telemetry come for free at install time. Milvus is opt-in : customers don't need it until they hit pgvector limits.
Demo / showcase¶
After Sprint 59 : - "Comment ADEN a répondu" panel (already shipped) shows the chunking strategy in pipeline_steps. - Cockpit "Utilisation" page shows live token counts during a demo, visible to the prospect — strong sales signal. - Milvus tab in Architecture page demonstrates the multi-tier vector store flexibility.
Rollback¶
- Chunking : Helm value
chunking.strategy: wordreverts to the current behaviour. The reindex Job is opt-in. - Milvus :
akko-milvus.enabled: falseremoves it entirely. - Telemetry : ServiceMonitors are passive; failure to scrape doesn't break anything.
Comparison with industry¶
| Vendor | Chunking | Vector store | AI usage telemetry |
|---|---|---|---|
| OpenAI ChatGPT Enterprise | semantic + structural | proprietary | dashboard built-in |
| Anthropic Claude.ai workspace | contextual retrieval | proprietary | per-org usage page |
| Microsoft Copilot Studio | semantic | Azure AI Search | Power BI integration |
| Snowflake Cortex Search | structural (Apache Tika) | proprietary | Snowsight |
| Databricks Vector Search | structural via DocAI | Mosaic / FAISS | Databricks SQL |
| AKKO post-ADR-042 | docling structural + word fallback | pgvector default + Milvus opt-in | LiteLLM/ADEN/RAG metrics → Perses + cockpit |
| AKKO pre-ADR-042 | word-only, fixed 400 | pgvector / OpenSearch / Iceberg 3-tier | KPI tiles with -- placeholders |
Validation needed (founder confirms before D1)¶
- Chunking direction : OK to ship docling structural splitter as default for PDF/DOCX/MD ?
- Milvus scope : opt-in tier (this ADR's recommendation), or replace pgvector as default ?
- Telemetry scope : token/cost/latency/top-users — anything else on the demo wishlist (cost-per-customer, GPU usage, RAG retrieval precision/recall) ?
Related¶
- Sprint 41 deliverable : 3-tier RAG (pgvector / OpenSearch / Iceberg)
- ADR-039 — no hardcoded identities (chunking values, model lists, etc. must be customer-driven)
feedback_governance_first.md— Milvus LF AI & Data Foundation pickfeedback_no_hardcoded_anything.md— strategy/threshold/model values via Helm