Aller au contenu

ADR-042 — RAG strategy: intelligent chunking, vector store, AI usage telemetry

Status: PROPOSED (2026-04-27, founder directive — to validate) Sprint: 59 (planned, post Sprint 58 ADEN scope-first) Related: ADR-039 (no hardcoded identities), ADR-041 (ADEN scope-first), Sprint 41 RAG 3-tier delivery

Context

Founder feedback 2026-04-27 evening, after consulting an external IA expert :

"pour le rag un expert ia m'a dit que la partie la plus importante c'est : Chunking intelligent et m'a dit que la meilleure base vectorielle : https://github.com/milvus-io/milvus, il m'a dit aussi ce qu'il sera important de souligner (donc à valider si on a toutes les informations, c'est la partie consommation et utilisation des ia aussi, donc faut voir si on a déjà ça dans le cockpit akko)"

Three sub-decisions need to be made together because they shape the RAG product :

  1. Chunking strategy — current docker/akko-rag/app/chunker.py is a word-based sliding window (400 words, fixed). The code itself flags this as Phase 0 ("Phase 2 upgrades this to the docling chapter-aware splitter"). Phase 2 was never shipped.

  2. Vector store — current is pgvector (Sprint 41) tier 1, with OpenSearch + Iceberg as tier 2/3 for cold + scale. Founder's external expert names Milvus (LF AI & Data Foundation) as best-in-class. Question : replace pgvector tier 1 with Milvus, or keep pgvector for hot tier and slot Milvus as a tier-1.5 / tier-2 between hot and OpenSearch ?

  3. AI usage telemetry — cockpit "Utilisation" page has KPI placeholders (--) but no live data. Founder wants visibility on consumption and usage. LiteLLM exposes Prometheus metrics; nothing wires them into a Perses dashboard or cockpit page.

Current state (audited 2026-04-27)

Chunking (akko-rag)

  • chunker.py : word-based, fixed window 400 words (proxy for tokens), overlap configurable via chunk_size_tokens / chunk_overlap_tokens.
  • No semantic boundaries (paragraph / section / heading).
  • No structural awareness (table / list / code block stays bundled with surrounding prose, often splitting mid-row).
  • Same chunker at ingest and reindex (deterministic doc IDs — important).
  • Phase 2 placeholder comment references docling chapter-aware split. docling is already in the AKKO image (docker/akko-rag/Dockerfile).

Vector store (akko-rag)

  • Tier 1 — pgvector extension on akko-postgresql-data (banking demo uses 50k embeddings, sub-100ms p95).
  • Tier 2 — OpenSearch with dense_vector + BM25 hybrid (Sprint 42).
  • Tier 3 — Iceberg cold table with Lance / Parquet (batch retrieval for archival queries).
  • All three already integrated. No vendor lock-in : the akko-rag service picks tier per query SLO.

AI usage telemetry

  • LiteLLM in chart : ✅ exposes /metrics (Prometheus format) by default (litellm_total_requests, litellm_total_tokens, litellm_total_cost, litellm_request_latency_seconds, etc.). Verified by checking upstream LiteLLM 1.72.2 source.
  • Prometheus scrape : NOT configured for LiteLLM service. No ServiceMonitor, no scrape_config in helm/akko/charts/akko-litellm/.
  • Cockpit "Utilisation" page : KPI tiles show -- because no backend fetches the data.
  • Audit trail in OPA / cockpit : separate concern, already partially wired.

Decision

A. Chunking — implement docling chapter-aware splitter (Sprint 59 D1-D2)

Replace the word-based chunker with docling semantic splitter when ingesting structured documents (PDF, DOCX, Markdown). Keep the word-based splitter as a fallback for plain text streams without detectable structure (logs, CSVs).

Why : the external expert's directive matches industry consensus — Anthropic's contextual retrieval blog post (2024-09), LlamaIndex SemanticSplitterNodeParser, LangChain RecursiveCharacterTextSplitter, Cohere's chunking benchmark — all converge on : split on natural boundaries, not byte counts. Recall improves 15–35 % on benchmarks (BeIR / MS MARCO) with semantic chunking.

How : - docker/akko-rag/app/chunker.py : add chunk_text_semantic(text, doc_meta) -> list[Chunk] that delegates to docling.chunking for PDF/DOCX/MD inputs, falls back to current chunk_text for plain text. - Keep deterministic chunk IDs : hash(doc_id + chunk_index + chunk_strategy) so an old word-based corpus and a new semantic corpus can coexist during migration. - Helm value akko-rag.chunking.strategy: auto | word | semantic so customers can pin the strategy if they have a deterministic corpus they don't want to re-chunk. - Re-index migration : provide a Job akko-rag-reindex that streams every doc through the new chunker. Tracked separately.

B. Vector store — keep pgvector default, add Milvus as opt-in tier (Sprint 59 D3-D5)

Do not replace pgvector. The current 3-tier architecture is sound and pgvector handles the hot tier well for a typical AKKO customer (thousands to low-millions of embeddings).

Add Milvus as an opt-in alternative tier-1 for customers with hard scaling requirements (10M+ embeddings, sub-50ms p99 across multi-AZ, GPU-accelerated index). Same Lego principle as MinIO → SeaweedFS → Ceph : one default packaged, additional slots for customers with specific needs.

Why Milvus over alternatives : - ✅ License : Apache 2.0 (clean for AKKO commercial redistribution). - ✅ Governance : LF AI & Data Foundation (Linux Foundation neutral — matches feedback_governance_first.md rule). - ✅ Maturity : graduated CNCF-style project, 30k+ GitHub stars, reference clients at scale. - ✅ Hybrid search built-in (dense + sparse + BM25) — eliminates the need to wire OpenSearch separately for the hot tier when a customer picks Milvus. - ⚠️ Operational overhead : separate stateful service with etcd + object-storage backend. Heavy for small customers; appropriate for large ones.

Why not replace pgvector default : - pgvector benefits from PostgreSQL's existing operational story (CloudNativePG HA, backup, point-in-time-recovery). - Most AKKO RAG corpora fit in pgvector comfortably (100k–5M chunks). - Adding Milvus as default doubles operational burden for customers who don't need it.

How : - New sub-chart helm/akko/charts/akko-milvus/ with Milvus standalone + etcd + MinIO-compat (or SeaweedFS). - Helm value akko-rag.hotTier.backend: pgvector | milvus to switch. - akko-rag service code : abstract the hot-tier client behind an interface; add MilvusHotTierClient next to the existing PgvectorHotTierClient. - Default disabled (akko-milvus.enabled: false); customer or AKKO operator flips on demand.

C. AI usage telemetry — Sprint 59 D6-D8

Goal : the cockpit "Utilisation" page populates with live data (tokens IN/OUT, cost per model, top users, p95 latency, error rate).

Sources : 1. LiteLLM /metrics (Prometheus). Already exposed. Need ServiceMonitor + scrape config. 2. ADEN /metrics (added Sprint 27 — verified aden_query_total, aden_query_duration_seconds, aden_llm_calls_total). 3. akko-rag /metrics (Sprint 41 — rag_query_total, rag_embeddings_count). 4. OpenMetadata audit log — qui a interrogé quoi (different concern, cockpit page already loads this from OM).

Wiring : - helm/akko/charts/akko-litellm/templates/servicemonitor.yaml : scrape :4000/metrics (LiteLLM exposes there). - helm/akko/charts/akko-aden/templates/servicemonitor.yaml : already exists; verify the Prometheus selector picks it up. - helm/akko/charts/akko-rag/templates/servicemonitor.yaml : verify. - New Perses dashboard akko-ai-usage.json : - 4 stat panels : total tokens (24h), total cost (24h), top user (24h), p95 latency (5m). - 2 timeseries : tokens per minute by model, cost per minute by model. - 1 table : top 10 users by token consumption (24h). - New cockpit page section under "Utilisation > AI Usage" — embeds the Perses dashboard via the same metrics.<domain> iframe pattern fixed in PR #149. - Cockpit existing KPIs (#usage-llm-requests, #usage-trino-queries, #usage-online-users) : wire to a new /api/usage/summary endpoint in cockpit-backend (when ADR-040 lands) that aggregates Prometheus queries → returns single numbers.

Consequences

Migration cost

  • Chunking : 2 days (new code path + tests + migration Job + values toggle).
  • Milvus opt-in : 3 days (sub-chart + abstraction layer + tests + values).
  • Telemetry : 3 days (3 ServiceMonitors + Perses dashboard + cockpit page + 1 cockpit-backend endpoint).
  • Total : ~8 days, fits a sprint with one engineer.

Customer onboarding

The new chunking strategy + telemetry come for free at install time. Milvus is opt-in : customers don't need it until they hit pgvector limits.

Demo / showcase

After Sprint 59 : - "Comment ADEN a répondu" panel (already shipped) shows the chunking strategy in pipeline_steps. - Cockpit "Utilisation" page shows live token counts during a demo, visible to the prospect — strong sales signal. - Milvus tab in Architecture page demonstrates the multi-tier vector store flexibility.

Rollback

  • Chunking : Helm value chunking.strategy: word reverts to the current behaviour. The reindex Job is opt-in.
  • Milvus : akko-milvus.enabled: false removes it entirely.
  • Telemetry : ServiceMonitors are passive; failure to scrape doesn't break anything.

Comparison with industry

Vendor Chunking Vector store AI usage telemetry
OpenAI ChatGPT Enterprise semantic + structural proprietary dashboard built-in
Anthropic Claude.ai workspace contextual retrieval proprietary per-org usage page
Microsoft Copilot Studio semantic Azure AI Search Power BI integration
Snowflake Cortex Search structural (Apache Tika) proprietary Snowsight
Databricks Vector Search structural via DocAI Mosaic / FAISS Databricks SQL
AKKO post-ADR-042 docling structural + word fallback pgvector default + Milvus opt-in LiteLLM/ADEN/RAG metrics → Perses + cockpit
AKKO pre-ADR-042 word-only, fixed 400 pgvector / OpenSearch / Iceberg 3-tier KPI tiles with -- placeholders

Validation needed (founder confirms before D1)

  1. Chunking direction : OK to ship docling structural splitter as default for PDF/DOCX/MD ?
  2. Milvus scope : opt-in tier (this ADR's recommendation), or replace pgvector as default ?
  3. Telemetry scope : token/cost/latency/top-users — anything else on the demo wishlist (cost-per-customer, GPU usage, RAG retrieval precision/recall) ?
  • Sprint 41 deliverable : 3-tier RAG (pgvector / OpenSearch / Iceberg)
  • ADR-039 — no hardcoded identities (chunking values, model lists, etc. must be customer-driven)
  • feedback_governance_first.md — Milvus LF AI & Data Foundation pick
  • feedback_no_hardcoded_anything.md — strategy/threshold/model values via Helm