ADR-007: akko_ai_search Query-Embedding Cache — JVM-Local vs Distributed¶
Status¶
Accepted
Date¶
2026-04-15
Context¶
AKKO ships a Trino scalar function akko_ai_search(query, column) implemented in the akko-trino plugin (dev.akko.trino.ai). Each invocation needs an embedding vector for the query argument and calls /v1/embed on the AI Service (Ollama-backed FastAPI) to obtain it.
Two pathological patterns emerge without caching:
- Per-row explosion — a
SELECT akko_ai_search('churn risk', notes) FROM eventsover N rows would, naively, issue N HTTP calls for the same query string. Query-level memoization inside the worker eliminates this. - Cross-worker duplication — Trino distributes splits across workers. Even with per-worker memoization, a 3-worker cluster running the same query against a large table issues up to 3 embed calls for one distinct
querystring. At scale (10+ workers, dashboards fan-out, repeated queries across sessions), this grows linearly with fleet size.
State-of-the-art would back the cache with Redis / Dragonfly / Postgres so the embedding is computed exactly once per distinct query for the entire cluster. This ADR documents why AKKO deliberately chose a simpler design.
Decision¶
Ship akko_ai_search with a JVM-local Caffeine cache, scoped to the Trino worker process:
- Implementation:
com.github.benmanes.caffeine.cache.Cache<String, float[]> - Capacity: 1024 entries (max size, LRU eviction on overflow)
- TTL: 1 hour (access-based — each hit resets the expiry)
- Eviction policy: access-order LRU via
expireAfterAccess(1h)+maximumSize(1024) - Key: raw query string (normalized by the AI Service, not here)
- Scope: per-JVM, per-worker — no shared state, no external dependency
The cache is exposed via JMX under dev.akko.trino.ai:type=aisearch with three counters: CacheHits, CacheMisses, CacheSize. These surface in Prometheus via the Trino JMX exporter and in Grafana's "AI Functions" dashboard.
Alternatives Considered¶
Redis / Dragonfly shared cache¶
- True cluster-wide deduplication — one embed call per distinct query, ever.
- But: adds a hard runtime dependency on Redis for a query plugin. Requires fail-open fallback logic (Redis down → call AI Service directly → log and continue), TLS wiring inside Trino's classpath, AUTH secret rotation, and a new Helm sub-chart or reuse of an existing Redis (AKKO does not ship one today).
- Network roundtrip per lookup (even on a hit) is ~0.5–2 ms — two orders of magnitude slower than an in-process Caffeine hit (sub-microsecond).
- Rejected: the operational surface (new service, new secrets, new failure mode, new alerts) is disproportionate to the savings at AKKO's current cluster sizes.
Postgres shared cache (reuse akko-postgresql)¶
- Zero new components — AKKO already ships
akko-postgresql(infra tier) with room for atrino_ai_cache(query_hash, embedding vector, updated_at)table. - pgvector is already installed on
akko-postgresql-data(RAG tier) — schema is trivial. - But: every cache lookup becomes a synchronous Postgres roundtrip on the hot path of every row evaluation. For a 10k-row
akko_ai_searchscan, that's 10k Postgres queries per worker — catastrophic, and it defeats the point of caching in the first place (we'd be trading embed latency for DB latency, with extra serialization overhead). - Could be mitigated with a JVM-local L1 cache in front of the Postgres L2 — but that's exactly what this ADR proposes, minus the L2. Adding L2 later is cheap if we ever need it.
- Rejected: synchronous DB reads on every scalar call violate Trino's latency contract for scalar UDFs.
No cache at all¶
- Simplest possible implementation.
- Acceptable for low-cardinality workloads (one
akko_ai_searchper dashboard widget, a few hundred rows). - But: catastrophic for dashboard refresh against 10k+ row tables — N embed calls per widget per refresh, overwhelming the AI Service and turning a sub-second function into a multi-minute scan.
- Rejected: the common case (semantic search over medium-to-large tables) is exactly the pathological case for a cacheless design.
Consequences¶
Positive¶
- Zero new dependencies — no Redis, no extra Postgres schema, no Helm chart changes, no new secrets, no new alerts.
- Zero new failure modes — the cache cannot be "down"; worst case is a cold miss that falls through to the AI Service exactly as if the cache were absent.
- Sub-microsecond hits — Caffeine is the reference high-performance JVM cache; a hit is a
ConcurrentHashMaplookup plus LRU bookkeeping. - Automatic invalidation on pod restart — if the operator updates the embedding model (
nomic-embed-text→ something else), rolling-restart of Trino workers flushes stale embeddings naturally. No manual cache-bust, no TTL tuning, no stampede. - Observability out of the box — JMX counters integrate with the existing Prometheus/Grafana pipeline with no plumbing work.
Negative¶
- Duplicated embed calls proportional to worker count — a 3-worker cluster performs up to 3 embed calls per distinct query (once per worker, amortized across that worker's lifetime). This is mitigated in practice because the AI Service itself caches embeddings (Ollama model-level caching + FastAPI memoization), so cross-worker duplicates are served from the AI Service's own cache at low cost — typically <20 ms — rather than triggering full model inference.
- No cross-session persistence — when a worker restarts, its cache is cold. Acceptable because
akko_ai_searchis deterministic (same input → same output) and the AI Service layer absorbs the repopulation cost. - Coarse TTL — 1 hour is a compromise between memory pressure and recency. For workloads with >1024 distinct queries per hour per worker, LRU will evict before TTL.
Neutral¶
- Migration path is clean — if we revisit this, adding an L2 (Redis or Postgres) behind the existing Caffeine L1 is a localized change inside
AiSearchFunction.java. No API changes, no user-facing behavior change. - Cache key is the raw query string — this matches the AI Service contract; if we later normalize (lowercasing, whitespace collapsing) we should do it in one place, ideally the AI Service, and document accordingly.
When to Revisit¶
Re-open this decision if any of the following becomes true for a sustained period (not a transient spike):
- Trino cluster scales beyond 10 workers — duplicated embed calls grow linearly and the AI Service cache may no longer absorb the cost.
- AI Service
/v1/embedp99 latency exceeds 500 ms sustainably — duplicates become user-visible. - JMX metrics show CacheHits / (CacheHits + CacheMisses) < 70% under normal workload — indicates either 1024 entries is too small or the access pattern is cache-hostile.
- A second Trino plugin needs the same embedding cache — extracting it to a shared service becomes justified by reuse.
Metrics¶
Exposed via JMX bean dev.akko.trino.ai:type=aisearch:
CacheHits— cumulative count of lookups served from cache.CacheMisses— cumulative count of lookups that triggered an AI Service call.CacheSize— current number of entries (≤ 1024).
Scraped by Trino's JMX exporter, aggregated in Prometheus, visualized in the Grafana "AI Functions" dashboard alongside AI Service latency and Ollama GPU/CPU utilization.
References¶
- Caffeine cache (GitHub)
- Trino scalar function SPI
- AKKO Trino AI plugin:
docker/trino-ai-functions/(custom imageakko-trino:2026.04) - AI Service
/v1/embedendpoint:docker/ai-service/ - Related sprint work: Trino AI plugin investigation (Task #90) and
akko_ai_searchimplementation (Task #91) - Related ADR: ADR-003 (Polaris) — same philosophy of minimizing runtime dependencies for query-path components