Aller au contenu

ADR-007: akko_ai_search Query-Embedding Cache — JVM-Local vs Distributed

Status

Accepted

Date

2026-04-15

Context

AKKO ships a Trino scalar function akko_ai_search(query, column) implemented in the akko-trino plugin (dev.akko.trino.ai). Each invocation needs an embedding vector for the query argument and calls /v1/embed on the AI Service (Ollama-backed FastAPI) to obtain it.

Two pathological patterns emerge without caching:

  1. Per-row explosion — a SELECT akko_ai_search('churn risk', notes) FROM events over N rows would, naively, issue N HTTP calls for the same query string. Query-level memoization inside the worker eliminates this.
  2. Cross-worker duplication — Trino distributes splits across workers. Even with per-worker memoization, a 3-worker cluster running the same query against a large table issues up to 3 embed calls for one distinct query string. At scale (10+ workers, dashboards fan-out, repeated queries across sessions), this grows linearly with fleet size.

State-of-the-art would back the cache with Redis / Dragonfly / Postgres so the embedding is computed exactly once per distinct query for the entire cluster. This ADR documents why AKKO deliberately chose a simpler design.

Decision

Ship akko_ai_search with a JVM-local Caffeine cache, scoped to the Trino worker process:

  • Implementation: com.github.benmanes.caffeine.cache.Cache<String, float[]>
  • Capacity: 1024 entries (max size, LRU eviction on overflow)
  • TTL: 1 hour (access-based — each hit resets the expiry)
  • Eviction policy: access-order LRU via expireAfterAccess(1h) + maximumSize(1024)
  • Key: raw query string (normalized by the AI Service, not here)
  • Scope: per-JVM, per-worker — no shared state, no external dependency

The cache is exposed via JMX under dev.akko.trino.ai:type=aisearch with three counters: CacheHits, CacheMisses, CacheSize. These surface in Prometheus via the Trino JMX exporter and in Grafana's "AI Functions" dashboard.

Alternatives Considered

Redis / Dragonfly shared cache

  • True cluster-wide deduplication — one embed call per distinct query, ever.
  • But: adds a hard runtime dependency on Redis for a query plugin. Requires fail-open fallback logic (Redis down → call AI Service directly → log and continue), TLS wiring inside Trino's classpath, AUTH secret rotation, and a new Helm sub-chart or reuse of an existing Redis (AKKO does not ship one today).
  • Network roundtrip per lookup (even on a hit) is ~0.5–2 ms — two orders of magnitude slower than an in-process Caffeine hit (sub-microsecond).
  • Rejected: the operational surface (new service, new secrets, new failure mode, new alerts) is disproportionate to the savings at AKKO's current cluster sizes.

Postgres shared cache (reuse akko-postgresql)

  • Zero new components — AKKO already ships akko-postgresql (infra tier) with room for a trino_ai_cache(query_hash, embedding vector, updated_at) table.
  • pgvector is already installed on akko-postgresql-data (RAG tier) — schema is trivial.
  • But: every cache lookup becomes a synchronous Postgres roundtrip on the hot path of every row evaluation. For a 10k-row akko_ai_search scan, that's 10k Postgres queries per worker — catastrophic, and it defeats the point of caching in the first place (we'd be trading embed latency for DB latency, with extra serialization overhead).
  • Could be mitigated with a JVM-local L1 cache in front of the Postgres L2 — but that's exactly what this ADR proposes, minus the L2. Adding L2 later is cheap if we ever need it.
  • Rejected: synchronous DB reads on every scalar call violate Trino's latency contract for scalar UDFs.

No cache at all

  • Simplest possible implementation.
  • Acceptable for low-cardinality workloads (one akko_ai_search per dashboard widget, a few hundred rows).
  • But: catastrophic for dashboard refresh against 10k+ row tables — N embed calls per widget per refresh, overwhelming the AI Service and turning a sub-second function into a multi-minute scan.
  • Rejected: the common case (semantic search over medium-to-large tables) is exactly the pathological case for a cacheless design.

Consequences

Positive

  • Zero new dependencies — no Redis, no extra Postgres schema, no Helm chart changes, no new secrets, no new alerts.
  • Zero new failure modes — the cache cannot be "down"; worst case is a cold miss that falls through to the AI Service exactly as if the cache were absent.
  • Sub-microsecond hits — Caffeine is the reference high-performance JVM cache; a hit is a ConcurrentHashMap lookup plus LRU bookkeeping.
  • Automatic invalidation on pod restart — if the operator updates the embedding model (nomic-embed-text → something else), rolling-restart of Trino workers flushes stale embeddings naturally. No manual cache-bust, no TTL tuning, no stampede.
  • Observability out of the box — JMX counters integrate with the existing Prometheus/Grafana pipeline with no plumbing work.

Negative

  • Duplicated embed calls proportional to worker count — a 3-worker cluster performs up to 3 embed calls per distinct query (once per worker, amortized across that worker's lifetime). This is mitigated in practice because the AI Service itself caches embeddings (Ollama model-level caching + FastAPI memoization), so cross-worker duplicates are served from the AI Service's own cache at low cost — typically <20 ms — rather than triggering full model inference.
  • No cross-session persistence — when a worker restarts, its cache is cold. Acceptable because akko_ai_search is deterministic (same input → same output) and the AI Service layer absorbs the repopulation cost.
  • Coarse TTL — 1 hour is a compromise between memory pressure and recency. For workloads with >1024 distinct queries per hour per worker, LRU will evict before TTL.

Neutral

  • Migration path is clean — if we revisit this, adding an L2 (Redis or Postgres) behind the existing Caffeine L1 is a localized change inside AiSearchFunction.java. No API changes, no user-facing behavior change.
  • Cache key is the raw query string — this matches the AI Service contract; if we later normalize (lowercasing, whitespace collapsing) we should do it in one place, ideally the AI Service, and document accordingly.

When to Revisit

Re-open this decision if any of the following becomes true for a sustained period (not a transient spike):

  • Trino cluster scales beyond 10 workers — duplicated embed calls grow linearly and the AI Service cache may no longer absorb the cost.
  • AI Service /v1/embed p99 latency exceeds 500 ms sustainably — duplicates become user-visible.
  • JMX metrics show CacheHits / (CacheHits + CacheMisses) < 70% under normal workload — indicates either 1024 entries is too small or the access pattern is cache-hostile.
  • A second Trino plugin needs the same embedding cache — extracting it to a shared service becomes justified by reuse.

Metrics

Exposed via JMX bean dev.akko.trino.ai:type=aisearch:

  • CacheHits — cumulative count of lookups served from cache.
  • CacheMisses — cumulative count of lookups that triggered an AI Service call.
  • CacheSize — current number of entries (≤ 1024).

Scraped by Trino's JMX exporter, aggregated in Prometheus, visualized in the Grafana "AI Functions" dashboard alongside AI Service latency and Ollama GPU/CPU utilization.

References

  • Caffeine cache (GitHub)
  • Trino scalar function SPI
  • AKKO Trino AI plugin: docker/trino-ai-functions/ (custom image akko-trino:2026.04)
  • AI Service /v1/embed endpoint: docker/ai-service/
  • Related sprint work: Trino AI plugin investigation (Task #90) and akko_ai_search implementation (Task #91)
  • Related ADR: ADR-003 (Polaris) — same philosophy of minimizing runtime dependencies for query-path components