Semantic Cache Poisoning: The Hidden Risk of LLM Cost Optimization

Security invariant

Similarity is a performance signal, not authorization

A cache hit should never cross a tenant, role, policy, model, or system-prompt boundary. Within an allowed boundary, similarity is only the first test: intent, freshness, provenance, and response safety still decide whether reuse is acceptable.

01 // The optimization

How a semantic cache turns meaning into a key

A conventional cache needs an exact key. A semantic cache embeds a prompt with a model such as text-embedding-3-small, stores that numerical vector beside the LLM response, and searches for the nearest existing vector when another request arrives. OpenAI describes embeddings as numerical representations that measure how related two pieces of text are.

01Embed

Convert the new prompt into a vector.

02Search

Find the nearest cached prompt in the allowed namespace.

03Compare

Evaluate distance, intent, policy, and freshness.

04Reuse or call

Return a safe hit or ask the LLM and consider admission.

Teams often use cosine similarity s, where values nearer 1 indicate closer direction. Some databases expose cosine distance d = 1 - s, where smaller is closer. Confusing those conventions silently reverses a policy. A rule such as “distance below 0.2” is therefore not merely a tuning constant; it is an integrity decision that determines which answer may replace a fresh generation.

02 // Vector-space exploitation

The cache has a blind spot: semantic keys are deliberately fuzzy

An attacker does not need to overwrite a database row. They need a legitimate-looking request whose vector lands near a useful cluster while its wording persuades the LLM to produce a specific false, unsafe, or attacker-controlled response. If the system admits that prompt-response pair, the attacker has planted a poisoned coordinate.

Probe. The attacker varies prompts and observes timing, wording, or hit behavior to map a valuable neighborhood.

Plant. A crafted prompt receives an attacker-favored answer and is admitted to the shared cache.

Collide. A normal user's nearby query selects the poisoned entry under the similarity threshold.

Replay. The service returns the cached answer without invoking the LLM's current instructions or checks.

The key collision is structural. Cryptographic hashes try to make a tiny input change produce a radically different output; embeddings intentionally keep related inputs close. A 2026 ICML paper frames semantic keys as “fuzzy hashes” and reports an 86% response-hijacking hit rate for its black-box attack in evaluated settings, including agent workflows. NDSS researchers separately demonstrated semantic cache poisoning across services from AWS, Azure, and Alibaba.

Why ordinary guardrails miss it

The poisoned response can be generated during one request but delivered during another. Output checks that run only after a fresh LLM call, or system instructions that the cached path never evaluates, cannot protect the victim.

03 // Dynamic thresholding

There is no universal safe cosine threshold

A static threshold assumes every intent tolerates the same semantic error. It does not. “How do I reset my password?” may tolerate a carefully validated FAQ response. “Should this transaction be approved?”, “Which tenant owns this record?”, or a tool-execution request should not accept a fuzzy substitute at all. Negation, numbers, named entities, dates, and permission language can change the correct answer while leaving vectors deceptively close.

Public FAQStrict semantic hit

Calibrated per intent, followed by an equivalence check.

General knowledgeVery strict + fresh

Short TTL, source and model version match, reject time-sensitive queries.

High risk / personalizedExact match or bypass

No shared semantic response for auth, finance, medical, tools, or private data.

secure-cache.tsIllustrative thresholds—calibrate with evals

const policy = {
  public_faq:      { semantic: true,  minSimilarity: 0.97 },
  general:         { semantic: true,  minSimilarity: 0.99 },
  high_risk:       { semantic: false, minSimilarity: 1.00 },
  personalized:    { semantic: false, minSimilarity: 1.00 }
};

async function lookup(ctx, prompt) {
  const intent = await classifyIntent(prompt);
  const rule = policy[intent.class];
  if (!rule.semantic || intent.timeSensitive) return null;

  const namespace = cacheNamespace(ctx);
  const vector = await embed(prompt);
  const hit = await vectorStore.nearest(vector, {
    limit: 1,
    filter: { namespace, intent: intent.class, status: "approved" }
  });

  if (!hit || hit.similarity < rule.minSimilarity) return null;
  if (!(await equivalent(prompt, hit.prompt))) return null;
  if (await redis.exists(`sc:quarantine:${hit.id}`)) return null;

  await recordCacheHit(hit, ctx.actorId);
  return verifyResponseDigest(hit) ? hit.response : null;
}

The second-stage equivalent check should explicitly compare intent, entities, negation, constraints, and required freshness—not merely run the same embedding comparison twice. Calibrate thresholds against labeled false-hit tests for each intent. Research on adaptive semantic caching similarly argues that one static threshold provides no general correctness guarantee.

04 // Cryptographic namespace + metadata

Bind reuse to the exact security context

Do not append a random salt to prompt text before embedding; that damages semantic geometry and still does not create authorization. Instead, derive a deterministic, server-side HMAC namespace from every fact that changes who may receive an answer or how it was produced. Then apply an exact metadata filter before vector ranking.

cache-namespace.tsSecurity partition before similarity

import { createHash, createHmac } from "node:crypto";

const sha256 = (value) =>
  createHash("sha256").update(value).digest("hex");

function cacheNamespace(ctx) {
  const boundary = JSON.stringify({
    tenantId: ctx.tenantId,
    role: ctx.role,
    model: ctx.model,
    embeddingModel: ctx.embeddingModel,
    systemPromptHash: sha256(ctx.systemPrompt),
    toolPolicyVersion: ctx.toolPolicyVersion
  });

  return createHmac("sha256", process.env.CACHE_NAMESPACE_KEY)
    .update(boundary)
    .digest("base64url");
}

Store the namespace with intent, systemPromptHash, model versions, provenance, admission status, TTL, and a digest of the response. Rotate the namespace when any policy input changes. The HMAC prevents clients from forging another partition; it does not stop semantic collisions inside that partition, which is why thresholds, validation, and monitoring still matter.

Admission is as important as lookup

Never cache every model output automatically. Validate schema and safety, reject responses containing secrets or personalized data, and keep untrusted-user entries private until they pass review, consensus, or a trusted-publisher rule. Some workloads should use an exact-key cache only.

05 // Anomalous-hit detection

Watch the coordinates attackers keep touching

Poisoning and vector-space scanning create telemetry. A single entry may attract an unusual number of near matches, distinct actors, repeated near misses, or a sudden change in intent mix. Redis can keep short-window counters cheaply: INCR and EXPIRE form a fixed window, while HyperLogLog PFADD/PFCOUNT estimates unique actors without retaining their raw identifiers.

cache-telemetry.tsnode-redis

async function recordCacheHit(hit, actorId) {
  const minute = Math.floor(Date.now() / 60_000);
  const hitsKey = `sc:hits:${hit.id}:${minute}`;
  const actorsKey = `sc:actors:${hit.id}:${minute}`;
  const actorTag = hmacTelemetryId(actorId); // never store the raw ID

  const hits = await redis.eval(`
    local n = redis.call("INCR", KEYS[1])
    if n == 1 then redis.call("EXPIRE", KEYS[1], ARGV[1]) end
    return n
  `, { keys: [hitsKey], arguments: ["120"] });

  await redis.pfAdd(actorsKey, [actorTag]);
  await redis.expire(actorsKey, 120);
  const uniqueActors = await redis.pfCount(actorsKey);

  const baseline = await baselineFor(hit.intent, hit.namespace);
  if (hits > baseline.maxHits || uniqueActors > baseline.maxActors) {
    await redis.set(`sc:quarantine:${hit.id}`, "1", { EX: 900 });
    await alertSecurity({ hit, hits, uniqueActors });
  }
}

Baselines must be per intent and namespace; a popular public FAQ naturally behaves differently from an internal finance query. Quarantine should fail safe by forcing a fresh LLM path, rerunning all output controls, and preserving evidence for review. Track similarity bands and near misses too: a scanner may never trigger one coordinate's hit counter while still probing a dense region.

06 // Secure architecture

Treat cached responses as untrusted, replayable artifacts

Partition first

Exact tenant, role, system-prompt, model, and tool-policy namespace filtering occurs before nearest-neighbor search.

Classify before lookup

High-risk, personalized, tool-using, and time-sensitive intents bypass shared semantic reuse.

Control admission

Only validated, attributable, non-sensitive responses become shared entries; everything else stays private or uncached.

Detect and quarantine

Hit-rate, actor-cardinality, near-miss, and drift signals can immediately remove an entry from service.

Finally, run the cache path through the same authorization, response schema, content policy, and audit pipeline as a fresh generation. Log why an entry was selected, which namespace and policy matched, its similarity and equivalence scores, and whether the response digest verified. Red-team with close paraphrases, negation flips, entity swaps, stale facts, cross-role requests, and deliberately poisoned admissions.

Semantic caching can still cut latency and cost. The mistake is treating proximity as proof. Make security boundaries exact, similarity intent-aware, admission selective, and every popular coordinate observable. Then a poisoned vector becomes a contained anomaly instead of an answer your application confidently repeats.

Sources // primary research and documentation