AI security // vector integrity
Semantic Cache Poisoning
The optimization that skips an expensive LLM call can also skip the model, its system prompt, and its safety checks—then replay an attacker's answer to an innocent user.
Security invariant
Similarity is a performance signal, not authorization
A cache hit should never cross a tenant, role, policy, model, or system-prompt boundary. Within an allowed boundary, similarity is only the first test: intent, freshness, provenance, and response safety still decide whether reuse is acceptable.
01 // The optimization
How a semantic cache turns meaning into a key
A conventional cache needs an exact key. A semantic cache embeds a
prompt with a model such as text-embedding-3-small,
stores that numerical vector beside the LLM response, and searches
for the nearest existing vector when another request arrives.
OpenAI describes embeddings as numerical representations that
measure how related two pieces of text are.
Convert the new prompt into a vector.
Find the nearest cached prompt in the allowed namespace.
Evaluate distance, intent, policy, and freshness.
Return a safe hit or ask the LLM and consider admission.
Teams often use cosine similarity s, where values
nearer 1 indicate closer direction. Some databases expose cosine
distance d = 1 - s, where smaller is closer. Confusing
those conventions silently reverses a policy. A rule such as
“distance below 0.2” is therefore not merely a tuning constant; it
is an integrity decision that determines which answer may replace
a fresh generation.
02 // Vector-space exploitation
The cache has a blind spot: semantic keys are deliberately fuzzy
An attacker does not need to overwrite a database row. They need a legitimate-looking request whose vector lands near a useful cluster while its wording persuades the LLM to produce a specific false, unsafe, or attacker-controlled response. If the system admits that prompt-response pair, the attacker has planted a poisoned coordinate.
Probe. The attacker varies prompts and observes timing, wording, or hit behavior to map a valuable neighborhood.
Plant. A crafted prompt receives an attacker-favored answer and is admitted to the shared cache.
Collide. A normal user's nearby query selects the poisoned entry under the similarity threshold.
Replay. The service returns the cached answer without invoking the LLM's current instructions or checks.
The key collision is structural. Cryptographic hashes try to make a tiny input change produce a radically different output; embeddings intentionally keep related inputs close. A 2026 ICML paper frames semantic keys as “fuzzy hashes” and reports an 86% response-hijacking hit rate for its black-box attack in evaluated settings, including agent workflows. NDSS researchers separately demonstrated semantic cache poisoning across services from AWS, Azure, and Alibaba.
The poisoned response can be generated during one request but delivered during another. Output checks that run only after a fresh LLM call, or system instructions that the cached path never evaluates, cannot protect the victim.
03 // Dynamic thresholding
There is no universal safe cosine threshold
A static threshold assumes every intent tolerates the same semantic error. It does not. “How do I reset my password?” may tolerate a carefully validated FAQ response. “Should this transaction be approved?”, “Which tenant owns this record?”, or a tool-execution request should not accept a fuzzy substitute at all. Negation, numbers, named entities, dates, and permission language can change the correct answer while leaving vectors deceptively close.
Calibrated per intent, followed by an equivalence check.
Short TTL, source and model version match, reject time-sensitive queries.
No shared semantic response for auth, finance, medical, tools, or private data.
const policy = {
public_faq: { semantic: true, minSimilarity: 0.97 },
general: { semantic: true, minSimilarity: 0.99 },
high_risk: { semantic: false, minSimilarity: 1.00 },
personalized: { semantic: false, minSimilarity: 1.00 }
};
async function lookup(ctx, prompt) {
const intent = await classifyIntent(prompt);
const rule = policy[intent.class];
if (!rule.semantic || intent.timeSensitive) return null;
const namespace = cacheNamespace(ctx);
const vector = await embed(prompt);
const hit = await vectorStore.nearest(vector, {
limit: 1,
filter: { namespace, intent: intent.class, status: "approved" }
});
if (!hit || hit.similarity < rule.minSimilarity) return null;
if (!(await equivalent(prompt, hit.prompt))) return null;
if (await redis.exists(`sc:quarantine:${hit.id}`)) return null;
await recordCacheHit(hit, ctx.actorId);
return verifyResponseDigest(hit) ? hit.response : null;
}
The second-stage equivalent check should explicitly
compare intent, entities, negation, constraints, and required
freshness—not merely run the same embedding comparison twice.
Calibrate thresholds against labeled false-hit tests for each
intent. Research on adaptive semantic caching similarly argues that
one static threshold provides no general correctness guarantee.
04 // Cryptographic namespace + metadata
Bind reuse to the exact security context
Do not append a random salt to prompt text before embedding; that damages semantic geometry and still does not create authorization. Instead, derive a deterministic, server-side HMAC namespace from every fact that changes who may receive an answer or how it was produced. Then apply an exact metadata filter before vector ranking.
import { createHash, createHmac } from "node:crypto";
const sha256 = (value) =>
createHash("sha256").update(value).digest("hex");
function cacheNamespace(ctx) {
const boundary = JSON.stringify({
tenantId: ctx.tenantId,
role: ctx.role,
model: ctx.model,
embeddingModel: ctx.embeddingModel,
systemPromptHash: sha256(ctx.systemPrompt),
toolPolicyVersion: ctx.toolPolicyVersion
});
return createHmac("sha256", process.env.CACHE_NAMESPACE_KEY)
.update(boundary)
.digest("base64url");
}
Store the namespace with intent,
systemPromptHash, model versions, provenance,
admission status, TTL, and a digest of the response. Rotate the
namespace when any policy input changes. The HMAC prevents clients
from forging another partition; it does not stop semantic
collisions inside that partition, which is why thresholds,
validation, and monitoring still matter.
Never cache every model output automatically. Validate schema and safety, reject responses containing secrets or personalized data, and keep untrusted-user entries private until they pass review, consensus, or a trusted-publisher rule. Some workloads should use an exact-key cache only.
05 // Anomalous-hit detection
Watch the coordinates attackers keep touching
Poisoning and vector-space scanning create telemetry. A single
entry may attract an unusual number of near matches, distinct
actors, repeated near misses, or a sudden change in intent mix.
Redis can keep short-window counters cheaply: INCR and
EXPIRE form a fixed window, while HyperLogLog
PFADD/PFCOUNT estimates unique actors
without retaining their raw identifiers.
async function recordCacheHit(hit, actorId) {
const minute = Math.floor(Date.now() / 60_000);
const hitsKey = `sc:hits:${hit.id}:${minute}`;
const actorsKey = `sc:actors:${hit.id}:${minute}`;
const actorTag = hmacTelemetryId(actorId); // never store the raw ID
const hits = await redis.eval(`
local n = redis.call("INCR", KEYS[1])
if n == 1 then redis.call("EXPIRE", KEYS[1], ARGV[1]) end
return n
`, { keys: [hitsKey], arguments: ["120"] });
await redis.pfAdd(actorsKey, [actorTag]);
await redis.expire(actorsKey, 120);
const uniqueActors = await redis.pfCount(actorsKey);
const baseline = await baselineFor(hit.intent, hit.namespace);
if (hits > baseline.maxHits || uniqueActors > baseline.maxActors) {
await redis.set(`sc:quarantine:${hit.id}`, "1", { EX: 900 });
await alertSecurity({ hit, hits, uniqueActors });
}
}
Baselines must be per intent and namespace; a popular public FAQ naturally behaves differently from an internal finance query. Quarantine should fail safe by forcing a fresh LLM path, rerunning all output controls, and preserving evidence for review. Track similarity bands and near misses too: a scanner may never trigger one coordinate's hit counter while still probing a dense region.
06 // Secure architecture
Treat cached responses as untrusted, replayable artifacts
Partition first
Exact tenant, role, system-prompt, model, and tool-policy namespace filtering occurs before nearest-neighbor search.
Classify before lookup
High-risk, personalized, tool-using, and time-sensitive intents bypass shared semantic reuse.
Control admission
Only validated, attributable, non-sensitive responses become shared entries; everything else stays private or uncached.
Detect and quarantine
Hit-rate, actor-cardinality, near-miss, and drift signals can immediately remove an entry from service.
Finally, run the cache path through the same authorization, response schema, content policy, and audit pipeline as a fresh generation. Log why an entry was selected, which namespace and policy matched, its similarity and equivalence scores, and whether the response digest verified. Red-team with close paraphrases, negation flips, entity swaps, stale facts, cross-role requests, and deliberately poisoned admissions.
Semantic caching can still cut latency and cost. The mistake is treating proximity as proof. Make security boundaries exact, similarity intent-aware, admission selective, and every popular coordinate observable. Then a poisoned vector becomes a contained anomaly instead of an answer your application confidently repeats.
Sources // primary research and documentation
Technical references
- OpenAI — text-embedding-3-small model documentation
- NDSS 2026 — When Cache Poisoning Meets LLM Systems
- ICML 2026 — From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching
- vCache — Adaptive Semantic Prompt Caching with VectorQ
- Redis — Rate limiter architecture and atomic counters
- Redis — PFCOUNT and HyperLogLog cardinality