Comparisons

Semantic Caching for LLM Apps: GPTCache vs Redis vs Upstash (2026)

A hands-on comparison of GPTCache, Redis LangCache, Upstash, and Canopy for semantic caching, with real hit rates, costs, and threshold-tuning lessons from production.

By Fanny Engriana · June 2, 2026 · 11 min read · 👁 31 views

Semantic Caching for LLM Apps: GPTCache vs Redis vs Upstash (2026)

If you ship anything backed by an LLM API, your bill grows with traffic in a way that feels almost punitive. Every repeated question costs the same as a brand-new one, even when ten users ask the same thing five different ways. Semantic caching is the cheapest fix I know of for that problem, and it is still underused. This is a breakdown of the four tools I keep coming back to in 2026 — GPTCache, Redis LangCache, Upstash Semantic Cache, and Canopy — with the hit rates, costs, and traps I have measured running them in production.

I run seven content-aggregator sites that hit OpenAI and Anthropic endpoints on a daily schedule, plus a handful of AI products built at Warung Digital Teknologi — BizChat Revenue Assistant, ServiceBot AI Helpdesk, and DocSumm AI Summarizer among them. Caching moved from "nice to have" to "non-negotiable" the month one of those products crossed 40,000 requests. So this is written from the chair of someone paying the invoice, not reviewing a press kit.

Semantic caching is not the same as prompt caching

People conflate the two constantly, and getting them mixed up will cost you money. They solve different problems and stack on top of each other.

Provider prompt caching (OpenAI's automatic prompt caching, Anthropic's cache breakpoints, Gemini's context cache) reuses the prefix of a prompt. It is an exact-match, token-level mechanism: if the first 2,000 tokens of your request are byte-for-byte identical to a recent request, the provider skips re-processing them and discounts that portion. It does nothing if the wording changes by a single character.

Semantic caching works at the meaning level. It embeds the incoming query into a vector, searches a vector store for a previously answered query whose embedding is close enough (cosine similarity above a threshold you set), and if it finds one, returns the stored answer without calling the model at all. "What's your refund window?" and "How long do I have to return something?" hit the same cache entry. The model is never invoked, so you save 100% of that call's cost and get a response in single-digit milliseconds.

The takeaway I'd hand a junior engineer: prompt caching discounts calls you still make; semantic caching eliminates calls entirely. Run both. They do not conflict.

How a semantic cache actually behaves in production

The mechanics are simple, and the failure modes hide in the details:

Embedding step. Each incoming query is embedded. This itself costs a small amount (or adds latency if self-hosted), so a cache that embeds on every single request has a floor cost you cannot escape.
Similarity search. The vector store returns the nearest stored query. You compare its similarity score against a threshold.
The threshold is the whole game. Set it too loose (say 0.80) and you serve "the capital of France is Paris" to someone asking about Germany — a false-positive cache hit, the most dangerous bug in this whole category because it is silent. Set it too tight (0.97) and your hit rate collapses to near zero.
Eviction and TTL. Cached answers go stale. Pricing pages, inventory, anything time-sensitive needs a short TTL or it will lie to users.

When I first wired semantic caching into BizChat, I started at a 0.85 threshold because a blog post told me to. Within two days a customer-facing answer about a specific pricing tier got served to a question about a different tier — close in embedding space, wrong in fact. I pulled the threshold to 0.92 for anything touching numbers or policy, kept 0.86 for general FAQ chit-chat, and split the cache into two namespaces by risk. That two-tier threshold approach is the single most useful thing I have learned in this area, and almost nobody mentions it.

The four tools, compared

Here is how they stack up on the dimensions that actually decided my choices. Pricing reflects published rates and my own usage as of mid-2026; verify before committing.

Tool	Type	Self-host	Embeds for you	Best for	License / pricing
GPTCache (Zilliz)	Python library	Yes (you run everything)	No — you wire the embedder	Full control, custom pipelines	MIT, free
Redis LangCache	Managed service on Redis	Via Redis Enterprise / Cloud	Yes (managed)	Teams already on Redis	Usage-based, free dev tier
Upstash Semantic Cache	Serverless	No (fully hosted)	Yes (internal)	Serverless / edge apps, zero ops	Pay-per-request, generous free tier
Canopy (Pinecone)	RAG framework w/ caching	Open source, you host	Yes (via Pinecone)	RAG apps already on Pinecone	Apache 2.0 + Pinecone costs

GPTCache

GPTCache is the one I reach for when I need control. It is an MIT-licensed Python library from Zilliz that plugs into LangChain and LlamaIndex and lets you swap every component — embedder, vector store (FAISS, Milvus, SQLite), eviction policy, similarity evaluator. On cache hits it consistently delivers 2–10x speedups in my measurements, and because it is a library there is no per-request vendor fee.

The cost is operational: it is a library, not a service. There is no HTTP proxy, no failover, no dashboard. You run the vector store, you handle scaling, you own the on-call pager. For my self-hosted aggregator scripts running on a Hostinger VPS, that tradeoff is fine — the traffic is predictable and I already manage the box. For a customer-facing product where I do not want to babysit infrastructure, it is more work than I want.

Redis LangCache

If your stack already has Redis — and most of mine does — LangCache is the path of least resistance. It is a managed semantic cache layered on Redis's vector capabilities. Redis's own benchmarks report up to 73% cost reduction on high-repetition workloads, and that number lines up with what I see on FAQ-heavy traffic. You store query embeddings and responses in memory and pull cached answers for semantically similar queries.

What I like is that it is operationally boring in the best way: it sits next to data I am already caching, the latency is excellent because it is in-memory, and the dev tier is free. The catch is that the best experience assumes Redis Enterprise or Redis Cloud; bolting it onto a bare open-source Redis is more limited.

Upstash Semantic Cache

Upstash is what I recommend to anyone running serverless or edge functions who does not want to think about a vector database at all. It is fully serverless, generates embeddings internally (so you do not wire up or pay a separate embedding API), and bills per request with a free tier that covers small apps outright. For a Vercel-hosted Next.js front end calling an LLM, this is the lowest-friction option in the list — you get a cache with three lines of setup and no server to keep alive.

The flip side of serverless: you have less control over the embedding model and the internals, and at very high sustained volume the per-request economics can cross over the point where self-hosting GPTCache is cheaper. It is the "zero ops, pay for convenience" choice.

Canopy

Canopy is Pinecone's open-source (Apache 2.0) RAG framework, and its caching is most compelling if you are already doing retrieval-augmented generation on Pinecone. Rather than treating the cache as a separate concern, it folds caching into the RAG pipeline. If you are not on Pinecone, standing up Canopy purely for semantic caching is overkill — pick one of the other three. I include it because plenty of teams already on Pinecone overlook that they have a caching layer sitting right there.

Picking the embedding model matters more than people admit

The cache decides "is this the same question?" using whatever embedding model you give it, so that model quietly controls your hit rate and your false-positive rate. I have run the same FAQ traffic through three different embedders and watched the effective hit rate swing by more than 15 percentage points at an identical similarity threshold, simply because the models space sentences differently.

A few things I have settled on:

Small, fast embedding models are usually the right call for caching. You are matching short queries, not indexing documents. A 384- or 768-dimension model is plenty and keeps both latency and the per-embed cost down. Reaching for a giant 3,072-dimension model here is wasted money.
Keep one embedder for the life of a cache. If you switch embedding models, every stored vector is now in a different space and your similarity scores are meaningless. Switching means flushing the cache and re-warming. Decide once.
Match the embedder to your language mix. Several of my products serve a bilingual English/Indonesian audience. A model that only spaces English well will mis-cluster the Indonesian queries, so I test the embedder on real mixed-language samples before trusting the hit-rate numbers. This is exactly the kind of detail a generic benchmark will never surface for you.

What the integration actually looks like

People assume this is a big project. For the serverless path it is genuinely a few lines. The shape of an Upstash-style setup is roughly:

cache = SemanticCache(url=..., token=..., min_proximity=0.92)

answer = cache.get(user_query)
if answer is None:
    answer = call_llm(user_query)   # only on a miss
    cache.set(user_query, answer)
return answer

That is the entire pattern. The min_proximity value is the threshold I keep harping on; bump it up for high-stakes namespaces. With GPTCache the skeleton is similar but you also declare the embedder, the vector store, and the eviction policy explicitly — more code, more control. The point is that the wiring is not where the difficulty lives. The difficulty lives entirely in threshold tuning, TTL discipline, and namespacing — the human-judgment parts, not the plumbing.

In-memory database and caching infrastructure for LLM applications

What semantic caching is actually worth: the numbers

Generic posts say "save up to 80%." That range is real but useless without knowing where your workload sits. Here is what I have measured and what the published benchmarks show, so you can estimate your own ceiling:

Hit rate is everything, and it tracks repetition. On a public FAQ / support bot, where users genuinely ask the same dozen things, I see 55–70% hit rates after the cache warms up. On a tool that answers open-ended, long-tail questions, the hit rate dropped under 20% — the cache barely paid for its own embedding cost.
Savings roughly equal your hit rate. At a 60% hit rate on an internal knowledge base, you cut about 60% of the model spend, because 60% of calls never reach the model. Redis reports up to 73% on high-repetition workloads; PremAI documents around 60% bill reduction without quality loss. Both match my range.
Latency wins are bigger than the cost wins, perceptually. A cache hit on my Redis setup returns in well under 20ms versus 800ms–3s for a fresh generation. Users notice speed before they notice your invoice.
Stacking compounds. Layering application-level semantic caching on top of provider prompt caching lands in the 60–80% total cost reduction range in optimized deployments. That is the real target if you are serious about cost.

The honest counter-case: if your traffic is genuinely unique per request, a semantic cache adds embedding cost and complexity for almost no return. Measure your query-repetition rate before you build anything. I now run a one-week embedding-similarity sample on new products before deciding whether a cache is even worth shipping.

A decision matrix I'd actually use

Serverless / edge, want zero ops: Upstash Semantic Cache.
Already run Redis: Redis LangCache — do not add a new dependency.
Already run Pinecone RAG: Canopy — you may already have it.
Want full control, fine with ops, cost-sensitive at scale: GPTCache.
Anything touching prices, policy, or facts: whichever tool you pick, run a tighter similarity threshold (0.92+) and a short TTL on that namespace.

Mistakes I made so you don't have to

One global threshold. Already covered, but it bears repeating: split your cache by risk and set thresholds per namespace. The false-positive hit is silent and erodes trust.
No TTL on dynamic content. A cached answer about "current pricing" served three weeks after a price change is worse than a slow correct answer. Tie TTL to how often the underlying fact changes.
Caching personalized responses. If the answer depends on the logged-in user, a naive cache leaks one user's data to another. Namespace by user or exclude personalized routes entirely.
Not logging cache misses. Your miss log is a free list of questions to pre-warm and FAQ topics to write. I feed mine straight back into content planning.

Frequently asked questions

Does semantic caching hurt answer quality?

Only if your threshold is too loose. At a sane threshold with risk-based namespaces, served answers are answers the model already produced for a near-identical question. PremAI's testing and my own show no measurable quality loss when the threshold is tuned. Quality drops come from sloppy thresholds, not from caching itself.

How is this different from just storing exact-match responses?

Exact-match (key-value) caching only hits when the query string is identical. Real users phrase the same question dozens of ways, so exact-match hit rates are tiny. Semantic caching matches on meaning, which is why it reaches 50%+ hit rates on repetitive traffic where exact-match would catch almost nothing.

Can I use semantic caching with any model provider?

Yes. The cache sits in front of your model call, so it is provider-agnostic — OpenAI, Anthropic, Google, or a self-hosted model behind vLLM all work the same way. The embedding model used for similarity is separate from your generation model.

What hit rate do I need for it to be worth it?

As a rough floor, I want to see at least 25–30% repetition in a sample of real queries before building a cache. Below that, the embedding overhead eats the savings. Above 50% and it is one of the highest-return changes you can ship.

Doesn't the embedding step add cost on every request?

It does, and that floor cost is the thing people forget. Every query gets embedded whether it hits or misses, so a cache that embeds against a pricey large model can erase its own savings on low-repetition traffic. This is why I default to a small embedding model and measure repetition first. On the hosted options (Upstash, Redis LangCache) the embedding is handled internally and folded into the per-request price, which is simpler but worth watching at high volume. On GPTCache you pick the embedder yourself, so you control that floor directly.

Should I cache streaming responses?

You can, but store the fully assembled answer, not the token stream, and replay it as a stream on a hit if your UI expects streaming. The user gets the familiar typing effect instantly while you pay nothing for generation. Just make sure your cache layer sits at the application boundary, not buried inside the streaming transport.

Is GPTCache still maintained in 2026?

Yes, it remains the most flexible open-source option and integrates with current LangChain and LlamaIndex versions. For library-style control with no vendor fee, it is still my default for self-hosted pipelines.

Where I'd start

If you are unsure, the fastest way to learn is to sample one week of real queries, run them through an embedding model, and measure how many cluster together. That number tells you your ceiling before you write a line of cache code. If repetition is high, pick the tool that matches your existing stack — Upstash for serverless, Redis LangCache if Redis is already there, GPTCache if you want control. Set risk-based thresholds, add TTLs, log your misses, and stack it on top of provider prompt caching. That combination has taken real money off my LLM bills every month, and it is the first optimization I now reach for on any new AI product.

🏷 Tagged: #semantic caching #LLM optimization #GPTCache #Redis #AI cost

Enjoyed this article?

Get more AI insights — browse our full library of 103+ articles and 373+ ready-to-use AI prompts.