You built an AI app. It works. But your OpenAI bill keeps climbing because 70% of queries are questions you've already answered. What if you could stop that in three lines of code?
Apache 2.0 · Python 3.10+ · PyPI
Not word-for-word — nobody asks the same exact string. But “How do I reset my password?” and “I forgot my password, help” are the same question. You pay for both. Every day. Every user.
Three tiers of intelligence — from instant verbatim hits to fresh AI-synthesized responses.
See how it works →SQLite for storage, ONNX for embeddings. No Redis, no cloud, no API keys for the cache itself. If anything goes wrong — your app keeps running.
“How do I cancel?” and “Where's the cancel button?” match. Vector embeddings, not string comparison.
MiniLM-L6-v2 · ONNXCEAG creates fresh responses from cached knowledge. Unique, contextual answers — not copy-pasted text. Quality: 0.942.
Cached Ensemble Augmented GenerationChange OpenAI() to CachedOpenAI(). Same API, same types, same streaming. Anthropic wrapper too.
Cache hits stream back chunk-by-chunk, exactly like the original API. Your frontend doesn't know the difference.
buffer & replayDisk full? Corrupt database? ONNX model missing? Cache fails silently, app calls the API directly. 14 failure scenarios tested.
graceful degradationDon't want to change code? Run cacheback-proxy, point your base URL to it. Works with any language.
GPTCache, LiteLLM, Portkey — they're fine tools. But none of them give you zero-config local embeddings, CEAG synthesis, and a drop-in wrapper in a single pip install.
| cacheback | GPTCache | LiteLLM | Portkey | |
|---|---|---|---|---|
| Setup | pip install, done | pip + Milvus or Redis | pip + config + proxy | SaaS signup + API key |
| Embeddings | Local ONNX (90MB, offline) | External API required | N/A — gateway, not cache | Cloud only |
| Integration | CachedOpenAI() drop-in | Own API, new patterns | Proxy — change base_url | Proxy — change base_url |
| CEAG synthesis | Fresh responses from cache | Verbatim return only | No semantic cache | Verbatim return only |
| Works offline | Fully local, no cloud | Needs embedding API | Needs upstream API | Cloud service required |
| Multimodal | Text + image + voice | Text + image | N/A | Text only |
| Failure handling | 14 scenarios, graceful | Basic | Good fallback | Good fallback |
| Cost | $0 — Apache 2.0 | $0 — MIT | Free + paid tiers | SaaS pricing |
| Dependencies | SQLite + hnswlib only | Milvus / Redis / Qdrant | Varies by config | Cloud infrastructure |
Comparison based on public documentation as of March 2026. All listed tools are good at what they do — we just solve a different problem: zero-infra semantic cache with synthesis.
Simple cache works where questions repeat. CEAG synthesis goes further — it uses conversation context to create fresh responses even for personalized queries. We'd rather be honest upfront than after you install.
Pick your SDK. Change one import. Deploy. Your app is now 70% cheaper to run.
The full SDK is free, open source, Apache 2.0. Use it in production, fork it, sell products built on it. We make money when you want us on speed dial.
Everything. Forever. No trial, no limit.
You ship to prod. We watch your back.
Compliance, isolation, architecture help.
Two lines of code. Savings start on the first duplicate.