v0.2.0 — now with synthesis

Your users ask the same questions. You pay every time.

You built an AI app. It works. But your OpenAI bill keeps climbing because 70% of queries are questions you've already answered. What if you could stop that in three lines of code?

$ pip install cacheback-ai

Apache 2.0 · Python 3.10+ · PyPI

from cacheback.openai import CachedOpenAI client = CachedOpenAI() # swap one word. that's it. resp = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is Python?"}] ) # Monday: “What is Python?” → OpenAI API ($0.01, 900ms) # Tuesday: “Explain Python” → cache hit ($0.00, 4ms)
Query
Embed
Match
Respond
70%
Less API spend
<5ms
Cache response
0.942
Quality score (CQS)
167
Tests passing

A support bot gets 10,000 questions a day.
6,000 of them have been asked before.

Not word-for-word — nobody asks the same exact string. But “How do I reset my password?” and “I forgot my password, help” are the same question. You pay for both. Every day. Every user.

What you're doing now

10K queries/day $300/day
Wait time per query 800–2000ms
Rate limit pressure High
Monthly bill $9,000

After adding cacheback

Same 10K queries $90/day
Repeated query response <5ms
Rate limit pressure Low
Monthly bill $2,700

It doesn't just remember answers. It creates new ones.

Most caches are dumb: exact match or nothing. cacheback understands meaning. And when it finds similar queries in its memory, it synthesizes a fresh, contextual response.

VERBATIM HIT

Identical question?

“What is Python?” asked twice. Same meaning, instant return. <5ms. $0. Done.

CEAG SYNTHESIS

Similar question?

“Explain Python for beginners” — not identical, but close. CEAG synthesizes a fresh answer from cached knowledge. Fast. Fraction of cost.

UPSTREAM CALL

Completely new?

Never seen before. Calls the real API, caches the response. Next time someone asks something similar — it's ready. The cache gets smarter over time.

Q Query [0.3, 0.7...] Embedding cos 0.94 Similarity R Response

Production-ready. Runs on your machine. Nothing to manage.

SQLite for storage, ONNX for embeddings. No Redis, no cloud, no API keys for the cache itself. If anything goes wrong — your app keeps running.

Understands meaning

“How do I cancel?” and “Where's the cancel button?” match. Vector embeddings, not string comparison.

MiniLM-L6-v2 · ONNX

Synthesizes, not parrots

CEAG creates fresh responses from cached knowledge. Unique, contextual answers — not copy-pasted text. Quality: 0.942.

Cached Ensemble Augmented Generation

One word to integrate

Change OpenAI() to CachedOpenAI(). Same API, same types, same streaming. Anthropic wrapper too.

sync + async

Streaming just works

Cache hits stream back chunk-by-chunk, exactly like the original API. Your frontend doesn't know the difference.

buffer & replay

Can't break your app

Disk full? Corrupt database? ONNX model missing? Cache fails silently, app calls the API directly. 14 failure scenarios tested.

graceful degradation

Zero-code proxy

Don't want to change code? Run cacheback-proxy, point your base URL to it. Works with any language.

OpenAI-compatible API

Other AI caches exist. Here's why developers switch.

GPTCache, LiteLLM, Portkey — they're fine tools. But none of them give you zero-config local embeddings, CEAG synthesis, and a drop-in wrapper in a single pip install.

cacheback GPTCache LiteLLM Portkey
Setup pip install, done pip + Milvus or Redis pip + config + proxy SaaS signup + API key
Embeddings Local ONNX (90MB, offline) External API required N/A — gateway, not cache Cloud only
Integration CachedOpenAI() drop-in Own API, new patterns Proxy — change base_url Proxy — change base_url
CEAG synthesis Fresh responses from cache Verbatim return only No semantic cache Verbatim return only
Works offline Fully local, no cloud Needs embedding API Needs upstream API Cloud service required
Multimodal Text + image + voice Text + image N/A Text only
Failure handling 14 scenarios, graceful Basic Good fallback Good fallback
Cost $0 — Apache 2.0 $0 — MIT Free + paid tiers SaaS pricing
Dependencies SQLite + hnswlib only Milvus / Redis / Qdrant Varies by config Cloud infrastructure

Comparison based on public documentation as of March 2026. All listed tools are good at what they do — we just solve a different problem: zero-infra semantic cache with synthesis.

cacheback is not for everything. Here's exactly where it works.

Simple cache works where questions repeat. CEAG synthesis goes further — it uses conversation context to create fresh responses even for personalized queries. We'd rather be honest upfront than after you install.

Where cacheback saves you money

Customer support bots
70% of tickets are variants of 20 topics
FAQ & knowledge bases
Same questions asked by thousands of users
Translation pipelines
Same phrases and sentences recur constantly
Classification APIs
Deterministic — same input, same label
Code Q&A (generic)
“How to do X in Python” — Stack Overflow model
Personalized chatbots CEAG
CEAG includes conversation context when synthesizing. Fresh responses adapted to each user — not verbatim cache returns
Content & copywriting CEAG
Blog posts, product descriptions, marketing copy. CEAG synthesizes from similar cached content, adapted to your brief
Voice assistants COMING SOON
Whisper transcription → semantic match
Image understanding COMING SOON
CLIP embeddings — similar photos, visual Q&A
Spatial & 3D queries RESEARCH
CLIP+3D — “show me furniture like this”

Where it won't help

Real-time data queries
Stock prices, weather, live scores change every second. Even CEAG can't refresh stale facts — it synthesizes text, not data. Use TTL-based caching for time-sensitive scenarios.
Image & video generation
DALL-E, Midjourney, Sora produce visual output. Can't synthesize new media from cached pieces — different modality. Tip: prompt caching for refinement does work.
Unique document analysis
“Analyze MY contract”, “Review MY code.” Each input is unique per user with no recurring patterns across your userbase.

You were expecting more steps. There aren't any.

Pick your SDK. Change one import. Deploy. Your app is now 70% cheaper to run.

from cacheback.openai import CachedOpenAI client = CachedOpenAI() # Exact same API as openai.OpenAI resp = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Explain caching"}] ) print(resp.choices[0].message.content) print(resp.cacheback_hit) # True if from cache
from cacheback.anthropic import CachedAnthropic client = CachedAnthropic() msg = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": "Explain caching"}] ) print(msg.content[0].text) print(msg.cacheback_hit) # True if from cache
# Terminal: start the proxy $ pip install cacheback-ai[proxy] $ cacheback-proxy # runs on :8990 # Your code: just change base_url import openai client = openai.OpenAI( base_url="http://localhost:8990/v1" ) # Works with any language. curl, Node.js, Go... # Zero code changes needed.

There is no catch.

The full SDK is free, open source, Apache 2.0. Use it in production, fork it, sell products built on it. We make money when you want us on speed dial.

Open Source
$0

Everything. Forever. No trial, no limit.

  • Semantic cache (SQLite + hnswlib)
  • OpenAI + Anthropic wrappers
  • Streaming, proxy mode, CEAG
  • Text, image, voice, audio embedders
  • Use commercially. No strings.
pip install
Pro
$99/mo

Compliance, isolation, architecture help.

  • Everything in Starter
  • PII filter (coming soon)
  • Namespace isolation
  • Slack support channel
  • Architecture review call
Contact Us

Your next API call could be free

Two lines of code. Savings start on the first duplicate.