v0.2.0 — now with synthesis

Your users ask the same questions. You pay every time.

You built an AI app. It works. But your OpenAI bill keeps climbing because 70% of queries are questions you've already answered. What if you could stop that in three lines of code?

$ pip install cacheback-ai

Apache 2.0 · Python 3.10+ · PyPI

from cacheback.openai import CachedOpenAI

client = CachedOpenAI()  # swap one word. that's it.

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is Python?"}]
)
# Monday: “What is Python?”   → OpenAI API  ($0.01, 900ms)
# Tuesday: “Explain Python”    → cache hit   ($0.00, 4ms)
      

Query

Embed

Match

Respond

The problem

A support bot gets 10,000 questions a day.
6,000 of them have been asked before.

Not word-for-word — nobody asks the same exact string. But “How do I reset my password?” and “I forgot my password, help” are the same question. You pay for both. Every day. Every user.

What you're doing now

10K queries/day $300/day

Wait time per query 800–2000ms

Rate limit pressure High

Monthly bill $9,000

After adding cacheback

Same 10K queries $90/day

Repeated query response <5ms

Rate limit pressure Low

Monthly bill $2,700

What you get

Production-ready. Runs on your machine. Nothing to manage.

SQLite for storage, ONNX for embeddings. No Redis, no cloud, no API keys for the cache itself. If anything goes wrong — your app keeps running.

Understands meaning

“How do I cancel?” and “Where's the cancel button?” match. Vector embeddings, not string comparison.

MiniLM-L6-v2 · ONNX

Synthesizes, not parrots

CEAG creates fresh responses from cached knowledge. Unique, contextual answers — not copy-pasted text. Quality: 0.942.

Cached Ensemble Augmented Generation

One word to integrate

Change OpenAI() to CachedOpenAI(). Same API, same types, same streaming. Anthropic wrapper too.

sync + async

Streaming just works

Cache hits stream back chunk-by-chunk, exactly like the original API. Your frontend doesn't know the difference.

buffer & replay

Can't break your app

Disk full? Corrupt database? ONNX model missing? Cache fails silently, app calls the API directly. 14 failure scenarios tested.

graceful degradation

Zero-code proxy

Don't want to change code? Run cacheback-proxy, point your base URL to it. Works with any language.

OpenAI-compatible API

vs alternatives

Other AI caches exist. Here's why developers switch.

GPTCache, LiteLLM, Portkey — they're fine tools. But none of them give you zero-config local embeddings, CEAG synthesis, and a drop-in wrapper in a single pip install.

	cacheback	GPTCache	LiteLLM	Portkey
Setup	pip install, done	pip + Milvus or Redis	pip + config + proxy	SaaS signup + API key
Embeddings	Local ONNX (90MB, offline)	External API required	N/A — gateway, not cache	Cloud only
Integration	CachedOpenAI() drop-in	Own API, new patterns	Proxy — change base_url	Proxy — change base_url
CEAG synthesis	Fresh responses from cache	Verbatim return only	No semantic cache	Verbatim return only
Works offline	Fully local, no cloud	Needs embedding API	Needs upstream API	Cloud service required
Multimodal	Text + image + voice	Text + image	N/A	Text only
Failure handling	14 scenarios, graceful	Basic	Good fallback	Good fallback
Cost	$0 — Apache 2.0	$0 — MIT	Free + paid tiers	SaaS pricing
Dependencies	SQLite + hnswlib only	Milvus / Redis / Qdrant	Varies by config	Cloud infrastructure

Comparison based on public documentation as of March 2026. All listed tools are good at what they do — we just solve a different problem: zero-infra semantic cache with synthesis.

Use cases

cacheback is not for everything. Here's exactly where it works.

Simple cache works where questions repeat. CEAG synthesis goes further — it uses conversation context to create fresh responses even for personalized queries. We'd rather be honest upfront than after you install.

Where cacheback saves you money

Customer support bots

70% of tickets are variants of 20 topics

FAQ & knowledge bases

Same questions asked by thousands of users

Translation pipelines

Same phrases and sentences recur constantly

Classification APIs

Deterministic — same input, same label

Code Q&A (generic)

“How to do X in Python” — Stack Overflow model

Personalized chatbots CEAG

CEAG includes conversation context when synthesizing. Fresh responses adapted to each user — not verbatim cache returns

Content & copywriting CEAG

Blog posts, product descriptions, marketing copy. CEAG synthesizes from similar cached content, adapted to your brief

Voice assistants COMING SOON

Whisper transcription → semantic match

Image understanding COMING SOON

CLIP embeddings — similar photos, visual Q&A

Spatial & 3D queries RESEARCH

CLIP+3D — “show me furniture like this”

Where it won't help

Real-time data queries

Stock prices, weather, live scores change every second. Even CEAG can't refresh stale facts — it synthesizes text, not data. Use TTL-based caching for time-sensitive scenarios.

Image & video generation

DALL-E, Midjourney, Sora produce visual output. Can't synthesize new media from cached pieces — different modality. Tip: prompt caching for refinement does work.

Unique document analysis

“Analyze MY contract”, “Review MY code.” Each input is unique per user with no recurring patterns across your userbase.

Integration

You were expecting more steps. There aren't any.

Pick your SDK. Change one import. Deploy. Your app is now 70% cheaper to run.

from cacheback.openai import CachedOpenAI

client = CachedOpenAI()

# Exact same API as openai.OpenAI
resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain caching"}]
)

print(resp.choices[0].message.content)
print(resp.cacheback_hit)  # True if from cache
      

from cacheback.anthropic import CachedAnthropic

client = CachedAnthropic()

msg = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain caching"}]
)

print(msg.content[0].text)
print(msg.cacheback_hit)  # True if from cache
      

# Terminal: start the proxy
$ pip install cacheback-ai[proxy]
$ cacheback-proxy # runs on :8990

# Your code: just change base_url
import openai

client = openai.OpenAI(
    base_url="http://localhost:8990/v1"
)

# Works with any language. curl, Node.js, Go...
# Zero code changes needed.
      

Pricing

There is no catch.

The full SDK is free, open source, Apache 2.0. Use it in production, fork it, sell products built on it. We make money when you want us on speed dial.

Open Source

Everything. Forever. No trial, no limit.

Semantic cache (SQLite + hnswlib)
OpenAI + Anthropic wrappers
Streaming, proxy mode, CEAG
Text, image, voice, audio embedders
Use commercially. No strings.

pip install

Starter

$49/mo

You ship to prod. We watch your back.

Everything in Open Source
Commercial license
Email support (48h SLA)
Priority bug fixes

Get Started

Pro

$99/mo

Compliance, isolation, architecture help.

Everything in Starter
PII filter (coming soon)
Namespace isolation
Slack support channel
Architecture review call

Your users ask the same questions. You pay every time.

A support bot gets 10,000 questions a day.
6,000 of them have been asked before.

What you're doing now

After adding cacheback

It doesn't just remember answers. It creates new ones.

Production-ready. Runs on your machine. Nothing to manage.

Understands meaning

Synthesizes, not parrots

One word to integrate

Streaming just works

Can't break your app

Zero-code proxy

Other AI caches exist. Here's why developers switch.

cacheback is not for everything. Here's exactly where it works.

Where cacheback saves you money

Where it won't help

You were expecting more steps. There aren't any.

There is no catch.

Your next API call could be free

Your users ask the same questions. You pay every time.

A support bot gets 10,000 questions a day.6,000 of them have been asked before.

What you're doing now

After adding cacheback

It doesn't just remember answers. It creates new ones.

Production-ready. Runs on your machine. Nothing to manage.

Understands meaning

Synthesizes, not parrots

One word to integrate

Streaming just works

Can't break your app

Zero-code proxy

Other AI caches exist. Here's why developers switch.

cacheback is not for everything. Here's exactly where it works.

Where cacheback saves you money

Where it won't help

You were expecting more steps. There aren't any.

There is no catch.

Your next API call could be free

A support bot gets 10,000 questions a day.
6,000 of them have been asked before.