← Back to cacheback.ai

How cacheback works

Not a simple cache — intelligent synthesis with conversation context. Learns from every query. 10x cheaper AI.

You pay for the same question many times

Without cache

User A: "What is photosynthesis?" $0.03 · 3s
User B: "How does photosynthesis work?" $0.03 · 3s
User C: "Explain photosynthesis" $0.03 · 3s
Total $0.09 · 9s

With semantic cache

User A: "What is photosynthesis?" $0.03 · 3s
User B: "How does photosynthesis work?" $0.00 · 3ms
User C: "Explain photosynthesis" $0.00 · 3ms
Total $0.03 · 3s

Same question, three wordings. Exact-match cache misses them. Semantic cache catches all three.

Turn text into numbers (embedding)

A small model (90MB, runs locally) converts each sentence into a list of 384 numbers. Sentences with similar meaning produce similar numbers. Sentences with different meaning produce different numbers.

"What is photosynthesis?"
0.12 -0.45 0.78 0.33 -0.21 0.56 ... ×384
"How does photosynthesis work?"
0.11 -0.44 0.79 0.31 -0.19 0.54 ... ×384
Similarity:
0.94 CACHE HIT
"What is the capital of France?"
0.91 0.22 -0.56 0.08 0.67 -0.33 ... ×384
Similarity:
0.23 CACHE MISS

The full flow in 5 steps

1
Query arrives
"How does photosynthesis work?"
2
Embedding (convert to numbers)
MiniLM model converts text to 384 numbers — locally, for free
2 ms · $0.00 · runs on your device
3
Search for similar in index
Compare those 384 numbers with previously stored queries
1 ms · algorithm: hnswlib (nearest neighbor search)
4
Similarity > 0.88?
Threshold decides: is the query similar enough?
YES → CACHE HIT
Return stored response
3 ms · $0.00
NO → CACHE MISS
Send to AI, store response
3 sec · $0.03

We don't return old answers — we synthesize new ones

Simple cache returns exactly what was stored. CEAG takes the 5 nearest cached responses, adds conversation context, and a small model synthesizes a fresh, personalized response. Then ensemble gates verify quality.

?
Query
"How does photosynthesis work?"
🔍
Top-5 from cache
hnswlib · 1ms
🧠
Synthesis
Phi-4-mini · 300ms
Ensemble
debate / MoA / RPI
Response
fresh · contextual
Conversation context = same 5 cached answers, but a different response for each user

Simple Cache

VERBATIM RETURN
Cost $0.00
Latency 3 ms
Quality static
Personalization NO

CEAG

SYNTHESIS + CONTEXT
Cost $0.002
Latency 300 ms
Quality ~85% GPT-4
Personalization YES

Full LLM

FULL GENERATION
Cost $0.03
Latency 3 sec
Quality 100%
Personalization YES

CEAG = 15x cheaper than Full LLM, 10x slower than Simple Cache, but fresh and personalized. Ensemble (debate/MoA/RPI) removes hallucinations and synthesis errors.

Not just text — every modality

The same mechanism works for images, voice, and physical space. Only the encoder changes — the rest of the infrastructure stays identical.

💬

Text → Text

Cache LLM responses for repeating questions
query → MiniLM → 384 numbers → search → hit/miss
READY
🎨

Text → Image

Cache generated images (DALL-E, Midjourney)
prompt → MiniLM → 384 numbers → search → cached image
1 DAY OF WORK
🎤

Voice → Text

Cache responses for repeating voice commands
audio → Whisper → text → MiniLM → search → hit/miss
2-3 DAYS OF WORK
🤖

Image → Action

Cache spatial recognition for robots/drones
camera → CLIP → 512 numbers → search → cached action
RESEARCH

How much do you save?

Requests per day
Cost per request ($)
Cache hit rate (%)
Cost without cache (daily) $300.00
Cost with cache (daily) $120.00
Monthly savings
with current settings
$5,400

Cache learns from every query

Cache starts empty. Every response from GPT-4/Claude is stored automatically. The more queries, the higher the hit rate — cache becomes smarter over time.

0% 25% 50% 75% Day 1-7 Week 2-4 Month 2-3 Month 4+ COLD WARM HOT MATURE ~15% ~35% ~60% 75%+
Day 1–7
Cold
0–15%
Cache is empty. Every query goes to GPT-4. Responses get stored.
Week 2–4
Warm
15–35%
Popular questions hit cache. CEAG synthesizes variants.
Month 2–3
Hot
35–60%
Cache covers most topics. Costs drop by half.
Month 4+
Mature
60–75%+
Cache is an expert in your domain. Lock-in.
The longer you use it, the less you pay. Cache learns from GPT-4/Claude responses — building your product's knowledge base for free.

Two lines of code

# Before (you pay for every query):
from openai import OpenAI
client = OpenAI()

# After (repeats are free):
from cacheback import CachedOpenAI
client = CachedOpenAI()

# The rest of your code doesn't change.
# Cache runs locally. Zero data leaves your machine.
response = client.chat.completions.create(
  model="gpt-4",
  messages=[{"role": "user", "content": "What is photosynthesis?"}]
)

Ready to cut your AI costs?

One pip install. Two lines of code. 70% savings.

pip install cacheback-ai