How cacheback works

The Problem

You pay for the same question many times

Without cache

User A: "What is photosynthesis?" $0.03 · 3s

User B: "How does photosynthesis work?" $0.03 · 3s

User C: "Explain photosynthesis" $0.03 · 3s

Total $0.09 · 9s

With semantic cache

User A: "What is photosynthesis?" $0.03 · 3s

User B: "How does photosynthesis work?" $0.00 · 3ms

User C: "Explain photosynthesis" $0.00 · 3ms

Total $0.03 · 3s

Same question, three wordings. Exact-match cache misses them. Semantic cache catches all three.

Step 1

Turn text into numbers (embedding)

A small model (90MB, runs locally) converts each sentence into a list of 384 numbers. Sentences with similar meaning produce similar numbers. Sentences with different meaning produce different numbers.

"What is photosynthesis?"

→

0.12 -0.45 0.78 0.33 -0.21 0.56 ... ×384

"How does photosynthesis work?"

→

0.11 -0.44 0.79 0.31 -0.19 0.54 ... ×384

Similarity:

0.94 CACHE HIT

"What is the capital of France?"

→

0.91 0.22 -0.56 0.08 0.67 -0.33 ... ×384

Similarity:

0.23 CACHE MISS

Step 2

The full flow in 5 steps

1

Query arrives

"How does photosynthesis work?"

2

Embedding (convert to numbers)

MiniLM model converts text to 384 numbers — locally, for free

2 ms · $0.00 · runs on your device

3

Search for similar in index

Compare those 384 numbers with previously stored queries

1 ms · algorithm: hnswlib (nearest neighbor search)

4

Similarity > 0.88?

Threshold decides: is the query similar enough?

YES → CACHE HIT

Return stored response

3 ms · $0.00

NO → CACHE MISS

Send to AI, store response

3 sec · $0.03

CEAG — Cached Ensemble Augmented Generation

We don't return old answers — we synthesize new ones

Simple cache returns exactly what was stored. CEAG takes the 5 nearest cached responses, adds conversation context, and a small model synthesizes a fresh, personalized response. Then ensemble gates verify quality.

?

Query

"How does photosynthesis work?"

→

🔍

Top-5 from cache

hnswlib · 1ms

→

🧠

Synthesis

Phi-4-mini · 300ms

→

⚖

Ensemble

debate / MoA / RPI

→

✓

Response

fresh · contextual

Conversation context = same 5 cached answers, but a different response for each user

Simple Cache

VERBATIM RETURN

Cost $0.00

Latency 3 ms

Quality static

Personalization NO

CEAG

SYNTHESIS + CONTEXT

Cost $0.002

Latency 300 ms

Quality ~85% GPT-4

Personalization YES

Full LLM

FULL GENERATION

Cost $0.03

Latency 3 sec

Quality 100%

Personalization YES

CEAG = 15x cheaper than Full LLM, 10x slower than Simple Cache, but fresh and personalized. Ensemble (debate/MoA/RPI) removes hallucinations and synthesis errors.

Multimodal

Not just text — every modality

The same mechanism works for images, voice, and physical space. Only the encoder changes — the rest of the infrastructure stays identical.

💬

Text → Text

Cache LLM responses for repeating questions

query → MiniLM → 384 numbers → search → hit/miss

READY

🎨

Text → Image

Cache generated images (DALL-E, Midjourney)

prompt → MiniLM → 384 numbers → search → cached image

1 DAY OF WORK

🎤

Voice → Text

Cache responses for repeating voice commands

audio → Whisper → text → MiniLM → search → hit/miss

2-3 DAYS OF WORK

🤖

Image → Action

Cache spatial recognition for robots/drones

camera → CLIP → 512 numbers → search → cached action

RESEARCH

Calculator

How much do you save?

Requests per day

Cost per request ($)

Cache hit rate (%)

Cost without cache (daily) $300.00

Cost with cache (daily) $120.00

Monthly savings

with current settings

$5,400

Learning

Cache learns from every query

Cache starts empty. Every response from GPT-4/Claude is stored automatically. The more queries, the higher the hit rate — cache becomes smarter over time.

Day 1–7

Cold

0–15%

Cache is empty. Every query goes to GPT-4. Responses get stored.

Week 2–4

Warm

15–35%

Popular questions hit cache. CEAG synthesizes variants.

Month 2–3

Hot

35–60%

Cache covers most topics. Costs drop by half.

Month 4+

Mature

60–75%+

Cache is an expert in your domain. Lock-in.

The longer you use it, the less you pay. Cache learns from GPT-4/Claude responses — building your product's knowledge base for free.

Integration

Two lines of code

      # Before (you pay for every query):

      from openai import OpenAI

      client = OpenAI()

      # After (repeats are free):

      from cacheback import CachedOpenAI

      client = CachedOpenAI()

      # The rest of your code doesn't change.

      # Cache runs locally. Zero data leaves your machine.

      response = client.chat.completions.create(

        model="gpt-4",

        messages=[{"role": "user", "content": "What is photosynthesis?"}]

      )

You pay for the same question many times

Without cache

With semantic cache

Turn text into numbers (embedding)

The full flow in 5 steps

We don't return old answers — we synthesize new ones

Simple Cache

CEAG

Full LLM

Not just text — every modality

Text → Text

Text → Image

Voice → Text

Image → Action

How much do you save?

Cache learns from every query

Two lines of code

Ready to cut your AI costs?