HOW A 60-LINE PROXY MADE MINIMAX-M2.5 4× FASTER ON CLAUDE CODE
──────────────────────────────────────────────────────────────────
The 42.7% MiniMax-M2.5 Terminal-Bench result on our leaderboard didn’t come from llama-server alone. Sitting between Claude Code and llama-server is a tiny Python proxy — claude-cache-proxy — that fixes a single bug in how Claude Code talks to local model servers. Without it, our 5-day evaluation would have been a 20-day evaluation. With it, prefill drops from ~60 seconds per turn to under one.
Here’s what it does, and why you need it if you’re running Claude Code against llama-server.
The bug: one header invalidates the whole cache
Claude Code injects a system header on every request that looks like this:
x-anthropic-billing-header: cc_version=2.1.63.a43; cc_entrypoint=sdk-cli; cch=e2224;
Several of those fields change per-request. And critically, the header lives at the start of the system prompt — not in HTTP headers, but inside the message body that gets tokenized and fed to the model.
llama-server’s KV cache works on prefix matching: if turn N’s tokenized prompt starts with the same tokens as turn N-1, it reuses the cache for that shared prefix and only evaluates the new tokens. But because Claude Code’s billing header diverges at ~token 33 of every request, the cache miss happens almost immediately. Every single turn forces a full prefill from scratch.
The cost: 60 seconds of wasted work, every turn
Do the math on a typical agent turn:
- ~18,000 prompt tokens (system prompt + tool definitions + conversation history)
- ~300 tok/s prefill on Mac Studio M3 Ultra
- = ~60 seconds of prefill per turn
Multiply that across hundreds of turns in a single Terminal-Bench task and you’ve burned tens of minutes per task on work the model already did. Multiply across 436 tasks and you’ve burned days. This is what was driving the 166 AgentTimeoutErrors in our leaderboard run — the model wasn’t slow at thinking; the server was discarding its memory between every message.
The fix: normalize the header before llama-server sees it
claude-cache-proxy is a small Python http.server that sits between Claude Code and llama-server. It intercepts each request, runs a byte-level regex over the request body, and replaces the entire billing header value with a fixed string before forwarding to the upstream server.
That’s it. ~60 lines of Python. No model awareness, no JSON parsing, no streaming changes — just a regex on the bytes.
After the swap, every request’s tokenized prefix matches the previous request’s prefix up through the conversation history. llama-server’s prefix matcher latches on, and ~99% of the KV cache survives between turns.
The numbers
Same model (MiniMax-M2.5 Q8_0), same hardware (M3 Ultra Mac Studio), same Terminal-Bench task (fix-git):
| Configuration | Time | Cache working? |
|---|---|---|
| llama-server, no proxy | 8:40 | No |
| llama-server + cache-proxy | 1:50 | Yes |
A 4.7× speedup from a regex.
Wiring it up with blobfish
The full local stack is four pieces in series:
Claude Code → blobfish agent → cache-proxy (:8081) → llama-server (:8080) → MiniMax-M2.5
Setup takes a couple of minutes. We covered llama.cpp and the model in Running MiniMax-M2.5 Locally with llama.cpp — start with that for the model side. Then layer in the proxy:
1. Clone the proxy:
git clone https://github.com/cchuter/claude-cache-proxy
cd claude-cache-proxy
2. Start llama-server on port 8080 (your usual MiniMax-M2.5 invocation — the port is the only thing that matters for the proxy):
llama-server \
--model ~/models/MiniMax-M2.5-GGUF-unsloth/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
--host 0.0.0.0 --port 8080 \
--jinja --ctx-size 196608 --context-shift \
--n-gpu-layers 999 --parallel 1 --flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 \
--metrics --verbose
3. Start the proxy on port 8081, pointed at llama-server:
python3 cache-proxy.py
That’s the whole proxy command. Defaults are --port 8081 --upstream http://localhost:8080, which is exactly what you want.
4. Point blobfish at the proxy instead of at llama-server directly:
ANTHROPIC_BASE_URL=http://localhost:8081 \
ANTHROPIC_API_KEY=no-key \
./scripts/run-terminal-bench.sh \
--backend claude \
--model minimax/minimax-m2.5 \
--agent-profile simple \
-k 5 -n 1
Note the 8081 — that’s the proxy. Claude Code talks to the proxy, the proxy normalizes and forwards to llama-server on 8080. From Claude Code’s perspective, it’s hitting an Anthropic Messages endpoint. From llama-server’s perspective, every request starts with the same prefix and the cache stays warm.
Verifying it actually works
Run llama-server with --verbose and watch the slot logs. Each request prints something like:
Cache hit (good):
slot update_slots: id 0 | task 232 | new prompt, ... task.n_tokens = 17928
slot update_slots: id 0 | task 232 | n_tokens = 17800, memory_seq_rm [17800, end)
17,800 of 17,928 tokens served from cache. Only 128 new tokens needed evaluation. Prefill cost: under a second.
Cache miss (broken — proxy not running, or pointed at the wrong port):
slot update_slots: id 0 | task 232 | n_tokens = 33, memory_seq_rm [33, end)
Only 33 tokens matched. The entire prompt is being re-evaluated. The billing header is diverging exactly where you’d expect it to.
The number after memory_seq_rm is your cache-hit canary. If it’s small (~33), the proxy isn’t in the path. If it’s large (close to your prompt size), you’re good.
What’s next
The same stack — llama-server + cache-proxy + blobfish — is running a Terminal-Bench evaluation on MiniMax-M2.7 as we speak. Same hardware, same harness, same proxy. When the run finishes, the result lands on the leaderboard and we’ll post the breakdown.
Until then: if you’re running Claude Code against any local model server and your inference feels mysteriously slow over multi-turn conversations, check whether your KV cache is actually surviving between turns. Odds are good it isn’t — and the fix is 60 lines of Python.