RUNNING CLAUDE CODE LOCALLY WITH MINIMAX-M2.5
────────────────────────────────────────────────
A complete guide to running Claude Code against a local MiniMax-M2.5 model on Apple Silicon. No API key, no data leaving your machine, no rate limits. From zero to working in about 30 minutes (plus model download time).
What you’ll end up with
Claude Code CLI → cache-proxy (:8081) → llama-server (:8080) → MiniMax-M2.5 Q8_0
Claude Code thinks it’s talking to the Anthropic API. llama-server speaks the same protocol natively. The cache proxy in the middle fixes a header issue that would otherwise kill performance (more on that in step 6).
Prerequisites
- Mac with Apple Silicon (M1/M2/M3/M4 — any variant)
- Unified memory: 512 GB for Q8_0 (full quality), ~128 GB for Q4_K_M (good quality, smaller). If you’re on a 64 GB machine, look at Q3_K or smaller quants.
- Homebrew installed (
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)") - Node.js 18+ (for Claude Code):
brew install node - Python 3 (for the cache proxy): already on macOS
Step 1: Install Claude Code
npm install -g @anthropic-ai/claude-code
Verify it’s installed:
claude --version
Step 2: Install llama.cpp
brew install llama.cpp
This gives you llama-server at /opt/homebrew/bin/llama-server. Verify:
llama-server --version
Step 3: Download MiniMax-M2.5
We use the Q8_0 quantization from the unsloth/MiniMax-M2.5-GGUF repo on Hugging Face. It’s split across multiple files, ~243 GB total.
# Install the Hugging Face CLI
brew install huggingface-cli
# Download MiniMax-M2.5 Q8_0
hf download \
unsloth/MiniMax-M2.5-GGUF \
--include "Q8_0/*" \
--local-dir ~/models/MiniMax-M2.5-GGUF
This takes a while — 243 GB is a lot of bits. Once done, you’ll have the split GGUF files:
~/models/MiniMax-M2.5-GGUF/Q8_0/
├── MiniMax-M2.5-Q8_0-00001-of-00006.gguf
├── MiniMax-M2.5-Q8_0-00002-of-00006.gguf
├── ...
└── MiniMax-M2.5-Q8_0-00006-of-00006.gguf
llama-server discovers all the splits from the first file.
On a smaller machine? Replace Q8_0 with Q4_K_M (or another quant) throughout this guide. Check the repo for available quantizations.
Step 4: Start llama-server
llama-server \
--model ~/models/MiniMax-M2.5-GGUF/Q8_0/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
--host 0.0.0.0 \
--port 8080 \
--jinja \
--ctx-size 131072 \
--context-shift \
--n-gpu-layers 999 \
--parallel 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--reasoning-budget 0 \
--metrics \
--verbose
Model loading takes a couple minutes on first launch. You’ll see layer offloading progress, then slot info when it’s ready.
| Flag | Why |
|---|---|
--jinja | Template rendering for tool use messages |
--ctx-size 131072 | 128K context — plenty for most coding sessions |
--context-shift | Keeps going if context exceeds the window |
--n-gpu-layers 999 | All layers on Metal GPU |
--parallel 1 | Single slot — prevents cache-breaking rotation |
--flash-attn on | Flash attention on Metal |
--cache-type-k q8_0 | Quantized KV cache keys (must match values) |
--cache-type-v q8_0 | Quantized KV cache values (must match keys) |
--reasoning-budget 0 | Disable thinking tokens (saves time per request) |
--verbose | Shows cache hit/miss stats for debugging |
Leave this running in its own terminal.
Step 5: Test llama-server directly
Before adding the proxy, verify the server works:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: no-key" \
-d '{
"model": "minimax-m2.5",
"max_tokens": 128,
"messages": [
{"role": "user", "content": "What is a blobfish?"}
]
}'
If you get a response, the server is working. If not, check the llama-server terminal for errors.
Step 6: Start the cache proxy
Claude Code injects a billing header at the start of every request that changes per-request. This silently invalidates llama-server’s KV cache between turns, forcing a full 60-second prefill every time. The proxy normalizes this header so the cache works — cutting multi-turn latency by 4×.
# Clone the proxy
git clone https://github.com/cchuter/claude-cache-proxy
cd claude-cache-proxy
# Start it
python3 cache-proxy.py
The proxy listens on port 8081 and forwards to localhost:8080. Leave it running in a second terminal.
For the full explanation of what’s happening under the hood, see How a 60-Line Proxy Made MiniMax-M2.5 4× Faster on Claude Code.
Step 7: Run Claude Code
Open a new terminal in any project directory and set three environment variables:
export ANTHROPIC_BASE_URL=http://localhost:8081
export ANTHROPIC_API_KEY=no-key
export ANTHROPIC_MODEL=minimax/minimax-m2.5
Then launch Claude Code:
claude
That’s it. Claude Code connects to the proxy (which forwards to llama-server), and you’ve got a fully local AI coding assistant running MiniMax-M2.5 on your Mac.
Try giving it a task:
> Explain the directory structure of this project
The first turn takes a few seconds (initial prefill). Subsequent turns are fast — the cache proxy ensures llama-server reuses its KV cache.
Tip: Add the exports to your shell profile (~/.zshrc) to make them persistent:
echo 'export ANTHROPIC_BASE_URL=http://localhost:8081' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY=no-key' >> ~/.zshrc
echo 'export ANTHROPIC_MODEL=minimax/minimax-m2.5' >> ~/.zshrc
Step 8: Verify the cache is working
In your llama-server terminal (the one with --verbose), watch the output as Claude Code sends requests. You should see:
Good (cache hitting):
slot update_slots: id 0 | task 5 | n_tokens = 8900, memory_seq_rm [8900, end)
This means 8,900 tokens were served from cache. Only the new tokens (after position 8,900) needed evaluation.
Bad (cache broken — proxy not running):
slot update_slots: id 0 | task 5 | n_tokens = 33, memory_seq_rm [33, end)
Only 33 tokens matched — the entire prompt is being re-evaluated every turn. If you see this, check that:
- Claude Code is pointing at
:8081(the proxy), not:8080(llama-server directly) cache-proxy.pyis running
Troubleshooting
“Model not loading” / split files not found: Ensure all GGUF split files are in the same directory. llama-server discovers them from the first file but needs them all present.
"–flash-attn" error:
Newer llama.cpp requires --flash-attn on (not just --flash-attn).
Out of memory / system becomes unresponsive:
Reduce --ctx-size (try 65536 for 64K) or use a smaller quantization. Q8_0 at 128K context needs ~260 GB total.
Very slow despite cache working:
Check --reasoning-budget 0 is set. Without it, the model may spend 1-2 minutes on “thinking” tokens before each response.
Claude Code says “model not found” or similar:
Make sure ANTHROPIC_MODEL is set to minimax/minimax-m2.5. The model name just needs to be a non-empty string — llama-server ignores it and uses whatever model it loaded.
The full picture
Once all three pieces are running, your local stack looks like this:
| Port | Service | Role |
|---|---|---|
| 8080 | llama-server | Model inference (MiniMax-M2.5) |
| 8081 | cache-proxy.py | Normalizes billing header for cache hits |
| — | Claude Code | CLI client, pointed at :8081 |
You can stop and restart Claude Code freely — the proxy and server stay warm. The KV cache persists across Claude Code sessions as long as llama-server stays running.
Going deeper
- Benchmarking llama-server: How We Chose Our Inference Parameters — KV cache type testing, context depth sweep, optimal flags
- How a 60-Line Proxy Made MiniMax-M2.5 4× Faster on Claude Code — deep dive on the billing header bug and cache proxy
- MiniMax-M2.5 on Terminal-Bench: 42.7% from a Mac Studio — this stack running a full coding benchmark
Memory requirements by quant
| Quantization | Model size | KV cache (128K) | Total | Min RAM |
|---|---|---|---|---|
| Q8_0 | ~243 GB | ~17 GB | ~260 GB | 512 GB |
| Q6_K | ~188 GB | ~17 GB | ~205 GB | 256 GB |
| Q4_K_M | ~138 GB | ~17 GB | ~155 GB | 192 GB |
| Q3_K_M | ~110 GB | ~17 GB | ~127 GB | 128 GB |
These are approximate. Reduce --ctx-size to cut KV cache usage proportionally.