Running Claude Code Locally with MiniMax-M2.5

RUNNING CLAUDE CODE LOCALLY WITH MINIMAX-M2.5

────────────────────────────────────────────────

date: 2026-04-20 tags: claude-code, llama.cpp, minimax, local-inference, apple-silicon, guide

A complete guide to running Claude Code against a local MiniMax-M2.5 model on Apple Silicon. No API key, no data leaving your machine, no rate limits. From zero to working in about 30 minutes (plus model download time).

What you’ll end up with

Claude Code CLI → cache-proxy (:8081) → llama-server (:8080) → MiniMax-M2.5 Q8_0

Claude Code thinks it’s talking to the Anthropic API. llama-server speaks the same protocol natively. The cache proxy in the middle fixes a header issue that would otherwise kill performance (more on that in step 6).

Prerequisites

Mac with Apple Silicon (M1/M2/M3/M4 — any variant)
Unified memory: 512 GB for Q8_0 (full quality), ~128 GB for Q4_K_M (good quality, smaller). If you’re on a 64 GB machine, look at Q3_K or smaller quants.
Homebrew installed (/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")
Node.js 18+ (for Claude Code): brew install node
Python 3 (for the cache proxy): already on macOS

Step 1: Install Claude Code

npm install -g @anthropic-ai/claude-code

Verify it’s installed:

claude --version

Step 2: Install llama.cpp

brew install llama.cpp

This gives you llama-server at /opt/homebrew/bin/llama-server. Verify:

llama-server --version

Step 3: Download MiniMax-M2.5

We use the Q8_0 quantization from the unsloth/MiniMax-M2.5-GGUF repo on Hugging Face. It’s split across multiple files, ~243 GB total.

# Install the Hugging Face CLI
brew install huggingface-cli

# Download MiniMax-M2.5 Q8_0
hf download \
  unsloth/MiniMax-M2.5-GGUF \
  --include "Q8_0/*" \
  --local-dir ~/models/MiniMax-M2.5-GGUF

This takes a while — 243 GB is a lot of bits. Once done, you’ll have the split GGUF files:

~/models/MiniMax-M2.5-GGUF/Q8_0/
├── MiniMax-M2.5-Q8_0-00001-of-00006.gguf
├── MiniMax-M2.5-Q8_0-00002-of-00006.gguf
├── ...
└── MiniMax-M2.5-Q8_0-00006-of-00006.gguf

llama-server discovers all the splits from the first file.

On a smaller machine? Replace Q8_0 with Q4_K_M (or another quant) throughout this guide. Check the repo for available quantizations.

Step 4: Start llama-server

llama-server \
  --model ~/models/MiniMax-M2.5-GGUF/Q8_0/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --ctx-size 131072 \
  --context-shift \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning-budget 0 \
  --metrics \
  --verbose

Model loading takes a couple minutes on first launch. You’ll see layer offloading progress, then slot info when it’s ready.

Flag	Why
`--jinja`	Template rendering for tool use messages
`--ctx-size 131072`	128K context — plenty for most coding sessions
`--context-shift`	Keeps going if context exceeds the window
`--n-gpu-layers 999`	All layers on Metal GPU
`--parallel 1`	Single slot — prevents cache-breaking rotation
`--flash-attn on`	Flash attention on Metal
`--cache-type-k q8_0`	Quantized KV cache keys (must match values)
`--cache-type-v q8_0`	Quantized KV cache values (must match keys)
`--reasoning-budget 0`	Disable thinking tokens (saves time per request)
`--verbose`	Shows cache hit/miss stats for debugging

Leave this running in its own terminal.

Step 5: Test llama-server directly

Before adding the proxy, verify the server works:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: no-key" \
  -d '{
    "model": "minimax-m2.5",
    "max_tokens": 128,
    "messages": [
      {"role": "user", "content": "What is a blobfish?"}
    ]
  }'

If you get a response, the server is working. If not, check the llama-server terminal for errors.

Step 6: Start the cache proxy

Claude Code injects a billing header at the start of every request that changes per-request. This silently invalidates llama-server’s KV cache between turns, forcing a full 60-second prefill every time. The proxy normalizes this header so the cache works — cutting multi-turn latency by 4×.

# Clone the proxy
git clone https://github.com/cchuter/claude-cache-proxy
cd claude-cache-proxy

# Start it
python3 cache-proxy.py

The proxy listens on port 8081 and forwards to localhost:8080. Leave it running in a second terminal.

For the full explanation of what’s happening under the hood, see How a 60-Line Proxy Made MiniMax-M2.5 4× Faster on Claude Code.

Step 7: Run Claude Code

Open a new terminal in any project directory and set three environment variables:

export ANTHROPIC_BASE_URL=http://localhost:8081
export ANTHROPIC_API_KEY=no-key
export ANTHROPIC_MODEL=minimax/minimax-m2.5

Then launch Claude Code:

claude

That’s it. Claude Code connects to the proxy (which forwards to llama-server), and you’ve got a fully local AI coding assistant running MiniMax-M2.5 on your Mac.

Try giving it a task:

> Explain the directory structure of this project

The first turn takes a few seconds (initial prefill). Subsequent turns are fast — the cache proxy ensures llama-server reuses its KV cache.

Tip: Add the exports to your shell profile (~/.zshrc) to make them persistent:

echo 'export ANTHROPIC_BASE_URL=http://localhost:8081' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY=no-key' >> ~/.zshrc
echo 'export ANTHROPIC_MODEL=minimax/minimax-m2.5' >> ~/.zshrc

Step 8: Verify the cache is working

In your llama-server terminal (the one with --verbose), watch the output as Claude Code sends requests. You should see:

Good (cache hitting):

slot update_slots: id 0 | task 5 | n_tokens = 8900, memory_seq_rm [8900, end)

This means 8,900 tokens were served from cache. Only the new tokens (after position 8,900) needed evaluation.

Bad (cache broken — proxy not running):

slot update_slots: id 0 | task 5 | n_tokens = 33, memory_seq_rm [33, end)

Only 33 tokens matched — the entire prompt is being re-evaluated every turn. If you see this, check that:

Claude Code is pointing at :8081 (the proxy), not :8080 (llama-server directly)
cache-proxy.py is running

Troubleshooting

“Model not loading” / split files not found: Ensure all GGUF split files are in the same directory. llama-server discovers them from the first file but needs them all present.

"–flash-attn" error: Newer llama.cpp requires --flash-attn on (not just --flash-attn).

Out of memory / system becomes unresponsive: Reduce --ctx-size (try 65536 for 64K) or use a smaller quantization. Q8_0 at 128K context needs ~260 GB total.

Very slow despite cache working: Check --reasoning-budget 0 is set. Without it, the model may spend 1-2 minutes on “thinking” tokens before each response.

Claude Code says “model not found” or similar: Make sure ANTHROPIC_MODEL is set to minimax/minimax-m2.5. The model name just needs to be a non-empty string — llama-server ignores it and uses whatever model it loaded.

The full picture

Once all three pieces are running, your local stack looks like this:

Port	Service	Role
8080	llama-server	Model inference (MiniMax-M2.5)
8081	cache-proxy.py	Normalizes billing header for cache hits
—	Claude Code	CLI client, pointed at :8081

You can stop and restart Claude Code freely — the proxy and server stay warm. The KV cache persists across Claude Code sessions as long as llama-server stays running.

Going deeper

Benchmarking llama-server: How We Chose Our Inference Parameters — KV cache type testing, context depth sweep, optimal flags
How a 60-Line Proxy Made MiniMax-M2.5 4× Faster on Claude Code — deep dive on the billing header bug and cache proxy
MiniMax-M2.5 on Terminal-Bench: 42.7% from a Mac Studio — this stack running a full coding benchmark

Memory requirements by quant

Quantization	Model size	KV cache (128K)	Total	Min RAM
Q8_0	~243 GB	~17 GB	~260 GB	512 GB
Q6_K	~188 GB	~17 GB	~205 GB	256 GB
Q4_K_M	~138 GB	~17 GB	~155 GB	192 GB
Q3_K_M	~110 GB	~17 GB	~127 GB	128 GB

These are approximate. Reduce --ctx-size to cut KV cache usage proportionally.