┌── THE BLOB BLOG ─────────────────────────────────────────────────────────┐
── THE BLOB BLOG ──

RUNNING CLAUDE CODE LOCALLY WITH MINIMAX-M2.5

────────────────────────────────────────────────

A complete guide to running Claude Code against a local MiniMax-M2.5 model on Apple Silicon. No API key, no data leaving your machine, no rate limits. From zero to working in about 30 minutes (plus model download time).

What you’ll end up with

Claude Code CLI → cache-proxy (:8081) → llama-server (:8080) → MiniMax-M2.5 Q8_0

Claude Code thinks it’s talking to the Anthropic API. llama-server speaks the same protocol natively. The cache proxy in the middle fixes a header issue that would otherwise kill performance (more on that in step 6).

Prerequisites

  • Mac with Apple Silicon (M1/M2/M3/M4 — any variant)
  • Unified memory: 512 GB for Q8_0 (full quality), ~128 GB for Q4_K_M (good quality, smaller). If you’re on a 64 GB machine, look at Q3_K or smaller quants.
  • Homebrew installed (/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)")
  • Node.js 18+ (for Claude Code): brew install node
  • Python 3 (for the cache proxy): already on macOS

Step 1: Install Claude Code

npm install -g @anthropic-ai/claude-code

Verify it’s installed:

claude --version

Step 2: Install llama.cpp

brew install llama.cpp

This gives you llama-server at /opt/homebrew/bin/llama-server. Verify:

llama-server --version

Step 3: Download MiniMax-M2.5

We use the Q8_0 quantization from the unsloth/MiniMax-M2.5-GGUF repo on Hugging Face. It’s split across multiple files, ~243 GB total.

# Install the Hugging Face CLI
brew install huggingface-cli

# Download MiniMax-M2.5 Q8_0
hf download \
  unsloth/MiniMax-M2.5-GGUF \
  --include "Q8_0/*" \
  --local-dir ~/models/MiniMax-M2.5-GGUF

This takes a while — 243 GB is a lot of bits. Once done, you’ll have the split GGUF files:

~/models/MiniMax-M2.5-GGUF/Q8_0/
├── MiniMax-M2.5-Q8_0-00001-of-00006.gguf
├── MiniMax-M2.5-Q8_0-00002-of-00006.gguf
├── ...
└── MiniMax-M2.5-Q8_0-00006-of-00006.gguf

llama-server discovers all the splits from the first file.

On a smaller machine? Replace Q8_0 with Q4_K_M (or another quant) throughout this guide. Check the repo for available quantizations.

Step 4: Start llama-server

llama-server \
  --model ~/models/MiniMax-M2.5-GGUF/Q8_0/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --ctx-size 131072 \
  --context-shift \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning-budget 0 \
  --metrics \
  --verbose

Model loading takes a couple minutes on first launch. You’ll see layer offloading progress, then slot info when it’s ready.

FlagWhy
--jinjaTemplate rendering for tool use messages
--ctx-size 131072128K context — plenty for most coding sessions
--context-shiftKeeps going if context exceeds the window
--n-gpu-layers 999All layers on Metal GPU
--parallel 1Single slot — prevents cache-breaking rotation
--flash-attn onFlash attention on Metal
--cache-type-k q8_0Quantized KV cache keys (must match values)
--cache-type-v q8_0Quantized KV cache values (must match keys)
--reasoning-budget 0Disable thinking tokens (saves time per request)
--verboseShows cache hit/miss stats for debugging

Leave this running in its own terminal.

Step 5: Test llama-server directly

Before adding the proxy, verify the server works:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: no-key" \
  -d '{
    "model": "minimax-m2.5",
    "max_tokens": 128,
    "messages": [
      {"role": "user", "content": "What is a blobfish?"}
    ]
  }'

If you get a response, the server is working. If not, check the llama-server terminal for errors.

Step 6: Start the cache proxy

Claude Code injects a billing header at the start of every request that changes per-request. This silently invalidates llama-server’s KV cache between turns, forcing a full 60-second prefill every time. The proxy normalizes this header so the cache works — cutting multi-turn latency by 4×.

# Clone the proxy
git clone https://github.com/cchuter/claude-cache-proxy
cd claude-cache-proxy

# Start it
python3 cache-proxy.py

The proxy listens on port 8081 and forwards to localhost:8080. Leave it running in a second terminal.

For the full explanation of what’s happening under the hood, see How a 60-Line Proxy Made MiniMax-M2.5 4× Faster on Claude Code.

Step 7: Run Claude Code

Open a new terminal in any project directory and set three environment variables:

export ANTHROPIC_BASE_URL=http://localhost:8081
export ANTHROPIC_API_KEY=no-key
export ANTHROPIC_MODEL=minimax/minimax-m2.5

Then launch Claude Code:

claude

That’s it. Claude Code connects to the proxy (which forwards to llama-server), and you’ve got a fully local AI coding assistant running MiniMax-M2.5 on your Mac.

Try giving it a task:

> Explain the directory structure of this project

The first turn takes a few seconds (initial prefill). Subsequent turns are fast — the cache proxy ensures llama-server reuses its KV cache.

Tip: Add the exports to your shell profile (~/.zshrc) to make them persistent:

echo 'export ANTHROPIC_BASE_URL=http://localhost:8081' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY=no-key' >> ~/.zshrc
echo 'export ANTHROPIC_MODEL=minimax/minimax-m2.5' >> ~/.zshrc

Step 8: Verify the cache is working

In your llama-server terminal (the one with --verbose), watch the output as Claude Code sends requests. You should see:

Good (cache hitting):

slot update_slots: id 0 | task 5 | n_tokens = 8900, memory_seq_rm [8900, end)

This means 8,900 tokens were served from cache. Only the new tokens (after position 8,900) needed evaluation.

Bad (cache broken — proxy not running):

slot update_slots: id 0 | task 5 | n_tokens = 33, memory_seq_rm [33, end)

Only 33 tokens matched — the entire prompt is being re-evaluated every turn. If you see this, check that:

  1. Claude Code is pointing at :8081 (the proxy), not :8080 (llama-server directly)
  2. cache-proxy.py is running

Troubleshooting

“Model not loading” / split files not found: Ensure all GGUF split files are in the same directory. llama-server discovers them from the first file but needs them all present.

"–flash-attn" error: Newer llama.cpp requires --flash-attn on (not just --flash-attn).

Out of memory / system becomes unresponsive: Reduce --ctx-size (try 65536 for 64K) or use a smaller quantization. Q8_0 at 128K context needs ~260 GB total.

Very slow despite cache working: Check --reasoning-budget 0 is set. Without it, the model may spend 1-2 minutes on “thinking” tokens before each response.

Claude Code says “model not found” or similar: Make sure ANTHROPIC_MODEL is set to minimax/minimax-m2.5. The model name just needs to be a non-empty string — llama-server ignores it and uses whatever model it loaded.

The full picture

Once all three pieces are running, your local stack looks like this:

PortServiceRole
8080llama-serverModel inference (MiniMax-M2.5)
8081cache-proxy.pyNormalizes billing header for cache hits
Claude CodeCLI client, pointed at :8081

You can stop and restart Claude Code freely — the proxy and server stay warm. The KV cache persists across Claude Code sessions as long as llama-server stays running.

Going deeper

Memory requirements by quant

QuantizationModel sizeKV cache (128K)TotalMin RAM
Q8_0~243 GB~17 GB~260 GB512 GB
Q6_K~188 GB~17 GB~205 GB256 GB
Q4_K_M~138 GB~17 GB~155 GB192 GB
Q3_K_M~110 GB~17 GB~127 GB128 GB

These are approximate. Reduce --ctx-size to cut KV cache usage proportionally.

SysOp: onlinenode: 1baud: 2400