RUNNING MINIMAX-M2.5 LOCALLY WITH LLAMA.CPP
──────────────────────────────────────────────
How to set up llama.cpp and run MiniMax-M2.5 locally on Apple Silicon for use with Claude Code, coding agents, or any tool that speaks the Anthropic Messages API.
Why llama.cpp?
llama.cpp has a native Anthropic-compatible /v1/messages endpoint — including tool use and streaming. No format conversion layers, no middleware. You point your tools at http://localhost:8080 and it just works.
We tried LM Studio first but its Anthropic-to-ChatML conversion broke prompt cache prefix matching. llama-server doesn’t have this problem because it speaks the Anthropic Messages format natively.
What you need
- Mac with Apple Silicon and enough unified memory for your model. MiniMax-M2.5 at Q8_0 needs ~250 GB total (model weights + KV cache). Smaller quants will work with less RAM.
- Homebrew installed
For reference, we’re running this on an M3 Ultra with 512 GB unified memory.
Step 1: Install llama.cpp
brew install llama.cpp
This gives you llama-server at /opt/homebrew/bin/llama-server. Verify it’s there:
llama-server --version
Step 2: Download the model
We’re using the Q8_0 quantization of MiniMax-M2.5 from Hugging Face. It’s split across 7 files, ~243 GB total.
# Install the Hugging Face CLI if you don't have it
brew install huggingface-cli
# Download MiniMax-M2.5 Q8_0
hf download \
unsloth/MiniMax-M2.5-GGUF \
--include "Q8_0/*" \
--local-dir ~/models/MiniMax-M2.5-GGUF
This will take a while — 243 GB is a lot of bits. You can use a smaller quantization (Q4_K_M, Q6_K, etc.) if you’re memory-constrained. Check the unsloth/MiniMax-M2.5-GGUF repo for available quants.
After downloading, you should have 7 split files:
~/models/MiniMax-M2.5-GGUF/
├── MiniMax-M2.5-Q8_0-00001-of-00007.gguf
├── MiniMax-M2.5-Q8_0-00002-of-00007.gguf
├── ...
└── MiniMax-M2.5-Q8_0-00007-of-00007.gguf
llama-server auto-discovers the remaining splits from the first file.
Step 3: Start the server
llama-server \
--model ~/models/MiniMax-M2.5-GGUF/MiniMax-M2.5-Q8_0-00001-of-00007.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 131072 \
--n-gpu-layers 999 \
--parallel 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--reasoning-budget 0 \
--metrics \
--verbose
Model loading takes a couple minutes. Once you see slot info in the logs, the server is ready.
What the flags do
| Flag | Value | Why |
|---|---|---|
--host | 0.0.0.0 | Allows Docker containers to connect via host.docker.internal |
--port | 8080 | Server port |
--ctx-size | 131072 | 128K context window (model supports up to 196K) |
--n-gpu-layers | 999 | Offload all layers to Metal GPU |
--parallel | 1 | Single slot — prevents slot rotation from breaking the KV cache |
--flash-attn | on | Flash attention on Metal for faster inference |
--cache-type-k | q8_0 | Quantize KV cache keys (~50% memory savings vs F16) |
--cache-type-v | q8_0 | Quantize KV cache values |
--reasoning-budget | 0 | Disables thinking/reasoning tokens (saves ~1.5 min per request) |
--metrics | Enables Prometheus metrics at /metrics | |
--verbose | Logs cache hit/miss stats for debugging |
The --parallel 1 flag is important. With multiple slots, llama-server can rotate which slot handles a request, which means your carefully built-up KV cache gets ignored. One slot means every request builds on the same cache.
Step 4: Test it
The server exposes an Anthropic-compatible endpoint at /v1/messages:
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: no-key" \
-d '{
"model": "minimax-m2.5",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "What is a blobfish?"}
]
}'
You should get a streamed response back. The x-api-key can be anything — there’s no auth on local inference.
Step 5: Point your tools at it
For Claude Code or any Anthropic SDK client:
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=no-key
export ANTHROPIC_MODEL=minimax/minimax-m2.5
That’s it. Your tools now talk to your local model.
Memory usage
On our M3 Ultra (512 GB unified memory):
| Component | Size |
|---|---|
| Model weights (Q8_0) | ~243 GB |
| KV cache (Q8_0, 128K ctx) | ~16.8 GB |
| Compute buffers | ~0.7 GB |
| Total | ~260 GB |
If you’re on a 128 GB machine, look at Q4_K_M or quants which will cut the model weight size significantly.
Monitoring
While the server is running:
# Check inference speed and token counts
curl -s http://localhost:8080/metrics | grep -E '^llamacpp:(prompt_tokens|predicted_tokens)'
# Check slot state
curl -s http://localhost:8080/slots | python3 -m json.tool
Troubleshooting
Server errors with --flash-attn: This version requires an explicit value: --flash-attn on (not just --flash-attn).
Slow despite running: Check if reasoning is enabled. The slot params in /slots will show thinking_forced_open: true. Use --reasoning-budget 0 to disable.
Model not loading: Ensure all 7 GGUF split files are in the model directory. llama-server discovers them from the first file but needs them all present.
Out of memory: Reduce --ctx-size (try 65536 for 64K context) or use a smaller quantization.
What’s next
In a follow-up post we’ll cover the prompt caching proxy — a small Python script that sits in front of llama-server and fixes a cache-busting issue with Claude Code, taking multi-turn inference from ~9 minutes down to under 2 minutes.