Running MiniMax-M2.5 Locally with llama.cpp

RUNNING MINIMAX-M2.5 LOCALLY WITH LLAMA.CPP

──────────────────────────────────────────────

date: 2026-03-04 tags: llama.cpp, local-inference, minimax, apple-silicon

How to set up llama.cpp and run MiniMax-M2.5 locally on Apple Silicon for use with Claude Code, coding agents, or any tool that speaks the Anthropic Messages API.

Why llama.cpp?

llama.cpp has a native Anthropic-compatible /v1/messages endpoint — including tool use and streaming. No format conversion layers, no middleware. You point your tools at http://localhost:8080 and it just works.

We tried LM Studio first but its Anthropic-to-ChatML conversion broke prompt cache prefix matching. llama-server doesn’t have this problem because it speaks the Anthropic Messages format natively.

What you need

Mac with Apple Silicon and enough unified memory for your model. MiniMax-M2.5 at Q8_0 needs ~250 GB total (model weights + KV cache). Smaller quants will work with less RAM.
Homebrew installed

For reference, we’re running this on an M3 Ultra with 512 GB unified memory.

Step 1: Install llama.cpp

brew install llama.cpp

This gives you llama-server at /opt/homebrew/bin/llama-server. Verify it’s there:

llama-server --version

Step 2: Download the model

We’re using the Q8_0 quantization of MiniMax-M2.5 from Hugging Face. It’s split across 7 files, ~243 GB total.

# Install the Hugging Face CLI if you don't have it
brew install huggingface-cli

# Download MiniMax-M2.5 Q8_0
hf download \
  unsloth/MiniMax-M2.5-GGUF \
  --include "Q8_0/*" \
  --local-dir ~/models/MiniMax-M2.5-GGUF

This will take a while — 243 GB is a lot of bits. You can use a smaller quantization (Q4_K_M, Q6_K, etc.) if you’re memory-constrained. Check the unsloth/MiniMax-M2.5-GGUF repo for available quants.

After downloading, you should have 7 split files:

~/models/MiniMax-M2.5-GGUF/
├── MiniMax-M2.5-Q8_0-00001-of-00007.gguf
├── MiniMax-M2.5-Q8_0-00002-of-00007.gguf
├── ...
└── MiniMax-M2.5-Q8_0-00007-of-00007.gguf

llama-server auto-discovers the remaining splits from the first file.

Step 3: Start the server

llama-server \
  --model ~/models/MiniMax-M2.5-GGUF/MiniMax-M2.5-Q8_0-00001-of-00007.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --reasoning-budget 0 \
  --metrics \
  --verbose

Model loading takes a couple minutes. Once you see slot info in the logs, the server is ready.

What the flags do

Flag	Value	Why
`--host`	`0.0.0.0`	Allows Docker containers to connect via `host.docker.internal`
`--port`	`8080`	Server port
`--ctx-size`	`131072`	128K context window (model supports up to 196K)
`--n-gpu-layers`	`999`	Offload all layers to Metal GPU
`--parallel`	`1`	Single slot — prevents slot rotation from breaking the KV cache
`--flash-attn`	`on`	Flash attention on Metal for faster inference
`--cache-type-k`	`q8_0`	Quantize KV cache keys (~50% memory savings vs F16)
`--cache-type-v`	`q8_0`	Quantize KV cache values
`--reasoning-budget`	`0`	Disables thinking/reasoning tokens (saves ~1.5 min per request)
`--metrics`		Enables Prometheus metrics at `/metrics`
`--verbose`		Logs cache hit/miss stats for debugging

The --parallel 1 flag is important. With multiple slots, llama-server can rotate which slot handles a request, which means your carefully built-up KV cache gets ignored. One slot means every request builds on the same cache.

Step 4: Test it

The server exposes an Anthropic-compatible endpoint at /v1/messages:

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: no-key" \
  -d '{
    "model": "minimax-m2.5",
    "max_tokens": 256,
    "messages": [
      {"role": "user", "content": "What is a blobfish?"}
    ]
  }'

You should get a streamed response back. The x-api-key can be anything — there’s no auth on local inference.

Step 5: Point your tools at it

For Claude Code or any Anthropic SDK client:

export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_API_KEY=no-key
export ANTHROPIC_MODEL=minimax/minimax-m2.5

That’s it. Your tools now talk to your local model.

Memory usage

On our M3 Ultra (512 GB unified memory):

Component	Size
Model weights (Q8_0)	~243 GB
KV cache (Q8_0, 128K ctx)	~16.8 GB
Compute buffers	~0.7 GB
Total	~260 GB

If you’re on a 128 GB machine, look at Q4_K_M or quants which will cut the model weight size significantly.

Monitoring

While the server is running:

# Check inference speed and token counts
curl -s http://localhost:8080/metrics | grep -E '^llamacpp:(prompt_tokens|predicted_tokens)'

# Check slot state
curl -s http://localhost:8080/slots | python3 -m json.tool

Troubleshooting

Server errors with --flash-attn: This version requires an explicit value: --flash-attn on (not just --flash-attn).

Slow despite running: Check if reasoning is enabled. The slot params in /slots will show thinking_forced_open: true. Use --reasoning-budget 0 to disable.

Model not loading: Ensure all 7 GGUF split files are in the model directory. llama-server discovers them from the first file but needs them all present.

Out of memory: Reduce --ctx-size (try 65536 for 64K context) or use a smaller quantization.

What’s next

In a follow-up post we’ll cover the prompt caching proxy — a small Python script that sits in front of llama-server and fixes a cache-busting issue with Claude Code, taking multi-turn inference from ~9 minutes down to under 2 minutes.