BENCHMARKING LLAMA-SERVER: HOW WE CHOSE OUR INFERENCE PARAMETERS
───────────────────────────────────────────────────────────────────
Running Terminal-Bench evaluations means sustained local inference — hundreds of agent tasks, each with multi-turn conversations that can run for 15-30 minutes. Every tok/s matters. We used llama-bench to find the optimal llama-server parameters for MiniMax-M2.5 on our Mac Studio and found one configuration choice that costs you 40% of your throughput.
The setup
- Hardware: Mac Studio, M3 Ultra, 512 GB unified memory
- Model: MiniMax-M2.5 Q8_0 (226 GB, 228B parameters, 10B active MoE)
- Software: llama.cpp build 8340 via Homebrew
- Benchmark tool:
llama-bench
Finding: never mix KV cache types
We tested four KV cache quantization configurations: all combinations of q8_0 and q4_0 for keys and values.
The results are stark:
| Keys | Values | Prompt (tok/s) | Generation (tok/s) | Penalty |
|---|---|---|---|---|
| q8_0 | q8_0 | 562 | 37.4 | — |
| q4_0 | q4_0 | 561 | 37.2 | ~0% |
| q8_0 | q4_0 | 289 | 22.3 | -40% |
| q4_0 | q8_0 | 286 | 22.3 | -40% |
Matching KV cache types perform identically regardless of whether you use q8_0 or q4_0. But mix them and you lose 40% of your throughput — on both prompt processing and generation. This is likely due to Metal kernel optimizations that handle uniform quantization types via a fast path. Mixed types force a slower conversion path on every cache access.
The practical takeaway: if you want to save memory with q4_0 caches, use q4_0 for both. Never mix.
We chose q8_0 for both keys and values — it gives us the best quality at no speed cost compared to q4_0, and the 512 GB of unified memory gives us plenty of headroom.
Finding: throughput at different context depths
Agent tasks in Terminal-Bench fill context progressively as the conversation grows. We measured combined throughput (prompt processing + token generation) at different context depths:
| Context depth | Throughput (tok/s) | vs 4K |
|---|---|---|
| 4,096 | 335 | — |
| 8,192 | 345 | +3% |
| 16,384 | 291 | -13% |
| 32,768 | 207 | -38% |
| 65,536 | 128 | -62% |
| 131,072 | 69 | -79% |
Throughput drops predictably as context grows. This is dominated by prompt processing cost — MiniMax’s hybrid architecture (lightning attention for most layers, softmax attention every 7th layer) means token generation itself stays relatively stable. But the overall request latency increases because each new turn re-processes the full conversation.
We chose 128K context for Terminal-Bench. Most agent tasks stay well under this limit, and it frees ~9 GB of KV cache memory compared to the model’s maximum 196K. That memory headroom helps with sustained performance over long evaluation runs.
Prompt processing speed also decreases with batch size:
| Prompt tokens | Speed (tok/s) |
|---|---|
| 512 | 562 |
| 2,048 | 528 |
| 8,192 | 416 |
Keeping llama.cpp current
llama.cpp releases multiple builds per day. For MiniMax-M2.5 on Apple Silicon, several recent changes matter:
Gated Delta Net optimizations — MiniMax’s lightning attention layers use the Gated Delta Net (GDN) architecture. Build b8333 fixed a memory coalescing issue in the fused GDN kernel where state matrices were accessed column-wise on row-major storage, wasting GPU cache bandwidth. This resolved a 39% regression on Metal. The --fused-gdn flag now gives users control over this optimization.
KV cache hybrid model fix (b8025) — Fixed KV cache handling specifically for hybrid architectures like MiniMax that mix attention and linear layers. This is exactly our model.
Metal kernel improvements — Adaptive CPU/GPU interleave (b7951) optimizes workload distribution, bin kernel optimization (b8312) avoids unnecessary divisions, and various flash attention fixes improve stability on Apple GPUs.
Staying on the latest build is worth it. A simple brew upgrade llama.cpp before a benchmark run ensures you’re getting the latest Metal optimizations.
brew upgrade llama.cpp
llama-server --version
Our configuration
Here’s what we run for Terminal-Bench evaluations:
llama-server \
--model ~/models/MiniMax-M2.5-GGUF/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
--host 0.0.0.0 \
--port 8080 \
--jinja \
--ctx-size 131072 \
--n-gpu-layers 999 \
--parallel 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--metrics
Every flag is there for a reason:
| Flag | Why |
|---|---|
--jinja | Template rendering for tool use |
--ctx-size 131072 | 128K context — enough for agent tasks, saves 9 GB vs max |
--n-gpu-layers 999 | All layers on Metal GPU |
--parallel 1 | Single slot — prevents KV cache rotation between requests |
--flash-attn on | Required for KV cache quantization, faster attention |
--cache-type-k q8_0 | Quantized KV cache — must match type-v |
--cache-type-v q8_0 | Quantized KV cache — must match type-k |
--metrics | Prometheus endpoint for monitoring |
Running the benchmarks yourself
# Install
brew install llama.cpp
# KV cache type comparison
llama-bench \
-m path/to/model.gguf \
-ngl 999 -fa 1 \
-ctk q8_0,q4_0 -ctv q8_0,q4_0
# Context depth sweep
llama-bench \
-m path/to/model.gguf \
-ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 \
-pg 4096,128 -pg 8192,128 -pg 16384,128 \
-pg 32768,128 -pg 65536,128 -pg 131072,128
What’s next
In an upcoming post we’ll cover the prompt caching proxy that sits between Claude Code and llama-server, and how our hook system steers the agent during Terminal-Bench evaluations.