Benchmarking llama-server: How We Chose Our Inference Parameters

BENCHMARKING LLAMA-SERVER: HOW WE CHOSE OUR INFERENCE PARAMETERS

───────────────────────────────────────────────────────────────────

date: 2026-03-14 tags: llama.cpp, benchmarking, minimax, apple-silicon, terminal-bench

Running Terminal-Bench evaluations means sustained local inference — hundreds of agent tasks, each with multi-turn conversations that can run for 15-30 minutes. Every tok/s matters. We used llama-bench to find the optimal llama-server parameters for MiniMax-M2.5 on our Mac Studio and found one configuration choice that costs you 40% of your throughput.

The setup

Hardware: Mac Studio, M3 Ultra, 512 GB unified memory
Model: MiniMax-M2.5 Q8_0 (226 GB, 228B parameters, 10B active MoE)
Software: llama.cpp build 8340 via Homebrew
Benchmark tool: llama-bench

Finding: never mix KV cache types

We tested four KV cache quantization configurations: all combinations of q8_0 and q4_0 for keys and values.

The results are stark:

Keys	Values	Prompt (tok/s)	Generation (tok/s)	Penalty
q8_0	q8_0	562	37.4	—
q4_0	q4_0	561	37.2	~0%
q8_0	q4_0	289	22.3	-40%
q4_0	q8_0	286	22.3	-40%

Matching KV cache types perform identically regardless of whether you use q8_0 or q4_0. But mix them and you lose 40% of your throughput — on both prompt processing and generation. This is likely due to Metal kernel optimizations that handle uniform quantization types via a fast path. Mixed types force a slower conversion path on every cache access.

The practical takeaway: if you want to save memory with q4_0 caches, use q4_0 for both. Never mix.

We chose q8_0 for both keys and values — it gives us the best quality at no speed cost compared to q4_0, and the 512 GB of unified memory gives us plenty of headroom.

Finding: throughput at different context depths

Agent tasks in Terminal-Bench fill context progressively as the conversation grows. We measured combined throughput (prompt processing + token generation) at different context depths:

Context depth	Throughput (tok/s)	vs 4K
4,096	335	—
8,192	345	+3%
16,384	291	-13%
32,768	207	-38%
65,536	128	-62%
131,072	69	-79%

Throughput drops predictably as context grows. This is dominated by prompt processing cost — MiniMax’s hybrid architecture (lightning attention for most layers, softmax attention every 7th layer) means token generation itself stays relatively stable. But the overall request latency increases because each new turn re-processes the full conversation.

We chose 128K context for Terminal-Bench. Most agent tasks stay well under this limit, and it frees ~9 GB of KV cache memory compared to the model’s maximum 196K. That memory headroom helps with sustained performance over long evaluation runs.

Prompt processing speed also decreases with batch size:

Prompt tokens	Speed (tok/s)
512	562
2,048	528
8,192	416

Keeping llama.cpp current

llama.cpp releases multiple builds per day. For MiniMax-M2.5 on Apple Silicon, several recent changes matter:

Gated Delta Net optimizations — MiniMax’s lightning attention layers use the Gated Delta Net (GDN) architecture. Build b8333 fixed a memory coalescing issue in the fused GDN kernel where state matrices were accessed column-wise on row-major storage, wasting GPU cache bandwidth. This resolved a 39% regression on Metal. The --fused-gdn flag now gives users control over this optimization.

KV cache hybrid model fix (b8025) — Fixed KV cache handling specifically for hybrid architectures like MiniMax that mix attention and linear layers. This is exactly our model.

Metal kernel improvements — Adaptive CPU/GPU interleave (b7951) optimizes workload distribution, bin kernel optimization (b8312) avoids unnecessary divisions, and various flash attention fixes improve stability on Apple GPUs.

Staying on the latest build is worth it. A simple brew upgrade llama.cpp before a benchmark run ensures you’re getting the latest Metal optimizations.

brew upgrade llama.cpp
llama-server --version

Our configuration

Here’s what we run for Terminal-Bench evaluations:

llama-server \
  --model ~/models/MiniMax-M2.5-GGUF/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics

Every flag is there for a reason:

Flag	Why
`--jinja`	Template rendering for tool use
`--ctx-size 131072`	128K context — enough for agent tasks, saves 9 GB vs max
`--n-gpu-layers 999`	All layers on Metal GPU
`--parallel 1`	Single slot — prevents KV cache rotation between requests
`--flash-attn on`	Required for KV cache quantization, faster attention
`--cache-type-k q8_0`	Quantized KV cache — must match type-v
`--cache-type-v q8_0`	Quantized KV cache — must match type-k
`--metrics`	Prometheus endpoint for monitoring

Running the benchmarks yourself

# Install
brew install llama.cpp

# KV cache type comparison
llama-bench \
  -m path/to/model.gguf \
  -ngl 999 -fa 1 \
  -ctk q8_0,q4_0 -ctv q8_0,q4_0

# Context depth sweep
llama-bench \
  -m path/to/model.gguf \
  -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 \
  -pg 4096,128 -pg 8192,128 -pg 16384,128 \
  -pg 32768,128 -pg 65536,128 -pg 131072,128

What’s next

In an upcoming post we’ll cover the prompt caching proxy that sits between Claude Code and llama-server, and how our hook system steers the agent during Terminal-Bench evaluations.