┌── THE BLOB BLOG ─────────────────────────────────────────────────────────┐
── THE BLOB BLOG ──

BENCHMARKING LLAMA-SERVER: HOW WE CHOSE OUR INFERENCE PARAMETERS

───────────────────────────────────────────────────────────────────

Running Terminal-Bench evaluations means sustained local inference — hundreds of agent tasks, each with multi-turn conversations that can run for 15-30 minutes. Every tok/s matters. We used llama-bench to find the optimal llama-server parameters for MiniMax-M2.5 on our Mac Studio and found one configuration choice that costs you 40% of your throughput.

The setup

  • Hardware: Mac Studio, M3 Ultra, 512 GB unified memory
  • Model: MiniMax-M2.5 Q8_0 (226 GB, 228B parameters, 10B active MoE)
  • Software: llama.cpp build 8340 via Homebrew
  • Benchmark tool: llama-bench

Finding: never mix KV cache types

We tested four KV cache quantization configurations: all combinations of q8_0 and q4_0 for keys and values.

Token Generation Speed by KV Cache Typeq8_0 / q8_037.4 tok/sq4_0 / q4_037.2 tok/sq8_0 / q4_022.3 tok/sq4_0 / q8_022.3 tok/sGreen = matching types | Red = mixed types (40% slower)

The results are stark:

KeysValuesPrompt (tok/s)Generation (tok/s)Penalty
q8_0q8_056237.4
q4_0q4_056137.2~0%
q8_0q4_028922.3-40%
q4_0q8_028622.3-40%

Matching KV cache types perform identically regardless of whether you use q8_0 or q4_0. But mix them and you lose 40% of your throughput — on both prompt processing and generation. This is likely due to Metal kernel optimizations that handle uniform quantization types via a fast path. Mixed types force a slower conversion path on every cache access.

The practical takeaway: if you want to save memory with q4_0 caches, use q4_0 for both. Never mix.

We chose q8_0 for both keys and values — it gives us the best quality at no speed cost compared to q4_0, and the 512 GB of unified memory gives us plenty of headroom.

Finding: throughput at different context depths

Agent tasks in Terminal-Bench fill context progressively as the conversation grows. We measured combined throughput (prompt processing + token generation) at different context depths:

Combined Throughput vs Context Depth (tok/s)3354K3458K29116K20732K12864K69128KEach test: fill context to depth, then generate 128 tokens
Context depthThroughput (tok/s)vs 4K
4,096335
8,192345+3%
16,384291-13%
32,768207-38%
65,536128-62%
131,07269-79%

Throughput drops predictably as context grows. This is dominated by prompt processing cost — MiniMax’s hybrid architecture (lightning attention for most layers, softmax attention every 7th layer) means token generation itself stays relatively stable. But the overall request latency increases because each new turn re-processes the full conversation.

We chose 128K context for Terminal-Bench. Most agent tasks stay well under this limit, and it frees ~9 GB of KV cache memory compared to the model’s maximum 196K. That memory headroom helps with sustained performance over long evaluation runs.

Prompt processing speed also decreases with batch size:

Prompt tokensSpeed (tok/s)
512562
2,048528
8,192416

Keeping llama.cpp current

llama.cpp releases multiple builds per day. For MiniMax-M2.5 on Apple Silicon, several recent changes matter:

Gated Delta Net optimizations — MiniMax’s lightning attention layers use the Gated Delta Net (GDN) architecture. Build b8333 fixed a memory coalescing issue in the fused GDN kernel where state matrices were accessed column-wise on row-major storage, wasting GPU cache bandwidth. This resolved a 39% regression on Metal. The --fused-gdn flag now gives users control over this optimization.

KV cache hybrid model fix (b8025) — Fixed KV cache handling specifically for hybrid architectures like MiniMax that mix attention and linear layers. This is exactly our model.

Metal kernel improvements — Adaptive CPU/GPU interleave (b7951) optimizes workload distribution, bin kernel optimization (b8312) avoids unnecessary divisions, and various flash attention fixes improve stability on Apple GPUs.

Staying on the latest build is worth it. A simple brew upgrade llama.cpp before a benchmark run ensures you’re getting the latest Metal optimizations.

brew upgrade llama.cpp
llama-server --version

Our configuration

Here’s what we run for Terminal-Bench evaluations:

llama-server \
  --model ~/models/MiniMax-M2.5-GGUF/MiniMax-M2.5-Q8_0-00001-of-00006.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --jinja \
  --ctx-size 131072 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --metrics

Every flag is there for a reason:

FlagWhy
--jinjaTemplate rendering for tool use
--ctx-size 131072128K context — enough for agent tasks, saves 9 GB vs max
--n-gpu-layers 999All layers on Metal GPU
--parallel 1Single slot — prevents KV cache rotation between requests
--flash-attn onRequired for KV cache quantization, faster attention
--cache-type-k q8_0Quantized KV cache — must match type-v
--cache-type-v q8_0Quantized KV cache — must match type-k
--metricsPrometheus endpoint for monitoring

Running the benchmarks yourself

# Install
brew install llama.cpp

# KV cache type comparison
llama-bench \
  -m path/to/model.gguf \
  -ngl 999 -fa 1 \
  -ctk q8_0,q4_0 -ctv q8_0,q4_0

# Context depth sweep
llama-bench \
  -m path/to/model.gguf \
  -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 \
  -pg 4096,128 -pg 8192,128 -pg 16384,128 \
  -pg 32768,128 -pg 65536,128 -pg 131072,128

What’s next

In an upcoming post we’ll cover the prompt caching proxy that sits between Claude Code and llama-server, and how our hook system steers the agent during Terminal-Bench evaluations.

SysOp: onlinenode: 1baud: 2400