DEEPSEEK V4 FLASH ON LLAMA.CPP: ARCHITECTURE PORT AND LESSONS
────────────────────────────────────────────────────────────────
DeepSeek V4 Flash is now running end-to-end in a fork of llama.cpp on Apple Silicon: model loading, Metal-offloaded inference, tool calling, and a full quantization pipeline from base safetensors. This post is a technical writeup of what was needed — what’s new versus V3.2, where the surprises were, and what the configuration looks like once it works.
The fork: cchuter/llama.cpp, branch feat/v4-port. Build version 9154, commit d6f1cc635. Upstreaming is on the roadmap; this post is also intended as the source for an issue / PR thread on ggml-org/llama.cpp.
The model
DeepSeek V4 Flash is a 284B-parameter Mixture-of-Experts language model with three architectural features beyond what V3.2 introduced:
- Hyper-connections (count=4, sinkhorn iterations=20, ε=1e-6) replacing residual connections, with per-layer
hc_attn_*/hc_ffn_*tensors and an output-leveloutput_hc_*triple. - Compressed-attention KV with per-layer compressors (
attn_compressor_{ape,kv,gate,norm}) producing a latent K cache. Compress ratios are per-layer (attention.compress_ratiosarray; alternates0/4/128across the 43 layers). - A separate indexer pass with its own compressors (
indexer_compressor_{ape,kv,gate,norm}),indexer_proj,indexer_attn_q_b, and a per-layer indexer K cache, used to drive sparse attention withtop_k=512.
V4 inherits V3.2’s Direct Sparse Attention (DSA) plumbing, so @fairydreaming’s PR #21149 is a hard prerequisite — the fork’s V4 work sits on top of fairydreaming/llama.cpp:deepseek-dsa. V4 also reuses V3.2’s MoE expert routing (256 experts × 6 active + 1 shared) and YaRN-scaled RoPE, but with a sqrtsoftplus expert gating function that’s new to V4 and requires a corresponding branch in TextModel.set_gguf_parameters().
V4 is also the first model in the DeepSeek family to ship with FP8 e4m3 attention weights and FP4 routed expert weights (with FP8 e8m0 scales) directly in the safetensors. There is no fp16/bf16 distribution. This shapes the converter design — see “Quantization” below.
What changed in the fork
130 commits on top of upstream master. Largest deltas, broken out by subsystem:
Architecture registration
src/llama-arch.{cpp,h}: newLLM_ARCH_DEEPSEEK4entry plus tensor name constants forATTN_COMPRESSOR_*,INDEXER_*,HC_ATTN_*,HC_FFN_*,OUTPUT_HC_*. New KV constants forHASH_LAYER_COUNT,HYPER_CONNECTION_*,ATTENTION_COMPRESS_RATIOS,ATTENTION_COMPRESS_ROPE_FREQ_BASE,ATTENTION_OUTPUT_LORA_RANK,ATTENTION_OUTPUT_GROUP_COUNT, andNEXTN_PREDICT_LAYERS.src/models/deepseek4.cpp: 1544-linellama_model_deepseek4class implementingload_arch_hparams,load_arch_tensors, the per-layer compressor allocator, and the forward graph including hyper-connection sinkhorn iterations and the indexer-driven sparse-attention path.src/models/deepseek32.cpp: 510-line restructure of V3.2 to fit the per-arch class pattern that landed upstream (commit994118a18, May 4) after fairydreaming’s branch. V4 piggybacks on this structure.
KV cache assembly
V4 has three K caches in flight per layer: the standard SWA mem_attn K, a compressed-attention K (cache.attn_k), and an indexer K (cache.index_k, only when compress_ratios[il] == 4). The hybrid allocator src/llama-memory-hybrid-iswa.cpp was extended (+667 lines) to handle the V4-specific tensors, with a sibling src/llama-kv-cache-dsa.{cpp,h} carrying over from V3.2.
The forward graph concatenates the SWA K view with the compressed K view via ggml_concat (src/models/deepseek4.cpp ~line 1352), which asserts equal dtype. This becomes important under quantized KV cache requests — see below.
Metal kernels
V4 introduces several ops not present in upstream:
GGML_OP_DSV4_FP8_KV_QUANTIZE— per-token fp8 quantization of K activations before they enter the cache. This is the source of the “K is already fp8-quantized” property that breaks q8_0 KV.- Lightning-indexer ops driving the sparse selection of attended tokens.
- Hyper-connection ops including the sinkhorn normalization iteration.
Additions land in ggml/src/ggml-metal/ggml-metal-{device,impl,ops}.{cpp,h,m} and ggml-metal.metal (+391 lines of kernels). GGML_OP_COUNT was bumped to 102 with a static_assert update.
Converter (convert_hf_to_gguf.py)
A DeepseekV4Model class (+811 lines including supporting infrastructure) handles:
- FP8 e4m3 weights with FP8 e8m0 per-block scales. Requires
TORCH_FLOAT8_E8M0FNU = getattr(torch, "float8_e8m0fnu", None)(torch ≥ 2.5) plusLazyTorchTensor._dtype_str_mapentries forF8_E8M0/F8_E8M0FNU. Without these, the safetensors loader can’t even identify the scale tensors. - FP4 routed expert weights.
_write_deepseek4_expert_tensorsraisesNotImplementedErrorif--outtypeis not one of{iq2_xxs, iq2_xs, q2_k, tq2_0, tq1_0, q8_0}. Going FP4 → bf16 → re-quantize loses the per-block scale information; FP4 → these compact targets preserves it.--outtype f16is unreachable for V4. - CLI plumbing for
--deepseek4-expert-outtypes,--deepseek4-max-layers(debug; partial export), and--deepseek4-expert-workers. These need matching kwargs inModelBase.__init__and themodel_class(...)call site. scoring_func == "sqrtsoftplus"branch in the baseTextModel.set_gguf_parameters().
gguf-py/gguf/constants.py (+147 lines) and gguf-py/gguf/gguf_writer.py (+24 lines) carry the V4 enum, V4 tensor names, V4 KV constants, and matching writer helpers (add_attention_compress_ratios, add_attention_output_lora_rank, etc.).
Chat template
models/templates/deepseek-ai-DeepSeek-V4.jinja (96 lines). Note that tokenizer_config.json in the upstream HuggingFace repo does not ship a chat_template field — converters need to either embed the template at build time (gguf_new_metadata.py --chat-template-file) or the produced GGUFs will fall back to a default that doesn’t emit V4’s <|DSML|tool_calls> markers.
common/chat.cpp detects V4 via the dsml_token substring in the rendered template and routes through common_chat_params_init_deepseek_v3_2 (V4 inherits V3.2’s tool-call grammar entirely; only the chat-token format differs).
The q8_0 KV cache pitfall
The most generally applicable lesson from this port: q8_0 KV cache silently corrupts V4 inference.
Symptoms: launch with --cache-type-k q8_0 --cache-type-v q8_0 and the model produces =-loops, single-character output, or "Mirror …"-style noise instead of coherent text. No error, no crash. Without the gate suite, this is easy to miss because tool-call paths sometimes get past the broken KV cache for short sequences.
Root cause, as observed in gates: V4 feeds cache tensors with values that have already been rounded through its FP8 KV path. In llama.cpp this op returns float tensors, not FP8 storage, but the value distribution is already constrained by V4’s e4m3-style quantize/dequantize step. Re-quantizing those latent K values again with generic q8_0’s linear per-32-element scale corrupts attention enough to produce coherent-looking but wrong decode loops. The robust fix is to force the V4 K/latent caches to F16 until a V4-specific cache format is designed and validated.
The fix: pin V4 KV cache types to GGML_TYPE_F16 regardless of --cache-type-k|v, with a one-line LLAMA_LOG_WARN on first override. The coercion happens in two places for defense in depth:
- Primary (
src/llama-context.cpp::llama_init_from_model): coercesparams.type_k/params.type_vbefore the shared validations at:351-355and:3047-3077. This matters because those validators reject q8_0 V cache without flash-attn, reject quantized KV underSPLIT_MODE_TENSOR, etc. — they need to see the effective fp16 types, not the user’s original q8 request. - Defense-in-depth (
src/llama-model.cpp::create_memory): re-pins the V4-specific cache types to fp16. Any direct caller ofcreate_memoryis also covered.
The forward-graph ggml_concat between the SWA K view and the compressed K view (~line 1352 of deepseek4.cpp) is the proximate reason both caches must agree on dtype — but the deeper reason is that neither of them tolerates q8_0 in the first place.
A regression gate (tests/v4-port/gate-server-chat-q8.sh, three modes: warn, warn-fa-off, bail) prevents this from regressing.
Quantization recipe
V4’s quantization pipeline goes through Q8_0 as a mandatory intermediate (FP4 routed experts can’t be written to bf16/f16 — see Converter section above). From Q8_0 we re-quantize to whatever final target is wanted:
# Step 1: HF safetensors (FP8 e4m3 weights, FP4 routed experts) → Q8_0 GGUF
python3 convert_hf_to_gguf.py ~/models/DeepSeek-V4-Flash \
--outfile ~/models/DeepSeek-V4-Flash-Q8_0.gguf --outtype q8_0
# Output: ~282 GiB, 8.50 BPW, ~30–60 min on M3 Ultra
llama-quantize also needs an integer-tensor passthrough in src/llama-quant.cpp::tensor_allows_quantization — V4’s ffn_gate_tid2eid.weight tensors are I32 lookup tables, and the dequantize path explicitly rejects integer types. Adding tensor->type == GGML_TYPE_I8/I16/I32/I64 → return false skips the lookup tables (they’re passed through unchanged).
Naming scheme: <base>-XL, <base>-XXL
Plain Q2_K body alone turns out to be too aggressive for V4 under realistic agent prompts — once the model lands on a high-prior repeated structure ("The\n", ". , git, ignore, ., git, keep"-style filename lists) it can’t escape and the decode degenerates. The fix is selective per-tensor pinning at higher precision on the tensors that drive attention discrimination. We use a t-shirt size suffix scheme to label these:
| Tier | What’s pinned high (above the body) |
|---|---|
<base> (no suffix) | nothing — just llama-quantize defaults |
<base>-XL | output projection + token embeddings → Q8_0 |
<base>-XXL | XL + attention weights → Q6_K, hyper-connections (hc_*, output_hc_*) → Q8_0, indexer projections → Q6_K |
Larger tier = more “armor” on critical-path tensors = more stable on long agent contexts, slightly bigger file, body precision unchanged. The scheme generalizes: Q4_K_M-XL, IQ2_M-XXL, etc. would all mean the same delta over their respective bases.
Recipes
# Q4_K_M (no suffix): default mix from Q8_0 source
./build/bin/llama-quantize --allow-requantize \
~/models/DeepSeek-V4-Flash-Q8_0.gguf \
~/models/DeepSeek-V4-Flash-Q4_K_M.gguf \
Q4_K_M
# ~160 GiB, 4.84 BPW. Known-good baseline; reward=1.0 on a Terminal-Bench task at this quant.
# Q2_K-XL: smaller body, output + token_embd preserved at Q8_0
./build/bin/llama-quantize --allow-requantize \
--output-tensor-type q8_0 \
--token-embedding-type q8_0 \
~/models/DeepSeek-V4-Flash-Q8_0.gguf \
~/models/DeepSeek-V4-Flash-Q2_K-XL.gguf \
Q2_K
# ~97 GiB, 2.92 BPW. Fastest practical body, but vulnerable to long-context loops on agent prompts.
# Q2_K-XXL: XL + attention/hc/indexer pinned high
./build/bin/llama-quantize --allow-requantize \
--output-tensor-type q8_0 \
--token-embedding-type q8_0 \
--tensor-type "indexer=q6_K" \
--tensor-type "attn_q_a=q6_K" --tensor-type "attn_q_b=q6_K" \
--tensor-type "attn_kv=q6_K" \
--tensor-type "attn_output_a=q6_K" --tensor-type "attn_output_b=q6_K" \
--tensor-type "hc_=q8_0" \
~/models/DeepSeek-V4-Flash-Q8_0.gguf \
~/models/DeepSeek-V4-Flash-Q2_K-XXL.gguf \
Q2_K
# ~99 GiB, 2.98 BPW. Recommended if you want something smaller than Q4_K_M without the loop-class failures.
# Q4_K_M-XXL: pin the kernels that scale with context. Targets the
# Q8-vs-Q4 agent-context speed gap (see "Quant choice for agent
# workloads" below).
./build/bin/llama-quantize --allow-requantize \
--output-tensor-type q8_0 \
--token-embedding-type q8_0 \
--tensor-type "indexer=q8_0" \
--tensor-type "attn_q_a=q8_0" --tensor-type "attn_q_b=q8_0" \
--tensor-type "attn_kv=q8_0" \
--tensor-type "attn_output_a=q8_0" --tensor-type "attn_output_b=q8_0" \
--tensor-type "hc_=q8_0" \
~/models/DeepSeek-V4-Flash-Q8_0.gguf \
~/models/DeepSeek-V4-Flash-Q4_K_M-XXL.gguf \
Q4_K_M
# ~163 GiB, 4.92 BPW (+3 GiB over plain Q4_K_M). Q4-footprint with the
# context-scaling kernels at Q8 precision.
The XXL recipe specifically targets the failure mode observed on plain Q2_K body:
attn_*at Q6_K: prevents attention-score noise from compounding over long contexthc_*at Q8_0: V4’s hyper-connections replace residual connections; sinkhorn-normalized noise here drives degeneracy fastindexer.*at Q6_K: the lightning indexer drives sparse-attention top-k selection — needs precision
Substring-match gotcha for
--tensor-typepatterns: the matcher is substring-based on the full tensor name.indexer_=q8_0matchesindexer_compressor_*(literal underscore continues) but notindexer.attn_q_borindexer.proj(dot afterindexer). Use the bare prefixindexer=q8_0to catch both the dotted and underscored families. We verified this empirically — an early XXL build withindexer_=only pinned the small compressor tensors and missed the largeindexer.attn_q_bprojections, which are the ones that actually matter for context-scaling decode speed.
tests/v4-port/build-quants.sh wraps the standard non-pinned variants.
Why not IQ2_XXS / IQ2_M / IQ1_M?
The natural target for an MoE this large is the IQ-series (~2.0 BPW), with IQ2_XXS for ffn_{gate,up}_exps and Q2_K for ffn_down_exps as a typical mixed recipe. All of these targets require an importance matrix.
llama-imatrix currently segfaults on V4 inside the activation-collection path, blocking IQ-series quants. The scheduler reports fused Gated Delta Net (autoregressive) enabled even though V4 doesn’t use GDN — likely a tensor-classification heuristic mismatch that triggers a hook into a code path the V4 graph doesn’t survive. This is open as a follow-up.
Server configuration
llama-server \
--model ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
--host 0.0.0.0 --port 8080 \
--jinja \
--reasoning off \
--ctx-size 393216 \
--n-gpu-layers 999 \
--parallel 1 \
--flash-attn on \
--threads-batch 32 \
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 \
--metrics --verbose
| Flag | Why |
|---|---|
--jinja | V4 uses the embedded chat template for tool calls |
--cache-type-k|v q8_0 | Don’t. Silently overridden to fp16. Save the flag for non-V4 architectures. |
--reasoning off | Default for agent workloads — <think> blocks add latency without changing tool-call quality, and an unbounded budget can produce 4000+ token thinking bursts that eat the per-task timeout. Use --reasoning auto only when you specifically want chain-of-thought visible in the output. |
--ctx-size 393216 | The model card recommends ≥384k for Think Max mode; 384k also covers any agent context we’ve seen. Drop to 131072 if KV-cache footprint is tight. |
--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 | Match the model card and generation_config.json: V4 was trained against full-distribution sampling. Lower temps (0.7) collapse onto highest-prior tokens and increase repetition / over-determinism — observed empirically as agent narration loops on smaller quants and over-engineered FFI rabbit holes on larger ones. |
Anthropic-shape thinking field is silently ignored. Claude Code clients send "thinking": {"type": "adaptive"} per the new Anthropic API. llama-server’s compat layer only extracts a budget when type == "enabled" and budget_tokens is present — adaptive falls through and the request uses whatever the server defaults to. The server-side --reasoning / --reasoning-budget flags are the actual ceiling regardless of what the client sends. Don’t rely on client-side budgets reaching the model.
Per-request max_tokens overrides server-side --predict. If you set --predict N server-side hoping to cap runaway generation, be aware that any per-request max_tokens value (Claude Code sends 48000 by default) takes precedence. The server-side flag is only a fallback when the client sends nothing. To enforce a real ceiling, clamp max_tokens in a proxy.
Memory footprint (Q4_K_M, 128k ctx)
| Component | Size |
|---|---|
| Model weights (Q4_K_M) | ~160 GiB |
| KV cache (forced f16, 128k ctx) | ~6 GiB |
| Compute buffers | ~7 GiB |
| Total on Metal device | ~173 GiB |
Q2_K-XL cuts the model weight to ~97 GiB; Q2_K-XXL is ~99 GiB. Q8_0 grows it to ~282 GiB. KV cache is fp16 in all cases regardless of the --cache-type-* flags.
Performance (M3 Ultra, NGL=999)
Decode tok/s from tests/v4-port/gate-speed.sh (small prompt, single-slot):
| Quant | Size | BPW | Decode tok/s | Notes |
|---|---|---|---|---|
| Q8_0 | 282 GiB | 8.50 | ~16 | reference |
| Q4_K_M | 160 GiB | 4.84 | 15.69 | known-good baseline |
| Q2_K-XL | 97 GiB | 2.92 | 16.38 | small but unstable on long agent prompts |
| Q2_K-XXL | 99 GiB | 2.98 | ~16 | recommended small variant |
Decode is compute-bound, not bandwidth-bound, at small contexts on V4 — the lightning indexer, hyper-connection sinkhorn iterations, and per-token expert routing dominate. Going from 4.84 BPW down to 2.92 BPW only shaves ~4% off decode time at gate-speed scale.
Quant choice for agent workloads (the Q8 surprise)
Gate-speed numbers above are measured on tiny prompts. They mislead. The same quants under realistic 25k-token agent contexts (Claude Code system prompt + a handful of tool results) diverge sharply:
| Quant | Decode @ 25k ctx | Δ vs gate-speed |
|---|---|---|
| Q8_0 | 16.62 tok/s | +4% (essentially flat) |
| Q4_K_M | 9.95 tok/s | -37% |
Q8_0 is 1.7× faster than Q4_K_M at agent context, despite reading 1.76× more bytes per token. The bandwidth-vs-compute intuition you’d apply to a typical dense model breaks here. Why:
- V4 decode is compute-bound on the indexer / sinkhorn / expert-routing path, not on weight bandwidth. These ops do roughly the same compute regardless of weight quantization — the dequant pass is incremental.
- Q8_0’s dequant kernel is the simplest possible: int8 × per-block scale, almost no bit-shuffling. Q4_K_M unpacks super-blocks with multi-level scales (4-bit quants + 6-bit scales + 6-bit mins) — much more setup per fetch.
- Per-matmul setup cost dominates at batch=1 when V4’s MoE routes to 6 of 256 experts per token. ~18 small matmuls per token (gate/up/down × 6) — Q4_K_M’s heavier unpack loses to Q8_0’s straightforward path even with more bytes to read.
- Long context exposes the gap: Q8 is essentially flat across context (~16 tok/s at gate-speed → 16.6 tok/s at 25k); Q4_K_M drops 37%. The flat-vs-context-scaling pattern points at ops that scale with context length — namely the lightning indexer and compressed-attention projections.
The fix is a recipe, not a kernel rewrite (yet). Q4_K_M-XXL (recipe above) pins exactly the tensors that scale with context — indexer.*, attn_q_a/q_b, attn_kv, attn_output_a/b, hc_* — at Q8_0 while leaving the routed expert FFN at Q4_K. ~163 GiB / 4.92 BPW (+3 GiB / +0.08 BPW over plain Q4_K_M). The proper fix is adding K-quant fast paths to the V4 Metal kernels (lightning indexer, hc_split_sinkhorn, etc.) so plain Q4_K_M decodes at Q8 speed; that’s a follow-up PR.
Practical guidance:
| Use case | Recommended quant |
|---|---|
| Agent / Claude Code (you have ≥350 GiB RAM headroom) | Q8_0 — fastest at agent context, model card-recommended sampling works out of the box |
| Agent / Claude Code (footprint-constrained) | Q4_K_M-XXL — ~163 GiB, recovers most of Q8’s agent-context speed |
| Chat / one-shot generation | Any quant; gate-speed numbers apply |
| Smallest viable footprint, accept some loop risk | Q2_K-XXL — ~99 GiB |
Validation gates
tests/v4-port/run-all-gates.sh runs:
gate-loader— confirms architecture is recognizedgate-coherence(NGL=0 + NGL=999) — short coherent decode at temp 0gate-speed— minimum 10 tok/s decode floorgate-tools— 5/5 tool-call requests against a weather fixturegate-server-chat— three chat completions (small + medium prompts + tool fixture)gate-server-chat-q8(MODE=warnandMODE=warn-fa-off) — exercises the q8 KV silent-override path
End-to-end: a largest-eigenval Terminal-Bench trial completed in 5:47 with reward=1.0 on Q2_K-OutQ8, confirming the full agentic loop (tool calls, multi-turn reasoning, file writes, test runs) works at the smallest practical quant.
Open items
llama-imatrixsegfault on V4 — blocks IQ2_XXS / IQ2_M / IQ1_M quants. Crash is in the activation-collection hook during the first chunk’s forward pass, with the scheduler misclassifying V4’s hyper-connection or recurrent-state tensors as Gated Delta Net.- K-quant fast paths in V4 Metal kernels — the Q8 vs Q4 agent-context speed gap (1.7×) is caused by V4’s lightning indexer + sinkhorn + hc_* kernels having a slow generic dequant path for K-quants. The XXL recipe works around this by pinning those tensors at Q8_0; a real fix adds typed specializations like
kernel_lightning_indexer_q4_K_f32. Estimated ~1–2 days per kernel for clean specializations; the lightning indexer is the highest-priority target. - CUDA backend coverage — only the lightning indexer has a CUDA kernel (560 LOC). The other five V4 ops (
DSV4_HC_SPLIT_SINKHORN,DSV4_HC_WEIGHTED_SUM,DSV4_HC_EXPAND,DSV4_FP8_KV_QUANTIZE,DSV4_ROPE_TAIL) only have CPU and Metal implementations. Running on CUDA today would round-trip activations CUDA→CPU→CUDA per layer, slower than CPU-only. - Real q8_0 KV bandwidth savings — currently impossible because the source K is already fp8-quantized. Would need a V4-specific KV quantization scheme that treats fp8 as the source distribution rather than fp16. Out of scope for this port.
- DSA Metal kernels for V3.2 — V4’s metal kernels work because they’re V4-specific; V3.2 sparse-attention models still won’t run on Metal in this build.
- Upstreaming — V3.2/DSA via PR #21149 needs to land first. V4 will likely follow as a separate PR layered on top.
Links
- Fork:
cchuter/llama.cpp@feat/v4-port - V3.2/DSA upstream PR:
ggml-org/llama.cpp#21149(fairydreaming) - Base model:
deepseek-ai/DeepSeek-V4-Flash - Completion writeups in the fork:
docs/plans/v4-port-debug-completion.md,docs/plans/v4-port-kv-q8-completion.md - HF model card:
docs/model-cards/DeepSeek-V4-Flash-Q4_K_M.md
Comments, corrections, and PRs welcome.