DeepSeek V4 Flash on llama.cpp: Architecture Port and Lessons

DEEPSEEK V4 FLASH ON LLAMA.CPP: ARCHITECTURE PORT AND LESSONS

────────────────────────────────────────────────────────────────

date: 2026-05-06 tags: llama.cpp, deepseek, deepseek-v4-flash, gguf, quantization, apple-silicon

DeepSeek V4 Flash is now running end-to-end in a fork of llama.cpp on Apple Silicon: model loading, Metal-offloaded inference, tool calling, and a full quantization pipeline from base safetensors. This post is a technical writeup of what was needed — what’s new versus V3.2, where the surprises were, and what the configuration looks like once it works.

The fork: cchuter/llama.cpp, branch feat/v4-port. Build version 9154, commit d6f1cc635. Upstreaming is on the roadmap; this post is also intended as the source for an issue / PR thread on ggml-org/llama.cpp.

The model

DeepSeek V4 Flash is a 284B-parameter Mixture-of-Experts language model with three architectural features beyond what V3.2 introduced:

Hyper-connections (count=4, sinkhorn iterations=20, ε=1e-6) replacing residual connections, with per-layer hc_attn_* / hc_ffn_* tensors and an output-level output_hc_* triple.
Compressed-attention KV with per-layer compressors (attn_compressor_{ape,kv,gate,norm}) producing a latent K cache. Compress ratios are per-layer (attention.compress_ratios array; alternates 0/4/128 across the 43 layers).
A separate indexer pass with its own compressors (indexer_compressor_{ape,kv,gate,norm}), indexer_proj, indexer_attn_q_b, and a per-layer indexer K cache, used to drive sparse attention with top_k=512.

V4 inherits V3.2’s Direct Sparse Attention (DSA) plumbing, so @fairydreaming’s PR #21149 is a hard prerequisite — the fork’s V4 work sits on top of fairydreaming/llama.cpp:deepseek-dsa. V4 also reuses V3.2’s MoE expert routing (256 experts × 6 active + 1 shared) and YaRN-scaled RoPE, but with a sqrtsoftplus expert gating function that’s new to V4 and requires a corresponding branch in TextModel.set_gguf_parameters().

V4 is also the first model in the DeepSeek family to ship with FP8 e4m3 attention weights and FP4 routed expert weights (with FP8 e8m0 scales) directly in the safetensors. There is no fp16/bf16 distribution. This shapes the converter design — see “Quantization” below.

What changed in the fork

130 commits on top of upstream master. Largest deltas, broken out by subsystem:

Architecture registration

src/llama-arch.{cpp,h}: new LLM_ARCH_DEEPSEEK4 entry plus tensor name constants for ATTN_COMPRESSOR_*, INDEXER_*, HC_ATTN_*, HC_FFN_*, OUTPUT_HC_*. New KV constants for HASH_LAYER_COUNT, HYPER_CONNECTION_*, ATTENTION_COMPRESS_RATIOS, ATTENTION_COMPRESS_ROPE_FREQ_BASE, ATTENTION_OUTPUT_LORA_RANK, ATTENTION_OUTPUT_GROUP_COUNT, and NEXTN_PREDICT_LAYERS.
src/models/deepseek4.cpp: 1544-line llama_model_deepseek4 class implementing load_arch_hparams, load_arch_tensors, the per-layer compressor allocator, and the forward graph including hyper-connection sinkhorn iterations and the indexer-driven sparse-attention path.
src/models/deepseek32.cpp: 510-line restructure of V3.2 to fit the per-arch class pattern that landed upstream (commit 994118a18, May 4) after fairydreaming’s branch. V4 piggybacks on this structure.

KV cache assembly

V4 has three K caches in flight per layer: the standard SWA mem_attn K, a compressed-attention K (cache.attn_k), and an indexer K (cache.index_k, only when compress_ratios[il] == 4). The hybrid allocator src/llama-memory-hybrid-iswa.cpp was extended (+667 lines) to handle the V4-specific tensors, with a sibling src/llama-kv-cache-dsa.{cpp,h} carrying over from V3.2.

The forward graph concatenates the SWA K view with the compressed K view via ggml_concat (src/models/deepseek4.cpp ~line 1352), which asserts equal dtype. This becomes important under quantized KV cache requests — see below.

Metal kernels

V4 introduces several ops not present in upstream:

GGML_OP_DSV4_FP8_KV_QUANTIZE — per-token fp8 quantization of K activations before they enter the cache. This is the source of the “K is already fp8-quantized” property that breaks q8_0 KV.
Lightning-indexer ops driving the sparse selection of attended tokens.
Hyper-connection ops including the sinkhorn normalization iteration.

Additions land in ggml/src/ggml-metal/ggml-metal-{device,impl,ops}.{cpp,h,m} and ggml-metal.metal (+391 lines of kernels). GGML_OP_COUNT was bumped to 102 with a static_assert update.

Converter (`convert_hf_to_gguf.py`)

A DeepseekV4Model class (+811 lines including supporting infrastructure) handles:

FP8 e4m3 weights with FP8 e8m0 per-block scales. Requires TORCH_FLOAT8_E8M0FNU = getattr(torch, "float8_e8m0fnu", None) (torch ≥ 2.5) plus LazyTorchTensor._dtype_str_map entries for F8_E8M0 / F8_E8M0FNU. Without these, the safetensors loader can’t even identify the scale tensors.
FP4 routed expert weights. _write_deepseek4_expert_tensors raises NotImplementedError if --outtype is not one of {iq2_xxs, iq2_xs, q2_k, tq2_0, tq1_0, q8_0}. Going FP4 → bf16 → re-quantize loses the per-block scale information; FP4 → these compact targets preserves it. --outtype f16 is unreachable for V4.
CLI plumbing for --deepseek4-expert-outtypes, --deepseek4-max-layers (debug; partial export), and --deepseek4-expert-workers. These need matching kwargs in ModelBase.__init__ and the model_class(...) call site.
scoring_func == "sqrtsoftplus" branch in the base TextModel.set_gguf_parameters().

gguf-py/gguf/constants.py (+147 lines) and gguf-py/gguf/gguf_writer.py (+24 lines) carry the V4 enum, V4 tensor names, V4 KV constants, and matching writer helpers (add_attention_compress_ratios, add_attention_output_lora_rank, etc.).

Chat template

models/templates/deepseek-ai-DeepSeek-V4.jinja (96 lines). Note that tokenizer_config.json in the upstream HuggingFace repo does not ship a chat_template field — converters need to either embed the template at build time (gguf_new_metadata.py --chat-template-file) or the produced GGUFs will fall back to a default that doesn’t emit V4’s <｜DSML｜tool_calls> markers.

common/chat.cpp detects V4 via the dsml_token substring in the rendered template and routes through common_chat_params_init_deepseek_v3_2 (V4 inherits V3.2’s tool-call grammar entirely; only the chat-token format differs).

The q8_0 KV cache pitfall

The most generally applicable lesson from this port: q8_0 KV cache silently corrupts V4 inference.

Symptoms: launch with --cache-type-k q8_0 --cache-type-v q8_0 and the model produces =-loops, single-character output, or "Mirror …"-style noise instead of coherent text. No error, no crash. Without the gate suite, this is easy to miss because tool-call paths sometimes get past the broken KV cache for short sequences.

Root cause, as observed in gates: V4 feeds cache tensors with values that have already been rounded through its FP8 KV path. In llama.cpp this op returns float tensors, not FP8 storage, but the value distribution is already constrained by V4’s e4m3-style quantize/dequantize step. Re-quantizing those latent K values again with generic q8_0’s linear per-32-element scale corrupts attention enough to produce coherent-looking but wrong decode loops. The robust fix is to force the V4 K/latent caches to F16 until a V4-specific cache format is designed and validated.

The fix: pin V4 KV cache types to GGML_TYPE_F16 regardless of --cache-type-k|v, with a one-line LLAMA_LOG_WARN on first override. The coercion happens in two places for defense in depth:

Primary (src/llama-context.cpp::llama_init_from_model): coerces params.type_k/params.type_v before the shared validations at :351-355 and :3047-3077. This matters because those validators reject q8_0 V cache without flash-attn, reject quantized KV under SPLIT_MODE_TENSOR, etc. — they need to see the effective fp16 types, not the user’s original q8 request.
Defense-in-depth (src/llama-model.cpp::create_memory): re-pins the V4-specific cache types to fp16. Any direct caller of create_memory is also covered.

The forward-graph ggml_concat between the SWA K view and the compressed K view (~line 1352 of deepseek4.cpp) is the proximate reason both caches must agree on dtype — but the deeper reason is that neither of them tolerates q8_0 in the first place.

A regression gate (tests/v4-port/gate-server-chat-q8.sh, three modes: warn, warn-fa-off, bail) prevents this from regressing.

Quantization recipe

V4’s quantization pipeline goes through Q8_0 as a mandatory intermediate (FP4 routed experts can’t be written to bf16/f16 — see Converter section above). From Q8_0 we re-quantize to whatever final target is wanted:

# Step 1: HF safetensors (FP8 e4m3 weights, FP4 routed experts) → Q8_0 GGUF
python3 convert_hf_to_gguf.py ~/models/DeepSeek-V4-Flash \
  --outfile ~/models/DeepSeek-V4-Flash-Q8_0.gguf --outtype q8_0
# Output: ~282 GiB, 8.50 BPW, ~30–60 min on M3 Ultra

llama-quantize also needs an integer-tensor passthrough in src/llama-quant.cpp::tensor_allows_quantization — V4’s ffn_gate_tid2eid.weight tensors are I32 lookup tables, and the dequantize path explicitly rejects integer types. Adding tensor->type == GGML_TYPE_I8/I16/I32/I64 → return false skips the lookup tables (they’re passed through unchanged).

Naming scheme: `<base>-XL`, `<base>-XXL`

Plain Q2_K body alone turns out to be too aggressive for V4 under realistic agent prompts — once the model lands on a high-prior repeated structure ("The\n", ". , git, ignore, ., git, keep"-style filename lists) it can’t escape and the decode degenerates. The fix is selective per-tensor pinning at higher precision on the tensors that drive attention discrimination. We use a t-shirt size suffix scheme to label these:

Tier	What’s pinned high (above the body)
`<base>` (no suffix)	nothing — just `llama-quantize` defaults
`<base>-XL`	output projection + token embeddings → Q8_0
`<base>-XXL`	XL + attention weights → Q6_K, hyper-connections (`hc_`, `output_hc_`) → Q8_0, indexer projections → Q6_K

Larger tier = more “armor” on critical-path tensors = more stable on long agent contexts, slightly bigger file, body precision unchanged. The scheme generalizes: Q4_K_M-XL, IQ2_M-XXL, etc. would all mean the same delta over their respective bases.

Recipes

# Q4_K_M (no suffix): default mix from Q8_0 source
./build/bin/llama-quantize --allow-requantize \
  ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
  ~/models/DeepSeek-V4-Flash-Q4_K_M.gguf \
  Q4_K_M
# ~160 GiB, 4.84 BPW. Known-good baseline; reward=1.0 on a Terminal-Bench task at this quant.

# Q2_K-XL: smaller body, output + token_embd preserved at Q8_0
./build/bin/llama-quantize --allow-requantize \
  --output-tensor-type q8_0 \
  --token-embedding-type q8_0 \
  ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
  ~/models/DeepSeek-V4-Flash-Q2_K-XL.gguf \
  Q2_K
# ~97 GiB, 2.92 BPW. Fastest practical body, but vulnerable to long-context loops on agent prompts.

# Q2_K-XXL: XL + attention/hc/indexer pinned high
./build/bin/llama-quantize --allow-requantize \
  --output-tensor-type q8_0 \
  --token-embedding-type q8_0 \
  --tensor-type "indexer=q6_K" \
  --tensor-type "attn_q_a=q6_K" --tensor-type "attn_q_b=q6_K" \
  --tensor-type "attn_kv=q6_K" \
  --tensor-type "attn_output_a=q6_K" --tensor-type "attn_output_b=q6_K" \
  --tensor-type "hc_=q8_0" \
  ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
  ~/models/DeepSeek-V4-Flash-Q2_K-XXL.gguf \
  Q2_K
# ~99 GiB, 2.98 BPW. Recommended if you want something smaller than Q4_K_M without the loop-class failures.

# Q4_K_M-XXL: pin the kernels that scale with context. Targets the
# Q8-vs-Q4 agent-context speed gap (see "Quant choice for agent
# workloads" below).
./build/bin/llama-quantize --allow-requantize \
  --output-tensor-type q8_0 \
  --token-embedding-type q8_0 \
  --tensor-type "indexer=q8_0" \
  --tensor-type "attn_q_a=q8_0" --tensor-type "attn_q_b=q8_0" \
  --tensor-type "attn_kv=q8_0" \
  --tensor-type "attn_output_a=q8_0" --tensor-type "attn_output_b=q8_0" \
  --tensor-type "hc_=q8_0" \
  ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
  ~/models/DeepSeek-V4-Flash-Q4_K_M-XXL.gguf \
  Q4_K_M
# ~163 GiB, 4.92 BPW (+3 GiB over plain Q4_K_M). Q4-footprint with the
# context-scaling kernels at Q8 precision.

The XXL recipe specifically targets the failure mode observed on plain Q2_K body:

attn_* at Q6_K: prevents attention-score noise from compounding over long context
hc_* at Q8_0: V4’s hyper-connections replace residual connections; sinkhorn-normalized noise here drives degeneracy fast
indexer.* at Q6_K: the lightning indexer drives sparse-attention top-k selection — needs precision

Substring-match gotcha for --tensor-type patterns: the matcher is substring-based on the full tensor name. indexer_=q8_0 matches indexer_compressor_* (literal underscore continues) but not indexer.attn_q_b or indexer.proj (dot after indexer). Use the bare prefix indexer=q8_0 to catch both the dotted and underscored families. We verified this empirically — an early XXL build with indexer_= only pinned the small compressor tensors and missed the large indexer.attn_q_b projections, which are the ones that actually matter for context-scaling decode speed.

tests/v4-port/build-quants.sh wraps the standard non-pinned variants.

Why not IQ2_XXS / IQ2_M / IQ1_M?

The natural target for an MoE this large is the IQ-series (~2.0 BPW), with IQ2_XXS for ffn_{gate,up}_exps and Q2_K for ffn_down_exps as a typical mixed recipe. All of these targets require an importance matrix.

llama-imatrix currently segfaults on V4 inside the activation-collection path, blocking IQ-series quants. The scheduler reports fused Gated Delta Net (autoregressive) enabled even though V4 doesn’t use GDN — likely a tensor-classification heuristic mismatch that triggers a hook into a code path the V4 graph doesn’t survive. This is open as a follow-up.

Server configuration

llama-server \
  --model ~/models/DeepSeek-V4-Flash-Q8_0.gguf \
  --host 0.0.0.0 --port 8080 \
  --jinja \
  --reasoning off \
  --ctx-size 393216 \
  --n-gpu-layers 999 \
  --parallel 1 \
  --flash-attn on \
  --threads-batch 32 \
  --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 \
  --metrics --verbose

Flag	Why
`--jinja`	V4 uses the embedded chat template for tool calls
`--cache-type-k\|v q8_0`	Don’t. Silently overridden to fp16. Save the flag for non-V4 architectures.
`--reasoning off`	Default for agent workloads — `<think>` blocks add latency without changing tool-call quality, and an unbounded budget can produce 4000+ token thinking bursts that eat the per-task timeout. Use `--reasoning auto` only when you specifically want chain-of-thought visible in the output.
`--ctx-size 393216`	The model card recommends ≥384k for Think Max mode; 384k also covers any agent context we’ve seen. Drop to 131072 if KV-cache footprint is tight.
`--temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0`	Match the model card and `generation_config.json`: V4 was trained against full-distribution sampling. Lower temps (0.7) collapse onto highest-prior tokens and increase repetition / over-determinism — observed empirically as agent narration loops on smaller quants and over-engineered FFI rabbit holes on larger ones.

Anthropic-shape thinking field is silently ignored. Claude Code clients send "thinking": {"type": "adaptive"} per the new Anthropic API. llama-server’s compat layer only extracts a budget when type == "enabled" and budget_tokens is present — adaptive falls through and the request uses whatever the server defaults to. The server-side --reasoning / --reasoning-budget flags are the actual ceiling regardless of what the client sends. Don’t rely on client-side budgets reaching the model.

Per-request max_tokens overrides server-side --predict. If you set --predict N server-side hoping to cap runaway generation, be aware that any per-request max_tokens value (Claude Code sends 48000 by default) takes precedence. The server-side flag is only a fallback when the client sends nothing. To enforce a real ceiling, clamp max_tokens in a proxy.

Memory footprint (Q4_K_M, 128k ctx)

Component	Size
Model weights (Q4_K_M)	~160 GiB
KV cache (forced f16, 128k ctx)	~6 GiB
Compute buffers	~7 GiB
Total on Metal device	~173 GiB

Q2_K-XL cuts the model weight to ~97 GiB; Q2_K-XXL is ~99 GiB. Q8_0 grows it to ~282 GiB. KV cache is fp16 in all cases regardless of the --cache-type-* flags.

Performance (M3 Ultra, NGL=999)

Decode tok/s from tests/v4-port/gate-speed.sh (small prompt, single-slot):

Quant	Size	BPW	Decode tok/s	Notes
Q8_0	282 GiB	8.50	~16	reference
Q4_K_M	160 GiB	4.84	15.69	known-good baseline
Q2_K-XL	97 GiB	2.92	16.38	small but unstable on long agent prompts
Q2_K-XXL	99 GiB	2.98	~16	recommended small variant

Decode is compute-bound, not bandwidth-bound, at small contexts on V4 — the lightning indexer, hyper-connection sinkhorn iterations, and per-token expert routing dominate. Going from 4.84 BPW down to 2.92 BPW only shaves ~4% off decode time at gate-speed scale.

Quant choice for agent workloads (the Q8 surprise)

Gate-speed numbers above are measured on tiny prompts. They mislead. The same quants under realistic 25k-token agent contexts (Claude Code system prompt + a handful of tool results) diverge sharply:

Quant	Decode @ 25k ctx	Δ vs gate-speed
Q8_0	16.62 tok/s	+4% (essentially flat)
Q4_K_M	9.95 tok/s	-37%

Q8_0 is 1.7× faster than Q4_K_M at agent context, despite reading 1.76× more bytes per token. The bandwidth-vs-compute intuition you’d apply to a typical dense model breaks here. Why:

V4 decode is compute-bound on the indexer / sinkhorn / expert-routing path, not on weight bandwidth. These ops do roughly the same compute regardless of weight quantization — the dequant pass is incremental.
Q8_0’s dequant kernel is the simplest possible: int8 × per-block scale, almost no bit-shuffling. Q4_K_M unpacks super-blocks with multi-level scales (4-bit quants + 6-bit scales + 6-bit mins) — much more setup per fetch.
Per-matmul setup cost dominates at batch=1 when V4’s MoE routes to 6 of 256 experts per token. ~18 small matmuls per token (gate/up/down × 6) — Q4_K_M’s heavier unpack loses to Q8_0’s straightforward path even with more bytes to read.
Long context exposes the gap: Q8 is essentially flat across context (~16 tok/s at gate-speed → 16.6 tok/s at 25k); Q4_K_M drops 37%. The flat-vs-context-scaling pattern points at ops that scale with context length — namely the lightning indexer and compressed-attention projections.

The fix is a recipe, not a kernel rewrite (yet). Q4_K_M-XXL (recipe above) pins exactly the tensors that scale with context — indexer.*, attn_q_a/q_b, attn_kv, attn_output_a/b, hc_* — at Q8_0 while leaving the routed expert FFN at Q4_K. ~163 GiB / 4.92 BPW (+3 GiB / +0.08 BPW over plain Q4_K_M). The proper fix is adding K-quant fast paths to the V4 Metal kernels (lightning indexer, hc_split_sinkhorn, etc.) so plain Q4_K_M decodes at Q8 speed; that’s a follow-up PR.

Practical guidance:

Use case	Recommended quant
Agent / Claude Code (you have ≥350 GiB RAM headroom)	Q8_0 — fastest at agent context, model card-recommended sampling works out of the box
Agent / Claude Code (footprint-constrained)	Q4_K_M-XXL — ~163 GiB, recovers most of Q8’s agent-context speed
Chat / one-shot generation	Any quant; gate-speed numbers apply
Smallest viable footprint, accept some loop risk	Q2_K-XXL — ~99 GiB

Validation gates

tests/v4-port/run-all-gates.sh runs:

gate-loader — confirms architecture is recognized
gate-coherence (NGL=0 + NGL=999) — short coherent decode at temp 0
gate-speed — minimum 10 tok/s decode floor
gate-tools — 5/5 tool-call requests against a weather fixture
gate-server-chat — three chat completions (small + medium prompts + tool fixture)
gate-server-chat-q8 (MODE=warn and MODE=warn-fa-off) — exercises the q8 KV silent-override path

End-to-end: a largest-eigenval Terminal-Bench trial completed in 5:47 with reward=1.0 on Q2_K-OutQ8, confirming the full agentic loop (tool calls, multi-turn reasoning, file writes, test runs) works at the smallest practical quant.

Open items

llama-imatrix segfault on V4 — blocks IQ2_XXS / IQ2_M / IQ1_M quants. Crash is in the activation-collection hook during the first chunk’s forward pass, with the scheduler misclassifying V4’s hyper-connection or recurrent-state tensors as Gated Delta Net.
K-quant fast paths in V4 Metal kernels — the Q8 vs Q4 agent-context speed gap (1.7×) is caused by V4’s lightning indexer + sinkhorn + hc_* kernels having a slow generic dequant path for K-quants. The XXL recipe works around this by pinning those tensors at Q8_0; a real fix adds typed specializations like kernel_lightning_indexer_q4_K_f32. Estimated ~1–2 days per kernel for clean specializations; the lightning indexer is the highest-priority target.
CUDA backend coverage — only the lightning indexer has a CUDA kernel (560 LOC). The other five V4 ops (DSV4_HC_SPLIT_SINKHORN, DSV4_HC_WEIGHTED_SUM, DSV4_HC_EXPAND, DSV4_FP8_KV_QUANTIZE, DSV4_ROPE_TAIL) only have CPU and Metal implementations. Running on CUDA today would round-trip activations CUDA→CPU→CUDA per layer, slower than CPU-only.
Real q8_0 KV bandwidth savings — currently impossible because the source K is already fp8-quantized. Would need a V4-specific KV quantization scheme that treats fp8 as the source distribution rather than fp16. Out of scope for this port.
DSA Metal kernels for V3.2 — V4’s metal kernels work because they’re V4-specific; V3.2 sparse-attention models still won’t run on Metal in this build.
Upstreaming — V3.2/DSA via PR #21149 needs to land first. V4 will likely follow as a separate PR layered on top.