Pruning and Quantizing Qwen 3.5 35B-A3B MoE with REAP and Modal

Sandesh

Sandesh / March 05, 2026

Mixture-of-Experts (MoE) models like Qwen 3.5 are the current gold standard for performance-to-compute efficiency. However, their massive weight files—often exceeding 70GB—make them a challenge for deployment on consumer hardware.

I recently completed a project to prune the Qwen3.5-35B-A3B model by 32%, resulting in a leaner version: Qwen3.5-24B-A3B-REAP-0.32. But I didn't stop at pruning. Using advanced GGUF quantization techniques, I've produced a highly optimized local-inference version that punches far above its weight class.

Even quantized ggufs like the ones from Unsloth fail to fit under 16gigs of vram. Recent posts on twitter and reddit boast about running Qwen3.5 35B moe model on 24gb rtx 3090 cards, but most of all don't have access to such beefy cards. That is one of the motivations to start this pruning attempt, to be able to run the SOTA model for its size in my 16 gigs nvidia card.

Before going into more details, here's a rough explanation of selecting a few params. I'm using 32% pruning cause its one of the defaults in the script, but also because it somehow gives me the around 15gb model size when using q4 quants. That's just enough for my 16gig card to hold the model while leaving some space for kv cache. Ideally, to replicate the minimal loss during pruning, I would've used a 25% pruning ratio like cerebras used for pruning GLM-4.7-Flash moe model, which is a similar 30b class moe model. For >100B models, cerebras even uses 50% pruning ratio, but it seems thats mostly effective for the bigger models.

P.S. You can find the orchestration script i used to prune and quantize the models here. sandeshrajbhandari/reap-qwen3.5-modal


The Core: Extending REAP for Qwen 3.5

The standard REAP implementation was built for earlier MoE architectures. To support the cutting-edge Qwen 3.5 series, I created a fork of the REAP repository and developed the feat/qwen3.5-moe-support branch.

Key Technical Fixes in the Fork:

  1. The "Gate" naming convention: Updated src/reap/prune.py to resolve the routing layer using getattr(moe, "router", getattr(moe, "gate", None)).
  2. Handling Forward Pass Variations: Patched src/reap/observer.py to detect single-tensor returns from SparseMoeBlock and wrap them in compatible tuples.
  3. Dtype Mismatch in Metrics: Implemented explicit casting in src/reap/metrics.py to fix RuntimeError in scatter_add_ during similarity computation.

High-Precision Quantization: The Unsloth-Style Recipe

To ensure the pruned model didn't lose its "intelligence," I implemented a high-precision GGUF quantization pipeline based on the "Unsloth recipe."

1. Importance Matrix (imatrix)

I generated a custom Importance Matrix using llama-imatrix and a diverse calibration corpus. This tells the quantizer which weights are critical for reasoning, allowing it to prioritize precision for those specific parameters while compressing less important ones more aggressively.

2. Custom Tensor Precision

Instead of a uniform Q4_K_M quant, I used custom overrides to force critical components into 8-bit (Q8_0):

  • Attention Gates & QKV: Preserves the core attention mechanism accuracy.
  • Shared Experts: These are used in every token pass, so maintaining 8-bit precision is vital for stability.
  • Token Embeddings: Improves vocabulary comprehension and retrieval.
./llama-quantize \
  --imatrix imatrix.dat \
  --token-embedding-type q8_0 \
  --output-tensor-type q6_k \
  --tensor-type "attn_gate=q8_0" \
  --tensor-type "attn_qkv=q8_0" \
  --tensor-type "ffn_down_shexp=q8_0" \
  --tensor-type "ffn_gate_shexp=q8_0" \
  --tensor-type "ffn_up_shexp=q8_0" \
  --tensor-type "ffn_down_exps=q5_k" \
  model-f16.gguf output-IQ4_K_M.gguf Q4_K_M

Overcoming Hardware Hurdles with Modal

Hardware limitations are the primary bottleneck in LLM research. Here is how Modal made the impossible possible:

  • Scaling the VRAM Wall: Upgraded heavy profiling tasks to A100-80GB with a single line change.
  • GGUF Filesystem Hacks: Automatically patched config.json at runtime to trick llama.cpp into recognizing the new Qwen 3.5 architecture during conversion. So when I pruned the model, the config.json had Qwen3_5ForCasualLM in its architecture, but llama.cpp and official qwen3.5 models use Qwen3_5MoeForConditionalGeneration. So i manually patched it for now.
  • Robust Sharded Uploads: Developed a script to shard the 50GB+ Safetensors into 5GB pieces, ensuring reliable transfers to Hugging Face despite large file sizes. I still haven't uploaded the f16 gguf for future quantization, cause i haven't sharded it yet, so im just convert hf to gguf everytime when running the quant scripts.

The Final Result

The successfully pruned and high-precision quantized models are now available on Hugging Face.

This project demonstrates that with the right pruning techniques, high-precision quantization, and serverless compute, we can make state-of-the-art MoE models accessible to everyone.