TLDR
- Unified Dual-Persona: Thinking mode (Chain-of-Thought ON) vs. Non-Thinking mode (Zero-Shot Speed) in the same weights – no model-swapping gymnastics needed.
- User-Tunable Thinking Budget: Cap reasoning tokens to trade FLOPs for latency without lobotomizing quality.
- Parameter Buffet: 0.6B → 235B, dense and 128-expert MoE variants (22B active params on the flagship).
- Multilingual Jump: From 29 → 119 languages + dialects; Western Euro to low-resource Afrikaans, all on one tokenizer.
- Train-Corpus: 36T tokens (PDF OCR, synthetic math/code, long-context dumps). Basically a bibliophagic leviathan.
- Open-Sourced: Under Apache-2.0 because Alibaba clearly wants community mind-share (and bug-fixes) for free.
- SOTA-Level Benchmarks: Beats DeepSeek-V3 & Llama-4 Maverick on most leaderboards while running ~⅔ the active params. Coding gains are spicy (+12-15% EvalPlus).
Why the "thinking" toggle is so important
Most LLMs either overshare their chain-of-thought or get gagged by RLHF safety masks. Qwen3’s /think
vs. /no_think
flags let you:
- Latency:Can have legacy style latency using the same model weights.
- Cost: Reduced inference cost when thinking isn't required or optionally disabled.
Expect others (OpenAI, Google) to crib this paradigm yesterday.
Architectural Design
Layer/Feature | Why? |
---|---|
Grouped-Query Attention + QK-Norm | Saner scaling at 128k context; less KV cache bloat. |
MoE w/ 128 experts, 8 hot | “Cheap” depth-wise width; 22B active beats 37B DeepSeek-V3. |
YaRN + Dual-Chunk Attention | 4× extrapolation; 32k context native, 128k via RoPE magic. |
SwiGLU, RMSNorm | Yada yada; standard post-Llama hygiene. |
The Kalomaze Intervention: Qwen MoE 30b-A3B → 16B Without Breaking a Sweat

Community contributions are already highlighting Qwen3's robustness. Notably, Kalomaze took the Qwen MoE 30b-A3B model and demonstrated its remarkable resilience to pruning. The original model features unbalanced experts, a design choice that, as Kalomaze discovered, allows for significant parameter reduction without a corresponding performance drop.
As detailed in this X.com post, Kalomaze was able to prune the 30 billion total parameter model down to a mere 16 billion parameters (the Qwen3-16B-A3B model) while retaining its performance. This is a testament to the efficiency of the unbalanced expert architecture and opens up exciting possibilities for deploying powerful MoE models in more constrained environments.
The key takeaway? Qwen3's MoE isn't just about scale; it's about smart, adaptable scale that the community can further optimize.
Perf Receipts (Flagship 235B-A22B, Thinking Mode)
Numbers straight out of the tech report, pages 5-15.
- MMLU-Redux: 92.7
- AIME’24: 85.7 (crushes Grok-3-beta)
- EvalPlus: 77.6 (+13 on DeepSeek)
- BFCL V3 (Agent): 70.8 – top of open board
- CodeForces ELO: 2056 (98.2 percentile)
Quickstart (1-Liner)
pip install transformers qwen3
python -m transformers-cli download Qwen/Qwen3-8B --trust-remote-code
Flip modes via system prefix:
<|im_start|>system
/think # or /no_think
<|im_end|>
Set max_think_tokens=4096
for a fair balance of latency and quality.