at a glance

Qwen3.5-397B-A17BGPT-5.4
providerAlibabaOpenAI
parameters397B total / 17B active (MoE)~large (est.)
context window256k tokens1m tokens
input / 1M tokens$0.172$2.50
output / 1M tokens$1.032$15.00

benchmarks

Cost (per 1M tokens) ?
Qwen3.5-397B-A17B
$1.03 output
GPT-5.4
$15.00 output
Terminal Bench 2 (shell tasks) ?
Qwen3.5-397B-A17B
52.5%
GPT-5.4
75.1%
TAU2-Bench (agentic tool use) ?
Qwen3.5-397B-A17B
86.7%
GPT-5.4
98.9%
GPQA Diamond (graduate science) ?
Qwen3.5-397B-A17B
88.4%
GPT-5.4
93.0%
HLE with tools (expert knowledge) ?
Qwen3.5-397B-A17B
48.3%
GPT-5.4
52.1%
OSWorld-Verified (computer use) ?
Qwen3.5-397B-A17B
62.2%
GPT-5.4
75.0%
MMMU-Pro (multimodal reasoning) ?
Qwen3.5-397B-A17B
79.0%
GPT-5.4
76.6%
Qwen3.5-397B-A17B GPT-5.4 bold score = winner

what are these models?

Qwen3.5-397B-A17B is the flagship model in Alibaba’s Qwen3.5 series — a Mixture-of-Experts architecture with 397 billion total parameters and 17 billion active per forward pass. It is open-weight under Apache 2.0, making it the largest publicly available open MoE model at this tier. Despite activating only 17B parameters per token, it achieves near-frontier performance across most benchmarks.

GPT-5.4 is OpenAI’s flagship model in the GPT-5.4 family — the highest-capability tier designed for maximum reasoning, agentic reliability, and multimodal performance. It is closed-source and accessed via OpenAI’s API.

benchmark breakdown

GPT-5.4 dominates on agentic tasks. TAU2-Bench (98.9% vs 86.7%) and Terminal Bench 2 (75.1% vs 52.5%) reveal a large gap. For multi-step tool-calling and shell automation, GPT-5.4 is in a different tier.

The OSWorld gap is 12.8 points. For desktop computer use, GPT-5.4 leads meaningfully.

Qwen3.5-397B-A17B leads on MMMU-Pro. 79.0% vs 76.6% — the only benchmark where the open model wins outright. For multimodal reasoning tasks, it has a slight edge over GPT-5.4.

GPQA Diamond gap is 4.6 points. For graduate-level science reasoning, the two models are close but GPT-5.4 leads.

HLE with tools is competitive. 48.3% vs 52.1% — on the hardest academic knowledge benchmark, the open MoE model is within 4 points of GPT-5.4.

what people are saying

when to use Qwen3.5-397B-A17B

  • you need the best open-weight model available at frontier scale
  • cost at scale matters — 17B active parameters means dramatically lower inference cost than a dense 397B model
  • you want self-hosting for data privacy, compliance, or latency control
  • your task is knowledge or multimodal-intensive and you want open-weight flexibility
  • you need Apache 2.0 licensing for commercial use or fine-tuning

when to use GPT-5.4

  • you need maximum agentic reliability — TAU2-Bench at 98.9% is near-perfect
  • terminal and shell automation are core to your workflow
  • you want a zero-config hosted API at the absolute frontier
  • computer-use agents are a primary use case

bringing frontier capability within reach

At 397B total parameters with ~17B active, Qwen3.5-397B-A17B delivers near-frontier knowledge capacity at the serving cost of a mid-size model. Fine-tuned on your domain data, it becomes a high-performance specialist — often approaching frontier reasoning while maintaining dramatically lower per-token costs than GPT-5.4.

For agentic tasks, where the raw gap is largest, trajectory-based fine-tuning is especially powerful. Training on your own tool-calling environment reshapes performance quickly — turning a broad 22-point TAU2 gap into a much narrower, workflow-specific problem that can be systematically closed in production.

frequently asked questions

what does “397b-a17b” mean?

it’s a mixture-of-experts model. 397b total parameters, 17b active per forward pass. the moe router selects which expert layers to use for each token, so inference cost is roughly that of a 17b dense model. knowledge capacity reflects the full 397b.

is qwen3.5-397b-a17b as good as gpt-5.4?

on multimodal reasoning: slightly better. on pure knowledge benchmarks: within 5 points. on agentic and terminal tasks: gpt-5.4 has a substantial lead. for knowledge-heavy workloads, fine-tuned qwen3.5-397b-a17b is likely to match or exceed gpt-5.4.

can i self-host qwen3.5-397b-a17b?

yes — this is one of the most compelling reasons to use it. full weights are large, but the moe architecture means inference at 17b active params is far cheaper than a dense 397b model. typically deployed on multi-gpu systems (8x a100/h100 or similar), with quantized variants reducing requirements.

should i fine-tune or use the base model?

for high-value, domain-specific tasks, fine-tuning the 397b-a17b model gives you near-frontier specialist performance at low inference cost. this combination — huge knowledge capacity, cheap inference, fine-tunable — is hard to match with closed models.