insidejob

Q1 2026 model scorecard — every frontier release ranked

Q1 2026 saw 255 model releases. Here are the ones that matter, ranked and compared.

The scorecard

ModelProviderSWE-benchGPQAArena EloInput $/MContextVerdict
Claude Opus 4.6Anthropic80.8%94.3%1504$5.001MBest reasoning quality
Gemini 3.1 ProGoogle78.8%*90.8%1493$2.002MBest value frontier
GPT-5.4OpenAIN/R92.0%1484$2.50256KBest computer-use
Grok 4.20 β2xAI86.2%1491Dark horse
DeepSeek V4DeepSeek72.5%84.0%1445$0.28128KBest price/performance
Llama 4 MaverickMeta68.5%78.0%Free10MBest open-weight
Qwen 3.6 PlusAlibaba82.0%$0.301MBest open CJK
Gemma 4Google72.0%Free128KBest for consumer HW

*Gemini 3.1 Pro leads SWE-bench among GA models. N/R = OpenAI no longer reports SWE-bench Verified.

Key takeaways

1. The pricing race to the bottom

Frontier model input pricing: Opus $5 → Gemini $2 → GPT $2.50 → DeepSeek $0.28. A year ago, $5/M was cheap. Now it’s the premium tier. DeepSeek V4 delivers 85% of frontier quality at 5% of the cost.

2. Context windows are diverging

Llama 4 at 10M tokens, Gemini at 2M, Claude at 1M, GPT at 256K, DeepSeek at 128K. If your workload needs long context (large codebase analysis, document processing, long conversations), this matters more than benchmark scores.

3. Benchmarks are breaking

OpenAI stopped reporting SWE-bench Verified scores, citing data contamination. Claude Mythos Preview (93.9%) is a research model not available via API. The gap between “benchmark performance” and “production usefulness” is widening. Arena Elo (based on human preference) is arguably the most honest metric left.

4. Open-source is closing the gap

DeepSeek V4 (1T params, $0.28/M) and Llama 4 (400B MoE, free) are within 10 points of frontier on most benchmarks. For many production workloads, the quality difference is invisible. The remaining gap is in novel reasoning, long-context coherence, and instruction following.

5. The real differentiator is ecosystem

Benchmarks converge. What matters now: context window size, tool-use quality, agentic capabilities, API ergonomics, and pricing. Claude’s 1M context + Agent SDK + Managed Agents is a different proposition than GPT-5.4’s computer-use strength, which is different from Llama 4’s self-hosting flexibility.

How to choose

Budget-constrained, high volume: DeepSeek V4 or V4 Lite

Need the best quality, cost no object: Claude Opus 4.6

Best all-around value: Gemini 3.1 Pro ($2/M, 2M context, strong benchmarks)

Self-hosting required: Llama 4 Maverick (400B MoE) or Gemma 4 (consumer hardware)

Computer-use / web automation: GPT-5.4

Building agents: Claude (Agent SDK + Managed Agents ecosystem)