insidejob

Benchmarks

Leaderboard snapshots tracked over time. Rankings sorted by score.

GPQA Diamond

# Model Provider Score
1 Claude Opus 4.6 Anthropic 94.3
2 GPT-5.4 OpenAI 92
3 GPT-5.3 Codex OpenAI 91.5
4 Gemini 3.1 Pro Google 90.8
5 Claude Sonnet 4.6 Anthropic 88.5
6 Grok 4.20 xAI 86.2
7 DeepSeek V4 DeepSeek 84

SWE-bench Verified

# Model Provider Score
1 Claude Mythos Preview Anthropic 93.9
2 GPT-5.3 Codex OpenAI 85
3 Claude Opus 4.5 Anthropic 80.9
4 Claude Opus 4.6 Anthropic 80.8
5 Claude Sonnet 4.6 Anthropic 79.6
6 Gemini 3.1 Pro Google 78.8
7 GPT-5.4 OpenAI 77.2
8 DeepSeek V4 DeepSeek 72.5
# Model Provider Score
1 Claude Opus 4.6 Thinking Anthropic 1504
2 Gemini 3.1 Pro Preview Google 1493
3 Grok 4.20 Beta1 xAI 1491
4 GPT-5.4 High OpenAI 1484
5 Claude Sonnet 4.6 Thinking Anthropic 1478
6 GPT-5.4 OpenAI 1470
7 Gemini 3.1 Flash Google 1455
8 DeepSeek V4 DeepSeek 1445