Benchmark
LM Arena (Chatbot Arena) Elo Rankings
| # | Model | Provider | Score |
|---|---|---|---|
| 1 | Claude Opus 4.6 Thinking | Anthropic | 1504 |
| 2 | Gemini 3.1 Pro Preview | 1493 | |
| 3 | Grok 4.20 Beta1 | xAI | 1491 |
| 4 | GPT-5.4 High | OpenAI | 1484 |
| 5 | Claude Sonnet 4.6 Thinking | Anthropic | 1478 |
| 6 | GPT-5.4 | OpenAI | 1470 |
| 7 | Gemini 3.1 Flash | 1455 | |
| 8 | DeepSeek V4 | DeepSeek | 1445 |
LM Arena (formerly LMSYS Chatbot Arena) ranks models using Elo ratings from crowdsourced human pairwise comparisons. Users chat with two anonymous models and vote for the better response.
Trends — April 2026
- Reasoning-optimized models dominate. Claude Opus 4.6 Thinking uses hidden chain-of-thought to debug outputs before the user sees them.
- Grok 4.20 disrupts the top tier — climbed to #3 globally, surpassing GPT-5.4.
- Gemini 3.1 Pro Preview outperforms GPT-5.4 High by 9 Elo points in the text arena.
- Anything above 1400 Elo is considered frontier-level performance.
The leaderboard updates daily as thousands of new human comparisons are processed.