insidejob
Sat, Apr 11 First edition. March 2026 was the densest model release window in AI history — GPT-5.4, Gemini 3.1, DeepSeek V4 (1T params), and Claude Managed Agents all shipped. Open-source models now match proprietary on many benchmarks. full summary
Latest News 7

A comprehensive guide to MITRE ATLAS — 16 tactics, 84 techniques, and 42 case studies for understanding adversarial threats to AI/ML systems.

A technical breakdown of prompt injection attack classes, real CVEs, and the defense mechanisms that work — and those that don't.

Head-to-head comparison of every major model released in Q1 2026. Benchmarks, pricing, context windows, and verdict for each.

Concrete attack scenarios for each OWASP LLM risk, mapped to real CVEs and agentic AI systems. Not a summary — a practitioner's guide.

Working code examples, SDK vs CLI comparison, and when to use which. A practical guide to the renamed Claude Agent SDK.

A cost and capability comparison of Anthropic's three agent execution models. Pricing math, code examples, and decision framework.

Pricing comparison, cost-per-task calculations, and benchmark analysis. When DeepSeek V4 makes sense and when it doesn't.

Releases 3
  • Fully managed agent harness on Anthropic infrastructure
  • Secure sandboxing and long-running sessions
  • Multi-agent coordination in research preview
  • Record 83% on GDPval
  • Record scores on OSWorld-Verified and WebArena Verified
  • Standard, Thinking, and Pro variants
  • 1M context window at standard pricing
  • Opus 80.8% and Sonnet 79.6% on SWE-bench Verified
  • Adaptive, extended, and interleaved thinking
Models 8 pricing per 1M tokens
Model Provider In/Out Context Benchmark
Qwen 3.6 Plus Alibaba $0.3/$1.2 1M GPQA 82%
Gemma 4 Google free 128K GPQA 72%
DeepSeek V4 DeepSeek $0.28/$1.1 128K SWE 72.5%
GPT-5.4 OpenAI $2.5/$10 256K GPQA 92%
Gemini 3.1 Pro Google $2/$12 2M SWE 78.8%
Claude Opus 4.6 Anthropic $5/$25 1M SWE 80.8%
Claude Sonnet 4.6 Anthropic $3/$15 1M SWE 79.6%
Llama 4 Maverick Meta free 10M SWE 68.5%
Security 2 rss
Benchmarks 3

GPQA Diamond

  1. Claude Opus 4.6 94.3
  2. GPT-5.4 92
  3. GPT-5.3 Codex 91.5
  4. Gemini 3.1 Pro 90.8
  5. Claude Sonnet 4.6 88.5

SWE-bench Verified

  1. Claude Mythos Preview 93.9
  2. GPT-5.3 Codex 85
  3. Claude Opus 4.5 80.9
  4. Claude Opus 4.6 80.8
  5. Claude Sonnet 4.6 79.6

LM Arena (Chatbot Arena) Elo Rankings

  1. Claude Opus 4.6 Thinking 1504
  2. Gemini 3.1 Pro Preview 1493
  3. Grok 4.20 Beta1 1491
  4. GPT-5.4 High 1484
  5. Claude Sonnet 4.6 Thinking 1478
ATLAS 5.5.0 16T / 101t / 66st