insidejob — AI intelligence, daily.

Sat, Apr 11 First edition. March 2026 was the densest model release window in AI history — GPT-5.4, Gemini 3.1, DeepSeek V4 (1T params), and Claude Managed Agents all shipped. Open-source models now match proprietary on many benchmarks. full summary

Latest News 7

MITRE ATLAS: the adversarial threat matrix for AI systems ↗ MITRE Apr 11

A comprehensive guide to MITRE ATLAS — 16 tactics, 84 techniques, and 42 case studies for understanding adversarial threats to AI/ML systems.

Prompt injection in 2026: taxonomy, real-world exploits, and defenses ↗ Multiple Apr 10

A technical breakdown of prompt injection attack classes, real CVEs, and the defense mechanisms that work — and those that don't.

Q1 2026 model scorecard — every frontier release ranked ↗ LLM Stats Apr 9

Head-to-head comparison of every major model released in Q1 2026. Benchmarks, pricing, context windows, and verdict for each.

OWASP LLM Top 10 in practice — what each risk looks like in production ↗ OWASP Apr 7

Concrete attack scenarios for each OWASP LLM risk, mapped to real CVEs and agentic AI systems. Not a summary — a practitioner's guide.

Building your first agent with the Claude Agent SDK ↗ Anthropic Apr 4

Working code examples, SDK vs CLI comparison, and when to use which. A practical guide to the renamed Claude Agent SDK.

Managed Agents vs Agent SDK vs Cloud Tasks — which harness? ↗ Anthropic Mar 31

A cost and capability comparison of Anthropic's three agent execution models. Pricing math, code examples, and decision framework.

DeepSeek V4 at $0.28/M — what 1T parameters means for cost ↗ DeepSeek Mar 10

Pricing comparison, cost-per-task calculations, and benchmark analysis. When DeepSeek V4 makes sense and when it doesn't.

Releases 3

Mar 31 Claude Managed Agents beta (2026-04-01) ↗

Fully managed agent harness on Anthropic infrastructure
Secure sandboxing and long-running sessions
Multi-agent coordination in research preview

Mar 4 GPT-5.4 5.4 ↗

Record 83% on GDPval
Record scores on OSWorld-Verified and WebArena Verified
Standard, Thinking, and Pro variants

Jan 31 Claude Opus 4.6 / Sonnet 4.6 4.6 ↗

1M context window at standard pricing
Opus 80.8% and Sonnet 79.6% on SWE-bench Verified
Adaptive, extended, and interleaved thinking

Models 8 pricing per 1M tokens

Model	Provider	In/Out	Context	Benchmark
Qwen 3.6 Plus	Alibaba	$0.3/$1.2	1M	GPQA 82%
Gemma 4	Google	free	128K	GPQA 72%
DeepSeek V4	DeepSeek	$0.28/$1.1	128K	SWE 72.5%
GPT-5.4	OpenAI	$2.5/$10	256K	GPQA 92%
Gemini 3.1 Pro	Google	$2/$12	2M	SWE 78.8%
Claude Opus 4.6	Anthropic	$5/$25	1M	SWE 80.8%
Claude Sonnet 4.6	Anthropic	$3/$15	1M	SWE 79.6%
Llama 4 Maverick	Meta	free	10M	SWE 68.5%

Security 2 rss

critical 9.6 CVE-2025-53773 github-copilot patched critical 9.3 CVE-2025-68664 langchain-core patched

ATLAS Navigator OWASP CISA all security

MITRE ATLAS: the adversarial threat matrix for AI systems ↗ Prompt injection in 2026: taxonomy, real-world exploits, and defenses ↗ OWASP LLM Top 10 in practice — what each risk looks like in production ↗

Benchmarks 3

GPQA Diamond

Claude Opus 4.6 94.3
GPT-5.4 92
GPT-5.3 Codex 91.5
Gemini 3.1 Pro 90.8
Claude Sonnet 4.6 88.5

SWE-bench Verified

Claude Mythos Preview 93.9
GPT-5.3 Codex 85
Claude Opus 4.5 80.9
Claude Opus 4.6 80.8
Claude Sonnet 4.6 79.6

LM Arena (Chatbot Arena) Elo Rankings

Claude Opus 4.6 Thinking 1504
Gemini 3.1 Pro Preview 1493
Grok 4.20 Beta1 1491
GPT-5.4 High 1484
Claude Sonnet 4.6 Thinking 1478

Trends 1 snapshots

Snapshot: 2026-04-12

Model	Arena	GPQA	$/M in
claude opus 4 6	1504	94.3%	$5
gemini 3 1 pro	1493	90.8%	$2
gpt 5 4	1484	92%	$2.5
claude sonnet 4 6	1478	88.5%	$3
deepseek v4	1445	84%	$0.28
llama 4 maverick	—	78%	free
qwen 3 6 plus	—	82%	$0.3
gemma 4	—	72%	free

3 releases tracked 2 advisories

ATLAS 5.5.0 16T / 101t / 66st

16 tactics 101 techniques 0 mitigations 57 case studies

Recently modified

AML.T0050 Command and Scripting Interpreter demonstrated AML.T0024.000 Infer Training Data Membership feasible AML.T0095 Search Open Websites/Domains demonstrated AML.T0000.001 Pre-Print Repositories demonstrated AML.T0002.000 Datasets demonstrated AML.T0008.001 Consumer Hardware realized Browse all 167 techniques