Five Frontier Models in One Week

Five frontier AI models in one week. That was February 2026: a period so dense with announcements that keeping track of which lab had released what required a spreadsheet. Gemini 3.1 Pro, GPT 5.3, Claude Sonnet 5 "Fennec", Grok 4.20 and DeepSeek V4 all arrived within days of each other. The AI industry, never shy about drama, had managed to compress an entire year's worth of landmark releases into a single news cycle.

The timing was not purely coincidental. Labs watch each other. Releasing a major model two weeks after a rival is worse than releasing one two weeks before. The result was a pile-up — and, for anyone trying to make sense of it, a headache.

What actually happened

Anthropic moved first. Claude Sonnet 5 "Fennec" launched on February 3rd, and its headline number was striking: 82.1% on SWE-Bench Verified, a widely used coding benchmark that tests whether a model can resolve real GitHub issues. No model had previously broken 80%. Anthropic had also, not long before, posted 80.9% with Claude Opus 4.5 — itself ahead of Gemini 3 Pro (76.2%), GPT-5.1 (76.3%) and Grok 4.1 (74.9%). The Fennec number made Opus 4.5 look almost modest in retrospect.

Google's entry came on February 19th. Gemini 3.1 Pro Preview arrived with a 1m-token context window — enough to ingest roughly 750 novels in a single prompt — and posted 77.1% on ARC-AGI-2, a benchmark designed to resist the kind of pattern-matching at which large language models excel. Whether 77.1% represents genuine reasoning or very sophisticated pattern-matching is a question the benchmark's designers are probably still arguing about.

OpenAI, which had released GPT-5.2 on December 11th and GPT-5.2-Codex a week later, came into February with GPT 5.3 still in development. A 400,000-token context window was reported. Numbers on the standard benchmarks had not been published by the time the week ended. The absence of a score card in a week defined by score cards was conspicuous.

DeepSeek V4 was readied for launch around February 17th. The Chinese lab has made a habit of releasing models that cost a fraction of what Western competitors spend and perform comparably. V4 continued that tradition. Grok 4.20, from xAI, rounded out the five.

What the numbers mean — and don't

SWE-Bench Verified is useful precisely because it tests something concrete: can the model write code that fixes a real bug? An 82.1% score is genuinely impressive. It also means that roughly one in five issues still defeats the model. Coding agents have improved enormously; they have not made developers redundant.

ARC-AGI-2 is harder to interpret. The benchmark was built to resist memorisation, but building a test that a sufficiently large model cannot eventually game is a Sisyphean task. A 77.1% score tells you something. It does not tell you that the model reasons like a human, or that it can handle genuinely novel problems outside the benchmark's distribution.

Context windows are the most seductive metric of the three, and probably the least informative on its own. A 1m-token window means a model can see more text at once. It says nothing about whether it can reliably retrieve and use information buried in the middle of that text — a problem, sometimes called "lost in the middle", that longer contexts have not solved so much as moved.

Why it matters that they all came at once

The avalanche created a specific kind of confusion. Benchmarks are only useful when they are stable. A score published one day becomes obsolete the next when a rival posts a higher number on a slightly different evaluation. Enterprise buyers, who make decisions over months rather than days, found the week difficult to parse. Developers building on top of these models faced a genuine dilemma about which API to commit to.

There is also a subtler effect on public perception. When five major announcements arrive simultaneously, each one receives less scrutiny than it would have alone. A 82.1% SWE-Bench score deserves careful examination of methodology; in the noise of four other releases, it gets a headline and a tweet. Labs benefit, probably deliberately, from this compression.

The question the avalanche doesn't answer

The race that produced this extraordinary week is a race for capability. Each lab is trying to build the most capable model, measured by benchmarks that are themselves imperfect proxies for what users actually need. What is missing from the avalanche is any serious answer to the question of what all this capability is for.

Developers are building useful things with these models. Coding assistants that genuinely accelerate work. Research tools that surface relevant literature faster than any human could. Customer service applications that handle routine queries competently. These are real gains.

But the labs are not primarily competing on usefulness. They are competing on benchmark scores, context window sizes and parameter counts. A model that scores 82.1% on SWE-Bench and is unreliable in production is less useful than a model that scores 74% and behaves predictably. That trade-off rarely features in the announcements.

February 2026 will be remembered as the week AI moved fastest. Whether it moved in the right direction is a different question, and one that a benchmark cannot answer.