Blog/AIBenchmarks/May 18, 2026

The AI Model Race in May 2026: What the Benchmarks Actually Say

A grounded May 18, 2026 comparison of GPT-5.5, Claude Opus 4.7, Gemini 3.1, and DeepSeek V4 Pro across reasoning, coding, cost, and benchmark signals.

Joel Saji
Toronto · May 18, 2026 · 8 min read · 1,442 words
AIBenchmarksLLMsModel Comparison

On this page

7 sections
  1. 01The new center of gravity
  2. 02Claude Opus 4.7 looks built for work, not demos
  3. 03Gemini's story is two models, not one
  4. 04DeepSeek V4 Pro is not the winner, but it matters
  5. 05What I would actually compare before choosing
  6. 06My read on the race
  7. 07Sources

May 2026 feels like a reset month for AI models. Not because one lab ended the race, but because the shape of the race changed again. OpenAI pushed GPT-5.5 into the center of the conversation. Anthropic answered with Claude Opus 4.7 and a smaller experimental thread called Claude Mythos. Google kept Gemini 3.1 Pro in the mix while making Flash-Lite harder to ignore on price and speed. DeepSeek V4 Pro reminded everyone that open weight models are still compressing the gap.

The honest summary is simple: GPT-5.5 looks like the strongest general model on the public benchmark mix I could verify by May 18, 2026. Claude Opus 4.7 looks especially attractive for coding, agents, and long work sessions. Gemini 3.1 Pro remains very competitive in reasoning and browsing style tasks. Gemini 3.1 Flash-Lite is the model I would reach for when budget and latency matter. DeepSeek V4 Pro is not at the frontier, but it is important because it raises the floor for open systems.

Benchmark comparison for frontier AI models in May 2026

The chart uses OpenAI's reported ARC AGI 2 verified scores for the left panel and NIST CAISI aggregate Elo for the DeepSeek V4 Pro evaluation on the right panel.

The new center of gravity

OpenAI's GPT-5.5 announcement is the cleanest public signal from this batch. The model improves on GPT-5.4 across the benchmark table OpenAI published, and it does especially well on hard reasoning and tool use style tasks.

The numbers worth paying attention to:

BenchmarkGPT-5.5GPT-5.4Claude Opus 4.7Gemini 3.1 Pro
ARC AGI 2 verified85.073.375.877.1
GPQA Diamond93.692.894.294.3
BrowseComp84.482.779.385.9
SWE-Bench Pro public58.657.764.354.2
Terminal-Bench 2.082.775.169.468.5

That table tells a more interesting story than "one model wins." GPT-5.5 has the broadest profile. Claude Opus 4.7 still edges it on SWE-Bench Pro public in OpenAI's own table. Gemini 3.1 Pro is slightly ahead on BrowseComp and GPQA Diamond. The practical reading is that frontier models are now close enough that product fit matters more than leaderboard placement.

If I were choosing a model for a new product today, I would not start with a single benchmark. I would start with the workflow:

  • Long chain reasoning with tools: GPT-5.5 is the first model I would test.
  • Coding agents and refactors: Claude Opus 4.7 deserves a serious run.
  • Research, browsing, and multimodal workflows: Gemini 3.1 Pro is still in the room.
  • High volume extraction, summarization, and routing: Gemini 3.1 Flash-Lite changes the cost math.
  • Private or self hosted experiments: DeepSeek V4 Pro is the model to watch.

Claude Opus 4.7 looks built for work, not demos

Anthropic's Claude Opus 4.7 announcement reads less like a magic trick and more like a model built for long, annoying, real jobs. The claim is better sustained performance on coding, agent workflows, research, and writing.

The most concrete public signals are partner and benchmark numbers Anthropic published:

SignalClaude Opus 4.7 resultWhy it matters
CursorBench70Coding agent work, above Opus 4.6 at 58
SWE-Bench Pro public64.3Strong public coding benchmark result
Finance Agent benchmark0.813Better than Opus 4.6 at 0.767
BigLaw Bench90.9High score on legal workflow tasks

I would read those numbers as a pattern: Claude is trying to be the dependable model for messy professional work. That is different from winning every synthetic leaderboard. If you are building an AI coding assistant, internal knowledge worker, or agent that has to keep state across many steps, Opus 4.7 is one of the models you benchmark first.

Claude Mythos is more experimental. Anthropic describes it as a restricted preview focused on long horizon reasoning, but it is not the model most teams can actually deploy today. I would keep it out of production comparisons until availability and evaluation details are less fuzzy.

Gemini's story is two models, not one

Google's Gemini line has a split personality in the best way. Gemini 3.1 Pro is the frontier competitor. Gemini 3.1 Flash-Lite is the practical machine you use when a product has real traffic and a real bill.

Gemini 3.1 Pro remains competitive in the OpenAI comparison table:

  • 77.1 on ARC AGI 2 verified
  • 94.3 on GPQA Diamond
  • 85.9 on BrowseComp
  • 54.2 on SWE-Bench Pro public in OpenAI's comparison table

Those are not sleepy numbers. The weaker spot is coding compared with Claude Opus 4.7 and GPT-5.5. The stronger spot is browsing and general reasoning.

Flash-Lite is the more interesting product move. Google priced it at $0.25 per million input tokens and $1.50 per million output tokens, with a focus on low latency and high throughput. Google also reported Flash-Lite at 1432 Arena Score, 86.9 on GPQA Diamond, and 76.8 on MMMU. If those numbers hold in your own evals, it becomes a strong default for large scale features that do not need the most expensive model on every request.

This is the direction I expect more apps to take: one frontier model for hard cases, one cheaper model for routing, drafting, extraction, and background work.

Landscape view of major AI models in May 2026

DeepSeek V4 Pro is not the winner, but it matters

NIST CAISI's DeepSeek V4 Pro evaluation is useful because it is less promotional than a launch post. Their report compared DeepSeek V4 Pro against GPT-5.5, Claude Opus 4.6, and GPT-5.4 mini across their aggregate evaluation suite.

The headline numbers:

ModelCAISI aggregate Elo
GPT-5.51260
Claude Opus 4.6999
DeepSeek V4 Pro800
GPT-5.4 mini749

That puts DeepSeek V4 Pro behind the most capable closed models, but above GPT-5.4 mini in CAISI's aggregate. CAISI also described the model as roughly eight months behind the proprietary frontier on performance, while still being much cheaper for certain workloads.

That is the part to watch. Open weight models do not need to beat GPT-5.5 tomorrow to matter. They need to be good enough for private deployment, fine tuning, local control, and cost sensitive inference. DeepSeek V4 Pro looks like another step in that direction.

What I would actually compare before choosing

Public benchmarks are a starting point. They are not procurement. If I were choosing a model stack for a product in May 2026, I would run a small internal benchmark with four columns:

Test areaWhat to measureModels I would start with
Hard reasoningCorrectness on domain problemsGPT-5.5, Gemini 3.1 Pro
CodingMulti-file change quality and test pass rateClaude Opus 4.7, GPT-5.5
Agent workTool calls, recovery, state trackingGPT-5.5, Claude Opus 4.7
Volume tasksCost, latency, acceptable accuracyGemini 3.1 Flash-Lite, DeepSeek V4 Pro
Private workflowsControl, data posture, hosting fitDeepSeek V4 Pro, local variants

The mistake is treating the leaderboard as the architecture. The better pattern is a model router: cheap model first, frontier model when needed, human review where the cost of being wrong is high.

My read on the race

The model race is less about a single king and more about specialization. GPT-5.5 is the best overall signal. Claude Opus 4.7 feels like the professional workhorse. Gemini 3.1 Pro is still a serious frontier model, and Flash-Lite is a quiet threat because cost wins more products than people admit. DeepSeek V4 Pro is the reminder that open models keep making the floor higher.

The next wave will not be decided by who can write the flashiest launch post. It will be decided inside actual products: fewer retries, better tool use, lower latency, cleaner failure modes, and bills that do not make teams afraid to ship.

That is the benchmark I care about most.

Sources

You finished

Thanks for reading. Get the next note in your inbox.

Quiet, occasional dispatches on engineering and craft.