Math Benchmark Test - Search News

AI scores a ‘C–’ on its hardest math test yet

The second batch of “First Proof” problems is meant to evaluate AI’s usefulness for research-level math. The best model got ...

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting ...

Humans outperform AI at this highly rigorous mathematics test

A new study reveals that human mathematicians have surpassed AI in solving unpublished high-level math problems, challenging ...

Hosted on MSN

AI is actually bad at math, ORCA shows

ORCA benchmark trips up ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 In the world of George Orwell's 1984, two and two make five. And large language models are not much ...

Geeky Gadgets

Al Benchmarks Investigated : Do Companies Tune Private Builds for Leaderboards, Then Ship Weaker Versions?

Are AI benchmarks really the gold standard we’ve been led to believe? Matt Wolfe walks through how these widely accepted metrics, designed to measure the performance of artificial intelligence systems ...

TechCrunch

Why most AI benchmarks tell us so little

On Tuesday, startup Anthropic released a family of generative AI models that it claims achieve best-in-class performance. Just a few days later, rival Inflection AI unveiled a model that it asserts ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results