Large Language Models Benchmarks

3don MSNOpinion

Multilingual benchmark evaluates how well AI interprets clinical text and health records in nine languages

Researchers at Mass General Brigham recently developed BRIDGE, a multilingual benchmark that evaluates how well large ...

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock ...

4don MSN

China's Z.ai GLM-5.2 tops OpenAI’s GPT 5.5 model on key benchmarks

Chinese startup Z.ai has launched GLM-5.2, a powerful AI model for complex coding projects. This new large language model ...

AI has passed the test but not the exam: Why ‘Humanity’s Last Exam’ matters

There is a temptation, when AI systems begin to outperform human baselines on established tests, to interpret this as a sign ...

Nature

Towards domain-adapted large language models for water and wastewater management: methods, datasets and benchmarking

Large language models (LLMs) have shown significant promise for water and wastewater management. However, current foundation models are not yet reliable. This Perspective outlines a pathway for ...

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting ...

Nature

Benchmarking large language model-based agent systems for clinical decision tasks

Clinical decision-making entails complex, data-intensive, and often uncertain judgments, resulting in excessive workload and exceeding the cognitive limits of many clinicians. For more than two ...

KT Unveils Korea-Specific AI Benchmark Covering Rental Fraud and Dokdo Dispute

MM,' which evaluates how safely multimodal large language models (MLLMs) provide answers that reflect Korean social issues and cultural context. This benchmark, co-developed with Korea University, ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

Crypto Briefing

Wallet V launches public performance benchmark for AI trading agents on Hyperliquid and Aster

Wallet V, a self-custody Web3 wallet, launched a public performance benchmark for the AI trading agents that its users have ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results