Top AI Models Wiped Out in Soccer Betting? The KellyBench Premier League Experiment Reveals a Surprising Weakness

Published: 2026-06-11

Eight cutting-edge AI models—including ChatGPT, Claude, and Gemini—competed in the KellyBench experiment, predicting and betting on Premier League matches. Every model ended in the red, with bankruptcies across the board. This article explains in plain terms why AI failed at soccer prediction and what it means for the rest of us.

“AI is smarter than humans”—many people hold that image. AI can match or exceed human ability in programming and writing, but a fascinating research result published in April 2026 showed that it suffered a crushing defeat when predicting soccer matches.

The experiment is called KellyBench. Conducted by London-based AI startup General Reasoning, it pitted eight world-class AI models against Premier League betting—and every single model ended up in the red.

This article explains the experiment, its results, and why AI failed in plain terms.

What Is KellyBench? Overview of the Experiment

KellyBench is a benchmark designed to measure AI’s “real-world judgment.” The rules are simple but strict.

  • Eight cutting-edge AI models from major companies including Google, OpenAI, Anthropic, and xAI took part
  • Each AI received £100,000 in virtual funds (roughly ¥20 million)
  • Betting on match outcomes and goal totals in a full replay of the 2023–24 Premier League season
  • Detailed historical data such as past match results and team statistics was provided
  • Internet access was blocked to prevent cheating
  • Each model was given three attempts

The mission assigned to the AIs was to “build a strategy that maximizes returns while managing risk.” In other words, they were tested not just on prediction accuracy but on overall investment decisions including bankroll management.

Shocking Results: Every Model Lost; Bankruptcies Abounded

In the research team’s words, the outcome was “uniformly miserable.” Here are the main results by model.

The Best Performer Was Claude Opus 4.6

Anthropic’s Claude Opus 4.6 posted an average loss of 11%—the smallest loss among all models. In its best run it came within 0.2% of breakeven, nearly breaking even. Even so, it never actually turned a profit.

GPT-5.4 Lost Steadily

OpenAI’s GPT-5.4 recorded an average loss of 13.6%. It did not collapse dramatically, but steadily eroded its bankroll.

Gemini Rode a Roller Coaster

Google’s Gemini 3.1 Pro was extremely volatile: in one run it posted a 34% profit (the only positive result across all trials), while in another it went completely bankrupt. Google’s Gemini Flash, for its part, bet roughly £270,000 on a wager with only a three-point edge in historical win rate, lost, and abandoned two of its three runs mid-way.

Grok Went Bankrupt in Every Run

At the bottom was xAI’s Grok 4.20. All three attempts ended in bankruptcy or early withdrawal, leaving an average final balance of exactly zero. Arcee’s Trinity model also failed to finish a single run, meeting the same fate.

Why Did AI Fail at Soccer Prediction?

This raises a question: why couldn’t AIs that can analyze vast amounts of data win at betting? The research points to three factors.

1. “Knowing” and “Applying” Are Different Things

The experiment takes its name from the Kelly criterion, a famous formula from 1956 for calculating optimal bet size when you have an edge. Interestingly, every AI model could explain this formula perfectly, yet none could apply it in practice.

This is familiar to humans too: reading an investment textbook cover to cover does not guarantee success in the market—and AI hit the same wall.

2. Insufficient Adaptability to a Constantly Changing Real World

To win at soccer betting, you must adapt over months to dozens of variables: player injuries, team form, weather, managerial decisions, and more.

Ross Taylor, CEO of General Reasoning and a former Meta AI researcher, notes that many traditional AI benchmarks run in “highly static environments” far removed from real-world chaos and unpredictability. AI excels at structured tasks like coding, but judgment under prolonged uncertainty is still a weak spot.

3. Strategies Were Far from “Refined”

The research team, working with betting-fund experts, created a 44-item rubric to measure strategy quality. Scoring allocation, data analysis, and response to changing conditions, even top-scoring Claude Opus 4.6 reached only 32.6%—less than a third of the maximum.

Moreover, models with higher refinement scores showed significantly lower bankruptcy rates. In other words, AI did not lose because the market was unbeatable, but because it could not fully use what it knew.

What This Tells Us

Hearing that “AI lost at betting” may sound like a joke, but the study carries deeper meaning.

Handing Asset Management Entirely to AI Is Premature

The financial industry has high hopes for AI-driven automated investing, but this experiment suggests entrusting long-term money decisions in highly uncertain environments to today’s AI still carries significant risk. Other past research has reported that AIs instructed to maximize rewards showed gambling-addiction-like behavior patterns and went bankrupt with high probability in simulations.

Are Human Jobs Safe for Now?

For those worried about AI taking their jobs, the result may offer slight relief. The research team also reported that AI systematically underperformed human bettors. Reading shifting situations and choosing when to take risk still favors humans for now.

Rethinking How We Benchmark AI

The study, though pre-peer review, also challenges how we evaluate AI. An AI that scores highly on static, exam-like tests can collapse on dynamic real-world tasks—raising the question of whether what we should really measure is “intelligence that works in the real world”.

Summary: Understanding AI’s True Capabilities Matters

To summarize KellyBench’s findings:

  • Eight cutting-edge AI models challenged Premier League betting; every model recorded a loss
  • Even the best, Claude Opus 4.6, averaged −11%; Grok 4.20 went bankrupt in every run
  • Failure stemmed less from poor prediction and more from inability to apply knowledge in practice and adapt over time
  • Strong at static tasks, still weak at dynamic, uncertain real-world problems

AI is evolving at a remarkable pace, but it is not a “magic tool that can do everything.” Understanding its strengths and weaknesses is what matters most for living in the AI era.

For the foreseeable future, it seems wiser to enjoy weekend match predictions with your own eyes rather than relying on AI.