Claude AI on Humanity's Last Exam: 5% to Score 55%+. Market ends June 30 with $325 daily volume. Trade live on Polymarket via Polymarket Trade.
Connect wallet to trade · No wallet? Passkey login available · Free alerts at /subscribe
Anthropic's Claude AI models are trading at just 5% implied probability of scoring at least 55% on Humanity's Last Exam, a rigorous new benchmark designed to evaluate advanced AI reasoning across complex domains. The exam, administered by academic researchers, tests frontier AI systems on their ability to solve multi-step problems requiring deep reasoning, specialized knowledge, and nuanced judgment across fields like science, policy, ethics, and cross-domain synthesis. With a June 30 resolution date, traders are betting decisively against Claude clearing the 55% threshold. The low odds likely reflect either skepticism about current Claude model capabilities relative to this specific benchmark, or market consensus that Humanity's Last Exam sets a genuinely high bar for performance compared to conventional AI benchmarks. The $7,941 in available liquidity and $325 in recent 24-hour volume suggest meaningful but measured trader interest, indicating this is primarily a longer-term positioning trade rather than active speculation. The market structure implies traders expect Claude performance to remain below 55%, whether because the benchmark is calibrated for next-generation systems or because Claude's current architecture has fundamental limitations on this complex reasoning task class.
Humanity's Last Exam is a frontier AI benchmark designed to test reasoning, knowledge, and judgment across domains that require sustained multi-step problem-solving, integration of specialized knowledge, and subjective evaluation of nuanced outcomes. Unlike standardized benchmarks like MMLU or Arc, which test factual recall and narrow reasoning, Humanity's Last Exam evaluates systems on tasks that historically required human expertise: scientific discovery, policy evaluation, ethical reasoning, and synthesis across fields. A 55% score represents a threshold where an AI system demonstrates performance approaching or exceeding human expert-level reasoning on these complex tasks. The 5% market odds suggest traders believe this threshold is substantially above Claude's current demonstrated capabilities on such benchmarks, or that the benchmark's difficulty is calibrated to challenge next-generation systems beyond current Claude architectures. Several factors could push the market toward YES. Anthropic has shown a consistent pattern of capability improvements with each Claude version release, and the June 30 deadline allows for potential new model releases or updated versions between now and resolution. Continued scaling of model size, training compute, and reasoning frameworks could yield step improvements in performance on complex reasoning tasks. Additionally, if Humanity's Last Exam is a newly published benchmark with limited prior performance data, traders may be discounting the possibility of Claude versions specifically optimized or fine-tuned for this task class before the resolution date. Conversely, several structural factors explain the pessimistic 5% pricing. Benchmark developers typically calibrate difficulty to evaluate frontier systems, and a 55% threshold may represent genuine difficulty that current Claude models have not yet overcome. Claude's known limitations on certain reasoning tasks, particularly those requiring deep quantitative analysis or sustained logical chains across many steps, could make this target difficult. The benchmark may include domains—scientific discovery, novel reasoning under uncertainty, integration of cutting-edge research—where Claude's training data cutoff or architectural constraints create meaningful gaps. Finally, if the benchmark has been published for months without high-scoring AI systems achieving 55%+, the 5% odds may rationally reflect empirical evidence rather than pure speculation. The current market structure implies traders assign very high conviction to Claude underperforming. At 5% YES odds, the risk-reward is skewed heavily toward NO, suggesting either that traders have high confidence in Claude's limitations relative to this specific task, or that the benchmark's design makes 55% a substantially difficult target. The low 24-hour volume ($325) despite $7,941 in liquidity indicates positioning rather than active debate—traders established long NO positions and are holding into the June 30 deadline. The market would likely shift materially on news of new Claude version capabilities, benchmark difficulty leaks, or interim test results suggesting higher performance.
Market resolves YES if an Anthropic Claude model scores at least 55% on Humanity's Last Exam before June 30, 2026. Resolution is determined by official exam administrators or published benchmark results.
Polymarket Trade is an independent third-party interface to the Polymarket CLOB prediction market exchange on Polygon — not affiliated with Polymarket, Inc. Prediction markets aggregate trader expectations into real-time probability estimates. Every market question resolves YES or NO based on a specific event outcome; traders buy shares of the side they believe will resolve positively. Prices range 0¢ (certain no) to 100¢ (certain yes) and naturally reflect the crowd-implied probability of YES. Polymarket Trade is non-custodial — your funds never leave your wallet. Open the full interactive page linked above to place orders, see order book depth, and execute a trade.