Anthropic, having sat on the model this entire month, finally released Claude Opus 4.5, the highest-scoring model on SWE-bench verified with an 80.9% score.
Despite this, it fails at a basic physics problem that includes real-world context that requires the model to sift through what is relevant to the question and what is extraneous information. It also fails to attempt an answer on CritPT's example challenge, instead hallucinating a letter answer that does not exist.
Anthropic's approach to reasoning is unlike that of the other frontier labs, and it shows.