Grok 4 and the frontier | Stephen M. Walker II

xAI did not disappoint with Grok 4 today, taking not only the top place on the HLE benchmark, but scoring 1.6x higher than any other model.

Gemini 2.5 Pro, the previous lead, had just surpassed o3 last month, and by only eight percent. Notably, Anthropic's Claude 4 series is missing from the top models, likely due to its hybrid reasoning architecture. Grok 4 has no swe-bench score (perhaps with next month's Grok Code).

Grok 4 on Humanity's Last Exam

Benchmarks drive headlines, but daily work tells a different story.

Daily Usage vs. Benchmarks

Anyone using these models — OpenAI o3, Anthropic Claude 4, Gemini 2.5, Grok 4 — on a daily basis likely has a very different opinion about the top model. According to the internet (aka Twitter aka X), Claude is the best coding model on the planet. But according to LMArena, Gemini 2.5 is still the leader until they finish their Grok 4 benchmarks. Seems suspect.

Grok 4 on Humanity's Last Exam

In my personal usage, I rely primarily on o3, and sometimes Grok 3 for Swift. My personal use has more to do with my second-order preferences and my prompt construction style. I like that o3, especially compared to Claude, takes fewer unasked-for tasks on, and seeks input to next steps, rather than making assumptions.

In my experience, though especially talented at UI, Claude will create significant technical debt in pursuit of visual excellence through overly verbose code, and a lack of coherence across connected views.

Repo Prompt

Outside of Cursor, I primarily use Repo Prompt and ChatGPT. ChatGPT's Work with harness is an incredibly powerful idea that seems to have more issues than not, so I ditch it for manual copy/paste. Working manually outside of a CLI/IDE, you build an intuition for how important your prompt is when pushing the models for results. Not just context engineering, but also the role and expected process for reaching an outcome. These models are very intelligent, but basic prompts get basic answers. Especially on engineering tasks.

Grok 4 Heavy is much better when using Repo Prompt with codebase context and architecture/engineer prompts, but Grok generally struggles to follow the XML output format. On daily engineering tasks, I do not notice a difference between Grok 3 and 4. Inside Cursor, Grok 4 seems fine, but the hidden Chain of Thought makes it a black box compared to Grok 3.

YMMV

Model performance across Chat, CLI, and IDE is vastly different.

Grok 4 and o3 are very sensitive to custom instructions, almost to the detriment of the output — I disable mine often in ChatGPT because of this. Strangely, Grok 3, Claude 4, and Gemini 2.5 are not as sensitive to custom instructions. Their default response structure and personality come through no matter what.

Grok 4 is so sensitive to custom instructions that it lowers the estimated IQ by 5-15 points on difficult, open-ended questions, and impacts its tool use and the synthesis of information from those tools. Here is an analysis of Grok answering a layered prompt for a complex, interdisciplinary engineering question.

Grok IQ sensitivity

This is merely a real-world observation of what has been, and will likely continue to be, true for ML researchers: the prompt and harness matter for getting the most out of the models. Many third-party benchmark results vary due to model/harness configuration differences.

But after 2025, I think raw intelligence will matter less. It's important, but there is more to AGI than accurate test-taking. We see this already with the widespread adoption of GPT-4o as the default ChatGPT model. As measured through GPQA, OpenAI's o3 is 60% smarter than GPT-4o, yet most — and we're talking hundreds of millions of — users don't think about this at all.

Most models know the correct answer to a specific question, most of the time. And Grok 4 seems to be the best model at answering the hardest questions that only the smartest humans can answer. Yet, when it comes to novel scenarios, it's about the same as Grok 3.

So what matters?

I think about the models across three dimensions now: intelligence, reasoning, and agency. Where intelligence is the ability to produce the right answer with supporting evidence to a specific question. Reasoning is the model's ability to find a correct set of facts and form an answer based on novel, cross-domain, multi-modal information. Agency is not only the model's ability to use tools, but also using the correct tools — correctly — the right number of times to finish the requested task.

This is a highly qualitative ranking, based on personal usage and small-sample benchmarks, but you already see a divergence in model skills. Claude Sonnet 4 is by far the best model in terms of agentic behavior, winning on the dimensions of correct tool use and only losing ground due to its overeager code generation. o3 is the most intelligent model and its ability to reason across multiple domains on novel problems is unmatched.

Model	Intelligence	Reasoning	Agency
o3	1	1	3
codex (o3 ft)	—	—	2
claude sonnet 4	3	3	1
claude opus 4	3	2	2
grok 4	2	3	4
grok 4 heavy	2	4	—
gemini 2.5 pro	4	5	5

When you think about the models through this lens, you can see why the vocal majority on X loves Claude, and why daily use varies greatly from benchmarks.

Grok 4 is an interesting model. It is capable and intelligent. It has a better post-training style for my preferences compared to Gemini 2.5, and it's clearly smarter than both Gemini and Claude. The upgrades to Grok 4 next month (Code) and in the fall (Vision) will close the capability gaps needed to compete as a coding copilot (against Claude Code) or an assistant (against ChatGPT).

Notably, Grok 4 is weaker than most OpenAI models, including GPT-4o, at translation and instruction following with long context or multi-turn threads. This weakness shows up in Cursor where Grok 4's agency lowers the longer the task thread. The weaknesses highlight the maturity of Grok's post-training regimen more than raw model capability.

I want to like Grok 4.

But so far, I feel that $300 for Grok 4 Heavy is a waste compared to ChatGPT Pro. I will not be surprised to see subscription adoption lag — xAI is scrappy in their demos and marketing, but unlike Tesla vs. legacy automotive, ChatGPT has clear, structured product marketing for an industry-leading feature set.