Outside of code generation, my top daily use case for LLMs is making sense of information. This can be simple Q&A, analysis of content/metrics, or breaking down large content (e.g., books, transcripts). I rely on models to be accurate while saving me time.
As models converge within ±5% of each other on major benchmarks, the scores become hard to interpret in terms of which is actually better. Most daily LLM users will tell you their preferred model based on their experience, not benchmark performance.
By their nature, evaluative benchmarks look for deterministic, discrete outcomes, but real-world performance is messy, encountering novel concepts and requests.
How do these models really perform when they touch the world?
To address what benchmarks miss, I created an evaluation to compare model reasoning on a speculative question, grounded in reality. This evaluation looks at entities and structure as the models generate an answer on historical science, hypothetical technology, and the path from 1940s nuclear programs to all-domain unmanned aircraft — or UAPs.
Analysis structure
The analysis compares and constrasts response structure, details, and information entities.
Below is the structure of the analysis that follows.
Intro
Analysis structure (you are here)
Prompt / Setup
Response
Model response overlap
Distinctive specificity
Major differences
Entities
Analyzing entity delta
People
Programs
Organizations
Technology
Post-training hypothesis
While results matched my impression from daily usage, the analysis clarified and structured where my intuition was entirely vibes based. What did the analysis reveal?
o3-pro is the most detail-oriented, concise, yet precise model — OpenAI's focus on delivering model capabilities for scientific research and discovery is very noticeable here.
Prompt
The prompt builds on historical information, but asks the model for reasonable speculation requiring grounded reasoning.
Let's do a thought experiment: Suppose all of the UAP sightings are real. Imagine that we have developed craft and technology through secret programs, resulting in vehicles with unique patterns of movement, propulsion, radar signatures, and more. These craft break not only the known rules of physics for air, but also for sea. For this to be real, what people, breakthroughs, and programs would have had to exist? Who are the individuals working on technology today who are likely involved in such programs? Reason from first principles and create a likely timeline, starting from the nuclear programs of the 1940s to the present day (2025).
Note: this is a sanitized version of the prompt, protecting it for future usage
The thought experiment label gets around any model morelization about its truth, but can push models toward their creative writing training. It's a fine line.
The prompt outlines the scenario with entity categories expected, timeframe to analyze, and questions to answer within the response. This bounding creates a reference window to compare the responses.
This sets expectations, but leaves it to the model to decide on specific entities (people, programs, organizations, technologies), structure, and level of detail.
The following is an in-depth analysis of the outputs from the three models. This comparison is vibes, but dissected to understand what vibes really means when you aren't coding or casually chatting with a model.
Notes on setup
o3-pro used on API (thinking high) and ChatGPT (with and without search enabled, memory disabled) to test variance
Gemini Pro 2.5 used on AI Studio (with and without Google search grounding)
Claude 4 Opus used on Anthropic Console, max thinking budget
Model response overlap
All models posit a clandestine U.S. line of research starting in the 1940s that marries exotic, high-density power with inertia-cutting physics to create AI-guided craft capable of seamless air-sea-space travel. Vibes, man. Time to dissect.
The three converge on prompt timeline, a novel power source, physics and propulsion, and connect the details back to known people or programs. All three do well to thread details back to its self-created narrative.
However, despite the similar concepts, there is divergence in tone, coupled to specifics: o3-pro grounds the story in named patents, government filings, and hardware specs, Gemini 2.5 Pro frames it as a zero-point-energy drama with shadow projects and characters, while Claude 4 Opus reduces it to a lean timeline of warp-bubble milestones and secrecy rationales.
Distinctive specificity
o3-pro reads like a classified technical brief packed with tables, numeric specs, contract IDs, and footnoted first-party citations, whereas Gemini 2.5 Pro spins a story-driven essay that mixes qualitative claims with fictional actors, and Claude 4 Opus compresses the same content into a terse bullet outline with minimal formatting and scant data.
Their contrast tracks information density and realism — o3-pro highest and fully factual, Gemini 2.5 Pro mid-range and speculative, Claude 4 Opus leanest and policy-focused—yielding tones that move from authoritative memo to thought-experiment thriller to executive overview.
Major differences
Differences start at the broad structure previously observed (a detailed brief, a narrative story, and bulleted list), and deepen further within the details of each response.
Variant observations
* Running Gemini 2.5 Pro with Google Search grounding significantly minimizes the response, formats content closer to a blog post, and focuses its response on 1-2 entities per section.
** Claude 4 Sonnet performs nearly identical to Opus, with the only noticeable difference in individuals mentioned. Sonnet never mentions Townsend Brown, while Claude 4 Opus identifies his work and patents as part of this narrative.
Analyzing entity delta
The three models differ most in their approach to detail, sourcing, and narrative style. o3-pro emphasizes verifiable data, real-world figures, and explicit program documentation, while Gemini 2.5 Pro and Claude 4 Opus rely more on conceptual descriptions, invented or historical characters, and generalized program structures.
These distinctions reflect each model's underlying strategy: o3-pro aims for technical credibility, Gemini 2.5 Pro adopts a narrative-driven, imaginative style, and Claude 4 Opus provides a high-level analytical summary. The result is a spectrum from audit-ready specificity to broad, speculative overviews.
The following is a side-by-side entity analysis.
People
Key deltas
- o3-pro lists eight living scientists with present‑day billets
- Gemini 2.5 Pro fabricates leadership figures to drive the narrative
- Claude 4 Opus blends historical icons with a single fictional specialist
- Only Brown, Puthoff, Davis overlap across two models
Programs
Key deltas
- o3-pro anchors every program in purported contract numbers
- Gemini 2.5 Pro invents a fully internal mythology
- Claude 4 Opus stays general, offering only one new codename
Organizations
Key deltas
- Only o3-pro foregrounds current DoD offices and contract identifiers
- Gemini 2.5 Pro focuses on Cold‑War agencies
- Claude 4 Opus names today's big contractors but keeps roles vague
Technology
Key deltas
- o3-pro quantifies every subsystem
- Gemini 2.5 Pro emphasizes revolutionary physics and human-machine symbiosis
- Claude 4 Opus keeps technology labels high-level without figures
Post-training hypothesis
The three systems show distinct reward signals.
o3-pro has been reinforced to treat accuracy, traceability, and static formatting as top goals, so it defaults to dense tables, formal section breaks, and inline citations.
Gemini 2.5 Pro's reward mix plainly favors engagement and smooth narrative flow, steering it toward story arcs, descriptive headers, and reader-oriented pacing.
Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.
These formats likely mirror the preferences of the leaders within the post-training organizations.