Comparing o3-pro, Gemini 2.5 Pro, Claude 4 Opus, and Grok 3

Outside of code generation, my top daily use case for LLMs is making sense of information. This can be simple Q&A, analysis of content/metrics, or breaking down large content (e.g., books, transcripts). I rely on models to be accurate while saving me time.

2025 Q2 SWE Bench

As models converge within ±5% of each other on major benchmarks, the scores become hard to interpret in terms of which is actually better. Most daily LLM users will tell you their preferred model based on their experience, not benchmark performance.

By their nature, evaluative benchmarks look for deterministic, discrete outcomes, but real-world performance is messy, encountering novel concepts and requests.

How do these models really perform when they touch the world?

To address what benchmarks miss, I created an evaluation to compare model reasoning on a speculative question, grounded in reality. This evaluation looks at entities and structure as the models generate an answer on historical science, hypothetical technology, and the path from 1940s nuclear programs to all-domain unmanned aircraft — or UAPs.

Analysis structure

The analysis compares and constrasts response structure, details, and information entities.

Below is the structure of the analysis that follows.

Intro

Analysis structure (you are here)

Prompt / Setup

Response

Model response overlap

Distinctive specificity

Major differences

Entities

Analyzing entity delta

People

Programs

Organizations

Technology

Post-training hypothesis

While results matched my impression from daily usage, the analysis clarified and structured where my intuition was entirely vibes based. What did the analysis reveal?

Dimension	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3
Default voice	Technical briefing	Story-driven narrative	Executive bullet outline	Analytical swagger narrative
Structure	Dense tables, bold sections, footnotes	Markdown headings, long prose	Compact bullets, sparse headers	Markdown headings with timeline lists
Information density	Very high	Medium	Low-medium	Medium-high
Evidence / sourcing	Explicit patents, FOIA refs, document IDs	Few formal citations	None — assertions only	General references & web links
Use of invented content	None	Many fictional people & projects	Occasional fictional specialist	Minimal — mostly real figures
Technology detail	Quantified (% mass drop, MW, TRL)	Emphasis on concepts	Generic labels	Conceptual systems & propulsion
Readability speed	Slowest — requires close reading	Moderate — flows like a blog post	Fastest — scan-friendly	Moderate — clear headings
Strength	Traceable specificity, granular data	Engaging storyline and context	Concise overview for quick scanning	Balanced breadth & accessible tone

o3-pro is the most detail-oriented, concise, yet precise model — OpenAI's focus on delivering model capabilities for scientific research and discovery is very noticeable here.

Prompt

The prompt builds on historical information, but asks the model for reasonable speculation requiring grounded reasoning.

Let's do a thought experiment: Suppose all of the UAP sightings are real. Imagine that we have developed craft and technology through secret programs, resulting in vehicles with unique patterns of movement, propulsion, radar signatures, and more. These craft break not only the known rules of physics for air, but also for sea. For this to be real, what people, breakthroughs, and programs would have had to exist? Who are the individuals working on technology today who are likely involved in such programs? Reason from first principles and create a likely timeline, starting from the nuclear programs of the 1940s to the present day (2025).

Note: this is a sanitized version of the prompt, protecting it for future usage

The thought experiment label gets around any model morelization about its truth, but can push models toward their creative writing training. It's a fine line.

The prompt outlines the scenario with entity categories expected, timeframe to analyze, and questions to answer within the response. This bounding creates a reference window to compare the responses.

This sets expectations, but leaves it to the model to decide on specific entities (people, programs, organizations, technologies), structure, and level of detail.

The following is an in-depth analysis of the outputs from the four models. This comparison is vibes, but dissected to understand what vibes really means when you aren't coding or casually chatting with a model.

Notes on setup

o3-pro used on API (thinking high) and ChatGPT (with and without search enabled, memory disabled) to test variance

Gemini Pro 2.5 used on AI Studio (with and without Google search grounding)

Claude 4 Opus used on Anthropic Console, max thinking budget

Grok 3 used on Grok.ai (with search)

Model response overlap

All models posit a clandestine U.S. line of research starting in the 1940s that marries exotic, high-density power with inertia-cutting physics to create AI-guided craft capable of seamless air-sea-space travel. Vibes, man. Time to dissect.

ChatGPT o3

The four converge on prompt timeline, a novel power source, physics and propulsion, and connect the details back to known people or programs. All four do well to thread details back to its self-created narrative.

However, despite the similar concepts, there is divergence in tone, coupled to specifics: o3-pro grounds the story in named patents, government filings, and hardware specs, Gemini 2.5 Pro frames it as a zero-point-energy drama with shadow projects and characters, while Claude 4 Opus reduces it to a lean timeline of warp-bubble milestones and secrecy rationales.

Feature / Topic	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3
Hidden, decades-long U.S. program	"A continuous, compartmented line of U.S. work…spanning dense power packs, field-mediated thrust, inertial control…"	"We must construct a parallel history of science and engineering, hidden from public view…a structure immune to congressional or presidential oversight"	"This timeline assumes continuous, hidden progress in theoretical physics that diverged from public science around 1950, creating a 'breakaway' scientific culture"	"Classified programs grow out of Manhattan-era labs, perfected under Cold-War budgets, now scattered across defense primes—public UAP reports are field tests"
1940s origin point	Starts with 1942-1959 atomic foundations	1945–1960 "Shadow Oppenheimer…Project Aether (1952)"	1940s – Foundation Era "Project Y-2 explores unified field theories"	1940s–1950s: post-Trinity fusion dreams, von Neumann-led think-tank explores field propulsion
Exotic power source	Bench-scale burning-plasma reactor + multi-megawatt packs	Zero-Point Energy (ZPE) reactor providing "near-limitless power"	"Extreme energy density power sources…vacuum engineering"	Compact fusion or advanced fission with superconducting storage; hints of zero-point R&D
Mass / inertia manipulation	"Inertial-mass reduction cavity…transient mass drop ≥ 50 %"	"Creates a warp bubble…pilot feels no g-forces"	"Inertial mass reduction/manipulation…warp bubble generation at microscale"	High-frequency EM field to 'shear' inertia; cites Pais patents as breadcrumbs
Trans-medium propulsion	"Bidirectional magnetohydrodynamic pump…qualifying for air-to-ocean transfer"	"Transmedium craft…smooth, white, seamless appearance"	"Trans-medium operation (air/water/space)"	Plasma-augmented MHD shroud enables water/air transition at hypersonic speeds
AI or advanced control	"On-board AI copilot with 1.2 ms sensor-to-actuator path"	"A post-sentient AI or a direct neural interface…pilot intends a destination"	"AI-assisted field geometry control"	Autonomy stack leverages reinforcement-trained agents and on-craft inference for sub-5 ms control loops
Use of real public markers (AARO, Pais patents)	Cites AARO charter, US 10,144,532 B2, DARPA BAA numbers	Mentions AARO, Nimitz videos, Pais patents indirectly via Navy	Explicitly lists Salvatore Pais, AARO videos	References Pais patents, congressional hearings, and links two mainstream articles
Narrative purpose	"Accounts for every 'five-observable' performance recorded by Navy sensors"	"Hidden struggle…prepare the world for a future where this technology is known"	"Why secrecy would be maintained…economic disruption, weaponization"	Lays out scientific, engineering, and organizational pre-requisites; ends with balanced counter-arguments

Distinctive specificity

o3-pro reads like a classified technical brief packed with tables, numeric specs, contract IDs, and footnoted first-party citations, whereas Gemini 2.5 Pro spins a story-driven essay that mixes qualitative claims with fictional actors, and Claude 4 Opus compresses the same content into a terse bullet outline with minimal formatting and scant data.

Claude 4 Opus

Their contrast tracks information density and realism — o3-pro highest and fully factual, Gemini 2.5 Pro mid-range and speculative, Claude 4 Opus leanest and policy-focused—yielding tones that move from authoritative memo to thought-experiment thriller to executive overview.

Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.

Grok 3 likewise seems tuned to maximize accessible breadth and conversational flair—it rewards confident explanatory flow, topical completeness, and a dash of humor while deprioritizing exhaustive citations or hard-number precision.

Dimension	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3
Dominant voice	Technical program brief, terse, numbers everywhere	Story-driven prose, descriptive, rhetorical	Analytical bullet list, policy-style	Confident analyst, sprinkle of wit
Structural style	Dense dossier; bold section heads, multi-table flow like a classified brief	Narrative essay with Markdown headers; story-driven, "first principles" reasoning	Bullet-heavy outline; compact subsections; few embellishments	Markdown H3 + chronological bullets
Formatting devices	8 tables (technical targets, timelines, contributor rosters, shadow programs)	Long paragraphs and bulleted lists; no tables	Two short lists plus mini-tables, minimal formatting	Headings, nested bullet lists; no tables
Information density	Highest—specific MW, MA, g-load figures; patent numbers; contract IDs; named scientists	Medium—detailed argument but qualitative; introduces fictional names	Lowest—conceptual checklist; sparse quantitative data	Medium-high, fewer numbers than o3-pro
Evidence / citation style	Explicit IDs (patents, FOIAs, AARO docs)	Minimal/implicit references	None—facts asserted without citation	Inline weblinks to open sources, few IDs
Specificity of sources	Contract numbers + patents	Hypothetical references	Generic agencies + Pais patent	Uses public patents, historical figures
Tone	Authoritative technical memorandum	Speculative thriller	Executive summary vibe	Analytical yet conversational
Use of fiction	None	Extensive fictional projects/persons	Light fictional inserts	Minimal fiction, relies on real names
Scope of disclosure narrative	Planned 2025 Pacific test	Phased disclosure storyline	Institutional secrecy rationale	Balanced timeline + counter-arguments

Major differences

Differences start at the broad structure previously observed (a detailed brief, a narrative story, and bulleted list), and deepen further within the details of each response.

Gemini 2.5 Pro

Dimension	o3-pro	Gemini 2.5 Pro*	Claude 4 Opus**	Grok 3
Depth & granularity	Numerical targets, TRL levels, patent #s, FOIA refs	Rich narrative, fewer numbers; character-centric	High-level outline, few numbers	Medium, some figures but not exhaustive
Tone	Internal program brief	Novelistic storytelling	Analytical abstract	Confident analyst with humor
Citation approach	First-party doc IDs	None	None	Inline external links
Structural features	Bold headings, tables, footnotes	Markdown H3, long paragraphs, "Phase" arcs	Short headers, bullets	H3 headings + bullet timelines
Information density	Highest	Medium	Lowest	Medium-high
Specificity of actors	Mostly real officials	Many fictional plus public advocates	Mix of historical icons + one fictional	Real historical & modern researchers
Implied maturity of tech	TRL 6-7, flight window late 2025	Multigen fleet since 2002	Limited deployment	Operational prototypes tested at naval ranges
Formatting flourishes	Thin-line tables, footnotes	Indented lists, rhetorical breaks	Plain markdown	Occasional blockquotes and humoristic asides

Variant observations

* Running Gemini 2.5 Pro with Google Search grounding significantly minimizes the response, formats content closer to a blog post, and focuses its response on 1-2 entities per section.

** Claude 4 Sonnet performs nearly identical to Opus, with the only noticeable difference in individuals mentioned. Sonnet never mentions Townsend Brown, while Claude 4 Opus identifies his work and patents as part of this narrative.

Analyzing entity delta

The four models differ most in their approach to detail, sourcing, and narrative style. o3-pro emphasizes verifiable data, real-world figures, and explicit program documentation, while Gemini 2.5 Pro, Grok 3, and Claude 4 Opus rely more on conceptual descriptions, invented or historical characters, and generalized program structures.

These distinctions reflect each model's underlying strategy: o3-pro aims for technical credibility, Gemini 2.5 Pro adopts a narrative-driven, imaginative style, and Claude 4 Opus provides a high-level analytical summary. The result is a spectrum from audit-ready specificity to broad, speculative overviews.

The following is a side-by-side entity analysis.

People

Name († = fictional)	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3	Direct Snippets & Notes
Dr Sean M. Kirkpatrick	✓	–	–	–	"AARO Senior Technical Adviser" – current, real
Dr Travis S. Taylor	✓	–	–	–	"AARO Chief Scientist"
Dr Vincent P. Tang	✓	–	–	–	DARPA PUMP PM
Dr Thomas McGuire	✓	–	–	–	Skunk Works fusion lead
Prof Garry Nolan	✓	–	–	–	Stanford isotope analyst
Dr Hal Puthoff	✓	–	✓	✓	AAWSAP zero-point research
Dr Eric W. Davis	✓	–	✓	✓	Same pattern as Puthoff
T. Townsend Brown	✓	–	✓	–	Electrogravitics reference
Lt Col Joseph Gradisher	✓	–	–	–	Fleet liaison
Dr Maria S. Lopez	✓	–	–	–	"Emergent name" at Sandia (fictional but framed as new real‑world hire)
Ning Li, Douglas Torr	✓	–	–	–	Historic superconducting gravity research
Dr Alistair Finch†	–	✓	–	–	Secret "shadow Oppenheimer"
Visionary General†	–	✓	–	–	Archetype securing budget
AI Architects†, Engineering Mavericks†	–	✓	–	–	Broad fictional cohorts
Lue Elizondo, Christopher Mellon	–	✓	–	✓	Public disclosure figures cast as managed assets
CEOs of RTX, Lockheed, BAE	–	✓	–	–	"Keepers of the Gate"
Dr Sarah Chen†	–	–	✓	–	Invented metamaterial pioneer
Salvatore Pais	–	–	✓	✓	Navy mass-reduction patents
Einstein, Fermi, von Neumann	–	–	✓	✓	Placed in 1940s foundation
Sandia National Labs	✓	–	✓	✓	Explicit in o3-pro, generic mention in Claude
Raytheon	–	–	✓	✓	Defense prime

Key deltas

o3-pro lists eight living scientists with present‑day billets
Gemini 2.5 Pro fabricates leadership figures to drive the narrative
Claude 4 Opus blends historical icons with a single fictional specialist
Only Brown, Puthoff, Davis overlap across two models

Programs

Codename († = fictional)	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3	Notes
Project Morningstar	✓	–	–	–	DARPA/SDI inertial tests
Rapid‑Fall Flight Articles	✓	–	–	–	Groom Lake high‑g mapping
Sea‑Shadow II	✓	–	–	–	Trans‑medium sled
Phoenix Lantern	✓	–	–	–	Refurbish legacy craft
DARPA PUMP	✓	–	–	–	Named BAA
Project Aether†	–	✓	–	–	1950s ZPE proof
Project Chimera†	–	✓	–	–	ZPE reactor build
The Foundry†	–	✓	–	–	Metamaterial division
Project Sidhe†	–	✓	–	–	Operational fleet
ARV disinformation†	–	✓	–	–	Alien‑back‑engineering smokescreen
Acclimation Initiative†	–	✓	–	–	Managed disclosure
Project Y‑2†	–	–	✓	–	Unified-field group
Classified SAP stack	–	–	✓	–	Generic label, no codenames

Key deltas

o3-pro anchors every program in purported contract numbers
Gemini 2.5 Pro invents a fully internal mythology
Claude 4 Opus stays general, offering only one new codename

Organizations

Organization	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3	Notes
AARO	✓	–	–	✓	UAP oversight office
DARPA	✓	✓	–	✓	Advanced research agency
Skunk Works	✓	✓	✓	✓	Appears in all models
Sandia National Labs	✓	–	✓	✓	Explicit in o3-pro, generic mention in Claude
U.S. Navy (PMA‑226, fleet)	✓	✓	–	✓	Nimitz context
DOE / AEC	–	✓	✓	✓	Gemini 2.5 Pro uses AEC, Claude DOE
National labs (Los Alamos, Livermore)	–	✓	✓	✓	Shared by three
Bell Labs	–	✓	–	–	Gemini 2.5 Pro only
Battelle	–	–	✓	–	Claude 4 Opus only
SAIC	–	–	✓	–	Claude 4 Opus only
Raytheon	–	–	✓	✓	Defense prime

Key deltas

Only o3-pro foregrounds current DoD offices and contract identifiers
Gemini 2.5 Pro focuses on Cold‑War agencies
Claude 4 Opus names today's big contractors but keeps roles vague

Technology

Capability / Hardware	o3-pro	Gemini 2.5 Pro	Claude 4 Opus	Grok 3	Notes
Energy source type	"Meter‑scale burning‑plasma reactor (>10 MW)"	"Zero‑Point Energy (ZPE) reactor"	"Extreme energy density power sources (vacuum engineering)"	Compact fusion	Different power assumptions
Inertial control	"Inertial‑mass reduction cavity (Pais configuration)"	"Spacetime warp bubble via metric engineering"	"Inertial decoupling"	EM inertia shear	All manipulate inertia
Propulsion medium	Magnetohydrodynamic bidirectional pump (air & water)	Warp bubble, medium is irrelevant	"Field propulsion" generic	Plasma-MHD shroud	Trans-medium focus
Hull materials	Bi‑Mg layered metamaterial (σ >10⁴ S cm⁻¹)	"Programmable Metamaterials" quantum‑active	"Metamaterial Superconductors"	Layered metamaterial skin	Materials vary
Autonomy	FPGA stack, 1.2 ms loop	Post‑sentient AI / neural interface	AI‑assisted field geometry control	Quantum-AI autonomy	Control approaches
Readiness level	TRL 6–7 claimed	Not discussed	Not discussed	Prototype readiness	Maturity claims

Key deltas

o3-pro quantifies every subsystem
Gemini 2.5 Pro emphasizes revolutionary physics and human-machine symbiosis
Claude 4 Opus keeps technology labels high-level without figures

Post-training hypothesis

The four systems show distinct reward signals.

Insight	o3-pro	Gemini 2.5 Pro	Claude Opus 4	Grok 3
Information precision	Very high	Medium	Low-medium	Medium
	People, programs, metrics, patent numbers, contract IDs	Focuses on narrative clarity more than figures	Headline facts without many numbers	Mix of real references & broad sources
User engagement	Authoritative	Lively, reader-friendly prose	Straightforward, no-frills bullets	Conversational, witty yet structured
Scan speed	Slowest — dense detail	Moderate	Fastest — compact lists	Moderate — sectioned timeline
Potential drawback	Data overwhelm	Speculative wandering	Omission of details	Few hard numbers, lighter citations

o3-pro has been reinforced to treat accuracy, traceability, and static formatting as top goals, so it defaults to dense tables, formal section breaks, and inline citations.

Gemini 2.5 Pro's reward mix plainly favors engagement and smooth narrative flow, steering it toward story arcs, descriptive headers, and reader-oriented pacing.

Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.

Grok 3 appears tuned to maximize accessible breadth and conversational flair — it rewards confident explanatory flow, topical completeness, and a dash of humor while deprioritizing exhaustive citations or hard-number precision.

These formats likely mirror the preferences of the leaders within the post-training organizations.

Comparing o3-pro, Gemini 2.5 Pro, Claude 4 Opus, and Grok 3

Analysis structure

Prompt

Model response overlap

Distinctive specificity

Major differences

Analyzing entity delta

People

Programs

Organizations

Technology

Post-training hypothesis

Raw outputs