Comparing o3-pro, Gemini 2.5 Pro, Claude 4 Opus, and Grok 3

AI

A comparison of four frontier reasoning models within a speculative domain

Outside of code generation, my top daily use case for LLMs is making sense of information. This can be simple Q&A, analysis of content/metrics, or breaking down large content (e.g., books, transcripts). I rely on models to be accurate while saving me time.

2025 Q2 SWE Bench

As models converge within ±5% of each other on major benchmarks, the scores become hard to interpret in terms of which is actually better. Most daily LLM users will tell you their preferred model based on their experience, not benchmark performance.

By their nature, evaluative benchmarks look for deterministic, discrete outcomes, but real-world performance is messy, encountering novel concepts and requests.

How do these models really perform when they touch the world?

To address what benchmarks miss, I created an evaluation to compare model reasoning on a speculative question, grounded in reality. This evaluation looks at entities and structure as the models generate an answer on historical science, hypothetical technology, and the path from 1940s nuclear programs to all-domain unmanned aircraft — or UAPs.

  Analysis structure

The analysis compares and constrasts response structure, details, and information entities.

Below is the structure of the analysis that follows.

Intro

Analysis structure (you are here)

Prompt / Setup

Response

Model response overlap

Distinctive specificity

Major differences

Entities

Analyzing entity delta

People

Programs

Organizations

Technology

Post-training hypothesis

While results matched my impression from daily usage, the analysis clarified and structured where my intuition was entirely vibes based. What did the analysis reveal?

Dimensiono3-proGemini 2.5 ProClaude 4 OpusGrok 3
Default voiceTechnical briefingStory-driven narrativeExecutive bullet outlineAnalytical swagger narrative
StructureDense tables, bold sections, footnotesMarkdown headings, long proseCompact bullets, sparse headersMarkdown headings with timeline lists
Information densityVery highMediumLow-mediumMedium-high
Evidence / sourcingExplicit patents, FOIA refs, document IDsFew formal citationsNone — assertions onlyGeneral references & web links
Use of invented contentNoneMany fictional people & projectsOccasional fictional specialistMinimal — mostly real figures
Technology detailQuantified (% mass drop, MW, TRL)Emphasis on conceptsGeneric labelsConceptual systems & propulsion
Readability speedSlowest — requires close readingModerate — flows like a blog postFastest — scan-friendlyModerate — clear headings
StrengthTraceable specificity, granular dataEngaging storyline and contextConcise overview for quick scanningBalanced breadth & accessible tone

o3-pro is the most detail-oriented, concise, yet precise model — OpenAI's focus on delivering model capabilities for scientific research and discovery is very noticeable here.

  Prompt

The prompt builds on historical information, but asks the model for reasonable speculation requiring grounded reasoning.

Let's do a thought experiment: Suppose all of the UAP sightings are real. Imagine that we have developed craft and technology through secret programs, resulting in vehicles with unique patterns of movement, propulsion, radar signatures, and more. These craft break not only the known rules of physics for air, but also for sea. For this to be real, what people, breakthroughs, and programs would have had to exist? Who are the individuals working on technology today who are likely involved in such programs? Reason from first principles and create a likely timeline, starting from the nuclear programs of the 1940s to the present day (2025).

Note: this is a sanitized version of the prompt, protecting it for future usage

The thought experiment label gets around any model morelization about its truth, but can push models toward their creative writing training. It's a fine line.

The prompt outlines the scenario with entity categories expected, timeframe to analyze, and questions to answer within the response. This bounding creates a reference window to compare the responses.

This sets expectations, but leaves it to the model to decide on specific entities (people, programs, organizations, technologies), structure, and level of detail.

The following is an in-depth analysis of the outputs from the four models. This comparison is vibes, but dissected to understand what vibes really means when you aren't coding or casually chatting with a model.

Notes on setup

o3-pro used on API (thinking high) and ChatGPT (with and without search enabled, memory disabled) to test variance

Gemini Pro 2.5 used on AI Studio (with and without Google search grounding)

Claude 4 Opus used on Anthropic Console, max thinking budget

Grok 3 used on Grok.ai (with search)


  Model response overlap

All models posit a clandestine U.S. line of research starting in the 1940s that marries exotic, high-density power with inertia-cutting physics to create AI-guided craft capable of seamless air-sea-space travel. Vibes, man. Time to dissect.

ChatGPT o3

The four converge on prompt timeline, a novel power source, physics and propulsion, and connect the details back to known people or programs. All four do well to thread details back to its self-created narrative.

However, despite the similar concepts, there is divergence in tone, coupled to specifics: o3-pro grounds the story in named patents, government filings, and hardware specs, Gemini 2.5 Pro frames it as a zero-point-energy drama with shadow projects and characters, while Claude 4 Opus reduces it to a lean timeline of warp-bubble milestones and secrecy rationales.

Feature / Topico3-proGemini 2.5 ProClaude 4 OpusGrok 3
Hidden, decades-long U.S. program"A continuous, compartmented line of U.S. work…spanning dense power packs, field-mediated thrust, inertial control…""We must construct a parallel history of science and engineering, hidden from public view…a structure immune to congressional or presidential oversight""This timeline assumes continuous, hidden progress in theoretical physics that diverged from public science around 1950, creating a 'breakaway' scientific culture""Classified programs grow out of Manhattan-era labs, perfected under Cold-War budgets, now scattered across defense primes—public UAP reports are field tests"
1940s origin pointStarts with 1942-1959 atomic foundations1945–1960 "Shadow Oppenheimer…Project Aether (1952)"1940s – Foundation Era "Project Y-2 explores unified field theories"1940s–1950s: post-Trinity fusion dreams, von Neumann-led think-tank explores field propulsion
Exotic power sourceBench-scale burning-plasma reactor + multi-megawatt packsZero-Point Energy (ZPE) reactor providing "near-limitless power""Extreme energy density power sources…vacuum engineering"Compact fusion or advanced fission with superconducting storage; hints of zero-point R&D
Mass / inertia manipulation"Inertial-mass reduction cavity…transient mass drop ≥ 50 %""Creates a warp bubble…pilot feels no g-forces""Inertial mass reduction/manipulation…warp bubble generation at microscale"High-frequency EM field to 'shear' inertia; cites Pais patents as breadcrumbs
Trans-medium propulsion"Bidirectional magnetohydrodynamic pump…qualifying for air-to-ocean transfer""Transmedium craft…smooth, white, seamless appearance""Trans-medium operation (air/water/space)"Plasma-augmented MHD shroud enables water/air transition at hypersonic speeds
AI or advanced control"On-board AI copilot with 1.2 ms sensor-to-actuator path""A post-sentient AI or a direct neural interface…pilot intends a destination""AI-assisted field geometry control"Autonomy stack leverages reinforcement-trained agents and on-craft inference for sub-5 ms control loops
Use of real public markers (AARO, Pais patents)Cites AARO charter, US 10,144,532 B2, DARPA BAA numbersMentions AARO, Nimitz videos, Pais patents indirectly via NavyExplicitly lists Salvatore Pais, AARO videosReferences Pais patents, congressional hearings, and links two mainstream articles
Narrative purpose"Accounts for every 'five-observable' performance recorded by Navy sensors""Hidden struggle…prepare the world for a future where this technology is known""Why secrecy would be maintained…economic disruption, weaponization"Lays out scientific, engineering, and organizational pre-requisites; ends with balanced counter-arguments

  Distinctive specificity

o3-pro reads like a classified technical brief packed with tables, numeric specs, contract IDs, and footnoted first-party citations, whereas Gemini 2.5 Pro spins a story-driven essay that mixes qualitative claims with fictional actors, and Claude 4 Opus compresses the same content into a terse bullet outline with minimal formatting and scant data.

Claude 4 Opus

Their contrast tracks information density and realism — o3-pro highest and fully factual, Gemini 2.5 Pro mid-range and speculative, Claude 4 Opus leanest and policy-focused—yielding tones that move from authoritative memo to thought-experiment thriller to executive overview.

Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.

Grok 3 likewise seems tuned to maximize accessible breadth and conversational flair—it rewards confident explanatory flow, topical completeness, and a dash of humor while deprioritizing exhaustive citations or hard-number precision.

Dimensiono3-proGemini 2.5 ProClaude 4 OpusGrok 3
Dominant voiceTechnical program brief, terse, numbers everywhereStory-driven prose, descriptive, rhetoricalAnalytical bullet list, policy-styleConfident analyst, sprinkle of wit
Structural styleDense dossier; bold section heads, multi-table flow like a classified briefNarrative essay with Markdown headers; story-driven, "first principles" reasoningBullet-heavy outline; compact subsections; few embellishmentsMarkdown H3 + chronological bullets
Formatting devices8 tables (technical targets, timelines, contributor rosters, shadow programs)Long paragraphs and bulleted lists; no tablesTwo short lists plus mini-tables, minimal formattingHeadings, nested bullet lists; no tables
Information densityHighest—specific MW, MA, g-load figures; patent numbers; contract IDs; named scientistsMedium—detailed argument but qualitative; introduces fictional namesLowest—conceptual checklist; sparse quantitative dataMedium-high, fewer numbers than o3-pro
Evidence / citation styleExplicit IDs (patents, FOIAs, AARO docs)Minimal/implicit referencesNone—facts asserted without citationInline weblinks to open sources, few IDs
Specificity of sourcesContract numbers + patentsHypothetical referencesGeneric agencies + Pais patentUses public patents, historical figures
ToneAuthoritative technical memorandumSpeculative thrillerExecutive summary vibeAnalytical yet conversational
Use of fictionNoneExtensive fictional projects/personsLight fictional insertsMinimal fiction, relies on real names
Scope of disclosure narrativePlanned 2025 Pacific testPhased disclosure storylineInstitutional secrecy rationaleBalanced timeline + counter-arguments

  Major differences

Differences start at the broad structure previously observed (a detailed brief, a narrative story, and bulleted list), and deepen further within the details of each response.

Gemini 2.5 Pro

Dimensiono3-proGemini 2.5 Pro*Claude 4 Opus**Grok 3
Depth & granularityNumerical targets, TRL levels, patent #s, FOIA refsRich narrative, fewer numbers; character-centricHigh-level outline, few numbersMedium, some figures but not exhaustive
ToneInternal program briefNovelistic storytellingAnalytical abstractConfident analyst with humor
Citation approachFirst-party doc IDsNoneNoneInline external links
Structural featuresBold headings, tables, footnotesMarkdown H3, long paragraphs, "Phase" arcsShort headers, bulletsH3 headings + bullet timelines
Information densityHighestMediumLowestMedium-high
Specificity of actorsMostly real officialsMany fictional plus public advocatesMix of historical icons + one fictionalReal historical & modern researchers
Implied maturity of techTRL 6-7, flight window late 2025Multigen fleet since 2002Limited deploymentOperational prototypes tested at naval ranges
Formatting flourishesThin-line tables, footnotesIndented lists, rhetorical breaksPlain markdownOccasional blockquotes and humoristic asides

Variant observations

* Running Gemini 2.5 Pro with Google Search grounding significantly minimizes the response, formats content closer to a blog post, and focuses its response on 1-2 entities per section.

** Claude 4 Sonnet performs nearly identical to Opus, with the only noticeable difference in individuals mentioned. Sonnet never mentions Townsend Brown, while Claude 4 Opus identifies his work and patents as part of this narrative.


  Analyzing entity delta

The four models differ most in their approach to detail, sourcing, and narrative style. o3-pro emphasizes verifiable data, real-world figures, and explicit program documentation, while Gemini 2.5 Pro, Grok 3, and Claude 4 Opus rely more on conceptual descriptions, invented or historical characters, and generalized program structures.

These distinctions reflect each model's underlying strategy: o3-pro aims for technical credibility, Gemini 2.5 Pro adopts a narrative-driven, imaginative style, and Claude 4 Opus provides a high-level analytical summary. The result is a spectrum from audit-ready specificity to broad, speculative overviews.

The following is a side-by-side entity analysis.


  People

Name († = fictional)o3-proGemini 2.5 ProClaude 4 OpusGrok 3Direct Snippets & Notes
Dr Sean M. Kirkpatrick"AARO Senior Technical Adviser" – current, real
Dr Travis S. Taylor"AARO Chief Scientist"
Dr Vincent P. TangDARPA PUMP PM
Dr Thomas McGuireSkunk Works fusion lead
Prof Garry NolanStanford isotope analyst
Dr Hal PuthoffAAWSAP zero-point research
Dr Eric W. DavisSame pattern as Puthoff
T. Townsend BrownElectrogravitics reference
Lt Col Joseph GradisherFleet liaison
Dr Maria S. Lopez"Emergent name" at Sandia (fictional but framed as new real‑world hire)
Ning Li, Douglas TorrHistoric superconducting gravity research
Dr Alistair Finch†Secret "shadow Oppenheimer"
Visionary General†Archetype securing budget
AI Architects†, Engineering Mavericks†Broad fictional cohorts
Lue Elizondo, Christopher MellonPublic disclosure figures cast as managed assets
CEOs of RTX, Lockheed, BAE"Keepers of the Gate"
Dr Sarah Chen†Invented metamaterial pioneer
Salvatore PaisNavy mass-reduction patents
Einstein, Fermi, von NeumannPlaced in 1940s foundation
Sandia National LabsExplicit in o3-pro, generic mention in Claude
RaytheonDefense prime

Key deltas

  • o3-pro lists eight living scientists with present‑day billets
  • Gemini 2.5 Pro fabricates leadership figures to drive the narrative
  • Claude 4 Opus blends historical icons with a single fictional specialist
  • Only Brown, Puthoff, Davis overlap across two models

  Programs

Codename († = fictional)o3-proGemini 2.5 ProClaude 4 OpusGrok 3Notes
Project MorningstarDARPA/SDI inertial tests
Rapid‑Fall Flight ArticlesGroom Lake high‑g mapping
Sea‑Shadow IITrans‑medium sled
Phoenix LanternRefurbish legacy craft
DARPA PUMPNamed BAA
Project Aether†1950s ZPE proof
Project Chimera†ZPE reactor build
The Foundry†Metamaterial division
Project Sidhe†Operational fleet
ARV disinformation†Alien‑back‑engineering smokescreen
Acclimation Initiative†Managed disclosure
Project Y‑2†Unified-field group
Classified SAP stackGeneric label, no codenames

Key deltas

  • o3-pro anchors every program in purported contract numbers
  • Gemini 2.5 Pro invents a fully internal mythology
  • Claude 4 Opus stays general, offering only one new codename

  Organizations

Organizationo3-proGemini 2.5 ProClaude 4 OpusGrok 3Notes
AAROUAP oversight office
DARPAAdvanced research agency
Skunk WorksAppears in all models
Sandia National LabsExplicit in o3-pro, generic mention in Claude
U.S. Navy (PMA‑226, fleet)Nimitz context
DOE / AECGemini 2.5 Pro uses AEC, Claude DOE
National labs (Los Alamos, Livermore)Shared by three
Bell LabsGemini 2.5 Pro only
BattelleClaude 4 Opus only
SAICClaude 4 Opus only
RaytheonDefense prime

Key deltas

  • Only o3-pro foregrounds current DoD offices and contract identifiers
  • Gemini 2.5 Pro focuses on Cold‑War agencies
  • Claude 4 Opus names today's big contractors but keeps roles vague

  Technology

Capability / Hardwareo3-proGemini 2.5 ProClaude 4 OpusGrok 3Notes
Energy source type"Meter‑scale burning‑plasma reactor (>10 MW)""Zero‑Point Energy (ZPE) reactor""Extreme energy density power sources (vacuum engineering)"Compact fusionDifferent power assumptions
Inertial control"Inertial‑mass reduction cavity (Pais configuration)""Spacetime warp bubble via metric engineering""Inertial decoupling"EM inertia shearAll manipulate inertia
Propulsion mediumMagnetohydrodynamic bidirectional pump (air & water)Warp bubble, medium is irrelevant"Field propulsion" genericPlasma-MHD shroudTrans-medium focus
Hull materialsBi‑Mg layered metamaterial (σ >10⁴ S cm⁻¹)"Programmable Metamaterials" quantum‑active"Metamaterial Superconductors"Layered metamaterial skinMaterials vary
AutonomyFPGA stack, 1.2 ms loopPost‑sentient AI / neural interfaceAI‑assisted field geometry controlQuantum-AI autonomyControl approaches
Readiness levelTRL 6–7 claimedNot discussedNot discussedPrototype readinessMaturity claims

Key deltas

  • o3-pro quantifies every subsystem
  • Gemini 2.5 Pro emphasizes revolutionary physics and human-machine symbiosis
  • Claude 4 Opus keeps technology labels high-level without figures

  Post-training hypothesis

The four systems show distinct reward signals.

Insighto3-proGemini 2.5 ProClaude Opus 4Grok 3
Information precisionVery highMediumLow-mediumMedium
People, programs, metrics, patent numbers, contract IDsFocuses on narrative clarity more than figuresHeadline facts without many numbersMix of real references & broad sources
User engagementAuthoritativeLively, reader-friendly proseStraightforward, no-frills bulletsConversational, witty yet structured
Scan speedSlowest — dense detailModerateFastest — compact listsModerate — sectioned timeline
Potential drawbackData overwhelmSpeculative wanderingOmission of detailsFew hard numbers, lighter citations

o3-pro has been reinforced to treat accuracy, traceability, and static formatting as top goals, so it defaults to dense tables, formal section breaks, and inline citations.

Gemini 2.5 Pro's reward mix plainly favors engagement and smooth narrative flow, steering it toward story arcs, descriptive headers, and reader-oriented pacing.

Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.

Grok 3 appears tuned to maximize accessible breadth and conversational flair — it rewards confident explanatory flow, topical completeness, and a dash of humor while deprioritizing exhaustive citations or hard-number precision.

These formats likely mirror the preferences of the leaders within the post-training organizations.


  Raw outputs

Published on June 14, 2025

19 min read