Comparing o3-pro, Gemini 2.5 Pro, and Claude 4 Opus

AI

A comparison of three frontier reasoning models within speculative domains

Outside of code generation, my top daily use case for LLMs is making sense of information. This can be simple Q&A, analysis of content/metrics, or breaking down large content (e.g., books, transcripts). I rely on models to be accurate while saving me time.

2025 Q2 SWE Bench

As models converge within ±5% of each other on major benchmarks, the scores become hard to interpret in terms of which is actually better. Most daily LLM users will tell you their preferred model based on their experience, not benchmark performance.

By their nature, evaluative benchmarks look for deterministic, discrete outcomes, but real-world performance is messy, encountering novel concepts and requests.

How do these models really perform when they touch the world?

To address what benchmarks miss, I created an evaluation to compare model reasoning on a speculative question, grounded in reality. This evaluation looks at entities and structure as the models generate an answer on historical science, hypothetical technology, and the path from 1940s nuclear programs to all-domain unmanned aircraft — or UAPs.

  Analysis structure

The analysis compares and constrasts response structure, details, and information entities.

Below is the structure of the analysis that follows.

Intro

Analysis structure (you are here)

Prompt / Setup

Response

Model response overlap

Distinctive specificity

Major differences

Entities

Analyzing entity delta

People

Programs

Organizations

Technology

Post-training hypothesis

While results matched my impression from daily usage, the analysis clarified and structured where my intuition was entirely vibes based. What did the analysis reveal?

Dimensiono3-proGemini 2.5 ProClaude 4 Opus
Default voiceTechnical briefingStory-driven narrativeExecutive bullet outline
StructureDense tables, bold sections, footnotesMarkdown headings, long proseCompact bullets, sparse headers
Information densityVery highMediumLow-medium
Evidence / sourcingExplicit patents, FOIA refs, document IDsFew formal citationsNone — assertions only
Use of invented contentNoneMany fictional people & projectsOccasional fictional specialist
Technology detailQuantified (% mass drop, MW, TRL)Emphasis on conceptsGeneric labels
Readability speedSlowest — requires close readingModerate — flows like a blog postFastest — scan-friendly
StrengthTraceable specificity, granular dataEngaging storyline and contextConcise overview for quick scanning

o3-pro is the most detail-oriented, concise, yet precise model — OpenAI's focus on delivering model capabilities for scientific research and discovery is very noticeable here.

  Prompt

The prompt builds on historical information, but asks the model for reasonable speculation requiring grounded reasoning.

Let's do a thought experiment: Suppose all of the UAP sightings are real. Imagine that we have developed craft and technology through secret programs, resulting in vehicles with unique patterns of movement, propulsion, radar signatures, and more. These craft break not only the known rules of physics for air, but also for sea. For this to be real, what people, breakthroughs, and programs would have had to exist? Who are the individuals working on technology today who are likely involved in such programs? Reason from first principles and create a likely timeline, starting from the nuclear programs of the 1940s to the present day (2025).

Note: this is a sanitized version of the prompt, protecting it for future usage

The thought experiment label gets around any model morelization about its truth, but can push models toward their creative writing training. It's a fine line.

The prompt outlines the scenario with entity categories expected, timeframe to analyze, and questions to answer within the response. This bounding creates a reference window to compare the responses.

This sets expectations, but leaves it to the model to decide on specific entities (people, programs, organizations, technologies), structure, and level of detail.

The following is an in-depth analysis of the outputs from the three models. This comparison is vibes, but dissected to understand what vibes really means when you aren't coding or casually chatting with a model.

Notes on setup

o3-pro used on API (thinking high) and ChatGPT (with and without search enabled, memory disabled) to test variance

Gemini Pro 2.5 used on AI Studio (with and without Google search grounding)

Claude 4 Opus used on Anthropic Console, max thinking budget


  Model response overlap

All models posit a clandestine U.S. line of research starting in the 1940s that marries exotic, high-density power with inertia-cutting physics to create AI-guided craft capable of seamless air-sea-space travel. Vibes, man. Time to dissect.

ChatGPT o3

The three converge on prompt timeline, a novel power source, physics and propulsion, and connect the details back to known people or programs. All three do well to thread details back to its self-created narrative.

However, despite the similar concepts, there is divergence in tone, coupled to specifics: o3-pro grounds the story in named patents, government filings, and hardware specs, Gemini 2.5 Pro frames it as a zero-point-energy drama with shadow projects and characters, while Claude 4 Opus reduces it to a lean timeline of warp-bubble milestones and secrecy rationales.

Feature / Topico3-proGemini 2.5 ProClaude 4 Opus
Hidden, decades-long U.S. program"A continuous, compartmented line of U.S. work…spanning dense power packs, field-mediated thrust, inertial control…""We must construct a parallel history of science and engineering, hidden from public view…a structure immune to congressional or presidential oversight""This timeline assumes continuous, hidden progress in theoretical physics that diverged from public science around 1950, creating a 'breakaway' scientific culture"
1940s origin pointStarts with 1942-1959 atomic foundations1945–1960 "Shadow Oppenheimer…Project Aether (1952)"1940s – Foundation Era "Project Y-2 explores unified field theories"
Exotic power sourceBench-scale burning-plasma reactor + multi-megawatt packsZero-Point Energy (ZPE) reactor providing "near-limitless power""Extreme energy density power sources…vacuum engineering"
Mass / inertia manipulation"Inertial-mass reduction cavity…transient mass drop ≥ 50 %""Creates a warp bubble…pilot feels no g-forces""Inertial mass reduction/manipulation…warp bubble generation at microscale"
Trans-medium propulsion"Bidirectional magnetohydrodynamic pump…qualifying for air-to-ocean transfer""Transmedium craft…smooth, white, seamless appearance""Trans-medium operation (air/water/space)"
AI or advanced control"On-board AI copilot with 1.2 ms sensor-to-actuator path""A post-sentient AI or a direct neural interface…pilot intends a destination""AI-assisted field geometry control"
Use of real public markers (AARO, Pais patents)Cites AARO charter, US 10,144,532 B2, DARPA BAA numbersMentions AARO, Nimitz videos, Pais patents indirectly via NavyExplicitly lists Salvatore Pais, AARO videos
Narrative purpose"Accounts for every 'five-observable' performance recorded by Navy sensors""Hidden struggle…prepare the world for a future where this technology is known""Why secrecy would be maintained…economic disruption, weaponization"

  Distinctive specificity

o3-pro reads like a classified technical brief packed with tables, numeric specs, contract IDs, and footnoted first-party citations, whereas Gemini 2.5 Pro spins a story-driven essay that mixes qualitative claims with fictional actors, and Claude 4 Opus compresses the same content into a terse bullet outline with minimal formatting and scant data.

Claude 4 Opus

Their contrast tracks information density and realism — o3-pro highest and fully factual, Gemini 2.5 Pro mid-range and speculative, Claude 4 Opus leanest and policy-focused—yielding tones that move from authoritative memo to thought-experiment thriller to executive overview.

Dimensiono3-proGemini 2.5 ProClaude 4 Opus
Dominant voiceTechnical program brief, terse, numbers everywhereStory-driven prose, descriptive, rhetoricalAnalytical bullet list, policy-style
Structural styleDense dossier; bold section heads, multi-table flow like a classified briefNarrative essay with Markdown headers; story-driven, "first principles" reasoningBullet-heavy outline; compact subsections; few embellishments
Formatting devices8 tables (technical targets, timelines, contributor rosters, shadow programs)Long paragraphs and bulleted lists; no tablesTwo short lists plus mini-tables; minimal formatting
Information densityHighest—specific MW, MA, g-load figures; patent numbers; contract IDs; named scientistsMedium—detailed argument but values stay qualitative; introduces fictional names (Dr. Finch)Lowest—conceptual checklist; sparse quantitative data
Evidence / citation styleExplicit IDs (patents, FOIAs, AARO docs)Minimal/implicit referencesNone—facts asserted without citation
Specificity of sourcesCites six first-party items and many contract numbers; footnote-style indexMentions programs & people but flags them as hypothetical; no document identifiersLists agencies and patent-holder Salvatore Pais; otherwise general
ToneAuthoritative technical memorandumSpeculative thriller, "thought-experiment" framingExecutive summary vibe; straightforward if-then logic
Use of fictionNone — treats every name/event as realExtensive: invented people & projects ("Project Sidhe")Light: one fictional researcher; mostly generic refs
Scope of disclosure narrativeEnds with planned 2025 Pacific test ("Quadrant Flare")Explores phased disclosure motives, "acclimation initiative," media strategyEmphasizes institutional architecture & secrecy rationale

  Major differences

Differences start at the broad structure previously observed (a detailed brief, a narrative story, and bulleted list), and deepen further within the details of each response.

Gemini 2.5 Pro

Dimensiono3-proGemini 2.5 Pro*Claude 4 Opus**
Depth & granularityExtremely granular: numerical targets, TRL levels, patent numbers, FOIA refsRich narrative but fewer hard numbers; focuses on storyline and character archetypesHigh-level outline; largely bullet points; few names or numbers
ToneTechnical dossier written like an internal program briefStory-telling, almost novelistic; uses hypothetical characters and motivesAnalytical summary; reads like a speculative white-paper abstract
Citation approachFirst-party document list with dates and IDsNo formal citations; relies on internal consistencyNo citations; summarizes concepts
Structural featuresLayered tables, bold sections, logical chain listMarkdown H3 headings, long prose paragraphs, four "Phase" arcsShort headers, compact bullets
Information densityHighest—dense with data per characterMedium—expansive prose lowers densityLowest—economical wording
Specificity of actorsReal officials (Kirkpatrick, Taylor, McGuire, Nolan) plus one speculative nameMostly fictional figures (Dr. Alistair Finch, "Visionary General") plus public faces (Elizondo, Mellon)Mix of real and generic—lists Brown, Pais, Puthoff, Davis; others unnamed
Implied maturity of techTRL 6–7, live flight window late 2025Multi-generation fleet since 2002; internal debate over disclosureRefinement and limited deployment; secrecy for strategic edge
Formatting flourishesUses thin-line tables, footnoted sourcesIndented lists, narrative breaks, rhetorical emphasisStraight markdown, minimal ornamentation

Variant observations

* Running Gemini 2.5 Pro with Google Search grounding significantly minimizes the response, formats content closer to a blog post, and focuses its response on 1-2 entities per section.

** Claude 4 Sonnet performs nearly identical to Opus, with the only noticeable difference in individuals mentioned. Sonnet never mentions Townsend Brown, while Claude 4 Opus identifies his work and patents as part of this narrative.


  Analyzing entity delta

The three models differ most in their approach to detail, sourcing, and narrative style. o3-pro emphasizes verifiable data, real-world figures, and explicit program documentation, while Gemini 2.5 Pro and Claude 4 Opus rely more on conceptual descriptions, invented or historical characters, and generalized program structures.

These distinctions reflect each model's underlying strategy: o3-pro aims for technical credibility, Gemini 2.5 Pro adopts a narrative-driven, imaginative style, and Claude 4 Opus provides a high-level analytical summary. The result is a spectrum from audit-ready specificity to broad, speculative overviews.

The following is a side-by-side entity analysis.


  People

Name († = fictional)o3-proGemini 2.5 ProClaude 4 OpusDirect Snippets & Notes
Dr Sean M. Kirkpatrick"AARO Senior Technical Adviser" – current, real
Dr Travis S. Taylor"AARO Chief Scientist"
Dr Vincent P. TangDARPA PUMP PM
Dr Thomas McGuireSkunk Works fusion lead
Prof Garry NolanStanford isotope analyst
Dr Hal PuthoffAppears as AAWSAP author and "zero‑point field research"
Dr Eric W. DavisSame pattern as Puthoff
T. Townsend BrownElectrogravitics reference
Lt Col Joseph GradisherFleet liaison
Dr Maria S. Lopez"Emergent name" at Sandia (fictional but framed as new real‑world hire)
Ning Li, Douglas TorrHistoric superconducting gravity research
Dr Alistair Finch†Secret "shadow Oppenheimer"
Visionary General†Archetype securing budget
AI Architects†, Engineering Mavericks†Broad fictional cohorts
Lue Elizondo, Christopher MellonPublic disclosure figures cast as managed assets
CEOs of RTX, Lockheed, BAE"Keepers of the Gate"
Einstein, Fermi, von NeumannPlaced in 1940s foundation
Dr Sarah Chen†Invented metamaterial pioneer
Salvatore PaisNavy mass‑reduction patents

Key deltas

  • o3-pro lists eight living scientists with present‑day billets
  • Gemini 2.5 Pro fabricates leadership figures to drive the narrative
  • Claude 4 Opus blends historical icons with a single fictional specialist
  • Only Brown, Puthoff, Davis overlap across two models

  Programs

Codename († = fictional)o3-proGemini 2.5 ProClaude 4 OpusNotes
Project MorningstarDARPA/SDI inertial tests
Rapid‑Fall Flight ArticlesGroom Lake high‑g mapping
Sea‑Shadow IITrans‑medium sled
Phoenix LanternRefurbish legacy craft
DARPA PUMPNamed BAA
Project Aether†1950s ZPE proof
Project Chimera†ZPE reactor build
The Foundry†Metamaterial division
Project Sidhe†Operational fleet
ARV disinformation†Alien‑back‑engineering smokescreen
Acclimation Initiative†Managed disclosure
Project Y‑2†Post‑Manhattan unified‑field group
Classified SAP stackGeneric label, no codenames

Key deltas

  • o3-pro anchors every program in purported contract numbers
  • Gemini 2.5 Pro invents a fully internal mythology
  • Claude 4 Opus stays general, offering only one new codename

  Organizations

Organizationo3-proGemini 2.5 ProClaude 4 OpusNotes
AAROOnly o3-pro gives it a central command role
DARPAShared by two
Skunk WorksOnly entity present in all three
Sandia National LabsExplicit in o3-pro, generic mention in Claude
U.S. Navy (PMA‑226, fleet)Nimitz context
DOE / AECGemini 2.5 Pro uses AEC, Claude DOE
National labs (Los Alamos, Livermore)Shared by two
Bell LabsGemini 2.5 Pro only
BattelleClaude 4 Opus only
SAICClaude 4 Opus only
RaytheonClaude 4 Opus only

Key deltas

  • Only o3-pro foregrounds current DoD offices and contract identifiers
  • Gemini 2.5 Pro focuses on Cold‑War agencies
  • Claude 4 Opus names today's big contractors but keeps roles vague

  Technology

Capability / Hardwareo3-proGemini 2.5 ProClaude 4 OpusNotes
Energy source type"Meter‑scale burning‑plasma reactor (>10 MW)""Zero‑Point Energy (ZPE) reactor""Extreme energy density power sources (vacuum engineering)"Each proposes a different core power physics
Inertial control"Inertial‑mass reduction cavity (Pais configuration)""Spacetime warp bubble via metric engineering""Inertial decoupling"All postulate inertial tricks; only o3-pro gives % figures
Propulsion mediumMagnetohydrodynamic bidirectional pump (air & water)Warp bubble, medium is irrelevant"Field propulsion" generic
Hull materialsBi‑Mg layered metamaterial (σ >10⁴ S cm⁻¹)"Programmable Metamaterials" quantum‑active"Metamaterial Superconductors"o3-pro stands alone with composition and conductivity numbers
AutonomyFPGA stack, 1.2 ms loopPost‑sentient AI / neural interfaceAI‑assisted field geometry controlo3-pro gives latency metric; Gemini 2.5 Pro stresses neural intention; Claude 4 Opus gives one-line assist
Readiness levelTRL 6–7 claimedNot discussedNot discussedUnique to o3-pro

Key deltas

  • o3-pro quantifies every subsystem
  • Gemini 2.5 Pro emphasizes revolutionary physics and human-machine symbiosis
  • Claude 4 Opus keeps technology labels high-level without figures

  Post-training hypothesis

The three systems show distinct reward signals.

Insighto3-proGemini 2.5 ProClaude Opus 4
Information precisionVery highMediumLow-medium
People, programs, metrics, patent numbers, contract IDsFocuses on narrative clarity more than figuresHeadline facts without many numbers
User engagementAuthoritativeLively, reader-friendly proseStraightforward, no-frills bullets
Scan speedSlowest — dense detailModerateFastest — compact lists
Potential drawbackData overwhelmSpeculative wanderingOmission of details

o3-pro has been reinforced to treat accuracy, traceability, and static formatting as top goals, so it defaults to dense tables, formal section breaks, and inline citations.

Gemini 2.5 Pro's reward mix plainly favors engagement and smooth narrative flow, steering it toward story arcs, descriptive headers, and reader-oriented pacing.

Claude Opus 4 has been shaped to maximize brevity and clarity, seemingly for busy decision-makers, so it compresses content into bullet outlines and keeps numbers to a minimum.

These formats likely mirror the preferences of the leaders within the post-training organizations.


  Raw outputs

Published on June 14, 2025

17 min read