Intelligent Writing

After years of speculation, the release of GPT-5 was mostly disappointing.¹ It's a good model, better than GPT-4, yet compared with o3 it feels different, not better.

Objectively, GPT-5 is better at math/physics and hallucinates less.² The primary model characteristic that bothered me is its writing style. GPT-5 draws from GPT-4o/4.5 personality and style experiments, but even after developer prompt tweaks, the responses still feel unenjoyable. Too much vibe, not enough substance.

Each week I fed ChatGPT the same question twice to compare the models, but I couldn't articulate the difference I felt.

Why do I enjoy reading o3's answers more?

After weeks of wondering, I think I have a reasonable explanation: o3 writes for an educated audience and doesn't try to be your friend.

To wrap this hypothesis in more than just vibes, I gathered five recent ChatGPT prompts, generated responses with o3 and GPT-5, and ran Flesch-Kincaid readability tests on each (words/sentence, syllables/word).

Topic	Description
CrowdStrike incident	Create a briefing on the 2024 incident
Solar power generation	How far are we from storing 1TWh in America
Programmatic vs DL	What's the difference between code and deep learning intelligence
Archaeological paper	How does evidence for cooking fish 780ka alter our views
Podcast summary	Gather the topics and summarize from a recent JRE podcast

What I found was that despite nearly identical content and nothing exceptionally different between the responses, even the syntax itself is fairly close with tables, em dashes, and overall structure.

The delta quickly appears when you measure readability.

GPT-5 always attempts a conversation, ignoring prompts or personality, and focuses on addressing the person it's chatting with directly.

o3 writes in an academic tone with many clause-laden sentences and elevated vocabulary that lift readability requirements to the college-graduate territory. GPT-5 packs the same ideas into short, parallel statements, using plain verbs, pushing the grade level down while preserving substance.

o3 writes academic papers, GPT-5 writes a conversation.

Feature	o3	GPT-5
Sentence length	~16 words	~10 words
Syllables per word	≈ 2.0	≈ 1.8
Typical F-K grade	13 – 15	7 – 11
Reading-Ease score	12 – 30	35 – 55
Opening style	Declarative taxonomy	Punchy contrast
Tone	Formal, academic	Direct, conversational
Structure	Few dense sentences, nested clauses	Many short sentences, clear pivots
Reader effort	High, requires parsing complex syntax	Moderate, quick scan suffices

The most interesting characteristic of GPT-5 is that despite instructions not to use punchy, contrastive not x but instead y statements, it still cannot help itself. This behavior is common in GPT-4-era models and is likely something carried over from GPT-4o personality post-training.

Topic	Model	F-K Grade	Ease Score	Reading Level	Words / Sentence	Syllables / Word	Sentences	Words
CrowdStrike incident	o3	13.1	29.9	College graduate	16.0	1.9	39	625
	GPT-5	8.4	47.5	College	7.0	1.8	60	422
Solar power generation	o3	9.8	43.8	College	10.6	1.8	62	655
	GPT-5	7.5	55.2	10th–12th grade	7.7	1.7	110	851
Programmatic vs DL	o3	15.5	12.7	College graduate	16.2	2.1	64	1035
	GPT-5	7.8	54.4	10th–12th grade	8.5	1.7	93	789
Archaeological paper	o3	13.7	28.3	College graduate	17.5	1.9	90	1573
	GPT-5	11.2	29.4	College graduate	8.1	2.0	68	551
Podcast summary	o3	11.2	34.8	College	11.1	1.9	44	490
	GPT-5	10.5	36.7	College	9.3	1.9	44	409

Across every topic, GPT-5 texts demand a lower grade level and yield higher Reading-Ease scores, and subtly use pronouns. When analyzing content (paper, podcast) the models converge closer to the source material.

The primary lever is sentence length: o3 adds between 2 and 9 extra words per sentence, raising perceived reading difficulty even when vocabulary density shifts only a tenth of a syllable.

Topic	Δ F-K Grade	Δ Ease	Δ Words / Sentence	Δ Syllables / Word	Δ Sentences	Δ Words
CrowdStrike incident	+4.7	–17.6	+9.0	+0.1	–21	+203
Solar power generation	+2.3	–11.4	+2.9	+0.1	–48	–196
Programmatic vs DL	+7.7	–41.7	+7.7	+0.4	–29	+246
Archaeological paper	+2.5	–1.1	+9.4	–0.1	+22	+1022
Podcast summary	+0.7	–1.9	+1.8	0.0	0	+81

There are other subtle quirks, including o3's preference for British English.

In OpenAI's quest to meet the needs of most in a population of hundreds of millions, the language and writing bend toward easier reading.

By chasing mass-market readability, GPT-5 trades that signal for speed, likely driven by human-preference feedback.

o3 was the first model that truly felt smarter than most people I know, and it's primarily because of how it communicates, not its underlying intelligence.

Writing style shapes perceived intelligence. A model that communicates in graduate-level prose signals intellect, even when raw reasoning is similar.

Hopefully, mass-market preference does not drive us to Idiocracy.

GPT-5 refers to the API version, also known as GPT-5 Thinking in ChatGPT. ↩
o3 (and perhaps all o-series models) has a significant flaw where it will hallucinate, especially with tool-use responses, and then gaslight the user when challenged. ↩

Footnotes