An end of scaling | Stephen M. Walker II

The Information and others leaked internal strategy realignment at OpenAI and Anthropic due to alleged diminishing model returns.

The number of people using ChatGPT and other artificial intelligence products is soaring. The rate of improvement for the basic building blocks underpinning them appears to be slowing down, though. The situation has prompted OpenAI, which makes ChatGPT, to cook up new techniques for boosting those building blocks, known as large language models, to make up for the slowdown. Google is also making changes after facing similar challenges.

One reason for the GPT slowdown is a dwindling supply of high-quality text and other data that LLMs can process during pretraining to make sense of the world and the relationships between different concepts so they can solve problems such as drafting blog posts or solving coding bugs.

The Information

The leaks suggest slowing gains from raw scale, pushing labs to compete on new training architectures, novel data pipelines from real-time human tasks, and inference-time reasoning for end-user applications.

If true, the shift weakens the moat of giant pre-training budgets. However, there is one potential counter to this point: the upcoming release of Grok 3.

Multiple large AI labs including but not limited to OpenAI/Microsoft, xAI, and Meta are in a race to build GPU clusters with over 100,000 GPUs. These individual training clusters cost in excess of $4 billion of server capital expenditures alone, but they are also heavily limited by the lack of datacenter capacity and power as GPUs generally need to be co-located for high-speed chip to chip networking. A 100,000 GPU cluster will require >150MW in datacenter capacity and guzzle down 1.59 terawatt hours in a single year, costing $123.9 million at a standard rate of $0.078/kWh.

Semianalysis

xAI's model will be the first to train on collocated infrastructure at this scale. The Colossus data center brings together 100,000 Nvidia H100 GPUs — making it 8 to 10 times larger than the setup used for GPT-4.

Grok 3's performance will be the best evidence for whether pre-training scale is over.