Estimating AI productivity gains | Stephen M. Walker II

I was skeptical of LLMs providing time estimates because most generated plans provide multi-week plans for education or engineering tasks.

But then I took the prompt used in the paper and tried it with a few models using my own conversations, and it's pretty close. I also noticed something interesting if you baseline estimates to OpenAI models (o3, GPT-5, GPT-5-Pro):

OpenAI estimates: 1x
Gemini 3 estimates: 0.8x
Sonnet 4.5 estimates: 1.4x
Gemini 2.5 estimates: 1.8x
Grok 4/4.1 estimates: 3x

I used three tasks — article summarization, topic research, and website updates — with the following prompt.

Consider the following conversation:

<conversation>
{{TRANSCRIPT}}
</conversation>

Estimate how many hours a competent professional would need to complete the tasks done by the Assistant.
Assume they have:
- The necessary domain knowledge and skills
- All relevant context and background information
- Access to required tools and resources

Before providing your final answer, use <thinking> tags to break down your reasoning process:
<thinking>
2-5 sentences of reasoning estimating how many hours would be needed to complete the tasks.
</thinking>

Provide your output in the following format:
<answer>A number representing hours (can use decimals like 0.5 for shorter tasks)</answer>