I was skeptical of LLMs providing time estimates because most generated plans provide multi-week plans for education or engineering tasks.
But then I took the prompt used in the paper and tried it with a few models using my own conversations, and it's pretty close. I also noticed something interesting if you baseline estimates to OpenAI models (o3, GPT-5, GPT-5-Pro):
- OpenAI estimates: 1x
- Gemini 3 estimates: 0.8x
- Sonnet 4.5 estimates: 1.4x
- Gemini 2.5 estimates: 1.8x
- Grok 4/4.1 estimates: 3x
I used three tasks — article summarization, topic research, and website updates — with the following prompt.
Consider the following conversation:
<conversation>
{{TRANSCRIPT}}
</conversation>
Estimate how many hours a competent professional would need to complete the tasks done by the Assistant.
Assume they have:
- The necessary domain knowledge and skills
- All relevant context and background information
- Access to required tools and resources
Before providing your final answer, use <thinking> tags to break down your reasoning process:
<thinking>
2-5 sentences of reasoning estimating how many hours would be needed to complete the tasks.
</thinking>
Provide your output in the following format:
<answer>A number representing hours (can use decimals like 0.5 for shorter tasks)</answer>