
1. Research Background and Objectives
AI technology has advanced rapidly in recent years, but what benchmark performance actually means in the real world remains unclear. To address this, researchers proposed a new metric called the "50% Task Completion Time Horizon" that allows comparison between AI systems and human capabilities. This metric measures the time it takes for an AI model to complete a specific task with a 50% success rate, enabling evaluation of AI's practical and intuitive capabilities.
2. Key Research Findings
2.1 AI Task Completion Time Horizon
- From 2019 to 2025, AI models' 50% task completion time horizons have been doubling approximately every 7 months.
- The latest AI models (e.g., Claude 3.7 Sonnet) have a measured 50% time horizon of approximately 50 minutes.
- After 2024, this rate of increase has been observed to potentially accelerate further.
"Current AI models can complete tasks that human experts would perform in 50 minutes with a 50% success rate."
2.2 Key Factors in AI Performance Improvement
- Improvement in logical reasoning ability
- Enhancement of tool utilization capability
- Increased adaptability to mistakes
- Strengthened task execution reliability
2.3 AI Limitations
- Low performance on unstructured "messy" tasks.
- Difficulty in environments without clear feedback loops.
- Lack of ability to recognize its own limitations and proactively search for necessary information.
"AI still cannot match human flexibility in complex and unstructured environments."
3. Research Methodology
3.1 Dataset Composition
A dataset consisting of 170 diverse tasks was used:
- HCAST: 97 software and general reasoning tasks (requiring 1 minute to 30 hours).
- RE-Bench: 7 challenging ML research tasks (requiring 8 hours).
- SWAA: 66 short software tasks (requiring 1 second to 30 seconds).
3.2 Human Baseline
- Time taken by expert-level humans to complete tasks was measured and compared with AI performance.
- A total of 2,529 hours of human work data was collected.
4. Key Analysis and Results
4.1 Time Horizon Calculation
- The task length at which an AI model's probability of successful completion reaches 50% was calculated.
- Logistic Regression was used to estimate the model's time horizon.
"The time horizon of AI models is closely correlated with the time it takes human experts to complete tasks."
4.2 Time Horizon Growth Trend
- From 2019 to 2025, the time horizon doubled every 7 months.
- From 2024 to 2025, this growth rate may be accelerating further.
"If current trends continue, AI could automate tasks that take humans a month to perform between 2028 and 2031."
4.3 80% Success Rate Time Horizon
- The time horizon based on an 80% success rate is about 5 times shorter than the 50% threshold.
- This indicates that AI struggles to succeed reliably on longer tasks.
5. Analysis of AI Performance Improvement Factors
5.1 Improved Tool Utilization and Logical Problem Solving
- AI has become more effective at using tools and has improved in logical problem-solving ability.
- Example: In Python code debugging tasks, early models made repetitive mistakes, while the latest models demonstrate the ability to recognize and correct mistakes.
"Early models repeated the same mistakes, but the latest models recognize errors and try new approaches."
5.2 Areas Still Lacking
- Difficulty in environments without clear feedback.
- Does not proactively search for information.
- Lack of strategic thinking in complex environments.
6. External Validation
6.1 Comparison with SWE-Bench Verified
- A similar time horizon growth trend was confirmed in the SWE-Bench Verified dataset.
- However, SWE-Bench's time estimates tend to be measured as shorter than actual human work times.
6.2 Performance on "Messy" Tasks
- AI success rates decline as tasks become more complex and unstructured.
- However, performance on complex tasks has been improving over time.
7. Future Outlook and Predictions
7.1 The Arrival of 1-Month Time Horizon AI
- Based on current trends, the point at which AI can complete a 1-month (167-hour) task with a 50% success rate is predicted to be between 2028 and 2031.
"A 1-month time horizon AI will not only create enormous economic value but may also possess potentially dangerous capabilities."
7.2 Uncertainty Factors in Predictions
- Differences between real-world tasks and benchmark tasks.
- Changes in future technological progress rates.
- The possibility that automation of AI research and development may accelerate the time horizon growth rate.
8. Conclusions and Recommendations
- Time Horizon is a useful intuitive and quantitative metric for evaluating AI's real-world capabilities.
- However, more realistic task datasets and refined human baselines are needed.
- If AI performance continues to improve, safeguards considering social, economic, and ethical impacts are essential.
"AI's progress is remarkable, but preparation for managing it safely is necessary."
9. Keywords
- Time Horizon
- AI Performance Evaluation
- Tool Utilization Capability
- Logical Reasoning
- Messy Tasks
- 1-Month AI
- AI Automation