Measuring AI Ability to Complete Long Tasks

1. Research Background and Objectives

AI technology has advanced rapidly in recent years, but what benchmark performance actually means in the real world remains unclear. To address this, researchers proposed a new metric called the "50% Task Completion Time Horizon" that allows comparison between AI systems and human capabilities. This metric measures the time it takes for an AI model to complete a specific task with a 50% success rate, enabling evaluation of AI's practical and intuitive capabilities.

2. Key Research Findings

2.1 AI Task Completion Time Horizon

From 2019 to 2025, AI models' 50% task completion time horizons have been doubling approximately every 7 months.
The latest AI models (e.g., Claude 3.7 Sonnet) have a measured 50% time horizon of approximately 50 minutes.
After 2024, this rate of increase has been observed to potentially accelerate further.

"Current AI models can complete tasks that human experts would perform in 50 minutes with a 50% success rate."

2.2 Key Factors in AI Performance Improvement

Improvement in logical reasoning ability
Enhancement of tool utilization capability
Increased adaptability to mistakes
Strengthened task execution reliability

2.3 AI Limitations

Low performance on unstructured "messy" tasks.
Difficulty in environments without clear feedback loops.
Lack of ability to recognize its own limitations and proactively search for necessary information.

"AI still cannot match human flexibility in complex and unstructured environments."

3. Research Methodology

3.1 Dataset Composition

A dataset consisting of 170 diverse tasks was used:

HCAST: 97 software and general reasoning tasks (requiring 1 minute to 30 hours).
RE-Bench: 7 challenging ML research tasks (requiring 8 hours).
SWAA: 66 short software tasks (requiring 1 second to 30 seconds).

3.2 Human Baseline

Time taken by expert-level humans to complete tasks was measured and compared with AI performance.
A total of 2,529 hours of human work data was collected.

4. Key Analysis and Results

4.1 Time Horizon Calculation

The task length at which an AI model's probability of successful completion reaches 50% was calculated.
Logistic Regression was used to estimate the model's time horizon.

"The time horizon of AI models is closely correlated with the time it takes human experts to complete tasks."

4.2 Time Horizon Growth Trend

From 2019 to 2025, the time horizon doubled every 7 months.
From 2024 to 2025, this growth rate may be accelerating further.

"If current trends continue, AI could automate tasks that take humans a month to perform between 2028 and 2031."

4.3 80% Success Rate Time Horizon

The time horizon based on an 80% success rate is about 5 times shorter than the 50% threshold.
This indicates that AI struggles to succeed reliably on longer tasks.

5. Analysis of AI Performance Improvement Factors

5.1 Improved Tool Utilization and Logical Problem Solving

AI has become more effective at using tools and has improved in logical problem-solving ability.
Example: In Python code debugging tasks, early models made repetitive mistakes, while the latest models demonstrate the ability to recognize and correct mistakes.

"Early models repeated the same mistakes, but the latest models recognize errors and try new approaches."

5.2 Areas Still Lacking

Difficulty in environments without clear feedback.
Does not proactively search for information.
Lack of strategic thinking in complex environments.

6. External Validation

6.1 Comparison with SWE-Bench Verified

A similar time horizon growth trend was confirmed in the SWE-Bench Verified dataset.
However, SWE-Bench's time estimates tend to be measured as shorter than actual human work times.

6.2 Performance on "Messy" Tasks

AI success rates decline as tasks become more complex and unstructured.
However, performance on complex tasks has been improving over time.

7. Future Outlook and Predictions

7.1 The Arrival of 1-Month Time Horizon AI

Based on current trends, the point at which AI can complete a 1-month (167-hour) task with a 50% success rate is predicted to be between 2028 and 2031.

"A 1-month time horizon AI will not only create enormous economic value but may also possess potentially dangerous capabilities."

7.2 Uncertainty Factors in Predictions

Differences between real-world tasks and benchmark tasks.
Changes in future technological progress rates.
The possibility that automation of AI research and development may accelerate the time horizon growth rate.

8. Conclusions and Recommendations

Time Horizon is a useful intuitive and quantitative metric for evaluating AI's real-world capabilities.
However, more realistic task datasets and refined human baselines are needed.
If AI performance continues to improve, safeguards considering social, economic, and ethical impacts are essential.

"AI's progress is remarkable, but preparation for managing it safely is necessary."

9. Keywords

Time Horizon
AI Performance Evaluation
Tool Utilization Capability
Logical Reasoning
Messy Tasks
1-Month AI
AI Automation

1. Research Background and Objectives

2. Key Research Findings

2.1 AI Task Completion Time Horizon

From 2019 to 2025, AI models' 50% task completion time horizons have been doubling approximately every 7 months.
The latest AI models (e.g., Claude 3.7 Sonnet) have a measured 50% time horizon of approximately 50 minutes.
After 2024, this rate of increase has been observed to potentially accelerate further.

"Current AI models can complete tasks that human experts would perform in 50 minutes with a 50% success rate."

2.2 Key Factors in AI Performance Improvement

Improvement in logical reasoning ability
Enhancement of tool utilization capability
Increased adaptability to mistakes
Strengthened task execution reliability

2.3 AI Limitations

Low performance on unstructured "messy" tasks.
Difficulty in environments without clear feedback loops.
Lack of ability to recognize its own limitations and proactively search for necessary information.

"AI still cannot match human flexibility in complex and unstructured environments."

3. Research Methodology

3.1 Dataset Composition

A dataset consisting of 170 diverse tasks was used:

HCAST: 97 software and general reasoning tasks (requiring 1 minute to 30 hours).
RE-Bench: 7 challenging ML research tasks (requiring 8 hours).
SWAA: 66 short software tasks (requiring 1 second to 30 seconds).

3.2 Human Baseline

Time taken by expert-level humans to complete tasks was measured and compared with AI performance.
A total of 2,529 hours of human work data was collected.

4. Key Analysis and Results

4.1 Time Horizon Calculation

The task length at which an AI model's probability of successful completion reaches 50% was calculated.
Logistic Regression was used to estimate the model's time horizon.

"The time horizon of AI models is closely correlated with the time it takes human experts to complete tasks."

4.2 Time Horizon Growth Trend

From 2019 to 2025, the time horizon doubled every 7 months.
From 2024 to 2025, this growth rate may be accelerating further.

"If current trends continue, AI could automate tasks that take humans a month to perform between 2028 and 2031."

4.3 80% Success Rate Time Horizon

The time horizon based on an 80% success rate is about 5 times shorter than the 50% threshold.
This indicates that AI struggles to succeed reliably on longer tasks.

5. Analysis of AI Performance Improvement Factors

5.1 Improved Tool Utilization and Logical Problem Solving

AI has become more effective at using tools and has improved in logical problem-solving ability.
Example: In Python code debugging tasks, early models made repetitive mistakes, while the latest models demonstrate the ability to recognize and correct mistakes.

"Early models repeated the same mistakes, but the latest models recognize errors and try new approaches."

5.2 Areas Still Lacking

Difficulty in environments without clear feedback.
Does not proactively search for information.
Lack of strategic thinking in complex environments.

6. External Validation

6.1 Comparison with SWE-Bench Verified

A similar time horizon growth trend was confirmed in the SWE-Bench Verified dataset.
However, SWE-Bench's time estimates tend to be measured as shorter than actual human work times.

6.2 Performance on "Messy" Tasks

AI success rates decline as tasks become more complex and unstructured.
However, performance on complex tasks has been improving over time.

7. Future Outlook and Predictions

7.1 The Arrival of 1-Month Time Horizon AI

Based on current trends, the point at which AI can complete a 1-month (167-hour) task with a 50% success rate is predicted to be between 2028 and 2031.

"A 1-month time horizon AI will not only create enormous economic value but may also possess potentially dangerous capabilities."

7.2 Uncertainty Factors in Predictions

Differences between real-world tasks and benchmark tasks.
Changes in future technological progress rates.
The possibility that automation of AI research and development may accelerate the time horizon growth rate.

8. Conclusions and Recommendations

Time Horizon is a useful intuitive and quantitative metric for evaluating AI's real-world capabilities.
However, more realistic task datasets and refined human baselines are needed.
If AI performance continues to improve, safeguards considering social, economic, and ethical impacts are essential.

"AI's progress is remarkable, but preparation for managing it safely is necessary."

9. Keywords

Time Horizon
AI Performance Evaluation
Tool Utilization Capability
Logical Reasoning
Messy Tasks
1-Month AI
AI Automation

1. Research Background and Objectives

2. Key Research Findings

2.1 AI Task Completion Time Horizon

2.2 Key Factors in AI Performance Improvement

2.3 AI Limitations

3. Research Methodology

3.1 Dataset Composition

3.2 Human Baseline

4. Key Analysis and Results

4.1 Time Horizon Calculation

4.2 Time Horizon Growth Trend

4.3 80% Success Rate Time Horizon

5. Analysis of AI Performance Improvement Factors

5.1 Improved Tool Utilization and Logical Problem Solving

5.2 Areas Still Lacking

6. External Validation

6.1 Comparison with SWE-Bench Verified

6.2 Performance on "Messy" Tasks

7. Future Outlook and Predictions

7.1 The Arrival of 1-Month Time Horizon AI

7.2 Uncertainty Factors in Predictions

8. Conclusions and Recommendations

9. Keywords

Related writing

The AI Paradox at Work

Inside a No-Employee AI Startup

Building a Self-Improving Company

Reading

1. Research Background and Objectives

2. Key Research Findings

2.1 AI Task Completion Time Horizon

2.2 Key Factors in AI Performance Improvement

2.3 AI Limitations

3. Research Methodology

3.1 Dataset Composition

3.2 Human Baseline

4. Key Analysis and Results

4.1 Time Horizon Calculation

4.2 Time Horizon Growth Trend

4.3 80% Success Rate Time Horizon

5. Analysis of AI Performance Improvement Factors

5.1 Improved Tool Utilization and Logical Problem Solving

5.2 Areas Still Lacking

6. External Validation

6.1 Comparison with SWE-Bench Verified

6.2 Performance on "Messy" Tasks

7. Future Outlook and Predictions

7.1 The Arrival of 1-Month Time Horizon AI

7.2 Uncertainty Factors in Predictions

8. Conclusions and Recommendations

9. Keywords

Related writing

The AI Paradox at Work

Inside a No-Employee AI Startup

Building a Self-Improving Company