The Paradigm Shift

OpenAI released O3 and O4-mini — no longer called "AI models" but "AI systems." No existing benchmark captures their full potential.

Systems Thinking

A system is "interconnected elements producing their own behavior over time." The whole is greater than the sum of parts. Key components: elements, interconnections, and purpose. Feedback loops are what make systems work.

The Breakthrough: Feedback Loops

Previous LLMs: input → output → maybe tool calls → done. No self-reflection, no course correction. To compensate, developers stacked multiple agents.

O3/O4-mini: "They learned not just how to use tools, but when to use them through reinforcement learning." Up to 600 sequential tool calls with reasoning between each one.

"This is the single biggest breakthrough of this release."

Impact

  • Agents now think after each action and choose appropriate paths
  • Complex workflows automatable without rigid pre-programming
  • Tool addition alone doubled AIME 2025 benchmark scores (from 46% improvement model-to-model, to 93% improvement with tools)
  • Agents can run for days without human input
  • Workflow automation platforms (Zapier, Make) face obsolescence

Practical Advice

  1. Take initiative now — implementation will soon be handled by AI systems
  2. Build agents today — most businesses haven't utilized even previous model capabilities
  3. Choose models wisely: O4-mini for most agents (5x cheaper than O3), GPT-4.1 for clear simple tasks, O3 for mission-critical zero-error-tolerance work

Is This AGI?

Not inherently, but with the right tools, knowledge, and instructions, these systems could achieve AGI-level performance across narrow domains. Multi-agent systems — each specialized with their own tools — operating like a real organization could collectively approach AGI.

"With the capabilities of these AI systems alone, we can build true AGI together."

Related writing