Meta has open-sourced ARE (Agents Research Environments), its AI agent development environment. This platform is designed to evaluate AI agents in environments that closely resemble real-world applications, and is expected to bring significant changes to practical performance evaluation for AI researchers and developers. Alongside comparisons of actual model performance, the significance of open-sourcing and future development possibilities are being discussed.


1. ARE: A New Testing Ground for AI Agents

ARE is itself a simulator that places AI agents in environments consisting of real apps, events, notifications, and scenarios. Time continues to flow even while agents are thinking, meaning that slow models can miss deadlines. Agents use tools, receive asynchronous notifications, and operate under rules defined by a directed acyclic graph.

"Everything is modeled as apps, events, notifications, and scenarios. Time continues to flow even when agents are thinking, and slow models miss deadlines. Agents use tools, receive asynchronous notifications, and operate under rules defined by a directed acyclic graph."

ARE: Simulator interface example


2. Gaia2: A Rigorous Benchmark in a "Smartphone-Like World"

Agents developed in ARE are validated through a benchmark called Gaia2. Gaia2 is designed to operate within 1,120 scenarios across 12 representative apps (chat, calendar, shopping, email, etc.). The main challenge types are:

  • Search
  • Execution
  • Adaptability
  • Time
  • Ambiguity
  • Agent-to-Agent collaboration

Each scenario is evaluated in a verifiable manner.

"Scenarios directly compare the oracle's written actions with the agent's actions (IDs, order, etc.), and an LLM evaluates the content."

Gaia2 benchmark environment screen


3. Performance Comparison Across Key Models

Testing multiple representative AI models in ARE and Gaia2 revealed no single model with absolute dominance. GPT-5's "high reasoning" variant performed well on difficult tasks but was weak in time-constrained situations. Claude-4 Sonnet achieved a reasonable balance between speed and accuracy, but was more expensive. Open-source models (Kimi-K2, etc.) showed promise in adaptability.

"No single model dominates: GPT-5's 'high reasoning' excels at hard tasks but collapses under time constraints. Claude-4 Sonnet balances speed and accuracy but costs more. Open-source models show good adaptability."

Additionally, simply increasing computing power does not continuously improve performance, and a tendency to hit diminishing returns was revealed.

Per-model performance and scaling curves


4. Key Insights for Developers

In real experiments, an "Inverse Scaling" phenomenon was frequently observed where models with strong reasoning capabilities actually fail at critical moments due to slowness. In other words, when time is tight, deep thinking can actually be a liability. When multiple agents are deployed to collaborate, weaker models see improved cooperative performance, but the collaboration performance of the strongest models was unpredictable.

"Strong reasoning models frequently fail when punctuality matters. Instant-mode experiments confirmed that long reasoning times negatively affect deadlines. Multi-agent environments show collaborative benefits for weaker models, but mixed effects for strong models."

For detailed papers and live demos, see below:

Key insights chart


5. Significance of the Open-Source Release and Community Reactions

The community has responded very positively to this release. While agents were previously evaluated only in "toy-like tasks," many see the value in now being able to "test in environments that feel like real apps, enabling much more honest assessment."

"ARE is like the 'gym for agents' that's been missing. Now agents are thrown into environments as complex as real apps, and you can actually observe where and when they break down."

The open-sourcing of ARE opens cutting-edge AI agent research opportunities even for smaller teams, and is expected to accelerate the development of agents with more patience than humans in actual customer service and similar applications.

"Meta's open-source enthusiasm is truly remarkable. The next goal is probably a customer support agent with more patience than a human, right?"

"The release of ARE is a very important step in bridging the gap between the lab and real-world environments in AI agent development."

"If you want to try it out yourself, check out the ARE demo on HuggingFace."


Closing Thoughts

Meta's open-source release of ARE represents a major shift that allows anyone to verify and experiment with AI agents' real-world capabilities in realistic, app-level scenarios. It clearly reveals the actual strengths and limitations of various AI models and provides valuable insights for the future direction of AI research and development. If you're a developer who wants to try it hands-on, check out the demo and paper links and get actively involved

Related writing