Ensuring AI Agents Work: Evaluation Frameworks for Scaling Success

1. Introduction and Context 🎤

Aparna Dhinkaran opens the talk, apologizing briefly for a slightly hoarse voice.
She emphasizes that today's topic is evaluating AI agents and assistants.
"Today we're going to talk about something really important: how do you actually evaluate AI agents and assistants?"
She notes that while many AI agents are being built and various tools and methodologies are being introduced, verifying that they work correctly in production is critical.
"Making sure that the agents we build actually work well in the real world is incredibly important."
She acknowledges the talk will touch on technical content, but stresses it is equally essential for leadership-level audiences.

2. New Trends in AI Agents: Multimodal and Voice AI 🗣️

Most people are familiar with text-based agents, but voice AI is now spreading rapidly in call centers and similar contexts.
"Voice AI has already taken over call centers. More than a billion call center interactions worldwide are being handled by voice assistants."
She highlights a real-world example: Priceline's Pennybot, an agent that handles travel bookings entirely by voice.
"We've moved beyond text-only — we're now in the era of multimodal agents."
She emphasizes that evaluation approaches must adapt to the form of the agent (text, voice, multimodal).

3. Core Building Blocks of an AI Agent 🧩

Agents are typically composed of three main components:
1. Router: The "boss" that decides what action to take next
2. Skill: The logic chain that actually performs the task
3. Memory: Stores conversation state and past information
"The router acts like a boss — it decides which skill to call."
Different frameworks (LangGraph, CrewAI, LlamaIndex, etc.) implement these differently, but the router–skill–memory structure is universal.
"These three components appear in almost every framework, no matter which one you use."

4. What Is a Router? 🚦

The router receives the user's request and decides which skill to invoke.
Examples: "I want to make a return," "Are there any discounts available?" — these questions flow into the router.
The router selects the appropriate skill (e.g., connecting to customer support, providing discount information) and executes it.
"The router won't always get it right, but it needs to choose correctly as often as possible."
If the router calls the wrong skill, the entire flow breaks down.

5. The Role of Skills and Memory 🛠️🧠

Skills: Execute concrete tasks such as API calls, LLM calls, and more.
"For example, if you say 'recommend the best leggings,' a product search skill is invoked."
Memory: Retains information across multiple conversational turns.
"An agent can't be allowed to forget what it said earlier. That's why memory matters."

6. A Real Example: Looking Inside an Agent's Trace 🔍

Using an open-source project, she demonstrates an agent's internal operations (trace) visually.
"This is what your engineers actually see when they build and debug agents."
Example: A user asks "What's causing latency in my trace?" → the router decides which skill to call → SQL query runs → a data analysis skill is invoked → results are analyzed.
Multiple router calls may occur, with memory storing state at each step.

7. Evaluation Points for Each Component 📝

1) Router Evaluation

Key question: Did the router call the correct skill?
"If I asked for legging recommendations and got connected to customer support instead, something went wrong."
It also matters whether the correct parameters (e.g., material, price range) are passed when a skill is called.
"You must evaluate whether the router calls the right skill with the right arguments."

2) Skill Evaluation

There are multiple evaluation points inside a skill.
- For a RAG skill: relevance and accuracy of retrieved information
- Various methods including LLM-based evaluation or code-based evaluation
"You need to check the skill's accuracy, relevance, and whether it actually produces the desired output."

3) Path (Convergence) Evaluation

Check whether an agent completing the same task always finishes in a similar number of steps.
"The same skill built with OpenAI versus Anthropic can result in completely different step counts."
Consistent and reliable paths are essential.
"We call this 'convergence,' and it's one of the hardest parts to evaluate in practice."

8. Additional Evaluation Dimensions for Voice Agents 🎧

Voice agents require evaluating not just text but the audio itself.
"With voice assistant APIs, audio chunks are sent first, and transcripts are generated afterward."
Evaluation points:
- User sentiment
- Speech-to-text accuracy
- Consistency of tone throughout the conversation
- Intent recognition, audio quality, and more
"You need separate evaluation criteria for text, conversation flow, and audio."

9. A Real Arize Evaluation Case: The Co-pilot Agent 🤖

Arize's Co-pilot agent lets users make natural-language requests for search, summarization, debugging, and more within the product.
"We're an evaluation company, so we apply evals directly to our own product."
Evals are inserted at each step:
- Whether the overall response is correct
- Whether the router selected the right router
- Whether the correct arguments were passed
- Whether the skill executed properly
"The key point is that evaluation isn't placed at just one step — it's distributed throughout the entire flow."
When a problem occurs, you can quickly debug to identify exactly which step failed.

10. Key Message & Closing 🎯

Agent evaluation must be multi-layered across the entire flow.
"If you take away just one thing from this talk, it's this: put evals at every step of your agent."
When something goes wrong, you must be able to quickly pinpoint whether the issue is in the router, a skill, the path, the audio, or elsewhere.
She closes with: "Any questions?"

Key Keyword Summary

AI agent evaluation
Router
Skill
Memory
Multimodal / voice agents
Trace (internal operation tracking)
Convergence (path consistency)
Multi-layered evaluation (Eval)
Production deployment
Debugging and reliability

"Making sure that the agents we build actually work well in the real world is incredibly important."

"The router acts like a boss — it decides which skill to call."

"An agent can't be allowed to forget what it said earlier. That's why memory matters."

"You must evaluate whether the router calls the right skill with the right arguments."

"The same skill built with OpenAI versus Anthropic can result in completely different step counts."

"If you take away just one thing from this talk, it's this: put evals at every step of your agent."

This talk provides a thorough and accessible walkthrough of the full structure of AI agent evaluation and how to apply it in practice. If you're interested in building and operating agents, this is well worth your time. 😊

1. Introduction and Context 🎤

Aparna Dhinkaran opens the talk, apologizing briefly for a slightly hoarse voice.
She emphasizes that today's topic is evaluating AI agents and assistants.
"Today we're going to talk about something really important: how do you actually evaluate AI agents and assistants?"
She notes that while many AI agents are being built and various tools and methodologies are being introduced, verifying that they work correctly in production is critical.
"Making sure that the agents we build actually work well in the real world is incredibly important."
She acknowledges the talk will touch on technical content, but stresses it is equally essential for leadership-level audiences.

2. New Trends in AI Agents: Multimodal and Voice AI 🗣️

Most people are familiar with text-based agents, but voice AI is now spreading rapidly in call centers and similar contexts.
"Voice AI has already taken over call centers. More than a billion call center interactions worldwide are being handled by voice assistants."
She highlights a real-world example: Priceline's Pennybot, an agent that handles travel bookings entirely by voice.
"We've moved beyond text-only — we're now in the era of multimodal agents."
She emphasizes that evaluation approaches must adapt to the form of the agent (text, voice, multimodal).

3. Core Building Blocks of an AI Agent 🧩

Agents are typically composed of three main components:
1. Router: The "boss" that decides what action to take next
2. Skill: The logic chain that actually performs the task
3. Memory: Stores conversation state and past information
"The router acts like a boss — it decides which skill to call."
Different frameworks (LangGraph, CrewAI, LlamaIndex, etc.) implement these differently, but the router–skill–memory structure is universal.
"These three components appear in almost every framework, no matter which one you use."

4. What Is a Router? 🚦

The router receives the user's request and decides which skill to invoke.
Examples: "I want to make a return," "Are there any discounts available?" — these questions flow into the router.
The router selects the appropriate skill (e.g., connecting to customer support, providing discount information) and executes it.
"The router won't always get it right, but it needs to choose correctly as often as possible."
If the router calls the wrong skill, the entire flow breaks down.

5. The Role of Skills and Memory 🛠️🧠

Skills: Execute concrete tasks such as API calls, LLM calls, and more.
"For example, if you say 'recommend the best leggings,' a product search skill is invoked."
Memory: Retains information across multiple conversational turns.
"An agent can't be allowed to forget what it said earlier. That's why memory matters."

6. A Real Example: Looking Inside an Agent's Trace 🔍

Using an open-source project, she demonstrates an agent's internal operations (trace) visually.
"This is what your engineers actually see when they build and debug agents."
Example: A user asks "What's causing latency in my trace?" → the router decides which skill to call → SQL query runs → a data analysis skill is invoked → results are analyzed.
Multiple router calls may occur, with memory storing state at each step.

7. Evaluation Points for Each Component 📝

1) Router Evaluation

Key question: Did the router call the correct skill?
"If I asked for legging recommendations and got connected to customer support instead, something went wrong."
It also matters whether the correct parameters (e.g., material, price range) are passed when a skill is called.
"You must evaluate whether the router calls the right skill with the right arguments."

2) Skill Evaluation

There are multiple evaluation points inside a skill.
- For a RAG skill: relevance and accuracy of retrieved information
- Various methods including LLM-based evaluation or code-based evaluation
"You need to check the skill's accuracy, relevance, and whether it actually produces the desired output."

3) Path (Convergence) Evaluation

Check whether an agent completing the same task always finishes in a similar number of steps.
"The same skill built with OpenAI versus Anthropic can result in completely different step counts."
Consistent and reliable paths are essential.
"We call this 'convergence,' and it's one of the hardest parts to evaluate in practice."

8. Additional Evaluation Dimensions for Voice Agents 🎧

Voice agents require evaluating not just text but the audio itself.
"With voice assistant APIs, audio chunks are sent first, and transcripts are generated afterward."
Evaluation points:
- User sentiment
- Speech-to-text accuracy
- Consistency of tone throughout the conversation
- Intent recognition, audio quality, and more
"You need separate evaluation criteria for text, conversation flow, and audio."

9. A Real Arize Evaluation Case: The Co-pilot Agent 🤖

Arize's Co-pilot agent lets users make natural-language requests for search, summarization, debugging, and more within the product.
"We're an evaluation company, so we apply evals directly to our own product."
Evals are inserted at each step:
- Whether the overall response is correct
- Whether the router selected the right router
- Whether the correct arguments were passed
- Whether the skill executed properly
"The key point is that evaluation isn't placed at just one step — it's distributed throughout the entire flow."
When a problem occurs, you can quickly debug to identify exactly which step failed.

10. Key Message & Closing 🎯

Agent evaluation must be multi-layered across the entire flow.
"If you take away just one thing from this talk, it's this: put evals at every step of your agent."
When something goes wrong, you must be able to quickly pinpoint whether the issue is in the router, a skill, the path, the audio, or elsewhere.
She closes with: "Any questions?"

Key Keyword Summary

AI agent evaluation
Router
Skill
Memory
Multimodal / voice agents
Trace (internal operation tracking)
Convergence (path consistency)
Multi-layered evaluation (Eval)
Production deployment
Debugging and reliability

"Making sure that the agents we build actually work well in the real world is incredibly important."

"The router acts like a boss — it decides which skill to call."

"An agent can't be allowed to forget what it said earlier. That's why memory matters."

"You must evaluate whether the router calls the right skill with the right arguments."

"The same skill built with OpenAI versus Anthropic can result in completely different step counts."

"If you take away just one thing from this talk, it's this: put evals at every step of your agent."

1. Introduction and Context 🎤

2. New Trends in AI Agents: Multimodal and Voice AI 🗣️

3. Core Building Blocks of an AI Agent 🧩

4. What Is a Router? 🚦

5. The Role of Skills and Memory 🛠️🧠

6. A Real Example: Looking Inside an Agent's Trace 🔍

7. Evaluation Points for Each Component 📝

1) Router Evaluation

2) Skill Evaluation

3) Path (Convergence) Evaluation

8. Additional Evaluation Dimensions for Voice Agents 🎧

9. A Real Arize Evaluation Case: The Co-pilot Agent 🤖

10. Key Message & Closing 🎯

Key Keyword Summary

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes

Reading

1. Introduction and Context 🎤

2. New Trends in AI Agents: Multimodal and Voice AI 🗣️

3. Core Building Blocks of an AI Agent 🧩

4. What Is a Router? 🚦

5. The Role of Skills and Memory 🛠️🧠

6. A Real Example: Looking Inside an Agent's Trace 🔍

7. Evaluation Points for Each Component 📝

1) Router Evaluation

2) Skill Evaluation

3) Path (Convergence) Evaluation

8. Additional Evaluation Dimensions for Voice Agents 🎧

9. A Real Arize Evaluation Case: The Co-pilot Agent 🤖

10. Key Message & Closing 🎯

Key Keyword Summary

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes