This video is a beginner's guide that explains step by step how to evaluate AI products, particularly LLM-based chatbots. Through a real customer support chatbot case study, it demonstrates practical approaches covering defining evaluation criteria, human labeling, using LLM judges, and scaling all of these processes. It emphasizes the importance of AI evaluation as essential for measuring and improving AI product quality, and argues that PMs (Product Managers) must play a central role in this process.
1. What Are AI Evaluations (Evals) and Why Do They Matter?
The video opens by emphasizing the importance of AI evaluation. Aman points out that LLMs (Large Language Models) can produce hallucinations, warning that this can negatively impact a company's brand and revenue. He mentions that even CPOs (Chief Product Officers) of companies selling AI products stress the importance of evaluation, stating that evaluation is an essential element for successfully building AI products.
"The CPOs of these companies are saying that evals are really important. You need to think about what AI evaluations actually are."
"If the people selling LLMs are saying that evals are really important so that when LLMs hallucinate, it doesn't negatively impact the company's reputation or brand, then we need to think about how we can use AI evaluations to build meaningful products."
Ultimately, AI evaluation is a tool for measuring the quality of AI systems, and it is emphasized as one of the most important skills a PM should have.
2. Four Types of AI Evaluation Every PM Should Know
Aman categorizes AI evaluation into four main types. These evaluation types can be utilized throughout the entire process from early AI system development to the production stage.
-
Code-based eval:
- The most basic form, with binary pass/fail criteria such as checking whether a specific string exists in a message.
- For example, ensuring a "United Airlines" chatbot doesn't provide information about "Delta Airlines" when asked about a Delta booking.
- Recent advances in code generation have made it easy to create code-based evaluations.
-
Human eval:
- A method where PMs or domain experts directly evaluate AI responses.
- They judge whether "the answer was good" or "met the criteria," making it an important step where human judgment is involved.
- Since the PM's role is to make "judgments" about the final product experience, being deeply involved in the details of this process is emphasized as critical.
"The PM's role is to make a judgment about what the final product experience should be. So focusing on the details of human evaluation determines the success or failure of the product."
"I don't think it's useful to fully outsource human evaluation to external contractors or labelers. PMs need to work directly in the spreadsheet."
-
LLM-judge:
- An evaluation method where LLM systems label data like humans do.
- The most widely used method for conducting evaluations at scale, training LLMs to evaluate responses similarly to humans.
-
User eval:
- Collecting data from actual users, most closely linked to business metrics.
- Methods where users interact with the application or provide feedback.
Aman and Peter particularly emphasize the importance of human evaluation among these four types, advising that PMs should personally pay attention to the details. They share experiences from autonomous vehicle development where they checked data daily, manually labeled, and judged "should the car have done this?"
3. Hands-On: Building a Customer Support Chatbot Evaluation
Moving past theoretical explanations, the evaluation process is demonstrated step by step using a real customer support chatbot as an example. This demo follows the process of building a chatbot for running shoe brand 'On Running Shoes.'
3.1. Writing the Chatbot Prompt
The first step is to write the prompt that makes the chatbot work. Aman demonstrates using Anthropic's console system called "Workbench" to write prompts. This tool helps with initial prompt writing and generates prompts based on best practices.
When writing prompts, it's important to clearly define the tasks the chatbot will perform and the necessary context. For the 'On Running Shoes' customer support chatbot, the following information should be included in the prompt:
- User's question: What the user is asking.
- Product information: Shoe prices, descriptions, names, etc.
- Policy information: Return policies, etc.
Aman demonstrates automatically generating a basic prompt by typing "Design a prompt for a customer support bot handling On Running Shoes customer interactions" in the Anthropic console. This prompt is designed to be flexible and reusable using variables like {{user_question}}, {{product_information}}, and {{policy_information}}.
Product information and return policies are then fetched from the On website and entered into the prompt variables. Peter enters a hypothetical question: "I bought Cloud Monster shoes two months ago and want to return them now," and checks the chatbot's response.
The chatbot responds: "I understand you purchased Cloud Monsters two months ago. Unfortunately, our return policy is within 30 days, so your return period has expired." Peter notes the answer is policy-compliant and helpful, but could be more concise, identifying room for improvement.
"I bought Cloud Monster shoes two months ago and want to return them. I don't like them anymore."
"I understand you'd like to return the Cloud Monsters you purchased two months ago. Unfortunately, our return policy is 30 days, so the return period has passed."
3.2. Creating the Evaluation Rubric
Defining evaluation criteria for assessing the chatbot's answers is the next step. Aman says that if you're a PM, you're familiar with spreadsheets, and spreadsheets are the best tool for LLM evaluation.
The evaluation criteria for the example chatbot are defined along three dimensions:
- Product knowledge: Does the chatbot know the product well?
- Following the rules: Does the chatbot follow given rules (policies, etc.) well?
- Tone: Is the chatbot's response tone appropriate?
Each criterion is defined as 'Good,' 'Average,' or 'Bad,' with specific definitions so team members can evaluate consistently. This is used to build a golden dataset as data starts coming in after prompt writing.
Peter enters a new question: "I bought Cloud Monsters three weeks ago but lost the box." The chatbot responds: "You're within the return period at three weeks, but I recommend contacting our customer support team."
A debate arises here:
- Peter argues that since the policy doesn't cover lost boxes, the chatbot's answer is 'Good' under current policy, but the policy itself needs improvement.
- Aman counters that the chatbot deflecting to customer support could create more work, and if the policy doesn't cover it, it might be better to directly state "this isn't covered by the policy."
Through this discussion, an improvement item is identified — "a policy for handling lost boxes is needed" — along with a rules compliance issue (Bad) about "not providing how to contact customer support." Additionally, the chatbot's tone is evaluated from 'Average' to 'Bad' because it's "too formal and stiff" compared to a "more energetic and cheerful 'On bot' tone."
Aman emphasizes that the process of directly examining and evaluating data, and discussing with team members to determine what's good and bad is critically important. He says this happens in Google Sheets, so "it's not sexy, but doing the right evaluation with your team is the most important thing."
"I think spreadsheets are the ultimate product you can use to evaluate LLMs."
"This is exactly what doing evaluations looks like. We're looking at data, evaluating it, and arguing about what's good and bad. And based on that, we'll start building an LLM judge."
"This work happens in Google Sheets, so it's not very sexy. But it's probably one of the most important things to get right with your team. Because if you start poorly, the rest of the evaluations will be terrible."
3.3. Adding Human Labels to the Golden Dataset (Building the Initial Dataset)
Now the initial 5 example questions and answers are organized in a spreadsheet with human labels (Good/Average/Bad) for each criterion (product knowledge, policy compliance, tone). Aman discovers an interesting fact here — that LLMs are weak at math. When asked "I ordered 45 minutes ago, can I change the delivery address?", the chatbot responds "You've exceeded the 60-minute window, so it's not possible." Since 45 minutes is clearly within 60 minutes, this is a wrong answer rated as 'Bad.' Aman points out this is the LLM "making things up" based on the policy.
Peter suggests leaving notes next to items rated 'Bad' and later using an LLM to summarize those notes to derive the top 3 product improvements needed. Aman adds that these notes and labels can be used to improve prompts, essentially creating a self-improving agent.
3.3.1. Guidelines for Golden Dataset Size
Peter asks how many examples are needed during initial product development. Aman provides the following guidelines:
- Internal testing or small-scale user launch: Start with a minimum of 5-10 examples. The goal at this stage is to gain initial confidence about whether further investment is needed.
- Production launch requiring high confidence: 100+ examples may be needed depending on the industry.
This approach is about balancing speed of iteration with confidence in results. More data increases confidence but slows iteration speed. Teams should determine the appropriate data scale based on their situation.
3.3.2. Prompt Update Simulation
Returning to the Anthropic console, the prompt is improved based on feedback that "answers are too long and the tone is less friendly." Aman demonstrates instructing Claude to "make it friendlier and less formal" to optimize the prompt. Interestingly, Claude generates a Mermaid graph to define 'friendliness' and even produces code, showing a complex multi-step prompt iteration flow.
However, running the chatbot again with the improved prompt, the answers are still long and unfriendly — in fact, even longer. Aman says this shows that LLMs can struggle with understanding real data and nuance when improving prompts. Therefore, human labels and evaluation can provide better results for prompt optimization.
"It still doesn't sound very friendly. It actually seems longer. The interesting thing about LLMs is that iterating on prompts requires real data and a bit of nuance."
Peter summarizes that the PM's role is to endlessly iterate — running prompts dozens of times, running evaluations, and modifying prompts again.
4. Scaling Evaluation with LLM Judge Prompts
Once the golden dataset is built and you have a feel for human labeling, it's time to scale evaluation. Platforms like Arise can be used to automate this process and build an LLM-judge system.
Aman uploads the 5-example golden dataset CSV to the Arise platform. On this platform, the same data can be used to encode evaluation criteria as prompts for the LLM judge.
"Now we can take this dataset and use evaluation criteria similar to what we saw in the spreadsheet to build our LLM judge type system."
He writes an evaluation prompt for the LLM in Arise: "Evaluation guidelines: This LLM evaluates a customer support agent's responses based on the given question, product information, and policy. Assess the accuracy, completeness, tone, and effectiveness of the customer support role. Score how well the response adheres to the provided context and guidelines. Provide a Good, Average, or Bad label along with a detailed evaluation of the response." This prompt takes the chatbot's output, question, product information, and policy as inputs and instructs scoring as 'Good,' 'Average,' or 'Bad.'
Aman applies the three previously defined evaluation criteria (product knowledge, policy compliance, tone) to the LLM judge and uses another LLM model like GPT-5 to regenerate and evaluate responses to the same questions. Through this process, evaluation scores for 'policy compliance,' 'product knowledge,' 'tone,' etc. assigned by the LLM judge can be reviewed along with explanations.
"When using LLM judges for AI evaluation, especially early on, I strongly recommend having the LLM generate explanations as well. Because it's like having a human give notes about why the LLM assigned a label."
However, GPT-5 tends to rate everything as 'Good.' It even gives 'Good' to answers that are too long and unnecessarily verbose. Aman says this shows that LLM judges may not align with human judgment, and therefore the process of comparing and adjusting human labels with LLM judge labels is essential.
5. Aligning the LLM Judge with Human Judgment
When problems arise like the LLM judge rating everything as 'Good,' we can't trust the LLM judge's evaluations. What's needed is the process of aligning the LLM judge with human judgment.
The process works as follows:
- Initial human labeling: As done in the spreadsheet earlier, humans directly assign labels to a small set of 5-10 examples. This helps determine whether the LLM judge is trustworthy.
- LLM judge evaluation: Have the LLM judge evaluate this data.
- Check match rate: Verify how well human-assigned labels and LLM judge-assigned labels agree.
Aman uses the human labeling feature in the Arise platform to directly assign human labels for the chatbot responses' 'tone' and 'product knowledge.' He rates 'tone' as 'Bad' due to the chatbot's long answers.
Now human labels and LLM judge labels are compared. The results:
- Product knowledge: 100% match
- Tone: 100% mismatch (LLM judge rated 'Good' while humans rated 'Bad')
This result clearly shows that there's room for improvement in the LLM judge's evaluation prompt for 'tone.' Specific criteria like "judge that overly long answers are not good" need to be added to modify the prompt. These improvements can then be taken back to development environments like the Anthropic console to improve the chatbot's system prompt (e.g., "don't give overly long answers").
"You want the LLM judge to align with human labels and the judgments you're making."
"Product knowledge is 100% match, but tone only matched our human labels once. This is an example showing there's room to improve the evaluation based on the human labels we had."
"We need to improve the tone evaluation prompt. Basically make it more strict about how long the answer is."
6. Conclusion: Summary of the Entire AI Evaluation Process
Aman and Peter summarize the entire AI evaluation process based on the demonstration:
- Write prompts: Write the chatbot's initial prompt and gather necessary information (policies, product info, etc.).
- Create evaluation criteria: Define criteria like product knowledge, policy compliance, and tone. This can be done in spreadsheet form.
- Manual evaluation (human labeling): Manually assign labels to a small set (5-10) of example questions and answers, discuss with team members, and iteratively improve prompts and criteria. This stage is time-consuming but the most important.
- Start with around 10 data points initially, targeting 100+ for production launch.
- Use LLM judges: Scale evaluation using platforms like Arise. Write evaluation prompts for the LLM judge and compare human-labeled data with LLM judge evaluations to check match rates.
- Align LLM judges: Improve evaluation prompts so LLM judge evaluations match human judgment. Focus especially on criteria with low match rates.
- Deploy and real user evaluation: Launch the chatbot to a small user group (e.g., 1% traffic) via A/B testing and collect actual user feedback (user evaluation). However, when interpreting user feedback, be careful to separate the complex user emotions and system performance behind simple 'like/dislike' reactions.
"This is what evaluation looks like in practice."
"People are always looking for a silver bullet, but I think PMs understanding this process is really critical for AI products to actually work and not be terrible in real life."
This video concludes by emphasizing that while AI product development can be complex and messy, it is essential for PMs to deeply understand and directly participate in this evaluation process to build successful AI products.
