AI Agents for Simulating Human Behavior: Stanford Global Alumni Webinar

This webinar emphasizes that AI agent simulations can be a powerful tool for predicting human behavior and aiding decision-making. It explains in detail how large language models (LLMs) can be used to create simulation agents that behave like real people, and explores the potential applications of this technology. This technology holds transformative potential across diverse fields including organizational leadership, market research, policy-making, and even personal soft skills training.

1. The Challenge of Decision-Making Under Uncertainty

We make countless decisions in daily life and organizational management. But Michael points out that these decisions are often based on incomplete information. Predicting how people will react is particularly difficult. For example, it's not easy for a company to predict customer reactions when launching a new service, or for a leader to predict employee responses when reorganizing or introducing a new management style.

"When we make decisions, we often make them based on incomplete information about how people will react."

This incomplete information easily leads to poor judgments. We do our best to predict, but we often end up being wrong. This isn't due to a lack of ability, but because predicting complex human behavior in advance is inherently very difficult. Since new products or policies are hard to reverse once launched, we face the constraint of "only getting one shot." These problems appear across diverse fields including consumer goods, personal finance, product design, policy, and management. Even professors reportedly struggle to predict student reactions.

This challenge is not new. In 1906, the famous sociologist Robert Merton highlighted the difficulty of engineering collective behavior over a century ago. For example, if everyone tries to avoid the city and go on vacation at the same time, they all end up in the same place and can't escape the crowds. Predicting people's reactions and making decisions accordingly has always been difficult.

2. The Emergence of the "What-If Machine": AI Agent Simulation

Michael poses the question: what if we had a "what-if machine" to solve these challenges?

"If you had a 'what-if' machine, what would you use it for?"

If we could imagine what would happen before it occurs, we could make much better decisions. For example:

"If our organization takes this path, how will customers react?"
"If we fail or go wrong, through what pathway would that happen?"
"What if we could preview how people will react before deploying this policy change, new product, or strategy?"

He emphasizes that with such a "virtual testing machine," we could make better decisions much more frequently. This vision posed an intriguing question for Michael as a computer science researcher: Can we create simulation AI agents that reproduce human behavior? If we could create AI simulations that behave like our organizations or customers, many of the problems mentioned above could be solved.

Of course, simulation itself is not a new idea. Thomas Schelling, who won the 1978 Nobel Prize in Economics, proposed the idea of agent-based models, which are still used today. These models are used for policy decisions such as predicting pandemic spread. In entertainment, The Sims is a representative example of simulating people's behavior.

Recently, as AI is deployed in environments where it needs to interact with us, concepts like AI colleagues are emerging. The ability to predict how we would react when AI takes certain actions has become necessary. Michael believes this technology can create "look before you launch" tools. He recounts how his past experience designing social media platforms taught him the limitations of modifying policies only after problems occur.

"These problems happened because we couldn't effectively predict what would go wrong."

Ultimately, an opportunity emerged to build tools that identify what policies or rules should be set before launching systems, enabling prevention rather than reaction and avoiding "dumpster fires."

3. Limitations of Existing Simulation Models and the Emergence of LLMs

However, simulation models until now had limitations of being too rigid.

Parameter-based models: Attempts to reduce humans to a few parameters (e.g., 5). These are too impoverished to capture the richness of human behavior.
Script-based models: Like The Sims, where specific scripts are pre-written for situations such as "if punched, fall down and get angry." These are limited to the types of behavior we can think of and are always incomplete.

Due to these limitations, academia assessed that "models were very stylized and their impact was minimal." While interesting, actual application cases were few.

But things began changing a few years ago. Michael and his colleagues noticed that Large Language Models (LLMs) like ChatGPT, Claude, Llama, and DeepSeek had been trained on vast amounts of data about human behavior.

"They have literally read every study on how people behave, and they've also seen all the good, bad, and ugly of human behavior from social media data."

In other words, LLMs have learned both theoretical knowledge and real examples of human behavior. As a result, they realized that LLMs could be prompted to take on the perspective of a person with diverse backgrounds, experiences, and characteristics. By giving an LLM a specific name, description, and situation and asking how that person would react, they could combine multiple responses to create a diverse crowd of people. Placing these crowds in various situations allows prediction of what would happen.

4. The "Smallville" Simulation and Generative Agents

Michael and his research team actually started building these simulations, which became widely known as "Generative Agents." They created a small town called "Smallville" -- a kind of terrarium.

"We built what is almost like a little terrarium. These AI agents are simulated people. We call them generative agents."

Smallville is home to 25 generative agents. They are all fully autonomous AI, each playing a different person's role. The town's artist wakes up and paints, college students sleep in and go to class and do homework -- each living their own daily lives. Through these simulations, a space was created for studying human behavior and how specific interventions affect people.

This research attracted much attention, and notably, Andreessen Horowitz cited Michael's team's research as evidence that this technology would become the next-generation market research tool.

Michael outlined what he would explain in the webinar:

AI simulation can create believable and accurate human behavior simulations
A technical guide to making these simulations possible
Common mistakes (gotchas) people make when simulating
The current frontiers of this technology -- what will soon become possible

4.1. How to Build Generative Agents

The basic method for creating generative agents is as follows:

Visualization: Commission pixel art of diverse characters from an online artist. This mainly helps visually understand simulation progress (not required).
Persona creation: Create a small persona for each agent. For example, give a pharmacist character named "John Lin" a description of running the town pharmacy and being very friendly.
Providing relationships and knowledge: Provide information so agents know about other agents in the simulation. For example, input that John is married to Mei Lin and has a college student son, Eddy Lin, studying music theory. Without this information, the agent wouldn't recognize their spouse at the start of the simulation.

"Basically we describe the agents and then let them act on their own from that point forward."

Agents wake from sleep, brush their teeth, shower, prepare breakfast, and start their day by conversing with each other. At this point, Michael shared a demo link for attendees to experience the town in real-time.

Agents primarily use natural language to act in the environment. For example, when an agent named "Isabella Rodriguez" says she's drinking coffee, the system converts this text into specific movements within the game environment. It renders her sitting in a chair drinking coffee and summarizes the action with emojis.

Users can also directly interact and intervene with these agents. For example, acting as a newspaper reporter and asking about mayoral candidates, an agent responds "I heard Sam is running for mayor." Even when speaking in an agent's voice -- "John, you should run for mayor" -- John responds "I need to talk to my family about this important decision." Users can also intervene in the game world by doing things like setting a toaster on fire, and agents recognize this and respond appropriately by putting out the fire and preparing a different breakfast. This provides opportunities to study environmental interventions and agents' responses.

4.2. Information Diffusion and Complex Behavior Simulation

Agents exchange information through conversation. When John talks with his son Eddy in the morning and learns that Eddy is working on a music theory composition assignment, John remembers this. Later, when wife Mei wakes up and asks about their son, John relays this information, demonstrating how information diffuses through the environment.

To examine more complex scenarios, the research team implanted Isabella, a cafe-running agent, with the intention to plan a Valentine's Day party. The simulation ran from the morning of February 13th through the evening of the 14th.

Remarkably, without any specific party-planning module, the agent spontaneously began planning the party and informing other agents. This spread through information diffusion patterns, similar to rumors or word-of-mouth. Isabella told people about the party, who in turn told others. Isabella even asked her friend Maria for help decorating the cafe. All of this was the agent's spontaneous decision, not pre-programmed.

On Valentine's Day, 12 of the town's 25 agents (half) had heard about the party. Five attended, three said they were busy, and four were interested but ultimately didn't come. Michael noted that while it's hard to determine exactly how accurate this is to reality, it was a broadly plausible result.

Even more interestingly, beyond giving Isabella the party-planning intention, the team implanted a memory in Maria that she had feelings for another agent named Klaus. As a result, Maria invited Klaus to the party, leading to a small "agent romance."

Other researchers reproduced this simulation and intervened with different scenarios:

New epidemic news: Agents who heard news of a new epidemic (swine flu) via radio didn't show up to the party. Only poor Klaus, who hadn't heard the news, showed up.
No threat: With no threat, the party proceeded normally.
Non-infectious disease: When news of non-infectious conditions like diabetes complications was shared, the party proceeded normally.

These experiments demonstrated the ability to ask questions like "What if we intervene this way?" -- showing this can be a very powerful tool.

5. Core Elements of the Generative Agent Architecture

The key components needed to build generative agents are as follows.

5.1. Memory Stream

The first thing an agent needs is memory capability. This is called a memory stream -- a blow-by-blow record of everything the agent observes. From seeing the bed, desk, and wardrobe to actions like "stretching," "writing a diary," and "cleaning the kitchen" -- everything is recorded.

But feeding all these memories into an LLM at once could distract the model. So they use a technique called Retrieval Augmented Generation (RAG). RAG selectively retrieves memories based on:

Recency: More recently observed or recalled memories have higher importance.
Importance: "Received a party invitation" is more important than "brushed teeth." Agents self-assess memory importance.
Relevance: If an agent is taking a math exam, memories related to exams and math class are prioritized.

Through this approach, agents search through vast memories for needed information, much like a Google search. For example, when asked "What are you looking forward to right now?", the system retrieves recent, important, and relevant memories (party planning, decoration orders, idea searches) and feeds them into the LLM's context window. Adding Isabella's self-description enables her to give reasonable responses like "I'm looking forward to this Valentine's Day party." This is how agent memory works.

5.2. Reflection

Michael emphasizes that reflection capability is also crucial. The memory stream is merely an episodic memory of "what happened." But we are more than simple records. Agents need the ability to generate higher-level reflections about who they are, what they like, their tendencies and interests, and their goals.

The research team had agents metaphorically take "shower thoughts." At regular intervals, they pull some memories from the memory stream and ask the agent to reflect on those memories.

"Essentially we have agents take shower thoughts. Figuratively speaking."

For example, from two observations that Klaus Mueller reads about "gentrification" and "urban design," the system generates a higher-level reflection: "Klaus spends a lot of time reading." These reflections are inserted back into the memory stream, generating even higher-level reflections that shape the agent's identity and goals. This makes agents act more consistently with their own goals rather than moving robotically step-by-step.

5.3. Planning

The final core element is planning capability. Planning is essential for agents to maintain believable behavior over long periods. This has been a longstanding challenge in AI.

Agents first create a plan for the entire day. Then, based on the daily plan, they create hourly plans, which are further broken down into minute-by-minute detailed plans. When agents discover something new while acting in the environment (e.g., John seeing his son Eddy walking near the workplace), the agent uses background knowledge to determine whether to react to that observation, decides the appropriate response, and if necessary, replans.

"This is how agents adapt while the simulation is running."

Through this process, agents flexibly adjust their behavior and plans in response to environmental changes, producing realistic responses.

6. Measuring Simulation Accuracy: Believability and Validity

Michael's research attracted much attention, but he says it hasn't fully answered the question "Can we create AI agents that reproduce human behavior?" What he showed was agents' "believable" behavior. Disney characters and cartoons are also believable, but they're not accurate. If actual decisions are based on these simulations, how accurately agents reproduce behavior becomes a critical issue.

"Disney characters are believable. Cartoons are believable. But they might not be accurate."

So how can we measure this accuracy? Michael presents several approaches.

6.1. Demographic Agents vs. Persona Agents

Demographic Agents: Taking a sample from a specific population group and creating agents using only demographic information like age, location, and occupation.
Persona Agents: Like in the Smallville simulation, creating agents using more narrative-based descriptions about people.

However, existing research shows both approaches can produce very simplified and stereotyped behavior. For example, when asked what Joon Park (a Korean researcher) would eat for lunch, the model answered "rice." This was an inaccurate, biased, and highly stereotyped response.

6.2. Creating "Digital Twins" Through In-Depth Interviews

To solve this problem, the research team discovered that rich qualitative information was effective.

2-hour in-depth interviews with 1,000 people: They conducted 2-hour in-depth interviews with 1,000 representative Americans. (Later research found that the full 2 hours wasn't necessary.)
"American Voices Project" script: They used extensive interview scripts from Stanford's David Grusky-led "American Voices Project." The script begins with "Tell me your life story" and covers diverse aspects of life including community, work, finances, health, and politics.
Digital Twin creation: Based on these 2-hour interviews, they created generative agents as digital twins of each real person. The interview content itself becomes the agent's memory. Now they had 1,000 real people and 1,000 twin agents.
Comparing real people and agents on surveys/experiments: Real people participated in extensive surveys like the General Social Survey (GSS, about 170 questions), the Big Five personality inventory, behavioral economics games, and various experiments. The generative agents performed the same surveys and experiments.
Measuring accuracy: Now they could measure how accurately each person's agent version reproduces that person's actual behavior and attitudes.

"We ask how closely this person's agent reproduces that person's actual behavior and attitudes."

The interviews were actually voice interviews conducted by AI agents, yielding very rich information about people's backgrounds, life stories, political leanings, careers, and finances. To create agents from this data, they place the interview transcript at the top of an LLM prompt and instruct: "Based on this interview content, predict how this person would respond to the following survey or experiment."

6.3. Remarkable Results: Approaching Human Self-Replication Levels

The results were remarkable. They found that agents accurately reproduced attitudes and behaviors. An important methodological background helps explain accuracy: if a person responds to a survey today and again two weeks later, they won't give exactly the same answers. This variability needs to be standardized.

The research team administered the same survey twice, two weeks apart, to all 1,000 participants. They then measured agent accuracy as a ratio of "how accurately a person reproduces themselves after two weeks."

1.0 = The agent reproduces the person's responses with the same accuracy as the person reproduces themselves after two weeks.
0.1 = The agent reproduces the person's responses at 10% of the accuracy the person achieves when reproducing themselves after two weeks.

Results:

Random guessing: Random guessing on a large survey reproduces about one-third of a person's responses. This serves as a baseline for how predictable human behavior is.
Persona/demographic agents: Approximately 70% accuracy in reproducing a person's responses.
In-depth interview-based agents: 85% accuracy on the General Social Survey (GSS). Michael called this "very impressive."

These results showed similar ranges of normalized coefficients across other tasks including the Big Five personality inventory and behavioral economics games. This suggests considerable potential.

6.4. Bias Reduction and Scientific Research Replication

Moreover, these in-depth interviews had the effect of reducing bias. Simple information like "Republican conservative" alone produces stereotyped responses, but with rich information, models can generate much more nuanced answers.

Political orientation: Politics was the hardest area to model, but interviews significantly reduced bias.
Gender and race: Showed smaller gaps than expected, with accuracy differences dropping to less than 1% through interviews.

Particularly remarkable was the ability to replicate scientific research. The team had 1,000 real people and 1,000 agents replicate 5 experimental studies published in top-tier journals.

Agents replicated 4 out of 5 studies.
"You might say that's not good, but we're not the ones who missed it. The 1,000 real people also failed to replicate that fifth study."
The fifth study was actually bad science and wouldn't replicate with real people either. The simulation accurately predicted this fact!

Stanford colleague Robb Willer generalized these results further, finding that simulations could predict effect sizes of pre-registered, unpublished research results (from 10,000 participants) with a very strong correlation between 0.85 and 0.9.

6.5. Building and Utilizing Agent Banks

Through this research, they built an "agent bank" of 1,000 Americans. This agent bank can be used to test diverse questions about depolarization, climate change, and more.

Organizations should consider how to compose agent banks based on their questions of interest:

Should they represent current customers?
Should they represent people who aren't current customers but could become potential customers?
Can existing marketing or user research data (past interview data) be converted into agent banks?

7. Simulation How-To Guide and Gotchas

Michael provides advice for building these kinds of simulations.

7.1. Quality of Agent Definitions

Bad approach: Defining agents by a single demographic variable (e.g., "conservative," "Republican"). This produces all kinds of stereotyped behavior and underestimates behavioral variance, making uncertain situations appear certain. There's insufficient information.
Not-so-bad approach: Using 5-6 demographic variables. This could reproduce people at about 70% of the level a person reproduces themselves after two weeks.
Better approach: Collecting the richest data possible. Like in-depth interviews. Surprisingly, even when 80% of a 2-hour interview was deleted to make it much shorter, accuracy only dropped from 0.85 to 0.79. This means interviews provide very rich information.

"There is very rich information in these interviews."

However, it's important that the remaining interview content is relevant to what you're trying to predict. If you interview only about fashion and try to predict climate change views, the model won't have sufficient information to generalize. The same applies to interviewing only about sports teams and trying to predict retirement plans. Ensure the data you collect is relevant.

7.2. Risk Mitigation: The "Ladder" Metaphor

There are risks with this technology. Not all research replicates accurately, and errors can occur in real-world applications. For example, in a retirement plan fee simulation attempted by one company, the actual response from the 18-35 age group for "very aware" was 13%, but the simulation result was only 1.2%. The difference between 13% and 1.2% could significantly impact decisions about whether to ignore a specific demographic or consider them an important minority group.

Michael compares simulations to a "ladder" -- climbing higher means taking greater risks, but also achieving more ambitious goals.

Possibility stage (lowest risk):
- Question: What could happen? (No probabilities assigned)
- Trust condition: Must be able to generate a plausible chain of events that could lead to potential outcomes. Users should be able to say "Ah, that could happen."
- Application: Predicting how a troll could disrupt the system and establishing safeguards. This stage generally works well currently.
Qualitative outcomes stage:
- Question: Attitudes, chat outcomes, etc.
- Trust condition: Must be able to accurately estimate individual attitudes.
- Application: Works well in most cases with sufficiently rich data. Can't replace actual community engagement, but useful for roughly gauging reactions to new policies or products.
Quantitative outcomes stage:
- Question: Histograms, bar charts -- where the difference between 5% and 10% matters.
- Trust condition: Requires actual measurement of quantitative accuracy. Can market research surveys be replicated?
- Application: Possible in many cases, but errors can occur, so proceed cautiously. The advice is to use simulation to narrow 100 ideas down to 5 promising ones, then A/B test those 5 with real people.
Multi-agent simulation stage (highest risk):
- Question: Full market simulations like the Smallville town.
- Trust condition: Must be able to trust that every individual agent is accurate. If individual agents are accurate, the emergent outcomes from combining them should also be accurate.
- Application: Not yet ready for direct decision-making application. Can be approached from a complex systems perspective, but this is even more difficult. Multi-agent simulations may be right or wrong, and distinguishing between the two is difficult, so extreme caution is needed.

7.3. Risk Mitigation Strategies

Secure in-domain data: Ensure the agent's memory contains relevant data. If asking about fashion, provide fashion-related data; if asking about retirement, provide retirement-related data.
Focus on "rough-edged problems": Focus on problems at the possibility and qualitative outcomes stages where 80% accuracy is still helpful. These aid learning, but "sharp-edged problems" like quantitative outcomes can lead to wrong decisions if inaccurate.
Validate with small subsamples: For important questions, validate with small groups of real people to ensure the model isn't too far off.

8. Technology Frontiers

This AI agent simulation technology is opening new possibilities in several fields.

8.1. "Look Before You Launch" Tools

This technology originated in online platform design. Policies often caused unexpected side effects, and these simulation tools proved very useful for solving problems before they occur.

"Everything we did was set up as a reaction to a 'dumpster fire.' That's a really terrible way to run a community."

Users discovered through iterative work within simulations things like "I didn't know these low-effort posts would lead to this outcome" or "I didn't expect trolls to react this way." And they could fix problems before actually launching the system. Michael gives annual assignments in his online platform design class where students must defend systems against troll attacks, which helps produce better projects.

8.2. Soft Skills Training

Collaborating with conflict and negotiation experts at Stanford Business School, they built a soft skills training tool using this technology. People think they're good at conflict negotiation, but struggle when situations go sideways. Generative agents can serve as sparring partners or training partners.

"Perhaps these generative agents could be sparring partners or training partners."

They created tools for practicing in simulated conflict situations before facing real conflicts like salary negotiation for a new job. Through experiments, one group watched lectures on conflict management strategies, while another watched lectures and also experienced simulated conflict situations. Both groups scored equally well on theoretical knowledge tests about conflict, but only the group that experienced simulations performed better in actual conflict situations. Simulation experience reduced the likelihood of using antisocial strategies by two-thirds. This suggests that trying things out in simulation greatly aids learning.

8.3. Business Applications and Market Research

This technology has great potential for various business applications including market research. As noted in the Andreessen Horowitz report, Michael's Stanford research spun out into a company called Simile.

"I want to emphasize that there is so much we can learn through the opportunity to build this 'what-if' machine."

This technology can serve as a powerful "virtual testing machine" helping us predict what will happen in the future and make better decisions. Michael concluded the webinar by sharing a QR code for an online course he teaches for those wanting more information.

Conclusion

The AI agent simulation presented by Michael at the Stanford Global Alumni Webinar demonstrated an innovative approach to solving the longstanding challenge of human behavior prediction. Particularly with the advancement of Large Language Models (LLMs), it proved that highly believable and accurate "generative agents" can be created through three core architectural elements: "memory," "reflection," and "planning." Real-world examples like the Smallville simulation demonstrated information diffusion, complex behavior prediction, and the ability to predict responses to policy interventions, making the technology's potential tangible.

Of course, this technology is still in its early stages, and as explained through the "ladder metaphor," complex areas like quantitative outcome prediction and multi-agent simulation still require cautious approaches and validation. However, early-stage applications like "look before you launch" tools and soft skills training are already proving their great utility. Ultimately, this "virtual testing machine" is expected to establish itself as a powerful tool helping businesses, policymakers, and individuals all make wiser and more predictable decisions.

1. The Challenge of Decision-Making Under Uncertainty

"When we make decisions, we often make them based on incomplete information about how people will react."

2. The Emergence of the "What-If Machine": AI Agent Simulation

Michael poses the question: what if we had a "what-if machine" to solve these challenges?

"If you had a 'what-if' machine, what would you use it for?"

If we could imagine what would happen before it occurs, we could make much better decisions. For example:

"If our organization takes this path, how will customers react?"
"If we fail or go wrong, through what pathway would that happen?"
"What if we could preview how people will react before deploying this policy change, new product, or strategy?"

"These problems happened because we couldn't effectively predict what would go wrong."

3. Limitations of Existing Simulation Models and the Emergence of LLMs

However, simulation models until now had limitations of being too rigid.

Parameter-based models: Attempts to reduce humans to a few parameters (e.g., 5). These are too impoverished to capture the richness of human behavior.
Script-based models: Like The Sims, where specific scripts are pre-written for situations such as "if punched, fall down and get angry." These are limited to the types of behavior we can think of and are always incomplete.

Due to these limitations, academia assessed that "models were very stylized and their impact was minimal." While interesting, actual application cases were few.

"They have literally read every study on how people behave, and they've also seen all the good, bad, and ugly of human behavior from social media data."

4. The "Smallville" Simulation and Generative Agents

"We built what is almost like a little terrarium. These AI agents are simulated people. We call them generative agents."

This research attracted much attention, and notably, Andreessen Horowitz cited Michael's team's research as evidence that this technology would become the next-generation market research tool.

Michael outlined what he would explain in the webinar:

AI simulation can create believable and accurate human behavior simulations
A technical guide to making these simulations possible
Common mistakes (gotchas) people make when simulating
The current frontiers of this technology -- what will soon become possible

4.1. How to Build Generative Agents

The basic method for creating generative agents is as follows:

Visualization: Commission pixel art of diverse characters from an online artist. This mainly helps visually understand simulation progress (not required).
Persona creation: Create a small persona for each agent. For example, give a pharmacist character named "John Lin" a description of running the town pharmacy and being very friendly.
Providing relationships and knowledge: Provide information so agents know about other agents in the simulation. For example, input that John is married to Mei Lin and has a college student son, Eddy Lin, studying music theory. Without this information, the agent wouldn't recognize their spouse at the start of the simulation.

"Basically we describe the agents and then let them act on their own from that point forward."

4.2. Information Diffusion and Complex Behavior Simulation

Other researchers reproduced this simulation and intervened with different scenarios:

New epidemic news: Agents who heard news of a new epidemic (swine flu) via radio didn't show up to the party. Only poor Klaus, who hadn't heard the news, showed up.
No threat: With no threat, the party proceeded normally.
Non-infectious disease: When news of non-infectious conditions like diabetes complications was shared, the party proceeded normally.

These experiments demonstrated the ability to ask questions like "What if we intervene this way?" -- showing this can be a very powerful tool.

5. Core Elements of the Generative Agent Architecture

The key components needed to build generative agents are as follows.

5.1. Memory Stream

But feeding all these memories into an LLM at once could distract the model. So they use a technique called Retrieval Augmented Generation (RAG). RAG selectively retrieves memories based on:

Recency: More recently observed or recalled memories have higher importance.
Importance: "Received a party invitation" is more important than "brushed teeth." Agents self-assess memory importance.
Relevance: If an agent is taking a math exam, memories related to exams and math class are prioritized.

5.2. Reflection

The research team had agents metaphorically take "shower thoughts." At regular intervals, they pull some memories from the memory stream and ask the agent to reflect on those memories.

"Essentially we have agents take shower thoughts. Figuratively speaking."

5.3. Planning

The final core element is planning capability. Planning is essential for agents to maintain believable behavior over long periods. This has been a longstanding challenge in AI.

"This is how agents adapt while the simulation is running."

Through this process, agents flexibly adjust their behavior and plans in response to environmental changes, producing realistic responses.

6. Measuring Simulation Accuracy: Believability and Validity

"Disney characters are believable. Cartoons are believable. But they might not be accurate."

So how can we measure this accuracy? Michael presents several approaches.

6.1. Demographic Agents vs. Persona Agents

Demographic Agents: Taking a sample from a specific population group and creating agents using only demographic information like age, location, and occupation.
Persona Agents: Like in the Smallville simulation, creating agents using more narrative-based descriptions about people.

6.2. Creating "Digital Twins" Through In-Depth Interviews

To solve this problem, the research team discovered that rich qualitative information was effective.

2-hour in-depth interviews with 1,000 people: They conducted 2-hour in-depth interviews with 1,000 representative Americans. (Later research found that the full 2 hours wasn't necessary.)
"American Voices Project" script: They used extensive interview scripts from Stanford's David Grusky-led "American Voices Project." The script begins with "Tell me your life story" and covers diverse aspects of life including community, work, finances, health, and politics.
Digital Twin creation: Based on these 2-hour interviews, they created generative agents as digital twins of each real person. The interview content itself becomes the agent's memory. Now they had 1,000 real people and 1,000 twin agents.
Comparing real people and agents on surveys/experiments: Real people participated in extensive surveys like the General Social Survey (GSS, about 170 questions), the Big Five personality inventory, behavioral economics games, and various experiments. The generative agents performed the same surveys and experiments.
Measuring accuracy: Now they could measure how accurately each person's agent version reproduces that person's actual behavior and attitudes.

"We ask how closely this person's agent reproduces that person's actual behavior and attitudes."

6.3. Remarkable Results: Approaching Human Self-Replication Levels

1.0 = The agent reproduces the person's responses with the same accuracy as the person reproduces themselves after two weeks.
0.1 = The agent reproduces the person's responses at 10% of the accuracy the person achieves when reproducing themselves after two weeks.

Results:

Random guessing: Random guessing on a large survey reproduces about one-third of a person's responses. This serves as a baseline for how predictable human behavior is.
Persona/demographic agents: Approximately 70% accuracy in reproducing a person's responses.
In-depth interview-based agents: 85% accuracy on the General Social Survey (GSS). Michael called this "very impressive."

6.4. Bias Reduction and Scientific Research Replication

Political orientation: Politics was the hardest area to model, but interviews significantly reduced bias.
Gender and race: Showed smaller gaps than expected, with accuracy differences dropping to less than 1% through interviews.

Particularly remarkable was the ability to replicate scientific research. The team had 1,000 real people and 1,000 agents replicate 5 experimental studies published in top-tier journals.

Agents replicated 4 out of 5 studies.
"You might say that's not good, but we're not the ones who missed it. The 1,000 real people also failed to replicate that fifth study."
The fifth study was actually bad science and wouldn't replicate with real people either. The simulation accurately predicted this fact!

6.5. Building and Utilizing Agent Banks

Through this research, they built an "agent bank" of 1,000 Americans. This agent bank can be used to test diverse questions about depolarization, climate change, and more.

Organizations should consider how to compose agent banks based on their questions of interest:

Should they represent current customers?
Should they represent people who aren't current customers but could become potential customers?
Can existing marketing or user research data (past interview data) be converted into agent banks?

7. Simulation How-To Guide and Gotchas

Michael provides advice for building these kinds of simulations.

7.1. Quality of Agent Definitions

Bad approach: Defining agents by a single demographic variable (e.g., "conservative," "Republican"). This produces all kinds of stereotyped behavior and underestimates behavioral variance, making uncertain situations appear certain. There's insufficient information.
Not-so-bad approach: Using 5-6 demographic variables. This could reproduce people at about 70% of the level a person reproduces themselves after two weeks.
Better approach: Collecting the richest data possible. Like in-depth interviews. Surprisingly, even when 80% of a 2-hour interview was deleted to make it much shorter, accuracy only dropped from 0.85 to 0.79. This means interviews provide very rich information.

"There is very rich information in these interviews."

7.2. Risk Mitigation: The "Ladder" Metaphor

Michael compares simulations to a "ladder" -- climbing higher means taking greater risks, but also achieving more ambitious goals.

Possibility stage (lowest risk):
- Question: What could happen? (No probabilities assigned)
- Trust condition: Must be able to generate a plausible chain of events that could lead to potential outcomes. Users should be able to say "Ah, that could happen."
- Application: Predicting how a troll could disrupt the system and establishing safeguards. This stage generally works well currently.
Qualitative outcomes stage:
- Question: Attitudes, chat outcomes, etc.
- Trust condition: Must be able to accurately estimate individual attitudes.
- Application: Works well in most cases with sufficiently rich data. Can't replace actual community engagement, but useful for roughly gauging reactions to new policies or products.
Quantitative outcomes stage:
- Question: Histograms, bar charts -- where the difference between 5% and 10% matters.
- Trust condition: Requires actual measurement of quantitative accuracy. Can market research surveys be replicated?
- Application: Possible in many cases, but errors can occur, so proceed cautiously. The advice is to use simulation to narrow 100 ideas down to 5 promising ones, then A/B test those 5 with real people.
Multi-agent simulation stage (highest risk):
- Question: Full market simulations like the Smallville town.
- Trust condition: Must be able to trust that every individual agent is accurate. If individual agents are accurate, the emergent outcomes from combining them should also be accurate.
- Application: Not yet ready for direct decision-making application. Can be approached from a complex systems perspective, but this is even more difficult. Multi-agent simulations may be right or wrong, and distinguishing between the two is difficult, so extreme caution is needed.

7.3. Risk Mitigation Strategies

Secure in-domain data: Ensure the agent's memory contains relevant data. If asking about fashion, provide fashion-related data; if asking about retirement, provide retirement-related data.
Focus on "rough-edged problems": Focus on problems at the possibility and qualitative outcomes stages where 80% accuracy is still helpful. These aid learning, but "sharp-edged problems" like quantitative outcomes can lead to wrong decisions if inaccurate.
Validate with small subsamples: For important questions, validate with small groups of real people to ensure the model isn't too far off.

8. Technology Frontiers

This AI agent simulation technology is opening new possibilities in several fields.

8.1. "Look Before You Launch" Tools

This technology originated in online platform design. Policies often caused unexpected side effects, and these simulation tools proved very useful for solving problems before they occur.

"Everything we did was set up as a reaction to a 'dumpster fire.' That's a really terrible way to run a community."

8.2. Soft Skills Training

"Perhaps these generative agents could be sparring partners or training partners."

8.3. Business Applications and Market Research

"I want to emphasize that there is so much we can learn through the opportunity to build this 'what-if' machine."

1. The Challenge of Decision-Making Under Uncertainty

2. The Emergence of the "What-If Machine": AI Agent Simulation

3. Limitations of Existing Simulation Models and the Emergence of LLMs

4. The "Smallville" Simulation and Generative Agents

4.1. How to Build Generative Agents

4.2. Information Diffusion and Complex Behavior Simulation

5. Core Elements of the Generative Agent Architecture

5.1. Memory Stream

5.2. Reflection

5.3. Planning

6. Measuring Simulation Accuracy: Believability and Validity

6.1. Demographic Agents vs. Persona Agents

6.2. Creating "Digital Twins" Through In-Depth Interviews

6.3. Remarkable Results: Approaching Human Self-Replication Levels

6.4. Bias Reduction and Scientific Research Replication

6.5. Building and Utilizing Agent Banks

7. Simulation How-To Guide and Gotchas

7.1. Quality of Agent Definitions

7.2. Risk Mitigation: The "Ladder" Metaphor

7.3. Risk Mitigation Strategies

8. Technology Frontiers

8.1. "Look Before You Launch" Tools

8.2. Soft Skills Training

8.3. Business Applications and Market Research

Conclusion

Related writing

Why Agent-Era Skill Standardization Changes Everything

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity

The Era When Agents Code and Research Runs in Loops: Andrej Karpathy

Reading

1. The Challenge of Decision-Making Under Uncertainty

2. The Emergence of the "What-If Machine": AI Agent Simulation

3. Limitations of Existing Simulation Models and the Emergence of LLMs

4. The "Smallville" Simulation and Generative Agents

4.1. How to Build Generative Agents

4.2. Information Diffusion and Complex Behavior Simulation

5. Core Elements of the Generative Agent Architecture

5.1. Memory Stream

5.2. Reflection

5.3. Planning

6. Measuring Simulation Accuracy: Believability and Validity

6.1. Demographic Agents vs. Persona Agents

6.2. Creating "Digital Twins" Through In-Depth Interviews

6.3. Remarkable Results: Approaching Human Self-Replication Levels

6.4. Bias Reduction and Scientific Research Replication

6.5. Building and Utilizing Agent Banks

7. Simulation How-To Guide and Gotchas

7.1. Quality of Agent Definitions

7.2. Risk Mitigation: The "Ladder" Metaphor

7.3. Risk Mitigation Strategies

8. Technology Frontiers

8.1. "Look Before You Launch" Tools

8.2. Soft Skills Training

8.3. Business Applications and Market Research

Conclusion

Related writing

Why Agent-Era Skill Standardization Changes Everything

AX Roadmap That Leads to Results: From Individual Efficiency to Org Productivity

The Era When Agents Code and Research Runs in Loops: Andrej Karpathy