Causal Inference with LLMs: Practical Application and Experiments with the PyWhy-LLM Library

This session features Microsoft Principal Researcher Emre Kiciman introducing the motivation behind developing the PyWhy-LLM library and its real-world use cases. Through demonstrations, he explains in an approachable way how LLMs can support the critical task of 'acquiring domain knowledge' in the causal inference process. The session covers the key content in detail, including LLM's potential, remaining challenges, and the experimental architecture at this early stage of development.

1. Introduction to the 'PyWhy Causality in Practice' Series and Session Overview

The talk begins as the first session of a new series from the PyWhy community. This series is designed to broadly cover not only PyWhy libraries, but also foundational causal inference theory, practical applications, and recent interesting research trends.

"This lecture series focuses on how to practically connect causal inference and machine learning."

The speaker acknowledged that the connection between LLMs (Large Language Models) and causal inference may feel unfamiliar, but stated he would explain the intersection through today's presentation.

The format was also explained: questions should be left in the Teams chat during the 45-minute presentation, with live Q&A to follow afterward.

2. The Persistent Problem in Causal Analysis Practice and the Potential of LLMs

Emre emphasizes that a critical component of causal inference analysis is domain knowledge. He points out that expert judgment always plays a major role in setting up causal graphs and assumptions.

"The most central challenge when doing actual causal analysis is: where do we get domain knowledge?"

He then introduces representative failures of observational studies, concretely explaining how devastating the absence of domain knowledge can be. Key examples include:

Night lights and myopia: A 1999 study concluded that "children who sleep with a night light on are more likely to develop myopia."

"Subsequent studies failed to replicate this result, and it turned out to be a spurious correlation caused by myopic parents being more likely to use night lights."
Vitamins and cardiovascular disease: Observational findings suggested various vitamins reduced cardiovascular disease, but randomized controlled trials refuted most of these claims.
COVID-19 and medical AI: Early in the pandemic, attempts to diagnose COVID from chest X-rays were made, but in reality, external factors like patient position (sitting/lying down), pediatric data, etc. were confounded, producing useless diagnostic tools -- a real recent example.

These problems all boil down to "errors that occur when domain knowledge about the data generation process is not properly reflected."

3. PyWhy-LLM: LLM as a Causal Inference Assistant

Against this background, the talk explains the origin and goals of the PyWhy-LLM library.

"PyWhy-LLM is an experimental causal inference support tool created to complement the parts that previously relied only on human experts, by leveraging LLM's vast knowledge."

Since LLMs already internalize extensive domain common sense from massive text corpora, they can play the following roles at each stage of causal analysis:

Modeling and Design:
- Suggesting causal relationships (cause-effect) between variables
- Identifying 'potential confounding variables' not yet considered
- Providing variable lists and descriptions
  
  "Typically at this stage, analysts need to repeatedly interview multiple domain experts, but LLMs can serve as a quick and convenient brainstorming assistant."
Identification:
- Suggesting appropriate 'instrumental variables', 'back-door', 'front-door' variables
Validation:
- Critiquing causal graphs, suggesting alternatives, proposing negative controls, etc.
Estimation:
- Code automation integrated with existing analysis algorithms

The LLM's suggestions are designed to naturally integrate with practical environments like "existing PyWhy libraries, NetworkX graphs," and more.

4. PyWhy-LLM Code Structure and Key Demonstration Examples

Emre then provides a detailed demonstration of PyWhy-LLM's actual code structure and usage.

Under the library's suggesters folder, protocols are defined for each of the four stages of causal analysis.
For example, in the modeling stage:
1. Request a judgment on the causal relationship between two variables ("ice cream sales" vs "temperature")
2. Input a list of multiple variables and convert them into a causal graph (NetworkX-compatible)
3. Suggest potential confounding variables for a specific treatment-outcome pair
The LLM's responses are returned as options like:

"Temperature affects ice cream sales (B); ice cream sales and shark attacks have no causal relationship with each other (C)" And the code converts these into structures usable for practical analysis.
When variable names are easily misunderstood by the LLM (e.g., a column named 'sex'):

"There was a case where the LLM interpreted 'sex' as sexual activity rather than patient gender information" Such real-world incidents are shared, with guidance that an 'ambiguity check' feature can be added for improvement.

He also noted that "currently the Guidance library is used internally, but connections to LangChain and other tools are also open."

During Q&A, participants asked about "complex structures like time series data feedback loops and the flexibility of pipelines preferred by various analysts," and Emre explained that "given this is still early development, we aim for flexible structure and continued experimentation."

5. More Complex Use Cases and Future Directions

Discussions continued about advanced use case ideas and community participation.

For example:

In response to suggestions like "What if we could query Pandas DataFrames and Causal Models directly, and get analysis method recommendations as an LLM-agent?" Emre replied: "PyWhy-LLM's fundamental design allows each analysis step to be modularly and flexibly modified, and we intend to experiment with compatibility across various frameworks like LangChain, AutoGen, etc." He emphasized that experimental attempts from the community are also very welcome.

He also reiterated that PyWhy-LLM is still an early-stage experimental library.

"We welcome all forms of feedback -- trying it out and reporting issues, bugs, opinions on structure, or direct code contributions. We're in an experimental stage, so we've intentionally kept the structure loose."

6. Summary and Announcements

At the end of the session, future development plans and community communication channels were shared.

The full code will soon be updated via GitHub PRs
Feedback and issue reporting are actively received through GitHub, Discord, and other PyWhy community channels
Future lecture schedules and next speakers will be announced on Discord

"New lectures will continue every two weeks, so please check the Discord channel for updates!"

Closing

This session broadly and approachably introduced the starting point and real examples of PyWhy-LLM's effort to alleviate the traditional dependency on experts that has been a longstanding challenge in causal inference. Analysts should keep in mind that this is an experimental library, but can actively test it and participate in its development.

"PyWhy-LLM is still experimental, but it has the potential to significantly change causal inference practice in the future" -- with this message, the session concluded in a warm atmosphere emphasizing community-driven development.

1. Introduction to the 'PyWhy Causality in Practice' Series and Session Overview

"This lecture series focuses on how to practically connect causal inference and machine learning."

The speaker acknowledged that the connection between LLMs (Large Language Models) and causal inference may feel unfamiliar, but stated he would explain the intersection through today's presentation.

The format was also explained: questions should be left in the Teams chat during the 45-minute presentation, with live Q&A to follow afterward.

2. The Persistent Problem in Causal Analysis Practice and the Potential of LLMs

"The most central challenge when doing actual causal analysis is: where do we get domain knowledge?"

He then introduces representative failures of observational studies, concretely explaining how devastating the absence of domain knowledge can be. Key examples include:

Night lights and myopia: A 1999 study concluded that "children who sleep with a night light on are more likely to develop myopia."

"Subsequent studies failed to replicate this result, and it turned out to be a spurious correlation caused by myopic parents being more likely to use night lights."
Vitamins and cardiovascular disease: Observational findings suggested various vitamins reduced cardiovascular disease, but randomized controlled trials refuted most of these claims.
COVID-19 and medical AI: Early in the pandemic, attempts to diagnose COVID from chest X-rays were made, but in reality, external factors like patient position (sitting/lying down), pediatric data, etc. were confounded, producing useless diagnostic tools -- a real recent example.

These problems all boil down to "errors that occur when domain knowledge about the data generation process is not properly reflected."

3. PyWhy-LLM: LLM as a Causal Inference Assistant

Against this background, the talk explains the origin and goals of the PyWhy-LLM library.

"PyWhy-LLM is an experimental causal inference support tool created to complement the parts that previously relied only on human experts, by leveraging LLM's vast knowledge."

Since LLMs already internalize extensive domain common sense from massive text corpora, they can play the following roles at each stage of causal analysis:

Modeling and Design:
- Suggesting causal relationships (cause-effect) between variables
- Identifying 'potential confounding variables' not yet considered
- Providing variable lists and descriptions
  
  "Typically at this stage, analysts need to repeatedly interview multiple domain experts, but LLMs can serve as a quick and convenient brainstorming assistant."
Identification:
- Suggesting appropriate 'instrumental variables', 'back-door', 'front-door' variables
Validation:
- Critiquing causal graphs, suggesting alternatives, proposing negative controls, etc.
Estimation:
- Code automation integrated with existing analysis algorithms

The LLM's suggestions are designed to naturally integrate with practical environments like "existing PyWhy libraries, NetworkX graphs," and more.

4. PyWhy-LLM Code Structure and Key Demonstration Examples

Emre then provides a detailed demonstration of PyWhy-LLM's actual code structure and usage.

Under the library's suggesters folder, protocols are defined for each of the four stages of causal analysis.
For example, in the modeling stage:
1. Request a judgment on the causal relationship between two variables ("ice cream sales" vs "temperature")
2. Input a list of multiple variables and convert them into a causal graph (NetworkX-compatible)
3. Suggest potential confounding variables for a specific treatment-outcome pair
The LLM's responses are returned as options like:

"Temperature affects ice cream sales (B); ice cream sales and shark attacks have no causal relationship with each other (C)" And the code converts these into structures usable for practical analysis.
When variable names are easily misunderstood by the LLM (e.g., a column named 'sex'):

"There was a case where the LLM interpreted 'sex' as sexual activity rather than patient gender information" Such real-world incidents are shared, with guidance that an 'ambiguity check' feature can be added for improvement.

He also noted that "currently the Guidance library is used internally, but connections to LangChain and other tools are also open."

5. More Complex Use Cases and Future Directions

Discussions continued about advanced use case ideas and community participation.

For example:

In response to suggestions like "What if we could query Pandas DataFrames and Causal Models directly, and get analysis method recommendations as an LLM-agent?" Emre replied: "PyWhy-LLM's fundamental design allows each analysis step to be modularly and flexibly modified, and we intend to experiment with compatibility across various frameworks like LangChain, AutoGen, etc." He emphasized that experimental attempts from the community are also very welcome.

He also reiterated that PyWhy-LLM is still an early-stage experimental library.

"We welcome all forms of feedback -- trying it out and reporting issues, bugs, opinions on structure, or direct code contributions. We're in an experimental stage, so we've intentionally kept the structure loose."

6. Summary and Announcements

At the end of the session, future development plans and community communication channels were shared.

The full code will soon be updated via GitHub PRs
Feedback and issue reporting are actively received through GitHub, Discord, and other PyWhy community channels
Future lecture schedules and next speakers will be announced on Discord

"New lectures will continue every two weeks, so please check the Discord channel for updates!"

Closing

"PyWhy-LLM is still experimental, but it has the potential to significantly change causal inference practice in the future" -- with this message, the session concluded in a warm atmosphere emphasizing community-driven development.

1. Introduction to the 'PyWhy Causality in Practice' Series and Session Overview

2. The Persistent Problem in Causal Analysis Practice and the Potential of LLMs

3. PyWhy-LLM: LLM as a Causal Inference Assistant

4. PyWhy-LLM Code Structure and Key Demonstration Examples

5. More Complex Use Cases and Future Directions

6. Summary and Announcements

Closing

Related writing

Inside YC's AI Playbook

SensorLM: Giving Wearable Data Language

Why Agent-Era Skill Standardization Changes Everything

Reading

1. Introduction to the 'PyWhy Causality in Practice' Series and Session Overview

2. The Persistent Problem in Causal Analysis Practice and the Potential of LLMs

3. PyWhy-LLM: LLM as a Causal Inference Assistant

4. PyWhy-LLM Code Structure and Key Demonstration Examples

5. More Complex Use Cases and Future Directions

6. Summary and Announcements

Closing

Related writing

Inside YC's AI Playbook

SensorLM: Giving Wearable Data Language

Why Agent-Era Skill Standardization Changes Everything