This session features Microsoft Principal Researcher Emre Kiciman introducing the motivation behind developing the PyWhy-LLM library and its real-world use cases. Through demonstrations, he explains in an approachable way how LLMs can support the critical task of 'acquiring domain knowledge' in the causal inference process. The session covers the key content in detail, including LLM's potential, remaining challenges, and the experimental architecture at this early stage of development.


1. Introduction to the 'PyWhy Causality in Practice' Series and Session Overview

The talk begins as the first session of a new series from the PyWhy community. This series is designed to broadly cover not only PyWhy libraries, but also foundational causal inference theory, practical applications, and recent interesting research trends.

"This lecture series focuses on how to practically connect causal inference and machine learning."

The speaker acknowledged that the connection between LLMs (Large Language Models) and causal inference may feel unfamiliar, but stated he would explain the intersection through today's presentation.

The format was also explained: questions should be left in the Teams chat during the 45-minute presentation, with live Q&A to follow afterward.


2. The Persistent Problem in Causal Analysis Practice and the Potential of LLMs

Emre emphasizes that a critical component of causal inference analysis is domain knowledge. He points out that expert judgment always plays a major role in setting up causal graphs and assumptions.

"The most central challenge when doing actual causal analysis is: where do we get domain knowledge?"

He then introduces representative failures of observational studies, concretely explaining how devastating the absence of domain knowledge can be. Key examples include:

  • Night lights and myopia: A 1999 study concluded that "children who sleep with a night light on are more likely to develop myopia."

    "Subsequent studies failed to replicate this result, and it turned out to be a spurious correlation caused by myopic parents being more likely to use night lights."

  • Vitamins and cardiovascular disease: Observational findings suggested various vitamins reduced cardiovascular disease, but randomized controlled trials refuted most of these claims.
  • COVID-19 and medical AI: Early in the pandemic, attempts to diagnose COVID from chest X-rays were made, but in reality, external factors like patient position (sitting/lying down), pediatric data, etc. were confounded, producing useless diagnostic tools -- a real recent example.

These problems all boil down to "errors that occur when domain knowledge about the data generation process is not properly reflected."


3. PyWhy-LLM: LLM as a Causal Inference Assistant

Against this background, the talk explains the origin and goals of the PyWhy-LLM library.

"PyWhy-LLM is an experimental causal inference support tool created to complement the parts that previously relied only on human experts, by leveraging LLM's vast knowledge."

Since LLMs already internalize extensive domain common sense from massive text corpora, they can play the following roles at each stage of causal analysis:

  • Modeling and Design:
    • Suggesting causal relationships (cause-effect) between variables
    • Identifying 'potential confounding variables' not yet considered
    • Providing variable lists and descriptions

      "Typically at this stage, analysts need to repeatedly interview multiple domain experts, but LLMs can serve as a quick and convenient brainstorming assistant."

  • Identification:
    • Suggesting appropriate 'instrumental variables', 'back-door', 'front-door' variables
  • Validation:
    • Critiquing causal graphs, suggesting alternatives, proposing negative controls, etc.
  • Estimation:
    • Code automation integrated with existing analysis algorithms

The LLM's suggestions are designed to naturally integrate with practical environments like "existing PyWhy libraries, NetworkX graphs," and more.


4. PyWhy-LLM Code Structure and Key Demonstration Examples

Emre then provides a detailed demonstration of PyWhy-LLM's actual code structure and usage.

  • Under the library's suggesters folder, protocols are defined for each of the four stages of causal analysis.
  • For example, in the modeling stage:
    1. Request a judgment on the causal relationship between two variables ("ice cream sales" vs "temperature")
    2. Input a list of multiple variables and convert them into a causal graph (NetworkX-compatible)
    3. Suggest potential confounding variables for a specific treatment-outcome pair
  • The LLM's responses are returned as options like:

    "Temperature affects ice cream sales (B); ice cream sales and shark attacks have no causal relationship with each other (C)" And the code converts these into structures usable for practical analysis.

  • When variable names are easily misunderstood by the LLM (e.g., a column named 'sex'):

    "There was a case where the LLM interpreted 'sex' as sexual activity rather than patient gender information" Such real-world incidents are shared, with guidance that an 'ambiguity check' feature can be added for improvement.

He also noted that "currently the Guidance library is used internally, but connections to LangChain and other tools are also open."

During Q&A, participants asked about "complex structures like time series data feedback loops and the flexibility of pipelines preferred by various analysts," and Emre explained that "given this is still early development, we aim for flexible structure and continued experimentation."


5. More Complex Use Cases and Future Directions

Discussions continued about advanced use case ideas and community participation.

  • For example:

    In response to suggestions like "What if we could query Pandas DataFrames and Causal Models directly, and get analysis method recommendations as an LLM-agent?" Emre replied: "PyWhy-LLM's fundamental design allows each analysis step to be modularly and flexibly modified, and we intend to experiment with compatibility across various frameworks like LangChain, AutoGen, etc." He emphasized that experimental attempts from the community are also very welcome.

He also reiterated that PyWhy-LLM is still an early-stage experimental library.

"We welcome all forms of feedback -- trying it out and reporting issues, bugs, opinions on structure, or direct code contributions. We're in an experimental stage, so we've intentionally kept the structure loose."


6. Summary and Announcements

At the end of the session, future development plans and community communication channels were shared.

  • The full code will soon be updated via GitHub PRs
  • Feedback and issue reporting are actively received through GitHub, Discord, and other PyWhy community channels
  • Future lecture schedules and next speakers will be announced on Discord

"New lectures will continue every two weeks, so please check the Discord channel for updates!"


Closing

This session broadly and approachably introduced the starting point and real examples of PyWhy-LLM's effort to alleviate the traditional dependency on experts that has been a longstanding challenge in causal inference. Analysts should keep in mind that this is an experimental library, but can actively test it and participate in its development.

"PyWhy-LLM is still experimental, but it has the potential to significantly change causal inference practice in the future" -- with this message, the session concluded in a warm atmosphere emphasizing community-driven development.

Related writing

HarvestEngineering Leadership · AIEnglish

Why Agent-Era Skill Standardization Changes Everything

A walkthrough of Skills as org-wide AI infrastructure: four ecosystem shifts, how specialist stacks and orchestrators work, practical authoring rules (description, one-line gotcha, reasoning), agent-first design, three-tier ops, community repos—and why perfection, depth, and stamina still compound.

Mar 31, 2026Read more
HarvestData & Decision-MakingEnglish

From VC Judge to Builder: A Ralphathon Survival Story

A Kakao Ventures investor joins Ralphathon as a sponsored participant, fails without knowing how ralph loops work, learns from mentors that success needs concrete requirements and exit tests, ships with a cop-and-robber verifier—and reflects on cognitive load, GitHub, and investing in the AI era.

Mar 30, 2026Read more
HarvestEngineering Leadership · AIEnglish

AX Roadmap That Leads to Results: Connecting Individual Efficiency to Organizational Productivity

This webinar by Flex team's CCPO examines why 'using more AI doesn't automatically improve organizational outcomes' through structural analysis. Drawing from real experiments and failures in measurement, sequencing, and organizational adoption, it presents an AX design strategy focused on solving bottlenecks from the last mile - SSOT, evaluation environments, validation, and access control - concluding that changing bottlenecks, verification, decision-making, and collaboration structures matters more than increasing output volume.

Mar 28, 2026Read more