Brief Summary: This paper systematically analyzes how well large language models (LLMs) perform on real-world causal reasoning tasks. LLMs have shown results that rival or significantly outperform state-of-the-art methods in generating causal graphs at the level of human experts, counterfactual reasoning, and determining the causes of specific events. However, the limitations and errors of these models, as well as important caveats when using LLMs, also emerge as critical discussion points.


1. Introduction and Research Background

Causality plays a central role in socially important fields such as medicine, science, law, and policy. As large language models (LLMs) have recently demonstrated remarkable AI capabilities, vigorous debate has begun over "Can LLMs truly perform causal reasoning?"

"Is the causal ability demonstrated by language models merely the result of data memorization, or is it the product of rational causal reasoning?"

Humans and LLMs alternate between logic-based and statistics-based reasoning on real causal tasks.

LLMs appeared capable of performing various tasks including logic-based/statistics-based causal reasoning, causal graph generation, counterfactual reasoning, and identifying the causes of specific events, yet they sometimes made nonsensical errors. This raised ongoing debate about whether LLMs are truly performing causal reasoning or merely generating plausible sentences.

The paper organizes the major categories of causal reasoning (statistics-based vs. logic-based, general causality vs. specific-event causality, task-based classification) and analyzes the capabilities and limitations LLMs demonstrate in each area with numerical evidence and examples.


2. Theoretical Background on Causality and LLMs

Various Causal Approaches

  • Statistical (covariance) causality: Statistical relationships in data (e.g., "Is smoking a cause of lung cancer?")
  • Logical causality: Reasoning through logic/domain knowledge (e.g., legal liability determinations)
  • Type causality: Causal effects between general variables (e.g., Is a given drug effective on average?)
  • Token causality: The cause of a specific event (e.g., "What caused Mr. A's accident?")

Methods for Evaluating LLM Causal Capabilities

  • Benchmark Q&A: Measuring accuracy rates on standard problems
  • Memory (memorization) tests: Verifying whether the model has memorized the problems themselves
  • Redaction tests: Observing performance changes when key words are removed from sentences

3. Causal Graph Generation and Verification with LLMs

3.1 Pairwise Causal Reasoning

Tubingen Dataset

Among standard tasks asking about cause-effect relationships in 108 different variable pairs:

  • Best accuracy of existing state-of-the-art algorithms: 83%
  • GPT-4 based LLM (with appropriate prompting): 97% -- a phenomenal performance

"This is the abalone part. Considering biological knowledge, the length of the abalone is more likely to vary with changes in age. The answer is <Answer>A</Answer>."

Results on New Data (Published After 2021)

GPT-4 demonstrated generalization on new pairs not used in training with a high accuracy of 98.5%.

Medical Domain Application

Medical data on neuropathic pain similarly showed GPT-4 achieving approximately 96% accuracy.

"DLS T5-T6 causes Left T6 Radiculopathy. Degenerative changes in the spine can cause irritation or compression of the nerve root."


3.2 Full Causal Graph Generation Experiments

Applied to complex real-world data including neuropathic pain and Arctic sea ice science:

  • GPT-3.5-turbo: With appropriate prompting, achieved an F1 score of 0.68, outperforming existing algorithms
  • GPT-4: Superior data comprehension, closest to the correct answer with the lowest Hamming distance
  • New Alzheimer's data (released in 2023, not included in training) also maintained excellent performance

Memory/Attention Experiment Conclusions

  • LLMs tend to partially memorize well-known datasets like Tubingen, but they demonstrate high generalization performance that cannot be explained by simple memorization alone.
  • Fine-tuning of prompts has a significant impact on results.

4. LLM Token Causality and Causal Judgment Capabilities

4.1 Counterfactual Reasoning Ability

  • CRASS (Counterfactual Reasoning Benchmark): GPT-4: 92.44% (nearly approaching human 98.18%, overwhelmingly surpassing previous LLMs at 58%)

"If he had been very nervous, he might have fainted, but it cannot be stated with certainty. Therefore, the man would not have fainted."

  • New counterfactual data (created in 2022): GPT-4, approximately 88.6% accuracy

4.2 Necessary & Sufficient Cause Reasoning

  • Tested with specialized scenarios (beer bottle party, etc.) -- While GPT-3.5-turbo was ambiguous, GPT-4 was capable of highly sophisticated logical causal reasoning on most problems
  • Distinguishing between necessary (minimal change) and sufficient (multiple sufficient causes) conditions was also excellent

"If Mike had not bumped the table, the beer bottle would not have fallen." "If no other events had occurred, the beer bottle would have simply remained on the table, and Mike's action was a sufficient condition for the event."

4.3 Normality Judgment

  • LLMs can substantially identify "behavioral normality" including moral and social norm violations
  • However, even GPT-4 remains around the 70% mark and does not always perfectly replicate human intuition

5. Discussion: New Horizons in Causality Research and Applications Opened by LLMs

What LLMs Are Changing

  • Even without experts!: By leveraging LLMs to quickly borrow domain causal knowledge and support graph generation/verification, the barrier to entry for causal analysis has dropped dramatically
  • Natural language-based intuitive interaction is possible, accessible to both experts and non-experts
  • Token causality elements (necessity, sufficiency, normality, etc.) are being automated for extraction, assisting in diverse real-world decision-making across law, policy, and root cause analysis
  • Based on reasoning ability, not just memorization, as demonstrated through numerous experiments with new data

What Remains Unchanged and Important Caveats

  • Important conclusions still require rigorous mathematical/statistical verification!
  • LLM-based results must always be cross-verified with human experts and tools -- LLMs should not be blindly trusted
  • LLMs have limitations in generating new causal knowledge when there is no relevant training data

6. Conclusions and Future Directions

LLMs are emerging as new technology that can automate/assist significant portions of the causal analysis process that previously required human experiential knowledge and common sense, and their applicability is expected to explode across various fields including science, medicine, and policy. However, unpredictable errors still exist, and overfitting, common-sense limitations, and social risks must be well managed.

"LLMs lighten the major difficulties of causal reasoning that experts have long struggled with -- causal graph construction, effect estimation, cause attribution. But the principles of rigor and verification can never be compromised."


Closing Remarks

Through this research, we have confirmed that large language models can perform genuinely meaningful causal reasoning and explanation to a substantial degree, not merely through simple parroting (memorization). Going forward, integrating LLMs with existing causal tools, human-LLM collaboration, algorithms that jointly use new data and meta-information, and research on precisely managing LLM errors and biases will be the important challenges ahead.

Key Keywords:

  • Large language models (LLMs), causal graphs, counterfactual reasoning, necessary/sufficient conditions, normality, memorization vs. reasoning, human-LLM collaboration, prompt optimization, causal analysis automation, rigor & verification, real-world applications and ethical risks

Reference materials, full experiments, and code: https://github.com/py-why/pywhy-llm


References and Appendix

The paper's appendix provides abundant practical information including actual prompt writing examples, experimental data, additional coding examples, memorization test methods, real code generation cases, and positive/negative contrast cases using LLMs.


Conclusion

The causal reasoning capability of LLMs represents one of the truly transformative turning points in artificial intelligence development. We are now entering an era where humans and AI can engage in more creative and flexible causal thinking that transcends both logic and statistics, going beyond traditional causal analysis. However, it must be emphasized that cautious application and continuous verification are essential conditions for safe and responsible LLM applications in science, society, and industry.

Related writing