Can Large Language Models Really Perform Causal Reasoning?

Large language models (LLMs) are showing performance that surpasses human levels on various causal reasoning (the ability to identify cause and effect) tasks, yet research findings are increasingly suggesting that it is difficult to say they truly understand causal relationships. This article provides a detailed, chronological overview of LLMs' causal reasoning capabilities, limitations, and future research directions. The core conclusion is that LLMs behave like causal "parrots" -- in reality, they merely repeat statistical correlation patterns without achieving deep causal understanding.

1. What Is Causal Reasoning?

We humans readily reason about "why did that happen?" when we observe events around us. For example, if our health improves after taking medicine, we think "it must be the medicine," or when we see rain clouds, we predict "it's going to rain soon." This ability to identify and predict cause and effect is precisely what causal reasoning is.

This ability is critically important in science, medicine, policy, and more:

"If you properly understand the cause, you can effectively intervene in problems and avoid wasting effort on the wrong causes."

2. Types of Causal Reasoning Tasks

There are several types of causal reasoning:

Causal discovery: Finding actual cause-effect relationships between variables using only observational data
Effect estimation: Quantitatively determining the magnitude of a cause's impact on an effect
Counterfactual reasoning: Imagining what would have happened if reality had been different
Actual cause judgment: Identifying the causes that actually influenced a specific event

"If I hadn't smoked, would I still have gotten cancer?" (An example of counterfactual reasoning)

All these tasks require more than simple memorization or finding correlations -- they demand the thinking ability to consider situations from various different perspectives.

3. How Well Do LLMs Perform at Causal Reasoning?

Researchers experimented with how well state-of-the-art LLMs like GPT-3 and GPT-4 perform on various causal reasoning benchmarks. The results were surprisingly impressive!

Pairwise causal discovery: In tasks matching cause-effect relationships between pairs of variables across over 100 real-world cases, LLM accuracy reached 97%.

"Spanning diverse fields including physics, biology, and epidemiology, it far surpassed the previous best algorithm (83%)."
Full causal graph recovery: When drawing overall causal relationship networks with multiple variables, GPT-4 produced structures just as accurate as the latest deep learning methods.
Counterfactual reasoning: GPT-4 selected the correct answer with 92% accuracy for "what if that hadn't happened?" scenarios.
Necessary/sufficient cause identification: It correctly identified "the cause absolutely necessary for the event to occur (necessary cause)" and "the minimal cause sufficient to produce the event on its own" with 86% accuracy.
Normality assessment: Even on more difficult tasks judging whether situations were normal (defaults, norm violations, etc.), it recorded accuracy around 70%.

"LLMs achieved high scores by mobilizing common sense and background knowledge just by reading the problem description (prompt), without any data analysis."

4. Limitations and Problems of LLM Causal Reasoning

However, it's not perfect. Representative LLMs GPT-3 and GPT-4 both showed clear weaknesses in specific areas.

Unexpected failures:
- Context misunderstanding: In situations uncommon in training data, they sometimes completely misinterpreted causal relationships and gave entirely wrong answers
- Logical errors: In some cases, they would give logically sound answers only to make mistakes instantly on similar problems
Instability/fragility:
- Overly sensitive to how questions are phrased: Answers vary significantly depending on how the question is worded.
  
  "Ask the same question twice and you get a different answer each time" "Because it relies on linguistic hints, it's governed by the sentence structure of the prompt rather than truly understanding causal mechanisms."
Performance variation across benchmarks: For example, GPT-4 excels at counterfactual problems and causal graph drawing, but still has many gaps in normality assessment and similar tasks.

5. LLMs Don't Truly Understand Causation

When LLMs discuss "causal relationships," are they really understanding the underlying principles? The answer is that they are closer to "causal parrots"!

"LLMs appear to give causal answers, but in reality they merely copy (repeat) the massive statistical correlations found in their training data."

Confusing correlation with causation: LLMs are technology that imitates statistically recurring "patterns" as they are. They don't actually know why something happened (the causal mechanism).
Meta SCMs concept: Researchers including Zecevic showed through the concept of "meta structural causal models (meta SCMs)" that LLMs' causal answers are actually recycling of repeated correlation patterns, not genuine causal understanding.

"LLMs don't construct actual causal relationships; they simply repeat learned causal statements. (Like parrots)"

6. Future Directions for LLM Causal Reasoning Research

Researchers are now proposing several lines of research to address LLMs' weaknesses.

Analyzing the true nature of causal reasoning ability: Deeper research into how LLMs apply causal information and methods for combining it with human common sense/domain knowledge
Achieving stronger, more consistent performance: Combining with external tools, designing more diverse prompts, and increasing reliability by combining multiple LLMs
Integration with traditional causal analysis methods: LLMs can serve as a vast domain knowledge database to automate the causal analysis preparation process itself
Supporting explainable causal reasoning and actual causality judgments: Potential for LLMs to support expert "evidence explanation" across diverse fields including law, intelligence analysis, and machine learning
Human-LLM collaboration: Proposals for collaborative analysis systems where, for example, LLMs provide feedback on human-generated causal graphs, or LLMs automatically suggest candidate causal relationships

7. Conclusion

Recently, LLMs have produced remarkable results on various causal reasoning tasks, surpassing both humans and existing algorithms. However, these models clearly have the limitation that they "don't truly understand causation and merely repeat patterns they've seen and learned."

Going forward, LLMs can assist human experts as "easy and flexible natural language-based causal reasoning tools," but we must be wary of the illusion that deep learning alone can grasp the complex principles of causation. The future direction is for humans, LLMs, and traditional causal reasoning methods to complement each other in building safer and more reliable causal reasoning AI.

"LLMs can lead the democratization of causal analysis, but their limitations must always be recognized and used with caution."

Key Images

Comparing human causal reasoning with LLMs Various types of causal reasoning Counterfactual reasoning example Limitations of LLMs

Final Thoughts

The causal reasoning capabilities of large language models have made tremendous progress, but deeper research and new directions are needed before achieving a true "understanding of cause and effect." When using LLMs, we must always distinguish between "what they can do well" and "their fundamental limitations" and apply them with care.

1. What Is Causal Reasoning?

This ability is critically important in science, medicine, policy, and more:

"If you properly understand the cause, you can effectively intervene in problems and avoid wasting effort on the wrong causes."

2. Types of Causal Reasoning Tasks

There are several types of causal reasoning:

Causal discovery: Finding actual cause-effect relationships between variables using only observational data
Effect estimation: Quantitatively determining the magnitude of a cause's impact on an effect
Counterfactual reasoning: Imagining what would have happened if reality had been different
Actual cause judgment: Identifying the causes that actually influenced a specific event

"If I hadn't smoked, would I still have gotten cancer?" (An example of counterfactual reasoning)

All these tasks require more than simple memorization or finding correlations -- they demand the thinking ability to consider situations from various different perspectives.

3. How Well Do LLMs Perform at Causal Reasoning?

Researchers experimented with how well state-of-the-art LLMs like GPT-3 and GPT-4 perform on various causal reasoning benchmarks. The results were surprisingly impressive!

Pairwise causal discovery: In tasks matching cause-effect relationships between pairs of variables across over 100 real-world cases, LLM accuracy reached 97%.

"Spanning diverse fields including physics, biology, and epidemiology, it far surpassed the previous best algorithm (83%)."
Full causal graph recovery: When drawing overall causal relationship networks with multiple variables, GPT-4 produced structures just as accurate as the latest deep learning methods.
Counterfactual reasoning: GPT-4 selected the correct answer with 92% accuracy for "what if that hadn't happened?" scenarios.
Necessary/sufficient cause identification: It correctly identified "the cause absolutely necessary for the event to occur (necessary cause)" and "the minimal cause sufficient to produce the event on its own" with 86% accuracy.
Normality assessment: Even on more difficult tasks judging whether situations were normal (defaults, norm violations, etc.), it recorded accuracy around 70%.

"LLMs achieved high scores by mobilizing common sense and background knowledge just by reading the problem description (prompt), without any data analysis."

4. Limitations and Problems of LLM Causal Reasoning

However, it's not perfect. Representative LLMs GPT-3 and GPT-4 both showed clear weaknesses in specific areas.

Unexpected failures:
- Context misunderstanding: In situations uncommon in training data, they sometimes completely misinterpreted causal relationships and gave entirely wrong answers
- Logical errors: In some cases, they would give logically sound answers only to make mistakes instantly on similar problems
Instability/fragility:
- Overly sensitive to how questions are phrased: Answers vary significantly depending on how the question is worded.
  
  "Ask the same question twice and you get a different answer each time" "Because it relies on linguistic hints, it's governed by the sentence structure of the prompt rather than truly understanding causal mechanisms."
Performance variation across benchmarks: For example, GPT-4 excels at counterfactual problems and causal graph drawing, but still has many gaps in normality assessment and similar tasks.

5. LLMs Don't Truly Understand Causation

When LLMs discuss "causal relationships," are they really understanding the underlying principles? The answer is that they are closer to "causal parrots"!

"LLMs appear to give causal answers, but in reality they merely copy (repeat) the massive statistical correlations found in their training data."

Confusing correlation with causation: LLMs are technology that imitates statistically recurring "patterns" as they are. They don't actually know why something happened (the causal mechanism).
Meta SCMs concept: Researchers including Zecevic showed through the concept of "meta structural causal models (meta SCMs)" that LLMs' causal answers are actually recycling of repeated correlation patterns, not genuine causal understanding.

"LLMs don't construct actual causal relationships; they simply repeat learned causal statements. (Like parrots)"

6. Future Directions for LLM Causal Reasoning Research

Researchers are now proposing several lines of research to address LLMs' weaknesses.

Analyzing the true nature of causal reasoning ability: Deeper research into how LLMs apply causal information and methods for combining it with human common sense/domain knowledge
Achieving stronger, more consistent performance: Combining with external tools, designing more diverse prompts, and increasing reliability by combining multiple LLMs
Integration with traditional causal analysis methods: LLMs can serve as a vast domain knowledge database to automate the causal analysis preparation process itself
Supporting explainable causal reasoning and actual causality judgments: Potential for LLMs to support expert "evidence explanation" across diverse fields including law, intelligence analysis, and machine learning
Human-LLM collaboration: Proposals for collaborative analysis systems where, for example, LLMs provide feedback on human-generated causal graphs, or LLMs automatically suggest candidate causal relationships

7. Conclusion

"LLMs can lead the democratization of causal analysis, but their limitations must always be recognized and used with caution."

Key Images

Comparing human causal reasoning with LLMs Various types of causal reasoning Counterfactual reasoning example Limitations of LLMs

1. What Is Causal Reasoning?

2. Types of Causal Reasoning Tasks

3. How Well Do LLMs Perform at Causal Reasoning?

4. Limitations and Problems of LLM Causal Reasoning

5. LLMs Don't Truly Understand Causation

6. Future Directions for LLM Causal Reasoning Research

7. Conclusion

Final Thoughts

Related writing

MouseMapper Reveals Whole-Body Change

SensorLM: Giving Wearable Data Language

AI Arbitrage in the Age of Bots

Reading

1. What Is Causal Reasoning?

2. Types of Causal Reasoning Tasks

3. How Well Do LLMs Perform at Causal Reasoning?

4. Limitations and Problems of LLM Causal Reasoning

5. LLMs Don't Truly Understand Causation

6. Future Directions for LLM Causal Reasoning Research

7. Conclusion

Final Thoughts

Related writing

MouseMapper Reveals Whole-Body Change

SensorLM: Giving Wearable Data Language

AI Arbitrage in the Age of Bots