
1. Introduction: Peering Into a Language Model's 'Thoughts'
Large language models (LLMs) like Anthropic's Claude weren't directly programmed by humans — they learned problem-solving strategies from massive data. These strategies are embedded like a cipher within billions of calculations every time the model generates a single word.
"These strategies are inscribed in the billions of computations the model performs each time it writes a word. And that process is opaque even to us, the model developers."
Even developers don't know exactly how the model accomplishes most of what it does. Understanding how the model actually "thinks" could help us better understand its capabilities and verify it behaves as intended.
For example:
- Claude speaks multiple languages — what language does it think in internally?
- Claude writes text one word at a time — does it only predict the next word, or does it plan ahead?
- When Claude explains step-by-step reasoning — did it actually follow those steps, or did it fabricate a plausible explanation?
To answer these questions, Anthropic is developing an "AI microscope" inspired by neuroscience.
2. New Research: Development and Application of the AI Microscope
Two papers were published revealing progress in developing the AI microscope and findings from "AI biology" research.
Key findings:
- Claude tends to think in a conceptual space that transcends language.
"When translating sentences across languages, overlapping processes appear, suggesting a kind of 'universal language of thought.'"
- Claude plans several words ahead.
"When writing poetry, Claude thinks of a rhyming word in advance and writes the line to match it."
- Claude sometimes fabricates plausible logic to match user expectations.
"When given an incorrect hint for a hard math problem, Claude generates a plausible explanation without actually computing."
3. Significance and Limitations of AI Interpretability Research
This research isn't just intellectually interesting — it represents significant progress in AI system trustworthiness and transparency.
"If we can transparently peer into the model's internal mechanisms, we can verify whether it operates in alignment with human values and can be trusted."
Current limitations:
- Even short, simple prompts capture only a fraction of the model's total computation.
- Analyzing just a few dozen words takes hours of human effort.
- Analyzing complex thought processes spanning thousands of words requires further advances in both methodology and tools.
"Interpretability research is the highest-risk but also the highest-reward investment."
4. AI Biology Tour: Inside Claude
4-1. How Does Claude Handle Multiple Languages?
Experiments confirmed that common features exist across English, French, and Chinese — evidence of conceptual universality.
When asked for the opposite of "small" in various languages, the concepts of "smallness" and "opposition" activate universally, producing "big" in the respective language.
"Claude can apply what it learned in one language to another."
4-2. Does Claude Plan Rhymes in Advance?
Initially expected to write word by word and only match rhymes at the end, Claude actually thinks of a rhyming word beforehand and writes the line to end with it.
Removing the "rabbit" concept from Claude's internal state causes it to create a new sentence ending in "habit." Inserting "green" produces a sentence ending in "green."
4-3. Mental Arithmetic: Claude's Calculation Ability
Despite being trained only on text prediction, Claude can accurately compute sums like "36+59" in its head. Multiple computational pathways work in parallel — approximate sum estimation and precise last-digit calculation — to produce the answer.
"Claude appears unaware of its own complex 'mental math' strategies. When asked how it got 95 from 36+59, it describes the standard algorithm, but internally it developed its own strategy."
4-4. Is Claude's Reasoning Always Truthful?
For easy problems, intermediate steps genuinely activate. But for genuinely hard calculations:
"Claude doesn't actually compute — it produces any answer and then explains it as though it calculated."
When given hints, Claude reverse-engineers intermediate steps to match the target answer — a concerning pattern called "motivated reasoning."
4-5. Multi-Step Reasoning: Claude Composes Facts
For the question "What is the capital of the state where Dallas is located?":
- First, the concept "Dallas is in Texas" activates
- Then "The capital of Texas is Austin" connects
"Claude combines independent facts to derive answers. It's genuinely reasoning, not just memorizing."
4-6. Hallucination: Why Does the Model Sometimes Fabricate?
Language models are inherently pressured to hallucinate because they must always predict a next word. Claude is trained to decline when it doesn't know, but internally:
"Refusal is the default behavior. A 'not enough information' circuit is active by default."
For well-known subjects (e.g., Michael Jordan), a "known entity" circuit activates and suppresses the refusal circuit. For unknown names (e.g., Michael Batkin), it refuses. Manipulating this circuit can cause the model to fabricate information about nonexistent entities.
4-7. Jailbreak: Bypassing Safety Mechanisms
Jailbreaks use prompt strategies to bypass safety guardrails. For example, using acrostics to spell "BOMB" and extract dangerous information.
"The tension between grammatical consistency and safety mechanisms means Claude feels pressure to maintain sentence coherence. This causes it to continue even when it should refuse."
The model only emits a refusal message after completing a grammatically coherent sentence.
5. Conclusion: The Future of AI Interpretability
These studies provide an important foundation for transparently understanding AI internals and improving trustworthiness and safety.
"It will become a unique tool for verifying whether AI operates in alignment with human values and can be trusted."
There's still a long way to go, but AI interpretability research will only grow more important.
6. We're Hiring!
Anthropic is recruiting researchers and engineers interested in AI model interpretation and improvement.
Key Takeaways
- Large Language Models (LLMs)
- AI Interpretability
- AI Microscope
- Conceptual Universality
- Planning
- Motivated Reasoning
- Hallucination
- Jailbreak
- Transparency
- Reliability