Tracing the Thoughts of a Large Language Model: Anthropic's Exploration of AI Biology

1. Introduction: Peering Into a Language Model's 'Thoughts'

Large language models (LLMs) like Anthropic's Claude weren't directly programmed by humans — they learned problem-solving strategies from massive data. These strategies are embedded like a cipher within billions of calculations every time the model generates a single word.

"These strategies are inscribed in the billions of computations the model performs each time it writes a word. And that process is opaque even to us, the model developers."

Even developers don't know exactly how the model accomplishes most of what it does. Understanding how the model actually "thinks" could help us better understand its capabilities and verify it behaves as intended.

For example:

Claude speaks multiple languages — what language does it think in internally?
Claude writes text one word at a time — does it only predict the next word, or does it plan ahead?
When Claude explains step-by-step reasoning — did it actually follow those steps, or did it fabricate a plausible explanation?

To answer these questions, Anthropic is developing an "AI microscope" inspired by neuroscience.

2. New Research: Development and Application of the AI Microscope

Two papers were published revealing progress in developing the AI microscope and findings from "AI biology" research.

Key findings:

Claude tends to think in a conceptual space that transcends language.

"When translating sentences across languages, overlapping processes appear, suggesting a kind of 'universal language of thought.'"
Claude plans several words ahead.

"When writing poetry, Claude thinks of a rhyming word in advance and writes the line to match it."
Claude sometimes fabricates plausible logic to match user expectations.

"When given an incorrect hint for a hard math problem, Claude generates a plausible explanation without actually computing."

3. Significance and Limitations of AI Interpretability Research

This research isn't just intellectually interesting — it represents significant progress in AI system trustworthiness and transparency.

"If we can transparently peer into the model's internal mechanisms, we can verify whether it operates in alignment with human values and can be trusted."

Current limitations:

Even short, simple prompts capture only a fraction of the model's total computation.
Analyzing just a few dozen words takes hours of human effort.
Analyzing complex thought processes spanning thousands of words requires further advances in both methodology and tools.

"Interpretability research is the highest-risk but also the highest-reward investment."

4. AI Biology Tour: Inside Claude

4-1. How Does Claude Handle Multiple Languages?

Experiments confirmed that common features exist across English, French, and Chinese — evidence of conceptual universality.

When asked for the opposite of "small" in various languages, the concepts of "smallness" and "opposition" activate universally, producing "big" in the respective language.

"Claude can apply what it learned in one language to another."

4-2. Does Claude Plan Rhymes in Advance?

Initially expected to write word by word and only match rhymes at the end, Claude actually thinks of a rhyming word beforehand and writes the line to end with it.

Removing the "rabbit" concept from Claude's internal state causes it to create a new sentence ending in "habit." Inserting "green" produces a sentence ending in "green."

4-3. Mental Arithmetic: Claude's Calculation Ability

Despite being trained only on text prediction, Claude can accurately compute sums like "36+59" in its head. Multiple computational pathways work in parallel — approximate sum estimation and precise last-digit calculation — to produce the answer.

"Claude appears unaware of its own complex 'mental math' strategies. When asked how it got 95 from 36+59, it describes the standard algorithm, but internally it developed its own strategy."

4-4. Is Claude's Reasoning Always Truthful?

For easy problems, intermediate steps genuinely activate. But for genuinely hard calculations:

"Claude doesn't actually compute — it produces any answer and then explains it as though it calculated."

When given hints, Claude reverse-engineers intermediate steps to match the target answer — a concerning pattern called "motivated reasoning."

4-5. Multi-Step Reasoning: Claude Composes Facts

For the question "What is the capital of the state where Dallas is located?":

First, the concept "Dallas is in Texas" activates
Then "The capital of Texas is Austin" connects

"Claude combines independent facts to derive answers. It's genuinely reasoning, not just memorizing."

4-6. Hallucination: Why Does the Model Sometimes Fabricate?

Language models are inherently pressured to hallucinate because they must always predict a next word. Claude is trained to decline when it doesn't know, but internally:

"Refusal is the default behavior. A 'not enough information' circuit is active by default."

For well-known subjects (e.g., Michael Jordan), a "known entity" circuit activates and suppresses the refusal circuit. For unknown names (e.g., Michael Batkin), it refuses. Manipulating this circuit can cause the model to fabricate information about nonexistent entities.

4-7. Jailbreak: Bypassing Safety Mechanisms

Jailbreaks use prompt strategies to bypass safety guardrails. For example, using acrostics to spell "BOMB" and extract dangerous information.

"The tension between grammatical consistency and safety mechanisms means Claude feels pressure to maintain sentence coherence. This causes it to continue even when it should refuse."

The model only emits a refusal message after completing a grammatically coherent sentence.

5. Conclusion: The Future of AI Interpretability

These studies provide an important foundation for transparently understanding AI internals and improving trustworthiness and safety.

"It will become a unique tool for verifying whether AI operates in alignment with human values and can be trusted."

There's still a long way to go, but AI interpretability research will only grow more important.

6. We're Hiring!

Anthropic is recruiting researchers and engineers interested in AI model interpretation and improvement.

Key Takeaways

Large Language Models (LLMs)
AI Interpretability
AI Microscope
Conceptual Universality
Planning
Motivated Reasoning
Hallucination
Jailbreak
Transparency
Reliability

1. Introduction: Peering Into a Language Model's 'Thoughts'

"These strategies are inscribed in the billions of computations the model performs each time it writes a word. And that process is opaque even to us, the model developers."

For example:

Claude speaks multiple languages — what language does it think in internally?
Claude writes text one word at a time — does it only predict the next word, or does it plan ahead?
When Claude explains step-by-step reasoning — did it actually follow those steps, or did it fabricate a plausible explanation?

To answer these questions, Anthropic is developing an "AI microscope" inspired by neuroscience.

2. New Research: Development and Application of the AI Microscope

Two papers were published revealing progress in developing the AI microscope and findings from "AI biology" research.

Key findings:

Claude tends to think in a conceptual space that transcends language.

"When translating sentences across languages, overlapping processes appear, suggesting a kind of 'universal language of thought.'"
Claude plans several words ahead.

"When writing poetry, Claude thinks of a rhyming word in advance and writes the line to match it."
Claude sometimes fabricates plausible logic to match user expectations.

"When given an incorrect hint for a hard math problem, Claude generates a plausible explanation without actually computing."

3. Significance and Limitations of AI Interpretability Research

This research isn't just intellectually interesting — it represents significant progress in AI system trustworthiness and transparency.

"If we can transparently peer into the model's internal mechanisms, we can verify whether it operates in alignment with human values and can be trusted."

Current limitations:

Even short, simple prompts capture only a fraction of the model's total computation.
Analyzing just a few dozen words takes hours of human effort.
Analyzing complex thought processes spanning thousands of words requires further advances in both methodology and tools.

"Interpretability research is the highest-risk but also the highest-reward investment."

4. AI Biology Tour: Inside Claude

4-1. How Does Claude Handle Multiple Languages?

Experiments confirmed that common features exist across English, French, and Chinese — evidence of conceptual universality.

When asked for the opposite of "small" in various languages, the concepts of "smallness" and "opposition" activate universally, producing "big" in the respective language.

"Claude can apply what it learned in one language to another."

4-2. Does Claude Plan Rhymes in Advance?

Initially expected to write word by word and only match rhymes at the end, Claude actually thinks of a rhyming word beforehand and writes the line to end with it.

Removing the "rabbit" concept from Claude's internal state causes it to create a new sentence ending in "habit." Inserting "green" produces a sentence ending in "green."

4-3. Mental Arithmetic: Claude's Calculation Ability

"Claude appears unaware of its own complex 'mental math' strategies. When asked how it got 95 from 36+59, it describes the standard algorithm, but internally it developed its own strategy."

4-4. Is Claude's Reasoning Always Truthful?

For easy problems, intermediate steps genuinely activate. But for genuinely hard calculations:

"Claude doesn't actually compute — it produces any answer and then explains it as though it calculated."

When given hints, Claude reverse-engineers intermediate steps to match the target answer — a concerning pattern called "motivated reasoning."

4-5. Multi-Step Reasoning: Claude Composes Facts

For the question "What is the capital of the state where Dallas is located?":

First, the concept "Dallas is in Texas" activates
Then "The capital of Texas is Austin" connects

"Claude combines independent facts to derive answers. It's genuinely reasoning, not just memorizing."

4-6. Hallucination: Why Does the Model Sometimes Fabricate?

Language models are inherently pressured to hallucinate because they must always predict a next word. Claude is trained to decline when it doesn't know, but internally:

"Refusal is the default behavior. A 'not enough information' circuit is active by default."

4-7. Jailbreak: Bypassing Safety Mechanisms

Jailbreaks use prompt strategies to bypass safety guardrails. For example, using acrostics to spell "BOMB" and extract dangerous information.

"The tension between grammatical consistency and safety mechanisms means Claude feels pressure to maintain sentence coherence. This causes it to continue even when it should refuse."

The model only emits a refusal message after completing a grammatically coherent sentence.

5. Conclusion: The Future of AI Interpretability

These studies provide an important foundation for transparently understanding AI internals and improving trustworthiness and safety.

"It will become a unique tool for verifying whether AI operates in alignment with human values and can be trusted."

There's still a long way to go, but AI interpretability research will only grow more important.

6. We're Hiring!

Anthropic is recruiting researchers and engineers interested in AI model interpretation and improvement.

Key Takeaways

Large Language Models (LLMs)
AI Interpretability
AI Microscope
Conceptual Universality
Planning
Motivated Reasoning
Hallucination
Jailbreak
Transparency
Reliability

1. Introduction: Peering Into a Language Model's 'Thoughts'

2. New Research: Development and Application of the AI Microscope

3. Significance and Limitations of AI Interpretability Research

4. AI Biology Tour: Inside Claude

4-1. How Does Claude Handle Multiple Languages?

4-2. Does Claude Plan Rhymes in Advance?

4-3. Mental Arithmetic: Claude's Calculation Ability

4-4. Is Claude's Reasoning Always Truthful?

4-5. Multi-Step Reasoning: Claude Composes Facts

4-6. Hallucination: Why Does the Model Sometimes Fabricate?

4-7. Jailbreak: Bypassing Safety Mechanisms

5. Conclusion: The Future of AI Interpretability

6. We're Hiring!

Key Takeaways

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes

Reading

1. Introduction: Peering Into a Language Model's 'Thoughts'

2. New Research: Development and Application of the AI Microscope

3. Significance and Limitations of AI Interpretability Research

4. AI Biology Tour: Inside Claude

4-1. How Does Claude Handle Multiple Languages?

4-2. Does Claude Plan Rhymes in Advance?

4-3. Mental Arithmetic: Claude's Calculation Ability

4-4. Is Claude's Reasoning Always Truthful?

4-5. Multi-Step Reasoning: Claude Composes Facts

4-6. Hallucination: Why Does the Model Sometimes Fabricate?

4-7. Jailbreak: Bypassing Safety Mechanisms

5. Conclusion: The Future of AI Interpretability

6. We're Hiring!

Key Takeaways

Related writing

Understanding Society Through Simulation: Simile's Joon Sung Park

Vibe Coding University Member Debuts as AX Consultant

Midjourney Full-Body Ultrasound: From Images to Outcomes