This paper proposes a novel mechanism where large language models (LLMs) recognize on their own the reasoning steps they frequently repeat when solving complex problems like mathematics, then abstract these steps into concise "behaviors" that can be stored and reused. This approach achieves both token efficiency and accuracy improvements over conventional Chain of Thought-based problem solving, with its effectiveness validated across three major experiments. This summary covers the key ideas, methodology, experiments, limitations, and potential for extension in an accessible, chronological format.


1. Introduction: LLM Reasoning Inefficiency and the Metacognitive Approach

Large language models have recently become adept at solving multi-step reasoning problems in math, coding, and more. However, they exhibit a structural limitation: they re-derive already-learned intermediate procedures (such as deriving the geometric series sum formula, unit conversion, or case splitting) from scratch each time, resulting in high token consumption and slow processing.

The question this paper poses is:

"When the sum formula for a finite geometric series is needed again in a similar problem, must the model derive it from scratch every time?"

Currently, most LLMs lack the ability to store or retrieve frequently recurring patterns in a short, reusable format.

The researchers therefore propose a metacognitive pathway -- that is, after solving a problem, the model reflects on its own reasoning process (metacognition) and extracts generalizable steps as 'behaviors' for storage.

These extracted behaviors (name + brief description) accumulate in a "behavior handbook" that the LLM can reference directly when solving future problems, or internalize through fine-tuning.

"By converting verbose derivations into fast, concise behaviors (procedural hints), the LLM remembers not just the right answers but also 'how to think.'"


2. Related Work and How This Paper Differs

2.1 Prior LLM Reasoning Optimization Research

There has been extensive research on making Chain of Thought (CoT) reasoning shorter and more cohesive:

  • Skeleton-of-Thought: Creates an outline in order, then expands each item in parallel
  • TokenSkip: Omits unnecessary tokens
  • Dynasor, MinD: More efficient generation path management

However, while existing approaches explicitly train the model to "write more concisely," this paper shows that efficiency naturally follows when recurring reasoning patterns are abstracted into behaviors.

2.2 LLM Metacognition and Procedural Memory

"The essence of metacognition is 'thinking about thinking.'"

"Existing memory systems (RAG, etc.) store fact-based knowledge, but this paper is innovative in that it stores the reasoning patterns the LLM repeatedly uses -- that is, 'how to think.'"

In other words, the behavior handbook focuses on procedural knowledge (how to think) rather than declarative knowledge (what is true).


3. The Behavior Extraction Process

3.1 Role Separation and Overall Architecture

The framework assigns a single LLM three distinct roles:

  • Metacognitive Strategist: Analyzes its own reasoning process to extract behaviors
  • Teacher: Generates response data using behaviors
  • Student: References handbook behaviors or internalizes them through training

Behavior extraction pipeline diagram

3.2 Actual Behavior Examples and the Extraction Process

A behavior is defined as a (name, description) pair.

systematic_counting: Examine the contribution of each digit one by one, systematically counting possible numbers without overlap or omission.

Extraction Steps:

  1. The LLM answers a question and generates the full reasoning process
  2. The reasoning path and final answer are fed back into the LLM to review whether the logic is sound and whether any generalizable behaviors exist (Reflection prompt)
  3. Based on (1) + (2), the LLM extracts behavior names and descriptions and adds them to the handbook

Example prompt used for behavior extraction

"The key point is that the behavior handbook is not built from specific data or external documents, but is refined from 'recurring methods' drawn from the model's own reasoning experience."


4. Three Ways to Use Behaviors in LLM Reasoning

4.1 Behavior-Conditioned Inference (BCI)

"During problem-solving, the student LLM (e.g., Qwen3-32B) is given relevant behaviors pre-extracted alongside the problem."

"Results: Token usage decreased by up to 46%, while accuracy stayed the same or improved."

  • Behaviors are selected from the handbook via topic matching (e.g., MATH data) or embedding-based retrieval (e.g., AIME data)
  • The selected behaviors and problem are input together to the LLM, generating a concise reasoning flow

BCI prompt example


4.2 Behavior-Guided Self-Improvement

Instead of the conventional "critique-and-revise" approach,

"The model receives behaviors it previously extracted itself and applies them to the same problem (or new problems)."

  • Behavior instructions serve as hints, improving accuracy by up to 10 percentage points more than the "critique-and-revise" approach within increasing token budgets
  • "Beyond conditional behaviors, the behaviors themselves played a decisive role in performance improvement."


4.3 Behavior-Conditioned Supervised Fine-Tuning (BC-SFT)

This approach internalizes behaviors into the model's parameters to eliminate the cost of retrieving behaviors at test time.

  1. The Metacognitive Strategist extracts behaviors
  2. The Teacher generates response data using BCI with the behaviors
  3. The Student model is fine-tuned on (question, behavior-based response) pairs
  4. At test time, only the question is needed as input. The model leverages learned behaviors "on its own" in real-time

"This approach enables even non-reasoning models (e.g., Qwen2.5-14B-Base) to achieve reasoning-model-level performance and token efficiency."


5. Experimental Results: Validating Efficiency and Performance Improvements

5.1 Behavior-Conditioned Inference (BCI) Experiments

  • Datasets: MATH, AIME-24/25
  • Handbook behavior count: MATH (90-140+ per topic), AIME (1,457 from 60 problems)
  • Key results
    • Comparable or better accuracy than solving without behaviors, with up to 46% token savings
    • Performance increases alongside growing token budgets

Behavior-conditioned inference results on MATH Left: R1-Llama-70B, Right: Qwen3-32B

"When retrieving behaviors from the handbook, input tokens increase slightly, but output (generation) tokens are greatly reduced, making the final inference cost efficient."


5.2 Self-Improvement Experiments

  • Existing approach vs. behavior-based approach comparison
    • Existing: Self-critique and revision of own reasoning
    • Behavior-based: Uses extracted behaviors as "hints" for retry
  • Key patterns
    1. Accuracy gains: The behavior-based approach is always higher (gap widens with larger budgets)
    2. Test Time Scaling: Performance of the behavior-based approach also increases with growing token budgets
    3. Token efficiency: The behavior-based approach used more tokens in self-improvement experiments but achieved substantially higher performance

Self-improvement experiment results on AIME-24


5.3 Behavior-Conditioned Fine-Tuning (BC-SFT) Experiments

  • Train/Test: Various Student models tested using S1 and AIME-24/25 problems
  • Key results
    • Models with BC-SFT fine-tuning consistently achieved higher accuracy and token savings than plain SFT or untrained models
    • The performance gap was especially dramatic for non-reasoning models (Qwen2.5-14B-Base, etc.)

AIME-24 BC-SFT performance comparison by model AIME-25 BC-SFT performance comparison by model

"It was conclusively demonstrated that the model doesn't just receive more accurate answers -- the 'intermediate reasoning skills' themselves are internalized."


6. Conclusions and Limitations

This paper demonstrates that LLMs' metacognitive capabilities can automatically extract and accumulate recurring reasoning processes, then leverage or internalize them to improve both accuracy and token efficiency.

  • All three approaches (behavior-conditioned inference, self-improvement, behavior-conditioned fine-tuning) show consistent improvements on challenging datasets like mathematics
  • Extensible to diverse domains beyond math, including programming and scientific reasoning

Limitations and Future Work

  • BCI uses a fixed behavior list per problem, unable to add new behaviors in real-time during the solving process

    "A more ideal method would be to modify the model architecture so that the model can search for and use behaviors from the handbook in real-time as needed."

  • Building large-scale behavior handbooks across diverse domains and generating large SFT datasets remain tasks for future work

Closing

By converting complex, slow chains of thought into fast, concise structures (behaviors), LLMs accumulate not just 'right answers' but also 'how to think.' This approach provides a key insight for LLMs to systematically learn and accumulate "how to reason," and could represent an important turning point toward more general-purpose and efficient AI systems.

Related writing