As of 2025, we trace the evolution of large language model (LLM) architectures chronologically from GPT-2 to DeepSeek-V3, Llama 4, and Kimi 2. While they may look similar on the surface, there have been many structural changes aimed at improving efficiency and performance. This article provides a friendly overview of each model's key architectural changes and their significance.


1. Seven Years of LLM Architecture: Same Foundation, Revolutionary Details

From GPT-2 (2019) to DeepSeek-V3 and Llama 4 (2024-2025), the basic structure hasn't changed dramatically, but incremental improvements in efficiency and performance have continued. For example, positional embeddings shifted from absolute to RoPE (Rotary Position Encoding), Multi-Head Attention (MHA) evolved into Grouped-Query Attention (GQA), and activation functions moved from GELU to SwiGLU.

"On the surface, they still look similar, but have there really been revolutionary changes, or are we just refining the same foundation?"

In reality, datasets, training techniques, hyperparameters, and many other factors affect performance, so architecture alone doesn't determine results. Nevertheless, examining what structural experiments LLM developers are conducting in 2025 remains highly informative.

Representative architectures


2. DeepSeek-V3/R1: Achieving Both Efficiency and Performance with MLA and MoE

DeepSeek R1, released in January 2025, is a reasoning-specialized model based on the DeepSeek V3 (announced December 2024) architecture. DeepSeek V3's two key structural features are:

  • Multi-Head Latent Attention (MLA)
  • Mixture-of-Experts (MoE)

2.1. MLA: A New Attention Mechanism That Maximizes Memory Efficiency

Traditional MHA computes keys and values separately for each head, while GQA reduces memory usage by sharing keys/values across multiple heads.

MHA vs GQA

"The key idea behind GQA is that multiple query heads share keys/values, reducing parameter count and memory usage."

MLA goes a step further, compressing key/value tensors into a lower-dimensional space for storage in the KV cache, then restoring them to their original size during inference.

MLA structure

Experimental results show that MLA delivers better performance than GQA while also being more memory-efficient.

MLA vs GQA performance

2.2. MoE: Massive Parameters, Efficient Inference

MoE replaces the FeedForward module in each transformer block with multiple "Experts." However, not all experts are used every time -- a router selects only a subset to activate.

MoE structure

For example, DeepSeek-V3 activates only 9 out of 256 experts (1 shared expert + 8 router-selected experts). This means total parameters are 671B (671 billion), but only 37B (37 billion) are used during actual inference, making it highly efficient.

"The shared expert is always activated for every token, preventing multiple experts from redundantly learning repeated patterns."

Shared experts


3. OLMo 2: Design Focused on Transparency and Stability

OLMo 2, developed by AI2 (Allen Institute for AI), features transparently published training data, code, and technical reports. While not top-tier in performance, its clean architectural design and transparency are major strengths.

OLMo 2 performance/cost

3.1. RMSNorm and QK-Norm: Normalization Strategies for Stable Training

OLMo 2 uses RMSNorm and places the normalization layer differently from conventional models. Traditional GPT-style models use Pre-Norm (normalization then attention/FFN), while OLMo 2 adopts Post-Norm (attention/FFN then normalization).

Normalization position comparison

"The reason for changing the normalization layer position is training stability."

In practice, the Post-Norm approach improves training stability.

Training stability comparison

Additionally, QK-Norm (query/key normalization) was introduced to apply RMSNorm within the attention mechanism for additional stability.

OLMo 2 vs Llama 3 structure


4. Gemma 3 & 3n: The Evolution of Sliding Window Attention and Efficiency

Gemma 3 (Google) introduces sliding window attention to significantly reduce KV cache memory usage.

KV cache reduction

4.1. Sliding Window Attention: From Global to Local

Traditional attention lets every token attend to every other token (global), while sliding window attention only attends within a fixed range (window), making it local.

Global vs sliding window

Gemma 3 mixes sliding window (local) and global attention at a 5:1 ratio, also reducing the window size from 4096 to 1024. Experimental results show that memory efficiency is maximized with no performance degradation.

Minimal performance impact

4.2. Unique Normalization Layer Placement

Gemma 3 places RMSNorm both before and after attention/FFN (Pre-Norm + Post-Norm), pursuing both stability and efficiency.

OLMo2 vs Gemma3 structure

4.3. Gemma 3n: Mobile Optimization and PLE

Gemma 3n uses a Per-Layer Embedding (PLE) technique that keeps only some parameters on the GPU while loading the rest from CPU/SSD on demand, enabling efficient operation even on mobile devices.

PLE memory savings


5. Mistral Small 3.1: Optimized for Fast Inference

Mistral Small 3.1 24B is faster than Gemma 3 27B while delivering better performance on multiple benchmarks. The key lies in a custom tokenizer, reduced KV cache, and fewer layers, all minimizing inference latency.

Gemma 3 vs Mistral 3.1 structure


6. Llama 4: MoE Goes Mainstream and Comparison with DeepSeek-V3

Llama 4 also adopts an MoE architecture, showing a very similar structure to DeepSeek-V3.

DeepSeek V3 vs Llama 4 structure

The differences are that Llama 4 uses GQA while DeepSeek-V3 uses MLA, and the number of experts and activation methods differ (DeepSeek-V3 activates 9, Llama 4 activates 2).

"In 2025, MoE has become the dominant architecture for large LLMs."


7. Qwen3: Flexible Architecture from Ultra-Small to Ultra-Large

Qwen3 is released in a wide range of sizes -- 0.6B to 32B (dense), 30B-A3B/235B-A22B (MoE) -- offering options from small models suitable for local execution/training to MoE models optimized for large-scale inference.

Qwen3 0.6B vs Llama 3 1B

MoE models use only as many parameters as the number of active experts (e.g., 22B), pursuing both the learning capacity of large models and inference efficiency simultaneously.

DeepSeek V3 vs Qwen3 235B-A22B

Interestingly, Qwen3 doesn't use shared experts. Developer Junyang Lin responded:

"At the time, shared experts didn't provide significant improvement, and we had concerns about inference optimization. Honestly, there's no clear answer."


8. SmolLM3: Eliminating Positional Embeddings with NoPE

SmolLM3 (3B) is a small model competing with Qwen3 1.7B/4B, Llama 3 3B, and Gemma 3 4B.

SmolLM3 performance

The most interesting aspect is NoPE (No Positional Embeddings) -- a structure that doesn't use positional embeddings at all.

Qwen3 4B vs SmolLM3 3B

Previously, positional embeddings (e.g., RoPE) were considered essential, but the NoPE paper showed that length generalization performance can actually be better without explicit position information.

NoPE paper results


9. Kimi 2: 1 Trillion Parameters, an Extended Version of DeepSeek-V3

Kimi 2 is an ultra-large open model with 1 trillion (1T) parameters, using the DeepSeek-V3 architecture as a base with more experts and fewer MLA heads.

DeepSeek V3 vs Kimi K2

It also applies the Muon optimizer to a large model for the first time, achieving very fast and stable training loss curve reduction.

Kimi K2 vs OLMo 2 loss curves


In Closing

As of 2025, LLM architectures are maintaining the same fundamental skeleton while actively experimenting with diverse approaches for efficiency and performance. MLA, GQA, MoE, sliding window attention, NoPE, and normalization layer placement -- each model pursues memory efficiency, inference speed, training stability, and scalability in its own way.

"After all this time, LLM development is still fascinating. I look forward to seeing what innovations come next!"

Closing image

Related writing