This article clearly and concisely explains what embeddings are in LLMs (Large Language Models), how they have evolved, how they are implemented and used in practice, and the structure and visualization of modern embeddings. It covers everything from the historical background of embeddings to traditional methods, Word2Vec, BERT, and real-world examples and graph analysis of the latest LLM embeddings in chronological order. You'll understand why embeddings matter, what makes a good embedding, and how to work with embeddings in practice.


1. What Are Embeddings?

Embeddings are the semantic backbone of LLMs--the gateway that transforms raw text into numerical vectors that models can understand. For example, when you ask an LLM to debug code, the input words and tokens are converted into a high-dimensional vector space where semantic relationships become mathematical relationships.

Embeddings can be applied not only to text but also to images, audio, graph data, and more. However, this article focuses on text embeddings. The history of embeddings is long, and various approaches have evolved over time.

"Embeddings are the process of converting data into vectors. This article focuses on text embeddings."

There are static embeddings and dynamic (contextual) embeddings. Static embeddings assign a fixed vector to each input token, while dynamic embeddings change the vector as it passes through deeper layers of the model, reflecting the context of the input. Distinguishing between these two types is important.


2. What Makes a Good Embedding?

Just as embeddings serve as the language dictionary for LLMs, good embeddings enable models to better understand and communicate in human language. So what are the conditions for a good embedding?

Semantic Representation

Embeddings must capture semantic relationships between words well. For example, "cat" and "dog" should be closer in vector space than "dog" and "strawberry."

Dimensionality

The size of the embedding vector also matters. Too small and information is insufficient; too large and there's a risk of overfitting. For example, GPT-2's embedding size is at least 768.


3. Traditional Embedding Techniques

Early embedding methods used statistical approaches based on word frequency or co-occurrence frequency in large corpora. A representative example is TF-IDF.

TF-IDF (Term Frequency-Inverse Document Frequency)

  • TF (Term Frequency): How often a specific word appears in a document
  • IDF (Inverse Document Frequency): How rare the word is across all documents

TF-IDF calculates word importance by multiplying these two values. For example, if "cat" appears in only 2 out of 10 documents and appears 5 times in one document, the TF-IDF score would be 0.05 x 1.61 = approximately 0.08.

"The TF-IDF score simultaneously reflects how important a word is within a document and how rare it is across the entire corpus."

The limitation of TF-IDF embeddings is that most words cluster in similar positions, and semantic similarity is not reflected. In other words, even if "number" and "apple" are semantically close, they have no relationship in vector space.


4. Word2Vec: The Beginning of Semantically Meaningful Embeddings

Word2Vec is a more advanced approach than TF-IDF, learning embeddings by using surrounding words (context). The two main architectures are CBOW (Continuous Bag of Words) and Skipgram.

  • CBOW: Predicts the center word from surrounding words
  • Skipgram: Predicts surrounding words from the center word

Word2Vec converts input words into one-hot vectors, passes them through an embedding layer (hidden layer), and the weights of this hidden layer become the embeddings.

Word2Vec architecture diagram

"The hidden layer of Word2Vec is the embedding itself. The weights of this layer represent the word vectors."

Word2Vec embeddings capture semantic similarity well. For example, arithmetic operations like "Italy - Rome + London = England" become possible.

"italy - rome + london = england"

Additionally, negative sampling enables efficient training even with large vocabularies.

Word2Vec embeddings can be visualized in 2D/3D using the TensorFlow Embedding Projector to examine semantic clusters.

Word2Vec embedding visualization


5. BERT: Dynamic Embeddings That Reflect Context

BERT is an encoder-only model based on the Transformer architecture and a landmark model that drove innovation in NLP. BERT's architecture consists of:

  1. Tokenizer: Converts text into integer sequences
  2. Embedding Layer: Converts tokens into vectors
  3. Encoder: Self-attention-based Transformer blocks
  4. Task Head: Output tailored to the purpose, such as classification or generation

During pre-training, BERT simultaneously learns masked language modeling (masking some words in a sentence) and next sentence prediction (classifying whether two sentences are consecutive).

"BERT dynamically updates embeddings so that all words in the input sentence reflect each other's context."

BERT's embeddings are a representative example of contextual embeddings.

BERT architecture diagram


6. Embeddings in Modern LLMs

In LLMs, embeddings are the first step of converting tokens into vectors and a core component that significantly affects the overall model's performance.

Position of Embeddings

  • Static embeddings: Convert input tokens into vectors (token embedding + positional embedding)
  • Dynamic embeddings: Vectors change according to context as they pass through Transformer layers

For example, the word "bank" having different meanings in "river bank" and "bank robbery" is thanks to dynamic embeddings.

LLM embedding structure overview

Learning Embeddings

LLMs learn their embeddings internally, optimized for the model's purpose and data.

"Instead of using pre-trained embeddings like Word2Vec, LLMs directly learn and optimize their input layer embeddings."

Implementing the Embedding Layer

In PyTorch, torch.nn.Embedding is used to implement the embedding layer. This layer serves as a lookup table that takes token indices and returns the corresponding embedding vectors.

Embedding layer visualization


7. Embeddings in Practice: DeepSeek-R1-Distill-Qwen-1.5B

What does the embedding layer of a real LLM look like? Let's directly extract and examine the embeddings of the DeepSeek-R1-Distill-Qwen-1.5B model.

  1. Load the model and tokenizer
  2. Extract and save the embedding layer
  3. Load just the embedding layer separately and convert input tokens to vectors
  4. Find the most similar embeddings to a specific word using cosine similarity

Example sentence: "HTML coders are not considered programmers"

Token IDTokenEmbedding vector (partial, 1536 dimensions)
151646-0.027466, 0.002899, ...
5835HTML-0.018555, 0.000912, ...
20329#cod-0.026978, -0.012939, ...
.........

"Embeddings are 1536-dimensional vectors, with a unique vector assigned to each token."

You can also calculate cosine similarity between embedding vectors to find the most similar words.


8. Graph Analysis of Embeddings

When you visualize embeddings as a graph, each token becomes a node, and tokens with nearby vectors are connected by edges.

For example, tokenizing the sentence "AI agents will be the most hot topic of artificial intelligence in 2025." and connecting each token's embedding to the 20 most similar embeddings creates an embedding graph like the following.

"The token 'list' has very similar embeddings to various variants like '_list' and 'List'."

Through this kind of graph analysis, you can grasp the semantic clusters of embeddings and relationships between variant tokens at a glance.


9. Closing

Embeddings are a core component of natural language processing and LLMs, and their importance doesn't diminish over time. In this article, we thoroughly examined everything from the basic concepts of embeddings to traditional methods, Word2Vec, BERT, and real-world examples and visualizations of the latest LLM embeddings in chronological order.

Embeddings are intuitive and easy to understand, yet they play a decisive role in model performance and semantic understanding. Understanding the fundamental principles of embeddings will continue to be enormously helpful in working with NLP and LLMs.

"Embeddings are simple, powerful, and will remain at the core of LLMs for the foreseeable future."

If you have questions or feedback, feel free to leave them in the community anytime

Related writing