Understand RoPE

Last modified: 2025/02/21 | Estimated Reading Time: 15 min | Author: XXM

This article will primarily introduce RoPE, which I compiled after a sharing session with the team. The main reference for this paper is the original author's paper, and I have also incorporated some of my own understanding. If you don’t have much background knowledge, that’s fine. I will begin by discussing the most fundamental position encoding in the transformer block.

For sequence data, among the three basic modeling approaches, both RNN and CNN inherently incorporate positional information, so there is no need to consider additional position encoding during the modeling process. As for why the transformer block is incapable of handling the positional information of tokens in a sequence, a brief explanation is as follows: For the transformer block, its computation can be simplified as where , being a fully connected layer, is clearly independent of position. On the other hand, represents a global computation that only captures the relative relationships between tokens, making it position-independent as well. In fact, the ability of the transformer block to perform highly parallelized computation is achieved by sacrificing positional information.

Is positional information important?

The answer to this question is certainly yes. The two sentences 'I raised a cat' and 'A cat raised me' only differ in word order, yet their meanings are completely opposite. From a certain perspective, the purpose of the transformer block is to capture semantic vectors (one can think of the role of the [CLS] token in the BERT model). Therefore, how to incorporate positional information of the sequence into the computation of the transformer block naturally becomes one of the key factors that ultimately influences the model's performance.

So, how should we incorporate the positional information of the sequence into the transformer block?

To answer this question, let's first examine where the sequence information might be added.

For the computation process of the transformer block, :

Modify , which means making adjustments to the result of the embedding layer.

Modify , since is a computational process. Here, the modification refers to adding positional information to one of the intermediate results or to the final output of .

Both of the above approaches.

Since the purpose of position encoding is simply to add positional information to the features, approach 3 seems somewhat redundant. In fact, I have not come across such an encoding scheme. If you have, please feel free to email me, and I will promptly update it here. Next, I will introduce approaches 1 and 2 in detail.

Approach 1: Modify the input

Since we are modifying the input, this means that the modification is directly applied to the embedding:

Here, is the embedding of the -th token in the sequence, and is the corresponding modification value. They have the same vector dimension. This method of directly adding position encodings to the token embeddings is called absolute position encoding, which, as the name suggests, directly adds the position index.

Currently, there are two mainstream approaches for the design of :

The first approach is to add a matrix with the same size as the model's embedding and allow the model to learn the parameters during training. This is the approach used in both BERT and GPT models. The advantage of this approach is its simplicity: there is no need to think about how to design the position encodings; the model learns everything itself. The downside, however, is that it loses scalability, as the model only learns position encodings for a fixed length. Although both BERT and GPT use this encoding method, it was actually proposed in 2017 by Facebook AI :

The other approach is to define a position encoding function , which ideally satisfies two requirements. First, the range of this function should have both upper and lower bounds. This is primarily to ensure efficient training of the neural network, and it is one of the reasons for applying various normalization techniques during training. Second, the values of the function must change with and should not repeat.

The most well-known position encoding that meets these requirements in this approach is the Sinusoidal position encoding. This name may seem a bit unfamiliar, but it is actually the encoding method designed in the paper Attention is All You Need:

Here, and are the position encodings for the -th and -th positions of the -th token in the sequence, and is the dimensionality of the entire embedding vector.

In addition to satisfying the two conditions mentioned earlier, this approach also has an interesting property:

This implies that if the encoding angle of a position is , there is actually some relative relationship between this position and the two positions encoded by and . Although this relationship is not direct and may be difficult to understand, it ultimately provides this capability.

Credit to: https://kexue.fm/archives/8130#三角式

Since the main purpose of this blog is to introduce RoPE, I will not go into further details about other absolute position encodings.

Approach 2: Modify the AttnBlock

Let us first review the two most important computations in , namely, the calculation of the matrices and the computation of the attention scores:

Clearly, it is impossible to introduce positional encoding information in equation , as the computation only involves a single vector . If we take this approach, it would be the same as the absolute position encoding in Approach 1 mentioned earlier. Therefore, we can attempt to introduce the positional information of two indices in equation , allowing the computation of to be expressed as . Since , this means that if the magnitudes do not change, the modification to this computation paradigm can only be made through the angle.

Before formally introducing RoPE, let’s first recall a concept from matrix transformation. Suppose we have a vector in a two-dimensional space:

Now, let’s define a square matrix as follows:

Next, we multiply it by :

From this result, we can draw the conclusion: multiplying a two-dimensional vector by is equivalent to rotating the vector by while keeping the magnitude unchanged. Now, using this conclusion, we will rotate by and by , and continue with the previous computation:

That is to say, by rotating and , we have successfully introduced relative positioning into while preserving the magnitudes. This is the core idea and approach of RoPE.

So, does this conclusion hold in higher-dimensional spaces?

Let us split the embedding into two-dimensional subspaces. Following the previous definition, the rotation angles of the -th vector are , and the corresponding rotation matrix is defined as:

We can observe that is actually composed of diagonal matrices, represented as:

We extend the derivation from the two-dimensional space to the higher-dimensional space:

Where:

Thus, we have successfully extended this conclusion to higher-dimensional spaces by utilizing the orthogonal property of diagonal matrix rotations. Now, let’s examine the two conditions mentioned earlier. The first one, which requires an upper and lower bound, I think does not need further elaboration. Let's now attempt to prove the second condition, that there should be no repetition, using proof by contradiction.

Assume that the rotations of positions and are identical, meaning that the difference in the rotation angles in all corresponding subspaces is an integer multiple of . In other words, for a particular subspace , we have:

In other words, unless contains , this equality cannot hold. Moreover, in the original paper, the author also provides a definition for .

Decay of RoPE

Finally, I would like to discuss the decay property of RoPE. The decay property here refers to the fact that as the relative distance between two tokens increases, the difference introduced by RoPE decreases. Intuitively, this property is very useful, because clearly, as two tokens become more distant, their attention scores should decrease.

The author provides a complex-angle explanation in the original paper, and I will directly adopt this here. After grouping and in pairs, the inner product after adding RoPE is represented using complex numbers (in the original paper, the author uses to denote the dimension of the vector, so I will maintain consistency with the author here):

Let , and . Using the method of summation by parts (Abel's), we can obtain:

In the paper, the author eventually examines and finds that it exhibits a decay property as the relative distance increases.

However, I believe this property can be understood in another way. If you're not too familiar with the complex number representation in the original paper, feel free to consider my approach.

Let us first assume that both and are ones vectors. In that case, we have:

If we denote the relative distance between two tokens, , as , then the attention score between these two vectors is given by:

Let's first take a quick look at this function:

If , then , which means the attention score is constant and independent of the distance. This is easy to understand, as we haven't made any changes.

If , then , indicating that as the distance increases, the attention score between the vectors becomes a continuously repeating periodic function.

The author in the paper provides , so the period of is:

Since the function is obviously periodic, although it oscillates in the long run, is still monotonically decreasing at . To demonstrate the combined effect of periodicity and decay, let’s illustrate this with an experiment:

We replace the constant 10000 in with a tunable parameter , and first observe the case when :

Indeed, this matches our earlier expectations: a periodic repetition combined with decay. We also compared the results for different values of :

altair_multi.html

844.5KB

It can be observed that within a certain period, there is indeed a decreasing trend. This is the monotonic decay within the first quarter of the ⁡ cycle, but the head elements still exhibit significant periodic oscillations. Therefore, theoretically, RoPE can capture this information for a context length of at least 3000 on the graph.

Before we end

At this point, the introduction to RoPE has come to an end. However, before officially concluding this article, let's take another look at the absolute position encoding in the transformer block:

At this point, if we denote , , then the above expression can be transformed into the following form:

Suddenly, we found that it looks very similar to RoPE’s rotational position encoding. In each subspace, the position encoding is represented by a pair of values: . However, compared to RoPE’s rotation, it actually performs an addition operation directly in the 2D Euclidean space.

With that, we’ve covered the design origins of RoPE’s rotational position encoding, along with some elements and properties of position encoding.

Thanks for reading.