Paper Link: https://arxiv.org/abs/2104.09864

Formula:

Key Components:

  • : Fixed non-zero scalar
  • : Embedding vector at position m, in 2D as

Paper Details

In the paper when talking about attention notice how they transpose the query values instead of the key value…

Advantage over others

Sinusoidal Positional Encoding are not preferred compared to Rotary Position Embeddings because there’s no clear pattern established as the position counter goes up

The output changes significantly (like magnitude)