Scalable Softmax is Superior for Attention

Scalable Softmax Function : z_{i} = \frac{e ^{(s log n) z_{i}}}{\sum _{j = 1}^{n} e ^{(s log n) z_{j}}}

Pretty cool paper, talks about the advantages of Scalable Softmax over normal Softmax in attention

Uses tests like Needle in a Haystack

Concepts in the Paper

Attention Fading - As the input vector size grows running softmax on the vector the resulting probability distribution becomes increasingly flat.