Paper Link: https://arxiv.org/pdf/2501.19399

Pretty cool paper, talks about the advantages of Scalable Softmax over normal Softmax in attention

Uses tests like Needle in a Haystack

Concepts in the Paper

Attention Fading - As the input vector size grows running softmax on the vector the resulting probability distribution becomes increasingly flat.