Paper Link: https://arxiv.org/pdf/2501.19399
Pretty cool paper, talks about the advantages of Scalable Softmax over normal Softmax in attention
Uses tests like Needle in a Haystack
Concepts in the Paper
Attention Fading
- As the input vector size grows running softmax on the vector the resulting probability distribution becomes increasingly flat.