In Multi-Head Attention in transformers, the values of Key and Value are updated with every new token. When the new Key and Value are computed they’re appended onto the Key Value in memory
1st Token Generation
When the model is trying to generate the first token there’s no KV Cache so it has to compute these values resulting in higher latency
Each subsequent token has lower latency