In Multi-Head Attention in transformers, the values of Key and Value are updated with every new token. When the new Key and Value are computed they’re appended onto the Key Value in memory

1st Token Generation

When the model is trying to generate the first token there’s no KV Cache so it has to compute these values resulting in higher latency

Each subsequent token has lower latency