KV Cache

In Multi-Head Attention in transformers, the values of Key and Value are updated with every new token. When the new Key and Value are computed they’re appended onto the Key Value in memory

1st Token Generation

When the model is trying to generate the first token there’s no KV Cache so it has to compute these values resulting in higher latency

Each subsequent token has lower latency

Ayush Garg

Recently Updated

Hi, I am Ayush 1️⃣9️⃣

Flash Attention

Hyperic

Obsidian

KV Cache

1st Token Generation

Graph View

Backlinks

Ayush Garg

Recently Updated

Hi, I am Ayush 1️⃣9️⃣

Flash Attention

Hyperic

Obsidian

KV Cache

1st Token Generation §

Graph View

Backlinks

1st Token Generation