Multi-Head Attention

MultiHead(Q, K, V) = Concat(head $_{1}$ ,…, head $_{h}$ ) $W^{O}$ where head $_{i}$ = Attention $(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})$

This builds on top of Scaled Dot-Product Attention

Multi-head attention extends the basic attention mechanism by allowing the model to attend to different parts of the input from multiple perspectives simultaneously

The input for multi-head attention is linearly transformed to be smaller

Ayush Garg

Recently Updated

Speed

Thanos

Failure

Self Belief

Multi-Head Attention

Design

Graph View

Backlinks

Ayush Garg

Recently Updated

Speed

Thanos

Failure

Self Belief

Multi-Head Attention

Design §

Graph View

Backlinks

Design