MultiHead(Q, K, V) = Concat(head,…, head) where head = Attention
This builds on top of Scaled Dot-Product Attention
Multi-head attention extends the basic attention mechanism by allowing the model to attend to different parts of the input from multiple perspectives simultaneously
The input for multi-head attention is linearly transformed to be smaller