Attention(Q, K, V) = softmax()V

Design

Scaled Dot-Product Attention architecture

Key:

  • MatMul: Matrix Multiplication

  • Scale: Raw scores being divided by

  • Mask (Optional): Applied to prevent attending to certain positions

  • Softmax: Turns the scaled scores into a probability distribution emphasizing the most relevant tokens

  • Second MatMul (Softmax Output x V): Multiply softmax results with value matrix (V), a weighted sum of values based on how much attention each token deserves

  • Q (Query) - The vector (or set of vectors) which represents the input information (like question?)

  • K (Key) - A key is associated with part of the input and acts as an identifier for that piece of information. When you have a query you compare all of them against the keys to determine how relevant each input is to the query

  • V (Value) - Contains the actual information or content that’ll be aggregated. After attention computes a relevant score (usually through a dot product of query with each key) these scores are used to weigh the corresponding value vectors to produce the final output