Transformer Encoder

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is $LayerNorm (x + Sublayer (x))$ , where $Sublayer (x)$ is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension $d_{m o d e l} = 512$ .

This is taken directly from Attention is All You Need because I didn’t think there was a better way to phrase this

Ayush Garg

Recently Updated

LoRA: Low Rank Adaptation of Large Language Models

University of Waterloo

A Vision-Language-Action Flow Model for General Robot Control

CS 241 - Definitions

Transformer Encoder

Graph View

Backlinks