BitNet: Scaling 1-bit Transformers for Large Language Models

BitLinear

Quantization

Binarization of weights

First step of bit linear is to binarize the weights using Signum Function

A scaling factor of $β$ is used after Binarization to reduce the L2 Error between real and binarized weights

Mathematical steps:

W = Sign (W - α)

This centralizes the weights to 0 mean 2.

S i g n (W_{ij}) = {+ 1, if W_{ij} > 0 - 1, if W_{ij} \leq 0

α = \frac{1}{nm} ij \sum W_{ij}

Semantic understanding of the math:

$W \in R^{n x m}$ - weight matrix (for linear layer)
$W ij$ - one specific entry of that matrix
Sign - Signum Function
$n$ - number of rows
$m$ - number of columns
$α$ - The mean (average) of all the weights in W

Absmax quantization: Scales activations into the range $[- Q_{b}, Q_{b}] (Q_{b} = 2^{b - 1})$ 4.

x = Q u an t (x) = Cl i p (x \times \frac{Q _{b}}{γ}, - Q_{b} + ϵ, Q_{b} - ϵ)

Cl i p (x, a, b) = max (a, min (b, x)), γ = ∥ x ∥_{\infty}

Semantic Understanding of the math:

the clip function forces (clips) the value into the interval $[a, b]$

Clip (x, a, b) = ⎩ ⎨ ⎧ a, x < a x, a \leq x \leq b b, x > b

What are $Q_{b}, b$ and the range

$Q_{b} = 2^{b - 1}$
Target range: $[- Q_{b}, Q_{b}]$

Example:

If b = 8, then $Q_{b} = 2^{7} = 128$
Range is roughly $[- 128, 128]$

This is the integer range they want the activations to lice in after scaling + clipping

The $\pm ϵ$ makes the bounds slightly inside that range $[- Q_{b} + ϵ, Q_{b} - ϵ]$ to avoid edge cases like overflow later cast or round to an integer type

$γ = ∥ x ∥_{\infty}$

This means that gamma is equal to the maximum absolute value among all the entries of x

Eg. x = [-10, 5, 3, 2, 6] The gamma value would be 10

Ayush Garg

Recently Updated

STAT 206 - Definitions

Signum Function