Affine quantization is the mapping of each floating-point scalar weight to lower-bit weight

Legacy Quantization comes in two main subcategories that correspond to symmetric and asymmetric quantization:

  • Type 0 (symmetric): Q4_0, Q5_0, Q8_0
  • Type 1 (asymmetric): Q4_1, Q5_1, Q8_1

The Q in the naming convention indicates the bit width - eg. Q4 means (most) weights are stored as 4-bit integers.

_0 and _1 suffixes indicate the type

Type 0 (Symmetric Quantization)

Type 0 quantization assumes the weight range is symmetric around zero