Affine quantization is the mapping of each floating-point scalar weight to lower-bit weight
Legacy Quantization comes in two main subcategories that correspond to symmetric and asymmetric quantization:
- Type 0 (symmetric):
Q4_0,Q5_0,Q8_0 - Type 1 (asymmetric):
Q4_1,Q5_1,Q8_1
The Q in the naming convention indicates the bit width - eg. Q4 means (most) weights are stored as 4-bit integers.
_0 and _1 suffixes indicate the type
Type 0 (Symmetric Quantization)
Type 0 quantization assumes the weight range is symmetric around zero