Affine quantization is the mapping of each floating-point scalar weight to lower-bit weight
Legacy Quantization comes in two main subcategories that correspond to symmetric and asymmetric quantization:
- Type 0 (symmetric):
Q4_0
,Q5_0
,Q8_0
- Type 1 (asymmetric):
Q4_1
,Q5_1
,Q8_1
The Q
in the naming convention indicates the bit width - eg. Q4
means (most) weights are stored as 4-bit integers.
_0
and _1
suffixes indicate the type
Type 0 (Symmetric Quantization)
Type 0 quantization assumes the weight range is symmetric around zero