tinyML Asia 2021 Dongsoo Lee: Extremely low-bit quantization for Transformers

tinyML Asia 2021 Extremely low-bit quantization for Transformers DongSoo LEE 이동수, Executive Officer, NAVER CLOVA The deployment of widely used Transformer architecture is challenging because of heavy computation load and memory overhead during inference, especially when the target device is limited in computational resources such as mobile or edge devices. Quantization is an effective technique to address such challenges. Our analysis shows that for a given number of quantization bits, each block of Transformer contributes to model accuracy and inference computations in different manners. Moreover, even inside an embedding block, each word presents vastly different contributions. Correspondingly, we propose a mixed precision quantization strategy to represent Transformer weights by an extremely low number of bits (e.g., under 3 bits). For example, for each word in an embedding block, we assign different quantization bits based on statistical property. We also introduce a new

1 view