
Quantizing deep convolutional networks for efficient inference: A whitepaper
We present an overview of techniques for quantizing convolutional neural...
read it

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks
Inference for stateoftheart deep neural networks is computationally e...
read it

HighAccuracy Inference in Neuromorphic Circuits using HardwareAware Training
Neuromorphic MultiplyAndAccumulate (MAC) circuits utilizing synaptic w...
read it

Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on FixedPoint Hardware
We propose a method of training quantization clipping thresholds for uni...
read it

WRPN: Training and Inference using Wide ReducedPrecision Networks
For computer vision applications, prior works have shown the efficacy of...
read it

Minimizing Area and Energy of Deep Learning Hardware Design Using Collective Low Precision and Structured Compression
Deep learning algorithms have shown tremendous success in many recogniti...
read it

WRPN: Wide ReducedPrecision Networks
For computer vision applications, prior works have shown the efficacy of...
read it
VSQuant: Pervector Scaled Quantization for Accurate LowPrecision Neural Network Inference
Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting lowcost integer math hardware units. Quantization maps floatingpoint weights and activations in a trained model to lowbitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are shared at a coarse granularity across many dimensions of each tensor, effective precision of individual elements within the tensor are limited. To reduce quantizationrelated accuracy loss, we propose using a separate scale factor for each small vector of (≈1664) elements within a single dimension of a tensor. To achieve an efficient hardware implementation, the pervector scale factors can be implemented with lowbitwidth integers when calibrated using a twolevel quantization scheme. We find that pervector scaling consistently achieves better inference accuracy at low precision compared to conventional scaling techniques for popular neural networks without requiring retraining. We also modify a deep learning accelerator hardware design to study the area and energy overheads of pervector scaling support. Our evaluation demonstrates that pervector scaled quantization with 4bit weights and activations achieves 37 24 4bit weights and 8bit activations achieve nearfullprecision accuracy for both BERTbase and BERTlarge on SQuAD while reducing area by 26 an 8bit baseline.
READ FULL TEXT
Comments
There are no comments yet.