Loading…

A Cost-Effective CNN Accelerator Design with Configurable PU on FPGA

Convolutional neural networks (CNNs) are rapidly expanding and being applied to a vast range of applications. Despite its popularity, deploying CNNs on a portable system is challenging due to enormous data volume, intensive computation, and frequent memory access. Hence, many approaches have been pr...

Full description

Saved in:
Bibliographic Details
Main Authors: Fong, Chi Fung Brian, Mu, Jiandong, Zhang, Wei
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Convolutional neural networks (CNNs) are rapidly expanding and being applied to a vast range of applications. Despite its popularity, deploying CNNs on a portable system is challenging due to enormous data volume, intensive computation, and frequent memory access. Hence, many approaches have been proposed to reduce the CNN model complexity, such as model pruning and quantization. However, it also brings new challenges. For example, existing designs usually adopted channel dimension tiling which requires regular channel number. After pruning, the channel number may become highly irregular which will incur heavy zero padding and large resource waste. As for quantization, simple aggressive bit reduction usually results in large accuracy drop. In order to address these challenges, in this work, firstly we propose to use row-based tiling in the kernel dimension to adapt to different kernel sizes and channel numbers and significantly reduce the zero padding. Moreover, we developed the configurable processing units (PUs) design which can be dynamically grouped or split to support the tiling flexibility and enable efficient hardware resource sharing. As for quantization, we considered the recently proposed Incremental Network Quantization (INQ) algorithm which uses low bit representation of weights in power of 2 format, and hence is able to represent the weights with minimum computing complexity since expensive multiplication can be transferred into cheap shift operation. We further propose an approximate shifter based processing element (PE) design as the fundamental building block of the PUs to facilitate the convolution computation. At last, a case study of RTL-level implementation of INQ quantized AlexNet is realized on a standalone FPGA, Stratix V. Compared with the state-of-art designs, our accelerator achieves 1.87x higher performance, which demonstrates the efficiency of the proposed design methods.
ISSN:2159-3477
DOI:10.1109/ISVLSI.2019.00015