Loading…

DARKSIDE: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE open journal of solid-state circuits 2022, Vol.2, p.231-243
Main Authors:	Garofalo, Angelo, Tortorella, Yvan, Perotti, Matteo, Valente, Luca, Nadalini, Alessandro, Benini, Luca, Rossi, Davide, Conti, Francesco
Format:	Article
Language:	English
Subjects:	Acceleration Accelerators Artificial neural networks Clustering methods Clusters Efficiency Engines Flexibility Floating point arithmetic Hardware Heterogeneous cluster Human computer interaction Inference Integers Kernels Low power electronics Multiplication System on chip tensor product engine (TPE) Tensors Training ultralow-power AI
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	On-chip deep neural network (DNN) inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy, and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present DARKSIDE, a System-on-Chip with a heterogeneous cluster of eight RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost the performance and efficiency on key compute-intensive DNN kernels, the cluster is enriched with three digital accelerators: 1) a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); 2) a minimal overhead datamover to marshal 1-32-b data on-the-fly; and 3) a 16-b floating-point tensor product engine (TPE) for tiled matrix-multiplication acceleration. DARKSIDE is implemented in 65-nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency-enough to enable on-chip floating-point training at competitive speed coupled with ultralow power quantized inference.
ISSN:	2644-1349 2644-1349
DOI:	10.1109/OJSSCS.2022.3210082