Loading…

Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too fa...

Full description

Saved in:
Bibliographic Details
Published in:ACM transactions on computer systems 2023-12, Vol.41 (1-4), p.1-30, Article 4
Main Authors: Pellauer, Michael, Clemons, Jason, Balaji, Vignesh, Crago, Neal, Jaleel, Aamer, Lee, Donghyuk, O’Connor, Mike, Parashar, Angshuman, Treichler, Sean, Tsai, Po-An, Keckler, Stephen W., Emer, Joel S.
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms.This article proposes Symphony, a hybrid programmable/specialized architecture that focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations but also at supporting data orchestration features, such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra and provide 31× improved runtime and 44× improved energy over a comparably provisioned GPU for these applications.
ISSN:0734-2071
1557-7333
DOI:10.1145/3630007