Loading…

Asymmetric Resilience for Accelerator-Rich Systems

Accelerators are becoming popular owing to their exceptional performance and power-efficiency. However, researchers are yet to pay close attention to their reliability-a key challenge as technology scaling makes building reliable systems challenging. A straightforward solution to make accelerators r...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE computer architecture letters 2019-01, Vol.18 (1), p.83-86
Main Authors:	Leng, Jingwen, Buyuktosunoglu, Alper, Bertran, Ramon, Bose, Pradip, Reddi, Vijay Janapa
Format:	Article
Language:	English
Subjects:	accelerator architecture Accelerators Asymmetry error recovery Graphics processing units heterogenous system Kernel Memory management Power efficiency Reliability Resilience Runtime soft errors Task analysis voltage noise
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Accelerators are becoming popular owing to their exceptional performance and power-efficiency. However, researchers are yet to pay close attention to their reliability-a key challenge as technology scaling makes building reliable systems challenging. A straightforward solution to make accelerators reliable is to design the accelerator from the ground-up to be reliable by itself. However, such a myopic view of the system, where each accelerator is designed in isolation, is unsustainable as the number of integrated accelerators continues to rise in SoCs. To address this challenge, we propose a paradigm called "asymmetric resilience" that avoids accelerator-specific reliability design. Instead, its core principle is to develop the reliable heterogeneous system around the CPU architecture. We explain the implications of architecting such a system and the modifications needed in a heterogeneous system to adopt such an approach. As an example, we demonstrate how to use asymmetric resilience to handle GPU execution errors using the CPU with minimal overhead. The general principles can be extended to include other accelerators.
ISSN:	1556-6056 1556-6064
DOI:	10.1109/LCA.2019.2917898