Loading…

Efficient Fault-Criticality Analysis for AI Accelerators using a Neural Twin

Owing to the inherent fault tolerance of deep neural network (DNN) models used for classification, many structural faults in the processing elements (PEs) of a systolic array-based AI accelerator are functionally benign. Brute-force fault simulation for determining fault criticality is computational...

Full description

Saved in:
Bibliographic Details
Main Authors: Chaudhuri, Arjun, Chen, Ching-Yuan, Talukdar, Jonti, Madala, Siddarth, Dubey, Abhishek Kumar, Chakrabarty, Krishnendu
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Owing to the inherent fault tolerance of deep neural network (DNN) models used for classification, many structural faults in the processing elements (PEs) of a systolic array-based AI accelerator are functionally benign. Brute-force fault simulation for determining fault criticality is computationally expensive due to many potential fault sites in the accelerator array and the dependence of criticality characterization of PEs on the functional input data. Supervised learning techniques can be used to accurately estimate fault criticality but it requires ground truth for model training. The ground-truth collection involves extensive and computationally expensive fault simulations. We present a framework for analyzing fault criticality with a negligible amount of ground-truth data. We incorporate the gate-level structural and functional information of the PEs in their "neural twins", referred to as "PE-Nets". The PE netlist is translated into a trainable PE-Net, where the standard-cell instances are substituted by their corresponding "Cell-Nets" and the wires translate to neural connections. Each Cell-Net is a pre-trained DNN that models the Boolean-logic behavior of the corresponding standard cell. In the PE-Net, every neural connection is associated with a bias that represents a perturbation in the signal propagated by that connection. We utilize a recently proposed misclassification-driven training algorithm to sensitize and identify biases that are critical to the functioning of the accelerator for a given application workload. The proposed framework achieves up to 100% accuracy in fault-criticality classification in 16-bit and 32-bit PEs by using the criticality knowledge of only 2% of the total faults in a PE.
ISSN:2378-2250
DOI:10.1109/ITC50571.2021.00015