Loading…

Inter-kernel Reuse-aware Thread Block Scheduling

As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging...

Full description

Saved in:

Bibliographic Details
Published in:	ACM transactions on architecture and code optimization 2020-08, Vol.17 (3), p.1-27
Main Authors:	Huzaifa, Muhammad, Alsop, Johnathan, Mahmoud, Abdulrahman, Salvador, Giordano, Sinclair, Matthew D., Adve, Sarita V.
Format:	Article
Language:	English
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging GPU coherence protocols can mitigate this bottleneck by exploiting data reuse in GPU caches across kernel boundaries. Unfortunately, current GPU thread block schedulers are typically not designed to expose such reuse. This article proposes new hardware thread block schedulers that optimize inter-kernel reuse while using work stealing to preserve load balance. Our schedulers are simple, decentralized, and have extremely low overhead. Compared to a baseline round-robin scheduler, the best performing scheduler reduces average execution time and energy by 19% and 11%, respectively, in regular applications, and 10% and 8%, respectively, in irregular applications.
ISSN:	1544-3566 1544-3973
DOI:	10.1145/3406538