Loading…

Co-Designing an OpenMP GPU Runtime and Optimizations for Near-Zero Overhead Execution

GPU accelerators are ubiquitous in modern HPC systems. To program them, users have the choice between vendor-specific, native programming models, such as CUDA, which provide simple parallelism semantics with minimal runtime support, or portable alternatives, such as OpenMP, which offer rich parallel...

Full description

Saved in:
Bibliographic Details
Main Authors: Doerfert, Johannes, Patel, Atemn, Huber, Joseph, Tian, Shilei, Diaz, Jose M Monsalve, Chapman, Barbara, Georgakoudis, Giorgis
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:GPU accelerators are ubiquitous in modern HPC systems. To program them, users have the choice between vendor-specific, native programming models, such as CUDA, which provide simple parallelism semantics with minimal runtime support, or portable alternatives, such as OpenMP, which offer rich parallel semantics and feature an extensive runtime library to support execution. While the operations of such a runtime can easily limit performance and drain resources, it was to some degree regarded an unavoidable overhead. In this work we present a co-design methodology for optimizing applications using a specifically crafted OpenMP GPU runtime such that most use cases induce near-zero overhead. Specifically, our approach exposes runtime semantics and state to the compiler such that optimization effectively eliminating abstractions and runtime state from the final binary. With the help of user provided assumptions we can further optimize common patterns that otherwise increase resource consumption. We evaluated our prototype build on top of the LLVM/OpenMP GPU offloading infrastructure with multiple HPC proxy applications and benchmarks. Comparison of CUDA, the original OpenMP runtime, and our co-designed alternative show that, by our approach, performance is significantly improved and resource consumption is significantly lowered. Oftentimes we can closely match the CUDA implementation without sacrificing the versatility and portability of OpenMP.
ISSN:1530-2075
DOI:10.1109/IPDPS53621.2022.00055