Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-In-Time Compilation
In High-performance computing (HPC) fast application execution is the primary objective. HPC software is written in high-performance languages (C/C++, Fortran) and is statically compiled Ahead-of-Time (AOT), prior to execution, using optimizing compilers to generate fast code, typically targeting heterogeneous CPU-GPU architectures. AOT compilation optimizes source code with only the limited information statically available or inferred at compile time, which precludes possible optimization leveraging runtime information.
For lifting this limitation, we propose Proteus, an easy-to-use, portable, and lightweight Just-In-Time (JIT) compilation approach to optimize GPU kernels including runtime information. By contrast to previous JIT compilation solutions that use source code or concrete syntax tree representations bound to a language, our approach dynamically compiles and optimizes using language-agnostic LLVM IR. Further, Proteus extracts runtime information by extending AOT compilation through minimally intrusive developer annotations to dynamically specialize and optimize GPU kernels for the runtime values of their arguments and threading launch parameters.
We evaluate our approach on a diverse set of programs on AMD and NVIDIA GPUs. Results show Proteus achieves significant end-to-end speedup, of up to 2.8$\times$ for AMD and 1.78$\times$ on NVIDIA, when contrasted with typical AOT compilation optimization, recuperating dynamic compilation overhead. Comparing also with the CUDA-specific Jitify tool, operating on stringified source code and performing similar runtime optimization, our portable approach achieves 1.23$\times$ higher end-to-end speedup on average, thanks to significantly less dynamic compilation overhead and in certain cases generating more optimized, faster binary code.
Tue 4 MarDisplayed time zone: Pacific Time (US & Canada) change
14:00 - 15:00 | |||
14:00 20mTalk | Code Generation for Cryptographic Kernels Using Multi-word Modular Arithmetic on GPU Main Conference | ||
14:20 20mTalk | CuAsmRL: optimizing GPU SASS schedules via deep reinforcement learning Main Conference | ||
14:40 20mTalk | Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-In-Time Compilation Main Conference Giorgis Georgakoudis Lawrence Livermore National Laboratory, Konstantinos Parasyris Lawrence Livermore National Laboratory, David Beckingsale Lawrence Livermore National Laboratory |