Tue 4 Mar 2025 14:40 - 15:00 at Casuarina Ballroom (Level 2) - GPU & Parallelism Chair(s): Bastian Hagedorn

In High-performance computing (HPC) fast application execution is the primary objective. HPC software is written in high-performance languages (C/C++, Fortran) and is statically compiled Ahead-of-Time (AOT), prior to execution, using optimizing compilers to generate fast code, typically targeting heterogeneous CPU-GPU architectures. AOT compilation optimizes source code with only the limited information statically available or inferred at compile time, which precludes possible optimization leveraging runtime information.

For lifting this limitation, we propose Proteus, an easy-to-use, portable, and lightweight Just-In-Time (JIT) compilation approach to optimize GPU kernels including runtime information. By contrast to previous JIT compilation solutions that use source code or concrete syntax tree representations bound to a language, our approach dynamically compiles and optimizes using language-agnostic LLVM IR. Further, Proteus extracts runtime information by extending AOT compilation through minimally intrusive developer annotations to dynamically specialize and optimize GPU kernels for the runtime values of their arguments and threading launch parameters.

We evaluate our approach on a diverse set of programs on AMD and NVIDIA GPUs. Results show Proteus achieves significant end-to-end speedup, of up to 2.8$\times$ for AMD and 1.78$\times$ on NVIDIA, when contrasted with typical AOT compilation optimization, recuperating dynamic compilation overhead. Comparing also with the CUDA-specific Jitify tool, operating on stringified source code and performing similar runtime optimization, our portable approach achieves 1.23$\times$ higher end-to-end speedup on average, thanks to significantly less dynamic compilation overhead and in certain cases generating more optimized, faster binary code.

Tue 4 Mar

Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:00
14:00
20m
Talk
Code Generation for Cryptographic Kernels Using Multi-word Modular Arithmetic on GPU
Main Conference
Naifeng Zhang Carnegie Mellon University, Franz Franchetti Carnegie Mellon University, USA
14:20
20m
Talk
CuAsmRL: optimizing GPU SASS schedules via deep reinforcement learning
Main Conference
Guoliang He University of Cambridge, Eiko Yoneki U. of Cambridge
14:40
20m
Talk
Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-In-Time Compilation
Main Conference
Giorgis Georgakoudis Lawrence Livermore National Laboratory, Konstantinos Parasyris Lawrence Livermore National Laboratory, David Beckingsale Lawrence Livermore National Laboratory