Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-In-Time Compilation (CGO 2025 - Main Conference)

Who

Giorgis Georgakoudis, Konstantinos Parasyris, David Beckingsale

Track

CGO 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 4 Mar 2025 14:40 - 15:00 at Casuarina Ballroom (Level 2) - GPU & Parallelism Chair(s): Bastian Hagedorn

Abstract

In High-performance computing (HPC) fast application execution is the primary objective. HPC software is written in high-performance languages (C/C++, Fortran) and is statically compiled Ahead-of-Time (AOT), prior to execution, using optimizing compilers to generate fast code, typically targeting heterogeneous CPU-GPU architectures. AOT compilation optimizes source code with only the limited information statically available or inferred at compile time, which precludes possible optimization leveraging runtime information.

For lifting this limitation, we propose Proteus, an easy-to-use, portable, and lightweight Just-In-Time (JIT) compilation approach to optimize GPU kernels including runtime information. By contrast to previous JIT compilation solutions that use source code or concrete syntax tree representations bound to a language, our approach dynamically compiles and optimizes using language-agnostic LLVM IR. Further, Proteus extracts runtime information by extending AOT compilation through minimally intrusive developer annotations to dynamically specialize and optimize GPU kernels for the runtime values of their arguments and threading launch parameters.

We evaluate our approach on a diverse set of programs on AMD and NVIDIA GPUs. Results show Proteus achieves significant end-to-end speedup, of up to 2.8$\times$ for AMD and 1.78$\times$ on NVIDIA, when contrasted with typical AOT compilation optimization, recuperating dynamic compilation overhead. Comparing also with the CUDA-specific Jitify tool, operating on stringified source code and performing similar runtime optimization, our portable approach achieves 1.23$\times$ higher end-to-end speedup on average, thanks to significantly less dynamic compilation overhead and in certain cases generating more optimized, faster binary code.

Giorgis Georgakoudis

Lawrence Livermore National Laboratory

United States

Konstantinos Parasyris

Lawrence Livermore National Laboratory

United States

David Beckingsale

Lawrence Livermore National Laboratory

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 4 Mar
Displayed time zone: Pacific Time (US & Canada) change

14:00 - 15:00	GPU & ParallelismMain Conference at Casuarina Ballroom (Level 2) Chair(s): Bastian Hagedorn NVIDIA

14:00 20m Talk		Code Generation for Cryptographic Kernels Using Multi-word Modular Arithmetic on GPU Main Conference Naifeng Zhang Carnegie Mellon University, Franz Franchetti Carnegie Mellon University, USA
14:20 20m Talk		CuAsmRL: optimizing GPU SASS schedules via deep reinforcement learning Main Conference Guoliang He University of Cambridge, Eiko Yoneki U. of Cambridge
14:40 20m Talk		Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-In-Time Compilation Main Conference Giorgis Georgakoudis Lawrence Livermore National Laboratory, Konstantinos Parasyris Lawrence Livermore National Laboratory, David Beckingsale Lawrence Livermore National Laboratory