CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries (CGO 2025 - Main Conference)

Who

Yoon Noh Lee, Yongseung Yu, Yongjun Park

Track

CGO 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 3 Mar 2025 16:00 - 16:20 at Willow (Level 2) - ML Compilers Chair(s): William S. Moses

Abstract

Large Language Models (LLMs) have recently emerged as a state-of-the-art learning model with a wide range of applications in diverse computing environments. Among the various computational operations that comprise the LLM, the GEneral Matrix Multiplication (GEMM) operation is the most frequently utilized operation within the LLM. GEMM libraries such as cuBLAS and CUTLASS provide a variety of optimization techniques to achieve optimal GEMM performance in GPU-enabled computing environments. In particular, the CUTLASS open-source library for GPUs within the CUDA programming environment provides users with the capability to optimize templates for high performance. Previous research has demonstrated the effectiveness of CUTLASS-based GEMMs in improving the performance of real-world deep neural networks on various deep learning platforms. However, these studies have not considered different model parameters for modern LLMs nor have they explored the impact of diverse GPU computing environments.

This paper presents CUrator, an efficient LLM execution engine that can achieve optimal end-to-end LLM performance using both cuBLAS and CUTLASS libraries on different GPUs for modern LLMs such as BERT, GPT, and Llama. CUrator first generates CUTLASS-/cuBLAS-friendly graph IRs of various LLMs on the TVM framework to maximize mapping coverage. On the CUTLASS mapping path, it performs a comprehensive search for programmable tuning parameters in the CUTLASS library with the objective of deriving optimal kernels for all GEMMs within each LLM. CUrator further introduces two optimization techniques: 1) build-time reduction key initialization support for CUTLASS Split-K GEMMs, and 2) Split-K support for CUTLASS Batch GEMMs. Finally, CUrator selects the best performing mapping path between cuBLAS and CUTLASS paths. The experimental results show that CUrator achieves inference speedups of 1.50x and 4.99x, respectively, for representative LLMs on the A100 GPU in the single and half precision, compared to the baseline. We strongly believe that the CUrator framework can provide the best direction for next-generation tuning frameworks by showing the maximum end-to-end performance of various LLMs on various GPUs.

Yoon Noh Lee

Yonsei University

South Korea

Yongseung Yu

Yonsei University

Yongjun Park

Yonsei University

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 3 Mar
Displayed time zone: Pacific Time (US & Canada) change

15:40 - 16:40	ML CompilersMain Conference at Willow (Level 2) Chair(s): William S. Moses University of Illinois Urbana-Champaign

15:40 20m Talk		ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference Main Conference Long Li Ant Group, Jianxin Lai Ant Group, Peng Yuan Ant Group, Tianxiang Sui Ant Group, Yan Liu Ant Group, Qing Zhu Ant Group, Xiaojing Zhang Ant Group, Linjie Xiao Ant Group, Wenguang Chen Tsinghua University; Pengcheng Laboratory, Jingling Xue UNSW Sydney
16:00 20m Talk		CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries Main Conference Yoon Noh Lee Yonsei University, Yongseung Yu Yonsei University, Yongjun Park Yonsei University
16:20 20m Talk		Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices Main Conference Hyeoncheol Kim Yonsei University, Taehoon Kim Rebellions Inc, Taehyeong Park Yonsei University, Donghyeon Kim Hanyang University, Yongseung Yu Yonsei University, Hanjun Kim Yonsei University, Yongjun Park Yonsei University