CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries
Large Language Models (LLMs) have recently emerged as a state-of-the-art learning model with a wide range of applications in diverse computing environments. Among the various computational operations that comprise the LLM, the GEneral Matrix Multiplication (GEMM) operation is the most frequently utilized operation within the LLM. GEMM libraries such as cuBLAS and CUTLASS provide a variety of optimization techniques to achieve optimal GEMM performance in GPU-enabled computing environments. In particular, the CUTLASS open-source library for GPUs within the CUDA programming environment provides users with the capability to optimize templates for high performance. Previous research has demonstrated the effectiveness of CUTLASS-based GEMMs in improving the performance of real-world deep neural networks on various deep learning platforms. However, these studies have not considered different model parameters for modern LLMs nor have they explored the impact of diverse GPU computing environments.
This paper presents CUrator, an efficient LLM execution engine that can achieve optimal end-to-end LLM performance using both cuBLAS and CUTLASS libraries on different GPUs for modern LLMs such as BERT, GPT, and Llama. CUrator first generates CUTLASS-/cuBLAS-friendly graph IRs of various LLMs on the TVM framework to maximize mapping coverage. On the CUTLASS mapping path, it performs a comprehensive search for programmable tuning parameters in the CUTLASS library with the objective of deriving optimal kernels for all GEMMs within each LLM. CUrator further introduces two optimization techniques: 1) build-time reduction key initialization support for CUTLASS Split-K GEMMs, and 2) Split-K support for CUTLASS Batch GEMMs. Finally, CUrator selects the best performing mapping path between cuBLAS and CUTLASS paths. The experimental results show that CUrator achieves inference speedups of 1.50x and 4.99x, respectively, for representative LLMs on the A100 GPU in the single and half precision, compared to the baseline. We strongly believe that the CUrator framework can provide the best direction for next-generation tuning frameworks by showing the maximum end-to-end performance of various LLMs on various GPUs.
Mon 3 MarDisplayed time zone: Pacific Time (US & Canada) change
15:40 - 16:40 | ML CompilersMain Conference at Willow (Level 2) Chair(s): William S. Moses University of Illinois Urbana-Champaign | ||
15:40 20mTalk | ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference Main Conference Long Li Ant Group, Jianxin Lai Ant Group, Peng Yuan Ant Group, Tianxiang Sui Ant Group, Yan Liu Ant Group, Qing Zhu Ant Group, Xiaojing Zhang Ant Group, Linjie Xiao Ant Group, Wenguang Chen Tsinghua University; Pengcheng Laboratory, Jingling Xue UNSW Sydney | ||
16:00 20mTalk | CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries Main Conference | ||
16:20 20mTalk | Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices Main Conference Hyeoncheol Kim Yonsei University, Taehoon Kim Rebellions Inc, Taehyeong Park Yonsei University, Donghyeon Kim Hanyang University, Yongseung Yu Yonsei University, Hanjun Kim Yonsei University, Yongjun Park Yonsei University |