Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices
Real-time processing of deep learning models on conventional systems, such as CPUs and GPUs, is highly challenging due to memory bottlenecks. This is exacerbated in Large Language Models (LLMs), where the majority of executions are dominated by General Matrix Multiplication (GEMM) operations, which are relatively more memory-intensive than convolution operations. Processing-in-Memory (PIM), which provides high internal bandwidth, can be a promising alternative for LLM serving. However, since current PIM systems do not fully replace traditional memory, data transfer between the host and PIM-side memory is essential. Therefore, minimizing the transfer cost between the host and PIM is crucial for serving LLMs efficiently on the PIM.
In this paper, we propose PIM-LLM, an end-to-end framework that accelerates LLMs using an efficient tiled GEMM library and several key target-aware optimizations on real-world PIM systems. We first propose PGEMMlib, which provides optimized tiling techniques for PIM, considering architecture specific characteristics to minimize unnecessary data transfer overhead and maximize parallelism. In addition, Tile-Selector explores optimized parameters and techniques for different GEMM shapes and available resources of PIM systems using an analytical model. To accelerate LLMs using PGEMMlib, we integrate it into the TVM deep learning compiler framework. We further optimize the LLM execution by applying several key optimizations: Build-time memory layout adjustment, PIM resource pooling, CPU/PIM cooperation support, and QKV generation fusion. Evaluation shows that PIM-LLM achieves significant performance gains of up to 45.75x over the TVM baseline for several well-known LLMs. We strongly believe that this work provides key insights for efficient LLM serving on real PIM devices.
Mon 3 MarDisplayed time zone: Pacific Time (US & Canada) change
15:40 - 16:40 | ML CompilersMain Conference at Willow (Level 2) Chair(s): William S. Moses University of Illinois Urbana-Champaign | ||
15:40 20mTalk | ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference Main Conference Long Li Ant Group, Jianxin Lai Ant Group, Peng Yuan Ant Group, Tianxiang Sui Ant Group, Yan Liu Ant Group, Qing Zhu Ant Group, Xiaojing Zhang Ant Group, Linjie Xiao Ant Group, Wenguang Chen Tsinghua University; Pengcheng Laboratory, Jingling Xue UNSW Sydney | ||
16:00 20mTalk | CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries Main Conference | ||
16:20 20mTalk | Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices Main Conference Hyeoncheol Kim Yonsei University, Taehoon Kim Rebellions Inc, Taehyeong Park Yonsei University, Donghyeon Kim Hanyang University, Yongseung Yu Yonsei University, Hanjun Kim Yonsei University, Yongjun Park Yonsei University |