Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices (CGO 2025 - Main Conference)

Who

Hyeoncheol Kim, Taehoon Kim, Taehyeong Park, Donghyeon Kim, Yongseung Yu, Hanjun Kim, Yongjun Park

Track

CGO 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 3 Mar 2025 16:20 - 16:40 at Willow (Level 2) - ML Compilers Chair(s): William S. Moses

Abstract

Real-time processing of deep learning models on conventional systems, such as CPUs and GPUs, is highly challenging due to memory bottlenecks. This is exacerbated in Large Language Models (LLMs), where the majority of executions are dominated by General Matrix Multiplication (GEMM) operations, which are relatively more memory-intensive than convolution operations. Processing-in-Memory (PIM), which provides high internal bandwidth, can be a promising alternative for LLM serving. However, since current PIM systems do not fully replace traditional memory, data transfer between the host and PIM-side memory is essential. Therefore, minimizing the transfer cost between the host and PIM is crucial for serving LLMs efficiently on the PIM.

In this paper, we propose PIM-LLM, an end-to-end framework that accelerates LLMs using an efficient tiled GEMM library and several key target-aware optimizations on real-world PIM systems. We first propose PGEMMlib, which provides optimized tiling techniques for PIM, considering architecture specific characteristics to minimize unnecessary data transfer overhead and maximize parallelism. In addition, Tile-Selector explores optimized parameters and techniques for different GEMM shapes and available resources of PIM systems using an analytical model. To accelerate LLMs using PGEMMlib, we integrate it into the TVM deep learning compiler framework. We further optimize the LLM execution by applying several key optimizations: Build-time memory layout adjustment, PIM resource pooling, CPU/PIM cooperation support, and QKV generation fusion. Evaluation shows that PIM-LLM achieves significant performance gains of up to 45.75x over the TVM baseline for several well-known LLMs. We strongly believe that this work provides key insights for efficient LLM serving on real PIM devices.

Hyeoncheol Kim

Yonsei University

Taehoon Kim

Rebellions Inc

Taehyeong Park

Yonsei University

Donghyeon Kim

Hanyang University

Yongseung Yu

Yonsei University

Hanjun Kim

Yonsei University

Yongjun Park

Yonsei University

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 3 Mar
Displayed time zone: Pacific Time (US & Canada) change

15:40 - 16:40	ML CompilersMain Conference at Willow (Level 2) Chair(s): William S. Moses University of Illinois Urbana-Champaign

15:40 20m Talk		ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference Main Conference Long Li Ant Group, Jianxin Lai Ant Group, Peng Yuan Ant Group, Tianxiang Sui Ant Group, Yan Liu Ant Group, Qing Zhu Ant Group, Xiaojing Zhang Ant Group, Linjie Xiao Ant Group, Wenguang Chen Tsinghua University; Pengcheng Laboratory, Jingling Xue UNSW Sydney
16:00 20m Talk		CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries Main Conference Yoon Noh Lee Yonsei University, Yongseung Yu Yonsei University, Yongjun Park Yonsei University
16:20 20m Talk		Accelerating LLMs using an Efficient GEMM library and Target-aware Optimizations on Real-world PIM Devices Main Conference Hyeoncheol Kim Yonsei University, Taehoon Kim Rebellions Inc, Taehyeong Park Yonsei University, Donghyeon Kim Hanyang University, Yongseung Yu Yonsei University, Hanjun Kim Yonsei University, Yongjun Park Yonsei University