An Efficient Polynomial Multiplication Derived Implementation Of Convolution in Neural Networks (CGO 2025 - Main Conference)

Who

Haoke Xu, Yulin Zhang, Zitong Cheng, Xiaoming Li

Track

CGO 2025 Main Conference

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 4 Mar 2025 12:00 - 12:20 at Casuarina Ballroom (Level 2) - Optimizations & Transformations (2) Chair(s): Sebastian Hack

Abstract

Convolution is the most time consuming computation kernel in Convolutional Neural Network (CNN) applications and the majority of Graph Neural Network (GNN) applications. To achieve good convolution performance, current NN libraries such as cuDNN usually transform the naive convolution problem into either the matrix vector multiplication form, called the $im2col+MM$ approach, or into the Fourier domain representation, namely the FFT approach. Both approaches either introduce significant amount of data redundancy to reduce the number of data passes and to leverage highly tuned linear algebra libraries, or avoid data redundancy but require multiple passes on input. Therefore, no implementation of convolution can outperform others in all cases. In this paper, we introduce a novel transformation that reinterprete the convolution in NN as a polynomial multiplication problem. By carefully constructing a number of conceptual polynomials, NN’s convolution essentially becomes a well-known problem of calculating the coefficients in the product of two polynomials. We evaluate our approach in multiple ways: at the API level comparing it with one of the most widely used NN libraries cuDNN, as well as implementing it in PyTorch and comparing the performance with benchmark networks. On three NVIDIA GPUs 3090Ti, V100 and A10G, our approach outperforms cuDNN over a broad range of parameters with the max speedups of $34.6%$, $43.1%$ and $33.6%$ on these GPUs, respectively.

Haoke Xu

University of Delaware

Yulin Zhang

Minzu University of China

Zitong Cheng

University of Delaware

Xiaoming Li

University of Delaware

Time Zone

The program is currently displayed in (GMT-08:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-08:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 4 Mar
Displayed time zone: Pacific Time (US & Canada) change

11:20 - 12:20	Optimizations & Transformations (2)Main Conference at Casuarina Ballroom (Level 2) Chair(s): Sebastian Hack Saarland University, Saarland Informatics Campus

11:20 20m Talk		PreFix: Optimizing the Performance of Heap-Intensive Applications Main Conference Chaitanya Mamatha Ananda University of California Riverside, Rajiv Gupta University of California at Riverside (UCR), Sriraman Tallam Google Inc., Han Shen Google Inc, David Li Google
11:40 20m Talk		A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications Main Conference Lukas Trümper Daisytuner, Philipp Schaad ETH Zurich, Berke Ates ETH Zurich, Alexandru Calotoiu ETH Zurich, Marcin Copik ETH Zurich, Torsten Hoefler ETH Zurich
12:00 20m Talk		An Efficient Polynomial Multiplication Derived Implementation Of Convolution in Neural Networks Main Conference Haoke Xu University of Delaware, Yulin Zhang Minzu University of China, Zitong Cheng University of Delaware, Xiaoming Li University of Delaware