Convolution is the most time consuming computation kernel in Convolutional Neural Network (CNN) applications and the majority of Graph Neural Network (GNN) applications. To achieve good convolution performance, current NN libraries such as cuDNN usually transform the naive convolution problem into either the matrix vector multiplication form, called the $im2col+MM$ approach, or into the Fourier domain representation, namely the FFT approach. Both approaches either introduce significant amount of data redundancy to reduce the number of data passes and to leverage highly tuned linear algebra libraries, or avoid data redundancy but require multiple passes on input. Therefore, no implementation of convolution can outperform others in all cases. In this paper, we introduce a novel transformation that reinterprete the convolution in NN as a polynomial multiplication problem. By carefully constructing a number of conceptual polynomials, NN’s convolution essentially becomes a well-known problem of calculating the coefficients in the product of two polynomials. We evaluate our approach in multiple ways: at the API level comparing it with one of the most widely used NN libraries cuDNN, as well as implementing it in PyTorch and comparing the performance with benchmark networks. On three NVIDIA GPUs 3090Ti, V100 and A10G, our approach outperforms cuDNN over a broad range of parameters with the max speedups of $34.6%$, $43.1%$ and $33.6%$ on these GPUs, respectively.

Tue 4 Mar

Displayed time zone: Pacific Time (US & Canada) change

11:20 - 12:20
Optimizations & Transformations (2)Main Conference at Casuarina Ballroom (Level 2)
Chair(s): Sebastian Hack Saarland University, Saarland Informatics Campus
11:20
20m
Talk
PreFix: Optimizing the Performance of Heap-Intensive Applications
Main Conference
Chaitanya Mamatha Ananda University of California Riverside, Rajiv Gupta University of California at Riverside (UCR), Sriraman Tallam Google Inc., Han Shen Google Inc, David Li Google
11:40
20m
Talk
A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications
Main Conference
Lukas Trümper Daisytuner, Philipp Schaad ETH Zurich, Berke Ates ETH Zurich, Alexandru Calotoiu ETH Zurich, Marcin Copik ETH Zurich, Torsten Hoefler ETH Zurich
12:00
20m
Talk
An Efficient Polynomial Multiplication Derived Implementation Of Convolution in Neural Networks
Main Conference
Haoke Xu University of Delaware, Yulin Zhang Minzu University of China, Zitong Cheng University of Delaware, Xiaoming Li University of Delaware