An Efficient Polynomial Multiplication Derived Implementation Of Convolution in Neural Networks
Convolution is the most time consuming computation kernel in Convolutional Neural Network (CNN) applications and the majority of Graph Neural Network (GNN) applications. To achieve good convolution performance, current NN libraries such as cuDNN usually transform the naive convolution problem into either the matrix vector multiplication form, called the $im2col+MM$ approach, or into the Fourier domain representation, namely the FFT approach. Both approaches either introduce significant amount of data redundancy to reduce the number of data passes and to leverage highly tuned linear algebra libraries, or avoid data redundancy but require multiple passes on input. Therefore, no implementation of convolution can outperform others in all cases. In this paper, we introduce a novel transformation that reinterprete the convolution in NN as a polynomial multiplication problem. By carefully constructing a number of conceptual polynomials, NN’s convolution essentially becomes a well-known problem of calculating the coefficients in the product of two polynomials. We evaluate our approach in multiple ways: at the API level comparing it with one of the most widely used NN libraries cuDNN, as well as implementing it in PyTorch and comparing the performance with benchmark networks. On three NVIDIA GPUs 3090Ti, V100 and A10G, our approach outperforms cuDNN over a broad range of parameters with the max speedups of $34.6%$, $43.1%$ and $33.6%$ on these GPUs, respectively.
Tue 4 MarDisplayed time zone: Pacific Time (US & Canada) change
11:20 - 12:20 | Optimizations & Transformations (2)Main Conference at Casuarina Ballroom (Level 2) Chair(s): Sebastian Hack Saarland University, Saarland Informatics Campus | ||
11:20 20mTalk | PreFix: Optimizing the Performance of Heap-Intensive Applications Main Conference Chaitanya Mamatha Ananda University of California Riverside, Rajiv Gupta University of California at Riverside (UCR), Sriraman Tallam Google Inc., Han Shen Google Inc, David Li Google | ||
11:40 20mTalk | A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications Main Conference Lukas Trümper Daisytuner, Philipp Schaad ETH Zurich, Berke Ates ETH Zurich, Alexandru Calotoiu ETH Zurich, Marcin Copik ETH Zurich, Torsten Hoefler ETH Zurich | ||
12:00 20mTalk | An Efficient Polynomial Multiplication Derived Implementation Of Convolution in Neural Networks Main Conference Haoke Xu University of Delaware, Yulin Zhang Minzu University of China, Zitong Cheng University of Delaware, Xiaoming Li University of Delaware |