Tensor transposition is a basic operation used for many high performance tensor calculations, such as tensor contractions.
However, most existing methods for tensor transposition are out-of-place, which means they require additional space of the same size of the input tensor to accomplish the transposition. The low space utilization is a serious problem for most accelerators, such as Graphics Processing Units (GPUs), because their memory size is relatively limited. To improve the memory usage, we propose in-place transposition algorithms for third-order tensors, called IT3. To utilize existing optimized in-place matrix transposition kernels, IT3 first decomposes third-order tensors into smaller units, such as matrices, rows, or columns, based on different transposition configurations. After that, IT3 combines different matrix transposition methods to achieve the final result. The theoretical analysis shows IT3 uses at most 50\% of the memory of the input size. We also presented efficient GPU implementations of IT3 that optimize the performance for modern GPUs. Experiments that exhaust all possible permutations of third-order tensor transposition are conducted. The GPU implementations of IT3 were compared with the state-of-the-art tensor transposition library for GPUs, CUDA Tensor Transpose (cuTT). The results show IT3 outperformed cuTT not only in terms of memory usage for all benchmarks, but also in terms of speed for most cases.
Chinese Abstract i
Abstract ii
List of Figures v
List of Tables vi
1 Introduction 1
2 Background and Related Works 4
2.1 Tensor Transposition 4
2.2 In-place Matrix Transposition 5
2.3 Catanzaro’s Algorithm 6
2.4 Scatter and Gather Function 9
3 Algorithm of IT3 10
3.1 Algorithms 10
3.1.1 231 Transpose 12
3.1.2 312 Transpose 12
3.1.3 213 Transpose 13
3.1.4 132 Transpose 14
3.1.5 321 Transpose 14
3.2 Implementations of Catanzaro’s Algorithm 15
3.2.1 Data Layout 16
3.2.2 Column Operation 16
3.2.3 Row Operation 19
3.3 GPU Implementations of IT3 21
3.3.1 231 Transpose and 312 Transpose 21
3.3.2 213 Transpose 21
3.3.3 132 Transpose 22
3.3.4 321 Transpose 23
4 Performance Evaluation 24
4.1 231 and 312 Transpose 25
4.2 213 Transpose 28
4.3 132 Transpose 33
4.4 321 Transpose 36
5 Conclusion and Future Works 38
[1] Chetan Nayak, Steven H. Simon, Ady Stern, Michael Freedman, and SankarDas Sarma. Non-abelian anyons and topological quantum computation.Re-views of Modern Physics, 80(3):1083–1159, Sep 2008.
[2] Hans-Joachim Werner, Peter J. Knowles, Gerald Knizia, Frederick R. Manby,and Martin Sch ̈utz. Molpro: a general-purpose quantum chemistry programpackage.WIREs Computational Molecular Science, 2(2):242–253, 2012.
[3] M. Alex O. Vasilescu and Demetri Terzopoulos. Multilinear analysis of imageensembles: TensorFaces. In Anders Heyden, Gunnar Sparr, Mads Nielsen,and Peter Johansen, editors,Computer Vision — ECCV 2002, pages 447–460,Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
[4] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.Tensorizing neural networks, 2015.
[5] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.SIAM Review, 51(3):455–500, 2009.
[6] So Hirata. Tensor contraction engine: Abstraction and automated parallelimplementation of configuration-interaction, coupled-cluster, and many-bodyperturbation theories.The Journal of Physical Chemistry A, 107:9887–9897,11 2003.
[7] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou,A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, and S. Tomov. High-performancetensor contractions for GPUs.Procedia Computer Science, 80:108 – 118, 2016.International Conference on Computational Science 2016, ICCS 2016, 6-8 June2016, San Diego, California, USA.
[8] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton,and James Demmel. A massively parallel tensor contraction framework forcoupled-cluster computations.Journal of Parallel and Distributed Computing,74(12):3176 – 3190, 2014. Domain-Specific Languages and High-Level Frame-works for High-Performance Computing.
[9] Yang Shi, U. N. Niranjan, Animashree Anandkumar, and Cris Cecka. Tensorcontractions with extended BLAS kernels on CPU and GPU.2016 IEEE 23rdInternational Conference on High Performance Computing (HiPC), Dec 2016.
[10] Paul Springer and Paolo Bientinesi. Design of a high-performance gemm-liketensor–tensor multiplication.ACM Trans. Math. Softw., 44(3), January 2018.
[11] Devin A. Matthews. High-performance tensor contraction without transposi-tion.SIAM Journal on Scientific Computing, 40(1):C1–C24, 2018.
[12] Dmitry I. Lyakh. An efficient tensor transpose algorithm for multicore CPU,Intel Xeon Phi, and NVidia Tesla GPU.Computer Physics Communications,189, 1 2015.
[13] Antti-Pekka Hynninen and Dmitry I. Lyakh. cuTT: A high-performance tensortranspose library for CUDA compatible GPUs, 2017.
[14] Paul Springer, Tong Su, and Paolo Bientinesi. Hptt: A high-performance tensortransposition c++ library. ARRAY 2017, page 56–62, New York, NY, USA,2017. Association for Computing Machinery.
[15] J. Vedurada, A. Suresh, A. S. Rajam, J. Kim, C. Hong, A. Panyala, S. Krish-namoorthy, V. K. Nandivada, R. K. Srivastava, and P. Sadayappan. TTLG -an efficient tensor transposition library for GPUs. In2018 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), pages 578–588, 2018.
[16] Guifr ́e Vidal. Efficient classical simulation of slightly entangled quantum com-putations.Physical Review Letters, 91(14), Oct 2003.
[17] G. Vidal. Classical simulation of infinite-size quantum lattice systems in onespatial dimension.Physical Review Letters, 98(7), Feb 2007.
[18] G. Evenbly and G. Vidal. Algorithms for entanglement renormalization.Phys-ical Review B, 79(14), Apr 2009.
[19] Fred Gustavson, Lars Karlsson, and Bo K ̊agstr ̈om. Parallel and cache-efficientin-place matrix storage format conversion. 38(3), April 2012.
[20] Fred G. Gustavson and David W. Walker.Algorithms for in-place ma-trix transposition.Concurrency and Computation: Practice and Experience,31(13):e5071, 2019. e5071 cpe.5071.
[21] I-Jui Sung, Juan G ́omez-Luna, Jos ́e Mar ́ıa Gonz ́alez-Linares, Nicol ́as Guil, andWen-Mei W. Hwu. In-place transposition of rectangular matrices on accelera-tors. 49(8):207–218, February 2014.
[22] J. G ́omez-Luna, I. Sung, L. Chang, J. M. Gonz ́alez-Linares, N. Guil, and W. W.Hwu. In-place matrix transposition on GPUs.IEEE Transactions on Paralleland Distributed Systems, 27(3):776–788, 2016.
[23] Bryan Catanzaro, Alexander Keller, and Michael Garland. A decomposition forin-place matrix transposition.SIGPLAN Not., 49(8):193–206, February 2014.
[24] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. TTC: a tensor trans-position compiler for multiple architectures.Proceedings of the 3rd ACM SIG-PLAN International Workshop on Libraries, Languages, and Compilers for Ar-ray Programming - ARRAY 2016, 2016.
[25] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. Efficient 3d transpo-sitions in graphics processing units.Int. J. Parallel Program., 43(5):876–891,October 2015.
[26] N. Brenner. Algorithm 467: Matrix transposition in place.Commun. ACM,16(11):692–694, November 1973.
[27] P. F. Windley. Transposing Matrices in a Digital Computer.The ComputerJournal, 2(1):47–48, 01 1959.
[28] A.A. Tretyakov and E.E. Tyrtyshnikov. Optimal in-place transposition of rect-angular matrices.Journal of Complexity, 25(4):377 – 384, 2009.
[29] Sunpyo Hong and Hyesoon Kim. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness.SIGARCH Comput.Archit. News, 37(3):152–163, June 2009.
[30] Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. Aperformance analysis framework for identifying potential benefits in GPGPUapplications.SIGPLAN Not., 47(8):11–22, February 2012.