字級大小SCRIPT,如您的瀏覽器不支援,IE6請利用鍵盤按住ALT鍵 + V → X → (G)最大(L)較大(M)中(S)較小(A)小,來選擇適合您的文字大小,如為IE7或Firefoxy瀏覽器則可利用鍵盤 Ctrl + (+)放大 (-)縮小來改變字型大小。
:
twitter line
研究生: 涂智傑
研究生(外文): Tu, Chih-Chieh
論文名稱: 三階張量在GPU上的原地轉置
論文名稱(外文): IT3: In-place Transposition of Third-Order Tensor on Graphics Processing Units
指導教授: 李哲榮
指導教授(外文): Lee, Che-Rung
口試委員: 韓永楷 高英哲
口試委員(外文): Hon, Wing-Kai Kao, Ying-Jer
口試日期: 2021-01-20
學位類別: 碩士
校院名稱: 國立清華大學
系所名稱: 資訊工程學系
學門: 工程學門
學類: 電資工程學類
論文種類: 學術論文
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 42
中文關鍵詞: 三階張量 張量 原地轉置 演算法 最佳化
外文關鍵詞: Third-Order Tensor Tensor Graphics Processing Units GPU In-place Transposition Algorithm Optimization
相關次數:
  • 被引用 被引用:0
  • 點閱 點閱:273
  • 評分 評分:
  • 下載 下載:16
  • 收藏至我的研究室書目清單 書目收藏:0
張量轉置是一個被廣泛使用於高效張量計算的基本操作,例如張量降秩。然而,現有的張量轉置方法多為非原地的(out-of-place),意即這些方法地需使用輸入張量所佔空間之雙倍以完成轉置。此低效率之空間使用對大多加速器而言是一個嚴重的問題,例如圖形處理器(GPU)上的記憶體大小即是相對有限的。針對記憶體使用效率的改善,本論文提出三階張量的原地轉置演算法IT3。為有效利用已經過最佳化的的矩陣轉置函式核心(kernels),IT3首先將三階張量拆解為較小的單位來處理,例如矩陣、列或行,依轉置目標而異。接著,IT3結合這些不同的矩陣轉置結果以達成最後目標。透過理論分析可知,相較於非原地轉置的方法,IT3能夠節省至少百分之五十的額外記憶體使用量。本論文亦提出IT3在GPU上的高效率實作。所有三階張量轉置的可能排列都經過了實驗測試。IT3的GPU實作被與當前最先進的GPU張量轉置函式庫CUDA Tensor Transpose(cuTT)作比較,結果顯示IT3不僅在任何情況下的記憶體使用表現上皆優於cuTT,且執行速度方面亦在大多數情況下獲得領先。
Tensor transposition is a basic operation used for many high performance tensor calculations, such as tensor contractions.
However, most existing methods for tensor transposition are out-of-place, which means they require additional space of the same size of the input tensor to accomplish the transposition. The low space utilization is a serious problem for most accelerators, such as Graphics Processing Units (GPUs), because their memory size is relatively limited. To improve the memory usage, we propose in-place transposition algorithms for third-order tensors, called IT3. To utilize existing optimized in-place matrix transposition kernels, IT3 first decomposes third-order tensors into smaller units, such as matrices, rows, or columns, based on different transposition configurations. After that, IT3 combines different matrix transposition methods to achieve the final result. The theoretical analysis shows IT3 uses at most 50\% of the memory of the input size. We also presented efficient GPU implementations of IT3 that optimize the performance for modern GPUs. Experiments that exhaust all possible permutations of third-order tensor transposition are conducted. The GPU implementations of IT3 were compared with the state-of-the-art tensor transposition library for GPUs, CUDA Tensor Transpose (cuTT). The results show IT3 outperformed cuTT not only in terms of memory usage for all benchmarks, but also in terms of speed for most cases.
Chinese Abstract i
Abstract ii
List of Figures v
List of Tables vi
1 Introduction 1
2 Background and Related Works 4
2.1 Tensor Transposition 4
2.2 In-place Matrix Transposition 5
2.3 Catanzaro’s Algorithm 6
2.4 Scatter and Gather Function 9
3 Algorithm of IT3 10
3.1 Algorithms 10
3.1.1 231 Transpose 12
3.1.2 312 Transpose 12
3.1.3 213 Transpose 13
3.1.4 132 Transpose 14
3.1.5 321 Transpose 14
3.2 Implementations of Catanzaro’s Algorithm 15
3.2.1 Data Layout 16
3.2.2 Column Operation 16
3.2.3 Row Operation 19
3.3 GPU Implementations of IT3 21
3.3.1 231 Transpose and 312 Transpose 21
3.3.2 213 Transpose 21
3.3.3 132 Transpose 22
3.3.4 321 Transpose 23
4 Performance Evaluation 24
4.1 231 and 312 Transpose 25
4.2 213 Transpose 28
4.3 132 Transpose 33
4.4 321 Transpose 36
5 Conclusion and Future Works 38
[1] Chetan Nayak, Steven H. Simon, Ady Stern, Michael Freedman, and SankarDas Sarma. Non-abelian anyons and topological quantum computation.Re-views of Modern Physics, 80(3):1083–1159, Sep 2008.
[2] Hans-Joachim Werner, Peter J. Knowles, Gerald Knizia, Frederick R. Manby,and Martin Sch ̈utz. Molpro: a general-purpose quantum chemistry programpackage.WIREs Computational Molecular Science, 2(2):242–253, 2012.
[3] M. Alex O. Vasilescu and Demetri Terzopoulos. Multilinear analysis of imageensembles: TensorFaces. In Anders Heyden, Gunnar Sparr, Mads Nielsen,and Peter Johansen, editors,Computer Vision — ECCV 2002, pages 447–460,Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
[4] Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov.Tensorizing neural networks, 2015.
[5] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications.SIAM Review, 51(3):455–500, 2009.
[6] So Hirata. Tensor contraction engine: Abstraction and automated parallelimplementation of configuration-interaction, coupled-cluster, and many-bodyperturbation theories.The Journal of Physical Chemistry A, 107:9887–9897,11 2003.
[7] A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou,A. Haidar, I. Karlin, Tz. Kolev, I. Masliah, and S. Tomov. High-performancetensor contractions for GPUs.Procedia Computer Science, 80:108 – 118, 2016.International Conference on Computational Science 2016, ICCS 2016, 6-8 June2016, San Diego, California, USA.
[8] Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton,and James Demmel. A massively parallel tensor contraction framework forcoupled-cluster computations.Journal of Parallel and Distributed Computing,74(12):3176 – 3190, 2014. Domain-Specific Languages and High-Level Frame-works for High-Performance Computing.
[9] Yang Shi, U. N. Niranjan, Animashree Anandkumar, and Cris Cecka. Tensorcontractions with extended BLAS kernels on CPU and GPU.2016 IEEE 23rdInternational Conference on High Performance Computing (HiPC), Dec 2016.
[10] Paul Springer and Paolo Bientinesi. Design of a high-performance gemm-liketensor–tensor multiplication.ACM Trans. Math. Softw., 44(3), January 2018.
[11] Devin A. Matthews. High-performance tensor contraction without transposi-tion.SIAM Journal on Scientific Computing, 40(1):C1–C24, 2018.
[12] Dmitry I. Lyakh. An efficient tensor transpose algorithm for multicore CPU,Intel Xeon Phi, and NVidia Tesla GPU.Computer Physics Communications,189, 1 2015.
[13] Antti-Pekka Hynninen and Dmitry I. Lyakh. cuTT: A high-performance tensortranspose library for CUDA compatible GPUs, 2017.
[14] Paul Springer, Tong Su, and Paolo Bientinesi. Hptt: A high-performance tensortransposition c++ library. ARRAY 2017, page 56–62, New York, NY, USA,2017. Association for Computing Machinery.
[15] J. Vedurada, A. Suresh, A. S. Rajam, J. Kim, C. Hong, A. Panyala, S. Krish-namoorthy, V. K. Nandivada, R. K. Srivastava, and P. Sadayappan. TTLG -an efficient tensor transposition library for GPUs. In2018 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), pages 578–588, 2018.
[16] Guifr ́e Vidal. Efficient classical simulation of slightly entangled quantum com-putations.Physical Review Letters, 91(14), Oct 2003.
[17] G. Vidal. Classical simulation of infinite-size quantum lattice systems in onespatial dimension.Physical Review Letters, 98(7), Feb 2007.
[18] G. Evenbly and G. Vidal. Algorithms for entanglement renormalization.Phys-ical Review B, 79(14), Apr 2009.
[19] Fred Gustavson, Lars Karlsson, and Bo K ̊agstr ̈om. Parallel and cache-efficientin-place matrix storage format conversion. 38(3), April 2012.
[20] Fred G. Gustavson and David W. Walker.Algorithms for in-place ma-trix transposition.Concurrency and Computation: Practice and Experience,31(13):e5071, 2019. e5071 cpe.5071.
[21] I-Jui Sung, Juan G ́omez-Luna, Jos ́e Mar ́ıa Gonz ́alez-Linares, Nicol ́as Guil, andWen-Mei W. Hwu. In-place transposition of rectangular matrices on accelera-tors. 49(8):207–218, February 2014.
[22] J. G ́omez-Luna, I. Sung, L. Chang, J. M. Gonz ́alez-Linares, N. Guil, and W. W.Hwu. In-place matrix transposition on GPUs.IEEE Transactions on Paralleland Distributed Systems, 27(3):776–788, 2016.
[23] Bryan Catanzaro, Alexander Keller, and Michael Garland. A decomposition forin-place matrix transposition.SIGPLAN Not., 49(8):193–206, February 2014.
[24] Paul Springer, Aravind Sankaran, and Paolo Bientinesi. TTC: a tensor trans-position compiler for multiple architectures.Proceedings of the 3rd ACM SIG-PLAN International Workshop on Libraries, Languages, and Compilers for Ar-ray Programming - ARRAY 2016, 2016.
[25] Jose L. Jodra, Ibai Gurrutxaga, and Javier Muguerza. Efficient 3d transpo-sitions in graphics processing units.Int. J. Parallel Program., 43(5):876–891,October 2015.
[26] N. Brenner. Algorithm 467: Matrix transposition in place.Commun. ACM,16(11):692–694, November 1973.
[27] P. F. Windley. Transposing Matrices in a Digital Computer.The ComputerJournal, 2(1):47–48, 01 1959.
[28] A.A. Tretyakov and E.E. Tyrtyshnikov. Optimal in-place transposition of rect-angular matrices.Journal of Complexity, 25(4):377 – 384, 2009.
[29] Sunpyo Hong and Hyesoon Kim. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness.SIGARCH Comput.Archit. News, 37(3):152–163, June 2009.
[30] Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. Aperformance analysis framework for identifying potential benefits in GPGPUapplications.SIGPLAN Not., 47(8):11–22, February 2012.
連結至畢業學校之論文網頁 點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!