國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於SIMT架構之多核繪圖處理器之設計與實作,Design and Implementation of a Multi-core Graphic Processing Unit based on SIMT Architecture

論文名稱 Title	基於SIMT架構之多核繪圖處理器之設計與實作 Design and Implementation of a Multi-core Graphic Processing Unit based on SIMT Architecture
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	93
研究生 Author	楊賀鈞 Ho-chun yang
指導教授 Advisor	張雲南 Yun-Nan Chang
召集委員 Convenor	謝東佑 Tong-Yu Hsieh
口試委員 Advisory Committee	黃英哲 Ing-Jer Huang
口試日期 Date of Exam	2014-07-25	繳交日期 Date of Submission	2014-08-28
關鍵字 Keywords	多執行緒、多核心、分支分歧問題、SIMT、統一圖形處理單元 Unified GPU, Multithreading, Branch divergence, Multi-core, SIMT
統計 Statistics	本論文已被瀏覽 5682 次，被下載 426 次 The thesis/dissertation has been browsed 5682 times, has been downloaded 426 times.

中文摘要
因應手持裝置的應用程式對於繪圖能力的需求逐漸增加，要如何設計出高效能的繪圖處理器成了一個重要的議題。本論文提出了一個基於SIMT(Single Instruction Multi Threads)架構之多核心統一圖形處理器的設計，其多核心架構使用了四個SIMD(Single Instruction Multiple Data)架構的向量的處理器，每個向量處理器可以執行一道四維的向量指令。多核心處理器的管線分為10級管線，其中四級管線會對暫存器檔案進行存取的動作，若根據處理器的管線的存取模式來設計暫存器檔案，便會需要四個埠的暫存器才可以滿足處理器的存取頻寬，為了要達到減少暫存器檔案硬體成本的目的，故本篇使用多區塊(Multi-Bank)的暫存器檔案架構，使用多區塊的架構在建構暫存器檔案上，可以以單埠的暫存器來取代多埠暫存器的使用。本論文所提出之SIMT架構處理器設計，會將四個同樣類型的執行緒組成一個Warp，當處理器運行時，Warp內的多個執行緒在同一個時脈內會執行相同的指令，Warp中的執行緒資料分別對應到位於暫存器檔案中的區塊，處理器在運行時藉由遵守某Warp執行後要再間隔四個時脈才可以再次被執行的排程規則，可以避免處理器對暫存器檔案中的單埠暫存器存取時埠數不足的問題。另外，在SIMT架構中因為Warp中每個執行緒都執行相同指令而產生的分支分歧問題，本論文採用Thread Frontiers演算法來進行處理，實作方式為處理器運行時遵循著執行Warp中多個執行緒中的最小PC的規則來完成，其在處理分支分歧問題上有較好的硬體使用率。最後，本論文在硬體上設計了一個支援扇形以及帶狀物件建構模式的光柵化填充單元，使用扇形及帶狀這兩種物件建構模式可以減少使用者在建構物件上所使用的頂點數量。本論文所實作之多核心處理器經過在RTL和FPGA平台多個測試項目的驗證，而整個處理器的硬體合成數據結果為共使用了約1.8M個邏輯閘數量。
Abstract
With the increasing demand of embedded graphic processing unit (GPU), how to develop an efficient GPU has become more and more important. This thesis proposed a multi-core single-instruction-multiple-thread (SIMT) unified GPU architecture. The proposed GPU is composed of four vector processors, and each processor can run 4-way single-instruction-multiple-data (SIMD) instructions. The processor has been deeply pipelined by 10-stage. The instructions of our processors will require up to four register read-write operations such that the direct implementation of register file will need four ports. In order to reduce the implementation cost of the register file, a multi-bank register architecture has been proposed in this thesis such that the register file can only be equipped with one port. Four same types of threads are grouped into a warp in our GPU, and each warp will be executed at the same cycle. Each warp of threads will use a separate register file bank. Five warps of threads will be run in our multi-core processors in the time-multiplexing style such that their access to the register file will not conflict. In addition, the threads in the same warp may encounter branch divergence issue. This paper has implemented a thread frontier mechanism by choosing those threads with the smallest program counter to run, which can lead to better hardware utilization under the branch divergence condition. Finally, this thesis has developed a rasterization-fill unit which can support different triangle modes. The proposed multi-core GPU has been implemented and successfully run on FPGA. The overall gate count is about 1.8M.

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii 目錄 iv 圖目錄 vii Chapter 1概論 1 1.1 研究動機 1 1.2 論文大綱 2 Chapter 2 研究背景與相關研究 3 2.1 頂點與像素處理器繪圖管線 3 2.2 統一處理器背景 6 2.3 多執行緒相關研究 9 2.3.1 多執行緒背景 9 2.3.2 多執行緒切換機制 10 2.3.3 多執行緒架構處理器 12 2.3.3.1 單執行緒相關架構 12 2.3.3.2 多執行緒相關架構 13 2.3.3.3 SIMT架構 16 2.4 處理器儲存元件分析 18 Chapter 3 多核心繪圖處理器設計 21 3.1 指令集功能說明 21 3.2 處理器硬體架構 25 3.2.1 處理器主要設計概念 25 3.2.2 處理器各模組功能 26 3.2.3 多核心運算單元(Multi_core) 31 3.2.4 暫存器檔案(Register File) 35 3.3 處理器流程 37 3.3.1 處理器上層流程控制 37 3.3.2 處理運算階段控制 38 3.3.2.1 排程器資料流程 39 3.3.2.2 多執行緒切換機制及硬體實作 44 3.3.3 Finish和Texture指令 47 3.3.3.1 Finish指令 47 3.3.3.1 Texture指令 48 3.4 分支分歧問題 52 3.4.1 序列化設計(Serialization) 52 3.4.2 PDOM 53 3.4.3 執行緒前沿(Thread Frontiers) 55 3.4.4 處理分支分歧問題之效能比較 59 Chapter 4 結果分析 62 4.1 驗證環境 62 4.1.1 RTL驗證 62 4.1.1.1 EASY模擬平台架構 62 4.1.1.2 EASY軟體模擬流程 63 4.1.1.3 RTL階層驗證結果 69 4.1.2 FPGA階層驗證 70 4.1.2.2 FPGA階層合成軟體及模擬環境 70 4.1.2.2 FPGA階層驗證結果 71 4.2合成數據和效能分析 73 4.2.1 硬體成本 73 4.1.2.1 暫存器檔案硬體成本 73 4.1.2.2 處理器硬體成本 74 4.2.2 效能速度 76 Chapter 5結論與未來目標 77 5.1 結論 77 5.2 可優化改進項目 77 5.1.1純量架構多核心處理器 77 5.1.2 Dynamic Warp 78 5.1.2 ASIC光柵化單元 79 參考文獻 80 附錄一 System Register參數表 82

參考文獻 References
[1] 徐肇謚, ”多執行緒SIMD統一圖形處理器的設計與實作,” 國立中山大學碩士論文, Jul. 2013. [2] http://www.khronos.org/opengles/1_X/. [3] http://www.khronos.org/opengles/2_X/. [4] T.Y. Kim, J. Kim, and H. Hur, “A unified shader based on the OpenGL ES 2.0 for 3D mobile game development,” Technologies for E-Learning and Digital Entertainment Computer Science Lecture Notes, pp. 898-903, June. 2007. [5] J. Laudon, A. Gupta, and M. Horowitz, “Interleaving: A multithreading technique targeting multiprocessors and workstations,” in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 308-318, Oct. 1994. [6] T. Ungerer, B. Robic, and J. Silc, “Multithreaded processor,” The Computer Journal of the British Computer Society, Vol. 45, no. 3, pp. 320-348, Jan. 2002. [7] Y. Chang, J. Wei, W. Guo, and J. Sun, "A multi-functional dot product unit with SIMD architecture for embedded 3D graphics engine,” IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1-4, Aug. 2011. [8] K. Chung, D. Kim, and L. Kim, "A 3-way SIMD engine for programmable triangle setup in embedded 3D graphics hardware,” IEEE International Symposium on Circuits and Systems (ISCAS), pp. 4546-4549, May. 2005. [9] S. Nam, B. Kim, Y. Im, Y. Kwon, J. Lee, Y. Cheon, S. Byun, D. Lee, and C. Kyung, "FLOVA: A four-issue VLIW geometry processor with SIMD instructions and lighting acceleration unit,” in Custom Integrated Circuits Conference(CICC), pp. 551-554, May. 2000. [10] J. Meng, J.W. Sheaffer, and K. Skadron, "Robust SIMD: Dynamically adapted SIMD width and multi-threading depth,” IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), pp. 107-118, May. 2012. [11] H. Kaul, M.A. Anders, S.K. Mathew, S.K. Hsu, A. Agarwal, R.K. Krishnamurthy, and S. Borkar, "A 300mV 494GOPS/W reconfigurable dual-supply 4-way SIMD vector processing accelerator in 45nm CMOS,” IEEE International Solid-State Circuits Conference (ISSCC) , pp. 260-261, Feb. 2009. [12] Y. Kim, H. Kim, S. Kim, J. Park, S. Park, and L. Kim, "Homogeneous stream processors with embedded special function units for high-utilization programmable shaders,” IEEE Transactions on VLSI Systems, vol.20, no.9, pp. 1691-1704, Sep. 2012. [13] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, "NVIDIA Tesla: A unified graphics and computing architecture,” IEEE Micro, vol.28, no.2, pp. 39-55, Mar. 2008. [14] 林仕明, ”低成本三維立體圖形呈像引擎設計,” 國立中山大學碩士論文, Feb. 2011. [15] W.W.L. Fung, I. Sham, G. Yuan, and T.M. Aamodt, "Dynamic warp formation and scheduling for efficient GPU control flow," 40th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 407-420, Dec. 2007. [16] G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H.Wu, and S Yalamanchili, ” SIMD re-convergence at thread frontiers,” 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 477-488, Dec. 2011. [17] 湯恩綬, ”多核心圖形處理器的設計與探討,” 國立中山大學碩士論文, Jul. 2013.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0728114-141237.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS