國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,低成本多執行緒之單一著色器架構設計,Design of low-cost multi-thread unified shader architecture

論文名稱 Title	低成本多執行緒之單一著色器架構設計 Design of low-cost multi-thread unified shader architecture
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	99 學年度第 1 學期 The fall semester of Academic Year 99	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	81
研究生 Author	孫亞賢 Ya-hsien Sun
指導教授 Advisor	張雲南 Yun-nan Chang
召集委員 Convenor	蕭勝夫 Shen-fu Hsiao
口試委員 Advisory Committee	范志鵬 Chih-peng Fan
口試日期 Date of Exam	2011-01-18	繳交日期 Date of Submission	2011-02-14
關鍵字 Keywords	像素運算模組、多執行緒、排程、著色器、統一圖形處理單元 Shader, Unified GPU, Per-fragment operation, Multithreading, Schedule
統計 Statistics	本論文已被瀏覽 5643 次，被下載 0 次 The thesis/dissertation has been browsed 5643 times, has been downloaded 0 times.

中文摘要
因為可編程之圖形處理單元經常等待長延遲指令的執行結果而閒置，為了增加使用率，多執行緒的技術常被採用於設計當中。本論文提出的多執行緒之統一圖形處理單元設計，擁有下列幾項特點。第一點，處理核心不僅可以執行頂點著色器與像素著色器，還可以執行用軟體實作的點陣化模組；在其它的圖形處理單元設計中，點陣化模組大部分是使用額外的硬體去實作。此外，我們的設計所採用的執行緒切換策略為不可搶先的阻塞式排程法。通常，一個指令在進入指令解碼階段之前，無法被得知是否會拖延整個程式的執行。為了達到零花費的執行緒切換，一個協助位元被增添於每個指令中，告知控制單元相同執行緒內的下個指令是否會拖延程式的執行。這個機制在本論文所跑的某些測試程式中可以達到1.4倍的加速。圖形處理單元所用的暫存器一般可到四個存取埠，在多執行緒的設計中，由於暫存器會複製好幾份，佔圖形處理單元相當大的部份。基於本論文所提出的方法，使用多塊兩埠的暫存器來取代一般的暫存器設計，可以減少硬體實作的花費。我們的實驗結果顯示這個方法可以減少約整體26.12%的邏輯閘。最後，為了進一步減少面積，剩下的固定化管線之像素運算模組採用分時架構來實現。而本文所提出的圖形處理單元的邏輯閘總數為600K。
Abstract
In order to increase the data-path utilization of the programmable graphics processor units (GPU) which often stall by waiting for the execution results of those long-latency instructions, multi-thread technique is very often used in the design of GPU. This thesis proposes a multi-thread single unified core GPU design which owns several key features. First, its processor core can execute not only the vertex and fragment shading programs, but also the software rasteriation module which is mostly implemented by a individual hardware module in other GPU designs. Next, the thread-switching policy in our design is based on the non-preempt blocked scheduling. Normally, whether an instruction will be stalled cannot be detected until it enters the instruction-decode stage. In order to achieve zero-penalty thread switching, a single assistant bit will be padded to each instruction in a thread to tell if the next instruction in the same thread will be stalled or not. This mechanism can help achieve a speed-up of 1.4 in some benchmarks used in this thesis. The register file used in GPU processor is usually equipped with up to four access ports, such that it will occupy a significant portion of the entire GPU especially for muti-thread designs where the register set has to be duplicated by several copies. The implementation cost of the register file can be reduced by decreasing its access port number to two based on the proposed multi-bank approach in this thesis. Our experimental results show that this approach can help reduce the overall gate count by 26.12%. Finally, the rest of fixed-pipeline fragment operation is realized by an iterative time-sharing architecture in order to further save the silicon area. The overall gate count of the proposed GPU is 600K.

目次 Table of Contents
CHAPTER 1 概論....................................................................................1 1.1 研究動機......................................................................................1 1.2 論文大綱......................................................................................1 CHAPTER 2 研究背景與相關研究........................................................3 2.1 三維圖學簡介與流程..................................................................3 2.1.1 幾何轉換與著色子系統簡介.............................................3 2.1.2 可程式化三維繪圖架構.....................................................9 2.1.2.1 頂點與像素處理器簡介........................................10 2.1.2.2 統一處理器簡介....................................................13 2.2 多執行緒處理器相關背景........................................................16 2.2.1 多執行緒處理器架構介紹................................................17 2.2.2 多執行緒排程策略介紹...................................................24 CHAPTER 3 三維後端呈像子系統設計...............................................26 3.1 像素處理器說明與設計..........................................................26 3.1.1 指令集功能說明...............................................................26 3.1.2 硬體架構...........................................................................29 3.2 Per-fragment operation 說明與設計.......................................35 3.2.1 OpenGL ES 2.0 功能說明與硬體設計..........................36 CHAPTER 4 多執行緒之統一處理器說明與設計..............................40 4.1 Unified fragment operation unit 設計......................................40 4.2 Unified GPU 設計...................................................................43 4.2.1 硬體架構...........................................................................43 4.2.2 多執行緒機制與排程策略...............................................50 4.2.3 處理器儲存元件分析.......................................................53 CHAPTER 5 系統實作成果..................................................................58 5.1 功能驗證....................................................................................58 5.1.1 軟體驗證...........................................................................58 5.1.2 硬體驗證...........................................................................58 5.1.2.1 RTL 驗證...............................................................58 5.1.2.2 Co-sim 驗證...........................................................58 5.1.2.3 FPGA 驗證............................................................59 5.1.3 執行結果...........................................................................60 5.2 系統介面與定址空間................................................................61 5.3 合成數據與效能分析................................................................63 CHAPTER 6 結論與未來目標..............................................................66 6.1 結論............................................................................................66 6.2 未來目標....................................................................................66

參考文獻 References
[1] http://zh.wikipedia.org/zh-tw/計算機形學 [2] http://www.khronos.org/opengles/2_X/ [3] http://tinyurl.com/4hwrdfs [4] Chang-Hyo Yu, Kyusik Chung, Donghyun Kim, Lee-Sup Kim, “A 120 Mvertices/s Multi-threaded VLIW Vertex Processor for Mobile Multimedia Applications”, in Proc. IEEE International Solid-State Circuits Conference (ISSCC 2006), San Francisco, CA, Feb. 2006, pp. 1606-1615. [5] D. Kim, et al., “An Soc With 1.3 Gtexels/S 3-D Graphics Full Pipeline For Consumer Applications”, IEEE Journal of Solid-State Circuits (JSSC), Vol. 41, NO. 1, pp. 71-84, Jan. 2006. [6] ftp://download.nvidia.com/developer/Papers/2004/Vertex_Textures/ Vertex_Textures.pdf. [7] http://www.ozone3d.net/tutorials/bump_mapping.php [8] Nolan Goodnight, Rui Wang, Cliff Woolley, and Greg Humphreys , “Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware”, in Proc. 14th Eurographics Symposium on Rendering(EGSR’03), Belgium, June 2003, pp. 26-37. [9] Lindholm E., Nickolls J., Oberman S., Montrym J., ”NVIDIA TESLA: A Unified Graphics and Computing Architecture”, IEEE Micro, Vol. 28, No. 2, pp. 39-55, March-April 2008. [10] Tae-Young Kim, Jongho Kim, and Hyunmin Hur, “A Unified Shader Based on the OpenGL ES 2.0 for 3D Mobile Game Development”, in Proc. 2nd International Conference, Edutainment 2007, Hong Kong, China, June 2007, pp. 898-903. [11] http://http.download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf [12] James Laudon, Anoop Gupta, Mark Horowitz, “Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations”, in Proc. 6th International Conference on Architectural Support for Programming Languages and Operating Systems, Boston, U.S.A., October 1994, pp. 308-318. [13] Theo Ungerer, Borut Robic, Jurij Silc, “Multithreaded Processor”, British Computer Society, The Computer Journal, Vol. 45, No.3, pp. 320-348, January 2002. [14] Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M. Aamodt, “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow”, in Proc. 40th Annual IEEE/ACM International Symposium on Micro-architecture, Chicago, IL, Dec. 2007, pp. 407-420. [15] Jeong-Ho Woo, Ju-Ho Sohn, Hyejung Kim, Hoi-Jun Yoo, “A Low Power Multimedia SoC with Fully Programmable 3D Graphics and MPEG4/H.264/JPEG for Mobile Devices”, in Proc. 2007 IEEE Symposium on Low Power Electronics and Design(ISLPED), Portland, OR, Aug. 2007, pp. 238-243. [16] Wilson Wai Lun Fung, “Dynamic Warp Formation: Exploiting Thread Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware”, ACM Transactions on Architecture and Code Optimization(TACO), Vol. 6, NO. 2, June 2009. [17] Nagesh B. Lakshminarayana, Hyesoon Kim, “Effect of Instruction Fetch and Memory Scheduling on GPU Performance”, Workshop on Language, Compiler, and Architecture Support for GPGPU, in conjunction with HPCA/PPoPP 2010, 2010. [18] http://www.opengl.org/documentation/specs/version2.0/glspec20.pdf [19] Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, Roger Espasa, “A Single (Unified) Shader GPU Micro-architecture for Embedde Systems”, in Proc. 1st International Conference, HiPEAC 2005, Barcelona, Spain, November 2005.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.22.61.246 論文開放下載的時間是校外不公開 Your IP address is 3.22.61.246 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS