Responsive image
博碩士論文 etd-0214111-112506 詳細資訊
Title page for etd-0214111-112506
論文名稱
Title
低成本多執行緒之單 一著色器架構設計
Design of low-cost multi-thread unified shader architecture
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
81
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2011-01-18
繳交日期
Date of Submission
2011-02-14
關鍵字
Keywords
像素運算模組、多執行緒、排程、著色器、統一圖形處理單元
Shader, Unified GPU, Per-fragment operation, Multithreading, Schedule
統計
Statistics
本論文已被瀏覽 5643 次,被下載 0
The thesis/dissertation has been browsed 5643 times, has been downloaded 0 times.
中文摘要
因為可編程之圖形處理單元經常等待長延遲指令的執行結果而閒置,為了增加使用率,多執行緒的技術常被採用於設計當中。本論文提出的多執行緒之統一圖形處理單元設計,擁有下列幾項特點。第一點,處理核心不僅可以執行頂點著色器與像素著色器,還可以執行用軟體實作的點陣化模組;在其它的圖形處理單元設計中,點陣化模組大部分是使用額外的硬體去實作。此外,我們的設計所採用的執行緒切換策略為不可搶先的阻塞式排程法。通常,一個指令在進入指令解碼階段之前,無法被得知是否會拖延整個程式的執行。為了達到零花費的執行緒切換,一個協助位元被增添於每個指令中,告知控制單元相同執行緒內的下個指令是否會拖延程式的執行。這個機制在本論文所跑的某些測試程式中可以達到1.4倍的加速。圖形處理單元所用的暫存器一般可到四個存取埠,在多執行緒的設計中,由於暫存器會複製好幾份,佔圖形處理單元相當大的部份。基於本論文所提出的方法,使用多塊兩埠的暫存器來取代一般的暫存器設計,可以減少硬體實作的花費。我們的實驗結果顯示這個方法可以減少約整體26.12%的邏輯閘。最後,為了進一步減少面積,剩下的固定化管線之像素運算模組採用分時架構來實現。而本文所提出的圖形處理單元的邏輯閘總數為600K。
Abstract
In order to increase the data-path utilization of the programmable graphics processor units (GPU) which often stall by waiting for the execution results of those long-latency instructions, multi-thread technique is very often used in the design of GPU. This thesis proposes a multi-thread single unified core GPU design which owns several key features. First, its processor core can execute not only the vertex and fragment shading programs, but also the software rasteriation module which is mostly implemented by a individual hardware module in other GPU designs. Next, the thread-switching policy in our design is based on the non-preempt blocked scheduling. Normally, whether an instruction will be stalled cannot be detected until it enters the instruction-decode stage. In order to achieve zero-penalty thread switching, a single assistant bit will be padded to each instruction in a thread to tell if the next instruction in the same thread will be stalled or not. This mechanism can help achieve a speed-up of 1.4 in some benchmarks used in this thesis. The register file used in GPU processor is usually equipped with up to four access ports, such that it will occupy a significant portion of the entire GPU especially for muti-thread designs where the register set has to be duplicated by several copies. The implementation cost of the register file can be reduced by decreasing its access port number to two based on the proposed multi-bank approach in this thesis. Our experimental results show that this approach can help reduce the overall gate count by 26.12%. Finally, the rest of fixed-pipeline fragment operation is realized by an iterative time-sharing architecture in order to further save the silicon area. The overall gate count of the proposed GPU is 600K.
目次 Table of Contents
CHAPTER 1 概論....................................................................................1
1.1 研究動機......................................................................................1
1.2 論文大綱......................................................................................1
CHAPTER 2 研究背景與相關研究........................................................3
2.1 三維圖學簡介與流程..................................................................3
2.1.1 幾何轉換與著色子系統簡介.............................................3
2.1.2 可程式化三維繪圖架構.....................................................9
2.1.2.1 頂點與像素處理器簡介........................................10
2.1.2.2 統一處理器簡介....................................................13
2.2 多執行緒處理器相關背景........................................................16
2.2.1 多執行緒處理器架構介紹................................................17
2.2.2 多執行緒排程策略介紹...................................................24
CHAPTER 3 三維後端呈像子系統設計...............................................26
3.1 像素處理器說明與設計..........................................................26
3.1.1 指令集功能說明...............................................................26
3.1.2 硬體架構...........................................................................29
3.2 Per-fragment operation 說明與設計.......................................35
3.2.1 OpenGL ES 2.0 功能說明與硬體設計..........................36
CHAPTER 4 多執行緒之統一處理器說明與設計..............................40
4.1 Unified fragment operation unit 設計......................................40
4.2 Unified GPU 設計...................................................................43
4.2.1 硬體架構...........................................................................43
4.2.2 多執行緒機制與排程策略...............................................50
4.2.3 處理器儲存元件分析.......................................................53
CHAPTER 5 系統實作成果..................................................................58
5.1 功能驗證....................................................................................58
5.1.1 軟體驗證...........................................................................58
5.1.2 硬體驗證...........................................................................58
5.1.2.1 RTL 驗證...............................................................58
5.1.2.2 Co-sim 驗證...........................................................58
5.1.2.3 FPGA 驗證............................................................59
5.1.3 執行結果...........................................................................60
5.2 系統介面與定址空間................................................................61
5.3 合成數據與效能分析................................................................63
CHAPTER 6 結論與未來目標..............................................................66
6.1 結論............................................................................................66
6.2 未來目標....................................................................................66
參考文獻 References
[1] http://zh.wikipedia.org/zh-tw/計算機形學
[2] http://www.khronos.org/opengles/2_X/
[3] http://tinyurl.com/4hwrdfs
[4] Chang-Hyo Yu, Kyusik Chung, Donghyun Kim, Lee-Sup Kim, “A 120
Mvertices/s Multi-threaded VLIW Vertex Processor for Mobile Multimedia
Applications”, in Proc. IEEE International Solid-State Circuits Conference
(ISSCC 2006), San Francisco, CA, Feb. 2006, pp. 1606-1615.
[5] D. Kim, et al., “An Soc With 1.3 Gtexels/S 3-D Graphics Full Pipeline For Consumer Applications”, IEEE Journal of Solid-State Circuits (JSSC), Vol. 41, NO. 1, pp. 71-84, Jan. 2006.
[6] ftp://download.nvidia.com/developer/Papers/2004/Vertex_Textures/
Vertex_Textures.pdf.
[7] http://www.ozone3d.net/tutorials/bump_mapping.php
[8] Nolan Goodnight, Rui Wang, Cliff Woolley, and Greg Humphreys , “Interactive
Time-Dependent Tone Mapping Using Programmable Graphics Hardware”,
in Proc. 14th Eurographics Symposium on Rendering(EGSR’03), Belgium, June 2003, pp. 26-37.
[9] Lindholm E., Nickolls J., Oberman S., Montrym J., ”NVIDIA TESLA: A
Unified Graphics and Computing Architecture”, IEEE Micro, Vol. 28, No. 2,
pp. 39-55, March-April 2008.
[10] Tae-Young Kim, Jongho Kim, and Hyunmin Hur, “A Unified Shader
Based on the OpenGL ES 2.0 for 3D Mobile Game Development”, in Proc.
2nd International Conference, Edutainment 2007, Hong Kong, China, June
2007, pp. 898-903.
[11] http://http.download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
[12] James Laudon, Anoop Gupta, Mark Horowitz, “Interleaving: A Multithreading
Technique Targeting Multiprocessors and Workstations”, in Proc. 6th
International Conference on Architectural Support for Programming Languages
and Operating Systems, Boston, U.S.A., October 1994, pp. 308-318.
[13] Theo Ungerer, Borut Robic, Jurij Silc, “Multithreaded Processor”, British
Computer Society, The Computer Journal, Vol. 45, No.3, pp. 320-348, January
2002.
[14] Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M. Aamodt, “Dynamic Warp
Formation and Scheduling for Efficient GPU Control Flow”, in Proc. 40th Annual
IEEE/ACM International Symposium on Micro-architecture, Chicago, IL, Dec.
2007, pp. 407-420.
[15] Jeong-Ho Woo, Ju-Ho Sohn, Hyejung Kim, Hoi-Jun Yoo, “A Low Power
Multimedia SoC with Fully Programmable 3D Graphics and
MPEG4/H.264/JPEG for Mobile Devices”, in Proc. 2007 IEEE Symposium on
Low Power Electronics and Design(ISLPED), Portland, OR, Aug. 2007, pp.
238-243.
[16] Wilson Wai Lun Fung, “Dynamic Warp Formation: Exploiting Thread
Scheduling for Efficient MIMD Control Flow on SIMD Graphics Hardware”,
ACM Transactions on Architecture and Code Optimization(TACO), Vol. 6, NO.
2, June 2009.
[17] Nagesh B. Lakshminarayana, Hyesoon Kim, “Effect of Instruction Fetch and
Memory Scheduling on GPU Performance”, Workshop on Language, Compiler,
and Architecture Support for GPGPU, in conjunction with HPCA/PPoPP 2010,
2010.
[18] http://www.opengl.org/documentation/specs/version2.0/glspec20.pdf
[19] Victor Moya, Carlos Gonzalez, Jordi Roca, Agustin Fernandez, Roger Espasa, “A
Single (Unified) Shader GPU Micro-architecture for Embedde Systems”, in Proc. 1st International Conference, HiPEAC 2005, Barcelona, Spain, November 2005.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.22.61.246
論文開放下載的時間是 校外不公開

Your IP address is 3.22.61.246
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code