國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於OpenCL架構的平行運算處理器之設計,Design of parallel computing processor based on OpenCL architecture

論文名稱 Title	基於OpenCL架構的平行運算處理器之設計 Design of parallel computing processor based on OpenCL architecture
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	104 學年度第 1 學期 The fall semester of Academic Year 104	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	59
研究生 Author	許建德 Chien-te Hsu
指導教授 Advisor	張雲南 Yun-Nan Chang
召集委員 Convenor	阮聖彰 Shanq-Jang Ruan
口試委員 Advisory Committee	謝東佑, 郭可驥 Tong-Yu Hsieh; Ko-Chi Kuo
口試日期 Date of Exam	2015-08-21	繳交日期 Date of Submission	2015-09-08
關鍵字 Keywords	純量處理器、圖形處理器、一般運算、OpenCL、多核心 Scalar Processor, Multi-core, GPU, General Computing, OpenCL
統計 Statistics	本論文已被瀏覽 5667 次，被下載 491 次 The thesis/dissertation has been browsed 5667 times, has been downloaded 491 times.

中文摘要
除了追求更多著色器核心來達到更好的成像效能，另一個在現代圖形處理單元的重要發展趨勢是利用其強大計算能力用於通用計算的應用。本論文是將之前學術界所發展的專精於圖學成像的設計進一步的衍生為支援大量資料平行的嵌入式通用運算處理器。提出的設計是依據OpenCL(open computing language)架構中所定義的記憶體和平台模組。本論文首先將四個著色器核心組成一個compute unit，每個核心被視為一個processing element，每個processing element可以透過local memory互相的溝通。整個處理器是由多個compute unit所組成，並且可以透過global memory來交換訊息，其控制機制和原本繪圖處理器中的貼圖處理是相同的。為了使不同的執行緒在不同的processing element上執行時可以進行同步處理，OpenCL中定義了barrier，本論文也根據其規範在設計中實現。而在原本的繪圖處理器設計中已經採用的SIMD(Single Instruction Multiple Data)向量指令集處理器架構在一般運算中效能較差，因為向量所占的比率並不會那麼高。因此，我們的設計將向量處理器修改為純量處理器來達到較佳的硬體使用率。此外，同樣考慮到使用率，特殊函數運算單元也被從ALU(arithmetic logic unit )中分割出來，成為同一個compute unit中所有的processing element共享的資源。而整個處理器包含兩個compute units的硬體合成數據結果為共使用了約3600k個邏輯閘數量。
Abstract
In addition to pursuing more shader cores for better rendering performance, another important trend in the evolution of modern graphic processing units (GPU) is to exploit their enormous computing power for general computing applications. This thesis has extended a past academic GPU design specialized for graphics rendering into a general embedded computing engine which can support massive data parallel processing. The proposed design is developed according to the platform and memory models defined in the framework of open computing language (OpenCL). Therefore, this thesis first clusters four shader cores together as a compute unit, and each core is regarded as a processing element which can communicate each other through the newly installed local memory. The overall compute engine is formed by multiple compute units, which can exchange messages through global memory. Its access is based on the same control mechanism used for texture processing in the original GPU design. In order to synchronize different threads executed on different processing elements, OpenCL defines the use of barrier which is also implemented in our design. The original GPU design has adopted the single instruction, multiple data (SIMD) processor architecture which may become inefficient since the ratio of vector instructions in general computing programs may not be too large. Therefore, our design has transformed the vector processor into scalar processor in order to achieve better hardware utilization. Furthermore , the special function unit is also separated from the arithmetic logic unit (ALU) to become a common resource shared by all processing elements in the same compute unit. The overall equivalent gate count of our design which comprises two compute units is about 3600k.

目次 Table of Contents
目錄論文審定書 i 摘要 iii Abstract iv 目錄 v 圖目錄 vii Chapter 1概論 1 1.1研究動機 1 1.2 論文大綱 2 Chapter 2 研究背景與相關研究 3 2.1多核繪圖處理器 3 2.2 OpenCL 架構 5 2.3 Scalar運算處理器架構 6 2.4 同步處理 8 Chapter 3 平行運算處理器 10 3.1 向量指令集與純量處理器 10 3.1.1 向量指令集 10 3.1.2 Scalar運算處理器支援向量指令 13 3.1.2.1硬體的運作方式 15 3.2處理器硬體架構 16 3.2.1 Compute Unit中各模組功能 16 3.2.2 多核心運算單元(Multi_core) 19 3.2.3 Local Memory 存取機制 23 3.3 處理器流程 25 3.3.1 Warp在處理器中的排程 25 3.3.2 Work Item切換機制及硬體實作 26 3.3.3 存取Global Memory和Special Function指令 29 3.3.3.1 Global Memory指令 29 3.3.3.2 Special Function指令 33 3.3.4 Barrier指令 34 3.3.4.1硬體的運作方式 35 3.4設定平行處理器參數所使用到的使用者介面 37 Chapter 4 驗證方式及結果分析 39 4.1. EASY平台驗證 39 4.2 FPGA階層驗證 42 4.2.1 FPGA階層合成軟體及模擬環境 42 4.2.2 驗證結果 43 4.3 合成數據分析 45 Chapter 5結論與未來目標 47 5.1結論 47 5.2未來目標 47 參考文獻 48

參考文獻 References
[1] 楊賀鈞, “基於SIMT架構之多核繪圖處理器之設計與實作,” 國立中山大學碩士論文, Jul. 2014. [2] C. Altinigneli, B. Knote, D. Rujescu, C. Bohm, C. Plant, “Identification of SNP Interactions Using Data-Parallel Primitives on GPUs,” 2014 IEEE International Conference on Big Data, pp. 539-548, Oct. 2014. [3] J.H. Ahn, M. Erez, W.J. Dally, “Scatter-Add in Data Parallel Architectures, ” 11th International Symposium on High-Performance Computer Architecture , pp. 132-142, Feb. 2005. [4] V.G. Castellana, A. Tumeo, O. Vila, D. Haglin, J. Feo, “Composing Data Parallel Code for a SPARQL Graph Engine, ” 2013 International Conference on Social Computing , pp. 691-699, Sept. 2013. [5] L.B. Baumstark, L.M. Wills, “Retargeting Sequential Image-Processing Programs for Data Parallel Execution, ” IEEE Transactions on Software Engineering , pp. 116-136, Feb. 2005. [6] http://www.geeks3d.com/20100115/gpu-computing-geforce-and-radeon-opencl-test-part-1 [7] http://wycwang.blogspot.tw/2013/11/opencl.html [8] Y. Chang, J. Wei, W. Guo, and J. Sun, “A multi-functional dot product unit with SIMD architecture for embedded 3D graphics engine,” IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 1-4, Aug. 2011. [9] K. Chung, D. Kim, and L. Kim, “A 3-way SIMD engine for programmable triangle setup in embedded 3D graphics hardware,” IEEE International Symposium on Circuits and Systems (ISCAS), pp. 4546-4549, May. 2005. [10] S. Nam, B. Kim, Y. Im, Y. Kwon, J. Lee, Y. Cheon, S. Byun, D. Lee, and C. Kyung, “FLOVA: A four-issue VLIW geometry processor with SIMD instructions and lighting acceleration unit,” in Custom Integrated Circuits Conference(CICC), pp. 551-554, May. 2000. [11] J. Meng, J.W. Sheaffer, and K. Skadron, “Robust SIMD: Dynamically adapted SIMD width and multi-threading depth,” IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS), pp. 107-118, May. 2012. [12] H. Kaul, M.A. Anders, S.K. Mathew, S.K. Hsu, A. Agarwal, R.K. Krishnamurthy, and S. Borkar, “A 300mV 494GOPS/W reconfigurable dual-supply 4-way SIMD vector processing accelerator in 45nm CMOS,” IEEE International Solid-State Circuits Conference (ISSCC) , pp. 260-261, Feb. 2009. [13] Y. Kim, H. Kim, S. Kim, J. Park, S. Park, and L. Kim, “Homogeneous stream processors with embedded special function units for high-utilization programmable shaders,” IEEE Transactions on VLSI Systems, vol.20, no.9, pp. 1691-1704, Sep. 2012. [14] J. Laudon, A. Gupta, and M. Horowitz, “Interleaving: A multithreading technique targeting multiprocessors and workstations,” in Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 308-318, Oct. 1994. [15] J. Ding, W. Hsu, B. Jeng, S. Hung, Y. Chung, “HSAemu – A full system emulator for HSA platforms,” In Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pp.1-10, Oct. 2014.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0808115-113525.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS