Responsive image
博碩士論文 etd-0724117-184404 詳細資訊
Title page for etd-0724117-184404
論文名稱
Title
在超純量多核心架構下實現基於語意分析之迴圈展開器以提升ILP
Improving ILP with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
87
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2017-07-24
繳交日期
Date of Submission
2017-08-25
關鍵字
Keywords
迴圈語意、迴圈展開、超多純量、迴圈指令並行度
ILP of loop, hyperscalar, semantic of loop, loop unrolling
統計
Statistics
本論文已被瀏覽 5727 次,被下載 97
The thesis/dissertation has been browsed 5727 times, has been downloaded 97 times.
中文摘要
迴圈結構為高效能運算需求之程式結構的主體,因應多核心計算架構的時代來臨,發開迴圈中的指令層級並行度(Instruction Level Parallelism, ILP),將能有效提昇多核心計算架構的整體運算效能。迴圈結構對於程式所造成的特徵如下:(1)重複從指令快取記憶體中讀取及解碼、(2) 迴圈本體指令數的派遣限制、與(3)重疊運算(Iteration)間的資料往往存在相依性關係,導致迴圈執行時的指令並行度不佳,尤其在超多純量(Hyperscalar)架構上的應用,更應被重視,以達到發揮核心間運算資源的最大效益。
迴圈結構經編譯器產生之機械瑪有特定的指令編排模式,觀察其編排模式訂定出迴圈的語意。本論文以Hyperscalar為基礎,提出一種在指令分析器(Instruction Analyzer, IA)中,解析經由編譯器產生的機械碼,找出符合迴圈語意之迴圈本體指令的區間並收集該區間的資料,在分析收集的資料後展開迴圈,稱之為基於語意分析之迴圈展開器,其架構分為三個單元:迴圈偵測單元(Loop Detect Unit, LDU)、迴圈展開單元(Loop Unrolling Unit, LUU)與迴圈控制器(Loop Contoller)。迴圈偵測單元將根據訂定的語意找出迴圈本體並收集該區間的資料。迴圈展開單元將依據LDU收集的資料展開迴圈,其程序為:(1)根據核心資源決定迴圈展開的次數並加上指令編號(SEQ)、(2)暫存器重新命名(Register Renaming)及消除重疊運算間的資料相依性、(3)產生迴圈展開後的指令標籤和加入因應迴圈預先執行之補償標籤以維持資料的正確性、與(4)重新編排派發順序,使被消除重疊運算間之資料相依性的指令提前派發,並產生指令派發核心的標籤表、迴圈虛擬共享暫存器映射表(Loop VSRF Mapping Table)、迴圈記憶體標籤映射表(Loop Memory Tag Mapping Table)及迴圈特定指令沖刷表(Loop Specific Instruction Flush Table)。迴圈控制器(Loop Contoller)將依據預測錯誤的分支指令以及展開完成的迴圈資訊決定派發指令的權利,若預測不跳躍而執行結果為跳躍的分支指令的指令位址為展開完成的迴圈之條件判斷的分支指令時,迴圈控制器會將派發指令的權利交給LUU,待迴圈執行結束後將派發指令的權利歸還IA。
本論文使用Keil μVision5 compiler 後的 ARM 組語做驗證。經實測結果可看出消除重疊運算的相依性可使ILP提升1.2至2倍,,以及加入指定指令沖刷能有效降低內部具有分支指令的迴圈程式之執行時間。
Abstract
In an age of multi-core computing architecture, exploiting ILP of loops can enhance the computing efficiency in the multi-core computing architecture since loop structure is the main construction of the program with high performance computing needs. The characteristics of the loop structure for the program are as follows: (1) Instruction will be fetched from cache and be decoded repeatedly. (2) The limit of instructions issue number of the loop body. (3) There are dependence relations between iterations. These factors result in the poor ILP in the implementation of the loop. In order to develop the maximum benefit of the usage of the cores computing resources, the application of the Hyperscalar architecture should be emphasized.
Because there is a specific ordering pattern in machine codes which produced by compiling the loop structure, we can formulate the semantic of the loop with the observations of this pattern. In this thesis, we propose an architecture called semantic-based loop unrolling mechanism in the Hyperscalar architecture. This architecture unrolls the loop in the instruction analyzer (IA) by analyzing the information gathered after finding the closed interval of loop body instructions by parsing the semantic of instructions, which is identical to what we formulate.
Proposed architecture consists of three unit: loop detect unit (LDU), loop unrolling unit (LUU), and loop controller. Loop detect unit will find the closed interval of the loop body instructions by parsing the semantic of instructions, which is identical to what we formulate, and collecting the information of this closed interval. Loop unrolling unit will unroll the loop based on the information collected by LDU. The unrolling procedures of LUU are as follows: (1) Decide loop unrolling times by the resources of core numbers, and add the SEQ tag to these instructions. (2) Register renaming and eliminate iteration dependence of the unrolled loop. (3) Generate tag to these instructions and add compensate tag to make sure the accuracy of data. (4) Rearrange the issue order of these instructions to issue the instructions which have been eliminated iteration dependence in advance, and generate instruction tag dispatch table, loop VSRF mapping table, loop memory tag mapping table, and loop specific instruction flush table. Loop controller will depend on the branch instruction with wrong prediction result and the loop which finish the unrolling procedures to decide the dispatch right. If this branch instruction identical to the unrolled loop’s conditional check branch instruction, and then the dispatch right will be handed over to LUU. When the execution of the unrolled loop is finish, loop controller will hand over the dispatch right to IA.
In this paper, the verify ARM instructions is generated by Keil μVision5 compiler. The results show that eliminating iteration dependence can improve ILP by 20% to 100%, and flushing specific instruction can decrease the total execution time of the loop whose loop body contains the internal branch instructions.
目次 Table of Contents
論文審定書 i
論文公開授權書 ii
致謝 iii
摘要 iv
Abstract vi
目錄 viii
圖次 x
表次 xii
第一章 簡介 1
1.1. 研究動機 1
1.2. 研究目標 3
1.3. 論文架構 5
第二章 相關研究 6
2.1. 超多純量(Hyperscalar)架構介紹 6
2.1.1. 指令分析器 8
2.1.2. 虛擬共享暫存器檔案 11
2.1.3. Register data flow的處理 13
2.1.4. Memory data flow的處理 15
2.1.5. Instruction flow的處理 16
2.2. 在pipeline及Superscalar處理器中提升迴圈執行效率 19
2.2.1. Loop Buffers 19
2.2.2. Loop Cache 19
2.2.3. Branch Prediction 19
2.2.4. Out-of-Order Execution 20
2.3. 在資料流處理器中提升迴圈執行效率 22
2.3.1. 靜態資料流處理器 23
2.3.2. 動態資料流處理器 24
2.4. 在VLIW處理器中提升迴圈執行效率 25
2.5. 各處理器在提升迴圈執行效率的比較 27
第三章 在超多純量架構中 設計基於語意分析之迴圈展開器 28
3.1. 基於語意分析之迴圈展開系統架構 28
3.1.1. 系統架構設計概念 28
3.1.2. 系統架構 30
3.2. 迴圈偵測單元 34
3.2.1. 迴圈偵測 34
3.2.2. 迴圈資料儲存格式 37
3.2.3. 迴圈資料儲存格式範例 38
3.3. 迴圈展開單元 40
3.3.1. 迴圈指令相依性分析 40
3.3.2. Register Renaming 43
3.3.3. Eliminate Iteration Dependence 44
3.3.4. 迴圈補償機制 47
3.4. 迴圈控制器 53
3.5. 指令運作範例 54
第四章 模擬與分析 65
4.1. 架構模擬 65
4.1.1. 模擬器架構之程式流程 65
4.1.2. 測試程式 67
4.2. 結果與討論 68
第五章 結論 70
參考文獻 72
參考文獻 References
[1] E. Rotenberg, S. Bennett, and J.E. Smith, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 24 –34, 1996
[2] T. Conte, K. Menezes, P. Mills, and B. Patel, “Optimization of instruction fetch mechanisms for high issue rates,” in 22nd Intl. Symp. on Computer Architecture, pp. 333-344, June 1995
[3] T.Y. Yeh, D.T. Marr, and Y. N. Patt, “Increasing the instruction fetch rate via multiple branch prediction and a branch address cache,” in 7th Itel. Conf. on Supercomputing, pp. 67-76, July 1993
[4] R. S. Bajwa et al., Instruction buffering to reduce power in processors for signal processing, IEEE VLSI, 1997.
[5] N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in ICCD, 1999
[6] Tse-Yu Yeh, Yale N. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” 1992, Department of Electrical Engineering and Computer Science The University of Michigan.
[7] Thornton, James E. (1965). "Parallel operation in the control data 6600". Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems. AFIPS '64. San Francisco, California: ACM. pp. 33–40
[8] R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal of Research and Development, volume 11, issue 1, January 1967, IBM, pp. 25–33
[9] J.B. Dennis and D.P. Misunas, “A Preliminary Architecture for a Basic Data-Flow Processor,” in Proceedings of the 2nd Annual Symposium on Computer Architecture, pp. 126-131, Houston, TX, January 1975
[10] E.J. Lerner, “Data-flow Architecture,” in IEEE Spectrum, pp. 57-62, April 1984
[11] J. A.Fisher, P. Faraboschi, and C. Young, Embedded Computing, a VLIW approach to architecture, compilers and tools. Elsevier, 2005.
[12] J.L. Hennessy and D.A. Patterson, “Computer Architecture A Quantitative Approach,” 2nd Edition, 1995
[13] S. Weiss and J.E. Smith, “A Study of Scalar Compilation Techniques for Pipelined Supercomputers,” in Proceedings of Second International Conference on Architecture Support for Programming Languages and Operating Systems, pp. 105-109, Palo Alto, CA, October 1987
[14] F.H. McMohan, “The Livemore Fortran Kernels: A Computer Test of the Numerical Performance Range,” Lawrence Livemore National Laboratory, Livemore, CA, 1986
[15] J.C. Huang and T. Leng, ” Generalized loop-unrolling: a method for program speedup,” in. Proceedings of 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, 1999
[16] J.W. Davidson and S. Jinturkar, “Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation,” in Proceedings of the 28th Annual International Symposium on Microarchitecture, pp. 125 –132, 1995
[17] Po-Kai Chen, “ESL Model of the Hyper-scalar Processor on a Chip”,2007 ,Department of Electrical Engineering National Sun Yat-Sen University
[18] Yu-Lian Chou, “Study of the Hyperscalar Multi-core Architecture”,2011 ,Department of Electrical Engineering National Sun Yat-Sen University
[19] Yin-Jou Huang, “Design of the Optimized Group Management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture”,2013 ,Department of Electrical Engineering National Sun Yat-Sen University
[20] Zh-Lung Chen, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in X86 Architectures”,1999 , Department of Computer Science and Information Engineering National Chiao Tung University
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code