國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,設計基於多重迴圈指令序列之語意分析之動態迴圈展開器,Design of Dynamic Loop Unrolling Mechanism Based on Semantic Analysis of Nested Loop

論文名稱 Title	設計基於多重迴圈指令序列之語意分析之動態迴圈展開器 Design of Dynamic Loop Unrolling Mechanism Based on Semantic Analysis of Nested Loop
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	76
研究生 Author	呂易軒 Yi-Xuan Lu
指導教授 Advisor	邱日清 Jih-Ching Chiu
召集委員 Convenor	鄺獻榮 Shiann-Rong Kuang
口試委員 Advisory Committee	周育樑, 謝東佑 Yu-Liang Chou; Tong-Yu Hsieh
口試日期 Date of Exam	2018-07-24	繳交日期 Date of Submission	2018-08-06
關鍵字 Keywords	多重迴圈語意、多重迴圈展開、迴圈語意、超多純量、迴圈展開、迴圈指令並行度 hyper-scalar, semantic of nested loop, ILP of loop, loop unrolling, semantic of loop, nested loop unrolling
統計 Statistics	本論文已被瀏覽 5634 次，被下載 0 次 The thesis/dissertation has been browsed 5634 times, has been downloaded 0 times.

中文摘要
現今支援ILP處理器並不具備分析指令，與主動對程式指令進行編排提升並行度之功能，只能依照經編譯器編譯過的指令順序進行指令抓取，再根據指令之間的相依性，盡量將能夠派發的指令向下派發以提升並行度。高效能需求的程式中，無論是影像處理或是目前主流發展的機器學習，其程式皆使用了大量的迴圈結構，因迴圈結構的特殊性，使支援ILP處理器難以提升並行度，需要利用編譯器對程式進行特殊編譯以提升程式並行度，此做法缺乏彈性且無法作用於已經編譯完成的程式。本論文提出基於多重迴圈指令序列之語意分析之動態迴圈展開器，並將之設計在Hyperscalar架構的指令分析器上，此動態迴圈展開器之架構分為以下三個部分：(1)迴圈偵測單元、(2)展開控制單元、(3)迴圈展開單元。其根據訂定的語意對指令序列進行分析，找出迴圈之指令區間並蒐集資料，通過分析已蒐集的資料、迴圈之間的關係以及當前迴圈展開的情形，將迴圈程式分段展開、消除迴圈結構的特殊性，並在重新編排指令順序後，派送至處理器中的各個核心。為了驗證加入動態迴圈展開器後對處理器效能的提升，將機器學習中大量使用的的卷積運算、矩陣乘積運算以及當前常被使用的AES中的 mix column，利用通用的
Abstract
In today's ILP processors can't analyze the semantic information of instructions and change instruction series automatically to promote ILP.ILP processors can only fetch instructions sequentially and analyze the dependency between instructions then dispatch the instructions which data are prepared. High performance requirements program such as image processing or machine learning, contains a lot of loop structure. Loop’s particularity caused processors hard to promote ILP. Processors need to use special compiler to compile code, the method is inflexible and cannot be used to promote the code which had already been compiled. This paper proposed dynamic loop unrolling mechanism based on semantic analysis of nested loop consists of three units in the architecture: loop detect unit (LDU), unrolling control unit (UCU) and loop unrolling unit (LUU). It will parse the semantic of instructions to find the closed interval of the loop body instructions, and collect the instruction’s information in the loop body. Dynamic loop unrolling mechanism cuts the instruction series of nested loop into several segments and unrolls it by the information from LDU and the situation of unrolling, then dispatch the instruction into cores. The verifications use ARM instructions generated by

目次 Table of Contents
目錄論文審定書...........................................................................i 致謝......................................................................................ii 摘要.....................................................................................iii ABSTRACT…………………………………………………………………….……..iv 目錄 ……………………………………………………………………………….......v 圖次..................................................................................viii 第1章緒論......................................................................1 1.1 研究動機 ......................................................................1 1.2 研究目標 ......................................................................2 1.3 論文架構 ......................................................................3 第2章相關研究..............................................................4 2.1 超純量處理器..............................................................4 2.1.1 超純量處理器架構介紹.......................................4 2.1.2 亂序執行例外處理...............................................8 2.2 超多純量(Hyperscalar)處理器...................................8 2.2.1 超多純量處理器架構介紹...................................8 2.2.2 指令分析器........................................................10 2.2.3 虛擬共享暫存器檔案.........................................11 2.3 在pipeline及Superscalar處理器中提升迴圈效率....13 2.3.1 Loop Buffers ....................................................13 2.3.2 Loop Cache......................................................14 2.3.3 Branch Prediction............................................14 2.3.4 編寫程式時進行loop unrolling........................14 2.4 在VLIW處理器中提升迴圈執行效率.......................15 2.5 在超多純量架構中提升迴圈執行效率....................15 第3章在超多純量架構中設計基於多重迴圈指令序列之語意分析之動態迴圈展開器...17 3.1 基於多重迴圈語意分析之迴圈展開系統架構...........17 3.1.1 系統設計概念.....................................................17 3.1.2 系統架構.............................................................19 3.2 迴圈偵測單元.............................................................24 3.2.1 單層迴圈偵測.....................................................24 3.2.2 多重迴圈判斷.....................................................26 3.2.3 迴圈資料儲存格式.............................................27 3.2.4 迴圈資料儲存格式範例.....................................28 3.2.5 迴圈偵測單元架構............................................29 3.3 展開控制單元...........................................................31 3.3.1 外層迴圈資料處理............................................32 3.3.2 外層迴圈計數器資料補償................................34 3.3.3 迴圈之間指令資料處理....................................37 3.3.4 迴圈展開資料儲存格式....................................38 3.3.5 展開控制單元架構...........................................38 3.4 迴圈展開單元..........................................................40 3.4.1 消除重疊運算相依性........................................41 3.4.2 暫存器重新命名................................................43 3.4.3 迴圈間指令以及外層迴圈計數器資料補償指令處理..44 3.4.4 迴圈補償機制....................................................46 3.4.5 迴圈展開單元架構............................................48 3.5 迴圈展開器概述.......................................................50 第4章模擬與驗證.......................................................51 4.1 模擬程式流程...........................................................51 4.1.1 模擬程式流程介紹............................................51 4.1.2 測試指令...........................................................53 4.2 模擬結果與分析.......................................................56 4.2.1 模擬結果..........................................................56 4.2.2 模擬結果分析...................................................58 第5章結論..................................................................62 參考文獻..........................................................................63

參考文獻 References
[1] J. E. Thornton, “Parallel operation in the Control Data 6600,”Proceedings of Spring Joint Computer Conference, 1964. [2] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo, “TheIBM System/360 model 91: Machine philosophy andinstruction-handling,” IBM Journal of Research and Development, vol. 11, 1967, pp. 8–24. [3] R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal of Research and Development, volume 11, issue 1, January 1967, IBM, pp. 25–33 [4] J.E. Smith and A.R. Pleszkun,” Implementing precise interrupts in pipelined processors” IEEE Transactions on Computers ,Volume 37, Issue 5, May 1988 ,pp. 562 - 573 [5] D.S. Su. “Design of the Execution-driven Simulation Environment for Hyper- scalar Architecture.” Department of Electrical Engineering National Sun Yat-Sen University, 2008. [6] J.C. Chiu, Y.L. Chou, P.K. Chen and D.S. Su. “A Unitable Computing Architecture for Chip Multiprocessors.” The Computer Journal, Nov. 2011 Vol. 54, No. 12, pp.2033-2052. [7] P.K. Chen. “ESL Model of the Hyper-scalar Processor on a Chip.” Department of Electrical Engineering National Sun Yat-Sen University, 2007. [8] J.C. Chiu, Y.J. Huang and Y.L. Ye. “Design of the Optimized Group management Unit by Detecting Thread Parallelism on the Hyperscalar Architecture.” National Computer Symposium, Dec.2013. [9] R. S. Bajwa et al., Instruction buffering to reduce power in processors for signal processing, IEEE VLSI, 1997. [10] N. Bellas, I. Hajj, C. Polychronopoulos, and G. Stamoulis, “Energy and Performance Improvements in Microprocessor Design using a Loop Cache,” in ICCD, 1999 [11] T.Y. Yeh and Y. N. Patt, “Alternative Implementations of Two-Level Adaptive Branch Prediction,” Department of Electrical Engineering and Computer Science The University of Michigan.,1992 [12] J. A.Fisher, P. Faraboschi, and C. Young, Embedded Computing, a VLIW approach to architecture, compilers and tools. Elsevier, 2005 [13] P. Faraboschi, J.A. Fisher and C. Young”Instruction scheduling for instruction level parallel processors” Proceedings of the IEEE Volume 89, Issue 11, Nov 2001 ,pp. 1638 - 1659 [14] Y. Yang,N. Gu,K. Ren and B. Hu”An Approach to Enhance Loop Performance for Multicluster VLIW DSP Processor” ARCS 2014; 2014 Workshop Proceedings on Architecture of Computing Systems, Feb 2014, pp. 25-28 [15] Z.L. Chen, “Improving ILP with Semantic-Based Loop Unrolling Mechanism in X86 Architectures”, Department of Computer Science and Information Engineering National Chiao Tung University,1999 [16] J.C. Huang and T. Leng, ” Generalized loop-unrolling: a method for program speedup,” in. Proceedings of 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, 1999 [17] S. Weiss and J.E. Smith, “A Study of Scalar Compilation Techniques for Pipelined Supercomputers,” in Proceedings of Second International Conference on Architecture Support for Programming Languages and Operating Systems, Palo Alto, CA, Oct. 1987, pp. 105-109 [18] J.C. Chiu , S.J. Chao and Yi-Xuan Lu “Design of Instruction Analyzer with Semantic-Based Loop Unrolling Mechanism in the Hyperscalar Architecture.” Department of Electrical Engineering National Sun Yat-Sen University, 2017. [19] B. Wang, W. Zheng and Q. Fang, “Weimin Zheng Parallel Task Developing Based on Software Pipeline in Multicore System” International Symposium on Parallel and Distributed Processing with Applicationsm, Sept. 2010, pp. 6-9 [20] E. Rotenberg, S. Bennett, and J.E. Smith, “Trace cache: a low latency approach to high bandwidth instruction fetching,” in MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture, 1996, pp. 24 –34

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0706118-131509.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS