國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,在SIMD處理器上透過包裝暫存器的方法研究可平行化特性上的限制,A Study of the Limits of Parallelism Available in SIMD Processors Through Register Packing

論文名稱 Title	在SIMD處理器上透過包裝暫存器的方法研究可平行化特性上的限制 A Study of the Limits of Parallelism Available in SIMD Processors Through Register Packing
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	92
研究生 Author	陳柔佳 Rou-Jia Chen
指導教授 Advisor	希家史提夫 Steve W. Haga
召集委員 Convenor	李宗南 Chungnan Lee
口試委員 Advisory Committee	謝東佑, 林俊宏 Tong-Yu Hsieh; Chun-Hung Lin
口試日期 Date of Exam	2014-01-27	繳交日期 Date of Submission	2014-03-26
關鍵字 Keywords	迴圈展開、暫存器分配、SIMD和向量處理、指令編排 register allocation, loop unrolling, SIMD and vector processing, instruction scheduling
統計 Statistics	本論文已被瀏覽 5693 次，被下載 529 次 The thesis/dissertation has been browsed 5693 times, has been downloaded 529 times.

中文摘要
這篇論文設計了一個有指令平行效益的SIMD處理器在通用計算的嵌入式系統上，這些嵌入式系統的硬體規模常小於目前被廣泛應用的CPU或是GPU。而我們在此篇論文中，研發了需多技巧去改進我們SIMD處理器的指令編排的時間。藉由應用一些分支界定法來改善我們原本已有的SIMD架構，使之保有優化。這些方法包含PRSR，memorization和 register grouping. 我們另外提供了啟發式方法(heuristic)，這是一種快捷方法可以快速且高效率的改進我們的全面式搜尋，例如 unrolling optimization，instruction distribution，和sign constraint。透過暫存器包裝和迴圈地展開，我們將SIMD處理器應用在Mibench上並已具有可與VLIW 處理器兼容的效能。而且，我們的暫存器包裝允許處理向量形式的載入或儲存的指令從靜態隨機存取記憶體。這些記憶體存取指令頻繁發生於程式執行的過程，且在我們的SIMD處理器上對指令編排上有著顯著的加速。
Abstract
This thesis designed an instruction-level-parallelism processor for the embedded system with general purpose computations. The hardware of the embedded system is small-scalar then currently popular CPU or GPU. We exploit some techniques to enhance the instruction scheduling time of our SIMD processor. By applying branch-and-bound ways to modify algorithm that maintain optimality includes PRSR (pseudo random shift register), memorization, and register grouping. And we also support heuristic ways that is a mental shortcut that allow us to solve exhaustive searching quickly and efficiently such as unrolling optimization, instruction distribution, and sign constraint. Through register packing and loop unrolling, we applied our SIMD processor on Mibench and have a compatible performance with VLIW processor; moreover, our register packing allows for a vector-wide load from the SRAM. Such a load is a natural fit to a SIMD and achieves significant speedups, when our allocator is used.

目次 Table of Contents
論文審定書......................................................................................................i 中文摘要.........................................................................................................iii 英文摘要.........................................................................................................iv 1 Introduction....................................................................................... ......1 1.1 Motivation and Goal..........................................................................1 1.2 A Summary of Our Contribution.........................................................6 1.3 The key Process of Our Methodology...............................................16 2 Related Work.........................................................................................20 2.1 Basic of DFG construction.............................................................. 20 2.1.1 Super-Blocking...................................................................... 20 2.1.2 Loop Unrolling........................................................................ 24 2.2 Instruction Scheduling.....................................................................29 2.2.1 SISD (superscalar)-List Scheduling.......................................... 29 2.2.2 MIMD (VLIW)-Exhaustive Searching.........................................30 2.2.3 SIMD (vector)-A Survey of Approaches................................ .....31 3 Methodology.......................................................................................... 47 3.1 Algorithm Modification that Maintain Optimality................................. 47 3.1.1 PRSR....................................................................................48 3.1.2 Memorization.........................................................................51 3.1.3 Register Grouping...................................................................54 3.2 Heuristics...................................................................................... 59 3.2.1 Unrolling Optimization ............................................................ 60 3.2.2 Instructions Distribution.......................................................... 62 3.2.3 Sign Constraint...................................................................... 65 4 Performance Comparison and Result....................................................... 71 5 Conclusion............................................................................................ 76 6 References............................................................................................ 77

參考文獻 References
[ 1] Duncan, Ralph. "A survey of parallel computer architectures. 1990 [ 2] Pentium processor with MMX technology. http://edc.intel.com/Platforms/Previous/Processors/Pentium-MMX [ 3] Intel. IA-32 intel architecture optimization reference manual. 2005. [ 4] IBM. PowerPC microprocessor family: Vector/SIMD multimedia extension technology programming environments manual. IBM Systems and Technology Group, 2005. [ 5] S. Oberman, G. Favor, and F. Weber. Amd 3dnow! technology: Architecture and implementations. IEEE MICRO, 1999. [ 6] Linear feedback shift Register: http://en.wikipedia.org/wiki/Linear_feedback_shift_register [ 7] Farber, Rob. CUDA application design and development. Elsevier, 2011. [ 8] Yu-Dan Su, “A High Performance Register Allocator for Vector Architectures With a Unified Register-Set” National Sun Yat-sen University Master Thesis, 2010 [ 9] Davidson, Jack W., and Sanjay Jinturkar. "Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation." Proceedings of the 28th annual international symposium on Microarchitecture. IEEE Computer Society Press, 1995. [ 10] Lam, Monica. "Software pipelining: An effective scheduling technique for VLIW machines." ACM Sigplan Notices. Vol. 23. No. 7. ACM, 1988. [ 11] Dai, Yunyang, et al. "SIMD-efficient loop unrolling design for embedded multimedia applications." Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on. Vol. 3. IEEE, 2004. [ 12] Leupers, Rainer. "Compiler design issues for embedded processors." Design & Test of Computers, IEEE 19.4 (2002): 51-58. [ 13] Nuzman, Dorit, et al. "Compiling for an indirect vector register architecture." Proceedings of the 5th Conference on Computing Frontiers. ACM, 2008. [ 14] Franchetti, Franz, et al. "Efficient utilization of SIMD extensions." Proceedings of the IEEE 93.2 (2005): 409-425. [ 15] Kudriavtsev, Alexei, and Peter Kogge. "Generation of permutations for SIMD processors." ACM SIGPLAN Notices. Vol. 40. No. 7. ACM, 2005. [ 16] Eichenberger, Alexandre E., et al. "Optimizing compiler for the cell processor."Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on. IEEE, 2005. [ 17] Ren, Gang, Peng Wu, and David Padua. "Optimizing data permutations for SIMD devices." ACM SIGPLAN Notices. Vol. 41. No. 6. ACM, 2006. [ 18] Weiss, Michael. "Strip mining on SIMD architectures." Proceedings of the 5th international conference on Supercomputing. ACM, 1991. [ 19] Gschwind, Michael, et al. "Synergistic processing in cell's multicore architecture." Micro, IEEE 26.2 (2006): 10-24. [ 20] Eichenberger, Alexandre E., Peng Wu, and Kevin O'brien. "Vectorization for SIMD architectures with alignment constraints." ACM SIGPLAN Notices. Vol. 39. No. 6. ACM, 2004. [ 21] Sreraman, N., and R. Govindarajan. "A vectorizing compiler for multimedia extensions." International Journal of Parallel Programming 28.4 (2000): 363-400. [ 22] Larsen, Samuel, and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. Vol. 35. No. 5. ACM, 2000. [ 23] Ueng, Sain-Zee, et al. "CUDA-lite: Reducing GPU programming complexity."Languages and Compilers for Parallel Computing. Springer Berlin Heidelberg, 2008. 1-15. [ 24] Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008. [ 25] Ryoo, Shane, et al. "Program optimization space pruning for a multithreaded gpu." Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. ACM, 2008. [ 26] MiBench: A free, commercially representative embedded benchmark suite by Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, Richard B. Brown, IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, December 2001.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0224114-122630.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS