Responsive image
博碩士論文 etd-0224114-122630 詳細資訊
Title page for etd-0224114-122630
論文名稱
Title
在SIMD處理器上透過包裝暫存器的方法研究可平行化特性上的限制
A Study of the Limits of Parallelism Available in SIMD Processors Through Register Packing
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
92
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2014-01-27
繳交日期
Date of Submission
2014-03-26
關鍵字
Keywords
迴圈展開、暫存器分配、SIMD和向量處理、指令編排
register allocation, loop unrolling, SIMD and vector processing, instruction scheduling
統計
Statistics
本論文已被瀏覽 5693 次,被下載 529
The thesis/dissertation has been browsed 5693 times, has been downloaded 529 times.
中文摘要
這篇論文設計了一個有指令平行效益的SIMD處理器在通用計算的嵌入式系統上,這些嵌入式系統的硬體規模常小於目前被廣泛應用的CPU或是GPU。而我們在此篇論文中,研發了需多技巧去改進我們SIMD處理器的指令編排的時間。藉由應用一些分支界定法來改善我們原本已有的SIMD架構,使之保有優化。這些方法包含PRSR,memorization和 register grouping. 我們另外提供了啟發式方法(heuristic),這是一種快捷方法可以快速且高效率的改進我們的全面式搜尋,例如 unrolling optimization,instruction distribution,和sign constraint。

透過暫存器包裝和迴圈地展開,我們將SIMD處理器應用在Mibench上並已具有可與VLIW 處理器兼容的效能。而且,我們的暫存器包裝允許處理向量形式的載入或儲存的指令從靜態隨機存取記憶體。這些記憶體存取指令頻繁發生於程式執行的過程,且在我們的SIMD處理器上對指令編排上有著顯著的加速。
Abstract
This thesis designed an instruction-level-parallelism processor for the embedded system with general purpose computations. The hardware of the embedded system is small-scalar then currently popular CPU or GPU. We exploit some techniques to enhance the instruction scheduling time of our SIMD processor.

By applying branch-and-bound ways to modify algorithm that maintain optimality includes PRSR (pseudo random shift register), memorization, and register grouping. And we also support heuristic ways that is a mental shortcut that allow us to solve exhaustive searching quickly and efficiently such as unrolling optimization, instruction distribution, and sign constraint.

Through register packing and loop unrolling, we applied our SIMD processor on Mibench and have a compatible performance with VLIW processor; moreover, our register packing allows for a vector-wide load from the SRAM. Such a load is a natural fit to a SIMD and achieves significant speedups, when our allocator is used.
目次 Table of Contents
論文審定書......................................................................................................i
中文摘要.........................................................................................................iii
英文摘要.........................................................................................................iv
1 Introduction....................................................................................... ......1
1.1 Motivation and Goal..........................................................................1
1.2 A Summary of Our Contribution.........................................................6
1.3 The key Process of Our Methodology...............................................16
2 Related Work.........................................................................................20
2.1 Basic of DFG construction.............................................................. 20
2.1.1 Super-Blocking...................................................................... 20
2.1.2 Loop Unrolling........................................................................ 24
2.2 Instruction Scheduling.....................................................................29
2.2.1 SISD (superscalar)-List Scheduling.......................................... 29
2.2.2 MIMD (VLIW)-Exhaustive Searching.........................................30
2.2.3 SIMD (vector)-A Survey of Approaches................................ .....31
3 Methodology.......................................................................................... 47
3.1 Algorithm Modification that Maintain Optimality................................. 47
3.1.1 PRSR....................................................................................48
3.1.2 Memorization.........................................................................51
3.1.3 Register Grouping...................................................................54
3.2 Heuristics...................................................................................... 59
3.2.1 Unrolling Optimization ............................................................ 60
3.2.2 Instructions Distribution.......................................................... 62
3.2.3 Sign Constraint...................................................................... 65
4 Performance Comparison and Result....................................................... 71
5 Conclusion............................................................................................ 76
6 References............................................................................................ 77
參考文獻 References
[ 1] Duncan, Ralph. "A survey of parallel computer architectures. 1990
[ 2] Pentium processor with MMX technology. http://edc.intel.com/Platforms/Previous/Processors/Pentium-MMX
[ 3] Intel. IA-32 intel architecture optimization reference manual. 2005.
[ 4] IBM. PowerPC microprocessor family: Vector/SIMD multimedia extension
technology programming environments manual. IBM Systems and Technology Group, 2005.
[ 5] S. Oberman, G. Favor, and F. Weber. Amd 3dnow! technology: Architecture and implementations. IEEE MICRO, 1999.
[ 6] Linear feedback shift Register:
http://en.wikipedia.org/wiki/Linear_feedback_shift_register
[ 7] Farber, Rob. CUDA application design and development. Elsevier, 2011.
[ 8] Yu-Dan Su, “A High Performance Register Allocator for Vector Architectures With a Unified Register-Set” National Sun Yat-sen University Master Thesis, 2010
[ 9] Davidson, Jack W., and Sanjay Jinturkar. "Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation." Proceedings of the 28th annual international symposium on Microarchitecture. IEEE Computer Society Press, 1995.
[ 10] Lam, Monica. "Software pipelining: An effective scheduling technique for VLIW machines." ACM Sigplan Notices. Vol. 23. No. 7. ACM, 1988.
[ 11] Dai, Yunyang, et al. "SIMD-efficient loop unrolling design for embedded multimedia applications." Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference on. Vol. 3. IEEE, 2004.
[ 12] Leupers, Rainer. "Compiler design issues for embedded processors." Design & Test of Computers, IEEE 19.4 (2002): 51-58.
[ 13] Nuzman, Dorit, et al. "Compiling for an indirect vector register architecture." Proceedings of the 5th Conference on Computing Frontiers. ACM, 2008.
[ 14] Franchetti, Franz, et al. "Efficient utilization of SIMD extensions." Proceedings of the IEEE 93.2 (2005): 409-425.
[ 15] Kudriavtsev, Alexei, and Peter Kogge. "Generation of permutations for SIMD processors." ACM SIGPLAN Notices. Vol. 40. No. 7. ACM, 2005.
[ 16] Eichenberger, Alexandre E., et al. "Optimizing compiler for the cell processor."Parallel Architectures and Compilation Techniques, 2005. PACT 2005. 14th International Conference on. IEEE, 2005.
[ 17] Ren, Gang, Peng Wu, and David Padua. "Optimizing data permutations for SIMD devices." ACM SIGPLAN Notices. Vol. 41. No. 6. ACM, 2006.
[ 18] Weiss, Michael. "Strip mining on SIMD architectures." Proceedings of the 5th international conference on Supercomputing. ACM, 1991.
[ 19] Gschwind, Michael, et al. "Synergistic processing in cell's multicore architecture." Micro, IEEE 26.2 (2006): 10-24.
[ 20] Eichenberger, Alexandre E., Peng Wu, and Kevin O'brien. "Vectorization for SIMD architectures with alignment constraints." ACM SIGPLAN Notices. Vol. 39. No. 6. ACM, 2004.
[ 21] Sreraman, N., and R. Govindarajan. "A vectorizing compiler for multimedia extensions." International Journal of Parallel Programming 28.4 (2000): 363-400.
[ 22] Larsen, Samuel, and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. Vol. 35. No. 5. ACM, 2000.
[ 23] Ueng, Sain-Zee, et al. "CUDA-lite: Reducing GPU programming complexity."Languages and Compilers for Parallel Computing. Springer Berlin Heidelberg, 2008. 1-15.
[ 24] Ryoo, Shane, et al. "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA." Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming. ACM, 2008.
[ 25] Ryoo, Shane, et al. "Program optimization space pruning for a multithreaded gpu." Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. ACM, 2008.
[ 26] MiBench: A free, commercially representative embedded benchmark suite by Matthew R. Guthaus, Jeffrey S. Ringenberg, Dan Ernst, Todd M. Austin, Trevor Mudge, Richard B. Brown, IEEE 4th Annual Workshop on Workload Characterization, Austin, TX, December 2001.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code