國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,SAGE: 一個針對新的高效能SoC架構—記憶體處理器—效能提升與耗能降低所設計之自動分析平行化系統 ,SAGE: An Automatic Analyzing and Parallelizing System to Improve Performance and Reduce Energy on a New High-Performance SoC Architecture

論文名稱 Title	SAGE: 一個針對新的高效能SoC架構—記憶體處理器—效能提升與耗能降低所設計之自動分析平行化系統 SAGE: An Automatic Analyzing and Parallelizing System to Improve Performance and Reduce Energy on a New High-Performance SoC Architecture—Processor-in-Memory
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	91 學年度第 1 學期 The fall semester of Academic Year 91	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	103
研究生 Author	朱守禮 Slo-Li Chu
指導教授 Advisor	黃宗傳 Tsung-Chuan Huang
召集委員 Convenor	曾憲雄 Shian-Shyong Tseng
口試委員 Advisory Committee	林迺衛, 楊武, 楊竹星, 李政崑, 莊庭瑞, 許良全, 鍾葉青 Nai-Wei Lin; Wuu Yang; Chu-Sing Yang; Jenq-Kuen Lee; Tyng-Ruey Chuang; Liang-Chuan Hsu; Yeh-Ching Chung
口試日期 Date of Exam	2002-09-26	繳交日期 Date of Submission	2002-10-04
關鍵字 Keywords	系統單晶片、記憶體處理器架構、自動平行編譯器、耗能降低、陳述－基本 SoC, Processor-in-Memory architecture, statement-based, automatic parallelizing compiler, energy reduction
統計 Statistics	本論文已被瀏覽 5689 次，被下載 0 次 The thesis/dissertation has been browsed 5689 times, has been downloaded 0 times.

中文摘要
由於半導體技術持續地進步，使得一種結合處理邏輯�處理器與高密度記憶體的新系統單晶片計算機結構能順利發展，此類計算機結構一般稱為記憶體處理器或智慧型記憶體，其在高效能計算時，能有效降低處理器與記憶體之間的效能差距。這種架構結合了不同的處理器於單一系統內，而這些處理器，則可依他們的計算能力與記憶體存取之效能與耗能的不同加以分類。我們主要針對兩個問題加以探討：如何增進程式在記憶體處理器架構的執行效能；如何降低程式在記憶體處理器架構執行時所消耗的能量。因而需要發展一套新的策略以了解各種處理器之不同效能並且將工作分配給最適合的處理器以充分發揮其能力。有鑑於此，這份研究提出了一個新的原始碼對原始碼的自動平行化系統，稱為SAGE，來發揮記憶體處理器架構的優點。有別於傳統以輪次為基本的平行化系統，SAGE採用以陳述為基本的分析方法：將程式依陳述分解成多個區塊，並針對主處理器與記憶體處理器，產生適合的執行計畫表。SAGE系統整合了陳述分割、權重評估、效能排程、與耗能降低排程，自動地轉換Fortran原始程式，以增進程式在記憶體處理器架構下之執行效能或減少其能量消耗。這份論文詳細地說明這些分析方法，並以經由SAGE轉換後之真實的測速程式在記憶體處理器架構下執行之實驗結果，討論SAGE系統與這些分析方法之效能。
Abstract
Continuous improvements in semiconductor fabrication density are enabling new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processing with high-density memory. Such architectures are generally called Processor-in-Memory or Intelligent Memory and can support high-performance computing by reducing the performance gap between the processor and the memory. This architecture combines various processors in a single system. These processors are characterized by their computational and memory-access capabilities in performance and energy consumption. Two main problems addressed here are how to improve the performance and reduce the energy consumption of applications running on Processor-in-Memory architectures. Accordingly, a novel strategy must be developed to identify the capabilities of the different processors and dispatch the most appropriate jobs to them to exploit them fully. Accordingly, this study proposes a novel automatic source-to-source parallelizing system, called SAGE, to exploit the advantages of Processor-in-Memory architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analytical approaches. The strategy of the SAGE system, which decomposes the original program into blocks and produces a feasible execution schedule for the host and memory processors, is also investigated. Hence, several techniques including statement splitting, weight evaluation, performance scheduling and energy reduction scheduling are designed and integrated into the SAGE system to automatically transform Fortran source programs to improve the performance of the program or reduce energy consumed by the program executed on Processor-in-Memory architecture. This thesis provides detailed techniques and discusses the experimental results of real benchmarks which are transformed by SAGE system and targeted on the Processor-in-Memory architecture.

目次 Table of Contents
Acknowledgement ii 摘要 iii Abstract iv Table of Contents v List of Tables vii List of Figures viii Chapter 1 Introduction 1 Chapter 2 Related Work 4 2.1 PIM Architectural Aspects 4 2.2 PIM Compiler Aspect 6 2.3 Weight Evaluation Aspects 7 2.4 Scheduling Aspects 8 2.5 Energy Reduction Aspects 9 Chapter 3 Processor-in-Memory Architecture 11 3.1 Architecture Description 11 3.2 Synchronization Mechanism 13 3.3 Cache Coherence Mechanism 14 Chapter 4 Methodology 15 4.1 Statement Splitting and WPG Construction 18 4.2 Weight Evaluation 22 4.2.1 Partial Profiling Weight Evaluation Mechanism 23 4.2.2 Self-Patch Weight Evaluation 27 4.3 Performance Scheduling Mechanisms for One-P.Host and One-P.Mem Configuration 31 4.3.1 Basic Scheduling Mechanism 31 4.3.2 Seesaw Dispatching Mechanism 33 4.4 Performance Scheduling Mechanism for One-P.Host and Many-P.Mems Configuration 39 4.5 Optimizing Techniques 51 4.5.1 Tiling for PIM 51 4.5.2 Loop splitting for PIM 55 4.5.3 IMOP recognition 57 4.6. Energy Reduction Scheduling for PIM Architectures 61 4.6.1 Performance-Oriented Energy Reduction Scheduling 62 4.6.2 Energy-Oriented Energy Reduction Scheduling 67 4.6.3 Example 71 4.7 Implementation of the SAGE system 73 Chapter 5 Experimental Results 75 5.1 Experimental Results of the Scheduling Mechanism for the One-P.Host and One-P.Mem Configuration 76 5.2 Experimental Results of the Scheduling Mechanism for the One-P.Host and Many-P.Mem Configuration 79 5.3 Experimental Results of the Energy Reduction Scheduling Mechanism for the One-P.Host and One-P.Mem Configuration 82 Chapter 6 Conclusion 84 References 86 Vita 92

參考文獻 References
[1] R. Armstrong, D. Hensgen, T. Kidd, The Relative Performance of Various Mapping Algorithms is Independent of Sizable Variances in Run-Time Predictions, in: Proc. 7th IEEE Heterogeneous Computing Workshop (Mar 1998) 79-87. [2] R. S. Bajwa, M. Hiraki, H. Kojima, D. J. Gorny, K. Nitta, A. Shridhar, K. Seki, K. Sasaki, Instruction Buffering to Reduce Power in Processors for Signal Processing, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 5, 4 (Dec. 1997). [3] R. Barua, W. Lee, S. Amarasinghe, A. Agarwal, Maps: A Compiler-managed Memory System for Raw Machines, in: Proc. 26th Annual International Symposium on Computer Architecture, (May 1999) 4-15. [4] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford, Effective Automatic Parallelization with Polaris, International Journal of Parallel Programming (May 1995). [5] T. D. Braun, H. J. Siegal, "A Comparison Study of Static Mapping Heuristics for a Class of Meta-Tasks on Heterogeneous Computing Systems, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 15-29. [6] S. Carr, Combining Optimization for Cache and Instruction-Level Parallelism, in: Proc. 1996 Conference on Parallel Architectures and Compilation Techniques (1996) 238 –247. [7] S. L. Chu, T. C. Huang, L. C. Lee, Improving Workload Balance and Code Optimization on Processor-in-Memory Systems, to appear in Journal of Systems and Software (2003). [8] R. Crisp, Direct Rambus Technology: the New Main Memory Standard, IEEE Micro, (Nov. 1997) 18-28. [9] M. V. Devarakonda, R. K. Iyer, Predictability of Process Resource Usage: a Measurement-Based Study on UNIX, IEEE Trans. Software Engineering, 15, 12, (Dec. 1989) 1579-1586. [10] D. Elliott, M. Stumm, M. Snelgrove, Computational RAM: The Case for SIMD Computing in Memory, in: Proc. ISCA Workshop on Mixing Logic and DRAM (1997). [11] R. Freund, M. Gherrity, S. Ambrosius, M. Cambell, M. Halderman, D. Hensgen, L. Moore, B. Rust, H. Siegel, Scheduling Resources in Multi-user, Heterogeneous, Computing Environments with SmartNet, in: Proc. 7th IEEE Heterogeneous Computing Workshop (Mar. 1998) 184-199. [12] R. F. Freund, Optimal Selection Theory for Superconcurrency, in: Proc. 1989 Conference on Supercomputing (1989) 699-703. [13] R. F. Freund, H. J. Siegel, Heterogeneous Processing, IEEE Computer, 26, 6, (June 1993) 13-17. [14] M. Hajj, C. Polyckronopoulos, G.. Stamoulist, Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High Performance Microprocessors, in: Proc. 1998 International Symposium on Low Power Electronics and Design (1998). [15] M. Hall , P. Kogge , J. Koller , P. Diniz , J. Chame , J. Draper , J. LaCoss , J. Granacki , J. Brockman , A. Srivastava , W. Athas , V. Freeh , J. Shin , J. Park, Mapping Irregular Applications to DIVA, a PIM-Based Data-Intensive Architecture, in: Proc. 1999 Conference on Supercomputing (Jan. 1999). [16] M. Hall, J. Anderson, S. Amarasinghe, B. Murphy, S. Liao, E. Bugnion, M. Lam, Maximizing Multiprocessor Performance with the SUIF Compiler, IEEE Computer (Dec. 1996). [17] T. C. Huang, S. L. Chu, SAGE: A New Analysis and Optimization System for FlexRAM Architecture, in: Proc. IMS 2000, Lecture Notes in Computer Science, Vol. 2107, (Springer-Verlag, Berlin, 2001) 160-168. [18] T. C. Huang, S. L. Chu, A New Analysis Approach for Intelligent Memory Systems, in: Proc. ISCA 16th International Conference on Computers and Their Applications (Seattle, WA, Mar. 2001) 452-457. [19] T. C. Huang, S. L. Chu, A Statement Based Parallelizing Framework for Processor-in-Memory Architectures, to appear in Information Processing Letters (Elsevier Science, 2002) [20] M. Huang, J. Renau, S. M. Yoo, J. Torrellas, A Framework for Dynamic Energy Efficiency and Temperature Management, in: Proc. 33rd International Symposium on Microarchitecture (Dec. 2000). [21] W. Huang, Exploiting Application Parallelism Using Advanced Intelligent Memory – The FlexRAM approach, MS Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1999. [22] O. Ibarra, C. Kim, Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors, Journal of the ACM, 77, 2, (Apr. 1977) 280-289. [23] M. A. Iverson, F. Ozguner, G. Follen, Run-Time Statistical Estimation of Task Execution Times for Heterogeneous Distributed Computing, in: Proc. 1996 High Performance Distributed Computing Conference, (Syracuse, NY, Aug. 1996) 263-270. [24] M. A Iverson, F. Ozguner, L. C. Potter, Statistical Prediction of Task Execution Times Through Analytic Benchmarking For Scheduling in a Heterogeneous Environment, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 99-111. [25] M. Jiménez, Multilevel Tiling for Non-Rectangular Iteration Spaces, Ph.D. Thesis, Departamento de Arquittectura de Computadores, Universitat Politécniac de Catalunya, 1999. [26] D. Judd and K. Yelick, Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler, in: Proc. 2nd Workshop on Intelligent Memory Systems (Cambridge, MA, Nov. 2000). [27] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: Toward an Advanced Intelligent Memory System, in: Proc. International Conference on Computer Design (Austin, Texas, Oct. 1999). [28] Y. Kang, An Intelligent Memory for Data-Parallel Applications, Ph.D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1999. [29] K. Keeton, R Arpaci-Dusseau, D.A.Patterson, IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck, in: Proc. ISCA Workshop on Mixing Logic and DRAM (1997). [30] K. Kennedy and K. S. McKinley, Loop Distribution with Arbitrary Control Flow, in: Proc. Supercomputing ’90 (New York, Nov. 1990). [31] A. Khokhar, V. Prasanna, M. Shaaban, C. L. Wang, Heterogeneous Supercomputing: Problems and Issues, in: Proc. 1992 Workshop on Heterogeneous Processing, (Mar. 1992) 3-12. [32] T. Kidd, D. Hensgen, L. Moore, R. Freund, D. Charley, M. Halderman, M. Janakiraman, Studies in the Useful Predictability of Programs in a Distributed and Homogeneous Environment, The Smartnet Home Page "http://papaya.nosc.mil:80/SmartNet" (1995). [33] J. Kin, M. Gupta, M. Smith, The Filter Cache: An Energy Efficient Memory Structure. in: Proc. Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pp. 184-193, 1997. [34] P. Kogge, The EXECUBE Approach to Massively Parallel Processing, in: Proc. International Conference on Parallel Processing (August 1994). [35] D. J. Kuck, A Survey of Parallel Machine Organization and Programming, ACM Comput. Survey, 9, 1 (Mar. 1977) 29-59. [36] D. Landis, L. Roth, P. Hulina, L. Coraor, S. Deno, Evaluation of Computing in Memory Architectures for Digital Image Processing Applications, in: Proc. International Conference on Computer Design (1999) 146-151. [37] L. H. Lee, B. Moyer, J. Arends, Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops, in: Proc. 1999 International Symposium on Low Power Electronics and Design, (1999) 267-269. [38] M. Maheswaran, H.J. Siegel, A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems, in: Proc. 7th Heterogeneous Computing Workshop (1998) 57 –69. [39] R. Mehta, R. M. Owens, M. J. Irwin, R. Chen, D. Ghosh, Techniques for Low Energy Software, in: Proc. 1997 International Symposium on Low Power Electronics and Design (1997) 72-75. [40] C. A. Moritz, M. Frank, S. Amarasinghe, FlexCache: A Framework for flexible Compiler Generated Data Caching, in: Proc. 2nd Workshop on Intelligent Memory Systems (Cambridge, MA, Nov. 2000). [41] M. Oskin, F. Chong, T. Sherwood, Active Pages: A Computation Model for Intelligent Memory, in: Proc. 25th Annual International Symposium on Computer Architecture (June 1998) 192-203. [42] A. Parikh, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, Energy-Aware Instruction Scheduling, in: Proc. 7th International Conference on High Performance Computing (Dec. 2000) 335-344. [43] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Tomas, K. Yelick, A Case for Intelligent DRAM, IEEE Micro (Mar./Apr. 1997) 33-44. [44] D. Pease, A. Ghafoor, I. Ahmad, D. L. Andrews, K. Foudil-Bey, T. E. Karpinski, M. A. Mikki, M. Zerrouki, PAWS:A Performance Evaluation Tool for Parallel Computing Systems, IEEE Computer, 24, 1, (Jan. 1991) 18-29. [45] W. H. Press, S.A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in Fortran 77 (Cambridge University Press, 1992). [46] B. Reistad and D. K. Gifford, Static Dependent Costs for Estimating Execution Time, In: Proc. ACM Conference on LISP and Functional Programming (1994) 65-78. [47] A. K. Snip, D.G. Elliott, M. Margala, N. G. Durdle, Using Computational RAM for Volume Rendering, in: Proc. 13th Annual IEEE International Conference on ASIC/SOC (2000) 253 –257 [48] V. Tiwari, S. Malik, A. Wolfe, Power Analysis of Embedded Software: a First Step Towards Software Power Minimization, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 2, 4 (1994) 437-445. [49] V. Tiwari, S. Malik, A. Wolfe, Compilation Techniques for Low Energy: an Overview, in: Proc. IEEE Symposium on Low Power Electronics (1994) 38-39. [50] V. Tiwari, S. Malik, A. Wolfe, Instruction Level Power Analysis and Optimization of Software, in: Proc. Ninth International Conference on VLSI Design (1995) 326-328. [51] H. Topcuoglu, S. Hariri, M.Y. Wu, Task Scheduling Algorithms for Heterogeneous Processors, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 3 –14. [52] J. Veenstra and R. Fowler, MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors, in: Proc. MAS-COTS’94 (Jan. 1994) 201-207. [53] A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, X. Ji, Adapting Cache Line Size to Application Behavior, in: Proc. Supercomputing’99 (Jun, 1999). [54] K. Y. Wang, Precise Compile-Time Performance Prediction for Superscalar-Based Computers, in: Proc. ACM SIGPLAN '94 conference on Programming Language Design and Implementation (1994) 73 – 84. [55] M.Y. Wu, W. Shu, H. Zhang, Segmented Min-Min: a Static Mapping Algorithm for Meta-Tasks on Heterogeneous Computing Systems, in: Proc. 9th Heterogeneous Computing Workshop (2000) 375-385.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.218.61.16 論文開放下載的時間是校外不公開 Your IP address is 18.218.61.16 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS