Responsive image
博碩士論文 etd-1004102-042742 詳細資訊
Title page for etd-1004102-042742
論文名稱
Title
SAGE: 一個針對新的高效能SoC架構—記憶體處理器—效能提升與耗能降低所設計之自動分析平行化系統
SAGE: An Automatic Analyzing and Parallelizing System to Improve Performance and Reduce Energy on a New High-Performance SoC Architecture—Processor-in-Memory
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
103
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2002-09-26
繳交日期
Date of Submission
2002-10-04
關鍵字
Keywords
系統單晶片、記憶體處理器架構、自動平行編譯器、耗能降低、陳述-基本
SoC, Processor-in-Memory architecture, statement-based, automatic parallelizing compiler, energy reduction
統計
Statistics
本論文已被瀏覽 5689 次,被下載 0
The thesis/dissertation has been browsed 5689 times, has been downloaded 0 times.
中文摘要
由於半導體技術持續地進步,使得一種結合處理邏輯�處理器與高密度記憶體的新系統單晶片計算機結構能順利發展,此類計算機結構一般稱為記憶體處理器或智慧型記憶體,其在高效能計算時,能有效降低處理器與記憶體之間的效能差距。這種架構結合了不同的處理器於單一系統內,而這些處理器,則可依他們的計算能力與記憶體存取之效能與耗能的不同加以分類。我們主要針對兩個問題加以探討:如何增進程式在記憶體處理器架構的執行效能;如何降低程式在記憶體處理器架構執行時所消耗的能量。因而需要發展一套新的策略以了解各種處理器之不同效能並且將工作分配給最適合的處理器以充分發揮其能力。有鑑於此,這份研究提出了一個新的原始碼對原始碼的自動平行化系統,稱為SAGE,來發揮記憶體處理器架構的優點。有別於傳統以輪次為基本的平行化系統,SAGE採用以陳述為基本的分析方法:將程式依陳述分解成多個區塊,並針對主處理器與記憶體處理器,產生適合的執行計畫表。SAGE系統整合了陳述分割、權重評估、效能排程、與耗能降低排程,自動地轉換Fortran原始程式,以增進程式在記憶體處理器架構下之執行效能或減少其能量消耗。這份論文詳細地說明這些分析方法,並以經由SAGE轉換後之真實的測速程式在記憶體處理器架構下執行之實驗結果,討論SAGE系統與這些分析方法之效能。
Abstract
Continuous improvements in semiconductor fabrication density are enabling new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processing with high-density memory. Such architectures are generally called Processor-in-Memory or Intelligent Memory and can support high-performance computing by reducing the performance gap between the processor and the memory. This architecture combines various processors in a single system. These processors are characterized by their computational and memory-access capabilities in performance and energy consumption. Two main problems addressed here are how to improve the performance and reduce the energy consumption of applications running on Processor-in-Memory architectures. Accordingly, a novel strategy must be developed to identify the capabilities of the different processors and dispatch the most appropriate jobs to them to exploit them fully. Accordingly, this study proposes a novel automatic source-to-source parallelizing system, called SAGE, to exploit the advantages of Processor-in-Memory architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analytical approaches. The strategy of the SAGE system, which decomposes the original program into blocks and produces a feasible execution schedule for the host and memory processors, is also investigated. Hence, several techniques including statement splitting, weight evaluation, performance scheduling and energy reduction scheduling are designed and integrated into the SAGE system to automatically transform Fortran source programs to improve the performance of the program or reduce energy consumed by the program executed on Processor-in-Memory architecture. This thesis provides detailed techniques and discusses the experimental results of real benchmarks which are transformed by SAGE system and targeted on the Processor-in-Memory architecture.
目次 Table of Contents
Acknowledgement ii
摘要 iii
Abstract iv
Table of Contents v
List of Tables vii
List of Figures viii
Chapter 1 Introduction 1
Chapter 2 Related Work 4
2.1 PIM Architectural Aspects 4
2.2 PIM Compiler Aspect 6
2.3 Weight Evaluation Aspects 7
2.4 Scheduling Aspects 8
2.5 Energy Reduction Aspects 9
Chapter 3 Processor-in-Memory Architecture 11
3.1 Architecture Description 11
3.2 Synchronization Mechanism 13
3.3 Cache Coherence Mechanism 14
Chapter 4 Methodology 15
4.1 Statement Splitting and WPG
Construction 18
4.2 Weight Evaluation 22
4.2.1 Partial Profiling Weight Evaluation
Mechanism 23
4.2.2 Self-Patch Weight Evaluation 27
4.3 Performance Scheduling Mechanisms for
One-P.Host and One-P.Mem Configuration 31
4.3.1 Basic Scheduling Mechanism 31
4.3.2 Seesaw Dispatching Mechanism 33
4.4 Performance Scheduling Mechanism for
One-P.Host and Many-P.Mems
Configuration 39
4.5 Optimizing Techniques 51
4.5.1 Tiling for PIM 51
4.5.2 Loop splitting for PIM 55
4.5.3 IMOP recognition 57
4.6. Energy Reduction Scheduling for
PIM Architectures 61
4.6.1 Performance-Oriented Energy
Reduction Scheduling 62
4.6.2 Energy-Oriented Energy Reduction
Scheduling 67
4.6.3 Example 71
4.7 Implementation of the SAGE system 73
Chapter 5 Experimental Results 75
5.1 Experimental Results of the Scheduling
Mechanism for the One-P.Host and
One-P.Mem Configuration 76
5.2 Experimental Results of the Scheduling
Mechanism for the One-P.Host and
Many-P.Mem Configuration 79
5.3 Experimental Results of the Energy
Reduction Scheduling Mechanism for
the One-P.Host and One-P.Mem
Configuration 82
Chapter 6 Conclusion 84
References 86
Vita 92
參考文獻 References
[1] R. Armstrong, D. Hensgen, T. Kidd, The Relative Performance of Various Mapping Algorithms is Independent of Sizable Variances in Run-Time Predictions, in: Proc. 7th IEEE Heterogeneous Computing Workshop (Mar 1998) 79-87.
[2] R. S. Bajwa, M. Hiraki, H. Kojima, D. J. Gorny, K. Nitta, A. Shridhar, K. Seki, K. Sasaki, Instruction Buffering to Reduce Power in Processors for Signal Processing, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 5, 4 (Dec. 1997).
[3] R. Barua, W. Lee, S. Amarasinghe, A. Agarwal, Maps: A Compiler-managed Memory System for Raw Machines, in: Proc. 26th Annual International Symposium on Computer Architecture, (May 1999) 4-15.
[4] W. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, S. Weatherford, Effective Automatic Parallelization with Polaris, International Journal of Parallel Programming (May 1995).
[5] T. D. Braun, H. J. Siegal, "A Comparison Study of Static Mapping Heuristics for a Class of Meta-Tasks on Heterogeneous Computing Systems, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 15-29.
[6] S. Carr, Combining Optimization for Cache and Instruction-Level Parallelism, in: Proc. 1996 Conference on Parallel Architectures and Compilation Techniques (1996) 238 –247.
[7] S. L. Chu, T. C. Huang, L. C. Lee, Improving Workload Balance and Code Optimization on Processor-in-Memory Systems, to appear in Journal of Systems and Software (2003).
[8] R. Crisp, Direct Rambus Technology: the New Main Memory Standard, IEEE Micro, (Nov. 1997) 18-28.
[9] M. V. Devarakonda, R. K. Iyer, Predictability of Process Resource Usage: a Measurement-Based Study on UNIX, IEEE Trans. Software Engineering, 15, 12, (Dec. 1989) 1579-1586.
[10] D. Elliott, M. Stumm, M. Snelgrove, Computational RAM: The Case for SIMD Computing in Memory, in: Proc. ISCA Workshop on Mixing Logic and DRAM (1997).
[11] R. Freund, M. Gherrity, S. Ambrosius, M. Cambell, M. Halderman, D. Hensgen, L. Moore, B. Rust, H. Siegel, Scheduling Resources in Multi-user, Heterogeneous, Computing Environments with SmartNet, in: Proc. 7th IEEE Heterogeneous Computing Workshop (Mar. 1998) 184-199.
[12] R. F. Freund, Optimal Selection Theory for Superconcurrency, in: Proc. 1989 Conference on Supercomputing (1989) 699-703.
[13] R. F. Freund, H. J. Siegel, Heterogeneous Processing, IEEE Computer, 26, 6, (June 1993) 13-17.
[14] M. Hajj, C. Polyckronopoulos, G.. Stamoulist, Architectural and Compiler Support for Energy Reduction in the Memory Hierarchy of High Performance Microprocessors, in: Proc. 1998 International Symposium on Low Power Electronics and Design (1998).
[15] M. Hall , P. Kogge , J. Koller , P. Diniz , J. Chame , J. Draper , J. LaCoss , J. Granacki , J. Brockman , A. Srivastava , W. Athas , V. Freeh , J. Shin , J. Park, Mapping Irregular Applications to DIVA, a PIM-Based Data-Intensive Architecture, in: Proc. 1999 Conference on Supercomputing (Jan. 1999).
[16] M. Hall, J. Anderson, S. Amarasinghe, B. Murphy, S. Liao, E. Bugnion, M. Lam, Maximizing Multiprocessor Performance with the SUIF Compiler, IEEE Computer (Dec. 1996).
[17] T. C. Huang, S. L. Chu, SAGE: A New Analysis and Optimization System for FlexRAM Architecture, in: Proc. IMS 2000, Lecture Notes in Computer Science, Vol. 2107, (Springer-Verlag, Berlin, 2001) 160-168.
[18] T. C. Huang, S. L. Chu, A New Analysis Approach for Intelligent Memory Systems, in: Proc. ISCA 16th International Conference on Computers and Their Applications (Seattle, WA, Mar. 2001) 452-457.
[19] T. C. Huang, S. L. Chu, A Statement Based Parallelizing Framework for Processor-in-Memory Architectures, to appear in Information Processing Letters (Elsevier Science, 2002)
[20] M. Huang, J. Renau, S. M. Yoo, J. Torrellas, A Framework for Dynamic Energy Efficiency and Temperature Management, in: Proc. 33rd International Symposium on Microarchitecture (Dec. 2000).
[21] W. Huang, Exploiting Application Parallelism Using Advanced Intelligent Memory – The FlexRAM approach, MS Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1999.
[22] O. Ibarra, C. Kim, Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors, Journal of the ACM, 77, 2, (Apr. 1977) 280-289.
[23] M. A. Iverson, F. Ozguner, G. Follen, Run-Time Statistical Estimation of Task Execution Times for Heterogeneous Distributed Computing, in: Proc. 1996 High Performance Distributed Computing Conference, (Syracuse, NY, Aug. 1996) 263-270.
[24] M. A Iverson, F. Ozguner, L. C. Potter, Statistical Prediction of Task Execution Times Through Analytic Benchmarking For Scheduling in a Heterogeneous Environment, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 99-111.
[25] M. Jiménez, Multilevel Tiling for Non-Rectangular Iteration Spaces, Ph.D. Thesis, Departamento de Arquittectura de Computadores, Universitat Politécniac de Catalunya, 1999.
[26] D. Judd and K. Yelick, Exploiting On-Chip Memory Bandwidth in the VIRAM Compiler, in: Proc. 2nd Workshop on Intelligent Memory Systems (Cambridge, MA, Nov. 2000).
[27] Y. Kang, W. Huang, S. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, J. Torrellas, FlexRAM: Toward an Advanced Intelligent Memory System, in: Proc. International Conference on Computer Design (Austin, Texas, Oct. 1999).
[28] Y. Kang, An Intelligent Memory for Data-Parallel Applications, Ph.D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1999.
[29] K. Keeton, R Arpaci-Dusseau, D.A.Patterson, IRAM and SmartSIMM: Overcoming the I/O Bus Bottleneck, in: Proc. ISCA Workshop on Mixing Logic and DRAM (1997).
[30] K. Kennedy and K. S. McKinley, Loop Distribution with Arbitrary Control Flow, in: Proc. Supercomputing ’90 (New York, Nov. 1990).
[31] A. Khokhar, V. Prasanna, M. Shaaban, C. L. Wang, Heterogeneous Supercomputing: Problems and Issues, in: Proc. 1992 Workshop on Heterogeneous Processing, (Mar. 1992) 3-12.
[32] T. Kidd, D. Hensgen, L. Moore, R. Freund, D. Charley, M. Halderman, M. Janakiraman, Studies in the Useful Predictability of Programs in a Distributed and Homogeneous Environment, The Smartnet Home Page "http://papaya.nosc.mil:80/SmartNet" (1995).
[33] J. Kin, M. Gupta, M. Smith, The Filter Cache: An Energy Efficient Memory Structure. in: Proc. Thirtieth Annual IEEE/ACM International Symposium on Microarchitecture, pp. 184-193, 1997.
[34] P. Kogge, The EXECUBE Approach to Massively Parallel Processing, in: Proc. International Conference on Parallel Processing (August 1994).
[35] D. J. Kuck, A Survey of Parallel Machine Organization and Programming, ACM Comput. Survey, 9, 1 (Mar. 1977) 29-59.
[36] D. Landis, L. Roth, P. Hulina, L. Coraor, S. Deno, Evaluation of Computing in Memory Architectures for Digital Image Processing Applications, in: Proc. International Conference on Computer Design (1999) 146-151.
[37] L. H. Lee, B. Moyer, J. Arends, Instruction Fetch Energy Reduction Using Loop Caches for Embedded Applications with Small Tight Loops, in: Proc. 1999 International Symposium on Low Power Electronics and Design, (1999) 267-269.
[38] M. Maheswaran, H.J. Siegel, A Dynamic Matching and Scheduling Algorithm for Heterogeneous Computing Systems, in: Proc. 7th Heterogeneous Computing Workshop (1998) 57 –69.
[39] R. Mehta, R. M. Owens, M. J. Irwin, R. Chen, D. Ghosh, Techniques for Low Energy Software, in: Proc. 1997 International Symposium on Low Power Electronics and Design (1997) 72-75.
[40] C. A. Moritz, M. Frank, S. Amarasinghe, FlexCache: A Framework for flexible Compiler Generated Data Caching, in: Proc. 2nd Workshop on Intelligent Memory Systems (Cambridge, MA, Nov. 2000).
[41] M. Oskin, F. Chong, T. Sherwood, Active Pages: A Computation Model for Intelligent Memory, in: Proc. 25th Annual International Symposium on Computer Architecture (June 1998) 192-203.
[42] A. Parikh, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, Energy-Aware Instruction Scheduling, in: Proc. 7th International Conference on High Performance Computing (Dec. 2000) 335-344.
[43] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Tomas, K. Yelick, A Case for Intelligent DRAM, IEEE Micro (Mar./Apr. 1997) 33-44.
[44] D. Pease, A. Ghafoor, I. Ahmad, D. L. Andrews, K. Foudil-Bey, T. E. Karpinski, M. A. Mikki, M. Zerrouki, PAWS:A Performance Evaluation Tool for Parallel Computing Systems, IEEE Computer, 24, 1, (Jan. 1991) 18-29.
[45] W. H. Press, S.A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in Fortran 77 (Cambridge University Press, 1992).
[46] B. Reistad and D. K. Gifford, Static Dependent Costs for Estimating Execution Time, In: Proc. ACM Conference on LISP and Functional Programming (1994) 65-78.
[47] A. K. Snip, D.G. Elliott, M. Margala, N. G. Durdle, Using Computational RAM for Volume Rendering, in: Proc. 13th Annual IEEE International Conference on ASIC/SOC (2000) 253 –257
[48] V. Tiwari, S. Malik, A. Wolfe, Power Analysis of Embedded Software: a First Step Towards Software Power Minimization, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 2, 4 (1994) 437-445.
[49] V. Tiwari, S. Malik, A. Wolfe, Compilation Techniques for Low Energy: an Overview, in: Proc. IEEE Symposium on Low Power Electronics (1994) 38-39.
[50] V. Tiwari, S. Malik, A. Wolfe, Instruction Level Power Analysis and Optimization of Software, in: Proc. Ninth International Conference on VLSI Design (1995) 326-328.
[51] H. Topcuoglu, S. Hariri, M.Y. Wu, Task Scheduling Algorithms for Heterogeneous Processors, in: Proc. Eighth Heterogeneous Computing Workshop (1999) 3 –14.
[52] J. Veenstra and R. Fowler, MINT: A Front End for Efficient Simulation of Shared-Memory Multiprocessors, in: Proc. MAS-COTS’94 (Jan. 1994) 201-207.
[53] A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, X. Ji, Adapting Cache Line Size to Application Behavior, in: Proc. Supercomputing’99 (Jun, 1999).
[54] K. Y. Wang, Precise Compile-Time Performance Prediction for Superscalar-Based Computers, in: Proc. ACM SIGPLAN '94 conference on Programming Language Design and Implementation (1994) 73 – 84.
[55] M.Y. Wu, W. Shu, H. Zhang, Segmented Min-Min: a Static Mapping Algorithm for Meta-Tasks on Heterogeneous Computing Systems, in: Proc. 9th Heterogeneous Computing Workshop (2000) 375-385.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.218.61.16
論文開放下載的時間是 校外不公開

Your IP address is 18.218.61.16
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code