國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,超多純量多核心架構之研究 ,Study of the Hyperscalar Multi-core Architecture

論文名稱 Title	超多純量多核心架構之研究 Study of the Hyperscalar Multi-core Architecture
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	100 學年度第 1 學期 The fall semester of Academic Year 100	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	135
研究生 Author	周育樑 Yu-Liang Chou
指導教授 Advisor	邱日清 Jih-Ching Chiu
召集委員 Convenor	鍾崇斌 Chung-Ping Chung
口試委員 Advisory Committee	謝錫堃, 蕭勝夫, 李錫智, 鄺獻榮 Ce-Kuen Shieh; Shen-Fu Hsiao; Shie-Jue Lee; Shiann-Rong Kuang
口試日期 Date of Exam	2011-08-25	繳交日期 Date of Submission	2011-09-07
關鍵字 Keywords	可重新組態硬體、動態多核心、超純量、單晶片多處理器、單指令流多資料流、多媒體運算、超多純量 SIMD, chip multiprocessors, superscalar, dynamic multi-core, reconfigurable hardware, multimedia processing, hyperscalar
統計 Statistics	本論文已被瀏覽 5726 次，被下載 1150 次 The thesis/dissertation has been browsed 5726 times, has been downloaded 1150 times.

中文摘要
單晶片多核心處理器(Chip Multiprocessor)已成為現今處理器設計的主流趨勢。在傳統單晶片多核心系統中，單晶片多核心處理器能透過其內部的單一處理器核心架構來探勘指令層級並行度(Instruction-Level Parallelism, ILP)，並且能透過多顆處理器的並行運算來探勘執行緒層級並行度(Thread-Level Parallelism, TLP)。然而，傳統單晶片多核心架構必需在硬體設計規劃之初在高單一執行緒效能與高生產量作取捨，無法動態的調整指令層級並行度與執行緒層級並行度的探勘能力，造成了目前單晶片多核心處理器面對未來多變的應用程式類型處理上的效率不彰。為了解決多核心處理架構設計上所面臨之上述挑戰，本論文提出了Hyperscalar 運算概念，此運算概念讓多核心處理架構能夠動態群組多顆單指令抓取核心成為一個運算能力較高之超純量核心，此重新組態的特性讓多核心處理架構擁有更高的彈性來處理未來多變的應用程式類型，當執行緒層級並行度低時，透過多核心共同合作運行超純量機制提供高單一執行緒效能，而反之則透過多核心獨立運作提供高生產量。首先，本論文基於Hyperscalar重新組態之特性提出一個Hyperscalar雙核心架構。此Hyperscalar雙核心架構能夠扮演三種不同的腳色：一個雙指令抓取之靜態排程超純量處理器、一個同質雙核心處理器、與一個單獨的單核心處理器。論文中設計了指令相依分析器來連接兩個單指令抓取核心,並且負責處理Hyperscalar雙核心架構腳色之轉換。指令相依分析器的設計讓兩顆核心共同合作運行靜態排程超純量機制成為可能，指令相依分析器會分析單一執行緒內的指令之相依性並分配指令至兩顆核心內執行，而有資料相依性的指令會被分配至同一顆核心執行，透過核心內部本身的前饋路徑，指令的資料相依性即可被解決。模擬結果顯示，當一個Hyperscalar雙核心架構在靜態排程超純量模式底下，相對於傳統單核心處理器能有30.3% 效能提升;而在90奈米製程底下，擴展一同質雙核心成為一輕量級Hyperscalar雙核心僅增加1.8%之總面積與1.75%之總功率。由於Hyperscalar雙核心架構之低硬體花費、低功率消耗、與低再造成本之特性，使其成為一適用於未來嵌入式系統之雙核心架構。本論文接著針對高效能運算系統，進一步擴充Hyperscalar雙核心架構提出了Hyperscalar多核心架構，並且提出了虛擬共享暫存器概念,讓分散在多顆核心執行的單一執行緒之指令能夠邏輯地面對一組暫存器。模擬結果顯示， Hyperscalar架構其2、4、8、16與32核心群組之超純量組態分別能達到95%、84%、82%、85%與90%之傳統2、4、8、16與32指令抓取亂序超純量處理器效能。本論最後提出了Multi-streaming SIMD概念適用於Hyperscalar多核心架構使其能有效率地探勘資料層級並行度(Data-Level Parallelism, DLP)。Multi-streaming SIMD概念利用多個SIMD運算單元輸入相同的指令流，但各個SIMD運算單元分別處理不同的資料流;同一時間，每個SIMD運算單元亦能探勘各自資料層級並行度。模擬結果顯示，一個擁有四個多媒體運算儲存單元的Multistreaming SIMD運算引擎能夠提供3.3至5.5倍相對於傳統MMX延伸指令集的效能提升。在完成本論文中上述項目研究後，一個適用於未來多變的應用程式之多核心架構則被實現了。
Abstract
Current trends in processor design have migrated toward chip multiprocessors (CMPs). CMPs are designed to exploit both instruction-level parallelism (ILP) within processors and thread-level parallelism (TLP) within and across processors. However, the conventional design of current CMPs is forced to make a choice between high single-thread performance and high peak throughput. This inability to adjust to varying levels of ILP and TLP results in processor inefficiency. To cope with the dilemma of designing CMPs confronted by the processor designers, this dissertation proposed the hyperscalar concept for current multi-core designs. The hyperscalar concept enables the multi-core architectures to dynamically group many scalar in-order cores as a superscalar processor to accelerate a sequential thread. The reconfigure feature of hyperscalar architecture contributes to the high flexibility in adapting different types of applications, providing high single-thread performance when thread level parallelism (TLP) is low and high throughput when TLP is high. Based on the hyperscalar concept, this dissertation first proposed a hyperscalar dual-core architecture. It can play three different roles (a 2-issue statically scheduled superscalar processor, a homogeneous dual-core processor, or a standalone single-core processor). An Instruction-dependency Analyzer (IA) that connects two scalar in-order cores is designed to handle the role switching. The design of IA makes it possible for the two cores to work together like a 2-issue statically scheduled superscalar processor. The IA dispatches instructions with data dependencies to the same core so that the data dependencies can be resolved by existing forwarding paths in the core. Simulation results show that when the proposed architecture works in a statically scheduled superscalar manner, it achieves a 30.3% higher instructions per cycle (IPC) than the traditional five-stage pipelined core based on 35 benchmarks from the MiBench suite. The increases in area and power for extending a homogeneous dual-core processor to a hyperscalar dual-core processor are only 1.8% and 1.75%, respectively, using 90nm CMOS technology. On top of that, this dissertation further extended the hyperscalar dual-core architecture to hyperscalar multi-core architecture capable of flexibly providing high throughput for uniform parallel application as well as high performance for more general workloads. It can dynamically unite many scalar cores as a larger OOO superscalar processor to accelerate a thread. To accomplish this, the Virtual Shared Register File (VSRF) concept was proposed to help the instructions of a thread in different cores can logically face a uniform set of register file. Simulation results show that the 2, 4, 8, 16, and 32-core-united configurations of the hyperscalar multi-core architecture archive 95%, 84%, 82%, 85%, and 90% of the performance of the monolithic 2, 4,8, 16, and 32-issue OOO superscalar processors based the SPEC2000 benchmarks. Finally, this dissertation proposed a new technology, called multi-streaming SIMD, applicable for hyperscalar architecture to efficiently exploit data-level parallelism (DLP). The multi-streaming SIMD technology enables current multimedia extensions to simultaneously manipulate multiple data streams. Simulation results show that when a multi-streaming SIMD computing engine has four 4-register multimedia operation storage units, it provides a factor of 3.3x to 5.5x performance enhancement for traditional MMX extensions on twelve multimedia kernels. After exploring the above research topics discussed in this dissertation, a promising architecture for future multi-core designs was realized.

目次 Table of Contents
Table of Contents 誌謝 i 摘　要 ii Abstract iv Table of Contents vi List of Figures viii List of Tables x Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Research Goals 3 1.3 Dissertation Organization 5 Chapter 2 Design of the Hyperscalar Dual-core Architecture 6 2.1. Hyperscalar Dual-core Architecture Overview 6 2.2. New System-level Instructions for Switching Three Operation Modes 8 2.3. Design Issues for the Hyperscalar Dual-core Architecture in Superscalar Mode 11 2.3.1 Register Data Flow Techniques 12 2.3.2 Instruction Flow Techniques 14 2.3.3 Memory Data Flow Techniques 15 2.3.4 Dispatching Strategies in Superscalar Mode 16 2.4. Hardware Design of the Hyperscalar Dual-core Architecture 19 2.4.1 Building a Hyperscalar Dual-core Processor 19 2.4.2 The Hardware Design of the IA 21 Chapter 3 Design of the Hyperscalar Multi-core Architecture 25 3.1. Hyperscalar Multi-core Architecture Overview 25 3.2. New System-level Instructions for Uniting the Core Resources 28 3.3. Hyperscalar Multi-core Architecture Technology 30 3.3.1 Design Issues of the Instruction-dependency Analyzer (IA) 30 3.3.1.1 Register Data Flow Techniques 34 3.3.1.2 Instruction Flow Techniques 39 3.3.1.3 Memory Data Flow Techniques 45 3.3.1.4 Hardware of the IA 48 3.3.2 Designs of the Virtual Shared Register File (VSRF) 50 3.3.2.1 The VSRF Concepts 50 3.3.2.2 VSRF Communication Hardware 51 3.3.3 Designs of the Scalar In-order Core with Distributed Superscalar Processing Stages 53 Chapter 4 A Multi-streaming SIMD Multimedia Computing Engine for Hyperscalar Multi-core Architecture 55 4.1. The Multi-streaming SIMD Architecture 55 4.1.1 Multi-streaming SIMD Design Concepts 56 4.1.2 Three Instruction Modes for Dynamically Configuring SIMD Computing Resources 58 4.1.3 The Space Addressing Mode for Accessing Multiple Data Streams 61 4.1.4 An Example of Using the Three-mode Instructions 64 4.2. Implementation of a Multi-streaming SIMD Computing Engine 67 4.2.1 Hardware Design Method of the Multi-streaming SIMD Architecture 67 4.2.2 Basic Components of the Multi-streaming SIMD Architecture 68 4.2.3 Apply the Multi-streaming SIMD Concepts to the Hyperscalar Multi-core Architecture 72 Chapter 5 Performance Evaluation 74 5.1. Experimental Evaluation of the Hyperscalar Dual-core Architecture 74 5.1.1 Experimental Environment 74 5.1.2 Experimental Results 76 5.1.3 Area, Power, and Timing Effects 77 5.2. Experimental Evaluation of the Hyperscalar Multi-core Architecture 80 5.2.1 Experimental Environment 80 5.2.2 Experimental Results 82 5.2.3 Area Effects of the Hyperscalar Multi-core Architecture 86 5.3. Experimental Evaluation of the Multi-streaming SIMD Architecture 87 5.3.1 Experimental Environment 87 5.3.2 Experimental Results 89 5.3.3 Hardware Complexity Analyses 93 Chapter 6 Related Work 95 6.1. Dynamic Multi-core Architecture 95 6.2. Clustered Multithreaded Processors 98 6.3. Run-ahead Multi-core Architectures 99 6.4. Modern SIMD Architectures 99 6.5. Stream Processing Architectures 100 6.6. Processor-In-Memory (PIM) Architectures 102 Chapter 7 Conclusions and Future Work 104 Bibliography 109 Personal Publication 121

參考文獻 References
[1] B. A. Nayfeh and K. Olukotun, "A single-chip multiprocessor", IEEE Computer, vol. 30, pp. 79-85, 1997. [2] R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan, ” Heterogeneous Chip Multiprocessors,” IEEE Computer, vol. 38, pp. 32-38, Nov. 2005. [3] M.D. Hill and M.R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, pp. 33-38, Jul. 2008. [4] J. Chiu, Y. Chou and P. Chen, "Hyperscalar: A Novel Dynamically Reconfigurable Multi-core Architecture," in Proceedings of the 39th International Conference on Parallel Processing, 2010, pp.277-286. [5] J. Chiu, Y. Chou and D. Su, " A hyperscalar multi-core architecture,” in Proceedings of the 7th ACM International Conference on Computing Frontiers, 2010, pp.77-78. [6] J. Chiu, Y. Chou, P. Chen and D. Su, “A Unitable Computing Architecture for Chip Multiprocessors,” The Computer Journal, 2011, doi: 10.1093/comjnl/bxr085. [7] J. Chiu and Y. Chou,"A Multi-streaming SIMD Multimedia Computing Engine," Microprocessors and Microsystems, vol. 34, pp. 247-258, Nov. 2010. [8] J. Chiu, K. Yang and Y. Chou, "Design of a Novel SIMD Architecture by Fusing Operations and Registers," in Proceedings of the ACM 23rd International Conference on Supercomputing, 2009, pp.503-504. [9] J. Chiu, Y. Chou and H. Tzeng, ”A Multi-streaming SIMD Architecture for Multimedia Applications,” in Proceedings of the 6th ACM Conference on Computing Frontiers, 2009, pp. 51-60. [10] J. Hennessy and D. Patterson, “Computer Architecture: A Quantitative Approach, third edition,” Morgan Kaufmann, San Francisco, 1996. [11] J. Tang, S. Liu, Z. Gu, C. Liu, and J. L. Gaudiot, ”Prefetching in Embedded Mobile Systems Can Be Energy-Efficient,” IEEE Computer Architecture Letters, Feb. 2011. [12] J. Chiu, Y. Chou and T. Lin, "The Basic Block Reassembling Instruction Stream Buffer with LWBTB for X86 ISA," Journal of Information Science and Engineering, vol. 26, no. 4, Jul. 2010. [13] J. Chiu, Y. Chou and T. Yeh and T. Lin, "Designs of the Basic Block Reassembling Instruction Stream Buffer for X86 ISA," in Proceedings of the Thirteenth IEEE Asia-Pacific Computer Systems Architecture Conference, 2008, pp.1-8. [14] O. J. Santana, A. Ramirez, and M. Valero, “Enlarging instruction streams,” IEEE Transactions on Computers, Vol. 56, pp. 1342-1357, 2007. [15] K. Asanovic et al., ” The Landscape of Parallel Computing Research: A View from Berkeley,” Technical Report UCB/EECS-2006-183, Dept. Electrical Eng. and Computer Science, Univ.of Calif., Berkeley, 2006. [16] M. Johnson.,” Superscalar Microprocessor Design,” Prentice Hall, 1991. [17] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez.," Core Fusion: Accommodating software diversity in chip multiprocessors", in Proceedings of the International Symposium on Computer Architecture, 2007, pp. 186-197. [18] C. Kim, S. Sethumadhavan, M. Govindan, N. Ranganathan, D. Gulati, S. W. Keckler, and D. Burger,” Composable Lightweight Processors”, in Proceedings of the International Symposium on Microarchitecture, Dec. 2007, pp. 281-294. [19] S. Sethumadhavan, F. Roesner, J. S. Emer, D. Burger, and S. W. Keckler,” Late-Binding: Enabling unordered load-store queues”, in Proceedings of the International Symposium on Computer Architecture, 2007, pp. 347-357. [20] R. Kumar, N.P. Jouppi, and D. Tullsen, “Conjoined-Core Chip Multiprocessing,” in Proceedings of the International Symposium on Microarchitecture, 2004, pp. 195-206. [21] L. Seiler, D. Carmean, E. Sprangle, et al., “Larrabee: a many-core x86 architecture for visual computing”, ACM Transactions on Graphics, vol. 27, no. 3, pp. 1–15, 2008. [22] A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher, ”A power-efficient high-throughput 32-thread SPARC processor”, IEEE Journal of Solid-State Circuits, vol. 42, no. 1, pp. 7–16, Jan. 2007. [23] J. Pille, C. Adams, T. Christensen, S.R. Cottier, S. Ehrenreich and F. Kono, “Implementation of the Cell Broadband Engine in 65 nm SOI Technology Featuring Dual Power Supply SRAM Arrays Supporting 6 GHz at 1.3 V”, IEEE Journal of Solid-State Circuit, vol. 43, no. 1, pp. 163–171, Jan. 2008. [24] Peleg and U. Weiser, “MMX technology extension to the Intel architecture,” IEEE Micro, vol.16, no. 4, pp. 42-50, Aug. 1996. [25] M. Tremblay, J. M. O’Connor, V. Narayanan, and L. He, “VIS speeds new media processing,” IEEE Micro, vol. 16, no. 4, pp. 10-20, Aug. 1996. [26] K. Diefendorff, P.K. Dubey, R. Hochsprung, and H. Scales, “AltiVec Extension to PowerPC Accelerates Media Processing,” IEEE Micro, vol. 20, no. 2, pp. 85-95, Mar/Apr. 2000. [27] N.C. Paver, M.H. Khan, and B.C. Aldrich, “Accelerating mobile multimedia using Intel Wireless MMX technology,” in Proceedings of the IEEE International Symposium on Multimedia Software Engineering, Dec. 2004, pp. 491-498. [28] R.B. Lee, “Subword parallelism with MAX-2,” IEEE Micro, vol. 16, no. 4, pp. 51-59, Aug. 1996. [29] M. Gokhale, B. Holmes, and K. Iobst, “Processing in memory: The Terasys massively parallel PIM array,” IEEE Computer, vol. 28, no. 4, pp. 23–31, Apr. 1995. [30] J. Wolf, “Programming Methods for the PentiumR III Processor’s Streaming SIMD Extensions using the VTuneTM Performance Enhancement Environment,” Intel Technology Journal, Q2, 1999. [31] M. Hassaballah, S. Omran, and Y. B. Mahdy,” A review of SIMD multimedia extensions and their usage in scientific and engineering applications,” The Computer Journal, vol. 51, no.6, pp. 630-649, 2008. [32] P.T. Hulina, L.D. Coraor, L. Kurian, and E. John, “Design and VLSI Implementation of an Address Generation Coprocessor,” IEE Proceedings of Computers and Digital Techniques, vol. 142, no. 2, pp. 145-151, Mar. 1995. [33] N. Slingerland and A. Smith, “Measuring the performance of multimedia instruction Sets,” IEEE Transactions on Computers, vol. 51, pp. 1317-1332, Nov. 2002. [34] J. Corbal, M. Valero, and R. Espasa, “MOM: a Matrix SIMD Instruction Set Architecture for Multimedia Applications,” in Proceedings of the ACM/IEEE Supercomputing Conference, 1999. [35] J. Corbal, R. Espasa, and M. Valero, “Three-Dimensional Memory Vectorization for High Bandwidth Media Memory Systems,” in Proceedings of the 35th Annual Intl. Symposium on Microarchitecture, Nov. 2002, pp. 149-160. [36] A. Shahbahrami, B. Juurlink, and S. Vassiliadis, “Accelerating Color Space Conversion Using Extended Subwords and the Matrix Register File,” in Proceedings of the IEEE International Symposium on Multimedia, Dec. 2006, pp. 37-46. [37] D. Cheresiz, B. Juurlink, S. Vassiliadis, and H.A.G. Wijshoff, “The CSI multimedia architecture,” IEEE Transactions on Very Large Scale Integration Systems, vol. 1, pp. 1-13, Jan. 2005. [38] Intel Corp., “Using MMX™ Instructions to Convert RGB to YUV Color Conversion”, MMX™ Technology Manuals and Application Notes. Available:http://softwarecommunity.intel.com/articles/eng/1713.htm [39] S. Rixner, W.J. Dally, B. Khailany, P. Mattson, U.J. Kapasi, and J.D. Owens, “Register organization for media processing,” in Proceedings of the International Symposium on High-Performance Computer Architecture, Jan. 2000, pp. 375-386. [40] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown.,”Mibench: A free, commercially representative embedded benchmark suite”, in Proceedings of the IEEE 4th Annual Workshop on Workload Characterization, Dec. 2001, pp. 3–14. [41] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin-Madison, Jun. 1997. [42] ARM Limited,” ARM9EJ-S Technical Reference Manual,” Available: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0222b/DDI0222.pdf [43] ARM Limited, “ARM7TDMI Core Processor Product Overview”, Available: http://infocenter.arm.com/help/topic/com.arm.doc.dvi0027b/DVI_0027A_ARM7TDMI_PO.pdf [44] http://adtek.co.kr/solution/arm.htm [45] J. L. Henning, “SPEC CPU2000: Measuring CPU performance in the new millennium,” IEEE Computer, vol. 33, no.7, pp. 28-35, Jul. 2000. [46] R. Kumar, V. Zyuban, and D. Tullsen, “Interconnection in Multicore Architectures: Understanding Mechanisms, Overheads, and Scaling,” in Proceedings of the International Symposium on Computer Architecture, 2005, pp. 408-419. [47] P. Bai et al., “A 65nm logic technology featuring 35nm gate length, enhanced channel strain, 8 cu interconnect layers, low-k ILD and 0.57μm2 SRAM cell,” in IEEE International Electron Devices Meeting, Dec. 2005. [48] D. Tarjan, M. W. Boyer, K. Skadron. “Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue”, in Proceedings of the 45th annual Design Automation Conference, 2008, pp. 772-775. [49] C. Lee, M. Potkonjak, and W. H. Mangione-Smith., “Mediabench: A tool for evaluating and synthesizing multimedia and communications systems,” in Proceedings of the International Symposium on Microarchitecture, 1997, pp. 330-335. [50] B. Froba, and A. Ernst, “Face detection with the modified census transform,” in Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition, May. 2004, pp. 91-96. [51] Ze-Nian Li and Mark S. Drew, “Fundamentals of Multimedia,” Prentice-Hall, 2004. [52] A. Peleg, and U. Weiser, “MMX technology extension to the Intel architecture,” IEEE Micro, vol. 16, no. 4, pp. 42-50, Aug. 1996. [53] Intel Corp, MMX™ Technology Manuals and Application Notes. Available:http://softwarecommunity.intel.com/articles/eng/1713.htm [54] Intel Corp, Pentium Articles of Intel Software Network, Available: http://software.intel.com/en-us/articles/pentium/all/1/ [55] Intel Corp,” Intel 64 and IA32 Architecture Optimization Reference Manual,” November 2007. Available: http://www.intel.com/products/processor/manuals/ [56] H. Zhong, S. A. Lieberman, and S. A. Mahlke,” Extending multicore architectures to exploit hybrid parallelism in single-thread applications”, in Proceedings of the International Symposium on High-Performance Computer Architecture, Feb. 2007, pp. 25-36. [57] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore, ”Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture”, in Proceedings of the International Symposium on Computer Architecture, Jun. 2003, pp. 422-433. [58] F. Latorre, J. Gonz’alez, and A. Gonz’alez,”Back-end assignment schemes for clustered multithreaded processors”, in Proceedings of the International Conference on Supercomputing, 2004, pp. 316-325. [59] J. D. Collins and D. M. Tullsen, “Clustered multithreaded architectures - pursuing both ipc and cycle time”, in Proceedings of the International Parallel and Distributed Processing Symposium, 2004, pp. 76-86. [60] A. E.-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas, “Partitioning multi-threaded processors with a large number of threads”, in Proceedings of the International Symposium on Performance Analysis of Systems and Software, 2005, pp. 112-123. [61] S. Liu, C. Eisenbeis and J. L. Gaudiot, “A Theoretical Framework for Value Prediction in Parallel Systems,” in Proceedings of the 39th International Conference on Parallel Processing, 2010, pp.11-20. [62] S. Liu, C. Eisenbeis and J. L. Gaudiot, “Speculative Execution on GPU: An Exploratory Study,” in Proceedings of the 39th International Conference on Parallel Processing, 2010, pp.453-461. [63] H. Zhou, “Dual-core execution: building a highly scalable single-thread instruction window,” in Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, 2005, pp. 231-242. [64] S. Liu, and J. L. Gaudiot, ”Potential Impact of Value Prediction on Communication in Many-Core Architectures,” IEEE Transactions on Computers, vol. 58, no.6, pp. 759-769, 2009. [65] S. Liu, C. Eisenbeis and J. L. Gaudiot, “Value Prediction and Speculative Execution on GPU,” International Journal of Parallel Programming, vol. 39, pp. 533-552, 2011. [66] AMD Corp.,”AMD Stream Computing,” Sep. 2008. Available:http://ati.amd.com/technology/streamcomputing/sdkdwnld.html [67] B. Khailany, W.J. Dally, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles, A. Chang, S. Rixner, “Imagine: media processing with streams,” IEEE Micro, vol. 21, no. 2, pp. 35-46, Mar/Apr. 2001. [68] W.J. Dally, F. Labonte, A. Das, P. Hanrahan, Ahn Jung-Ho, J. Gummaraju, M. Erez, N. Jayasena, I. Buck, T.J. Knight, U.J. Kapasi, “Merrimac: Supercomputing with Streams,” in Proceedings of the ACM/IEEE Supercomputing Conference, Nov. 2003. [69] M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, “Synergistic Processing in Cell's Multicore Architecture,” IEEE Micro, vol. 26, no. 2, pp. 10-24, Mar/Apr. 2006. [70] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym, “NVIDIA Tesla: A Unified Graphics and Computing Architecture,” IEEE Micro, vol. 28, no. 2, pp. 39-55, Mar/Apr. 2008. [71] D. Patterson and C. Kozyrakis, “Scalable Vector Processors for Embedded Systems,” IEEE Micro, vol. 23, no. 6, pp. 36-45, Nov/Dec 2003. [72] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrellas, “FlexRAM: Toward an Advanced Intelligent Memory System,” in Proceedings of the IEEE International Conference on Computer Design, 1999. [73] M. Oskin, F.T. Chong, and T. Sherwood, “Active Pages: A Computation Model for Intelligent Memory,” in Proceedings of the International Symposium on Computer Architecture, June 1998, pp. 192-203. [74] J. Y. Kang, S. Gupta, and J. L. Gaudiot, “An Efficient Data-Distribution Mechanism in a Processor-In-Memory (PIM) Architecture Applied to Motion Estimation,” IEEE Transactions on Computers, vol. 57, no. 3, pp. 375-388, Mar. 2008.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0907111-220723.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS