國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用於晶片網路之類環狀網路仲裁策略排程,A Ring-like Arbitration Strategy Schedule for Networks-On-Chips

論文名稱 Title	應用於晶片網路之類環狀網路仲裁策略排程 A Ring-like Arbitration Strategy Schedule for Networks-On-Chips
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	101 學年度第 2 學期 The spring semester of Academic Year 101	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	161
研究生 Author	楊凱名 Kai-ming Yang
指導教授 Advisor	邱日清 Jih-ching Chiu
召集委員 Convenor	鍾崇斌 Chung-Ping Chung
口試委員 Advisory Committee	謝東佑, 楊竹星, 蕭勝夫, 鄺獻榮, 陳中和 Tong-Yu Hsieh; Chu-Sing Yang; Shen-Fu Hsiao.; Shiann-Rong Kuang; Chung-Ho Chen
口試日期 Date of Exam	2013-06-06	繳交日期 Date of Submission	2013-07-18
關鍵字 Keywords	晶片網路、優先權選擇器、指令資料流緩衝器、非同步電路、分散式晶片網路仲裁策略 Distributed on-chip network arbitration strategy, Instruction and data stream buffer, Asynchronous circuits, Network-on-chip, Priority selector
統計 Statistics	本論文已被瀏覽 5680 次，被下載 770 次 The thesis/dissertation has been browsed 5680 times, has been downloaded 770 times.

中文摘要
多核心處理器已成為目前處理器架構主流，單晶片多核心透過提升指令層級並行度與執行緒層級並行度來增進系統效能，因此，在此架構下核心間資料傳輸效率將決定多核心系統之效能，本論文將分為多核心晶片網路中，以公平仲裁機制改善節點飢餓與熱點問題，另一部分提出緩衝器機制用來增進晶片網路節點資料抓取機制，以改善傳統記憶體層級與處理器間速度差異造成之效能損失。多核心晶片網路中，用來仲裁核心間溝通遭遇碰撞策略之公平性、可延展性與仲裁策略之簡化皆重要地影響著多核心系統之效能，不公平的策略將導致飢餓與熱點的問題,特別在高負載的晶片網路下,此外仲裁策略的硬體複雜度亦必須被考量，針對這些課題,本論文提出一可適當調整網路節點優先權之簡單且公平仲裁策略，在傳輸的初始狀態，每節點有各自獨立的優先權，當節點間的競爭發生時,競爭失敗者將與勝利者交換其優先權權重做為下次連線的依據，這原則可保證勝利者在下一次連線的機會將被降低，否則該勝利者增加其優先權，本論文提出之機制僅需藉由簡單的比較與交換的運作，即可有效率地達到整體系統競爭之公平性，此外，考量速度與時脈偏移，本論文將以非同步電路完成，模擬結果呈現藉由這公平的策略，本論文提出的排程減緩飢餓的問題、保證免除死結和改進熱點問題,在多節點的系統中，該方法能對於全系統有效率地提供公平的仲裁。傳統記憶體層級架構雖然可以順暢指令與資料流，然而，指令流與資料流的頻寬不足仍然是提升整體系統效能的主要挑戰。為了增進指令與資料的抓取，在此提出一利用時間與空間的局部性特質切換緩衝器之存取的緩衝機制，當指令或資料存於緩衝器，該指令或資料將可被重新使用，此時，用來預先抓取資料之預先抓取緩衝器將位預先抓取於該緩衝器中。模擬及實作結果顯示，此機制在緩衝器擴充深度為3與每單位緩衝器大小為64Kbyte，指額外增加4%的硬體成本為最有效率之使用，該提出之緩衝器機制之命中率能優於loop buffer 22% 的指令抓取與7% 先進先出策略的資料抓取。
Abstract
Multi-core systems in single chip exploit ILP (Instruction-Level Parallelism) and TLP (Thread-Level Parallelism) to improve the system performance. Therefore, efficiency of transferring data among cores dominates the multi-core system performance. This work proposed a fair arbitration strategy to improve starvation and hotspot problems for multi-core systems in on-chip networks. On the other hands, to reduce the gap between the traditional memory hierarchy and processors, a novel buffering mechanism is proposed to improve the data fetch for network-on-chip nodes. On multi-core systems in on-chip networks, the global fairness, scalability, and simplicity of the strategy used to arbitrate the communications of collisions among cores have substantial effects. An unfair strategy causes starvation and hotspot problems, especially under heavy loads. In addition, the complexity of the hardware of the arbitration strategy that is involved in the on-chip environment must also be considered. To address these issues, this paper presents a simple and fair strategy that involves properly adjusting priorities of nodes. In the initial states of transferring data, each node has unique priorities. When competition among nodes occurs at a particular network, the loser swaps their priority with the priority of the winner. This principle guarantees that the opportunities of winners to decrease for the subsequent connection, whereas the priorities of winners increase. Using simple comparing and exchanging operations, the proposed arbitration strategy is an efficient global fairness strategy. Moreover, considering the speed and clock skew, asynchronous circuits are used for implementations. Simulation results demonstrate that by applying a fair strategy, the proposed scheme alleviates starvation, guarantees deadlock freedom, and improves hotspot problems. In a large system, this approach efficiently provides experience of service. The traditional memory hierarchy design can smooth the data stream and instruction stream. However, the bandwidth of the instruction stream and data stream are still the main challenge for high-performance microprocessor systems. To improve the data and instruction fetchers, the proposed buffering architecture can exploits both the temporal and spatial localities with a relation-exchanging buffering mechanism. On buffers hit, the instruction or data can be reused. At the same time, the prefetching mechanism will be enabled to prefetch the instruction/data being used in the near future. According to the simulation results, the proposed buffering mechanism with the depth 3 and 64-byte line size, which only needs extra 4% hardware cost, is a cost-effectiveness choice. The hit rate of the proposed buffer mechanism can 22% outperforms that of loop buffer architecture to fetch instruction stream and 7% outperform that of First-In-First-Out (FIFO) strategy to fetch data stream.

目次 Table of Contents
論文審定書 i 誌　謝 iii 摘　要 iv Abstract vi Table of Contents viii List of Figures x List of Tables xiv Chapter 1 Introduction 1 1.1 Fairness of arbitration strategies 3 1.2 Buffering mechanisms for data fetch 6 1.3 Centralized and Distributed Strategy approaches 7 Chapter 2 Background and Relevant Works 9 2.1 Multi-core Communications Overview 9 2.2 Background 12 2.3 Arbitration Strategies 14 2.4 Buffering mechanism 16 2.4.1 General Mechanisms 17 2.4.2 Instruction Stream for VLIW Architectures 20 Chapter 3 Buffering Mechanism for instruction and data streams 25 3.1 ABP buffer 27 3.1.1 Design of the ABP buffer 27 3.1.2 Primitive ABP buffer 28 3.1.3 Extended ABP Buffer 33 3.1.4 Hardware architecture of the ABP buffer 39 3.1.5 Hardware Complexity 41 3.2 Instruction Stream Buffer for VLIW Architectures 42 3.2.1 Design of Instruction Stream Buffer for VLIW Architectures 42 3.2.2 Hardware Design of Instruction Stream Buffer 49 Chapter 4 SWP Communication Schedule 51 4.1. Definitions of Interconnections 51 4.2. Basic Principles of the SWP Scheme 52 4.3 Implementation of the SWP scheme 55 4.4 Mechanism of the SWP Scheme 57 4.5 SWP Scheme Extension for Ring-like Topologies 60 4.6 SWP Scheme for Four-degree Network Topologies 61 4.7 Hardware Implementation and Overhead 64 4.7.1 Head flit switching 64 4.7.2 Forward path and backward path 67 4.7.3 Asynchronous Transceiver Architecture 69 Chapter 5 Analysis of the Effect of the Weighting Factor 76 Chapter 6 Experimental Result 86 6.1 Simulated parameters for arbitration strategies 86 6.1.1 Evaluation of Area and Latency 87 6.1.2 Fairness and Throughput of Arbitration Strategies 89 6.1.3 Effect of the Second Arbitration Strategy 93 6.2 Simulated experiment for Buffering mechanism 95 6.2.1 ABP buffer 95 6.2.2 Simulated parameters for buffering mechanisms of VLIW 101 Chapter 7 Conclusions 104 Bibliography 107 Appendix A 120 A.1 Introduction 120 A.2 Previous Works 122 A.2.1 The problem of the priority encoder 123 A.2.1.2 Relative Works 123 A.3 Balanced Propagation Path for Priority Policy Selector 125 A.3.1 Proposed Priority Encoder Scheme 126 A.3.2 Analytical Latency 132 A.3.3 A Novel Expression Generation Algorithm 133 A.4 Implemented VLSI Design 135 A.4.1 Design methodology for priority selectors using delayed precharge 135 A.4.2 Propagation delay 137 A.5 Experimental Results 139 A.6 Conclusion 144 Personal Publication 145

參考文獻 References
[1] A. Ivanov and G.D. Micheli, “The Network-on-Chip Paradigm in Practice and Research,” IEEE Design and Test of Computers, vol. 22, no. 5, pp. 399-403, Sept. 2005. [2] S.-Y. Lin, C.-H. Huang, C.-H. Chao, K.-H. Huang and A.-Y. Wu, “Traffic-Balanced Routing Algorithm for Irregular Mesh-Based On-Chip Networks,” IEEE Trans. Computers, vol. 57, no. 9, pp. 1156-1168, Sept. 2008. [3] G. Ascia, V. Catania, M. Palesi and D. Patti, “Implementation and Analysis of a New Selection Strategy for Adaptive Routing in Networks-on-Chip,” IEEE Trans. Computers, vol. 57, no. 6, pp.809-820, Jun. 2008. [4] Z. Wang, W. Wuchen, Z. Lei and P. Xiaohong, “The Buffer Depth Analysis of 2-Dimension Mesh Topology Network-on-Chip with Odd-Even Routing Algorithm,” in Proc. Int. Conf. Information Engineering and Computer Science, Dec. 2009, pp.1-4. [5] R. Lu, A. Cao and C. Koh, “SAMBA-Bus: A High Performance Bus Architecture for System- on-Chips,” IEEE Trans. VLSI Systems, vol. 15, pp. 69-79, Jan. 2007 [6] Core Connect bus architecture, IBM, Armonk, 1999. [7] Silicon micronetworks technical overview, Sonics Inc., CA, 2002. [8] AMBA Specification, ARM Limited, Cambridge, U.K., 1999. [9] R. S. Ramanujam, V. Soteriou, B. Lin and L.-S. Peh, “Design of a high-throughput distributed shared-buffer NoC node,” in Proc. 4th ACM/IEEE Symp. Networks-on-Chip, May 2010, pp. 69-78. [10] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, P. Dubey, S. Junkins, A. Lake, R. Cavin, R. Espasa, E. Grochowski, T. Juan, M. Abrash, J. Sugerman, and P. Hanrahan. “Larrabee: A many-core x86 architecture for visual computing,” IEEE Micro, vol. 29, pp.10-21, Jun. 2009. [11] V. Soteriou, R.S. Ramanujam, L. Bill and L.-S. Peh, “A High-Throughput Distributed Shared-Buffer NoC Node,” Computer Architecture Letters, vol. 8, pp. 21-24, Jan. 2009. [12] K. Lee, S.-J. Lee and H.-J. Yoo, “A Distributed Crossbar Switch Scheduler for On-Chip Networks,” in Proc. IEEE Conf. Custom Integrated Circuits, May 2010, pp. 69-78. [13] L. Mhamdi, “PBC: A Partially Buffered Crossbar Packet Switch,” IEEE Trans. Computers, vol. 58, pp. 1568-1581, Nov. 2009 [14] K. G. W. Goossens and L. M. I. V. Senin, “Internet-node Buffered Crossbar Based on Networks on Chip,” in Proc. IEEE Symp. Digital Systems Design, Aug. 2009, pp. 365-374. [15] K. Yoshigoe, K. Christensen and A. Jacob, “The RR/RR CICQ Switch: Hardware Design for 10-Gbps Link Speed,” in Proc. IEEE 5th Conf. Int’l Performance Computing and Comm., Apr. 2003, pp. 481-485. [16] N. Chrysos and M. Katevenis, “Scheduling in Switches with Small Internal Buffers,” in Proc. IEEE 24th Conf. Global Comm., Nov. 2005, pp. 614-619. [17] D. N. Serpanos and P. I. Antoniadis, “FIRM: A Class of Distributed Scheduling Algorithms for High-Speed ATM Switches with Input Queues,” in Proc. IEEE Computer and Communications Societies, Mar. 2000, vol. 2, pp. 548-555. [18] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, “Introduction to Algorithms. Cambridge”, MIT Press Cambridge, MA, 2001. [19] K. Goossens, J. van Meerbergen, A. Peeters, and R. Wielage, “Networks on silicon: combining best-effort and guaranteed services,” in Proc. IEEE/ACM Conf. Design, Automation and test in Europe, Mar. 2002, pp.423-425. [20] M.M. Lee, J. Kim, D. Abts, M.Marty, and J.W. Lee, “Probabilistic distance-based arbitration: Providing experience of service for many-core cmps,” in Proc. Microarchitecture, Dec. 2010, pp.509-519. [21] T. Moscibroda and O. Mutlu, “A case for bufferless routing in on-chip networks,” in Proc. the 36th Annual Int'l Symp. Computer Architecture, Jun. 2009, pp. 196-207. [22] P. Gratz, C. Kim, R. McDonald, S. W. Keckler and D. Burger. “Implementation and evaluation of on-chip network architectures, “In Proc. Int. Conf. Computer Design, Oct. 2006, pp. 477-484. [23] Congestion Control for Scalability in Bufferless On-Chip Networks, SAFARI Technical Report, Jul. 2011 [24] U. Y. Ogras and R. Marculescu. “Prediction-based flow control for network-on-chip traffic,” in Proc. ACM/IEEE Design Automation Conf., Jul. 2006, pp. 839-844. [25] B. Grot, J. Hestness, S. Keckler and O. Mutlu, “Express Cube Topologies for on-Chip Interconnects,” in Proc. IEEE Int. Sym. High Performance Computer Architecture, Feb. 2009, pp. 163-174. [26] J. Kim, J. Balfour and W.J. Dally, “Flattened butterfly topology for on-chip networks,” in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2007, pp. 172-182. [27] R. Das, O. Mutlu, T. Moscibroda and C.R. Das, “Application-aware prioritization mechanisms for on-chip networks,“ in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2009, pp. 280 – 291. [28] R. Das, O. Mutlu T. Moscibroda and C.R. Das, “AÉRGIA: exploiting packet latency slack in on-chip networks,” IEEE Micro, vol. 31, pp. 29-41, Feb. 2011. [29] B. Grot, S.W. Keckler and O. Mutlu, “Preemptive virtual clock: A flexible, efficient, and cost-effective qos scheme for networks-on-chip,” in Proc. IEEE/ACM Int. Sym. Microarchitecture, Dec. 2009, pp. 268 – 279. [30] J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W. Keckler and L.-S. Peh. “Research challenges for on-chip interconnection networks,” IEEE Micro, pp. 96–108, Oct. 2007. [31] C. Izu, “A throughput Fairness Injection Protocol for Mesh and Torus Networks,” in Proc. IEEE Conf. High Performance Computing, Dec. 2009, pp.294-303. [32] M. M. Lee, J. Kim, D. Abts, M. Marty and J. W. Lee, “Approximating age-based arbitration in on-chip networks,” in Proc. IEEE 19th Conf. Parallel Architectures and Compilation techniques, Sep. 2010, pp. 575-576. [33] F. Guderian, E. Fischer, M. Winter and G. Fettweis, “Fair rate packet arbitration in network-on-chip,” in Proc. IEEE Conf. SOC Conference, Sep. 2011, pp. 278 –283. [34] D. Abts and D. Weisser, “Age-based packet arbitration in large-radix k-ary n-cubes,” in Proc. ACM/IEEE Conf. Supercomputing, Nov. 2007, pp. 1-11. [35] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for a single-chip multiprocessor,” in Proc. Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1996, pp. 2-11. [36] B. A. Nayfeh and K. Olukotun, "A single-chip multiprocessor," Computer, vol. 30, no. 9, pp. 79-85, Sept. 1997, [37] J.-C. Chiu, K.-M. Yang and Y.-L. Chou, “A hyperscalar dual-core architecture for embedded systems,” Microprocessors and Microsystems, Jun. 2012. [38] J.-C. Chiu, K.-M. Yang, Y.-L. Chou, C.-K. Wu, “A relation-exchanging buffering mechanism for instruction and data streaming,” Computers & Electrical Engineering, vol. 39, no. 4, pp. 1129-1141, May 2013. [39] J.-C. Chiu, K.-M. Yang, “A Novel instruction stream buffer for VLIW architectures,” Computers and Electrical Engineering, vol. 36, no. 1, pp. 190-198, Jan. 2010. [40] William Stallings, Computer Organization and Architecture, Fifth Edition, Prentice Hall, 2000 [41] TMS320C3X User’s Guide, Texas Instruments Inc., 1997 [42] J. E. Thornton, Design of a Computer: the Control Data 6600, Glenview, 1970 [43] J.P Shen and M. Lipasti, “Modern Processor Design Fundamentals of Superscalar Processors”, McGRAW-Hill, 2005 [44] TMS320C62x DSP CPU and Instruction Set Reference Guide, Texas Instruments Inc., May. 2010 [45] Pentium Processor Family Developer’s Manual, Intel Corporation, 1997 [46] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench, “A tool for evaluating and synthesizing multimedia and communications systems,” in Symp. Microarchitecture, Dec. 1997, pp. 330-335. [47] J. L. Hennessy and D. A. Patterson, Computer Architecture A Quantitative Approach, 3rd ed, Morgan Kaufmann Publichsers, 2003 [48] Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel Corporation, Jun. 2011 [49] K. Ghose and M. B. Kamble, “Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. symp. Low power electronics and design, Aug 1999, pp.70-75. [50] S.Y. Larin and T.M. Conte, “Compiler-driven cached code compression schemes for embedded ILP processors,” in Proc. Micro-32 Int. Symp. Microarchitecture, Nov. 1999, p. 82–92. [51] W. Gass, “Higher performance and lower power enhancements to VLIW architectures,” in Proc. IEEE int. conf. computer design, Sep. 2001, pp. 157. [52] T.M. Conte, S. Banerjia, S.Y. Larin, K.N. Menezes and S.W. Sathaye, “Instruction fetch mechanisms for VLIW architectures with compressed encodings,” in Proc. Micro-29 Int. Symp. Microarchitecture, Dec. 1996, pp. 201–11. [53] [Online]. Available: http://www.ti.com/sc/docs/psheets/rel_dsp.htm, January 2006. [54] O. T.-C. Chen, L.-H. Chen, N.-W. Lin, and C.-C. Chen, “Application-specific data path for highly efficient computation of multistandard video codes,” IEEE Tran. Circuits and Systems for Video Technology, vol. 17, pp. 26-42, Jan. 2007. [55] R. Leupers, “Instruction scheduling for clustered VLIW DSPs,” In Proc. Int. conf. parallel architectures and compilation techniques, Oct. 2000. pp. 291–300. [56] M. Lewis and L. Brackenbury, “An instruction buffer for a low-power DSP,” In Proc. Advanced Research in Asynchronous Circuits and Systems, Apr. 2000, pp. 176-186. [57] C. Panis, H. Grunbacher, and J. Nurmi, “A scalable instruction buffer and align unit for xDSPcore,” IEEE J. Solid-State Circuits, vol. 39, pp. 1094-1100, Jul. 2004 [58] B.-H. Lim, “Panel: Challenges and Opportunities for System Software in the Multicore Era,” WIOSCA, 2007 [59] X.-C. WANG and B.-F. QIAN “The design of the cache crossbar based on OpenSPRAC architecture,” In Proc. Int. Conf. Electronic Packaging Technology & High Density Packaging, Jul. 2008, pp. 1-4. [60] J. Guo, M. Lai, Z. Pang, L. Huang, F. Chen, K. Dai and Z. Wang, “Memory System Design for a Multi-core Processor,” In Proc. Int. Conf. Complex, Intelligent and Software Intensive Systems, Mar. 2008, pp:621 - 626 , [61] Texas Instruments Inc., [Online]. Available:http://www.ti.com/sc/docs/psheets/rel_dsp.htm, [62] K. Hwang., “Advance Computer Architecture Parallelism Scalability Programmability”, McGRAW-HILL Inc., 1993 [63] J.-C. Chiu, Z.-L. Chen, and J. J.-J. Shann, “Improving ILP with Semantic Analyzer for Loop Unrolling in x86 Architectures,” In Proc. Int. Computer Symp. Computer Architecture, Dec. 2000, pp. 74-81. [64] L. Huang, Z. Wang, Li Shen, H. Lu, N. Xiao and C. Liu, “A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors,” In Proc. Design, Automation & Test, Mar. 2011, pp. 1-4. [65] Guzma, T. Pitkanen, and J. Takala, “Effects of Loop Unrolling and Use of Instruction Buffer on Processor Energy Consumption,” In Proc. Int. Symp. System on Chip, Oct. 2011, pp. 82-85. [66] V. Guzma, T. Pitkanen, and J. Takala, “Instruction Buffer with Limited Control Flow and Loop Nest Support,” In Proc. Int. Conf. Embedded Computer Systems, Jul. 2011, pp. 263-269. [67] J.-C. Chiu, and K.-M. Yang, “Novel instruction stream buffer for VLIW architectures,” Computers and Electrical Engineering, vol. 36, no. 1, pp. 190-198, Jan. 2010. [68] C.-H. Chi and J.-L. Yuan, “Load-Balancing Branch Target Cache and Prefetch Buffer” In Proc. Int. Conf. Embedded Computer Design, Oct. 1999, pp. 436-441. [69] C. J. Lee, O. Mutlu, V. Narasiman, and Y. N. Patt “Prefetch-Aware Memory Controllers” IEEE Tran. Computers, vol. 60, pp. 1406-1430, Oct. 2011. [70] S. Subha, “An Algorithm for Buffer Cache Management,” In Proc. Int. Conf. Information Technology: New Generations, Apr. 2009, pp. 889 – 893. [71] S. P. Vanderwiel and D. J. Lilja, “Data prefetch mechanisms”, ACM Computing Surveys, vol. 32, no.2, pp.174-199, Jun. 2000. [72] K. J. Nesbit and J. E. Smith, “Data Cache Prefetching Using a Global History Buffer,” In Proc. IEE software, Feb. 2004, pp. 96 [73] C. J. Lee, O. Mutlu, V. Narasiman and Y. N. Patt, “Prefetch-aware DRAM controllers,” in Proc. Micro-41 Int. Symp. Microarchitecture, Nov. 2008, pp. 200-209. [74] H. Khalid and M.S. Obaidat “KORA: a new cache replacement scheme” Computers and Electrical Engineering, vol. 26, no. 3, pp. 187-206, Apr. 2010. [75] Y. Jin, E. J. Kim, and K. H. Yum, “Design and Analysis of On-Chip Networks for Large-Scale Cache Systems,” IEEE Tran. Computers, vol. 59, no. 3, pp. 332-344, Mar. 2010. [76] R. G. Dreslinsky, A. G. Saidi, T. Mudge and S. K. Reinhardt, “Analysis of Hardware Prefetching Across Virtual Page Boundaries,” in Proc. Int. conf. computing frontiers, May 2007, pp. 13-22. [77] R. Pendse and R. Bhagavathula, “Performance of LRU Block Replacement Algorithm with Pre-fetching”, in Proc. Symp. Circuits and Systems, pp. 86-89, Aug. 1999 [78] H.S. Stone, “High Performance Computer Architecture”, Addison Wesley, 1990 [79] H. Ghasemzadeh, S. Mazrouee, and M. R. Kakoee, “Modified pseudo LRU replacement algorithm,” in Proc. IEEE Int. Symp. Engineering of Computer Based Systems, Mar. 2006, pp. 376. [80] S. Jiang and X. Zhang, “Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance”, IEEE Tran. Computers, vol. 54, no. 8, Aug. 2005. [81] D. Lee, J. Choi and H. Choe, “Implementation and Performance Evaluation of LRFU Replacement Policy”, IEEE Tran. Computers, pp. 106-111, Sep. 1997. [82] J. Handy, “The Cache Memory Book”, Academic Press,San Diego, pp. 47-67, 1993. [83] A. Sedra and K. Smith, “Microelectronic Circuits: Fifth Edition”, Oxford University Press, 2004. [84] Cell Broadband Engine Programming Handbook, IBM, Apr. 2007. [85] M. Gschwind, H.P. Hofstee, B. Flachs, M. Hopkin, Y. Watanabe and T. Yamazaki, “Synergistic processing in cell's multicore architecture,” IEEE micro, vol. 26, pp. 10-24, Apr. 2006. [86] J. Kahl, M. Day, H. Hofstee, C. Johns, T. Maeurer and D. Shippy. “Introduction to the Cell Multiprocessor.” IBM Journal of Research and Development, vol. 49, pp. 589-604, Jul. 2005. [87] IBM. Unleashing the Cell Broadband Engine Processor [Online]. Available: http://www-128.ibm.com/developerworks/power/library/pa-fpfeib [88] J. Kim, “Low-cost node microarchitecture for on-chip networks,” in Proc. IEEE/ACM Symp. Microarchitecture, Dec. 2009, pp. 255-266. [89] C. Hsieh and M. Pedram, “Architectural energy optimization by bus splitting,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 21, pp. 408–414, Apr. 2002. [90] K. Lahiri, A. Raghunathan and G. Lakshminarayana, “The lotterybus on-chip communication architecture,” IEEE Trans. VLSI Systems, vol. 14, pp. 596–608, Jun. 2006. [91] J.-C. Chiu and K.-M. Yang, "High-speed low-power multiplexer-based selector for priority policy," Computers and Electrical Engineering, vol. 39, no. 2, pp. 202–213, Feb. 2013 [92] J.-S. Wang and C.-H. Huang, “High-speed and low-power CMOS priority encoders,” IEEE J. Solid-State Circuits, vol. 35, pp. 1511–1514, Oct. 2000. [93] C.-H. Huang, J.-S. Wang, and Y.-C. Huang, “Design of high performance CMOS priority encoders and incrementer/decrementers using multilevel lookahead and multilevel folding techniques”, IEEE J. Solid-State Circuits, vol. 37, pp. 63 - 76, Jan. 2002. [94] ARMDUI0207A Realview ARMulator ISS User Guide, ARM Corporation, 2004 [95] B. Bishop, T.P. Kelliher and M.J. Irwin, “A detailed analysis of MediaBench,” in Proc. IEEE Signal Processing Systems, 1999, pp. 448-455. [96] A. Peleg and U. Weiser, “MMX technology extension to the Intel architecture”, IEEE Micro, Aug 1996, pp.42–50. [97] L. H., Z. Wang, L. Shen, H. Lu, N. Xiao, and C. Liu “A Specialized Low-Cost Vectorized Loop Buffer for Embedded Processors” in Proc. Design, Automation & Test, Mar. 2011, pp. 1-4. [98] T.-J. Lin, C.-C. Chang, C.-C. Lee and C.-W. Jen, “An efficient VLIW DSP architecture for baseband processing,” in Proc. Int. Conf. computer design; Oct. 2003, pp. 307–12. [99] L. Wanhammar, “DSP integrated circuits”, Academic Press; 1999. [100] M.R. Guthaus, J.S. Ringenberg, D. Ernst, T.M. Austin, T. Mudge and R.B. Brown, “ MiBench: a free, commercially representative embedded benchmark suite WWC- 4,” in Proc IEEE Workload Characterization; Dec. 2001. pp. 3–14. [101] TMS320C64x DSP Library Programmer’s Reference, Texas Instruments Inc.; April 2002. [102] TMS320C64x Image/Video Library Programmer’s Reference, Texas Instruments Inc.; April 2002. [103] T.-J. Lin, C.-M. Chao, C.-H. Liu, P.-C. Hsiao, S.-K. Chen, L.-C. Lin, C.-W. Liu, C.-W. Jen, “Computer architecture: a unified processor architecture for RISC and VLIW DSP” in Proc ACM Symp. VLSI, Apr. 2005, pp. 50-55. [104] J. G. Delgado-Frias and J. Nyathi, “A high-performance encoder with priority lookahead,” IEEE Trans. Circuits Syst. I, vol. 47, pp. 1390–1393, Sep. 2000. [105] S.K. Maurya and L.T. Clark, “Fast and Scalable Priority Encoding using Static CMOS” in Proc. IEEE Int. Symp. Circuits and Systems, pp. 433–436, May. 2010.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0618113-150752.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS