Responsive image
博碩士論文 etd-0829112-205635 詳細資訊
Title page for etd-0829112-205635
論文名稱
Title
叢集計算之容錯設計
The Design of Fault Tolerance of Cluster Computing Platform
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
70
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2012-07-20
繳交日期
Date of Submission
2012-08-29
關鍵字
Keywords
分散式運算、叢集系統、容錯
SLURM, cluster system, Job duplication, fault-tolerance, distributed computing
統計
Statistics
本論文已被瀏覽 5666 次,被下載 504
The thesis/dissertation has been browsed 5666 times, has been downloaded 504 times.
中文摘要
在一個分散式的應用服務中,當發生節點失效時,要付出相當高的代價來處理計算結果的錯誤,同時也會造成服務排程器的額外負擔。為了讓計算過程不會因為節點失效而全部重新啟動,只將發生錯誤的節點重新計算即可。因此,本論文考慮三種方法:N+N nodes、N + 1 nodes以及N + 1 nodes with probability,並分別實驗與分析這三種方式的優劣,其中第三種方法在派送前給予權重,之後根據權重換算成機率及nice值(此為SLURM[1]所定義的)進而影響排程器的排程順序。所以當發生錯誤時,正在計算的部份結果能先回到控制節點,而發生故障的機器因本論文的容錯設計能經由再次的派送任務到備份機上做部份計算或不經由重新派送任務而得到完整的計算結果。論文最後也會針對這三種作法分析其優劣。
Abstract
If nodes got failed in a distributed application service, it will not only pay more cost to handle with these results missing, but also make scheduler cause additional loadings. For whole results don’t recalculated cause by fault occurs, it will be recalculated data of fault nodes in backup machines. Therefore, this paper uses three methods: N + N nodes, N + 1 nodes, and N + 1 nodes with probability to experiment and analyze their pros and cons, the third way gives jobs weight before assigning them, and converts weight into probability and nice value(defined by SLURM[1]) to influence scheduler’s decision of jobs’ order. When fault occurs, calculating in normal nodes’ results will back to control node, and then the fault node’s jobs are going to be reassigned or not be reassigned to backup machine for getting complete results. Finally, we will analyze these three ways good and bad.
目次 Table of Contents
摘要 I
ABSTRACT II
目錄 III
圖目錄 V
表目錄 VI
1. 序論 1
1.1. 研究動機與目的 3
1.2. 論文架構 5
2. 平行計算與SLURM介紹 6
2.1. 平行計算 6
2.2. SLURM介紹 7
2.3. 容錯處理 11
2.4. 交易策略問題 12
2.4.1. 移動平均線 14
2.4.2. 淨值成交量 16
3. 系統架構 18
3.1. 系統功能 18
3.2. 硬體架構 19
3.3. 軟體架構 22
4. 研究、實作與比較 23
4.1. 容錯方法設計 23
4.1.1. 排程介紹 24
4.1.2. 容錯介紹 28
4.1.3. 整體流程 31
4.2. 比較 39
5. 結論 45
6. 參考文獻 46
附錄 55
參考文獻 References
[1]. SLURM: A Highly Scalable Resource Manager, https://computing.llnl.gov/linux/slurm/slurm.html
[2]. J. Darlinton, M. Ghanem, H. W. To, “Structured parallel programming,” In Programming Models for Massively Parallel Computers. IEEE Computer Society Press. 1993.
[3]. C. W. Krueger, “Software reuse,” ACM Computing Surveys, 24 (2), 1992, pp. 131-183.
[4]. The GNU Project (2003), http://www.gnu.org/licenses/licenses.html.
[5]. H. Lee, K. Chung, S. Chin, J. Lee, S. Park, H. Yu, “A resources management and fault tolerance services in grid computing,” Journal of Parallel Distributed Computing, 65, 2005, pp. 1305-1317.
[6]. M. Nandagopal, V. R. Uthariaraj, “Fault tolerant scheduling strategy for computational grid environment,” International Journal of Engineering Science Technology, 2 (9), 2010, pp. 4361-4372.
[7]. B. Nazir, K. Qureshi, F. G. Khan, “Adaptive checkpointing strategy to tolerate faults in economy based grid,” Journal of Supercomputing, 50, 2009, pp. 1-18.
[8]. C. Jiang, X. Xu, J. Wan, “Replication based job scheduling in grids with security assurance,” In: Proceedings of 3rd international symposium on electronic commerce and security workshops, Guangzhou, China, July 29-31, 2010.
[9]. M. Huda, H. Schmidt, I. Peake, “An agent oriented proactive fault-tolerant framework for grid computing,” In: Proceedings of international conference on e-science and grid computing, Melbourne, Australia, December 5-8, 2005. p. 304-311.
[10]. Q. Zheng, B. Veeravalli, “On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices,” Journal of Parallel Distributed Computing, 69, 2009, pp. 282-294.
[11]. J. Abawajy, “Fault-tolerant scheduling policy for grid computing systems,” In: Proceedings of 18th IEEE international parallel and distributed processing symposium, April 26-30, 2004.
[12]. B. Watson, “The performance of single-keyword and multiple-keyword pattern matching algorithms,” Technical Report CS TR 94-19, Department of Computing Science, Eindhoven University of Technology, 1994.
[13]. G. Navarro, M. Raffinot, “Flexible pattern matching in strings,” Practical Online Search Algorithms for Texts and Biological Sequences, Cambridge University Press, Cambridge, UK, 2002.
[14]. C. Clark, “C hash table – source code for a hash table data structure in C,” Computer Laboratory of the University of Cambridge, Mar 2006. http://www.cl.cam.ac.uk/~cwc22/hashtable/.
[15]. T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to Algorithms, Second Edition. The MIT Press and McGraw-Hill Book Company, 2002.
[16]. D. E. Knuth, J. H. Morris, V. R. Pratt, “Fast pattern matching in strings,” SIAM Journal on Computing, 6 (2), 1977, pp. 323-350.
[17]. J. H. Morris, V. R. Pratt, “A linear pattern-matching algorithm,” Technical Report TR 40, University of California, Berkley, CA, USA, 1970.
[18]. C. Charras, T. Lecroq, “Exact string matching algorithms – Animations in Java,” Electronic Publication, Laboratoire d’Informatique de Rouen a l’Universite de Rouen, Facultdes Sciences et des Techniques. Jan. 1997. http://www-igm.univ-mlv.fr/~lecroq/string/index.html.
[19]. R. S. Boyer, J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, 20 (10), 1977, pp. 762-772.
[20]. R. N. Horspool, “Practical fast searching in strings,” Software Practice and Experience, 10 (6), 1980, pp. 501-506.
[21]. L. Cleophas, B. W. Watson, G. Zwaan, “A new taxonomy of sublinear keyword pattern matching algorithms,” Technical Report CS-TR 04-07, Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, Mar 2004.
[22]. A. V. Aho, M. J. Corasick, “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, 18 (6), 1975, pp. 333-340.
[23]. S. Wu, U. Manber, “A fast algorithm for multi-pattern searching,” Technical Report TR-94-17, Department of Computer Science. Chung-Cheng University, 1994.
[24]. J. Dean, S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, 51 (1), 2008, pp. 107-113.
[25]. R. Buyya, High Performance Cluster Computing: Architectures and Systems, 1st ed., Prentice Hall PTR, 1999.
[26]. C. Bischof, M. Bucker, P. Gibbon, G. Joubert, T. Lippert, Parallel Computing: Architectures, Algorithms and Applications, IOS Press, 2008.
[27]. A. Silberschatz, P.B. Galvin, G. Gagne, Operating System Concepts, 8th ed., John Wiley, 2009.
[28]. O. Sinnen, Task Scheduling for Parallel Systems, 1st ed., John Wiley and Sons Inc, 2007.
[29]. R. Entezari-Maleki, A. Movaghar, A Genetic-based Scheduling Algorithm to Minimize the Makespan of the Grid Applications. Grid and Distributed Computing, Control and Automation. Communications in Computer and Information Science, 121, 2010, pp. 22-31.
[30]. S. Parsa, R. Entezari-Maleki, RASA: A New Grid Task Scheduling Algorithm, International Journal of Digital Content Technology and its Applications, 3(4), 2009, pp. 91-99.
[31]. Fault-tolerant system, http://en.wikipedia.org/wiki/Fault-tolerant_system
[32]. Rogue transmitter knocks out GPS signals, http://fcw.com/articles/1998/04/12/rogue-transmitter-knocks-out-gps-signals.aspx
[33]. Myrinet, http://en.wikipedia.org/wiki/Myrinet
[34]. L. Lamport, R. Shostak, M. Pease, The Byzantine Generals Problem. ACM Transactions on Programming Languages and Systems, 4(3), 1982, pp. 382-401.
[35]. R. Schlichting, F. Schneider, Fail-Stop Processors: An Approach to Designing Fault-Tolerant Computing Systems, ACM Transactions on Computing Systems, 1(3), 1983, pp. 222-238.
[36]. R.H. Arpaci-Dusseau, A.C. Arpaci-Dusseau, Fail-Stutter Fault Tolerance, In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, 2001.
[37]. C. Leangsuksun, T. Liu, L. Shen, S.L. Scott, Building High Availability and Performance Clusters with Ha-oscar Toolkits, In Proceedings of the High Availability and Performance Workshop, 2003.
[38]. M. Li, D. Goldberg, W. Tao, Y. Tamir, Fault-Tolerant Cluster Management for Reliable High-Performance Computing, In International Conference on Parallel and Distributed Computing and Systems, 2001.
[39]. B.-G. Chun, P. Maniatis, S. Shenker, Diverse Replication for Single-Machine Byzantine-Fault Tolerance, Proceeding ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference, 2008.
[40]. M. Trencseni, A. Gazso, Keyspace: A Consistently Replicated, Highly-Available Key-Value Store, 2009.
[41]. J. R. Sklaroff, Redundancy Management Technique for Space Shuttle Computers, IBM Journal of Research and Development, 20(1), 1976.
[42]. M. Jain, R. Gupta, Redundancy Issues in Software and Hardware Systems: An Overview, International Journal of Reliability, Quality and Safety Engineering, 18(1), 2011, pp. 61-98.
[43]. R. Samet, Recovery Device for Real-Time Dual-Redundant Computer Systems, IEEE Transactions on Dependable and Secure Computing, 8(3), 2011, pp. 391-403.
[44]. Z.T. Kalbarczyk, R.K. Iyer, S. Bagchi, K. Whisnant, Chameleon: A Software Infrastructure for Adaptive Fault Tolerance, IEEE Transactions on Parallel and Distributed Systems, 10(6), 1999, pp. 560-588.
[45]. A. Bessani, A. Daidone, I. Gashi, R. Obelheiro, P. Sousa, V. Stankovic, Enhancing Fault / Intrusion Tolerance through Design and Configuration Diversity, 39th IEEE/IFIP International Conference on Dependable Systems and Networks, 2009.
[46]. B. Littlewood, L. Strigini, Fault Tolerance via Diversity Against Design Faults: Design Principles and Reliability Assessment, 22nd International Conference on Software Engineering, 2000.
[47]. B. Littlewood, P. Popov, L. Strigini, Assessing the Reliability of Diverse Fault-Tolerant Software-based Systems, Safety Science, 40(9), 2002, pp. 781-796.
[48]. Y. Oh, S.H. Son, Scheduling Real-Time Tasks for Dependability, Journal of Operational Research Society, 48(6), 1997, pp. 629-639.
[49]. S. Ghosh, R. Melhem, D. Mosse, Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems, IEEE Transactions on Parallel and Distributed Systems, 8(3), 1997, pp. 272-284.
[50]. G. Manimaran, C.S.R. Murthy, A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis, IEEE Transactions on Parallel and Distributed Systems, 9(11), 1998, pp. 1137-1152.
[51]. R. Al-Omari, A. Somani, G. Manimaran, A New Fault-Tolerant Technique for Improving Schedulability in Multiprocessor Real-Time Systems, In: 15th International Parallel and Distributed Processing Symposium, 2001.
[52]. Q. Zheng, B. Veeravalli, C.K. Tham, Fault-Tolerant Scheduling of Independent Tasks in Computational Grid, In: 10th IEEE International Conference on Communications Systems, 2006.
[53]. Q. Zheng, B. Veeravalli, C.K. Tham, On the Design of Fault-Tolerant Scheduling Strategies Using Primary-Backup Approach for Computational Grids with Low Replication Costs, IEEE Transactions on Computers, 58(3), 2009, pp. 380-393.
[54]. S. Ghemawat, H. Gobioff, and S.T. Leung, “The google file system,” 2003.
[55]. E.B. Nightingale, J.R. Douceur, V. Orgovan, “Cycles, cells and platters: An empirical analysis of hardware failures on a million consumer PCs,” 2011.
[56]. Trading strategy, http://en.wikipedia.org/wiki/Trading_strategy
[57]. Moving average: http://en.wikipedia.org/wiki/Moving_average
[58]. 牛皮:http://zh.wikipedia.org/wiki/%E7%89%9B%E7%9A%AE_(%E8%82%A1%E7%A5%A8)
[59]. OBV:http://en.wikipedia.org/wiki/On-balance_volume
[60]. M. Jette and M. Grondona, “SLURM: simple linux utility for resource management,” Proceedings of Job Scheduling Strategies for Parallel Processing (JSSPP), June 23, 2003, pp. 44-60.
[61]. J.S. Plank and K. Li, “Faster checkpointing with N+1 parity,” 24th International Symposium on Fault-Tolerant Computing, Austin, TX, USA, June 15-17, 1994. p. 288-297.
[62]. D. Huang, H. Fei, L. Li, and Y. Zhu, “Design for N+1 fault-tolerant integrated solar controller,” Computer, Mechatronics, Control and Electronic Engineering (CMCE), Aug. 24-26, 2010, Vol(6), pp. 151-154.
[63]. S. Rani, C. Leangsuksun, A. Tikotekar, V. Rampure, and S.L. Scott. “Toward efficient failure detection and recovery in HPC,” In Proceedings of High Availability and Performance Workshop (HAPCW) 2006, in conjunction with Los Alamos Computer Science Institute (LACSI) Symposium 2006, October 17, 2006.
[64]. HA-OSCAR: Availability Prediction and Modeling of High Availability OSCAR Cluster, IEEE Cluster 2003, Hong Kong, December 2003.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code