國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一實現將啟發式演算法為基礎之排程器實作於Hadoop 之框架,A Framework for the Implementation of Heuristic-based Schedulers on Hadoop

論文名稱 Title	一實現將啟發式演算法為基礎之排程器實作於Hadoop 之框架 A Framework for the Implementation of Heuristic-based Schedulers on Hadoop
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	54
研究生 Author	江孟修 Meng-hsiu Chiang
指導教授 Advisor	江明朝 Ming-chao Chiang
召集委員 Convenor	楊竹星 Chu-sing Yang
口試委員 Advisory Committee	洪宗貝, 江明朝, 蔡崇煒 Tzung-Pei Hong; Ming-chao Chiang; Chun-Wei Tsai
口試日期 Date of Exam	2014-06-26	繳交日期 Date of Submission	2014-08-20
關鍵字 Keywords	雲端計算、Hadoop、分散式計算、排程、啟發式演算法 distributed computing, heuristic algorithm, cloud computing, Hadoop, scheduling
統計 Statistics	本論文已被瀏覽 5689 次，被下載 0 次 The thesis/dissertation has been browsed 5689 times, has been downloaded 0 times.

中文摘要
隨著電腦科技的進步，從單一處理器、多處理器、分散式計算，到目前的雲端計算，電腦系統中的排程器若要即時取得最佳的排程結果也越來越困難。Hadoop 為相當知名的分散式計算系統，也是雲端計算相當熱門的應用，在Hadoop 中，除了相當原始的先進先出排程之外，Facebook、Yahoo 等公司也提出了相應的演算法，以取得更佳的排程結果，但上述兩間公司提出的演算法都並非著眼於排程的最佳化。同時，由於排程最佳化是典型的NP-hard 問題，以窮舉法在合理時間內取得最佳解相當困難。而演化式計算用於解決NP-hard 問題已行之有年，其中也包括了排程最佳化。但在排程器的實作中，套用演化式計算的門檻在於通常作業完成時間等資訊皆為未知，而演化式計算需要這些資訊作為計算最適值的依據。為解決此問題，本文提出一框架，透過有效的預估與訓練，以最小化工作完成時間作為排程目標，以實作出以演化式計算為主的Hadoop 排程系統。根據我們的實驗結果，以此框架設計之排程器可以取得比先進先出與Facebook 的Fair Scheduling 更佳的排程結果，有效的減少完成全部工作所需要的時間。
Abstract
The advance of computer technology from uni-processing to symmetric-multiprocessing to distributed computing and then to cloud computing has made it more and more difficult to come up with an optimal schedule for the tasks to be run on such a system on the fly. In order to achieve better scheduling quality than the primitive First-In-First-Out scheduler, Facebook and Yahoo have developed their own schedulers for Hadoop, a widely used cloud computing system, but none of them are aimed for optimizing the schedule in terms of makespan. Moreover, since scheduling optimization is an NP-hard problem, it is very unlikely that a brute-force method will be able to find the optimal solution to this problem in a reasonable time. Hence, heuristic algorithms play a vital role in solving this problem. But from the perspective of implementation, the problem is that the completion time of each job that is needed for calculating the fitness value is not known. As such, this thesis presents a framework to overcome this problem so that the heuristics-based schedulers can be implemented on Hadoop. Our experimental results show that the heuristics-based schedulers give a better scheduling result when compared to First-In-First-Out, Facebook’s Fair Scheduling, and Yahoo’s Capacity Scheduler in terms of the makespan of jobs.

目次 Table of Contents
Chapter 1 Introduction 1 1.1 Motivation 2 1.2 Contributions of the Thesis 2 1.3 Organization of the Thesis 3 Chapter 2 Related Works 4 2.1 Scheduling Problem 4 2.1.1 Machine Environments 4 2.1.2 Constraints and Characteristics 5 2.1.3 Schedule Objective 6 2.2 Heuristic-based Algorithms 7 2.3 Hadoop 8 2.3.1 MapReduce 9 2.3.2 Hadoop Architecture 9 2.3.3 Schedulers in Hadoop 10 2.3.3.1 First-In-First-Out 10 2.3.3.2 Fair Scheduler 10 2.3.3.3 Capacity Scheduler 11 2.3.3.4 Other Hadoop Scheduling Research 11 2.4 Summary 11 Chapter 3 The Proposed Framework 13 3.1 The Concept 13 3.2 The Proposed Framework 14 3.2.1 Updating Stage 15 3.2.2 Prioritizing Stage 15 3.2.3 Scheduling Stage 17 3.2.4 Dispatching Stage 18 3.3 Summary 19 Chapter 4 API Implemented and Example 21 4.1 Overview 21 4.2 APIs 23 4.2.1 HScheduler 23 4.2.2 Scheduler 23 4.2.3 Schedule 24 4.2.4 JobRecord 26 4.2.5 Tracker 26 4.2.6 Slot 27 4.3 Example 28 4.4 Summary 29 Chapter 5 Simulation Results 31 5.1 Experiments Settings 31 5.1.1 Simulation Environment 31 5.1.2 Parameter Settings 32 5.1.3 Simulated Dataset 32 5.2 Simulation Results 33 5.3 Analysis 35 5.3.1 The average task waiting time 35 5.3.2 The cloud rental payment 37 5.4 Summary 37 Chapter 6 Conclusion and Future Works 38 6.1 Conclusion 38 6.2 Future Works 38 Bibliography 40

參考文獻 References
Facebook, “Hadoop fair scheduler.”http://hadoop.apache.org/docs/stable/ fair_scheduler.html. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling,” in Proceedings of the 5th European conference on Computer systems, EuroSys ’10, pp. 265–278, 2010. Yahoo!, “Hadoop capacity scheduler.”http://hadoop.apache.org/docs/stable/ capacity_scheduler.html. M. R. Garey and D. S. Johnson, Computer and intractability: A guide to the theory of NP-Completeness. New York: W.H. Freeman and Company, 1979. C. W. Tsai and J. Rodrigues, “Metaheuristic scheduling for cloud: A survey,” Systems Journal, IEEE, vol. 8, no. 1, pp. 279–291, 2014. A. Colorni, M. Dorigo, V. Maniezzo, et al., “Distributed optimization by ant colonies,”in Proceedings of the first European conference on artificial life, vol. 142, pp. 134–142, Paris, France, 1991. J. H. Holland, Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press, 1975. J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of IEEE In- ternational Conference on Neural Networks, vol. 4, pp. 1942–1948, 1995. 40 S. Kirkpatrick, “Optimization by simulated annealing: Quantitative studies,” Journal of statistical physics, vol. 34, no. 5-6, pp. 975–986, 1984. M. Pinedo, Scheduling: theory, algorithms, and systems. Springer, 2012. M. Dorigo and L. M. Gambardella, “Ant colony system: A cooperative learning approach to the traveling salesman problem,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 53–66, 1997. D. Merkle, M. Middendorf, and H. Schmeck, “Ant colony optimization for resource- constrained project scheduling,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 4, pp. 333–346, 2002. W. N. Chen and J. Zhang, “An ant colony optimization approach to a grid workflow scheduling problem with various QoS requirements,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 39, no. 1, pp. 29–43, 2009. E. S. Hou, N. Ansari, and H. Ren, “A genetic algorithm for multiprocessor scheduling,”IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 2, pp. 113–120, 1994. A. Salman, I. Ahmad, and S. Al-Madani, “Particle swarm optimization for task assign- ment problem,” Microprocessors and Microsystems, vol. 26, no. 8, pp. 363–371, 2002. A. H. Kashan and B. Karimi, “A discrete particle swarm optimization algorithm for scheduling parallel machines,” Computers & Industrial Engineering, vol. 56, no. 1, pp. 216–223, 2009. V.ˇCern` y, “Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm,” Journal of optimization theory and applications, vol. 45, no. 1, pp. 41–51, 1985. P. J. M. van Laarhoven, E. H. L. Aarts, and J. K. Lenstra, “Job shop scheduling by simu- lated annealing,” Operations Research, vol. 40, no. 1, pp. 113–125, 1992. Hadoop, “Apache hadoop.” http://hadoop.apache.org. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,”Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. 41 S. Ghemawat, H. Gobioff, and S. Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, SOSP ’03, pp. 29–43, ACM, 2003. K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,”in Proceedings of IEEE 26th Symposium Mass Storage Systems and Technologies, pp. 1–10, IEEE, 2010. Hadoop, “Who use hadoop?.”http://wiki.apache.org/hadoop/PoweredBy. G. S. Sadasivam and D. Selvaraj, “A novel parallel hybrid PSO-GA using MapReduce to schedule jobs in hadoop data grids,” in Proceedings of Nature and Biologically Inspired Computing, pp. 377–382, IEEE, 2010. K. Kc and K. Anyanwu, “Scheduling hadoop jobs to meet deadlines,” in Proceedings of IEEE 2nd International Conference on Cloud Computing Technology and Science, pp. 388–392, IEEE, 2010. T. Sandholm and K. Lai, “Dynamic proportional share scheduling in hadoop,” in Job scheduling strategies for parallel processing, pp. 110–131, Springer, 2010. O. O’Malley, “TeraByte sort on Apache Hadoop.”http://sortbenchmark.org/YahooHadoop.pdf. C. W. Tsai, W. C. Huang, M. C. Chiang, M. H. Chiang, and C. S. Yang, “A hyper-heuristic scheduling algorithm for cloud,” IEEE Transactions on Cloud Computing, vol. 2, no. 2, pp. 236–250, 2014. Amazon, “Amazon elastic mapreduce.”http://aws.amazon.com/elasticmapreduce. H. S. Shin, K. H. Kim, C. Y. Kim, and S. I. Jung, “The new approach for inter-communication between guest domains on virtual machine monitor,” in Proceedings of the International Symposium on Computer and information sciences, pp. 1–6, 2007.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.142.98.108 論文開放下載的時間是校外不公開 Your IP address is 3.142.98.108 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 永不公開 not available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS