國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,研究以Linux Container為基礎之資料倉儲叢集的實現與不同網路架構下之效能分析,The Study of Linux Container-based Cluster for Data Warehouse Implementation and Its Network Performance Analysis

論文名稱 Title	研究以Linux Container為基礎之資料倉儲叢集的實現與不同網路架構下之效能分析 The Study of Linux Container-based Cluster for Data Warehouse Implementation and Its Network Performance Analysis
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	102 學年度第 2 學期 The spring semester of Academic Year 102	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	74
研究生 Author	吳冠賢 Guan-Sian Wu
指導教授 Advisor	林俊宏 Chun-Hung Lin
召集委員 Convenor	周承復 Cheng-Fu Chou
口試委員 Advisory Committee	劉建興, 范俊逸, 廖宏仁 Jain-Shing Liu; Chun-I Fan; Hung-Jen Liao
口試日期 Date of Exam	2014-07-18	繳交日期 Date of Submission	2014-07-30
關鍵字 Keywords	Linux Container、資料倉儲、Hadoop Linux Container, Hadoop, Data Warehouse
統計 Statistics	本論文已被瀏覽 5671 次，被下載 40 次 The thesis/dissertation has been browsed 5671 times, has been downloaded 40 times.

中文摘要
雲端運算是目前最熱門的話題之一，時至今日，資料成長的速度越趨快速，大資料的議題更是成為資料探勘中的顯學，如何能加快資料處理的速度便成為一個重要的議題。本論文以資料的結構化程度、網路對資料運算時間上的影響、如何在相同的成本之下建構出可以發揮硬體成本的雲端運算叢集做探討與研究。輸入的資料中以結構化資料與非結構化資料兩種，探討在不同的程式語言下，對運算執行時間有怎樣的影響；而在網路影響方面，由於將所有機器接在同一網路交換設備之下，一旦遇到網路傳輸瓶頸，勢必會影響執行時間，本論文將提出網路對接的方式，以在相同成本之下，可以降低網路交換設備的負擔，增加網路傳輸的效率，最後，本論文以Linux Container為建立資料倉儲環境的基礎，探討此類型虛擬環境是否可以取得實體機器的所有運算資源，除此之外，在硬體資源容許的情況下，增加Linux Container虛擬機器的數量，探討在這樣的情況下是否更能發揮出硬體應有的運算效能。
Abstract
Cloud computing is currently one of the popular topics. Nowadays, the pace of data growth is much faster than before. How to accelerate the speed of data processing has become an important issue. In this paper, we will discuss and research the degree of structure in the data, the impact of network transmission on computing, and how to construct the most efficient cloud computing cluster under the same hardware costs. First, there are two type of input data, one is structure data, and the other is unstructured data. We will discuss how the execution time will be impacted under different programming language between this two type data. Second, we will discuss the influence of network transmission. Since all the computing nodes are connected to the same network switching equipment. If this network switching equipment reaches its throughput bottleneck, the execution time will be much longer than the same situation which it does not reach its throughput bottleneck. To avoid this situation happening, we propose a network interconnection approach to reduce the burden of network switching equipment and increase the efficiency of network transmission. Finally, we build Linux-Container-based data warehouse to discuss whether Linux Container can get whole the real host hardware resources, in addition, under the case of hardware resources allowance, we increase the number of Linux Container virtual machines to examine under this situation whether the hardware should play better computing performance.

目次 Table of Contents
論文審定書 i 致謝 ii 摘要 iii Abstract iv 目錄 v 圖目錄 vii 表目錄 viii 第一章序論 1 1.1 研究動機 1 1.2 論文架構 2 第二章研究背景 3 2.1 Hadoop介紹 3 2.1.1 HDFS 3 2.1.2 MapReduce 5 2.1.3 Hadoop架構的節點分工 7 2.2 Hadoop相關套件介紹 8 2.2.1 HBase 8 2.2.2 Pig 10 2.2.3 Hive 10 2.2.4 Impala 11 2.3 資料倉儲介紹 11 2.3.2 CDH介紹 12 2.4 LXC介紹 13 第三章系統架構 15 3.1 硬體規格 15 3.2 軟體版本 16 3.2.1 作業系統版本 16 3.2.2 資料倉儲相關軟體版本 16 3.2.3 Linux Container相關軟體版本 16 3.3 測試環境架構與服務配置 17 3.3.1 節點命名與服務佈署 17 3.3.2 測試環境一 / 測試環境二 18 3.3.3 測試環境三 18 3.3.4 測試環境四 20 3.3.5 測試環境五 21 3.3.6 測試環境六 24 3.3.7 測試環境七 / 測試環境八 24 3.3.8 測試環境九 / 測試環境十 24 3.3.9 測試環境整理 25 第四章實驗設計與數據說明 26 4.1 資料型態 26 4.2 程式說明 27 4.2.1 Java程式碼與流程說明 27 4.2.2 Pig程式碼與流程說明 32 4.2.3 Hive程式碼與流程說明 33 4.2.4 Impala程式碼與流程說明 34 4.3 實驗設計與數據說明 35 4.3.1 實驗一：環境一 / 二 / 三 35 4.3.2 實驗二：環境二 / 三 / 四 / 五 38 4.3.3 實驗三：環境四 / 六 / 七 / 八 / 九 / 十 41 第五章結論與未來展望 44 參考文獻 46 附錄 51

參考文獻 References
[1] Hortonworks。網址：http://hortonworks.com/。上網日期：2013-12-28。 [2] Cloudera。網址：http://www.cloudera.com/。上網日期：2013-12-28。 [3] 淺談資料倉儲。網址：http://webmail.hwai.edu.tw/~linys/DSS/DW.pdf。上網日期：2013-12-28。 [4] 譚磊。2013。大數據挖掘：從巨量資料發現別人看不到的秘密。 [5] HDFS Architecture。網址：http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html。上網日期：2014-05-29。 [6] MapReduce A framework for large-scale parallel processing。網址：http://www.cs.rutgers.edu/~pxk/417/notes/content/mapreduce.html。上網日期：2014-05-29。 [7] HBase architecture。網址：http://www.toadworld.com/platforms/nosql/w/wiki/356.hbase-storage.aspx。上網日期：2014-05-29。 [8] Pig Latin Basics。網址：http://pig.apache.org/docs/r0.12.0/basic.html。上網日期：2014-04-13。 [9] Hadoop, Pig and Twitter。網址：http://www.slideshare.net/kevinweil/hadoop-and-pig-at-twitter-oscon-2010-4824988。上網日期：2014-04-13。 [10] Hive。網址：https://hive.apache.org/。上網日期：2014-04-13。 [11] Spark。網址：https://spark.apache.org/。上網日期：2014-04-17。 [12] Shark。網址：http://shark.cs.berkeley.edu/。上網日期：2014-04-18。 [13] Impala：新一代開放原始碼大數據分析引擎。網址：http://www.parallellabs.com/2013/08/25/impala-big-data-analytics/。上網日期：2014-04-18。 [14] Linux Container。網址：https://linuxcontainers.org/。上網日期：2014-03-17。 [15] Containers—Not Virtual Machines—Are the Future Cloud。網址：http://www.linuxjournal.com/content/containers—not-virtual-machines—are-future-cloud。上網日期：2014-03-17。 [16] Configuring LXC - Linux Containers。網址︰http://kaivanov.blogspot.tw/2012/07/configuring-lxc-linux-containers.html。上網日期：2014-03-21。 [17] Creating a virtualized fully-distributed Hadoop cluster using Linux Containers。網址：http://ofirm.wordpress.com/2014/01/05/creating-a-virtualized-fully-distributed-hadoop-cluster-using-linux-containers/。上網日期：2014-03-25。 [18] How to create multiple network interfaces in an LXC container。網址： http://www.boxtricks.com/multiple-network-interfaces-lxc-container/。上網日期：2014-04-15。 [19] How to configure a Linux bridge interface。網址︰http://xmodulo.com/2013/04/how-to-configure-linux-bridge-interface.html。上網日期：2014-04-15。 [20] WordCount MapReduce example using Hive on local and EMR。網址：http://www.lichun.cc/blog/2012/06/wordcount-mapreduce-example-using-hive-on-local-and-emr/。上網日期：2014-04-23 [21] Java MapReduce WordCount Example。網址：http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial.html。上網日期：2014-06-21。 [22] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), pp. 107-113. [23] Stephen J. Vaughan-Nichols. 2006. New Approach to Virtualization Is a Lightweight. Computer 39, 11 (November 2006), pp. 12-14. [24] Miguel G. Xavier, Marcelo V. Neves, Fabio D. Rossi, Tiago C. Ferreto, Timoteo Lange, and Cesar A. F. De Rose. 2013. Performance Evaluation of Container-Based Virtualization for High Performance Computing Environments. In Proceedings of the 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP '13). IEEE Computer Society, Washington, DC, USA, pp. 233-240. [25] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40, 4 (January 2012), pp. 11-20. [26] Curtis E. Dyreson, Omar U. Florez, Akshay Thakre, and Vishal Sharma. 2013. Supporting data aspects in pig latin. In Proceedings of the 12th annual international conference on Aspect-oriented software development (AOSD '13). ACM, New York, NY, USA, pp. 13-24. [27] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: a runtime for iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, pp. 810-818. [28] Sherif Sakr, Anna Liu, and Ayman G. Fayoumi. 2013. The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46, 1, Article 11 (July 2013), pp. 1-44. [29] Stephen Soltesz, Herbert Pötzl, Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. 2007. Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. SIGOPS Oper. Syst. Rev. 41, 3 (March 2007), pp. 275-287. [30] Da-Wei Zhang, Fu-Quan Sun, Xu Cheng and Chao Liu, Research on Hadoop-based Enterprise File Cloud Storage System, 3rd International Conference on Awareness Science and Technology (iCAST), pp.434-437 [31] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi, "Evaluating MapReduce on Virtual Machines: The Hadoop Case", Proc. Conf. Cloud Computing (CloudCom 2009), Springer LNCS, Dec 2009, pp.519-528. [32] Bughin J, Chui M, Clouds Manyika J. Big data, and smart assets: ten tech-enabled business trends to watch. McKinsey Quarterly. McKinsey Global Institute; August 2010 [33] Mahesh Maurya and Sunita Mahajan. “Performance analysis of MapReduce Programs on Hadoop cluster”, World Congress on Information and Communication Technologies 2012. [34] G. Calarco and M. Casoni, “On the effectiveness of Linux containers for network virtualization,” Simulation Modelling Practice and Theory, vol. 31, pp. 169–185, 2013. [35] Songting Chen. 2010. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc. VLDB Endow. 3, 1-2 (September 2010), pp. 1459-1468. [36] Dawei Jiang, Beng Chin Ooi, Lei Shi, and Sai Wu. 2010. The performance of MapReduce: an in-depth study. Proc. VLDB Endow. 3, 1-2 (September 2010), pp. 472-483. [37] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, and Keqiu Li. 2012. Big Data Processing in Cloud Computing Environments. In Proceedings of the 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks (I-SPAN '12). IEEE Computer Society, Washington, DC, USA, pp. 17-23. [38] Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. 2010. A Map-Reduce System with an Alternate API for Multi-core Environments. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGRID '10). IEEE Computer Society, Washington, DC, USA, pp. 84-93. [39] Christos Doulkeridis and Kjetil NØrvåg. 2014. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal 23, 3 (June 2014), pp. 355-380. [40] Feng Li, Beng Chin Ooi, M. Tamer Özsu, and Sai Wu. 2014. Distributed data management using MapReduce. ACM Comput. Surv. 46, 3, Article 31 (January 2014), pp. 1-43 [41] Alfredo Cuzzocrea, Domenico Saccà, and Jeffrey D. Ullman. 2013. Big data: a research agenda. In Proceedings of the 17th International Database Engineering & Applications Symposium (IDEAS '13). ACM, New York, NY, USA, pp. 198-203.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0630114-101432.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS