國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,研究大數據運算與儲存平台架構：以我國全民健保資料運算為例之探討,The study of a big data platform for mining computation and storage : the case study of national health insurance database

論文名稱 Title	研究大數據運算與儲存平台架構：以我國全民健保資料運算為例之探討 The study of a big data platform for mining computation and storage : the case study of national health insurance database
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	103 學年度第 2 學期 The spring semester of Academic Year 103	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	68
研究生 Author	莊順賢 Shun-hsien Chuang
指導教授 Advisor	林俊宏 Chun-Hung Lin
召集委員 Convenor	吳信昇 Shihn-Sheng Wu
口試委員 Advisory Committee	陳璽煌, 蔡瑞修, 溫燕霞 Shi-Huang Chen; Jui-Hsiu Tsai; Yen-Hsia Wen
口試日期 Date of Exam	2015-07-17	繳交日期 Date of Submission	2015-08-03
關鍵字 Keywords	健保資料庫、Apache Hive、Hive on Spark、Apache Hadoop、大數據運算、Impala Impala, Big Data Computing, Apache Hadoop, Hive on Spark, Apache Hive, National Health Insurance Research Database
統計 Statistics	本論文已被瀏覽 5707 次，被下載 36 次 The thesis/dissertation has been browsed 5707 times, has been downloaded 36 times.

中文摘要
我國是全球少數擁有完整國民健康保險的國家，病患每次的看診資訊均記錄在健保資料庫中，有了全國民眾的就診資料，就可以利用大數據的技術，來進行分析，並從中取得有用的資訊，這些探勘出來的資訊目前許多應用於預防醫學，或療程及用藥之效能改善，希望能從中減少健保支出，或提升醫療成效與品質，甚至及早預防。但這些龐大的資料不是一般資料處理軟體可以負荷，必須導入新的處理方式來解決資料量過大無法處理的問題。本研究是將全民健康保險資料移至大數據運算平台上面，並且結合各種分析工具來幫助加速資料分析，本論文的目的在於針對目前醫藥人員常用之探勘主題，探討相對於大數據平台之運算，何種工具是最適合、或者最有效率來處理全民健康保險資料。我們將針對幾個典型的運算，分析運算資料（搜尋及計算）的速度，與儲存資料（運算過程中及最後資料回存）的速度。健保資料分析必須要能夠快速計算出想要的資料，並且能快速將資料儲存下來的工具，才是最為合適的工具。關鍵字：健保資料庫、大數據運算、Apache Hadoop、Apache Hive、Hive on Spark、Impala
Abstract
Our country has a complete national health insurance. The treatment records of patients are recorded into the National Health Insurance Research Database. With the treatment records of patients, we can use big data technology to analyze the treatment records of patients and to get useful information. This useful information is currently used in the preventive healthcare or improves the effectiveness of treatment and medication. We hope these researchs can help reduce the cost of nationl health insurance or enhance the effectiveness and quality of health care, even we can prevent diseases. However, these massive health insurance data are not suitable for usual data processing software. We must import a new approach to solve these massive health insurance data. In this paper, health insurance data are stored into the big data computing platform, then, we use a vatiety of analysis tools for accelerating data analysis. The purpose of this paper is discussing which analysis tool is suitable for the usual research topic of medical staffs. We will use several typical computations to analyze the performance of data computing and storing. Keywords: National Health Insurance Research Database, Big Data Computing, Apache Hadoop, Apache Hive, Hive on Spark, Impala

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 誌謝 iii 摘要 iv Abstract v 目錄 vi 圖次 ix 表次 x 第一章序論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文架構 2 第二章研究背景 4 2.1 全民健康保險研究資料庫介紹 4 2.1.1 資料描述 4 2.1.2 串檔說明 5 2.2 大數據運算平台介紹 5 2.2.1 Apache Hadoop介紹 6 2.2.1.1 HDFS 6 2.2.1.2 MapReduce 9 2.2.2 CDH介紹 11 2.3 分析工具介紹 13 2.3.1 Apache Hive 13 2.3.2 Hive on Spark 15 2.3.2.1 Apache Spark 15 2.3.3 Cloudera Impala 17 2.3.4 PostgreSQL 19 第三章系統架構 21 3.1 硬體規格 21 3.2 軟體版本 22 3.2.1 作業系統版本 22 3.2.2 Hadoop與相關套件之版本 22 3.3 測試環境架構與服務配置 23 3.3.1 節點命名與服務配置 23 3.3.2 測試環境架構一 25 3.3.3 測試環境架構二 25 3.3.4 測試環境架構三 26 第四章實驗流程與數據說明 27 4.1 實驗資料介紹 27 4.1.2 資料轉換 28 4.1.3 資料匯入 29 4.1.4 建立資料表 30 4.2 全民健康保險研究資料庫查詢介紹 33 4.2.1 常用查詢介紹 34 4.2.2 實驗查詢介紹 37 4.3 實驗設計與數據說明 38 4.3.1 實驗設計一：分析工具之效能比較 39 4.3.2 實驗設計二：Hadoop叢集節點數分析 41 4.3.3 實驗設計三：輸出檔案大小比較 42 第五章結論與未來展望 44 參考文獻 45 附錄 50

參考文獻 References
[1] National Health Insurance Research Database. http://nhird.nhri.org.tw/ [2] Apache Hadoop. https://hadoop.apache.org/ [3] Apache Hive. https://hive.apache.org/ [4] Apache Spark. https://spark.apache.org/ [5] Impala. http://impala.io/ [6] PostgreSQL. http://www.postgresql.org/ [7] Hive on Spark: Getting Started. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started#HiveonSpark:GettingStarted-HiveonSpark:GettingStarted [8] Xuefu Zhang, Reynold Xin. Hive On Spark https://issues.apache.org/jira/secure/attachment/12652517/Hive-on-Spark.pdf [9] Apache Hadoop 2.4.1 – File System Shell Guide. https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html [10] Hive Design. https://cwiki.apache.org/confluence/display/Hive/Design [11] Configure Hive on Spark. http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hos_config.html [12] Xiuqin LIN, Peng WANG, Bin WU. LOG ANALYSIS IN CLOUD COMPUTING ENVIRONMENT WITH HADOOP AND SPARK. Broadband Network & Multimedia Technology (IC-BNMT), 2013 5th IEEE International Conference on, pp.273-276, Nov. 2013. [13] Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Raghunath Rajachandrasekar, and Dhabaleswar K. (DK) Panda. In-Memory I/O and Replication for HDFS with Memcached: Early Experiences. Big Data (Big Data), 2014 IEEE International Conference on, pp.213-218, Oct. 2014. [14] T Lakshmi Siva Rama Krishna, Dr T Ragunathan, Sudheer Kumar Battula. Performance Evaluation of Read and Write Operations in Hadoop Distributed File System. Parallel Architectures, Algorithms and Programming (PAAP), 2014 Sixth International Symposium on, pp.110-113, July 2014. [15] Amrit Pal, Pinki Agrawal, Kunal Jain, Sanjay Agrawal. A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop. Communication Systems and Network Technologies (CSNT), 2014 Fourth International Conference on, pp.587-591, April 2014. [16] Ammar Fuad, Alva Erwin, Heru Purnomo Ipung. Processing Performance on Apache Pig, Apache Hive and MySQL Cluster. Information, Communication Technology and System (ICTS), 2014 International Conference on, pp.297-302, Sept. 2014. [17] Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, Oct. 2004. [18] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, Oct. 2003. [19] Shao Hua Cheng, Yu Shian Chiu, Shih Yao Dai, Hui-I Hsiao. Duplicate Drug Discovery Using Hadoop. Big Data (Big Data), 2014 IEEE International Conference on, pp.24-26, Oct. 2014. [20] Jingmin Li, Design of real-time data analysis system based on Impala. Advanced Research and Technology in Industry Applications (WARTIA), 2014 IEEE Workshop on, pp.934-936, Sept. [21] Simin You, Jianting Zhang, Le Gruenwald. Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters: A Preliminary Implementation based on Impala. Data Engineering Workshops (ICDEW), 2015 31st IEEE International Conference on, pp.143-148, April 2015. [22] Mahsa Mofidpoor, Nematollaah Shiri, T. Radhakrishnan. Index-Based Join Operations in Hive. Big Data, 2013 IEEE International Conference on, pp.26-33, Oct. 2013. [23] Simin You, Jianting Zhang, Le Gruenwald. Large-Scale Spatial Join Query Processing in Cloud. Data Engineering Workshops (ICDEW), 2015 31st IEEE International Conference on, pp.34-41, April 2015. [24] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy. Hive – A Petabyte Scale Data Warehouse Using Hadoop. Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pp.996-1005, March 2010. [25] Taewon Kim, Haejin Chung, Wonsuk Choi, Jongmoo Choi, Joonmo Kim. Cost-based Join Processing Scheme in a Hybrid RDBMS and Hive System. Big Data and Smart Computing (BIGCOMP), 2014 International Conference on, pp.160-164, Jan. 2014. [26] Sreelakshmi Ganesh, Binu A. Statistical Analysis to determine the performance of Multiple beneficiaries of educational sector using Hadoop-Hive. Data Science & Engineering (ICDSE), 2014 International Conference on, pp.32-37, Aug. 2014. [27] Heng Xie, Mei Wang, Jiajin Le. A Data Reusing Strategy Based On Hive. Data Science and Advanced Analytics (DSAA), 2014 International Conference on, pp.367-373, Nov. 2014 [28] Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar, Dhabaleswar K. (DK) Panda. Accelerating Spark with RDMA for Big Data Processing: Early Experiences. High-Performance Interconnects (HOTI), 2014 IEEE 22nd Annual Symposium on, pp.9-16, Aug. 2014. [29] Tao Jiang, Qianlong Zhang, Rui Hou, Lin Chai, Sally A. Mckee, Zhen Jia, Ninghui Sun. Understanding the Behavior of In-Memory Computing Workloads. Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp.22-30, Oct. 2014. [30] Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. (DK) Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, pp.1908-1917, May 2013. [31] E.Sivaraman, Dr.R.Manickachezian. High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing using Hadoop. Intelligent Computing Applications (ICICA), 2014 International Conference on, pp.32-36, March 2014. [32] Xiaofei Hou, Ashwin Kumar T K, Johnson P Thomas, Vijay Varadharajan. Dynamic Workload Balancing for Hadoop MapReduce. Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on, pp.56-62, Dec. 2014. [33] Kala Karun. A, Chitharanjan. K. A Review on Hadoop – HDFS Infrastructure Extensions. Information & Communication Technologies (ICT), 2013 IEEE Conference on, pp.132-137, April 2013. [34] Kalpana Dwivedi, Sanjay Kumar Dubey. Analytical Review on Hadoop Distributed File System. Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference, pp.174-181, Sept. 2014. [35] Taoying Liu, Jing Liu, Hong Liu, Wei Li. A Performance Evaluation of Hive for Scientific Data Management. Big Data, 2013 IEEE International Conference on, pp.39-46, Oct. 2013. [36] Patel Neha M., Patel Narendra M, Mosin I Hasan, Shah Parth D, Patel Mayur M. Improving HDFS Write Performance Using Efficient Replica Placement. Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference, pp.36-39, Sept. 2014.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0624115-131244.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS