Responsive image
博碩士論文 etd-0620116-134233 詳細資訊
Title page for etd-0620116-134233
論文名稱
Title
健保資料庫探勘:叢集網路規劃與自動查詢語言產生器
Data Mining of National Health Insurance Database: Network Design of Computing Cluster and Automatic Query Language Generator
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
102
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-07-22
繳交日期
Date of Submission
2016-08-22
關鍵字
Keywords
Openflow、LXC、SQL、Impala、健保資料庫、大數據
LXC, virtual machine, SQL, National Health Insurance, Impala, Openflow, Big Data
統計
Statistics
本論文已被瀏覽 5653 次,被下載 29
The thesis/dissertation has been browsed 5653 times, has been downloaded 29 times.
中文摘要
近年關於大數據的應用蓬勃發展,許多人都想要嘗試使用大數據作為某些問題的解決方案,但常常礙於伺服器設備昂貴及缺少專業的分析人員而失敗。
台灣自西元1995年起實施的全民健康保險制度至今,累積了台灣人民大量的健保資料,但在從前,醫師或相關研究人員只能以Excel等工具來分析這些資料,而因為資料的龐大導致分析的速度慢,甚至無法處理。
所以本篇論文利用Hadoop分散式運算的特性,提出一個解決方案,是以一般市售個人電腦所組成的運算叢集,透過使用Linux容器(container) LXC確保資料的安全及維護、擴充上的方便,另外可使用OpenFlow的設定確保運算時所需要的網路頻寬。
在軟體上,依照健保資料庫的特性,選擇Cloudera公司Impala作為運算的工具,Impala使用SQL作為查詢的語言,降低入門門檻,並使原本熟悉資料庫的工程師可以快速地上手。Impala的運算速度也較其他使用Map/Reduce的工具快上許多,在Client端使用HTML開發視覺化查詢描述介面,讓非資訊專長的使用者能夠快速跨平台的使用,利用圖形化勾選的方式就能即時編譯而轉換產生相對應的資料庫SQL查詢語言,使不熟悉SQL語法的非專業人士也可以使用此平台取得所需要的資料。
Abstract
Big Data technologies and applications flourish in recent years. However, the cost of solutions for big data is high because of the expensive hardware equipments and professional software tools for analysis.
Taiwan national health insurance since 1995 accumulates the whole treatment information of all populations of Taiwan. In the past, Microsoft Excel and statistical packages, like SAS and SPSS, are the main tools to analyze these data. However, these data are too huge to be processed, or it may be handled and expense too much time.
Therefore, the purpose of our research is looking for a solution to process the big data and its cost would not be too high. In addition to the platform solution, we also design and implement a solution for automatical SQL code generation. It is very useful for those who are not IT experts to be mining data from the platform. Our proposed platform solution is composed of a computing cluster with many off-shelf personal computers, and then we apply virtual machine tool, Linux container (LXC), to ensure data security and system scalability and utilization. Also, you use OpenFlow to ensure the required network bandwidth during the data mining.
We choose Cloudera Impala as the tool of data mining, which uses standard SQL as the query language in order to reduce the gap between users and the database. Impala, whose implementation uses in-memory approach, has a faster query speed than those which uses Map/Reduce one. Additionally, we use HTML5 as the interface to develop the automatic SQL generator for non-IT users to quickly get correct SQL code and then to execute the code.
目次 Table of Contents
論文審定書 i
摘要 ii
Abstract iii
目錄 iv
圖次 viii
表次 x
第一章 緒論 1
1.1. 研究動機與目的 1
1.2. 論文架構 2
第二章 研究背景 3
2.1. Software-defined networking(SDN) 3
2.2. OpenFlow 4
2.3. Linux Containers(LXC) 5
2.3.1. 簡介 5
2.3.2. 安裝與啟動 6
2.4. Apache Hadoop 7
2.5. Hadoop Distributed File System(HDFS) 7
2.6. MapReduce 8
2.7. Apache Hive 9
2.7.1. Hive運作流程 9
2.8. Cloudera Distribution of Apache Hadoop(CDH) 10
2.9. Apache Impala 11
2.9.1. Impala架構 12
2.9.2. Impala運作流程 12
2.10. Structured Query Language(SQL) 13
2.11. 健保資料庫 14
2.12. HTML5 17
2.13. JavaScript 18
2.14. PHP 18
2.15. EzoApp 19
第三章 系統設計及架構 20
3.1. 系統需求 20
3.1.1. 硬體需求 20
3.1.2. 軟體需求 21
3.2. 系統架構 21
3.2.1. 叢集硬體及網路架構 21
3.2.2. 軟體架構 22
3.3. CDH5安裝及設定 23
3.3.1. 在LXC中設定CDH5安裝環境 23
3.3.2. 安裝CDH5 25
3.3.3. CDH5管理介面介紹 26
3.3.4. 新增機器及故障修復 29
3.4. HUE介面介紹 30
第四章 系統實作 32
4.1. Client端 32
4.1.1. 互動視窗 32
4.1.2. 輸入欄位限制 33
4.1.3. 檔案上傳 34
4.1.4. 監聽輸入欄位變化 34
4.1.5. 為DOM增加或刪除屬性 35
4.1.6. 進度條 36
4.1.7. jQuery.post() 37
4.1.8. PHP Socket 37
4.2. Server端 39
4.2.1. Java Socket 39
4.2.2. 拆解JSON 40
4.2.3. JDBC連接Impala 40
4.3. 系統安裝 42
第五章 Impala SQL介紹與效能最佳化 43
5.1. Impala SQL敘述 43
5.1.1. 資料型態 43
5.1.2. 運算符號 43
5.1.3. View 44
5.1.4. 內嵌函數 47
5.1.5. 交集與聯集 48
5.2. Impala SQL效能分析與最佳化 49
5.2.1. EXPLAIN 49
5.2.2. COMPUTE STATS 52
5.2.3. PROFILE 60
第六章 系統成果展示 66
6.1. Client端 66
6.1.1. 第一次確診紀錄 66
6.1.2. 用藥紀錄 71
6.1.3. 前後病關係 73
6.1.4. View相關操作 74
6.1.5. 將SQL複製到HUE上操作 75
6.2. Server端 77
第七章 結論與未來展望 78
參考文獻 79
附錄A 82
參考文獻 References
[1] Abhijeet Desai, Nagegowda K S. Advanced Control Distributed Processing Architecture (ACDPA) using SDN and Hadoop for Identifying the Flow Characteristics and Setting the Quality of Service(QoS) in the Network, 2015 IEEE International Advance Computing Conference.
[2] Bruno A. A. Nunes, Mateus A. S. Santos, Bruno T. de Oliveira, Cintia B. Margi, Katia Obraczka, and Thierry Turletti. Software-Defined-Networking-Enabled Capacity Sharing in User-Centric Networks, 2014 IEEE Communications Magazine, 28-36.
[3] Hongyan Cui, Yuchen Zhang, Chenhang Ma, Wei Lai, Norman C. Beaulieu, Stanislav Sobolevsky, Yunjie Liu. Design and Realization of Cognitive Routing Resources Using Big Data Analysis in SDN, 2015 IEEE.
[4] Ian Ku, You Lu, Mario Gerla. Software-Defined Mobile Cloud: Architecture, Services and Use Cases, 2014 IEEE.
[5] Junaid Qadir, Nauman Ahad, Erum Mushtaq and Muhammad Bilal. SDNs, Clouds and Big Data: Mutual Opportunities, 2014 IEEE.
[6] Sandhya Narayan, Stu Bailey, Anand Daga. Hadoop Acceleration in an OpenFlow-based cluster, 2013 IEEE.
[7] Simin You, Jianting Zhang, Le Gruenwald. Scalable and Efficient Spatial Data Management on Multi-Core CPU and GPU Clusters: A Preliminary Implementation based on Impala, 2015 IEEE.
[8] Jingmin Li. Design of real-time data analysis system based on Impala, 2014 IEEE.
[9] Vamsi Krishna Myalapalli, Pradeep Raj Savarapu. High Performance SQL, 2014 IEEE.
[10] Xiaopeng Li, Wenli Zhou. Performance Comparison of Hive, Impala and Spark SQL, 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics.
[11] Nayem Rahman. SQL Optimization a Parallel Processing Database System, 2013 26th IEEE Canadian Conference Of Electrical And Computer Engineering (CCECE).
[12] Adma Jacobs. The Pathologies of Big Data, 1010data Inc.
[13] Dandan Li, Lu Han, Yi Ding. SQL Query Optimization Methods of Relational Database System, 2010 Second International Conference on Computer Engineering and Applications.
[14] Amrit Pal, Kunal Jain, Pinki Agrawal, Sanjay Agrawal, A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop, 2014 Fourth International Conference on Communication Systems and Network Technologies.
[15] T Lakshmi Siva Rama Krishna, Dr T Ragunathan, Sudheer Kumar Battula, Performance Evaluation of Read and Write Operations in Hadoop Distributed File System, 2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming.
[16] Nusrat Sharmin Islam, Xiaoyi Lu, Md. Wasi-ur-Rahman, Raghunath Rajachandrasekar, and Dhabaleswar K. (DK) Panda, In-Memory I/O and Replication for HDFS with Memcached: Early, 2014 IEEE International Conference on Big Data.
[17] Patel Neha M, Patel Narendra M, Patel Mayur M, Mosin I Hasan, Shah Parth D, Improving HDFS Write Performance Using Efficient Replica Placement, 2014 IEEE.
[18] Xiuqin LIN, Peng WANG, Bin WU, LOG ANALYSIS IN CLOUD COMPUTING
ENVIRONMENT WITH HADOOP AND SPARK, Proceedings of IEEE IC-BNMT2013.
[19] Apache Hadoop, http://hadoop.apache.org/, 2016/02/13.
[20] HDFS, https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html, 2013/08/04.
[21] Apache Hive, https://hive.apache.org/, 2016/02/25.
[22] Cloudera, http://www.cloudera.com/, 2016/02/25
[23] Impala, https://www.cloudera.com/products/apache-hadoop/impala.html, 2016/02/25.
[24] Cloudera Impala Guide, http://www.cloudera.com/documentation/enterprise/5-4-x/topics/impala.html, 2016/01/13.
[25] LXC, https://linuxcontainers.org/, 2016/03/22.
[26] OpenFlow, https://www.opennetworking.org/sdn-resources/openflow, 2016/03/24.
[27] SDN, https://www.opennetworking.org/sdn-resources/sdn-definition, 2016/04/12.
[28] w3school, http://www.w3schools.com/, 2016/07/12.
[29] 賈傅青,2015/11,《Hadoop的下一步 深度使用大數據查詢引擎Impala》,佳魁資訊。
[30] Eric Freeman, Elisabeth Robson,莊惠淳 譯,《深入淺出HTML5程式設計》,O’REILLY.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code