國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,研究以大數據之雲端技術為基礎之健保資料庫探勘：以失智症之危險因子分析為例,The Study of Big Data Mining of Taiwan’s National Health Insurance Database Based on Cloud Technologies: the Case Study of Risk Factors for Dementia

論文名稱 Title	研究以大數據之雲端技術為基礎之健保資料庫探勘：以失智症之危險因子分析為例 The Study of Big Data Mining of Taiwan’s National Health Insurance Database Based on Cloud Technologies: the Case Study of Risk Factors for Dementia
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	103 學年度第 2 學期 The spring semester of Academic Year 103	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	78
研究生 Author	劉淳楷 Chun-kai Liou
指導教授 Advisor	林俊宏 Chun-Hung Lin
召集委員 Convenor	吳信昇 Shihn-Sheng Wu
口試委員 Advisory Committee	蔡瑞修, 陳璽煌, 溫燕霞 Jui-Hsiu Tsai; Shi-Huang Chen; Yen-Hsia Wen
口試日期 Date of Exam	2015-07-17	繳交日期 Date of Submission	2015-07-25
關鍵字 Keywords	Java、Hive、Hadoop、健保資料庫、Big Data、分散式運算、雲端運算 National Health Insurance Research Database, Big Data, Cloud Computing, Java, Distributed Computing, Hive, Hadoop
統計 Statistics	本論文已被瀏覽 5647 次，被下載 24 次 The thesis/dissertation has been browsed 5647 times, has been downloaded 24 times.

中文摘要
近幾年來，雲端運算、分散式運算和大數據(Big Data)技術越來越成熟，也越來越重要，各式各樣的資料分析方法如雨後春筍般的出現，也應用在各個不同的行業領域，以前不會想到的地方，現在都用上了Big Data的分析。當然，在醫學上也不可以缺席，台灣全民健保從民國84年(西元1995年)實施到現在，完整記錄著台灣全國人民的就醫狀況，所以我們有歷時20年、並至少有2000萬人以上的健保資料，這龐大的資料很適合以大數據方式來進行分析，讓研究人員可以對某些病症更瞭解，也能透過大數據的分析，找到一點以前沒注意到的蛛絲馬跡，解決更多醫學上的未解之迷。本論文之研究主要是以研究人員的需求出發，實作一個符合這類使用者所需的使用介面之大數據分析平台，利用Hadoop分散式檔案系統做為資料倉儲，來儲存大量的健保資料，再用Hive或Impala工具做初步過濾，依條件來海撈符合條件的資料，然後進行細篩，此處以Java程式做分析工作，最後呈現一份數據報表供研究人員參考。例如，本論文以失智症風險因子探勘為應用，研究人員可以就失智症病人的年齡分佈、男女性別比例，進行特定族群分析。另外，大數據技術對於尋找新風險因子特別有用，而且對於多重因子共存的影響分析研究。這是過去研究人員所慣用的統計軟體，如：SPSS或SAS，所無法做到的，因為資料量太多，運算條件複雜，造成運算量過大無法單機進行運算，如今都可以用Big Data技術來克服。
Abstract
In recent years, cloud computing, distributed computing, and big data technology, are more and more important. Many data analysis methods appear. Various different industries now use analysis of Big Data. In medicine we can also use Big Data technology. Taiwan's national health insurance from A.D. 1995. We have complete of the Taiwan’s people medical record. We have over 20 years, at least more than 20 million people’s health-care information. This huge data suitable for Big Data analysis. So that researchers can better understand the disease, but also through the analysis of big data, found that previously did not notice the details. Solve more mysteries in medicine. The main thesis of this medicine to the needs of researchers. Implement a health insurance database analysis system. Use Hadoop Distributed File System as a data storage to store large amounts of health care information. According to the conditions, use Hive or Impala tool to filter information. Then use a Java program for further filter. Finally, showing data to the researchers. In the past researchers commonly used statistical software, such as: SPSS or SAS. Because of large data and complex operation conditions. Calculation can not be performed. Today, Big Data technology can be used to overcome these problems.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 誌謝 iii 摘要 iv Abstract v 目錄 vi 圖次 ix 表次 xi 第一章序論 12 1.1 研究動機 12 1.2 研究目的 12 1.3 研究現況 13 1.4 論文架構 13 第二章研究背景 15 2.1 Big Data技術 15 2.1.1 Apache Hadoop 15 2.1.2 Hadoop Distributed File System (HDFS) 15 2.1.3 Apache Spark 18 2.1.4 Apache Hive 19 2.1.5 Apache Hbase 19 2.1.6 Impala 19 2.1.7 Cloudera Distribution Hadoop (CDH) 21 2.2 健保資料庫 22 2.3 HyperSQL DataBase 23 2.4 程式語言 24 2.4.1 Java 24 2.4.2 SQL 25 2.5 貝氏定理 (Bayes' theorem) 25 第三章健保資料庫探勘系統之設計 27 3.1 系統介紹 27 3.1.1 系統特色 27 3.1.2 系統架構 27 3.1.2.1 Web(PHP) 28 3.1.2.2 Server(Java) 28 3.1.2.3 HyperSQL DataBase 29 3.1.2.4 Hadoop 29 3.1.3 系統運行流程 29 3.2 系統開發需求 30 3.2.1 硬體需求 30 3.2.2 軟體需求 31 第四章實作自動化探勘健保資料庫 32 4.1 Web端功能 32 4.1.1 Web Core 33 4.1.1.1 Core對User View 33 4.1.1.2 Core對Server 34 4.1.1.3 Core對HyperSQL DataBase 35 4.1.2 User View 35 4.1.3 Web Function Module 38 4.1.4 Web模組擴充與刪減 38 4.2 Server端功能 40 4.2.1 Server Core 40 4.2.1.1 Core對Web端 40 4.2.1.2 Core對HyperSQL DataBase端 42 4.2.2 Server Function Module 42 4.2.3 DataBase Module 43 4.2.3.1 HiveAccess 43 4.2.3.2 ImpalaAccess 44 4.2.4 Server模組擴充與刪減 45 4.3 HyperSQL DataBase功能 47 4.3.1 資料庫使用方式 47 4.3.2 資料表格介紹 48 4.4 Hadoop端功能 49 4.4.1 Hive 49 4.4.2 Impala 50 4.4.3 把健保資料庫倒進Hadoop的步驟和方法 50 第五章系統成果展示 54 第六章失智症危險因子分析 59 6.1 失智症病人篩選 59 6.2 找出前病 62 6.3 前病與現有因子比較 62 第七章結論與未來展望 66 參考文獻 67 附錄A 70

參考文獻 References
[1]余清祥、蘇維屏，全民健保資料庫分析：重大傷病及癌症之研究(2014-06) [2]林麗芬、李宗道，全民健保資料庫之女性乳癌緩解與復發統計分析(2013-06) [3]林麗芬教授，「全民健康保險資料庫在醫療保險之應用」研討會(2013-08-29) [4]Yung-Tai Yen; Chien-Yeh Hsu, Apply Grid Computation for Population-based Health Claims Analysis (2007IEEE) [5]Chu-Cheng Kuo; Fang-Chi Yang; Meng-Han Yang ; Ding-Dar Lee , Predicting the Onset of Bullous Pemphigoid with Co-morbidities A Survey Based on a Nationwide Medical Database(2013IEEE) [6]Wang, Hsing-I , The Preliminary Survey for the Design of Multipurpose Medical Databases for Taiwan – the Usage Oriented Approach (2011IEEE) [7]Jingmin Li ,Design of real-time data analysis system based on Impala (2014IEEE) [8]Rawte, V. ; Anuradha, G. ,Fraud detection in health insurance using data mining techniques (2015ICCICT) [9]Sobhy, D. ; El-Sonbaty, Y. , Abou Elnasr, M. ,MedCloud: Healthcare cloud computing system (2012ICITST) [10] Apache Hadoop https://hadoop.apache.org (2015-07-23) [11] Apache HBase http://hbase.apache.org (2015-07-23) [12] Apache Hive https://hive.apache.org (2015-07-23) [13]Apache Spark https://spark.apache.org/ (2015-07-23) [14]Apache Impala http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html (2015-07-23) [15]Impala與各軟體比較數據http://www.parallellabs.com/2013/08/25/impala-big-data-analytics/ (2015-07-23) [16]HDFS http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html (2015-07-23) [17]HSQL DataBase http://www.hsqldb.org (2015-07-23) [18] Cloudera Distribution Hadoop http://www.cloudera.com/ (2015-07-23) [19]健保資料庫 http://nhird.nhri.org.tw (2015-07-23) [20]Java http://www.java.com (2015-07-23) [21]Dremel http://www.yankay.com/google-dremel-rationale/ (2015-07-23) [22]php-thrift-hive-client https://github.com/garamon/php-thrift-hive-client (2015-07-23) [23]html5-menu https://github.com/fpmweb/html5-menu (2015-07-23) [24]Hive語法使用 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL (2015-07-23) [25]GSON https://code.google.com/p/google-gson/ (2015-07-23) [26]JQuery AJAX使用 http://expect7.pixnet.net/blog/post/37919326 (2015-07-23) [27]Impala JDBC Code https://github.com/onefoursix/Cloudera-Impala-JDBC-Example (2015-07-23) [28]Tao Jiang, Qianlong Zhang, Rui Hou, Lin Chai, Sally A. Mckee, Zhen Jia, Ninghui Sun. Understanding the Behavior of In-Memory Computing Workloads. Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp.22-30, Oct. 2014. [29]E.Sivaraman, Dr.R.Manickachezian. High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing using Hadoop. Intelligent Computing Applications (ICICA), 2014 International Conference on, pp.32-36, March 2014. [30]Md. Wasi-ur-Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Jithin Jose, Hari Subramoni, Hao Wang, Dhabaleswar K. (DK) Panda. High-Performance RDMA-based Design of Hadoop MapReduce over InfiniBand. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, pp.1908-1917, May 2013. [31]Kala Karun. A, Chitharanjan. K. A Review on Hadoop – HDFS Infrastructure Extensions. Information & Communication Technologies (ICT), 2013 IEEE Conference on, pp.132-137, April 2013. [32]Xiaofei Hou, Ashwin Kumar T K, Johnson P Thomas, Vijay Varadharajan. Dynamic Workload Balancing for Hadoop MapReduce. Big Data and Cloud Computing (BdCloud), 2014 IEEE Fourth International Conference on, pp.56-62, Dec. 2014. [33]Kalpana Dwivedi, Sanjay Kumar Dubey. Analytical Review on Hadoop Distributed File System. Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference, pp.174-181, Sept. 2014.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0623115-104358.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS