國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,健保資料大數據探勘之視覺化自動查詢語言產生器設計與實作,Data Mining of National Health Insurance Research Database: Design and Implementation of Visualized Automatic Query Language Generator

論文名稱 Title	健保資料大數據探勘之視覺化自動查詢語言產生器設計與實作 Data Mining of National Health Insurance Research Database: Design and Implementation of Visualized Automatic Query Language Generator
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	106 學年度第 1 學期 The fall semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	106
研究生 Author	蒲宏易 Hong-Yi Pu
指導教授 Advisor	林俊宏 Chun-Hung Lin
召集委員 Convenor	陳璽煌 Shi-Huang Chen
口試委員 Advisory Committee	王友群, 賴威光, 劉建興 You-Chiun Wang; Wei-Kuang Lai; Jain-shing Liu
口試日期 Date of Exam	2017-08-11	繳交日期 Date of Submission	2017-08-25
關鍵字 Keywords	大數據、Hadoop、Impala、SQL、HTML、JavaScript、健保資料庫 SQL, Impala, Hadoop, Big data, Healthcare database, JavaScript, HTML
統計 Statistics	本論文已被瀏覽 5661 次，被下載 18 次 The thesis/dissertation has been browsed 5661 times, has been downloaded 18 times.

中文摘要
我國是世界上少數實施全民健康保險制度的國家之一，由1995年開始至今的健康保險制度保存了全民完整的就醫詳細資料，這些具備全面性及完整性的就醫資料就成為我國珍貴的研究資料，不僅可用於疾病及藥品間關係研究，更可做為預防醫學重要發展的基石。為進行全民健康保險資料的探勘與分析，本研究與高雄醫學大學合作，開發及建置一個健保資料的大數據儲存與探勘平台。因健保資料資料量龐大，因此本系統不採用傳統資料庫，而以Hadoop分散式檔案系統來管理資料，此平台具有極佳的儲存擴充性，且具備高速資料探勘的分散計算能力，並透過Impala特殊的in-memory的分散計算方式及採用SQL做為查詢語言的特性，可以由大量的資料中快速篩選出特定的研究資料。然而醫療研究人員對於SQL語言查詢並不熟悉，為更進一步拉近此平台與醫療研究人員的距離，本研究設計與實作一個專門提供給醫療研究人員使用的SQL語法產生網頁，網頁使用HTML及JavaScript完成，讓使用者能藉由簡易的操作視覺化的網頁介面，就能載入所需查詢的條件並產生相對應的SQL語法。本研究實作包含兩大類型 : 第一類為「General Searching Method」，此為廣泛的一般化搜尋，可提供使用者進行健保資料庫中各欄位之查詢，此方法會將SQL語法進行分解，並藉由步驟引導使用者完成條件設定，得到所需的SQL語法。第二類為「Disease and Drug Searching Method」，此為針對病碼及藥碼的特化型搜尋，較第一類搜尋簡化為單一條件設置頁面，並有針對病碼及藥碼的特殊化條件設定及限制，最後產生之SQL語法可直接提供給使用者一份百萬歸人檔對應之病症及服藥狀況，可迅速且方便的提供給醫療研究人員進行後續之分析與研究。另外此類搜尋方式也提供分析圖給使用者能直接由圖表得知查詢之相對人數統計及疾病或藥品間的關聯性，讓使用者決定是否需進行後續之分析與研究。最後本論文針對第二類搜尋方法中所使用的SQL模型進行分析及優化實驗，並探討SQL在Impala之中優化時所需注意的條件與可以使用的方法。
Abstract
Taiwan is one of the minority country that implement National Health Insurance. From 1995, Taiwan started National Health Insurance and have been saved the whole medical data till now. These medical data are very important in medical research because of their high comprehensiveness and completeness. They can be the keystone for development of Preventive Medicine. We cooperate with Kaohsiung Medical University to build the platform of Healthcare database that can do the mining and analysis. In order to save and search the big data on Healthcare database, we choose Hadoop to build the platform because it have high scalability and high speed of distribute processing. And we can also use Impala which can do the SQL searching on Hadoop with In-memory distribute processing feature to get the data we want in a short time. Although we have built our Hadoop platform to store Healthcare data and we can do the SQL search on it, medical researchers still have problem with getting data from the database because they don’t really know how to use SQL language. So we decided to design a visualize web page with HTML and JavaScript, which can let medical researchers get the SQL language by simply setting some conditions. Our system includes two types of searching methods: The first one is the “General Searching Method”. This method will separate the logic of SQL language and guide users to generate their SQL step by step. And this method can get almost all the columns in Healthcare database. The second one is the “Disease and Drug Searching Method”. This Method focus on disease and drug. It is simpler than General Searching Method by using just one page form to get all the condition that needed. It also has some special design for condition setting or data limitation for disease/drug. The SQL language generated by this method will provide users an all people form that contains basic information and target disease/drug information. This form can provide medical researcher an easier way to analysis those target disease/drug. After SQL generate step, Disease and Drug Searching Method has the analysis function that offers medical researcher a visualize way to know the contents of the SQL they generated. Finally, we have an experiment to optimize our SQL module used in Disease and Drug Searching Method. In this experiment, we will explore the criteria and methods of Impala SQL optimization.

目次 Table of Contents
論文審定書 i 誌謝 ii 摘要 iii Abstract v 目錄 vii 圖次 x 表次 xiii 第一章緒論 1 1.1 研究動機與目的 1 1.2 論文架構 2 第二章研究背景 4 2.1 健保資料庫 4 2.2 Apache Hadoop 5 2.3 Apache Hive 7 2.4 Cloudera CDH 9 2.5 Apache Impala 10 2.5.1 Impala與Hive 10 2.5.2 Impala架構 11 2.6 Structured Query Language (SQL) 12 2.7 Nginx 13 2.8 HTML 13 2.9 JavaScript 14 2.10 WebSocket 14 2.11 CanvasJS 15 第三章相關研究 17 第四章系統設計及架構 20 4.1 Hadoop儲存平台 20 4.1.1 硬體配置 20 4.1.2 軟體配置 21 4.2 網頁端系統架構 22 第五章系統實作 24 5.1 系統簡介 24 5.2 General Searching Method 24 5.2.1 方法目的 24 5.2.2 方法流程 25 5.2.3 方法實作 27 5.2.3.1 選擇Server及健保資料庫年份 27 5.2.3.2 選擇FROM 28 5.2.3.3 選擇SELECT 29 5.2.3.4 設定WHERE 29 5.2.3.5 設定GROUP BY、ORDER BY 31 5.2.3.6 產生SQL 32 5.3 Disease and Drug Searching Method 34 5.3.1 方法目的 34 5.3.2 方法流程 35 5.3.3 方法實作 37 5.3.3.1 前置系統配置 37 5.3.3.2 Generate SQL 38 5.3.3.3 分析功能 41 5.4 中介伺服器 43 5.5 WebSocket 43 第六章 Impala SQL效能分析與優化 45 6.1 Compute Stats與Impala Daemon監測網頁 45 6.1.1 Compute Stats 45 6.1.2 Impala Daemon監測網頁 46 6.2 SQL模型分析與優化實驗 47 6.2.1 初始模型實驗 48 6.2.2 更改JOIN模型實驗 50 6.2.3 去除UNION模型實驗 51 6.3 實驗結果分析與探討 52 第七章系統運作與成果展示 56 7.1 General Searching Method 56 7.2 Disease and Drug Searching Method 63 7.2.1 Web頁面 - Generate SQL 63 7.2.2 Web頁面 - 分析功能 67 第八章結論與未來期望 73 參考文獻 74 附錄 77

參考文獻 References
[1] National Health Insurance Research Database, [Online]. Available: https://nhird.nhri.org.tw/. [2] Apache Hadoop, [Online]. Available: https://hadoop.apache.org/. [3] Apache Hive, [Online]. Available: https://docs.microsoft.com/zh-tw/azure/hdinsight/hdinsight-use-hive. [4] Cloudera. [Online]. Available: https://www.cloudera.com/. [5] Cloudera Product Documentation Impala Guide, [Online]. Available: https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala.html. [6] J. Li, "Design of real-time data analysis system based on Impala," in Advanced Research and Technology in Industry Applications (WARTIA), 2014 IEEE Workshop on, 2014. [7] Big Data Search Engine – Impala, [Online]. Available: https://read01.com/oJnk.html. [8] W3Schools Online Web Tutorials, [Online]. Available: https://www.w3schools.com/. [9] WebSocket. [Online]. Available: https://zh.wikipedia.org/wiki/WebSocket,. [10] Java WebSocket, [Online]. Available: https://github.com/TooTallNate/Java-WebSocket. [11] CanvasJS. [Online]. Available: https://canvasjs.com/. [12] M. M. Zloof, "Query by example," in Proceedings of the May 19-22, 1975, national computer conference and exposition, 1975. [13] QBE Definition, [Online]. Available: https://techterms.com/definition/qbe. [14] Microsoft Query Builder, [Online]. Available: https://docs.microsoft.com/en-us/sql/. [15] Petropoulos, Michalis and Deutsch, Alin and Papakonstantinou, Yannis and Katsis, Yannis, "Exporting and interactively querying web service-accessed sources: The CLIDE system," ACM Transactions on Database Systems (TODS), vol. 32, no. 4, p. 22, 2007. [16] Boonprapasri, Theeradol and Sriharee, Gridaphat, "An applied ontology: A semantic query builder for health GIS system," in omputer Science and Engineering Conference (ICSEC), 2015. [17] Cloughley, Ronald G and Bond, Raymond R and Finlay, Dewar D and Guldenring, Daniel and McLaughlin, James, "An interactive clinician-friendly query builder for decision support during ECG interpretation," in Computing in Cardiology Conference (CinC), 2016. [18] Tang, Mingjie and Tahboub, Ruby Y and Aref, Walid G and Atallah, Mikhail J and Malluhi, Qutaibah M and Ouzzani, Mourad and Silva, Yasin N, "Similarity group-by operators for multi-dimensional relational data," IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, pp. 510-523, 2016. [19] Goldstein, Larry B and Samsa, Gregory P and Matchar, David B and Horner, Ronnie D, "Charlson Index comorbidity adjustment for ischemic stroke outcome studies," Stroke, vol. 35, no. 8, pp. 1941-1945, 2004. [20] Myalapalli, Vamsi Krishna and Savarapu, Pradeep Raj, "High performance SQL," in India Conference (INDICON), 2014 Annual IEEE, 2014. [21] Vamsi Krishna Myalapalli, Bhupathi Lohith Ravi Teja, "High Performance PL/SQL Programming," in IEEE International Conference on Pervasive Computing, 2015. [22] Myalapalli, Vamsi Krishna and Shiva, Muddu Butchi, "An appraisal to optimize SQL queries," in Pervasive Computing (ICPC), 2015 International Conference on, 2015. [23] Mithani, Fazal and Machchhar, Sahista and Jasdanwala, Fernaz, "A novel approach for SQL query optimization," in Computational Intelligence and Computing Research (ICCIC), 2016 IEEE International Conference on, 2016. [24] Li, Ke and Su, Fei and Cheng, Xinzhou and Chen, Weiwei and Meng, Kejing, "The research of performance optimization methods based on Impala cluster," in Communications and Information Technologies (ISCIT), 2016 16th International Symposium on, 2016.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0721117-150952.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS