Responsive image
博碩士論文 etd-0411116-090441 詳細資訊
Title page for etd-0411116-090441
論文名稱
Title
有效聚合巨量資料以提供視覺化呈現之研究
Effectively Aggregating Big Data for Visualization
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
58
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2016-05-06
繳交日期
Date of Submission
2016-05-11
關鍵字
Keywords
巨量資料、資料探索分析、資料縮減、資料離散化、資料視覺化
Data Discretization, Exploratory Data Analysis, Big data, Data Reduction, Data Visualization
統計
Statistics
本論文已被瀏覽 5945 次,被下載 457
The thesis/dissertation has been browsed 5945 times, has been downloaded 457 times.
中文摘要
隨著科技的進步結合網路的發展,資料的收集越發容易,而這些巨量資料如何被妥善的運用端看企業或是個人如何從中獲取有用的資訊。在將資料作更進階的分析例如資料探勘前,會先做初步的了解,分析者可先對數據進行探索性資料分析(EDA),辨析數據的模式與特點,並且將其有序的發掘出來,便能靈活的選擇和調適適當的分析模型,常見的方式為作圖、製表、方程擬合等等探索數據的結構和規律。
論文中利用R非常適合探索數據以及擁有豐富的套件庫的特性來開發EDA軟體,並且針對大數據中的視覺化方式提出改善。數據量龐大的情況下,會利用資料縮減策略來降低資料量以及資料複雜度但仍保有資料原有的特性。而裝箱法為最常見的方法,大部分的文獻以及現有軟體中,常利用等距的方式聚合連續型的變數,等距的方式所縮減的資料雖然效率高但是對於偏態的資料分佈較不能有好的表現。因此,論文中比較了三種聚合方式: 等距、等高以及MHist,針對三者在縮減二維數據的效率以及精確度做評估
評估結果發現等距的方法效率最高但是精確度卻是最低的, MHist的方法能夠在不同的分布情況下都有極高的精確度但是執行效率卻是最低,等高的方式在效率以及精確度上都有不錯的表現,因此將其運用至我們的軟體做作為資料縮減技術。
Abstract
With the fast development of the internet technologies, data is easily generated and collected. Those data could be useful based on how the enterprise or individuals can derive the valuable information from it. Before doing more complex analysis, analyzers need to understand the data, preferably in a visualization way, leading to the approach of Exploratory Data Analysis(EDA). With EDA, analyzers can dig out the pattern or characteristic of data and then choose the appropriate model for further analysis. The common techniques of EDA include graphing, tabulation, and equation fitting, which could help the analyzers explore the data and identify its regularity. Unfortunately, when the volume of data is huge, traditional EDA methods may suffer from the lack of efficiency.
Our work uses R to develop an EDA software based on its features of data exploration and rich package libraries and tries to efficiently visualize big data. By applying data reduction strategies, large volumes of data could be reduced to some meaningful data set with lower complexity and lower size. Specifically, we apply the strategy of binning for developing data reduction methods. Equal-width is the most common binning method for aggregating continuous variables. Although equal-width had high efficiency, it had poor performance for skewness data distribution. In this thesis, we compared three aggregation approaches: equal-width, equal-depth and MHist by assessing their time efficiencies and accuracies.
Experimental results showed that both equal-depth and MHist has much higher accuracy at some price of efficiency when compared to equal-width. MHist method performs well in various data distributions but with lowest efficiency. The method equal-depth strikes a balance in that it has reasonable performance in both efficiency and accuracy.
目次 Table of Contents
第一章、 緒論 1
第一節、研究背景 1
第二節、動機 2
第三節、論文架構 3
第二章、 文獻參考 5
第一節、探索性資料分析 5
第二節、EDA軟體 8
第三節、巨量資料縮減方法 10
第三章、 系統架構 14
第一節、使用者介面 14
第二節、EDA伺服器 15
第三節、互動式圖庫 16
第四節、資料庫 16
第四章、 系統開發設計 17
第一節、上傳檔案 17
第二節、選擇篩選器 18
第三節、選擇分析工具 19
第四節、與圖形互動 24
第五章、 資料縮減技術 25
第一節、抽樣 25
第二節、聚合 25
第六章、 資料縮減顯示方法評估 33
第一節、實驗資料 33
第二節、精確度評估 35
第三節、效率評估 42
第七章、 結論 45
參考文獻 47
參考文獻 References
Ahlberg, C., &Shneiderman, B. (1994, April). Visual information seeking: Tight coupling of dynamic query filters with starfield displays. Proceedings of the SIGCHI conference on Human factors in computing systems , 313-317.

Battle, L., Stonebraker, M., & Chang, R. (2013). Dynamic reduction of query result sets for interactive visualization. 2013 IEEE International Conference on Big Data, 1-8.

Beeley, C. (2013). Web application development with R using Shiny. Birmingham, UK: Packt Publishing.

Data Description. (2015). Data Description: Statistical Analysis Software, Exploratory Data Analysis, Data Visualization, Multimedia Training, ActivStats, Data Desk, Datadesk, Fundraising Analytics, Predictive Analytics, Data Desk 7. Retrieved December 10, 2015, from http://www.datadesk.com/.

Dix, A., & Ellis, G. (2002). By chance enhancing interaction with large data sets through statistical sampling. Proceedings of the Working Conference on Advanced Visual Interfaces - AVI '02, 167-176.

Dougherty, J., Kohavi, R., & Sahami, M. (1995, July). Supervised and unsupervised discretization of continuous features. Machine learning: proceedings of the twelfth international conference, 12, 194-202.

Elmqvist, N. & Fekete, J. (2010). Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics, 16(3), 439–454.

Fayyad, U. M., & Irani, K. B. (1992). On the handling of continuous-valued attributes in decision tree generation. Machine learning, 8(1), 87-102.

Hartwig, F., & Dearing, B. E. (1979). Exploratory data analysis. Beverly Hills: Sage Publications.

Ho, K. M., & Scott, P. D. (1997). Zeta: A global method for discretization of continuous variables. Proceedings of 3rd International Conference of Knowledge Discovery and Data Mining (KDD97). Newport Beach, CA .

IBM. (2016). Guided and automated analytics from the cloud. Retrieved May 11, 2016, from http://www.ibm.com/analytics/watson-analytics/us-en/.

KDnuggets. (2015, May). Analytics, Data Mining software used.

Kerber, R. (1992, July). Chimerge: Discretization of numeric attributes. Proceedings of the tenth national conference on Artificial intelligence(AAAI'92), 123-128.

Liu, Z., Jiang, B., & Heer, J. (2013). ImMens : Real-time Visual Querying of Big Data. Computer Graphics Forum, 32(3pt4), 421-430.

Lohr, S. (2009). Sampling: design and analysis. Cengage Learning.

Muralikrishna, M., & Dewitt, D. J. (1988). Equi-depth multidimensional histograms. Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data - SIGMOD '88, 17(3), 28-36.

Poosala, V., Haas, P. J., Ioannidis, Y. E., & Shekita, E. J. (1996). Improved histograms for selectivity estimation of range predicates. ACM SIGMOD Record SIGMOD Rec., 25(2), 294-305.

Poosala, V., & Ioannidis, Y. E. (1997, August).Selectivity estimation without the attribute value independence assumption. VLDB, 97, 486-495.

RStudio Inc. Shiny: Web Application Framework for R. Retrieved December 10, 2015, from http://CRAN.R-project.org/package=shiny.

SAS Institute. (2015). JMP Software. Retrieved May 10, 2016, from http://www.jmp.com/en_us/software.html.

Swayne, D. F., Lang, D. T., Buja, A., & Cook, D. (2003). GGobi: Evolving from XGobi into an extensible framework for interactive data visualization. Computational Statistics & Data Analysis, 43(4), 423-444.

Tableau Software. (2015). Tableau. Retrieved December 10, 2015, from http://www.tableau.com/.

Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley Pub.

Velleman, P. F. &Hoaglin, D. C. (1981). Applications, basics, and computing of exploratory data analysis. Boston, MA : Duxbury Press.

Wickham, H. (2013). Bin-summarise-smooth: a framework for visualising large data.

Yu, C. H., & Ds, P. (2001).Exploratory data analysis and Data visualization.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code