國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,結合無監督式學習的時間序列分析,Time Series Analysis with Unsupervised Learning

論文名稱 Title	結合無監督式學習的時間序列分析 Time Series Analysis with Unsupervised Learning
系所名稱 Department	應用數學系 Department of Applied Mathematics
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	95
研究生 Author	陳可捷 Ke-jie Chen
指導教授 Advisor	郭美惠 Mei-Hui Guo
召集委員 Convenor	羅夢娜 Mong-Na Lo
口試委員 Advisory Committee	張中, 林良靖, 黃士峰 Chung Chang; Liang-Ching Lin; Shih-Feng Huang
口試日期 Date of Exam	2018-07-05	繳交日期 Date of Submission	2018-07-18
關鍵字 Keywords	SARIMA、B-spline、K-means 分群法、階層式分群、長短記憶網路模型、ARFIMA、主成分分析、非負矩陣分解 principal component analysis, long short-terms memory network, ARFIMA, B-spline, K-means clustering, hierarchical clustering, non-negative matrix factorization, SARIMA
統計 Statistics	本論文已被瀏覽 5688 次，被下載 1 次 The thesis/dissertation has been browsed 5688 times, has been downloaded 1 times.

中文摘要
本研究分成兩個部分討論無監督式學習與時間序列分析的結合。第一個部分考慮時間序列由於外在因素的影響，產生週期、趨勢改變等不穩定情形。我們利用無監督學習式的方法 (階層式分群法與 K-means 分群法) 對時間序列分類，使用 B-spline 配適各分類的趨勢曲線。去除趨勢之後，對殘差配適 ARFIMA 模型。此外，我們也利用長短記憶網路模型，分別對原始資料與殘差進行配適與預測，我們對此兩種模型的預測進行比較。在實證部分，我們對荷蘭的電力需求資料進行分析。結果顯示對假日的電力需求，ARFIMA 的預測表現優於長短記憶網路模型；然而，對於平日的電力需求，長短記憶網路模型的表現較優。第二個部分探討多維具有相關性的時間序列資料異常值的偵測問題。例如環保署光化測站的 54 種臭氧前驅物的日平均測量值，如何即時偵測這些時間序列資料的異常汙染源是一個重要的議題。我們採用無監督學習的方法 (主成分分析與非負矩陣分解)，降低多維時間序列資料的維度，藉此萃取主要成分。因為前二主成分的時間序列具有週期效應，我們配適 SARIMA 模型並偵測異常。結果顯示非負矩陣分解比主成分分析能夠解釋較多的變異，且對於前驅物的測量值也有較好的解釋性。
Abstract
This study is divided into two parts to discuss the combination of unsupervised learning and time series analysis. In the first part, we consider the unstable situation of time series occurrence period and trend change due to the influence of external factors. We use the unsupervised learning methods (hierarchical clustering (HCA) and K-means clustering) to cluster the time series, and use B-spline to fit the trend of each classification. After removing the trend, we use the ARFIMA model to fit residuals. Furthermore, we also apply the long short-terms memory network (LSTM) to fit the original data and its residual, then we compared the prediction of the two models. In the implementation, we analyze the power demand data from a Dutch research facility for the whole year of 1997. The results show that the forecast of ARFIMA is better than the LSTM model for the holiday power demand; However, for weekday power demand, the LSTM model is better. The second part deals with the detection of abnormal values of time-series data with correlation. For example, the daily average measurements of 54 ozone precursors from the EPD (Taiwan) photochemical assessment monitoring stations are an important issue in how to detect the anomalous pollution sources of these time series data. We use the unsupervised learning methods (principal component analysis (PCA) and non-negative matrix factorization (NMF)) to reduce the dimension of multi-dimensional time series data to extract the main features. Since the first two principal components of time series with periodic effect, we fit SARIMA model and detect anomalies. The results show that NMF can explain more variations than PCA, and the measurement value of the precursors has a better interpretative.

目次 Table of Contents
論文審定書 i 論文公開授權書 ii 誌謝 iii 摘要 iv Abstract v 1 Introduction 1 1.1 Background . . . . . . 1 1.2 Goals . . . . . . 1 1.3 Thesis structure . . . . . . 2 2 Methodology 3 2.1 Clustering . . . . . . 3 2.1.1 Hierarchical clustering . . . . . . 3 2.1.2 K-means clustering . . . . . . 4 2.2 Classification . . . . . . 5 2.2.1 Decision tree . . . . . . 6 2.2.2 LDA . . . . . . 7 2.2.3 Logistic regression . . . . . . 9 2.3 Dimension reduction . . . . . . 10 2.3.1 PCA . . . . . . 10 2.3.2 NMF . . . . . . 12 3 Time series models 14 3.1 Conventional time series models . . . . . . 14 3.1.1 ARFIMA . . . . . . 14 3.1.2 SARIMA . . . . . . 14 3.2 LSTM model . . . . . . 15 4 Empirical studies 18 4.1 Power demand data . . . . . . 18 4.1.1 Data introduction . . . . . . 18 4.1.2 Clustering and classification . . . . . . 18 4.1.3 B-spline trend fitting . . . . . . 21 4.1.4 Time series models . . . . . . 21 4.1.5 Prediction with LSTM model . . . . . . 23 4.1.6 Summary . . . . . . 23 4.2 The O3 precursor data . . . . . . 25 4.2.1 Data introduction and pre-processing . . . . . . 25 4.2.2 Dimension reduction . . . . . . 26 4.2.3 Time series models . . . . . . 28 4.2.4 Feature extraction . . . . . . 28 4.2.5 Summary . . . . . . 29 5 Conclusions and future work 30 6 References 31 7 Appendix 33 7.1 Appendix A: Some common linkage criteria of HCA . . . . . . 33 7.2 Appendix B: The details of PCA . . . . . . 35 7.3 Appendix C: The dendrograms of power demand data using HCA with some linkages . . . . . . 37 7.4 Appendix D: The tables for section 4.1 . . . . . . 41 7.5 Appendix E: The figures for section 4.1 . . . . . . 44 7.6 Appendix F: The tables for section 4.2 . . . . . . 49 7.7 Appendix G: The figures for section 4.2 . . . . . . 54

參考文獻 References
[1] de Boor, C. (1978). A practical guide to splines. Springer-Verlag, 27, 129-134. [2] Box, G. E. and Jenkins, G. M. (1976). Time series analysis: forecasting and control, revised ed. Holden-Day. 300-305. [3] Chen, S. P., Liu, T, H. and Chen, T. F., et al. (2010). Diagnostic modeling of PAMS VOC observation. [4] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9 (8), 1735-1780. [5] Hosking, J. R. (1981). Fractional differencing. Biometrika, 68 (1), 165-176. [6] James, G., Witten, D. and Hastie, T. (2013). An introduction to statistical learning. [7] Keogh, E., Lin, J. and Fu, A. (2005). Hot sax: finding the most unusual time series subsequence. [8] Kuo, Y. M., Chiu, C. H. and Yu, H. L. (2015). Influences of ambient air pollutants and meteorological conditions on ozone variations in Kaohsiung, Taiwan. [9] Lance, G. and Williams, W. (1967). A general theory of classification sorting strategies: 1. Hierarchical systems. Comput. J., 9, 373-380. [10] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. [11] Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28 (2), 129-137. [12] Mcquitty, L. L. (1966). Similarity analysis by reciprocal pairs for discrete and continuous data. [13] Milligan, G. W. (1979). Ultrametric hierarchical clustering algorithms. [14] Murtagh, F. and Legendre, P. (2011). Ward’s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm. [15] Pang, Y., Fuentes, M. and Rieger, P. (2015). Trends in selected ambient volatile organic compound (VOC) concentrations and a comparison to mobile source emission trends in California’s South Coast Air Basin. [16] Ramsay, J. O. and Silverman, B. W. (2005). Functional data analysis. New York: Springer [17] Ward, J. H., Jr. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association, 58, 236-244. [18] Yuan, Z., Zhong, L., et al. (2013). Volatile organic compounds in the Pearl River Delta: Identification of source regions and recommendations for emission-oriented monitoring strategies.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0617118-151942.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS