國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,提升矩陣雲端運算效率及效能之研究-以情境感知資料為例,Using Cloud Computing to Improve the Efficiency and Effectiveness of Matrix Factorization : A Case Study of Context-aware Data Set

論文名稱 Title	提升矩陣雲端運算效率及效能之研究-以情境感知資料為例 Using Cloud Computing to Improve the Efficiency and Effectiveness of Matrix Factorization : A Case Study of Context-aware Data Set
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	105 學年度第 2 學期 The spring semester of Academic Year 105	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	60
研究生 Author	王士豪 Shih-hao Wang
指導教授 Advisor	李偉柏 Wei-Po Lee
召集委員 Convenor	蔡玉娟 Yuh-Jiuan Tsay
口試委員 Advisory Committee	鄭炳強 Bing-Chiang Jeng
口試日期 Date of Exam	2017-01-23	繳交日期 Date of Submission	2017-03-21
關鍵字 Keywords	情境感知、推薦系統、協同式過濾、矩陣分解、雲端運算 Context Aware, Matrix Factorization, Cloud Computing, Recommended System, Collaborative Filtering
統計 Statistics	本論文已被瀏覽 5898 次，被下載 184 次 The thesis/dissertation has been browsed 5898 times, has been downloaded 184 times.

中文摘要
現今的推薦系統會使用兩種類型的方法,第一種是使用相似度計算來找出與使用者相似的人，再依照相似的人的進行推薦;而第二種方法就是矩陣分解。雖然文獻顯示出使用矩陣分解的推薦系統準確度比使用相似度計算高，但是矩陣分解有一個致命的缺點就是需要花費大量的時間去把空值推測出來。而雲端運算的概念是使用網路把多台電腦串起來，將一個需要大量時間才能解決的問題切成很多個小問題，再把各問題分配給每台電腦，讓各台電腦皆是處理小問題，來達到加速運算的效果。基於上述原因，本研究提出了一套方法，即使用Hadoop的HDFS、搭配Spark、結合矩陣分解，透過機器學習的方法，來讓電腦以有效率的方式過濾使用者不需要的雜訊，讓使用者可以迅速地找到自己想要找的事物。並且使用雲端運算的方法，成功降低矩陣分解所需要的大量時間，得以有效提升在矩陣分解的效率以及效能。
Abstract
There are two types of methods often used to develop collaborative recommender systems. One is based on the similarity calculation and the other is based on matrix factorization. Although the matrix factorization method performs better than the similarity-based method, it has to solve the time-consuming problem. Cloud computing can let some problem which need a lot of time become shorter than normal. This study develops an approach that uses Hadoop’s HDFS and Spark to improve the performance of matrix factorization. By using the presented approach, the computational time for matrix factorization can be largely reduced. .

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii 目錄 iv 第一章緒論 1 1.1 背景與動機 1 1.2 研究目的 1 1.2.1 平行化隱含因子模型推薦系統 2 1.2.2 找出使用雲端運算來提升協同式推薦之最佳方法 2 第二章文獻探討 3 2.1 雲端運算技術 3 2.1.1 Hadoop 3 2.1.2 HDFS 5 2.1.3 Spark 6 2.1.4 Spark與Hadoop之比較 7 2.2 協同過濾技術 7 2.2.1 隱含因子模型 8 2.3 分散式協同過濾 9 2.3.1 DSGD 9 2.3.2 FPSG 10 第三章研究方法 13 3.1 整體系統架構 13 3.1.1 Multimedia Dataset: 14 3.1.2 Data Preprocessing 14 3.1.3 Put File Into HDFS 14 3.1.4 Collaborative Filtering (協同式過濾) 15 3.1.5 A Cloud Computing Recommendation System 15 3.2 資料集介紹與處理 15 3.2.1 LDOS-CoMoDa Dataset 15 3.2.2 Netflix Dataset 17 3.2.3 Movielens 18 3.3 分散式協同過濾實作 18 3.3.1 SVD系統實作 18 3.3.2 使用Spark實做DSGD 20 3.3.3 結合情境因素隱含因子模型系統實作 23 第四章實驗結果 26 4.1 實驗環境 26 4.2 針對DSGD的調整 27 4.2.1 重新分布矩陣及重新排序之結果 28 4.2.2 固定Worker再打散之結果 30 4.2.3 把Block切得更細之結果 31 4.3 節點數對於時間及誤差值之影響 32 4.4 與FPSG的比較 36 4.5 使用情境感知資料集之結果 41 4.6 各方法之比較 42 第五章研究貢獻與未來展望 46 5.1 總結 46 5.2 研究貢獻 46 5.3 未來展望 46 第六章參考文獻 47 表目錄表 2 1 Hadoop 與 Spark比較表 7 表 3 1 CoMoDa欄位介紹 17 表 3 2 Netflix欄位介紹 17 表 3 3 Movielens欄位介紹 18 表 3 4各資料集之比較表 18 表 4 1 FPSG與DSGD之運算時間比較 37 圖目錄圖 2 1 Hadoop生態圖[27] 3 圖 2 2 <Key, Value>示意圖 4 圖 2 3 Map/Reduce示意圖[28] 5 圖 2 4 RDD 工作示意圖[29] 6 圖 2 5隱含因子模型示意圖[13] 8 圖 2 6 DSGD示意圖[19] 10 圖 2 7一般矩陣分解之資料走訪順序[21] 11 圖 2 8 FPSG 之拜訪順序[21] 11 圖 2 9FPSG示意圖[21] 12 圖 3 1系統架構圖 14 圖 3 2 SGD之演算法 19 圖 3 3 SVD之演算法 19 圖 3 4 SGD 與 SVD之比較 20 圖 3 5 DSGD演算法 21 圖 3 6矩陣打散概念圖[21] 21 圖 3 7打散矩陣之虛擬碼 22 圖 3 8固定矩陣與隨機矩陣誤差值之差別 23 圖 3 9加入情境之SVD虛擬碼 24 圖 3 10加入情境之DSGD虛擬碼 25 圖 4 1測試環境一 26 圖 4 2測試環境二 27 圖 4 3 Block數目與時間之關係 31 圖 4 4 ml-1m與節點數之運算時間比較圖 32 圖 4 5 ml-10m與節點數之運算時間比較圖 33 圖 4 6 Netflix與節點數之運算時間比較圖 33 圖 4 7ml1m之誤差值 34 圖 4 8 ml10m之誤差值 34 圖 4 9 Netflix之誤差值 35 圖 4 10 FPSG在電腦極限時每代所需之時間圖 36 圖 4 11 FPSG與本研究方法之運算時間比較 37 圖 4 12 1倍Netflix之誤差值 38 圖 4 13 2倍Netflix之誤差值 39 圖 4 14 5倍Netflix之誤差值 39 圖 4 15 8倍Netflix之誤差值 40 圖 4 16 9倍Netflix之誤差值 40 圖 4 17 10倍Netflix之誤差值 41 圖 4 18加入情境之差別 42 圖 4 19三種演算法之誤差值比較 43 圖 4 20三種演算法之執行時間 43 圖 4 21三種演算法在巨量資料的之執行時間 44

參考文獻 References
[1] Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61-70. [2] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. [3] Koren, Y., & Bell, R. (2011). Advances in collaborative filtering. In Recommender Systems Handbook. Springer US, 145-186. [4] Mobasher, B., Jin, X., & Zhou, Y. (2004). Semantically enhanced collaborative filtering on the web. In Web Mining: From Web to Semantic Web, Springer Berlin Heidelberg, 57-76. [5] Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 4. [6] Wu, M. (2007). Collaborative filtering via ensembles of matrix factorizations. In Proceedings of KDD Cup and Workshop (Vol. 2007). [7] Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 426-434. [8] Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, (8), 30-37. [9] Guan, L., & Lu, H. (2012). Recommend items for user in social networking services with CF. In Computer Science & Service System (CSSS), International Conference on, IEEE, 1347-1350. [10] Schelter, S., Boden, C., & Markl, V. (2012). Scalable similarity-based neighborhood methods with mapreduce. In Proceedings of the Sixth ACM Conference on Recommender Systems ACM, 163-170. [11] Elsayed, T., Lin, J., & Oard, D. W. (2008). Pairwise document similarity in large collections with MapReduce. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 265-268. [12] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Spark, S. I. (2010). Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. USENIX Association Berkeley, CA, USA, 10-10. [13] 曾冠宇. (2014). 結合多情境因素及協同過濾方法之多媒體推薦. 中山大學資訊管理學系研究所學位論文, 1-95. [14] LDOS-CoMoDa dataset, University of Ljubljana, July 2012, http://212.235.187.145/spletnastran/raziskave/um/comoda/comoda.php [Accessed On 17.09.2015] [15] Netflix paize http://www.netflixprize.com/index.html [Accessed On 17.09.2016] [16] Li, B., Tata, S., & Sismanis, Y. (2013). Sparkler: Supporting large-scale matrix factorization. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 625-636. [17] https://code.facebook.com/posts/861999383875667/recommending-items-to-more-than-a-billion-people/ [Accessed On 17.09.2015] [18] Difference between Apache Spark and Apache Hadoop https://www.quora.com/What-is-the-difference-between-Apache-Spark-and-Apache-Hadoop-Map-Reduce [Accessed On 01.10.2015] [19] Gemulla, R., Nijkamp, E., Haas, P. J., & Sismanis, Y. (2011). Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 69-77. [20] Parthasarathy, N., & Tea-mangkornpan, P. P. Low-rank matrix factorization using distributed SGD in Spark. [21] Chin, W. S., Zhuang, Y., Juan, Y. C., & Lin, C. J. (2015). A fast parallel stochastic gradient method for matrix factorization in shared memory systems. ACM Transactions on Intelligent Systems and Technology (TIST), 6(1), 2. [22] Spark Job Scheduling https://spark.apache.org/docs/latest/job-scheduling.html [Accessed On 13.12.2016] [23] Process&ThreadManagement http://www.csie.ntnu.edu.tw/~swanky/os/chap4.htm [Accessed On 10.12.2016] [24] Running Deep Learning on Distributed GPUs With Spark https://deeplearning4j.org/spark-gpus [Accessed On 01.01.2017] [25] GPU運算 http://www.nvidia.com.tw/object/what-is-gpu-computing-tw.html [Accessed On 01.01.2017] [26] Movielens http://grouplens.org/datasets/movielens/ [Accessed On 10.01.2017] [27] Hadoop生態 http://www.colfax-intl.com/nd/clusters/hadoop.aspx [Accessed On 25.01.2017] [28] WorkCount http://sls.weco.net/CollectiveNote20/MR [Accessed On 25.01.2017] [29] Spark Rdd https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/ [Accessed On 21.10.2016]

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0221117-203746.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS