國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用雲端運算於排序支援向量機之研究 ,Application of MapReduce to Ranking SVM for Large-Scale Datasets

論文名稱 Title	應用雲端運算於排序支援向量機之研究 Application of MapReduce to Ranking SVM for Large-Scale Datasets
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	98 學年度第 2 學期 The spring semester of Academic Year 98	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	87
研究生 Author	胡書嫻 Su-Hsien Hu
指導教授 Advisor	李錫智 Shie-Jue Lee
召集委員 Convenor	吳志宏 Chih-Hung Wu
口試委員 Advisory Committee	歐陽振森, 蔡賢亮, 賴智錦 Chen-Sen Ouyang; Hsien-Leing Tsai; Chih-Chin Lai
口試日期 Date of Exam	2010-07-21	繳交日期 Date of Submission	2010-08-10
關鍵字 Keywords	搜索引擎、雲端運算、MapReduce、Ranking SVM search engines, cloud computing, MapReduce, Ranking SVM
統計 Statistics	本論文已被瀏覽 5797 次，被下載 44 次 The thesis/dissertation has been browsed 5797 times, has been downloaded 44 times.

中文摘要
近年來，搜索引擎藉由機器學習技術與訓練查詢（training queries）來建立評等模型（ranking model），然後透過評等模型來進行網頁排序，其中訓練查詢來自過去使用者的查詢與點擊次數。目前，基於機器學習的排序學習方法非常多，其中以Raniking SVM最受矚目，然而此方法會因為大量的訓練資料而造成運算量過大。為了解決此問題，我們將評等支援向量機演算法修改成分散式的架構，並以雲端運算環境中的MapReduce來實現分散式演算法。MapReduce是由Google所提出，其隱藏平行處理的繁複細節、高容錯、資料分布、負載平衡，讓使用者可以簡單地使用大量的電腦來處理巨量資料集（large-scale datasets）。Ranking SVM主要分成訓練與測試兩個階段，在訓練階段，我們需要求解Ranking SVM的對偶子問題，而測試階段，則是藉由訓練後的評等模型對測試查詢作評分排名。因此，我們設計不同的MapReduce架構來滿足不同階段的需求。實驗結果證明，此分散式評等支援向量機可以有效地減少運算所耗費的時間。
Abstract
Nowadays, search engines are more relying on machine learning techniques to construct a model, using past user queries and clicks as training data, for ranking web pages. There are several learning to rank methods for information retrieval, and among them ranking support vector machine (SVM) attracts a lot of attention in the information retrieval community. One difficulty with Ranking SVM is that the computation cost is very high for constructing a ranking model due to the huge number of training data pairs when the size of training dataset is large. We adopt the MapReduce programming model to solve this difficulty. MapReduce is a distributed computing framework introduced by Google and is commonly adopted in cloud computing centers. It can deal easily with large-scale datasets using a large number of computers. Moreover, it hides the messy details of parallelization, fault-tolerance, data distribution, and load balancing from the programmer and allows him/her to focus on only the underlying problem to be solved. In this paper, we apply MapReduce to Ranking SVM for processing large-scale datasets. We specify the Map function to solve the dual sub problems involved in Ranking SVM and the Reduce function to aggregate all the outputs having the same intermediate key from Map functions of distributed machines. Experimental results show efficiency improvement on ranking SVM by our proposed approach.

目次 Table of Contents
摘要 iv Abstract v 圖目錄 viii 表目錄 ix 第一章簡介 10 1.1 研究背景 10 1.2 問題定義 12 1.3 研究目的 14 1.4 論文架構 15 第二章文獻探討 16 2.1 雲端運算 16 2.2 Hadoop 17 2.2.1 Hadoop Distributed File System (HDFS) 18 2.2.2 MapReduce 20 2.3 虛擬主機 22 2.4 排序學習 23 2.4.1 逐點式方法（Pointwise Approach） 25 2.4.2 成對式方法（Pairwise Approach） 26 2.4.3 序列式方法（Listwise Approach） 28 2.5 Ranking SVM 28 2.6 評估工具 30 2.6.1 正規化遞減累積效益（NDCG） 30 第三章研究方法 33 3.1 研究動機 33 3.2 方法概述 34 3.3 方法架構 36 3.3.1 架構一 36 3.3.2 架構二 39 3.3.3 架構三 42 3.3.4 架構四 45 第四章實驗結果與分析 50 4.1 實驗資料 50 4.2 特徵選取 54 4.3 實驗環境架設 59 4.4 實驗結果與分析 59 4.4.1 實驗一 60 4.4.2 實驗二 63 4.4.3 實驗三 65 4.4.4 實驗四 66 4.4.5 實驗五 68 4.4.6 實驗六 72 4.4.7 實驗七 76 第五章結果與未來展望 78 5.1 結論 78 5.2 未來研究方向 78 參考文獻 80 附錄1 82 Hadoop 0.20.2 叢集安裝 82

參考文獻 References
[1]. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999. [2]. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender, “Learning to Rank Using Gradient Descant,” in Proceedings of the 22nd International Conference on Machine Learning, pp. 89-96, 2005. [3]. Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, H. Li, “Learning to Rank: From Pairwise Approach to Listwise Approach,” in Proceedings of the 24th International Conference on MachineLearning , pp. 129-136, 2007. [4]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation, pp.10-10, 2004. [5]. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An Efficient Boosting Algorithm for Combining Preferences,” Journal of Machine Learning Research, Vol. 4, pp. 933-969, 2003. [6]. X.-B. Geng, T.-Y. Liu, T. Qin, H. Li, and H.-Y. Shum, “Query-Dependent Ranking Using K-Nearest Neighbor,” in Proceedings of the 31s Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115-122, 2008. [7]. S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google File System,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003. [8]. T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” ACM Conference on Knowledge Discovery and Data Mining, pp. 133-142, 2002. [9]. J. Kleinbreg, “Authoritative Sources in a Hyperlinked Environment,” in Proceeding of the 9th ACM–SIAM SODA, 1998. [10]. P. Li, C.J.C. Burges, and Q.Wu, “McRank: Learning to Rank Using Multiple Classification and Gradient Boosting,” Neural Information Processing Systems(NIPS), pp. 897-904, 2007. [11]. T.-Y. Liu, T. Qin, J. Xu, W. Xiong and H. Li, “LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval,” in Proceedings of the Learning to Rank workshop in the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007. [12]. T.Y. Liu, “Learning to Rank for Information Retrieval,” Tutorial at 17th International World-Wide Web Conference (WWW), 2008. [13]. R. Nallapati, “Discriminant Models for Information Retrieval,” in Proceedings of the 27th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 64-71, 2004. [14]. L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank Citation Ranking: Bringing Order to the Web,” Technical Report, Stanford University, 1998. [15]. V. C. Raykar, R. Duraiswami, and B. Krishnapuram, “A Fast Algorithm for Learning a Ranking Function from Large-Scale Data Sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1158-1170, July 2008. [16]. S. E. Robertson, “Overview of the Okapi Projects,” Journal of Documentation, Vol. 53, No. 1, pp. 3-7, 1997. [17]. J.E. Smith and R. Nair, "The Architecture of Virtual Machines," IEEE Computer Society, pp. 32-38, 2005. [18]. M.-F. Tsai, T.-Y. Liu, T. Qin, H.-H. Chen, W.-Y. Ma, “FRank: A Ranking Method with Fidelity Loss,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 383-390, 2007. [19]. T. White, “Hadoop: The Definitive Guide, “America: O’Reilly Media, 2009. [20]. J. Xu and H.Li, “AdaRank: A Boosting Algorithm for Information Retrieval,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391-398, 2007. [21]. Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A Support Vector Method for Optimizing Average Precision,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 271-278, 2007. [22]. C. Zhai and J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval,” in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 334-342, 2001. [23]. http://hadoop.apache.org/hdfs/ [24]. http://research.microsoft.com/en-us/um/beijing/projects/letor/yahoodata.aspx [25]. 陳瀅﹝2010﹞：雲端策略：雲端運算與虛擬化技術。台北市：天下雜誌。 [26]. 王鵬﹝2010﹞：雲端運算的關鍵技術與應用實例。台北市：加魁資訊。

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內公開，校外永不公開 restricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.142.197.198 論文開放下載的時間是校外不公開 Your IP address is 3.142.197.198 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS