國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,結合文件內容和使用紀錄的文獻數位圖書館文件分群技術,Clustering Articles in a Literature Digital Library Based on Content and Usage

論文名稱 Title	結合文件內容和使用紀錄的文獻數位圖書館文件分群技術 Clustering Articles in a Literature Digital Library Based on Content and Usage
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	92 學年度第 2 學期 The spring semester of Academic Year 92	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	52
研究生 Author	丁康迪 Kang-Di Ting
指導教授 Advisor	黃三益 San-Yih Hwang
召集委員 Convenor	魏志平 Chih-PingWei
口試委員 Advisory Committee	林福仁 Fu-Ren Lin
口試日期 Date of Exam	2004-07-26	繳交日期 Date of Submission	2004-08-10
關鍵字 Keywords	文件分群、文件分類、文獻數位圖書館、使用者紀錄分群 Digital library, Document categorization, Usage clustering, Document clustering, Content-based clustering
統計 Statistics	本論文已被瀏覽 5925 次，被下載 4625 次 The thesis/dissertation has been browsed 5925 times, has been downloaded 4625 times.

中文摘要
文獻數位圖書館提供文獻數位化的儲存，研究人員可以透過網路很方便地使用文獻的查詢。然而在查詢文獻的時候，往往使用者一次面對大量的資料，無法找到自己所真正想要的文獻資料。為了提供更有效率的查詢服務，很多系統會提供瀏覽介面，以期望能減少使用者點選的次數。在本篇研究，我們期望能建出一個兼具主題目錄的瀏覽介面，以期望能提供使用者做文獻資料的查詢時更多的方便以及效率。在之前的相關研究當中，文件分類或分群可以適用於本研究所要解決的問題，但文件分類方式多半需要專家的幫忙以及有既定的主題類別或目錄。所以本研究想試著利用系統中使用者的使用紀錄(usage log)來取代模擬專家的分類，減少專家人工上的成本，而能建構出符合使用者需求的瀏覽介面。本研究主要提出兩種結合文件內容與使用紀錄的方法(Document categorization-based與Document clustering based)，最後並以傳統內容式的方法(Content-based)與以分別針對專家人工分類的結果比較Entropy來評估。結果發現內容式的方法整體而言對於專家分類的結果吻合度較高。
Abstract
Literature digital library is one of the most important resources to preserve civilized asset. To provide more effective and efficient information search, many systems are equipped with a browsing interface that aims to ease the article searching task. A browsing interface is associated with a subject directory, which guides the users to identify articles that need their information need. A subject directory contains a set (or a hierarchy) of subject categories, each containing a number of similar articles. How to group articles in a literature digital library is the theme of this thesis. Previous work used either document classification or document clustering approaches to dispatching articles into a set of article clusters based on their content. We observed that articles that meet a single user’s information need may not necessarily fall in a single cluster. In this thesis, we propose to make use of both Web log and article content is clustering articles. We proposed two hybrid approaches, namely document categorization based method and document clustering based method. These alternatives were compared to other content-based methods. It has been found that the document categorization based method effectively reduces the number of required click-through at the expense of slight increase of entropy that measures the content heterogeneity of each generated cluster.

目次 Table of Contents
Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Research Motivations and Objectives 1 1.3 Data Description 2 1.4 Problem Description 4 1.5 Thesis organization 5 Chapter 2 Literature review 6 2.1 Converting an article to a set of vectors 6 2.2 Keyword Selection 8 2.2.1 CHI Square Statistics 8 2.2.2 Information Gains 9 2.3 Web Usage Clustering 9 2.3.1 Data preparation for Web usage log 9 2.3.2 Usage Clustering 11 2.3.2.1 Based on frequent itemsets 12 2.3.2.2 Based on Hyperclique Patterns 13 2.4 Content-based Clustering 15 2.5 Text Categorization 20 2.5.1 Probabilistic Classifiers 20 2.5.2 Neural Network Classifiers 21 2.5.3 Support Vector Machines 21 Chapter 3 Content-based and hybrid approach 24 3.1 Content-based clustering 24 3.1.1 Article Clique Hypergraph Partitioning 25 3.1.2 K-means 26 3.2 Hybrid approach 26 3.2.1 Document categorization based hybrid approach 27 3.2.2 Document clustering based hybrid approach 28 Chapter 4 Performance Evaluation 29 4.1 Performance Metrics 32 4.2 Experimental Results 34 4.2.1 Comparing usage coherence of various clustering 34 4.2.2 Comparing automatic clusters with manual clusters 36 Chapter 5 Conclusions 41 Reference 50

參考文獻 References
[AS94] Agrawal. R. and Srikant. R., “Fast algorithms for mining association rules”, In Proceedings of the 20th VLDB conference, pp. 487-499, Santiago, Chile, 1994. [BGGH99] Daniel Boley, Maria Gini, Robert Gross, and Eui-Hong Han etal. “Partitioning-Based Clustering for Web Document Categorization”, Decision Support Systems archive Volume 27 , Issue 3 Dec.1999 table of contents Special issue on WITS '97. Pages: 329 – 341, 1999. [Chuang03] S. M. Chuang. "Combining Content-based and Collaborative Article Recommendation in Literature Digital Libraries", master thesis, National Sun Yat-sen University Department of Information Management, Jul.2003. [CMS99] R. Cooley, B. Mobasher, and J. Srivastava, “Creating adaptive Web sites through usage-based clustering of URLs,” In Proc. of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX), November 1999. [FM01] E. A. Fox and G. Marchionini. "Digital Libraries," Communications of the ACM, 44(5), pp. 30-32, May 2001. [Fox92] C.Fox, “Lexical Analysis and Stoplists,” Chapter 7, in Information Retrieval: Data Structures & Algorithms, edited by W. B. Frakes and R. Baeza-Yates, Prentices Hall, 1992. [HKKM97] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Clustering based on association rule hypergraphs," In Proccedings of SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’97), May 1997. [HKKM98] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., "Hypergraph based clustering in high dimensional data sets: a summary of results." IEEE Bulletin of the Technical Committee on Data Engineering, (21) 1, March 1998. [Hsiung02] W.C. Hsiung. “Article Recommendation in Literature Digital Libraries.”, master thesis, National Sun Yat-sen University, department of Information Management, Jul. 2002. [Joac98] T. Joachims ,”Text Categorization with support vector machines: learning with many relevant features.” In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemintz, DE, 1998), pp.137-142 [Joac99] T. Joachims, “Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning”, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. [KS97] Daphe Koller and Mehran Sahami, "Hierarchically classifying documents using very few words," Proceedings of the 14th International Conference on Machine Learning (ML), Nashville, Tennessee, July 1997, Pages 170-178. [MDL00a] B. Mobasher, H. Dai, T. Luo, Miki Nakagawa, and Jim Witshire. "Discovery of aggregate usage profiles for Web personalization," In Proc. of the WebKDD Workshop, 2000. [MDL00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, and J. Zhu, "Integrating Web Usage and Content Mining for More Effective Personalization," International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK. September 2000. [Se02] Fabrizio Sebastiani, “Machine Learning in Automated Text Categorization” Consiglio Nazionale delle Ricerche, Italy, 2002 [SKK00] M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," In KDD Workshop on Text Mining, 2000. [SYZX01] Z.Su, Q.Yang, H.Zhang , X.Xu , and Y.Hu, "Correlation-based Document Clustering using Web Logs," 34th Annual Hawaii International Conference System Science(HICSS-34)-Volume 5.Jan 03-06,2001. [XTK04] Hui Xiong, Pang-Ning Tan, and Vpin Kumar, “Mining Hyperclique Patterns in Data Sets with Skewed Support Distributions,” Kluwer Acadenic Publishers, 2004. [YP97] Yang, Y. and Pederson, J.O., “A comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, 1997, pp412-420. [ZK02] Ying Zhao and George Karypis, ”Evaluation of hierarchical clustering algorithms for document datasets” Conference on Information and Knowledge Management Proceedings of the eleventh international conference on Information and knowledge management, 2002, pp515- 524

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0810104-153712.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS