國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,個人化與情境感知式文件分群,Personalized and Context-aware Document Clustering

論文名稱 Title	個人化與情境感知式文件分群 Personalized and Context-aware Document Clustering
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	95 學年度第 2 學期 The spring semester of Academic Year 95	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	91
研究生 Author	楊錦生 Chin-Sheng Yang
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	簡立峰 Lee-Feng Chien
口試委員 Advisory Committee	李昇暾, 曾新穆, 劉敦仁, 楊傳智 Sheng-Tun Li; Vincent S. Tseng; Duen-Ren Liu; Christopher C. Yang
口試日期 Date of Exam	2007-07-02	繳交日期 Date of Submission	2007-07-15
關鍵字 Keywords	文件分群、個人化文件分群、情境感知式文件分群、文件探勘、知識管理 Context-aware document clustering, Personalized document clustering, Text mining, Document clustering, Knowledge management
統計 Statistics	本論文已被瀏覽 5713 次，被下載 1776 次 The thesis/dissertation has been browsed 5713 times, has been downloaded 1776 times.

中文摘要
為管理日益增加的文件資料，組織與個人通常採用類別(或類別階層)的概念來整理文件，以促成文件管理之工作與協助後續文件檢索與取用之需求。文件分群是一種由分群者的個人偏好主導的意識行為，其反應的是分群者對哪些類別是適當的與文件該如何歸類的主觀認知，而此主觀意識行為會隨著分群者當時所處的情境之不同而有所差異。因此良好的文件分群技術需將個人偏好或所處情境等因素納入考量。然而，現有的文件分群技術大多僅依文件的內容來進行分群，是以無法符合個人化或情境式分群的要求。為滿足使用者對個人化與情境式分群的需求，本論文提出三個個人化或情境感知式的分群技術，並實際評估所提三個技術的效能。首先，為克服PEC技術在部份分群(Partial Clustering)過小時面臨的分群效能快速下降的問題，本文修改PEC技術並提出了Collaborative Filtering–based personalized document Clustering (CFC)技術，CFC技術採用協同推薦的概念，藉由考慮與某一使用者偏好相似的其他使用者之部份分群結果，來擴大使用者的部份分群。其次，為支援情境式文件分群，本文提出一個Context-Aware document-Clustering (CAC)技術，CAC技術考慮使用者在某一情境下的分群偏好(由一組Anchoring Terms來表達此一偏好)，並利用搜尋引擎來檢索網際網路上的文件，以建構一個統計式辭典，達成情境式分群的需求。最後，CAC技術也可能面臨Anchoring Terms過少而效能快速下滑的問題，因此我們一樣採用協同推薦的概念來改善CAC技術，並提出了Collaborative Filtering-based Context-Aware document Clustering (CF-CAC)技術。根據本論文的實證評估結果，提出的三個文件分群技術在支援個人化或情境式分群時，都有相當不錯的分群效能且優於現存的文件分群技術。
Abstract
To manage the ever-increasing volume of documents, organizations and individuals typically organize documents into categories (or category hierarchies) to facilitate their document management and support subsequent document retrieval and access. Document clustering is an intentional act that should reflect individuals’ preferences with regard to the semantic coherency or relevant categorization of documents and should conform to the context of a target task under investigation. Thus, effective document clustering techniques need to take into account a user’s categorization context defined by or relevant to the target task under consideration. However, existing document clustering techniques generally anchor in pure content-based analysis and therefore are not able to facilitate personalized or context-aware document clustering. In response, we design, implement and empirically evaluate three document clustering techniques capable of facilitating personalized or contextual document clustering. First, we extend an existing document clustering technique (specifically, the partial-clustering-based personalized document-clustering (PEC) approach) and propose the Collaborative Filtering–based personalized document-Clustering (CFC) technique to overcome the problem of small-sized partial clustering encountered by the PEC technique. Particularly, the CFC technique expands the size of a user’s partial clustering based on the partial clusterings of other users with similar categorization preferences. Second, to support contextual document clustering, we design and implement a Context-Aware document-Clustering (CAC) technique by taking into consideration a user’s categorization preference (i.e., a set of anchoring terms) relevant to the context of a target task and a statistical-based thesaurus constructed from the World Wide Web (WWW) via a search engine. Third, in response to the problem of small-sized set of anchoring terms which can greatly degrade the effectiveness of the CAC technique, we extend CAC and propose a Collaborative Filtering-based Context-Aware document Clustering (CF-CAC) technique. Our empirical evaluation results suggest that our proposed CFC, CAC, and CF-CAC techniques better support the need of personalized and contextual document clustering than do their benchmark techniques.

目次 Table of Contents
CHAPTER 1 INTRODUCTION ………………………………………………… 1 1.1 Research Background …………………………………………………… 1 1.2 Research Motivation …………………………………………………… 3 1.3 Research Objectives ……………………………………………………… 5 1.4 Organization of the Dissertation ………………………………………… 6 CHAPTER 2 LITERATURE REVIEW …………………………………………… 8 2.1 Content-based Document Clustering Techniques ……………………… 8 2.2 Non-content-based and Hybrid Document Clustering Approaches …… 10 2.3 Partial-Clustering-Based Personalized Document-Clustering Technique ... 12 CHAPTER 3 DESIGN OF COLLABORATIVE FILTERING-BASED DOCUMENT CLUSTERING (CFC) TECHNIQUE ……………………………………… 15 3.1 Collaborative Clustering-Expansion Phase …………………………… 17 3.2 Feature Construction Phase …………………………………………… 20 3.3 Document Representation Phase ……………………………………… 22 3.4 Clustering Phase ………………………………………………………… 22 CHAPTER 4 EVALUATION OF COLLABORATIVE FILTERING-BASED DOCUMENT-CLUSTERING (CFC) TECHNIQUE ……………………… 24 4.1 Data Collection ………………………………………………………… 24 4.2 Evaluation Criteria and Procedure ……………………………………… 25 4.3 Benchmark Techniques ………………………………………………… 26 4.4 Parameter Tuning ……………………………………………………… 27 4.5 Comparative Evaluation ………………………………………………… 31 4.6 Summary ……………………………………………………………… 35 CHAPTER 5 DESIGN OF CONTEXT-AWARE DOCUMENT-CLUSTERING (CAC) TECHNIQUE ………………………………………………………… 36 5.1 Feature Extraction and Selection Phase ………………………………… 37 5.2 Anchoring Term Expansion Phase ……………………………………… 37 5.3 Document Representation Phase ……………………………………… 40 5.4 Clustering Phase ………………………………………………………… 40 CHAPTER 6 EVALUATION OF CONTEXT-AWARE DOCUMENT CLUSTERING (CAC) TECHNIQUE ……………………………………… 41 6.1 Data Collection ………………………………………………………… 41 6.2 Evaluation Criteria ……………………………………………………… 42 6.3 Parameters Tuning ……………………………………………………… 43 6.4 Comparative Evaluation ………………………………………………… 45 6.5 Effects of Different Approaches for Statistical-based Thesaurus Construction …………………………………………………………… 46 6.6 Analysis of Temporal Stability of CAC ……………………………… 50 6.7 Summary ……………………………………………………………… 54 CHAPTER 7 DESIGN OF COLLABORATIVE FILTERING-BASED CONTEXT-AWARE DOCUMENT CLUSTERING (CF-CAC) TECHNIQUE ……………………………………………………………………… 56 7.1 Collaborative Context Expansion Phase ……………………………… 57 7.2 Feature Extraction and Selection Phase ………………………………… 60 7.3 Anchoring Term Expansion Phase ……………………………………… 60 7.4 Document Representation ……………………………………………… 62 7.5 Clustering Phase ………………………………………………………… 62 CHAPTER 8 EVALUATION OF COLLABORATIVE FILTERING-BASED CONTEXT-AWARE DOCUMENT CLUSTERING (CF-CAC) TECHNIQUE ……………………………………………………………………… 64 8.1 Data Collection ……………………………………………………… 64 8.2 Evaluation Criteria and Procedure …………………………………… 65 8.3 Parameter Tuning ……………………………………………………… 65 8.4 Comparative Evaluation ……………………………………………… 66 8.5 Summary ……………………………………………………………… 72 CHAPTER 9 CONCLUSION ………………………………………………… 73 9.1 Summary and Research Contributions ………………………………… 73 9.2 Future Research Directions …………………………………………… 75 REFERENCES …………………………………………………………………… 76 APPENDIX A …………………………………………………………………… 82

參考文獻 References
Anderberg, M.R., Cluster Analysis for Applications, Academic Press, Inc., New York, 1973. Balabanovic, M. and Shoham, Y., Fab: Content-based, Collaborative Recommendation, Communications of the ACM 40 (3), 1997, 66-72. Barreau, D.K., Context as a Factor in Personal Information Management Systems, Journal of the American Society for Information Science 46 (5), 1991, 327-339. Basu, C., Hirsh, H., and Cohen, W., Recommendation as Classification: Using Social and Content-based Information in Recommendation, Proceedings of the Workshop on Recommender Systems, AAAI Press, 1998, 11-15. Billhardt, H., Borrajo, D., and Maojo, V., A Context Vector Model for Information Retrieval, Journal of the American Society for Information Science and Technology 53 (3), 2002, 236-249. Billsus, D. and Pazzani, M.J., Learning Collaborative Information Filtering, Proceedings of the Workshop on Recommender Systems, Madison, WI ,1998, 714-720. Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L., Partitioning-based Clustering for Web Document Categorization, Decision Support Systems 27 (3), 1999, 329-341. Breese, J.D., Heckerman, D., and Kadie, C., Empirical Analysis of Predictive Algorithms for Collaborative Filtering, Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), San Francisco, CA, 1998, 43-52. Brill, E., A Simple Rule-based Part of Speech Tagger, Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, 1992, 152-155. Brill, E., Some Advances in Rule-based Part of Speech Tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA, 1994, 722-727. Case, D.O., Conceptual Organization and Retrieval of Text by Historians: The Role of Memory and Metaphor, Journal of the American Society for Information Science 42 (9), 1991, 657-668. Chim, H. and Deng, X., A New Suffix Tree Similarity Measure for Document Clustering, Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, 2007, 121-130. Cutting, D., Karger, D., Pedersen, J., and Tukey, J., Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 1992, 318-329. Deogun, J. and Raghavan, V., User-oriented Document Clustering: A Framework for Learning in Information Retrieval, Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, 1986, 157-163. Donovan, J., Patrons’ Expectations about Collocation: Measuring the Difference between Psychologically Real and the Really Real, Cataloging and Classification Quarterly 13 (2), 1991, 23-43. Dumais, S., Platt, J., Heckerman, D., and Sahami, M., Inductive Learning Algorithms and Representations for Text Categorization, Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management (CIKM '98), Bethesda, MD, 1998, 148-155. El-Hamdouchi, A. and Willett, P., Hierarchical Document Clustering Using Ward’s Method, Proceedings of the 9th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Palazzo dei Congressi, Pisa, Italy, 1986, 149-156. Furnas, G.W., Landauer, T.K., Gomez, L.M., and Dumais, S.T., The Vocabulary Problem in Human-System Communication, Communications of the ACM 30 (11), 1987, 964-971. Gauch, S., Wang, J., and Rachakonda, S.M., A Corpus Analysis Approach for Automatic Query Expansion and Its Extension to Multiple Databases, ACM Transactions on Information Systems 17 (3), 1999, 250-269. Gordon, M., User-based Document Clustering by Redescribing Subject Description with a Genetic Algorithm, Journal of the American Society for Information Science 42 (5), 1991, 311-322. Guerrero Bote, V.P., Moya Anegón, F., and Herrero Solana, V., Document Organization Using Kohonen’s Algorithm, Information Processing and Management 38 (1), 2002,79-89. Hatzivassiloglou, V., Gravano, L., and Maganti, A., An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering, Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 2000, 224-231. Herlocker, J.L., Konstan, J.A., Borchers, A., and Riedl, J., An Algorithmic Framework for Preforming Collaborative Filtering, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development of Information Retrieval, Berkeley, CA, 1999, 230-237. Hotho, A., Staab, S., and Stumme, G., Wordnet Improves Text Document Clustering, Proceedings of the Semantic Web Workshop at SIGIR’2003, Toronto, Canada, 2003. Jain, A. K., Murty, M. N., and Flynn, P. J., Data Clustering: A Review, ACM Computing Surveys 31 (3), 1999, 265-323. Kaufman, L. and Rousseeuw, P.J., Finding Groups in Data: An Introduction to Cluster Analysis (John Wiley & Sons, New York, 1990). Kim, H. and Lee, S., A Semi-supervised Document Clustering Technique for Information Organization, Proceedings of the Ninth International Conference on Information and Knowledge Management, McLean, VA, 2000, 30-37. Kim, H. and Lee, S., An Effective Document Clustering Method Using User-adaptable Distance Metrics, Proceedings of the 2002 ACM Symposium on Applied Computing, 2002, Madrid, Spain, 16-20. Kohonen, T., Self-Organization and Associative Memory (Springer, Berlin, 1989). Kohonen, T., Self-Organizing Maps (Springer, Berlin, 1995). Konstan, J.A., Miller, B.N., Maltz, D., Herlocker, J.L., Gordon, L.R., and Riedl, J., GroupLens: Applying Collaborative Filtering to Usenet News, Communication of the ACM 40 (3), 1997, 77-87. Kwasnik, B.H., The Importance of Factors that Are Not Document Attributes in the Organization of Personal Documents, Journal of Documentation 47 (4), 1991, 389-398. Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, 1996, 238-243. Lakoff, G., Women, Fire and Dangerous Things: What Categories Reveal about the Mind (University of Chicago Press, Chicago, 1987). Larsen, B. and Aone, C., Fast and Effective Text Mining Using Linear-time Document Clustering, Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, San Diego, CA, 16-22. Lin, C., Chen, H., and Nunamaker, J.F., Verifying the Proximity and Size Hypothesis for Self-organizing Maps, Journal of Management Information Systems 16 (3), 1999-2000, 57-70. Mackay, W.E., Diversity in the Use of Electronic Mail: A Preliminary Inquiry, ACM Transactions on Office Information Systems 6 (4), 1988, 380-397. Mackay, W.E., Responding to Cognitive Overload: Co-adaptation between Users and Technology, Intellectica 30 (1), 2000, 177-193. Malone, T.W., How Do People Organize Their Desks? Implications for the Design of Office Information Systems, ACM Transactions on Office Information Systems 1(1), 1983, 99-112. Pantel, P. and Lin, D., Document Clustering with Committees, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002, 199-206. Qiu, Y. and Frei, H.P., Concept Based Query Expansion, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, 1993, 160-169. Quillian, M.R., Semantic Memory, in Semantic Information Processing, M. Minsky (ed.) (The MIT Press, Cambridge, MA, 1968), 227-270. Raghavan, V. and Deogun, J., Optimal Determination of User-Oriented Clusters, Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Cambridge, MA, 1987, 140-146. Raghavan, V.V. and Wong, S.K., A Critical Analysis of Vector Space Model for Information Retrieval, Journal of the American Society for Information Science 37 (5), 1986, 279-287. Ravasio, P., Guttormsen Schär, S., and Krueger, H., In Pursuit of Desktop Evolution: User Problems and Practices with Modern Desktop Systems, ACM Transactions on Computer-Human Interaction 11(2), 2004, 156-180. Ryan, N., Pascoe, J., and Morse, D., Enhanced Reality Fieldwork: The Context-aware Archaeological Assistant, In Gaffney, V., Leusen, M.V., and Exxon, S. (Eds.), Computer Applications in Archaeology (Tempus Reparatum, Oxford, UK, 1997). Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J., GroupLens: An Open Architecture for Collaborative Filtering of Netnews, Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, Chapel Hill, NC, 1994, 175-186. Restorick, F.M., Novel Filing Systems Applicable to an Automated Office: A State-of-the-Art Study, Information Processing and Management 22 (2), 1986, 151-172. Roussinov, D.G. and Chen, H., Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques, Decision Support Systems 27 (1-2), 1999, 67-79. Rucker, J. and Polanco, M.J., Siteseer: Personalized Navigation for the Web, Communications of the ACM 40 (3), 1997, 73-75. Sarwar, B.M., Karypis, G., Konstan, J.A., and Riedl, J., Analysis of Recommendation Algorithms for E-Commerce, Proceedings of the 2nd ACM Conference on Electronic Commerce, Minneapolis, MN, 2000, 158-167. Sebastiani, F., Machine Learning in Automated Text Categorization, ACM Computing Surveys 34 (1), 2002, 1-47. Shardanand, U. and Maes, P., Social Information Filtering: Algorithms for Automating ‘Word of Mouth’, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, 1995, 210-217. Smith, K.A. and Ng, Alan, Web Page Clustering Using A Self-organizing Map of User Navigation Patterns, Decision Support Systems 35 (2), 2003, 245-256. Talavera, L. and Bejar, J., Integrating Declarative Knowledge in Hierarchical Clustering Tasks, Proceedings of the 3rd International Symposium on Advances in Intelligent Data Analysis, Amsterdam, Netherlands, 1999, 211-222. Turney, P. D. and Littman, M. L., Measuring Praise and Criticism: Inference of Semantic Orientation from Association, ACM Transactions on Information Systems 21 (4), October 2003, 315-346. Voorhees, E.M., Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval, Information Processing and Management 22 (6), 1986, 465-476. Voutilainen, A., Nptool: A Detector of English Noun Phrases, Proceedings of Workshop on Very Large Corpora, Columbus, OH, 1993, 48-57. Wei, C., Chiang, R.H.L., and Wu, C.C., Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach, Journal of Management Information Systems 23 (2), Fall 2006a, 173-201. Wei, C., Hu, P., and Dong, Y. X., Managing Document Categories in E-commerce Environments: An Evolution-based Approach, European Journal of Information System 11 (3), 2002, 208-222. Wei, C, Yang, C.S., Hsiao, H.W., and Cheng, T.H., Combining Preference- and Content-based Approaches for Improving Document Clustering Effectiveness, Information Processing and Management 42 (2), 2006b, 350-372. Wong, S.K. and Yao, Y.Y., An Information-theoretic Measure of Term Specificity, Journal of the American Society for Information Science 43 (1), 1992, 54-61. Yang, Y. and Chute, C.G., An Example-based Mapping Method for Text Categorization and Retrieval, ACM Transactions on Information Systems 12 (3), 1994, 252-277. Yang, C. C. and Luk J., Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws, Journal of the American Society for Information Science and Technology 54 (7), 2003, 671-682. Yang, Y. and Pedersen, J.O., A Comparative Study on Feature Selection in Text Categorization, Proceedings of 14th International Conference on Machine Learning, Nashville, TN, 1997, 412-420. Yoo, I. And Hu, X., A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, Chapel Hill, NC, 2006, 220-229. Yu, C.T., Wang, Y.T., and Chen, C.H., Adaptive Document Clustering, Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Montreal, Quebec, Canada, 1985, 197-203. Zhao, Y. and Karypis, G., Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering, Machine Learning 55 (3), 2004, 311-331.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0715107-184139.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS