國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以偏好為導向之文件分群技術,Preference-Anchored Document clustering Technique for Supporting Effective Knowledge and Document Management

論文名稱 Title	以偏好為導向之文件分群技術 Preference-Anchored Document clustering Technique for Supporting Effective Knowledge and Document Management
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	93 學年度第 2 學期 The spring semester of Academic Year 93	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	48
研究生 Author	王歆 Shin Wang
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	吳昇暾 Sheng-Tun Li
口試委員 Advisory Committee	盧文祥 Wen-Hsiang Lu
口試日期 Date of Exam	2005-07-28	繳交日期 Date of Submission	2005-08-03
關鍵字 Keywords	階層式叢集分群、知識地圖、文字探勘、以偏好為基礎的文件分群技術、文件分群 Document clustering, Hierarchical agglomerative clustering (HAC), Knowledge map, Preference-based document clustering, Text mining
統計 Statistics	本論文已被瀏覽 5845 次，被下載 1784 次 The thesis/dissertation has been browsed 5845 times, has been downloaded 1784 times.

中文摘要
隨著文件數量的急速增加，如何有效管理知識倉儲對於知識的分享、重複使用和吸收是非常重要的。而知識地圖經常被採用來協助知識倉儲內知識的取得。通常我們是透過文件分群的方式來建立一個知識地圖。但是當今的文件分群技術並沒有辦法達到適應個人不同的偏好，也無法產生以不同觀點為基礎的知識地圖。因此，本論文提出以偏好為導向的文件分群技術，結合了使用者的偏好觀點產生其特定偏好的知識地圖。實證的結果顯示，本文所提出的方法在高cluster precision的情況之下，比傳統以內容為基礎的文件分群技術表現要好上許多。另外，與用Chi-square建立的Oracle categorizer相比，本文所提出的方法在高cluster precision時，也有較好的表現。整體而言，我們實證的結果顯示本文所提出的方法具有可行性且有高度的發展性。
Abstract
Effective knowledge management of proliferating volume of documents within a knowledge repository is vital to knowledge sharing, reuse, and assimilation. In order to facilitate accesses to documents in a knowledge repository, use of a knowledge map to organize these documents represents a prevailing approach. Document clustering techniques typically are employed to produce knowledge maps. However, existing document clustering techniques are not tailored to individuals’ preferences and therefore are unable to facilitate the generation of knowledge maps from various preferential perspectives. In response, we propose the Preference-Anchored Document Clustering (PAC) technique that takes a user’s categorization preference (represented as a list of anchoring terms) into consideration to generate a knowledge map (or a set of document clusters) from this specific preferential perspective. Our empirical evaluation results show that our proposed technique outperforms the traditional content-based document clustering technique in the high cluster precision area. Furthermore, benchmarked with Oracle Categorizer, our proposed technique also achieves better clustering effectiveness in the high cluster precision area. Overall, our evaluation results demonstrate the feasibility and potential superiority of the proposed PAC technique.

目次 Table of Contents
CHAPTER 1 INTRODUCTION 1 1.1 Background 1 1.2 Research Motivation and Objectives 3 1.3 Organization of the Thesis 4 CHAPTER 2 LITERATURE REVIEW 6 2.1 Content-based Document Clustering Techniques 6 2.2 Non-content-based and Hybrid Document Clustering Approaches 10 CHAPTER 3 DESIGN OF PREFERENCE-ANCHORED DOCUMENT CLUSTERING (PAC) TECHNIQUE 15 3.1 Statistical-based Thesaurus Construction 17 3.2 Preference Specialization 20 3.3 Document Representation 21 3.4 Clustering 22 CHAPTER 4 EMPIRICAL EVALUATION 23 4.1 Evaluation Design 23 4.1.1 Collection of Document Corpus 23 4.1.2 Evaluation Criteria 24 4.1.3 Evaluation Procedure 25 4.2 Parameter Tuning Experiments and Results 26 4.2.1 Tuning for Traditional Content-based Document Clustering Technique 26 4.2.2 Tuning for the PAC Technique 28 4.3 Comparative Evaluation Results 33 4.4 Sensitivity to Size of Anchoring Terms 36 CHAPTER 5 CONCLUSION AND FUTURE RESEARCH DIRECTIONS 37 REFERENCES 39

參考文獻 References
[A73] Anderberg, M.R., Cluster Analysis for Applications, Academic Press, Inc., New York, NY, 1973. [B92] Brill, E., “A Simple Rule-based Part of Speech Tagger,” In Proceedings of the Third Conference on Applied Natural Language Processing, Association for Computational Linguistics, Trento, Italy, 1992, pp.152-155. [B94] Brill, E., “Some Advances in Rule-based Part of Speech Tagging,” In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), AAAI Press, Seattle, WA, 1994, pp. 722-727. [BGG99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L. “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, Vol. 27, No. 3, 1999, pp.329-341. [CKP92] Cutting, D., Karger, D., Pedersen, J., and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329. [DP98] Davenport, T. H., and Prusak, L., Working Knowledge: How Organizations Manage What They Know, Harvard Business School Press, Boston, MA, 1998. [DR86] Deogun, J. and Raghavan, V., “User-oriented Document Clustering: A Framework for Learning in Information Retrieval,” In Proceedings of the 9th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1986, pp.157-163. [EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical document clustering using Ward’s method.” In Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156. [FLG87] Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T., “The Vocabulary Problem in Human-system Communication,” Communications of the ACM, Vol. 30, No. 11, 1987, pp.964-971. [HNT99] Hansen, M. T., Nohria, N., and Tierney, T., “What’s Your Strategy for Managing Knowledge?,” Harvard Business Review, Vol. 77, No. 2, 1999, pp.106-116. [H99] Hickins, M., “Xerox Shares Its Knowledge,” Management Review, Vol. 88, No. 8, 1999, pp.40-45. [JMF99] Jain, A. K., Murty, M. N., and Flynn, P. J., “Data Clustering: A Review,” ACM Computing Surveys, Vol. 31, No. 3, 1999, pp.265-323. [JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994. [K89] Kohonen, T., Self-Organization and Associative Memory, Springer, Berlin, Germany, 1989. [K95] Kohonen, T., Self-Organizing Maps, Springer, Berlin, Germany, 1995. [KL00] Kim, H. and Lee, S., “A Semi-supervised Document Clustering Technique for Information Organization,” In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), 2000, pp.30-37. [KL02] Kim, H. and Lee, S., “An Effective Document Clustering Method Using User-adaptable Distance Metrics,” In Proceedings of the 2002 ACM Symposium on Applied Computing, 2002, pp.16-20. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York, NY, 1990. [LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22. [LCN99] Lin, C., Chen, H., and Nunamaker, J.F., “Verifying the Proximity and Size Hypothesis for Self-organizing Maps,” Journal of Management Information Systems, Vol. 16, No. 3, Winter 1999-2000, pp.57-70. [LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 238-243. [O98a] O’Leary, D. E., “Enterprise Knowledge Management,” IEEE Computer, Vol. 31, No. 3, 1998a, pp.54-61. [O98b] O’Leary, D. E., “Using AI in Knowledge Management: Knowledge Bases and Ontologies,” IEEE Intelligent Systems, Vol. 13, No. 3, 1998b, pp.34-39. [OCS05] Ong, T. H., Chen, H., Sung, W. K., and Zhu, B., “Newsmap: A Knowledge Map for Online News,” Decision Support Systems, Vol. 39, No. 4, 2005, pp.583-597. [PL02] Pantel, P. and Lin, D., “Document Clustering with Committees,” In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp.199-206. [RC99] Roussinov, D.G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, Vol. 27, No. 1-2, 1999, pp.67-79. [RM99] Rauber, A. and Merkl, D., “Using Self-organizing Maps to Organize Document Archives and to Characterize Subject Matters: How to Make A Map Tell the News of the World,” In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA’99), Florence, Italy, 1999, pp.302-311. [S97] Stewart, A., “Under the Hood at Ford,” Webmaster Magazine, June 1997, pp.26-34. [TB99] Talavera, L. and Bejar, J., “Integrating Declarative Knowledge in Hierarchical Clustering Tasks.” In Proceedings of the 3rd International Symposium on Intelligent Data Analysis, 1999, pp.211-222. [V86] Voorhees, E.M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management,” Vol. 22, 1986, pp.465-476. [V93] Voutilainen, A, “Nptool: A Detector of English Noun Phrases,” In Proceedings of Workshop on Very Large Corpora, Columbus, Ohio, June 1993, pp.48-57. [WHC02] Wei, C., Hu, P., and Chen, H. H., “Design and Evaluation of A Knowledge Management System,” IEEE Software, Vol. 19, No. 3, 2002, pp.56-59. [WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-commerce Environments: An Evolution-based Approach,” European Journal of Information Systems, Vol. 11, No. 3, 2002, pp.208-222. [WYH05] Wei, C., Yang, C. S., Hsiao, H. W., and Cheng, T. H., “Combining Preference- and Content-based Approaches for Improving Document Clustering Effectiveness,” Information Processing and Management, forthcoming. [YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, 2003, pp. 671-682. [YC94] Yang, Y. and Chute, C.G., “An Example-based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, Vol. 12, No. 3, 1994, pp.252-277. [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, 1997, pp.412-420. [YWC85] Yu, C. T., Wang, Y. T., and Chen, C. H. “Adaptive Document Clustering,” In Proceedings of the 8th International ACM SIGIR Conference on Research and Development in Information Retrieval,

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內立即公開，校外一年後公開 off campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0803105-150923.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS