國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,偏好引導的情境式文件分群技術：字詞關係及統計式字典之影響,Preference-Anchored Document Clustering Technique: Effects of Term Relationships and Thesaurus

論文名稱 Title	偏好引導的情境式文件分群技術：字詞關係及統計式字典之影響 Preference-Anchored Document Clustering Technique: Effects of Term Relationships and Thesaurus
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	50
研究生 Author	林浩翔 Hao-hsiang Lin
指導教授 Advisor	魏志平 Chih-Ping Wei
召集委員 Convenor	楊傳智 Christopher C. Yang
口試委員 Advisory Committee	胡仁華 Paul J. H. Hu
口試日期 Date of Exam	2006-07-18	繳交日期 Date of Submission	2006-08-30
關鍵字 Keywords	文字探勘、階層式文件分群、文件分群、個人化文件分群、以偏好為基礎的文件分群技術 Personalized document clustering, Document clustering, Preference-based document clustering, Text mining, Hierarchical agglomerative clustering (HAC)
統計 Statistics	本論文已被瀏覽 5709 次，被下載 6 次 The thesis/dissertation has been browsed 5709 times, has been downloaded 6 times.

中文摘要
根據情境式文件分群理論，個人的文件分群行為不單純只是考量文件的屬性(包含內容)，也取決於個人在什麼樣任務和情境之下進行分群。因此，有效的文件分群技術必須能夠考量使用者不同的偏好觀點，進而產生特定偏好的分群結果。偏好引導的情境式文件分群技術(PAC)支援以偏好為基礎的文件分群，並且考量使用者的分群偏好產生特定偏好的分群結果。而本文主要針對PAC探討兩個研究議題:(1)不同的字詞關係是否可以增進PAC的效能以及(2)不同的語料庫所建構出來的統計式字典是否可以增進PAC的效能。實證的結果顯示，在完整的群集標註詞(Anchoring terms)前提下，本文所提出來的方法和PAC具有相同的分群效能，然而隨著群集標註詞(Anchoring terms)的減少，並沒有辦法到達和PAC相同的分群效能，甚至產生較差的分群效能。實證的結果也顯示使用較大的語料庫所建構出來的統計式字典沒有辦法增進PAC的分群效能。
Abstract
According to the context theory of classification, the document-clustering behaviors of individuals not only involve the attributes (including contents) of documents but also depend on who is doing the task and in what context. Thus, effective document-clustering techniques need to be able to take into account users’ categorization preferences and thus can generate document clusters from different preferential perspectives. The Preference-Anchored Document Clustering (PAC) technique was proposed for supporting preference-based document-clustering. Specifically, PAC takes a user’s categorization preference into consideration and subsequently generates a set of document clusters from this specific preferential perspective. In this study, we attempt to investigate two research questions concerning the PAC technique. The first research question investigates “whether the incorporation of the broader-term expansion (i.e., the proposed PAC2 technique in this study) will improve the effectiveness of preference-based document-clustering, whereas the second research question is “whether the use of a statistical-based thesaurus constructed from a larger document corpus will improve the effectiveness of preference-based document-clustering.” Compared with the effectiveness achieved by PAC, our empirical results show that the proposed PAC2 technique neither improves nor deteriorates the effectiveness of preference-based document-clustering when the complete set of anchoring terms is used. However, when only a partial set of anchoring terms is provided, PAC2 cannot improve and even deteriorate the effectiveness of preference-based document-clustering. As to the second research question, our empirical results suggest the use of a statistical-based thesaurus constructed from a larger document corpus (i.e., the ACM corpus consisting of 14,729 documents) does not improve the effectiveness of PAC and PAC2 for preference-based document-clustering.

目次 Table of Contents
CHAPTER 1 INTRODUCTION 1.1 Background 1.2 Research Motivation and Objectives 1.3 Organization of the Thesis CHAPTER 2 LITERATURE REVIEW 2.1 Content-based Document Clustering Techniques 2.2 Preference-Anchored Document Clustering (PAC) Technique CHAPTER 3 DESIGN OF PREFERENCE-ANCHORED DOCUMENT CLUSTERING (PAC2) TECHNIQUE 3.1 Statistical-based Thesaurus Construction 3.2 Preference Expansion 3.3 Document Representation 3.4 Clustering CHAPTER 4 EMPIRICAL EVALUATIONS 4.1 Collection of Document Corpora 4.2 Collection of Users' Preferred Clustering 4.3 Evaluation Criteria and Procedure 4.4 Experiment 1: Effectiveness of PAC2 vs. PAC 4.4.1 Tuning of Traditional Content-based Document-Clustering Technique 4.4.2 Tuning of PAC and PAC2 Techniques 4.4.3 Comparative Evaluation 4.5 Experiment 2: Effects of Thesaurus on PAC and PAC2 4.5.1 Tuning of PAC and PAC2 Techniques 4.5.2 Comparative Evaluation CHAPTER 5 CONCLUSION AND FUTURE RESEARCH DIRECTIONS REFERENCES

參考文獻 References
[A73] Anderberg, M.R., Cluster Analysis for Applications, Academic Press, Inc., New York, NY, 1973. [B91] Barreau, D.K., “Context as A Factor in Personal Information Management Systems. Journal of the American Society for Information Science, 46, 5 (June 1991), pp.327-339. [B92] Brill, E., “A Simple Rule-based Part of Speech Tagger,” In Proceedings of the Third Conference on Applied Natural Language Processing, Association for Computational Linguistics, Trento, Italy, 1992, pp.152-155. [B94] Brill, E., “Some Advances in Rule-based Part of Speech Tagging,” In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), AAAI Press, Seattle, WA, 1994, pp. 722-727. [BGG99] Boley, D., Gini, M., Gross, R., Han, E., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., and Moore, L., “Partitioning-based Clustering for Web Document Categorization,” Decision Support Systems, 27, 3 (1999), pp.329-341. [C91] Case, D.O., “Conceptual Organization and Retrieval of Text by Historians: The Role of Memory and Metaphor,” Journal of the American Society for Information Science, 42, 9 (October 1991), pp.657-668. [CKP92] Cutting, D., Karger, D., Pedersen, J., and Tukey, J., “Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections,” In Proceedings of 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992, pp.318-329. [D91] Donovan, J., “Patrons Expectations about Collocation: Measuring the Difference between Psychologically Real and the Really Real,” Cataloging and Classification Quarterly, 13, 2 (1991), pp.23-43. [EW86] El-Hamdouchi, A. and Willett, P., “Hierarchical document clustering using Ward’s method.” In Proceedings of ACM Conference on Research and Development in Information Retrieval, 1986, pp.149-156. [JMF99] Jain, A. K., Murty, M. N., and Flynn, P. J., “Data Clustering: A Review,” ACM Computing Surveys, 31, 3 (1999), pp.265-323. [JC94] Jing, Y. and Croft, W. B., “An Association Thesaurus for Information Retrieval,” Technical Report, Department of Computer Science, University of Massachusetts at Amherst, 1994. [KR90] Kaufman, L. and Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, New York, NY, 1990. [K89] Kohonen, T., Self-Organization and Associative Memory, Springer, Berlin, Germany, 1989. [K95] Kohonen, T., Self-Organizing Maps, Springer, Berlin, Germany, 1995. [K91] Kwasnik, B.H., “The Importance of Factors that Are Not Document Attributes in the Organization of Personal Documents,” Journal of Documentation, 47 (1991), pp.389-398. [LHK96] Lagus, K., Honkela, T., Kaski, S., and Kohonen, T., “Self-organizing Maps of Document Collections: A New Approach to Interactive Exploration,” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 238-243. [L87] Lakoff, G. Women, Fire and Dangerous Things: What Categories Reveal about the Mind. Chicago: University of Chicago Press, 1987. [LA99] Larsen, B. and Aone, C., “Fast and Effective Text Mining Using Linear-time Document Clustering,” In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp.16-22. [LCN99] Lin, C., Chen, H., and Nunamaker, J.F., “Verifying the Proximity and Size Hypothesis for Self-organizing Maps,” Journal of Management Information Systems, 16, 3 (Winter 1999-2000), pp.57-70. [M88] Mackay, W.E., “Diversity in the Use of Electronic Mail: A Preliminary Inquiry,” ACM Transactions on Office Information Systems, 6, 4 (1988), pp.380-397. [M00] Mackay, W.E., “Responding to Cognitive Overload: Co-adaptation between Users and Technology,” Intellectica, 30, 1 (2000), pp.177-193. [OCI04] Quiroga, L.M., Crosby, M.E., and Iding, M.K., “Reducing Cognitive Load,” In Proceedings of the 37th Hawaii International Conference on Systems Sciences, 2004. [Q68] Quillian, M.R., “Semantic Memory,” In Semantic Information Processing, M. Minsky (ed.), Cambridge, MA: The MIT Press, 1968, pp.227-270. [R86] Restorick, F.M., “Novel Filing Systems Applicable to An Automated Office: A State-of-the-art Study,” Information Processing and Management, 22 (1986), pp.151-172. [RC99] Roussinov, D.G. and Chen, H., “Document Clustering for Electronic Meetings: An Experimental Comparison of Two Techniques,” Decision Support Systems, 27, 1-2 (1999), pp.67-79. [RM99] Rauber, A. and Merkl, D., “Using Self-organizing Maps to Organize Document Archives and to Characterize Subject Matters: How to Make A Map Tell the News of the World,” In Proceedings of the 10th International Conference on Database and Expert Systems Applications (DEXA’99), Florence, Italy, 1999, pp.302-311. [RP97] Rucker, J. and Polanco, M.J., “Siteseer: Personalized Navigation for the Web,” Communications of the ACM, 40, 3 (March 1997), pp.73-75. [TB99] Talavera, L. and Bejar, J., “Integrating Declarative Knowledge in Hierarchical Clustering Tasks.” In Proceedings of the 3rd International Symposium on Intelligent Data Analysis, 1999, pp.211-222. [V86] Voorhees, E.M., “Implementing Agglomerative Hierarchical Clustering Algorithms for Use in Document Retrieval,” Information Processing and Management,” 22 (1986), pp.465-476. [V93] Voutilainen, A, “Nptool: A Detector of English Noun Phrases,” In Proceedings of Workshop on Very Large Corpora, Columbus, Ohio, June 1993, pp.48-57. [WHD02] Wei, C., Hu, P., and Dong, Y. X., “Managing Document Categories in E-commerce Environments: An Evolution-based Approach,” European Journal of Information Systems, 11, 3 (2002), pp.208-222. [WW06] Wei, C. and Wang, S., “Preference-Anchored Document Clustering Technique for Supporting Effective Knowledge Management,” In Proceedings of the 4th Workshop on Knowledge Economy and Electronic Commerce, Kaohsiung, Taiwan, April 2006. [WYH05] Wei, C., Yang, C. S., Hsiao, H. W., and Cheng, T. H., “Combining Preference- and Content-based Approaches for Improving Document Clustering Effectiveness,” Information Processing and Management, 42, 2 (March 2006), pp.350-372. [YL03] Yang, C. C. and Luk J., “Automatic Generation of English/Chinese Thesaurus Based on a Parallel Corpus in Laws,” Journal of the American Society for Information Science and Technology, 54, 7 (2003), pp. 671-682. [YC94] Yang, Y. and Chute, C.G., “An Example-based Mapping Method for Text Categorization and Retrieval,” ACM Transactions on Information Systems, 12, 3 (1994), pp.252-277. [YP97] Yang, Y. and Pedersen, J. O., “A Comparative Study on Feature Selection in Text Categorization,” Proceedings of 14th International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, 1997, pp.412-420.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內一年後公開，校外永不公開 campus withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.220.137.164 論文開放下載的時間是校外不公開 Your IP address is 18.220.137.164 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS