國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,使用資料探勘技術挖掘線上論壇討論活動型態,Discovering Discussion Activity Flows in an On-line Forum Using Data Mining Techniques

論文名稱 Title	使用資料探勘技術挖掘線上論壇討論活動型態 Discovering Discussion Activity Flows in an On-line Forum Using Data Mining Techniques
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	96 學年度第 2 學期 The spring semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	136
研究生 Author	謝祿適 Lu-shih Hsieh
指導教授 Advisor	林福仁 Fu-Ren Lin
召集委員 Convenor	黃三益 San-Yih Hwang
口試委員 Advisory Committee	邱兆民, 李昇暾, 魏志平 Chao-Min Chiu; Sheng-Tun Li; Chih-Ping Wei
口試日期 Date of Exam	2008-06-13	繳交日期 Date of Submission	2008-07-22
關鍵字 Keywords	決策樹、文本分類、隱馬可夫模型、文本探勘、資料探勘、內容管理系統、學習管理系統、支持向量機 Support Vector Machine (SVM), Content Management System (CMS)., Text classification, Learning Management System (LMS), Decision tree, Data mining, Text mining, Hidden Markov Model (HMM)
統計 Statistics	本論文已被瀏覽 5736 次，被下載 1231 次 The thesis/dissertation has been browsed 5736 times, has been downloaded 1231 times.

中文摘要
隨著網際網路(Internet)時代來臨，愈來愈多學校課程使用課程管理系統(CMS, course management system)或學習管理系統(LMS, learning management system)來教學或輔助教學。為了幫助學生在網路上有效的學習，教師必須知道學生在線上論壇從事那些討論的活動，並且在必要的時候，提供學生所需協助。現今網路教學系統普遍化的結果，更增加老師們參與線上論壇的工作負擔；為減輕教師工作負荷，設計出可協助教師了解討論活動的自動化工具，成為一項重要的工作。本研究呼應這項需求，提出一個可以在課程管理系統或學習管理系統中，協助教師追蹤線上論壇討論活動流程的自動化工具，我們稱此工具為FAFT (Forum Activity Flow Tracer)。本研究採用資料探勘(data mining)及本文探勘(text ining)技術來發展FAFT 系統。FAFT 系統依其功能可分為，討論活動分類子系統(AC, activity classification)及活動流程探勘子系統(AFD, activity flow discovery)。一般而言，論壇上的一篇文章可以把它歸類為聲明、提問、澄清、解釋（演繹）、詰問、辯護和其它，這六類活動中的一類。討論活動分類子系統採用資料(本文)探勘技術以自動化方式完成每一篇文章活動的分類工作。本文以高中地球科學課程的論壇資料為例，進行實證研究；研究結果顯示，討論活動分類子系統，能有效完成討論活動分類工作。而活動流程探勘子系統採用隱馬爾可夫模型(hidden Markov model)來發覺討論活動流程。由於隱馬爾可夫模型可以方便地以圖形化的方式呈現，故能幫助教師更容易了解學生討論活動。同時也可應用隱馬爾可夫模型為預測模型的特性，來分辨學生的討論活動流程是屬於認知性(cognitive presence)的活動流程，亦或是社交性(social presence)的活動流程。這樣的預測有益於教師採取相對應的措施，來引導學生學習活動。實證結果顯示活動流程探勘子系統，可以有效完成分辨學生活動流程的工作。因此，我們認為本研究所提的 FAFT 系統，可以協助教師追蹤線上論壇的討論活動流程。
Abstract
In the Internet era, more and more courses are taught through a course management system (CMS) or learning management system (LMS). In an asynchronous virtual learning environment, an instructor has the need to beware the progress of discussions in forums, and may intervene if ecessary in order to facilitate students’ learning. This research proposes a discussion forum activity flow tracking system, called FAFT (Forum Activity Flow Tracer), to utomatically monitor the discussion activity flow of threaded forum postings in CMS/LMS. As CMS/LMS is getting popular in facilitating learning activities, the proposedFAFT can be used to facilitate instructors to identify students’ interaction types in discussion forums. FAFT adopts modern data/text mining techniques to discover the patterns of forum discussion activity flows, which can be used for instructors to facilitate the online learning activities. FAFT consists of two subsystems: activity classification (AC) and activity flow discovery (AFD). A posting can be perceived as a type of announcement, questioning, clarification, interpretation, conflict, or assertion. AC adopts a cascade model to classify various activitytypes of posts in a discussion thread. The empirical evaluation of the classified types from a repository of postings in earth science on-line courses in a senior high school shows that AC can effectively facilitate the coding rocess, and the cascade model can deal with the imbalanced distribution nature of discussion postings. AFD adopts a hidden Markov model (HMM) to discover the activity flows. A discussion activity flow can be presented as a hidden Markov model (HMM) diagram that an instructor can adopt to predict which iscussion activity flow type of a discussion thread may be followed. The empirical results of the HMM from an online forum in earth science subject in a senior high school show that FAFT can effectively predict the type of a discussion activity flow. Thus, the proposed FAFT can be embedded in a course management system to automatically predict the activity flow type of a discussion thread, and in turn reduce the teachers’ loads on managing online discussion forums.

目次 Table of Contents
Abstract ..................................................................................................................... I Keywords ................................................................................................................. II 中文摘要.................................................................................................................III 關鍵詞.....................................................................................................................IV Table of Contents ..................................................................................................... V List of Tables ...........................................................................................................IX List of Figures .........................................................................................................XI Chapter 1 Introduction ..............................................................................................1 1.1 Motivation ...................................................................................................1 1.2 The Proposed Approach Meeting the Need.................................................6 1.3 Organization of the Thesis ........................................................................ 11 Chapter 2 Background.............................................................................................13 2.1 Learning Activity Flow .............................................................................14 2.1.1 An activity flow example ...............................................................16 2.1.2 Learning in a computer media communication (CMC) environment .................................................................................................................18 2.2 Learning Management System (LMS)......................................................21 2.3 Text Mining Process..................................................................................23 2.4 Classification in Text Mining ....................................................................28 2.4.1 Decision tree classifiers..................................................................29 2.4.2 Support vector machines (SVM) classifiers...................................30 2.4.3 Imbalanced data distribution issue .................................................31 2.5 Mining Forum Activity Flows...................................................................34 Chapter 3 The Architecture of Forum Activity Flow Tracer (FAFT)......................42 3.1 FAFT Architecture.....................................................................................43 3.2 Activity Classification (AC) subsystem ....................................................44 3.2.1 AC implementation ........................................................................47 3.3 Activity Flow Discovery (AFD) subsystem ..............................................49 3.3.1 AFD implementation ......................................................................51 Chapter 4 Evaluation Design...................................................................................52 4.1 Data Set and Activity Type Coding...........................................................53 4.2 Evaluation Criteria ....................................................................................59 4.2.1 Evaluation criteria for AC ..............................................................59 4.2.2 Evaluation criteria for AFD............................................................61 Chapter 5 Evaluation Results and Discussion.........................................................64 5.1 Evaluation Results of AC Subsystem........................................................64 5.1.1 Results of decision tree classifiers .................................................65 5.1.2 Results of SVM classifiers .............................................................75 5.1.3 Results of the cascade model classifier ..........................................78 5.1.4 Discussion of AC subsystem ..........................................................87 5.2 Evaluation Results of AFD Subsystem .....................................................92 5.2.1 Results of the discovery and prediction of activity flow types ......92 5.2.2 Discussion of the Results of AFD subsystem.................................97 Chapter 6 Conclusion and Research Limitations ..................................................102 6.1 Conclusion...............................................................................................102 6.2 Research Limitations...............................................................................105 References .............................................................................................................108 Appendix A. Examples of Forum Discussion Posts and Corresponding Activity Types ..................................................................................................................... 113 Appendix B. Evaluation Results of AC................................................................. 116

參考文獻 References
An, G. (1996). The Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation, 8(3), 643-674. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains. The Annals of Mathematical Statistics, 41(1), 164-171. Berge, Z., & Collins, M. (1995). Computer mediated communication and the online classroom: overview and perspectives (Vol. 1, pp. 129-137). NJ: Hampton Press. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning database. Inf. Comput. Sci., Univ. California, Dept., Irvine.[Online]. Available: http://www. ics. uci. edu/mlearn/MLRepository. html. Bloehdorn, S., & Hotho, A. (2004). Boosting for text classification with semantic features. Proc. of the Mining for and from the Semantic Web Workshop at KDD, 2004. Blunsom, P. (2004). Hidden Markov Models. Retrieved on July 15, 2008, from http://www.cs.mu.oz.au/460/2004/materials/hmm-tutorial.pdf. Brace-Govan, J. (2003). A method to track discussion forum activity: The Moderators' Assessment Matrix. The Internet and Higher Education, 6(4), 303-325. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1), 1-6. Church, K. W., & Gale, W. A. (1995). Inverse document frequency (IDF): A measure of deviations from Poisson. Proceedings of the Third Workshop on Very Large Corpora, 121–130. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel-based Learning Methods (1st ed. pp. 189). Cambridge University Press. Dalziel, J. (2003). Implementing Learning Design: The Learning Activity Management System (LAMS). on ASCILITE (pp. 1-10). Do, M. N. (2003). Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. Signal Processing Letters, IEEE, 10(4), 115-118. Dougiamas, M., & Taylor, P. C. (2002). Interpretive analysis of an internet-based course constructed using a new courseware tool called Moodle. 2 nd Conference of HERDSA (The Higher Education Research and Development Society of Australasia), 7-10. Dragomir, R., Weiguo, R., & Zhu, F. (2001). Webinessence: A personalized web-based multidocument summarization and recommendation system. Retrieved on Dec. 3, 2007, from http://citeseer.ist.psu.edu/dragomir01webinessence.html. Fawcett, T., & Provost, F. (1997). Adaptive Fraud Detection. Data Mining and Knowledge Discovery, 1(3), 291-316. François, J. M. (2005). Jahmm–A HMM implementation in Java. 2005. Garrison, Anderson, & Archer. (1999). Critical Inquiry in a Text-Based Environment: Computer Conferencing in Higher Education. The Internet and Higher Education, 2(2-3), 87-105. doi: 10.1016/S1096-7516(00)00016-6. Garrison, D. R., Anderson, T., & Archer, W. (2001). Critical thinking and computer conferencing: A model and tool to assess cognitive presence. American Journal of Distance Education, 15(1), 7-23. Grant, C. A., & Sleeter, C. E. (2006). Turning on Learning: Five Approaches for Multicultural Teaching Plans for Race, Class, Gender and Disability. Jossey-Bass, An Imprint of Wiley, 352. Hewitt, J. (2004). An exploration of community in a knowledge forum classroom: an activity system analysis. Designing for Virtual Communities in the Service of Learning, 210-238. Home - LAMS Documents - Confluence. Retrieved on Jan. 9, 2008, from http://wiki.lamsfoundation.org/display/lamsdocs/Home. Hornick, M. F., Marcadé, E., & Venkayala, S. (2006). Java Data Mining: Strategy, Standard, and Practice: A Practical Guide for architecture, design, and implementation (1st Ed., pp. 544). Morgan Kaufmann. Huang, X., & Hon, H. W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall PTR Upper Saddle River, NJ, USA. IWS. (2006). Taiwan Internet and Telecommunications Market Reports. Retrieved on Apr. 18, 2008, from http://www.internetworldstats.com/asia/tw.htm. Japkowicz. (2000). Learning from imbalanced data sets: a comparison of various strategies. AAAI Workshop on Learning from Imbalanced Data Sets, 00-05. Japkowicz, & Stephen. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429-449. Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 2, 1137–1145. Kosala, R., & Blockeel, H. (2000). Web mining research: a survey. ACM SIGKDD Explorations Newsletter, 2(1), 1-15. Krishnamurthy, V., & Moore, J. B. (1993). On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure. IEEE Transactions on Signal Processing, 41(8), 2557-2573. Krogh'f, A., & Brown, I. (1994). Hidden Markov Models in Computational Biology. J. Mol. Bioi, 235, 1501-1531. Kubat, M., Holte, R. C., & Matwin, S. (1998). Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning, 30(2), 195-215. Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Proceedings of the Eleventh International Conference on Machine Learning, 148–156. Ma, W., & Chen, K. (2003). Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. (pp. 168-171). Sapporo, Japan: Association for Computational Linguistics. Mazzolini, M. (2007). When to jump in: The role of the instructor in online discussion forums. Computers & Education, 49(2), 193-213. Mitchell, T. (1997). Machine Learning (pp. 52-78). The McGraw-Hill Companies, Inc. Moodle (2007) - A Free, Open Source Course Management System for Online Learning. Retrieved on Nov. 7, 2007, from http://moodle.org/. Murthy, S. K. (1998). Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey. Data Mining and Knowledge Discovery, 2(4), 345-389. Nickerson, Japkowicz, & Milios. (2001). Using unsupervised learning to guide re-sampling in imbalanced data sets. Proceedings of the Eighth International Workshop on AI and Statitsics, 261–265. Papert, S. (1991). Situating Constructionism. Constructionism, 1-11. Pena-Shaff, J. B., & Nicholls, C. (2004). Analyzing student interactions and meaning construction in computer bulletin board discussions. Computers & Education, 42(3), 243-265. Peng, F., Huang, X., Schuurmans, D., & Wang, S. (2003). Text Classification in Asian Languages without Word Segmentation. Proceedings of the sixth international workshop on Information retrieval with Asian languages-Volume 11, 41-48. Platt, J. (1999a). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods-Support Vector Learning, 185–208. Platt, J. C. (1999b). Fast training of support vector machines using sequential minimal optimization, Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA. Quinlan, J. R. (1996). Improved Use of Continuous Attributes in C4.5. Journal of Aritficial Intelligent Research, 4(1), 77-90. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81-106. Quinlan, J. R. (1993). C4. 5: Programs for Machine Learning. Morgan Kaufmann. Rabiner, L., & Juang, B. (1986). An introduction to hidden Markov models. ASSP Magazine, IEEE, 3(1), 4-16. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2), 257-286. Rosen, L. (2008). Open Source Licensing: Software Freedom and Intellectual Property Law. Free software license. Retrieved on Apr. 14, 2008, from http://en.wikipedia.org/wiki/Free_software_license. Rourke, L., Anderson, T., Garrison, D. R., & Walter, A. (1999). Assessing Social Presence In Asynchronous Text-based Computer Conferencing. Journal of Distance Education, 14(2). Rovai, A. P. (2000). Building and sustaining community in asynchronous learning networks. The Internet and Higher Education, 3(4), 285-297. Schrire, S. (2003). A Model for Evaluating the Process of Learning in Asynchronous Computer Conferencing. Journal of Instruction Delivery Systems, 17(1), 6-12, . Scott, S., & Matwin, S. (1999). Feature engineering for text classification. Proceedings of ICML-99, 16th International Conference on Machine Learning, 379–388. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. Sudman, D., Ulowetz, J., Singhi, D., & Pajerski, M. (1997). Apparatus and method for generating and presenting an audiovisual lesson plan. Google Patents. Vapnik, V. N. (2000). The Nature of Statistical Learning Theory. Springer. Welch, L. R. (2003). Hidden markov models and the baum-welch algorithm. IEEE Information Theory Society Newsletter, 53(4). Witten, I. H., & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (2nd Ed., pp. 560). Morgan Kaufmann. Yang, S. C., & Tung, C. (2007). Comparison of Internet addicts and non-addicts in Taiwanese high school. Computers in Human Behavior, 23(1), 79-96. doi: 10.1016/j.chb.2004.03.037. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference on Machine Learning, 97, 412-420. Yoon, Lee, & Lee. (2005). Systematic Construction of Hierarchical Classifier in SVM-Based Text Categorization. Natural Language Processing – IJCNLP 2004. Retrieved on Jan. 15, 2008, from http://www.springerlink.com/content/9f0r032myrdwvke4/fulltext.pdf.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0722108-155145.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS