國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於修辭結構的評論品質分析,Quality Analysis of User Reviews Using Discourse Structure

論文名稱 Title	基於修辭結構的評論品質分析 Quality Analysis of User Reviews Using Discourse Structure
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	104 學年度第 2 學期 The spring semester of Academic Year 104	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	57
研究生 Author	羅珮綺 Pei-chi Lo
指導教授 Advisor	黃三益 San-Yih Hwang
召集委員 Convenor	魏志平 Chih-Ping Wei
口試委員 Advisory Committee	康藝晃 Yihuang Kang
口試日期 Date of Exam	2016-07-08	繳交日期 Date of Submission	2016-07-28
關鍵字 Keywords	修辭結構理論、文字探勘、自然語言處理、文字品質分析、使用者評論分析 Text Mining, Rhetorical Structure Theory, User Review, Quality Analysis, Natural Language Processing
統計 Statistics	本論文已被瀏覽 6049 次，被下載 153 次 The thesis/dissertation has been browsed 6049 times, has been downloaded 153 times.

中文摘要
Web 2.0的興起，帶動大眾言論的新浪潮：越來越多人在網際網路上發表想法、意見，甚至是對於特定商品的評論。在這麼大量的文字資料中，除了真實並且有參考價值的資料之外，也有可能包含惡意的垃圾訊息，或者是沒有意義的文字。因此，要如何判斷文字的品質變成了很重要的課題之一，這也成為文字探勘領域的熱門議題。在過去的研究中通常只以文字詞語的特性作為分析的標的，但這樣的分析方式只能觀察到文字詞語的組合，或是評論整體長短等數據型的資料，並沒有辦法對於文章整體的架構及內容有所了解。修辭結構理論（Rhetorical Structure Theory, RST）透過分析子句和子句之間的關係，將文章轉換成包含階層特性的樹狀結構，讓我們對於整個文章結構有全面性的了解。我們加入修辭結構理論作為分析的標的，使文章品質分析的結果更精確，也更符合人類語言的特性。我們提出一個文字品質分析流程，使用了NLTK（Natural Language Toolkit）進行自然語言處理，找出可能影響文字品質的影響因素，並使用不同的分類模型分析Amazon購物網站的真實使用者評論。我們比較了加入RST特徵前和加入後的結果，實驗證明我們提出的模型比起單純採用文字詞語的模型有顯著的進步。
Abstract
The emergence of Web 2.0 has led to a new era of user generated content. More and more people tend to share their thinking, opinion and even user review of a specific product. Among all these textual data, not only useful paragraphs that are worth considering are in-cluded, there are also malicious comments and spams. Thus, how to determine the quality of text has become a serious problem, and a popular issue in the domain of text mining. Most of the existing studies use the characteristic of tokens as the analysis target, which cannot represent the comprehensive structure of a document, but only the combination of words, or statistical data such as the length of the document. Rhetorical Structure Theory (RST) transform a document to a tree that have the characteristic of level according to the relation between text spans. By incorporating the concept of RST, we can have a more accu-rate result on analyzing text quality. We proposed a process to analyze the quality of text, using NLTK (Natural Language Toolkit) to process the text, finding the potential predictor, and use different classification model to analyze real user reviews from Amazon.com. Also, an evaluation process is con-ducted to compare the result before and after adding RST features. Experiment shows that our model outperforms previous models which only consider the structure of word tokens.

目次 Table of Contents
Chapter 1 -Introduction 1 1.1 Background and Motivation 1 1.2 Research Purpose 2 1.3 Expected Results and Contribution 2 1.4 Thesis Organization 3 Chapter 2 -Literature Review 4 2.1 Rhetorical Structure 4 2.2 Rhetorical Structure Theory In The Domain Of Text Mining 8 2.3 Quality analysis 9 Chapter 3 -Problem Definition 11 3.1 Preliminaries 11 3.2 Problem Description 13 Chapter 4 -Research Approach 15 4.1 Research Architecture 15 4.2 Data Preparation 17 4.3 Natural Language Processing 19 4.4 Discourse Parsing 20 4.5 Feature Building 24 4.6 Quality Determination 28 Chapter 5 -Experiments 30 5.1 Dataset 30 5.2 Experimental Design 33 5.3 Experiment Result 34 5.3.1 NLP/ Discourse Parsing Result 34 5.3.2 Prediction of Quality 36 5.3.3 RST Weighting Scheme 38 Chapter 6 -Conclusion 43 Chapter 7 -References 45

參考文獻 References
Bash, E. (2015). Natural Language processing with python. PhD Proposal (Vol. 1). http://doi.org/10.1017/CBO9781107415324.004 Berkeley, U. C., Joshua, J. J., Peace, E., & Iii, W. J. (2012). Stative Adjectives and Verbs in English, (1), 926–929. Bott, R. (2014). Use of Discourse Knowledge to Improve Lexicon-based Sentiment Analysis. Igarss 2014, (1), 1–5. http://doi.org/10.1007/s13398-014-0173-7.2 DeVurie, D. & Prendinger, H. (2009). A Novel Discourse Parser Based on Support Vector Machine Classification. Proceedings of ACL ’09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2(August), 665–673. http://doi.org/10.3115/1690219.1690239 De Rainville, F.-M., Fortin, F.-A., Gardner, M.-A., Parizeau, M., & Gagne, C. (2012). {DEAP} - Enabling Nimbler Evolutions. SIGEvolution Newsletter of the ACM Special Interest Group on Genetic and Evolutionary Computation, 6(2), 17–26. Retrieved from https://github.com/DEAP/notebooks Dellarocas, C. (2003). The digitization of word-of-mouth: promise and challenges of online reputation mechanisms. Management Science, (December), 1–38. http://doi.org/10.1287/mnsc.49.10.1407.17308 Fang, H., Lu, W., Wu, F., Zhang, Y., Shang, X., Shao, J., & Zhuang, Y. (2015). Topic aspect-oriented summarization via group selection. Neurocomputing, 149(PC), 1613–1619. http://doi.org/10.1016/j.neucom.2014.08.031 Feng, V. W., & Hirst, G. (2014a). A Linear-Time Bottom-Up Discourse Parser with Constraints and Post-Editing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (pp. 511–521). Feng, V. W., & Hirst, G. (2014b). Two-pass Discourse Segmentation with Pairing and Global Features. ArXiv E-Prints, 1407.8215. Retrieved from http://arxiv.org/abs/1407.8215 Gagn, C. (2012). DEAP : Evolutionary Algorithms Made Easy. Journal of Machine Learning Research, 13, 2171–2175. http://doi.org/10.1.1.413.6512 Ghose, A., & Ipeirotis, P. G. (2011). Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering, 23(10), 1498–1512. http://doi.org/10.1109/TKDE.2010.188 Heerschop, B., Goossen, F., Hogenboom, A., Frasincar, F., Kaymak, U., & De Jong, F. (2011). Polarity analysis of texts using discourse structure. Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM), 1061–1070. http://doi.org/10.1145/2063576.2063730 Hogenboom, A., Frasincar, F., de Jong, F., & Kaymak, U. (2015). Using Rhetorical Structure in Sentiment Analysis. Commun. ACM, 58(7), 69–77. http://doi.org/10.1145/2699418 Ittoo, A., & Prof, A. (n.d.). Predicting Review Helpfulness A Machine Learning & Natural Language Processing based Approach Background • Online reviews. Joty, S., & Ng, R. T. (2015). CODRA : A Novel Discriminative Framework for Rhetorical Analysis. Computational Linguistics, (January), 1–50. Korfiatis, N., García-Bariocanal, E., & Sánchez-Alonso, S. (2012). Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content. Electronic Commerce Research and Applications, 11(3), 205–217. http://doi.org/10.1016/j.elerap.2011.10.003 Krishnamoorthy, S. (2015). Linguistic features for review helpfulness prediction. Expert Systems with Applications, 42(7), 3751–3759. http://doi.org/10.1016/j.eswa.2014.12.044 Li, F., Liu, N., Jin, H., Zhao, K., Yang, Q., & Zhu, X. (2011). Incorporating reviewer and product information for review rating prediction. IJCAI International Joint Conference on Artificial Intelligence, 1820–1825. http://doi.org/10.5591/978-1-57735-516-8/IJCAI11-305 Liu, J., Cao, Y., Lin, C.-Y., Huang, Y., & Zhou, M. (2007). Low-Quality Product Review Detection in Opinion Summarization. Computational Linguistics, (June), 334–342. Retrieved from http://acl.ldc.upenn.edu/D/D07/D07-1035.pdf Lu, Y., Tsaparas, P., Ntoulas, A., & Polanyi, L. (2010). Exploiting social context for review quality prediction. Proceedings of the 19th International Conference on World Wide Web - WWW ’10, 691–700. http://doi.org/10.1145/1772690.1772761 Mann, W. C., & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text, 8(3), 243–281. http://doi.org/10.1515/text.1.1988.8.3.243 McAuley, J., Targett, C., Shi, Q., & Hengel, A. Van Den. (2015). Image-based Recommendations on Styles and Substitutes. Proceeding of 38th ACM SIGIR, 1–11. http://doi.org/10.1145/2766462.2767755 Moghaddam, S., Jamali, M., & Ester, M. (2011). Review recommendation: personalized prediction of the quality of online reviews. Proceedings of the 20th ACM International Conference on Information and Knowledge Management, 2249–2252. http://doi.org/10.1145/2063576.2063938 Mudambi, S. M., & Schuff, D. (2010). What Makes a Helpful Online Review? a Study of Customer Reviews on Amazon.Com 1, 34(1), 185–200. Retrieved from http://ssrn.com/abstract=2175066 Otterbacher, J., & Arbor, A. (2009). “ Helpfulness ” in Online Communities : A Measure of Message Quality. Proceedings of the 27th International Conference on Human Factors in Computing Systems - CHI ’09, 955–964. http://doi.org/10.1145/1518701.1518848 Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, 3(1), 115–124. http://doi.org/10.3115/1219840.1219855 Pavlou, P. a., & Dimoka, A. (2006). The nature and role of feedback text comments in online marketplaces: Implications for trust building, price premiums and seller differentiation. Information Systems Research, 17(4), 392–414. http://doi.org/10.1287/isre.1060.0106 Pavlou, P. A., Huigang, L., & Yajiong, X. (2007). Understanding and Mitigating Uncertainty in Online Exchange Relationships: A Principal--Agent Perspective. Mis Quarterly, 31(1), 105–136. http://doi.org/10.2307/25148783 Rainville, F. De, Fortin, F., Gardner, M., Parizeau, M., & Gagné, C. (2012). DEAP : A Python Framework for Evolutionary Algorithms. Companion Proc. of the Genetic and Evolutionary Computation Conference, 85–92. http://doi.org/doi:10.1145/2330784.2330799 Rubin, V. L., & Lukoianova, T. (2014). Truth and Deception at the Rhetorical Structure Level. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, (January 2016). http://doi.org/10.1002/asi.23216 Soricut, R., & Marcu, D. (2003). Sentence level discourse parsing using syntactic and lexical information. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology NAACL 03, 1(June), 149–156. http://doi.org/10.3115/1073445.1073475 Taboada, M., Voll, K., & Brooke, J. (2008). Extracting sentiment as a function of discourse structure and topicality. Technical Report (Vol. 20). Retrieved from http://www.sfu.ca/~mtaboada/docs/Taboada_Voll_Brooke_TR.pdf Tang, J., Gao, H., Hu, X., & Liu, H. (2013). Context-aware review helpfulness rating prediction. Proceedings of the 7th ACM Conference on Recommender Systems - RecSys ’13, 1–8. http://doi.org/10.1145/2507157.2507183 Wang, X., Yoshida, Y., Hirao, T., Sudoh, K., & Nagata, M. (2015). Summarization Based on Task-Oriented Discourse Parsing. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(8), 1358–1367. http://doi.org/10.1109/TASLP.2015.2432573 Yang, F., Shanmugasundaran, J., Riedewald, M., & Gehrke, J. (2006). Hilda : A High-Level Language for Data-Driven Web Applications. In Proceedings of the 22nd International Conference on Data Engineering (ICDE06) (pp. 32–43). http://doi.org/10.1109/ICDE.2006.75 Yang, G., Wen, D., Kinshuk, Chen, N. S., & Sutinen, E. (2015). A novel contextual topic model for multi-document summarization. Expert Systems with Applications, 42(3), 1340–1352. http://doi.org/10.1016/j.eswa.2014.09.015 Zhang, Y., & Zhang, D. (2014). Automatically predicting the helpfulness of online reviews. Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration, IEEE IRI 2014, (1), 662–668. http://doi.org/10.1109/IRI.2014.7051953

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0627116-214010.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS