國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以句子為特徵擷取單位結合機器學習應用於近似複本偵測之方法 ,Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning

論文名稱 Title	以句子為特徵擷取單位結合機器學習應用於近似複本偵測之方法 Detecting Near-Duplicate Documents using Sentence-Level Features and Machine Learning
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	101 學年度第 1 學期 The fall semester of Academic Year 101	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	81
研究生 Author	廖庭億 Ting-Yi Liao
指導教授 Advisor	李錫智 Shie-Jue Lee
召集委員 Convenor	黃宗傳 Tsung-Chuan Huang
口試委員 Advisory Committee	侯俊良, 歐陽振森 Chun-Liang Hou; Chen-Sen Ouyang
口試日期 Date of Exam	2012-10-15	繳交日期 Date of Submission	2012-10-23
關鍵字 Keywords	近似複本文件、門檻值、試錯法、支持向量機、相似度函數、虛字、特徵擷取 Near-duplicate, threshold, trial-and-error, support vector machine, feature selection, stop words, similarity function
統計 Statistics	本論文已被瀏覽 5793 次，被下載 737 次 The thesis/dissertation has been browsed 5793 times, has been downloaded 737 times.

中文摘要
如何有效的從大量的文件資料中找出近似複本文件一直是很重要的議題。在本論文中，我們提出一個新的方法，從大量資料中有效的偵測出近似複本文件，我們的方法分為三個主要的部分，特徵擷取、相似度計算和辨別是否為近似複本的依據。特徵擷取之部分，在特徵擷取前，文件需進行前處理，去掉符號、stop words…等等，再計算所得到的詞彙權重，並且選擇句子中較為重要的詞彙作為該句子的特徵，而所要偵測的文件經過這些轉換得到該文件的特徵集合。相似度計算的部分，根據兩篇文件的特徵向量由相似度函數來計算兩篇文件的近似程度。辨別兩篇文件是否為近似複本文件的關係，以支持向量機來訓練分類器。支持向量機為機器學習的一種策略，根據訓練樣本的相似度向量來訓練分類器，輸入分類器的資料為兩篇文件的相似度向量，訓練後得到一個分類器，用以分辨兩篇文件是否為近似複本關係。以句子為特徵擷取單位的方法比起以詞彙特徵擷取單位的方法能更有效的表現出文件的特色。而辨別是否為近似複本文件關係，在傳統的方法中，需要有門檻值作為辨別文件關係的分水嶺，例如我們設門檻值為0.5，若文件相似度值大於等於0.5，則為近似複本文件關係，若小於則反，但是實際上無法確定門檻值為0.5能夠準確地分辨文件關係，因此需要由試錯法來找出最佳的門檻值，此方法需要消耗許多的計算成本，並且沒有可信的證明所得到的門檻值為最佳的偵測結果，因此在論文中以支持向量機來訓練分類器的方法，以使用者所定義的訓練樣本來訓練分類器，因為以訓練樣本為依據更有可信度，最後可以從實驗中得知，在近似複本偵測中，我們所提出的方法更為有效率。
Abstract
From the large scale of documents effective to find the near-duplicate document, has been a very important issue. In this paper, we propose a new method to detect near-duplicate document from the large scale dataset, our method is divided into three parts, feature selection, similarity measure and discriminant derivation. In feature selection, document will be detected after preprocessed. Documents have to remove signals, stop words ... and so on. We measure the value of the term weight in the sentence, and then choose the terms which have higher weight in the sentence. These terms collected as a feature of the document. The document’s feature set collected by these features. Similarity measure is based on similarity function to measure the similarity value between two feature sets. Discriminant derivation is based on support vector machine which train a classifiers to identify whether a document is a near-duplicate or not. support vector machine is a supervised learning strategy. It trains a classifier by the training patterns. In the characteristics of documents, the sentence-level features are more effective than terms-level features. Besides, learning a discriminant by SVM can avoid trial-and-error efforts required in conventional methods. Trial-and-error is going to find a threshold, a discriminant value to define document’s relation. In the final analysis of experiment, our method is effective in near-duplicate document detection than other methods.

目次 Table of Contents
摘要 i Abstract ii 圖目錄 v 表目錄 vi 第一章簡介 1 1.1研究背景 1 1.2研究動機 2 1.3問題定義 4 1.4研究目的 5 1.5論文架構 5 第二章文獻探討 7 2.1近似複本文件模型 7 2.2特徵擷取方法 8 2.3 相似度函數 10 2.4近似複本文件關係辨別方法 12 第三章研究方法 14 3.1 句子為基礎之特徵擷取方法(Keywords Set Based on Sentence-n ) 14 3.1.1以句子之關鍵字集合為特徵 14 3.1.2準備訓練樣本 16 3.1.3以機器學習作為近似複本決策之方法 17 3.1.4辨別近似複本文件關係 18 3.2系統流程 19 3.3實際範例 21 第四章實驗結果與分析 30 4.1 實驗一 31 4.2實驗二 43 4.2.1傳統特徵向量 44 4.2.2 二位元特徵向量 48 4.3實驗三所需資源比較 52 第五章結論與未來展望 62 5.1結論 62 5.2未來研究方向 63 參考文獻 64

參考文獻 References
[1] A. Chowdhury, O. Frieder, D. Grossman, and M. C.McCabe. “Collection statistics for fast duplicate document detection”, ACM Transactions on Information Systems, 20(2):171–191, 2002. [2] M. Henzinger. “Finding near-duplicate web pages: a large-scale evaluation of algorithms”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and development in information retrieval, pages 284–291, 2006. [3] M. G. de Carvalho, A. H.F. Laender, M. A. Goncalves, and A. S. da Silva. “A genetic programming approach to record deduplication”, IEEE Transactions on Knowledge and Data Engineering, 24(3):399–412, 2012. [4] E. Valls and P. Rosso. “Detection of near-duplicate user generated contents: the SMS spam collection”, Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, pages 27–34, 2011. [5] V. A. Narayana, P. Premchand, and A. Govardhan. “A novel and efficient approach for near duplicate page detection in web crawling”, Proceedings of IEEE International Advance Computing Conference, pages 1492–1496, 2009. [6] G. S. Manku, A. Jain, and A. D. Sarma. “Detecting near-duplicates for web crawling”, Proceedings of the 16th International Conference on World Wide Web, pages 141–150, 2007. [7] H. Yang and J. Callan. “Near-duplicate detection for eRulemaking”, Proceedings of the National Conference on Digital Government Research, pages 78–86, 2005. [8] D. Fetterly, M. Manasse, and M. Najork. “On the evolution of clusters of near-duplicate web pages”, Proceedings of the First Conference on Latin American Web Congress, page 37, 2003. [9] J. G. Conrad, X. S. Guo, and C. P. Schriber. “Online duplicate document detection : signature reliability in a dynamic retrieval environment”, Proceedings of the Twelfth International Conference on Information and knowledge management, pages 443–452, 2003. [10] A. P. Jr, R. Baeza-Yates, and N. Ziviani. “Where and how duplicates occur in the web”, Proceedings of the Fourth Conference on Latin American Web Congress, pages 127–134, 2006. [11] A. Z. Broder. “Identifying and filtering near-duplicate documents”, Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, pages 1–10, 2000. [12] Enron email dataset. http://www.cs.cmu.edu/ enron/. 2012. [13] H. Yang and J. Callan. “Near-duplicate detection by instance-level constrained clustering”, Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 421–428, 2006. [14] S. Sood and D. Loguinov. “Probabilistic near-duplicate detection using simhash”, Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pages 1117–1126, 2011. [15] Q. Jiang and M. Sun. “Semi-supervised SimHash for efficient document similarity search”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1:93–101, 2011. [16] Bag of words. http://en.wikipedia.org/wiki/bag of words model. 2012. [17] C. Gong, Y. Huang, X. Cheng, and S. Bai. “Detecting near-duplicates in large-scale short text databases”, Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pages 877–883, 2008. [18] J. Qiu and Q. Zeng. “Detection and optimized disposal of near-duplicate pages”, Proceedings of the Second International Conference on Future Computer and Communication, 2:604–607, 2010. [19] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. “Wang. Efficient similarity joins for near duplicate detection”, ACM Transactions on Database Systems, 36(3) Issue 3:Article No. 15, 2011. [20] A. Arasu, V. Ganti, and R. Kaushik. “Efficient exact set-similarity joins”, Proceedings of the 32nd International Conference on Very Large Data Bases, pages 918–929, 2006. [21] R. Fagin, R. Kumar, and D. Sivakumar. “Efficient similarity search and classification via rank aggregation”, Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 301–312, 2003. [22] J.-H. Wang and H.-C. Chang. “Exploiting sentence-level features for near-duplicate document detection”, Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology, pages 205–217, 2009. [23] S. Huffman, A. Lehman, A. Stolboushkin, H. Wong-Toi, F. Yang, and H. Roehrig. “Multiple-signal duplicate detection for search evaluation”, Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 223–230, 2007. [24] M. Theobald, J. Siddharth, and A. Paepcke. “SpotSigs:Robust and efficient near duplicate detection in large web collections”, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 563–570, 2008. [25] C. Xiao, W. Wang, X. Lin, and H. Shang. “Top-k set similarity joins”, Proceedings of the 2009 IEEE International Conference on Data Engineering, pages 916–927, 2009. [26] C. Li, B. Wang, and X. Yang. Vgram: “Improving performance of approximate queries on string collections using variable-length grams”, Proceedings of the 33rd International Conference on Very Large Data Bases, pages 303–314, 2007. [27] P. Goyal, L. Behera, and T. M. McGinnity. “A context based word indexing model for document summarization”, IEEE Transactions on Knowledge and Data Engineering, 10.1109/TKDE.2012.114, 2012. [28] J. Kim and H. Lee. “Efficient exact similarity searches using multiple token orderings”, Proceedings of the IEEE 28th International Conference on Data Engineering, pages 822–833, 2012. [29] L. Han, T. Finin, P. McNamee, A. Joshi, and Y. Yesha. “Improving word similarity by augmenting pmi with estimates of word polysemy”, IEEE Transactions on Knowledge and Data Engineering, 10.1109/TKDE.2012.30, 2012. [30] Y. Luo, X. Lin, W. Wang, and X. Zhou. “Spark: top-k keyword query in relational databases”, Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 115–126, 2007. [31] L. Huang, L. Wang, and X. Li. “Achieving both high precision and high recall in near-duplicate detection”, Proceedings of the 17th ACM Conference on Information and Knowledge Management, pages 63–72, 2008. [32] Z. Zhao, L. Wang, H. Liu, and J. Ye. “On similarity preserving feature selection”, IEEE Transactions on Knowledge and Data Engineering, 10.1109/TKDE.2011.222, 2011. [33] R. J. Bayardo, Y. Ma, and R. Srikant. “Scaling up all pairs similarity search”, Proceedings of the 16th International Conference on World Wide Web, pages 131–140, 2007. [34] B. Martins. “A supervised machine learning approach for duplicate detection over gazetteer records”, Proceedings of the 4th International Conference on GeoSpatial Semantics, pages 34–51, 2011. [35] H. Hajishirzi, W. Yih, and A. Kolcz. “Adaptive nearduplicate detection via similarity learning”, Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 419–426, 2010. [36] S. Brin, J. Davis, and H. Garcia-Molina. “Copy detection mechanisms for digital documents”, Proceedings of the ACM SIGMOD International Conference on Management of Data, 2:398–409, 1995. [37] N. A. Arnosti and J. K. Kalita. “Cutting plane training for linear support vector machines”, IEEE Transactions on Knowledge and Data Engineering, 10.1109/TKDE.2011.247, 2011. [38] C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [39] Common Stopword set. http://www.ranks.nl/resources/stopwords.html. 2012. [40] RCV1 News dataset. http://www.daviddlewis.com/resources/testcollections/rcv1/. 2012. [41] J. Dean and S. Ghemawat. “MapReduce: simplified data processing on large clusters”, Communications of the ACM, 51(1):107–113, 2008.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-1023112-100138.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS