國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,植基於文字分類與分群的文本相似度量測,Measuring Document Similarity Based on Text Classification and Clustering

論文名稱 Title	植基於文字分類與分群的文本相似度量測 Measuring Document Similarity Based on Text Classification and Clustering
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	101 學年度第 2 學期 The spring semester of Academic Year 101	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	139
研究生 Author	林永申 Yung-Shen Lin
指導教授 Advisor	李錫智 Shie-Jue Lee
召集委員 Convenor	賴智錦 Chih-Chin Lai,
口試委員 Advisory Committee	歐陽振森, 蔡賢亮, 侯俊良, 劉志峰 Chen-Sen Ouyang; Hsien-Liang Tsai; Chun-Liang Hou; Chih-Feng Liu
口試日期 Date of Exam	2013-04-18	繳交日期 Date of Submission	2013-04-25
關鍵字 Keywords	近似複本文件、特徵擷取、分類器、分群演算法、文件分類、熵、文件分群、準確度、相似度函數 similarity function, feature selection, entropy, document clustering, document classification, near-duplicate document, accuracy, classifiers, clustering algorithms
統計 Statistics	本論文已被瀏覽 5727 次，被下載 350 次 The thesis/dissertation has been browsed 5727 times, has been downloaded 350 times.

中文摘要
本論文提出一新的文本相似度量測演算法，並藉由應用於多個文本資料庫，來驗證所提方法之可行性。其次，探討近似複本文件的相似度量測，進而提出新的偵測方法。文本資料要進行處理時，通常擷取足以涵蓋文本內容的資訊當作特徵值，再比對各個特徵的相似程度，並以此做為量測相似度的依據。因此，考量兩份文本之間的相似度，可以經由判斷所擷取的特徵在兩份文本中有無出現的情況、各個特徵相似的程度，以及相似特徵的數量多寡等等因素，進而提出最佳的量測方法。在本論文中，我們提出一個以文字分類及分群技術為基礎的相似度量測法，同時設計出一個有效且可行的近似複本文本偵測法。文本處理為目前資訊檢索、資料探勘及網路搜尋引擎等應用上很重要的技術。文本資料進行分析處理時，通常採用足以代表文本的特徵值來進行運算。這些特徵可以是單一字母、單字、整句乃至整段文字，而目前最常被使用的即是袋字模型(bag-of-words model)，此模型以文本中各個特徵出現的頻率，建立一代表文本的向量，再以此向量分析文本資料。代表文本的特徵向量，其中的向量值可以是被選為特徵項在文本中出現的次數、特徵項出現次數與全部特徵項出現次數總和的比例或是特徵項在單一文本出現頻率與同時在全部文本出現頻率的組合比。本論文針對任一特徵項出現在兩份比對文本的情況，區分為特徵項同時出現在比對的兩份文本中、特徵項僅出現在其中一份文本，以及兩份文本均無此特徵項等三種情況進行探討，並提出一新的相似度量測方式。此量測是以一對稱的量測方式，可以具體就前述三種情況建立特徵向量進行比對，進而得到兩份文本的相似程度值。此方法並實際應用於單標籤分類、多標籤分類、k-means 相似分群及聚合式階層分群等多個文本資料的應用，演算結果證明新發表的方法確實可行。在本論文中亦以前述之文本相似度量測方法為基礎，設計出一偵測近似複本文本的演算法。現今電子文本氾濫的網路時代，任一文本可以經由臉書(Facebook)、部落格(Blog) 及電子郵件等等媒介的增加、刪改或轉發等方式而形成許多近似複本文件，而搜尋引擎根據使用者所下達的搜尋項檢索資料，傳送出檢索的結果。因為是以搜尋項為特徵項進行檢索，因此，所得結果必然包含許多重複或近似的文本，若能以有效的方法判別出這些近似複本文，必然可以降低檢索結果中的重複文本，連帶提升搜尋效能，所以，如何有效偵測出近似複本文件已是時下一大課題。本論文提出能從大量資料中有效偵測出近似複本文件的方法。本方法有別於現已發表的文章大多以選擇字詞為特徵項，而是以句子作為文本的特徵項。採用句子為特徵擷取單位的方法比起以字詞特徵作為擷取單位的方法，更能有效表現出文本的特色。而進行相似度量測以及推導適用的判別分類器時，在傳統的方法是以門檻值作為判別文本關係的分水嶺，再以試誤法找出最佳門檻值，耗時且成效不佳。我們的方法則是改採支持向量機來訓練建立分類器，由於依照使用者所定義的訓練樣本來訓練分類器，可以使結果更具有可信度，因此，本論文依此架構出一有效的方法，最後並以實驗驗證。從實驗過程中得知，我們所提出的方法確實更有效率。
Abstract
This thesis proposes a novel similarity measure that applies between documents. The proposed measure is also extended to gauge the similarity between two sets of documents. Furthermore, a new method of similarity measure implementation is assigned to detect near-duplicate documents. To measure the similarity between two documents is a significant utilization in the text field. Computing the similarity between two documents with respect to a feature, the appropriate features are selected to represent documents, and employed to measure the similarity. Therefore, a similarity measure between two documents may be interested about the feature appears in both documents or not, similarity degree between features, and the number of similar features. In this thesis, we propose a new similarity based on three cases of the feature appear conditions. Document’s similarity differentiating is a significant operation in the text processing. For items of documents are huge, selecting the appropriate features to represent documents and facilitate this target are important. The documentation analysis usually retrieves the information sufficient to cover contents of the documents as a representative of documents feature. These features may be a single letter, word, sentence, or even whole paragraph. And the vector-space model is used to represent the features. To compute the similarity between two documents with respect to a feature, the major measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. Based on the research and to improve the performance of the similarity measure algorithms, our proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems, and the results are better than that achieved by other measures. For more consider of similarity measure, an implementation of detecting near-duplicate documents is also demonstrated. Based on similarity measure, we present a novel method for detecting near-duplicates from a large collection of documents. To distinguish near-duplicate documents is extremely important in the Internet era. If a search engine can effectively determine the near-duplicate documents will have access to reduce the number of duplicate documents retrieved, jointly and severally improve the search performance. For this purpose, we also propose a novel method for detecting near-duplicates from a huge collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is afterwards applied and the similarity degree between the input document, and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection.

目次 Table of Contents
書名頁. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i 致謝辭. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 論文口試委員審定書. . . . . . . . . . . . . . . . . . . . . . . . iii 授權書. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 中文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Text Processing . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Similarity Measure . . . . . . . . . . . . . . . . . . . . 2 1.3 Detecting Near-Duplicates . . . . . . . . . . . . . . . . 6 1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Distance Measure . . . . . . . . . . . . . . . . . 11 2.2.2 Clustering Algorithm . . . . . . . . . . . . . . . 14 2.2.2.1 K-Means Clustering Algorithm . . . . . 14 2.2.2.2 HAC Algorithm . . . . . . . . . . . . . 15 2.2.3 Classification . . . . . . . . . . . . . . . . . . . . 19 2.2.3.1 K-NN Single-Label Document Classification . . . . . . . . . . . . . . . . . . 20 2.2.3.2 Multi-Label Document Classification . . 20 3 Detecting Near-Duplicate Documents . . . . . . . . . . . . . . 23 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 Document Analysis . . . . . . . . . . . . . . . . 25 3.2.2 Similarity Function . . . . . . . . . . . . . . . . 26 4 A Novel Similarity Measure . . . . . . . . . . . . . . . . . . . 29 4.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Similarity Between Two Documents . . . . . . . . . . . 32 4.3 Similarity Between Two Document Sets . . . . . . . . . 38 5 A Novel of Detecting Near-Duplicate Documents . . . . . . . 46 5.1 Feature Sets Based on Sentences . . . . . . . . . . . . . 47 5.2 Preparing Training Patterns . . . . . . . . . . . . . . . 50 5.3 Discriminant Derivation . . . . . . . . . . . . . . . . . . 51 5.4 Testing Phase . . . . . . . . . . . . . . . . . . . . . . . 53 5.5 System Operation . . . . . . . . . . . . . . . . . . . . . 53 5.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Experimental Results for SMTP . . . . . . . . . . . . . . . . 59 6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . 64 6.3.1 Single-Label Document Classification . . . . . . 64 6.3.2 Multi-Label Document Classification . . . . . . . 70 6.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.1 K-Means Based Document Clustering . . . . . . 74 6.4.2 Hierarchical Agglomerative Document Clustering 78 7 Experimental Results for Detecting Near-Duplicate Documents 84 7.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Experiment I . . . . . . . . . . . . . . . . . . . . . . . . 86 7.3 Experiment II . . . . . . . . . . . . . . . . . . . . . . . 89 7.4 Experiment III . . . . . . . . . . . . . . . . . . . . . . . 91 7.5 Experiment IV . . . . . . . . . . . . . . . . . . . . . . . 93 7.6 Experiment V . . . . . . . . . . . . . . . . . . . . . . . 94 7.7 Experiment VI . . . . . . . . . . . . . . . . . . . . . . . 97 8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

參考文獻 References
[1] Y.-S. Lin, T.-Y. Liao, and S.-J. Lee., “Detecting near-duplicate documents using sentence-level features and supervised learning,” Expert Systems with Applications, vol. 40, pp. 1467–1476, April 2013. [2] Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee, “A similarity measure for text classification and clustering,” IEEE Transactions on Knowledge and Data Engineering, Jan. 2013. to be published. [3] K. Knight, “Mining online text,” Communications of the ACM, vol. 42, pp. 58–61, Nov. 1999. [4] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, pp. 1–47, Mar. 2002. [5] T. Joachims and F. Sebastiani, “Guest editors’ introduction to the special issue on automated text categorization,” Journal of Intelligent Information Systems, vol. 18, pp. 103–105, Mar. 2002. [6] G. Salton and M. McGill, Introduction to modern information retrieval. McGraw-Hill Book Company, 1986. [7] T. Joachims, “A probabilistic analysis of the rocchio algorithm with tfidf for text categorization,” in Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, (San Francisco, CA, USA), pp. 143–151, Morgan Kaufmann Publishers Inc., 1997. [8] H. Kim, P. Howland, and H. Park, “Dimension reduction in text classification with support vector machines,” Journal of Machine Learning Research, vol. 6, pp. 37–53, Dec. 2005. [9] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as text retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 591–606, 2009. [10] J. Han and M. Kamber, Data mining: concepts and techniques. Morgan Kaufmann, 2006. [11] D. W. Aha, “Lazy learning: Special issue editorial,” Artificial Intelligence Review, vol. 11, pp. 7–10, Feb. 1997. [12] P. K. Agarwal and C. M. Procopiuc, “Exact and approximation algorithms for clustering,” in Proceedings of the ninth annual ACMSIAM symposium on Discrete algorithms, SODA ’98, (Philadelphia, PA, USA), pp. 658–667, Society for Industrial and Applied Mathematics, 1998. [13] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text classification from labeled and unlabeled documents using em,” Machine learning, vol. 39, no. 2, pp. 103–134, 2000. [14] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: analysis and implementation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, 2002. [15] K. Hammouda and M. Kamel, “Efficient phrase-based document indexing for web document clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 10, pp. 1279–1296, 2004. [16] V. Lertnattee and T. Theeramunkong, “Multidimensional text classification for drug information,” IEEE Transactions on Information Technology in Biomedicine, vol. 8, no. 3, pp. 306–312, 2004. [17] D. Cai, X. He, and J. Han, “Document clustering using locality preserving indexing,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624–1637, 2005. [18] J. Kogan, M. Teboulle, and C. Nicholas, “Data driven similarity measures for k-means like clustering algorithms,” Information Retrieval, vol. 8, no. 2, pp. 331–349, 2005. [19] S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, “Some effective techniques for naive Bayes text classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, 2006. [20] S. G. Kolliopoulos and S. Rao, “A nearly linear-time approximation scheme for the euclidean k-median problem,” SIAM Journal on Computing, vol. 37, no. 3, pp. 757–782, 2007. [21] P. Tan et al., Introduction to data mining. Pearson Education India, 2007. [22] M. Zhang and Z. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern Recognition, vol. 40, no. 7, pp. 2038–2048, 2007. [23] H. Chim and X. Deng, “Efficient phrase-based document similarity for clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1217–1229, 2008. [24] K. M. Hammouda and M. S. Kamel, “Hierarchically distributed peer-to-peer document clustering and cluster summarization,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 5, pp. 681–698, 2009. [25] C. Silva, U. Lotric, B. Ribeiro, and A. Dobnikar, “Distributed text classification with an ensemble kernel-based learning approach,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 40, no. 3, pp. 287–297, 2010. [26] T. Schoenharl and G. Madey, “Evaluation of measurement techniques for the validation of agent-based simulations against streaming data,” Computational Science–ICCS 2008, pp. 6–15, 2008. [27] Y.-S. Lin, M.-Z. Rau, and S.-J. Lee., “Two methods for color quantization of image segmentation,” in The 85th Anniversary Conference of the Military Academy, ROC, May 2009. [28] Y.-S. Lin, M.-Z. Rau, and S.-J. Lee., “Applying self-constructing clustering to color image quantization,” ICIC Express Letters, vol. 3, no. 3, pp. 813–818, 2009. [29] T.-Y. Liao, Y.-S. Lin, R.-T. Sun, and S.-J. Lee., “Near-duplicate detection with machine learning,” ICIC Express Letters, vol. 7, pp. 261–266, Jan 2013. [30] S. Kullback and R. Leibler, “On information and sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79–86, 1951. [31] M. Michie, “Use of the bray-curtis similarity measure in cluster analysis of foraminiferal data,” Mathematical Geology, vol. 14, no. 6, pp. 661–667, 1982. [32] C. Gonzalez, W. Bonventi, and A. Rodrigues, “Density of closed balls in real-valued and autometrized boolean spaces for clustering applications,” Advances in Artificial Intelligence-SBIA 2008, pp. 8–22, 2008. [33] R. Hamming, “Error detecting and error correcting codes,” Bell System technical journal, vol. 29, no. 2, pp. 147–160, 1950. [34] A. Strehl and J. Ghosh, “Value-based customer grouping from large retail data sets,” in AeroSense 2000, pp. 33–42, International Society for Optics and Photonics, 2000. [35] D. Lin, “An information-theoretic definition of similarity,” in Proceedings of the 15th international conference on Machine Learning, vol. 1, pp. 296–304, San Francisco, 1998. [36] J. A. Aslam and M. Frost, “An information-theoretic measure for document similarity,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, SIGIR ’03, (New York, NY, USA), pp. 449–450, 2003. [37] I. Dhillon and D. Modha, “Concept decompositions for large sparse text data using clustering,” Machine learning, vol. 42, no. 1, pp. 143–175, 2001. [38] Y. Zhao and G. Karypis, “Comparison of agglomerative and partitional document clustering algorithms,” tech. rep., DTIC Document, 2002. [39] J. D’hondt, J. Vertommen, P.-A. Verhaegen, D. Cattrysse, and J. R. Duflou, “Pairwise-adaptive dissimilarity measure for document clustering,” Information Sciences, vol. 180, pp. 2341–2358, June 2010. [40] T. Zhang, Y. Tang, B. Fang, and Y. Xiang, “Document clustering in correlation similarity measure space,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 6, pp. 1002–1013, 2012. [41] J. Kogan, C. Nicholas, and V. Volkovich, “Text mining with information-theoretic clustering,” Computing in Science & Engineering, vol. 5, no. 6, pp. 52–59, 2003. [42] I. Dhillon, J. Kogan, and C. Nicholas, “Feature selection and document clustering,” A comprehensive survey of text mining, pp. 73–100, 2003. [43] I. S. Dhillon, S. Mallela, and R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification,” Journal of Machine Learning Research, vol. 3, pp. 1265–1287, Mar. 2003. [44] R. Duda, P. Hart, and D. Stork, Pattern Recognition. 2001. Wiley-Interscience, New York. [45] A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe, “Collection statistics for fast duplicate document detection,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 2, pp. 171– 191, 2002. [46] M. Henzinger, “Finding near-duplicate web pages: a large-scale evaluation of algorithms,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’06, (New York, NY, USA), pp. 284–291, 2006. [47] E. Valles and P. Rosso, “Detection of near-duplicate user generated contents: the sms spam collection,” in Proceedings of the 3rd international workshop on Search and mining user-generated contents, SMUC ’11, (New York, NY, USA), pp. 27–34, ACM, 2011. [48] M. G. de Carvalho, A. H. F. Laender, M. A. concalves, and A. S. da Silva, “A genetic programming approach to record deduplication,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, pp. 399–412, Mar. 2012. [49] D. Fetterly, M. Manasse, and M. Najork, “On the evolution of clusters of near-duplicate web pages,” Journal of Web Engineering, vol. 2, no. 4, pp. 228–246, 2003. [50] J. Conrad, X. Guo, and C. Schriber, “Online duplicate document detection: signature reliability in a dynamic retrieval environment,” in Proceedings of the twelfth international conference on Information and knowledge management, pp. 443–452, ACM, 2003. [51] H. Yang and J. Callan, “Near-duplicate detection for erulemaking,” in Proceedings of the 2005 national conference on Digital government research, pp. 78–86, Digital Government Society of North America, 2005. [52] A. Pereira Jr, R. Baeza-Yates, and N. Ziviani, “Where and how duplicates occur in the web,” in Web Congress, 2006. LA-Web’06. Fourth Latin American, pp. 127–134, IEEE, 2006. [53] G. Manku, A. Jain, and A. Das Sarma, “Detecting near-duplicates for web crawling,” in Proceedings of the 16th international conference on World Wide Web, pp. 141–150, ACM, 2007. [54] V. Narayana, P. Premchand, and A. Govardhan, “A novel and efficient approach for near duplicate page detection in web crawling,” in Advance Computing Conference, 2009. IACC 2009. IEEE International, pp. 1492–1496, 2009. [55] A. Broder, “Identifying and filtering near-duplicate documents,” in Combinatorial Pattern Matching, pp. 1–10, Springer, 2000. [56] “http://www. cs. cmu. edu/˜ enron.” [57] H. Yang and J. Callan, “Near-duplicate detection by instance-level constrained clustering,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 421–428, 2006. [58] Q. Jiang and M. Sun, “Semi-supervised simhash for efficient document similarity search,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 93–101, 2011. [59] S. Sood and D. Loguinov, “Probabilistic near-duplicate detection using simhash,” in Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 1117–1126, 2011. [60] R. Fagin, R. Kumar, and D. Sivakumar, “Efficient similarity search and classification via rank aggregation,” in Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 301–312, 2003. [61] A. Arasu, V. Ganti, and R. Kaushik, “Efficient exact set-similarity joins,” in Proceedings of the 32nd international conference on Very large data bases, pp. 918–929, VLDB Endowment, 2006. [62] S. Huffman, A. Lehman, A. Stolboushkin, H. Wong-Toi, F. Yang, and H. Roehrig, “Multiple-signal duplicate detection for search evaluation,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 223–230, 2007. [63] C. Li, B. Wang, and X. Yang, “Vgram: improving performance of approximate queries on string collections using variable-length grams,” in Proceedings of the 33rd international conference on Very large data bases, pp. 303–314, VLDB Endowment, 2007. [64] C. Gong, Y. Huang, X. Cheng, and S. Bai, “Detecting near duplicates in large-scale short text databases,” Advances in Knowledge Discovery and Data Mining, pp. 877–883, 2008. [65] M. Theobald, J. Siddharth, and A. Paepcke, “Spotsigs: robust and efficient near duplicate detection in large web collections,” in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 563–570, 2008. [66] J. Wang and H. Chang, “Exploiting sentence-level features for near-duplicate document detection,” Information Retrieval Technology, pp. 205–217, 2009. [67] C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,” in Data Engineering, 2009. ICDE’09. IEEE 25th International Conference on, pp. 916–927, 2009. [68] J. Qiu and Q. Zeng, “Detection and optimized disposal of near-duplicate pages,” in Future Computer and Communication (ICFCC), 2010 2nd International Conference on, vol. 2, pp. V2–604, 2010. [69] C. Xiao, W. Wang, X. Lin, J. Yu, and G. Wang, “Efficient similarity joins for near-duplicate detection,” ACM Transactions on Database Systems (TODS), vol. 36, no. 3, p. 15, 2011. [70] Y. Luo, X. Lin, W. Wang, and X. Zhou, “Spark: top-k keyword query in relational databases,” in International Conference on Management of Data: Proceedings of the 2007 ACM SIGMOD 114 international conference on Management of data, vol. 11, pp. 115–126, 2007. [71] J. Kim and H. Lee, “Efficient exact similarity searches using multiple token orderings,” in Data Engineering (ICDE), 2012 IEEE 28th International Conference on, pp. 822–833, 2012. [72] P. Goyal, L. Behera, and T. McGinnity, “A context based word indexing model for document summarization,” IEEE Transactions on Knowledge and Data Engineering, to be published. Early Access. [73] L. Han, T. Finin, P. McNamee, A. Joshi, and Y. Yesha, “Improving word similarity by augmenting pmi with estimates of word polysemy,” IEEE Transactions on Knowledge and Data Engineering, to be published. Early Access. [74] R. Bayardo, Y. Ma, and R. Srikant, “Scaling up all pairs similarity search,” in Proceedings of the 16th international conference on World Wide Web, pp. 131–140, Citeseer, 2007. [75] L. Huang, L. Wang, and X. Li, “Achieving both high precision and high recall in near-duplicate detection,” in Proceedings of the 17th ACM conference on Information and knowledge management, pp. 63–72, 2008. [76] Z. Zhao, L. Wang, H. Liu, and J. Ye, “On similarity preserving feature selection,” IEEE Transactions on Knowledge and Data Engineering, to be published. Early Access. [77] S. Brin, J. Davis, and H. Garcia-Molina, “Copy detection mechanisms for digital documents,” in ACM SIGMOD Record, vol. 24, pp. 398–409, 1995. [78] H. Hajishirzi, W. Yih, and A. Kolcz, “Adaptive near-duplicate detection via similarity learning,” in Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pp. 419–426, 2010. [79] B. Martins, “A supervised machine learning approach for duplicate detection over gazetteer records,” GeoSpatial Semantics, pp. 34–51, 2011. [80] N. Arnosti and J. Kalita, “Cutting plane training for linear support vector machines,” IEEE Transactions on Knowledge and Data Engineering, to be published. Early Access. [81] M. A. Wajeed and T. Adilakshmi, “Different similarity measures for text classification using knn,” in Proc. 2nd Int Computer and Communication Technology (ICCCT) Conf, pp. 41–45, 2011. [82] G. M. Timothy W. Schoenharl, “Evaluation of measurement techniques for the validation of agent-based simulations against streaming data,” in ICCS ’08 Proceedings of the 8th international conference on Computational Science, Part III, pp. 6–15, 2008. [83] V. Heikkinen, T. Tokola, J. Parkkinen, I. Korpela, and T. Jaaskelainen, “Simulated multispectral imagery for tree species classification using support vector machines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 3, pp. 1355–1364, 2010. [84] D. M. Cox, D. J. Trevor, E. A. Rohlfing, and A. Kaldor, “Measurements of magnetic moments of metal atom clusters (abstract),” Journal of Applied Physics, vol. 57, no. 8, 1985. [85] B. Weber and R. Scholl, “A new kind of light-generation mechanism: Incandescent radiation from clusters,” Journal of Applied Physics, vol. 74, no. 1, pp. 607–613, 1993. [86] H. Frigui and R. Krishnapuram, “Competitive fuzzy clustering,” in Proc. NAFIPS Fuzzy Information Processing Society 1996 Biennial Conf. of the North American, pp. 225–228, 1996. [87] E. Kapetanios and M. C. Norrie, “Data mining and modeling in scientific databases,” in Proc. Conf. Ninth Int Scientific and Statistical Database Management, pp. 24–27, 1997. [88] D. J. C. Mackay, InformationTheory, Inference, and Learning Algorithms, ch. 20, pp. 284–292. Cambridge University Press, 2003. [89] C. Manning, P. Raghavan, and H. Schutze, Introduction to information retrieval, vol. 1. Cambridge University Press Cambridge, 2008. [90] N. Ghamrawi and A. McCallum, “Collective multi-label classification,” in Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 195–200, 2005. [91] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” International Journal of Data Warehousing and Mining (IJDWM), vol. 3, no. 3, pp. 1–13, 2007. [92] “http://www.ranks.nl/resources/stopwords.html.” [93] S. Clinchant and E. Gaussier, “Information-based models for ad hoc ir,” in Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’10, (New York, NY, USA), pp. 234–241, 2010. [94] G. Amati, V. Rijsbergen, and C. Joost, “Probabilistic models of information retrieval based on measuring the divergence from randomness,” ACM Transactions on Information Systems, vol. 20, pp. 357–389, Oct. 2002. [95] S. Lee and C. Ouyang, “A neuro-fuzzy system modeling with selfconstructing rule generationand hybrid svd-based learning,” IEEE Transactions on Fuzzy Systems, vol. 11, no. 3, pp. 341–353, 2003. [96] G. Ball and D. Hall, “A clustering technique for summarizing multivariate data,” Behavioral Science, vol. 12, no. 2, pp. 153–155, 2006. [97] M. Eisen, P. Spellman, P. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences, vol. 95, no. 25, pp. 14863– 14868, 1998. [98] “http://www.cs.technion.ac.il/ ronb/thesis.html.” [99] “http://web.ist.utl.pt/ acardoso/datasets/.” [100] D. Lewis, Y. Yang, T. Rose, and F. Li, “Rcv1: A new benchmark collection for text categorization research,” The Journal of Machine Learning Research, vol. 5, pp. 361–397, 2004. [101] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to extract symbolic knowledge from the world wide web,” in Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, pp. 509–516, American Association for Artificial Intelligence, 1998. [102] K. Nigam, A. McCallum, S. Thrun, T. Mitchell, et al., “Learning to classify text from labeled and unlabeled documents,” in Proceedings of the National Conference on Artificial Intelligence, pp. 792–799, Citeseer, 1998. [103] "http://www.daviddlewis.com/resources/testcollections/rcv1/.” [104] “http://www.dmoz.org/.” [105] H. Fang, T. Tao, and C. Zhai, “A formal study of information retrieval heuristics,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’04, (New York, NY, USA), pp. 49–56, 2004. [106] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0425113-122452.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS