國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個用於網頁式資訊過濾系統下的鑑識樹檢索方法,An ID-Tree Index Strategy for Information Filtering in Web-Based Systems

論文名稱 Title	一個用於網頁式資訊過濾系統下的鑑識樹檢索方法 An ID-Tree Index Strategy for Information Filtering in Web-Based Systems
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	94 學年度第 2 學期 The spring semester of Academic Year 94	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	110
研究生 Author	王奕翔 Yi-Siang Wang
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	黃三益 San-Yih Hwang
口試委員 Advisory Committee	陳健輝, 李建億 Gen-huey Chen; Chien-I Lee
口試日期 Date of Exam	2006-06-02	繳交日期 Date of Submission	2006-07-10
關鍵字 Keywords	簽章、資訊過濾、非精確過濾、資料分割、近似度搜尋 Signature, Inexact Filtering, Information Filtering, Data Partition, Similarity Search
統計 Statistics	本論文已被瀏覽 5670 次，被下載 0 次 The thesis/dissertation has been browsed 5670 times, has been downloaded 0 times.

中文摘要
隨著全球網際網路(WWW)爆炸性的發展，許多使用者都曾有過資訊過載的經驗。因此，許多的搜尋引擎已被研發出來，來幫助使用者從大量的資料中搜尋有用的資訊。然而，使用者在不同的情況下，可能會有不同的資訊需求。相對於資訊檢索領域中，使用者「主動」地搜尋資料庫；在資訊過濾(IF)中，使用者是處於「被動」的狀態，等待所需的資料由伺服器透過廣播的媒體發送過來。因此，資料庫中所存放的，是記錄著使用者興趣與習慣之用戶資料。為了儲存更多的用戶資料，並以更簡便的方式過濾不相關的用戶，許多簽章式(signature-based)技術的方法已經被應用在IF的系統架構中。透過這些使用者的摘要，IF在進行過濾的過程便不需完整地比較資料庫中的每一筆檔案。然而，由於簽章只是資料的片段，要單純的利用這種技術來回答複雜的搜尋，是一項非常困難的挑戰。因此，如何萃取使用者的簽章，並有效地將這些簽章加以索引，在簽章式IF系統架構中，儼然是一項重要的議題。在簽章式IF系統中，通常必須處理兩種型態的查詢方式，分別是非精確過濾(inexact filtering)與近似度搜尋(similarity search)。在非精確過濾中，搜尋的主體是伺服器所接受到的新文件，而搜尋的對象，則是從資料庫中找尋出興趣選項完全包含於此文件的用戶資料。另一方面，在近似度搜尋中，搜尋的主體則是一個使用者，而搜尋的目標則是找出資料庫中與此使用者具有相似興趣的用戶資料。在這篇論文中，我們提出了一個名為「鑑識樹」(ID-tree)的索引方式來儲存用戶簽章。鑑識樹根據用戶資料間所有的相異項目，以一個二元樹來分割用戶資料為數個子資料群。基本上，我們所提出的鑑識樹方法，亦是「簽章樹」(signature tree)結構的一種。在一個鑑識樹中，每個用戶簽章是由根節點至葉節點之路徑所構成。由於每個用戶簽章都只被一個葉節點所指向，所有的用戶簽章都不會發生衝突(collision)。也就是說，不會有兩個相異的使用者資訊被擷取出相同的用戶簽章。更進一步地，鑑識樹僅需要同時檢查各資料子群集間相異的項目即可有效過濾不相關的資訊。因此，我們所提出的索引方法，可以比先前所有此領域的方法需要更少的用戶資料存取量來回答非精確過濾與近似度搜尋。此外，在建造簽章索引的過程中，我們可以更短的時間來批次地處理大量的用戶資料。根據模擬實驗的結果，在非精確過濾搜尋中，相較於Chen的簽章樹；以及在近似度搜尋中，相較於Aggarwal等人的「簽章表」(SG-table)方法，我們的方法確實可以有效減少搜尋所需存取的用戶數量。
Abstract
With the booming development of WWW, many search engines have been developed to help users to find useful information from a great quantity of data. However, users may have different needs in different situations. Opposite to the Information Retrieval where users retrieve data actively, Information Filtering (IF) sends information from servers to passive users through broadcast mediums, rather than being searched by them. Therefore, each user has his (or her) profile stored in the database, where a profile records a set of interest items that can present his (or her) interests or habits. To efficiently store many user profiles in servers and filter irrelevant users, many signature-based index techniques are applied in IF systems. By using signatures, IF does not need to compare each item of profiles to filter out irrelevant ones. However, because signatures are incomplete information of profiles, it is very hard to answer the complex queries by using only the signatures. Therefore, a critical issue of the signature-based IF service is how to index the signatures of user profiles for an efficient filtering process. There are often two types of queries in the signature-based IF systems, the inexact filtering and the similarity search queries. In the inexact filtering, a query is an incoming document and it needs to find the profiles whose interest items are all included in the query. On the other hand, in the similarity search, a query is a user profile and it needs to find the users whose interest items are similar to the query user. In this thesis, we propose an ID-tree index strategy, which indexes signatures of user profiles by partitioning them into subgroups using a binary tree structure according to all of the different items among them. Basically, our ID-tree index strategy is a kind of the signature tree. In an ID-tree, each path from the root to a leaf node is the signature of the profile pointed by the leaf node. Because each profile is pointed only by one leaf node of the ID-tree, there will be no collision in the structure. In other words, there will be no two profiles assigned to the same signature. Moreover, only the different items among subgroups of profiles will be checked at one time to filter out irrelevant profiles for queries. Therefore, our strategy can answer the inexact filtering and the similarity search queries with less number of accessed profiles as compared to the previous strategies. Moreover, to build the index of signatures, it needs less time to batch a great deal of database profiles. From our simulation results, we show that our strategy can access less number of profiles to answer the queries than Chen's signature tree strategy for the inexact filtering and Aggarwal et al.'s SG-table strategy for the similarity search.

目次 Table of Contents
ABSTRACT . . . . . i LIST OF FIGURES . . . . . . iv LIST OF TABLES . . . . . viii 1. Introduction . . . . . 1 1.1 Information Filtering . . . . . 1 1.1.1 Content-Based Information Filtering . . . . . 2 1.1.2 Collaborative Information Filtering . . . . . 3 1.2 Signatures . . . . . 5 1.3 Strategies of Information Filtering Based on Signatures . . . . . 5 1.3.1 Inexact Filtering Strategies . . . . . 6 1.3.2 Similarity Search Strategies . . . . . 8 1.4 Motivation . . . . . 11 1.5 Organization of Thesis . . . . . 16 2. A Survey . . . . . 18 2.1 Inexact Filtering Index Strategies . . . . . 18 2.1.1 The Signature Files . . . . . 19 2.1.2 The Bit-Slice Files . . . . . 20 2.1.3 The S-Tree . . . . . 21 2.1.4 Signature Trees . . . . . 23 2.2 Similarity Search Index Strategies . . . . . 24 2.2.1 The Signature Table . . . . . 25 2.2.2 The S 3 B: Signature-based Similarity Search for Basket-data . . . . . 27 3. The ID Tree Index Strategy . . . . . 29 3.1 The ID Tree Structure . . . . . 29 3.2 Construction of the ID Tree . . . . . 35 3.2.1 The Preprocessing Step . . . . . 35 3.2.2 The Extension Step . . . . . 41 3.3 Searching in the ID Tree . . . . . 44 3.3.1 Inexact Filtering in the ID Tree . . . . . 46 3.3.2 The Minimum Optimistic Bound . . . . . 51 3.3.3 Similarity Search in the ID Tree . . . . . . 58 4. Performance . . . . . 66 4.1 The Simulation Model . . . . . 66 4.2 Simulation Results of Inexact Filtering Strategies . . . . . 69 4.3 Simulation Results of Similarity Search Strategies . . . . . 79 5. Conclusion . . . . . 90 5.1 Summary . . . . . 90 5.2 Future Research Directions . . . . . 92 BIBLIOGRAPHY . . . . . 93

參考文獻 References
[1] C. C. Aggarwal, J. L. Wolf, and P. S. Yu, "A New Method for Similarity Indexing of Market Basket Data," Proc. of 1999 ACM SIGMOD Int. Conf. on Management of Data, pp. 407-418, June 1999. [2] M. Balabanovic and Y. Shoham, "Fab: Content-Based, Collaborative Recommendation," Communications of the ACM, Vol. 40, No. 3, pp. 66-72, March 1997. [3] R. Bayer and K. Unterrauer, "Prefix B-Tree," ACM Trans. on Database Systems, Vol. 2, No. 1, pp. 11-26, March 1977. [4] N. J. Belkin and W. B. Croft, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?," Communications of the ACM, Vol. 35, No. 12, pp. 29-38, Dec. 1992. [5] Y. Chen, "On the Signature Tree and Balanced Signature Trees," Proc. of the 21st IEEE Int. Conf. on Data Engineering, pp. 742-753, 2005. [6] U. Deppisch, "S-Tree: A Dynamic Balanced Signature Index for Office Retrieval," Proc. of ACM Conf. on Research and Development in Information Retrieval, pp. 77-87, 1986. [7] C. Faloutsos, "Access Methods for Text," ACM Computing Surveys (CSUR), Vol. 17, No. 1, pp. 49-74, March 1985. [8] C. Faloutsos and D. W. Oard, "A Survey of Information Retrieval and Filtering Methods," Technical Report, University of Maryland, Aug. 1995. [9] D. Goldberg, D. Nichols, B. Oki, and D. Terry, "Using Collaborative Filtering to Weave an Information Tapestry," Communications of the ACM, Vol. 35, No. 12, pp. 61-70, 1992. [10] A. Guttman, "R-Tree: A Dynamic Index Structure for Spatial Searching," Proc. of 1984 ACM SIGMOD Int. Conf. on Management of Data, pp. 47-54, 1984. [11] M. Hammami, Y. Chahir, and L. Chen, "Webguard: A Web Filtering Engine Combining Textual, Structural, and Visual Content-Based Analysis," IEEE trans. on Knowledge and Data Engineering, Vol. 18, No. 2, pp. 272-284, Feb. 2006. [12] W. Hill, L. Stead, M. Rosenstein, and G. Furnas, "Recommending and Evaluating Choices in a Virtual Community of Use," Proc. of ACM CHI'95 Conf. on Human Factors in Computing System, pp. 194-201, May 1995. [13] G. R. Hjaltason and H. Samet, "Index-Driven Similarity Search in Metric Spaces," ACM Trans. on Database Systems, Vol. 28, No. 4, pp. 517-580, Dec. 2003. [14] Y. Ishikawa, H. Kitagawa, and N. Ohbo, "Evaluation of Signature Files as Set Access Ficilities in Oodbs," Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 247-256, 1993. [15] S. Jung, J. Kim, and J. L. Herlocker, "Applying Collaborative Filtering for Efficient Document Search," Proc. of IEEE/WIC/ACM Int. Conf. on Web Intelligence, pp. 640-643, Sep. 2004. [16] N. Katayama and S. Satoh, "The Sr-tree: an Index Structure for High Dimensional Nearest Neighbor Queries," Proc. of ACM SIGMOD Int. Conf. on Management of Data, pp. 369-380, May 1997. [17] A. J. Kent, R. Sacks-Davis, and K. Ramamohanarao, "A Signature File Scheme Based on Multiple Organizations for Indexing Very Large Text Databases," Journal of the American Society for Information Science, Vol. 41, No. 7, pp. 508-534, Oct. 1990. [18] T. Ku and P. Shoval, "User Profile Generation for Intelligent Information Agents-Research in Progress," Proc. of the Second Int. Workshop Agent-Oriented Information System, pp. 63-72, 2000. [19] K. Lang, "Newsweeder: Learning to Filter Netnews," Proc. of the 12th Int. Conf. on Machine Learning, pp. 331-339, 1995. [20] N. Mamoulis, D. W. Cheung, and L. Wang, "Similarity Search in Sets and Categorical Data Using the Signature Tree," Proc. of the 19th IEEE Int. Conf. on Data Engineering, pp. 75-86, 2003. [21] D. R. Morrison, "Patricia: Practical Algorithm to Retrieve Information Coded in Alphanumeric," Journal of ACM, Vol. 15, No. 4, pp. 514-534, Oct. 1968. [22] A. Nanopoulos and Y. Manolopoulos, "Efficient Similarity Search for Market Basket Data," The Int. Journal on Very Large Data Bases, Vol. 11, No. 2, pp. 138-152, Oct. 2002. [23] L. Page, S. Brin, R. Motwani, and T. Winograd, "The Pagerank Citation Ranking: Bringing Order to the Web," Technical report, Stanford University, Stanford, CA, Jan. 1998. [24] M. Pazzani, J. Muramatsu, and D. Billsus, "Syskill and Webert: Indentifying Interesting Web Sites," Proc. of the 13th National Conf. on Artificial Intelligence, pp. 54-61, 1996. [25] J. Salter and N. Antonopoulos, "Cinemascreen Recommender Agent: Combining Collaborative and Content-Based Filtering," IEEE Intelligent Systems, Vol. 21, No. 1, pp. 35-41, Feb. 2006. [26] J. B. Schafer, J. Konstan, and J. Riedl, "Recommender Systems in E-Commerce," Proc. of ACM Conf. on Electronic Commerce, pp. 158-166, Nov. 1999. [27] U. Shardanand and P. Maes, "Social Information Filtering: Algorithms for Automating "Word of Mouth"," Proc. of ACM CHI'95 Conf. on Human Factors in Computing Systems, Vol. 1, pp. 210-217, 1995. [28] E. Tousidou, P. Bozanis, and Y. Manolopoulos, "Signature-Based Structures for Objects with Set-Valued Attributes," Information Systems, Vol. 27, No. 2, pp. 93-121, April 2002. [29] E. Tousidou, A. Nanopoulos, and Y. Manolopoulos, "Improved Methods for Signature-Tree Construction," The Computer Journal, Vol. 43, No. 4, pp. 301-314, June 2000. [30] D. H. Widyantoro, T. R. Ioerger, and J. Yen, "An Adaptive Algorithm for Learning Changes in User Interests," Proc. of the 8th Int. Conf. on Information and Knowledge Management, pp. 405-412, 1999. [31] Y. H. Wu, Y. C. Chen, and A. L. P. Chen, "Enabling Personalized Recommendation on the Web Based on User Interests and Behaviors," Proc. of IEEE Workshop Research Issues in Data Engineering, pp. 17-24, 2001.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.223.106.232 論文開放下載的時間是校外不公開 Your IP address is 18.223.106.232 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS