Responsive image
博碩士論文 etd-0725103-172838 詳細資訊
Title page for etd-0725103-172838
論文名稱
Title
支援網際網路個人化資訊過濾的索引之設計
Design of Index Structures for Supporting Personalized Information Filtering on the Internet
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
125
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2003-06-20
繳交日期
Date of Submission
2003-07-25
關鍵字
Keywords
個人化、索引、資料過濾、使用者檔案、網頁挖掘
index, information filtering, personalization, profile, web usage mining
統計
Statistics
本論文已被瀏覽 5670 次,被下載 0
The thesis/dissertation has been browsed 5670 times, has been downloaded 0 times.
中文摘要
由於網際網路的蓬勃發展,使得資料過濾遇到許多新的挑戰。資料過濾是一個研究的領域,其目的乃是發展出一個工具可用來在一堆資料中,過濾出有用的資料。資料過濾可以經由網頁和使用者需要的資訊比對,找出好的結果。首先,使用者先描述他們需要的是什麼,這叫做使用者檔案(profile)。將這些檔案建立出一個索引,一堆網頁將會和這個索引作比對。每一個網頁的表示方式也和使用者檔案的表示方式相同。經由網頁和使用者檔案的比對,使用者有興趣的網頁就可以被找出來。最後,根據比對的結果,過濾出來的網頁就可以推薦給使用者。因此,在資料過濾的服務中,我們所考慮的重點是如何建立一個索引可以做有效率的比對。當一個使用者檔案的索引被建立之後,我們可以減少修正使用者檔案所需花的空間和時間。在這篇論文中,首先,我們提出一個方法,稱為 count-based tree method,它會計算每一個關鍵字出現的次數,用來減少 the tree method 所需花費的大量空間,。接下來,我們提出三個以大的項目集合(large itemset)為基礎的方法,分別是 the count-major large itemset method、the weighted large itemset method 和 the hybrid method。在這三個以大的項目集合為基礎的方法中,我們首先將有相似興趣的使用者分在同一群。接下來,針對每一群,我們利用關連式法則的挖掘技術來幫助我們建立出索引。我們利用在關連式法則的挖掘中,其中之一個有名的方法 Apriori 演算法的概念,來設計這三個方法。但是,我們修改了在 Apriori 演算法中的決定最小的贊成值(support)和最後的目標。我們不一定總是得到最大的項目集合(Lk)。有時,我們只會選擇Lw,在這裡 w < k。總結來說,我們提出的這四個方法所花費的空間都比 Yan 和 Garcia-Molina 他們所提的 the tree method所花費的空間來的少。根據我們的模擬結果,我們的這四個方法會根據不同的資料而能提供最好的結果。接下來,我們提出一個儲存有不同輕重關鍵字的索引的方法,來減少更新的花費。當使用者的興趣常常在改變的時候,我們必須考慮如何提供一個索引結構,使得更新的花費比較少。在這個方法中,我們將每一個關鍵字的重量考慮進來。那就是每一個關鍵字可以被區分為 long-term interests和 short-term interests,當這個關鍵字的重量大於某個門檻,我們就稱它為 long-term interests,如果這個關鍵字的重量小於某個門檻,我們就稱它為 short--term interests。由於修改 short-term interests 的機會比修改 long-term interests 的機會來的大很多,因此我們可以更新的範圍是局部地。根據我們模擬的結果,我們的方法確實可以減少 Wu 和 Chen 他們的方法在更新上的花費。
Abstract
Owing to the booming development of the WWW, it creates many new challenges for information filtering. Information Filtering (IF) is an area of research that develops tools for discriminating between relevant and irrelevant information. IF can find good matches between the web pages and the users' information needs. Users first give descriptions about what they need, i.e., user profiles, to start the services. A profile index is built on these profiles. A series of incoming web pages will be put into the matching process. Each incoming web page is represented in the same form of the user profile. In this way, the users who are interested in an incoming web page can be identified by comparing the descriptions of the web page with each user profile. At last, the web page will be recommended to the users whose profiles belong to the filtered results. Therefore, a critical issue of the information filtering service is how to index the user profiles for an efficient matching process. When we index the user profile, we can reduce the costs of storage space and the processing time for modifying the user profiles. In this thesis, first, we propose a count-based tree method, which takes the count of each keyword into consideration, to reduce the large storage spaces as needed by the tree method. Next, three large-itemset-based methods are proposed to reduce the storage space, which are called the count-major large itemset method, the weighted large itemset method and the hybrid method. In these three large-itemset-based methods, we first cluster profiles with similar interests into the same group. Next, for each cluster, we apply the mining association rules techniques to help us to construct the index strategies. We design three methods by using
the idea of the Apriori algorithm which is one of well-known approaches in mining association rules. But, we modify the minimum support and the goal in the Apriori algorithm. We may not always output the large itemset Lk. That is, we may only use Lw, where w < k. In summary, the cost of storage spaces of our four methods are less than that of the tree method proposed by Yan and Garcia-Molina. According to our simulation results, each of our four methods may provide the best result when different input data sets. Next, we propose a large-itemset-based approach to the incremental update of the index structure for storing keywords to
reduce the update cost. When someone's interests are often changed, we must care about the way how to provide the low update cost of the index structure. We take the weight of each keyword into consideration. That is, each keyword can be distinguished the long-term interest which has weight above the threshold from the short-term interest which has weight below the threshold. Owing to that the probability of modifying the short-term interests is higher than that of modifying the long-term interests, we can update the short-term interests locally. According to our simulation results, our method really can reduce the update cost as needed by Wu and Chen' methods.

目次 Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Two Kinds of Filtering Approaches . . . . . . . . . . . . . . . 3
1.1.2 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Path Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Index Structures on the Internet . . . . . . . . . . . . . . . . . . . . . 9
1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 20
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 The Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 The Brute Force (BF) Method . . . . . . . . . . . . . . . . . . 22
2.1.2 The Pro le Indexing (PI) Method . . . . . . . . . . . . . . . . 22
2.1.3 The Selective Pro le Indexing (SPI) Method . . . . . . . . . . 23
2.2 The Boolean Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Yan and Garcia-Molona's Methods Under the Boolean Model 27
2.2.2 Wu and Chen's Methods Under the Boolean Model . . . . . . 33
2.2.2.1 Index Path with Path Signatures . . . . . . . . . . . 33
2.2.2.2 Index Graph with Path Signatures . . . . . . . . . . 35
2.2.2.3 Index Path with Pro le Sets . . . . . . . . . . . . . . 37
2.2.2.4 Index Graph with Pro le Sets . . . . . . . . . . . . . 39
3. The Count-Based Tree Method . . . . . . . . . . . . . . . . . . . . . 41
3.1 The Count-Based Tree Method . . . . . . . . . . . . . . . . . . . . . 41
3.2 A Special Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4. The Largest-Itemset-Based Methods . . . . . . . . . . . . . . . . . . 51
4.1 A Survey of the Apriori Algorithm . . . . . . . . . . . . . . . . . . . 51
4.2 The Count-Major Large Itemset Method . . . . . . . . . . . . . . . . 53
4.3 The Weighted Large Itemset Method . . . . . . . . . . . . . . . . . . 64
4.4 The Hybrid Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5. The Large-Itemset-Based Approach to the Incremental Update . 84
5.1 The Update Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 The Update Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1 The Simulation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 Simulation Results of the Boolean Model-Based Methods . . . . . . . 106
6.3 Simulation Results of the Vector Space Model-Based Methods . . . . 115
7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 122
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
參考文獻 References
[1] R. Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases, pp. 490-501, Sept. 1994.
[2] Sergey Brin and Larry Page. Google Search Engine. URL: http://google.stanford.edu.
[3] E. Y. Chang, C. Li, H. Garcia-Molina and G. Wiederhold, Clustering for Approximate Similarity Search in High-Dimensional Spaces," IEEE Trans. on Knowledge and Data Engineering, Vol. 14, No. 4, pp. 792-808, July 2002.
[4] M. S. Chen, J. S. Park and P. S. Yu, E&AElig;cient Data Mining for Path Traversal Patterns," IEEE Trans. on Knowledge and Data Engineering, Vol. 10, No. 2, pp. 209-221, March 1998.
[5] J. Cho, N. Shivakumar and H. Garcia-Molina, Finding Replicated Web Collections," Proc. SIGMOD Conf., pp. 355-366, 2000.
[6] E. J. Glover, S. Lawrence, M. D. Gordon, W. P. Birmingham and C. L. Giles, Web Search-YourWay," Communications of the ACM, Vol. 44, No. 12, pp. 97-102, 2001.
[7] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon and J. Riedl, Applying Collaborative Filtering to Usenet News," Communications of the ACM, Vol. 40, No. 3, pp. 77-87, March 1997.
[8] T. Kuik and P. Shoval, User Pro le Generation for Intelligent Information Agents-Research in Progress," Proc. of the Second Int'l Workshop Agent-
Oriented Information Systems, pp. 63-72, 2000.
[9] S. Melnik, S. Raghavan, B. Yang and H. Garcia-Molina, Building a Distributed Full-Text Index for the Web," ACM Trans. on Information Systems, Vol. 19, No. 3, pp. 217-241, July 2001.
[10] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web," URL: http://google.stanford.edu/backrub/pageranksub.ps, 1998.
[11] Y. W. Park and E. S. Lee, A New Generation Method of an User Pro le for
Information Filtering on the Internet," Proc. of Int. Conf. on Data Engineering, pp. 337-347, 1994.
[12] J. S. Park, M. S. Chen and P. S. Yu, An E&AElig;cient Hash-Based Algorithm for
Mining Association Rules," ACM SIGMOD Int. Conf. Management of Data, pp. 175-186, 1995.
[13] J. Picard and J. Savoy, Using Probailistic Argumentation Systems to Search and ClassifyWeb Sites," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 24, No. 3, pp. 33-41, Sep. 2001.
[14] C. Shahabi, A. M. Zarkesh, J. Adibi and V. Shah, Knowledge Discovery from
Users Web-Page Navigation," Proc. of IEEE Workshop Research Issues in Data
Engineering, pp. 20-29, 1997.
[15] D. H. Widyantoro, T. R. Ioerger and J. Yen, An Adaptive Algorithm for Learning Changes in User Interests," Proc. of the 8th Int. Conf. on Information and Knowledge Management, pp. 405-412, 1999.
[16] Y. H. Wu and Arbee L. P. Chen, Index Structures of User Pro les for E&AElig;-
cient Web Page Filtering Services," Proc. of the 20th Int. Conf. on Distributed
Computing Systems, pp. 644-653, 2000.
[17] Y. H. Wu, Y. C. Chen and Arbee L. P. Chen, Enabling Personalized Recom-
mendation on the Web Based on User Interests and Behaviors," Proc. of IEEE
Workshop Research Issues in Data Engineering, pp. 17-24, 2001.
[18] Yahoo, URL: http://www.yahoo.com, 1999.
[19] T. W. Yan and H. Garcia-Molina, Distributed Selective Dissemination of Information," Proc. of Parallel and Distributed Information Systems, pp. 89-98, 1994.
[20] T. W. Yan and H. Garcia-Molina, Index Structures for Information Filtering Under the Vector Space Model," Proc. of Int. Conf. on Data Engineering, pp. 337-347, 1994.
[21] T. W. Yan and H. Garcia-Molina, Index Structures for Selective Dissemination of Information Under the Boolean Model," ACM Trans. on Database Systems, Vol. 19, No. 2, pp. 322-364, Nov. 1994.
[22] T. W. Yan and H. Garcia-Molina, SIFT{A Tool for Wide-Area Information Dissemination," Proc. of the 1995 USENIX Technical Conf., pp. 177-186, 1995.
[23] O. Zamir, Fast and Intuitive Clustering of Web Documents," Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data Mining, pp. 287-290, 1997.
[24] O. Zamir and O. Etzioni, Web Document Clustering: A Feasibility Demonstration," Proc. of the 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 46-54, 1998.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 54.89.70.161
論文開放下載的時間是 校外不公開

Your IP address is 54.89.70.161
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code