論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available
論文名稱 Title |
一個以有意義候選集來探勘全球資訊網上使用者雙向移動路徑的方法 A Meaningful Candidate Approach to Mining Bi-Directional Traversal Patterns on the WWW |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
104 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2004-06-04 |
繳交日期 Date of Submission |
2004-07-27 |
關鍵字 Keywords |
瀏覽路徑、資料探勘、關聯式法則、全球資訊網、網路探勘 Web Mining, Traversal Pattern, WWW, Association Rule, Data Mining |
||
統計 Statistics |
本論文已被瀏覽 5733 次,被下載 0 次 The thesis/dissertation has been browsed 5733 times, has been downloaded 0 times. |
中文摘要 |
隨著全球資訊網的發展,有愈來愈多有用的資訊,存在於網路上。如何找出這些有用的資訊,便是網路探勘(Web mining)的主要工作。所謂的網路探勘,它是將傳統資料探勘(data mining)的技術,應用在網路上。而其中一個很重要的主題,便是找出使用者常常會出現哪些網頁瀏覽順序,也就是所謂的探勘瀏覽樣式(mining traversal patterns)。雖然傳統探勘關聯式法則的方法(如Apriori及DHP演算法),也能用來探勘瀏覽樣式,但由於它們並沒有利用到在網路上才會擁有的特性,因此它們會產生過多不必要的候選樣式,以致於沒有辦法提供良好的效能。在Wu等人提出的SpeedTracer演算法中,他們則是首先利用了這個網路特性,來減少在探勘過程中所會產生候選樣式的個數。但是,在SpeedTracer演算法中,並沒有有效地利用之前所提到的網路特性,來減少產生候選樣式時所需的檢查子集動作。在這篇論文中,我們提出的第一個演算法—SpeedTracer*-I,它改進了SpeedTracer演算法。在SpeedTracer*-I演算法中,它除了利用了之前提過的網路特性,直接從使用者的瀏覽記錄中,產生及計數所有的候選樣式,因而減少了候選樣式的個數。並且,更進一步地利用了這個網路特性,來加速檢查候選樣式子集的動作。接下來,我們則是以第一個演算法為基礎,提出了SpeedTracer*-II及SpeedTracer*-III兩個演算法。在這兩個演算法中,我們藉由減少掃描資料庫的次數,來減少讀取資料庫時所花費的時間,因而提供了更好的效能。在SpeedTracer*-II演算法中,會由使用者指定一個n值,我們先利用SpeedTracer*-I演算法,找出Ln,再利用Ln來產生所有的候選樣式Ck,k > n。最後,再掃描一次資料庫,便能得到所有候選樣式的計數,並決定哪些是頻繁樣式。在SpeedTracer*-III演算法中,一樣由使用者指定一個n值,我們先利用SpeedTracer*-I演算法,找出Ln,再利用Ln來幫助我們直接從瀏覽記錄中,產生及計數所有的Ck,k > n。從我們的模擬結果中顯示出,我們所提出的SpeedTracer*-I演算法,在執行時所需花費的時間,會比SpeedTracer演算法要少。再者,我們提出的SpeedTracer*-II及SpeedTracer*-III演算法,因為它們減少了資料庫的掃描次數,在執行的時間上,又會優於SpeedTracer或SpeedTracer*-I演算法。另外在模擬結果中,我們所提出的三個演算法,無論是在執行時所需花費的記憶體空間或總執行時間,皆會優於以傳統探勘關聯式法則方法為基礎的演算法(如Chen等人的FS演算法或Yen等人的FDLP演算法)。 |
Abstract |
Since the World Wide Web (WWW) appeared, more and more useful information has been available on the WWW. In order to find the information, one application of data mining techniques on the WWW, referred to as Web mining, has become a research area with increasing importance. Mining traversal patterns is one of the important topics in Web mining. It focuses on how to find the Web page sequences which are frequently browsed by users. Although the algorithms for mining association rules (e.g., Apriori and DHP algorithms) could be applied to mine traversal patterns, they do not utilize the property of Web transactions and generate too many invalid candidate patterns. Thus, they could not provide good performance. Wu et al. proposed an algorithm for mining traversal patterns, SpeedTracer, which utilizes the property of Web transactions, i.e., the continuous property of the traversal patterns in the Web structure. Although they decrease the number of candidate patterns generated in the mining process, they do not efficiently utilize the property of Web transactions to decrease the number of checks while checking the subsets of each candidate pattern. In this thesis, we design three algorithms, which improve the SpeedTracer algorithm, for mining traversal patterns. For the first algorithm, SpeedTracer*-I, it utilizes the property of Web transactions to directly generate and count all candidate patterns from user sessions. Moreover, it utilizes this property to improve the checking step, when candidate patterns are generated. Next, according to the SpeedTracer*-I algorithm, we propose SpeedTracer*-II and SpeedTracer*-III algorithms. In these two algorithms, we improve the performance of the SpeedTracer*-I algorithm by decreasing the number of times to scan the database. In the SpeedTracer*-II algorithm, given a parameter n, we apply the SpeedTracer*-I algorithm to find Ln first, and use Ln to generate all Ck, where k > n. After generating all candidate patterns, we scan the database once to count all candidate patterns and then the frequent patterns could be determined. In the SpeedTracer*-III algorithm, given a parameter n, we also apply the SpeedTracer*-I algorithm to find Ln first, and directly generate and count Ck from user sessions based on Ln, where k > n. The simulation results show that the performance of the SpeedTracer*-I algorithm is better than that of the Speed- Tracer algorithm in terms of the processing time. The simulation results also show that SpeedTracer*-II and SpeedTracer*-III algorithms outperform SpeedTracer and SpeedTracer*-I algorithms, because the former two algorithms need less times to scan the database than the latter two algorithms. Moreover, from our simulation results, we show that all of our proposed algorithms could provide better performance than Apriori-like algorithms (e.g., FS and FDLP algorithms) in terms of the processing time. |
目次 Table of Contents |
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 WebMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.1 Pre-processing Tasks . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.2 Discovery Techniques . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Mining Traversal Patterns . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 ProblemDescription of Association Rule . . . . . . . . . . . . 6 1.3.2 RelatedWork ofMining Traversal Patterns . . . . . . . . . . . 8 1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16 2. A Survey of Algorithms for Mining Traversal Patterns . . . . . . 18 2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 18 2.1.2 The DHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Mining Traversal Patterns . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 TheMF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.2 The FS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.3 The SS Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 The SpeedTracer Algorithm . . . . . . . . . . . . . . . . . . . 31 2.2.5 Li and Shan’s Algorithms . . . . . . . . . . . . . . . . . . . . 31 2.2.6 The FDLP Algorithm . . . . . . . . . . . . . . . . . . . . . . 35 3. SpeedTracer*-I, SpeedTracer*-II, and SpeedTracer*-III Algorithms 36 3.1 The SpeedTracer*-I Algorithm . . . . . . . . . . . . . . . . . . . . . . 36 3.2 The SpeedTracer*-II Algorithm . . . . . . . . . . . . . . . . . . . . . 44 3.3 The SpeedTracer*-III Algorithm . . . . . . . . . . . . . . . . . . . . . 50 3.4 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.1 SpeedTracer vs. SpeedTracer*-I . . . . . . . . . . . . . . . . . 71 4.2.2 SpeedTracer*-II vs. SpeedTracer*-III . . . . . . . . . . . . . . 75 4.3 Performance Results for RealWeb Logs . . . . . . . . . . . . . . . . . 81 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 90 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A. An Example of the Generation of Synthetic Data . . . . . . . . . . 95 B. An Example of the Real Web Log . . . . . . . . . . . . . . . . . . . 103 |
參考文獻 References |
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 207–216, 1993. [2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 487–499, 1994. [3] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database Perspective,” IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 5, pp. 866– 882, Dec. 1996. [4] M. S. Chen, J. S. Park, and P. S. Yu, “Efficient Data Mining for Path Traversal Patterns,” IEEE Trans. on Knowledge and Data Eng., Vol. 10, No. 2, pp. 209– 221, March/April 1998. [5] R. C. Chen and W. Y. Shen, “Data Mining User Sessions Based on Forward and Backward Reference Patterns,” Int. Computer Symp., pp. 207–214, 2000. [6] W. H. Chen, Y. H. Wu, and A. L. P. Chen, “Web-Flow Mining Techniques, Applications and System Implementations,” Proc. of National Computer Symp., pp. 26–32, 1999. [7] R. Cooley, B. Mobasher, and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” The 9th Int. Conf. on Tools with Artificial Intelligence, pp. 558–567, 1997. [8] J. Han and K. Chang, “Data Mining for Web Intelligence,” Computer, Vol. 35, No. 11, pp. 64–70, Nov. 2002. [9] H. F. Li and M. K. Shan, “Mining Non-Simple Traversal Paths from Web Access Logs,” Workshop on Internet and Distributed Systems, pp. 266–272, 2000. [10] F. R. Lin and S. T. Chang, “Mining User Access Patterns from Network Flow on the Internet,” The 11th National Conf. on the Information Management, pp. 1– 11, 2000. [11] I. Y. Lin, X. M. Huang, and M. S. Chen, “Capturing User Access Patterns in the Web for Data Mining,” The 11th IEEE Int. Conf. on Tools with Artificial Intelligence, pp. 345–348, 1999. [12] B. Mobasher, N. Jain, E. Han, and J. Srivastava, “Web Mining: Pattern Discovery from World Wide Web Transactions,” Technical Report TR96-050, Dept. of Computer Science, University of Minnesota, 1996. [13] J. S. Park, M. S. Chen, and P. S. Yu, “An Effective Hash-Based Algorithm for Mining Association Rules,” Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 175–186, 1995. [14] J. S. Park, M. S. Chen, and P. S. Yu, “Using A Hash-Based Method with Transaction Trimming for Mining Association Rules,” IEEE Trans. on Knowledge and Data Eng., Vol. 9, No. 5, pp. 813–825, Sept./Oct. 1997. [15] W. C. Peng and M. S. Chen, “Developing Data Allocation Schemes by Incremental Mining of User Moving Patterns in a Mobile Computing System,” IEEE Trans. on Knowledge and Data Eng., Vol. 15, No. 1, pp. 70–84, Jan./Feb. 2003. [16] S. M. Tseng and W. C. Chan, “Mining Complete User Moving Paths in a Mobile Environment,” Int. Computer Symp., pp. 1–7, 2002. [17] K. L. Wu, P. S. Yu, and A. Ballman, “SpeedTracer: A Web Usage Mining and Analysis Tool,” IBM Systems Journal, Vol. 37, No. 1, pp. 89–105, Jan. 1998. [18] Y. H.Wu, Y. H. Chen, and A. L. P. Chen, “Querying and Browsing the Resources in Internet,” Int. Computer Symp., pp. 9–16, 1996. [19] Y. Q. Xiao and M. H. Dunham, “Efficient Mining of Traversal Patterns,” Data and Knowledge Eng., Vol. 39, No. 2, pp. 191–214, Nov. 2001. [20] D. L. Yang, S. H. Yang, and M. C. Hong, “An Efficient Web Mining Algorithm for Session Path Patterns,” Int. Computer Symp., pp. 107–112, 2000. [21] S. J. Yen, Y. S. Lee, and C. H. Hsu, “Mining Frequent Traversal Patterns in a Web Training Environment,” Proc. of National Computer Symp., pp. 105–116, 2001. [22] C. H. Yun and M. S. Chen, “Mining Web Transaction Patterns in an Electronic Commerce Environment,” Proc. of the 4th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, pp. 216–219, 2000. [23] C. H. Yun and M. S. Chen, “Using Pattern-Join and Purchase-Combination for Mining Web Transaction Patterns in an Electronic Commerce Environment,” The 24th Annual Int. Computer Software and Applications Conf., pp. 99–104, 2000. [24] O. R. Zaiane, M. Xin, and J. Han, “Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs,” Proc. of Advances in Digital Libraries Conf., pp. 19–29, 1998. |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內校外均不公開 not available 開放時間 Available: 校內 Campus:永不公開 not available 校外 Off-campus:永不公開 not available 您的 IP(校外) 位址是 3.131.110.169 論文開放下載的時間是 校外不公開 Your IP address is 3.131.110.169 This thesis will be available to you on Indicate off-campus access is not available. |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |