Responsive image
博碩士論文 etd-0727104-153403 詳細資訊
Title page for etd-0727104-153403
論文名稱
Title
一個以有意義候選集來探勘全球資訊網上使用者雙向移動路徑的方法
A Meaningful Candidate Approach to Mining Bi-Directional Traversal Patterns on the WWW
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
104
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2004-06-04
繳交日期
Date of Submission
2004-07-27
關鍵字
Keywords
瀏覽路徑、資料探勘、關聯式法則、全球資訊網、網路探勘
Web Mining, Traversal Pattern, WWW, Association Rule, Data Mining
統計
Statistics
本論文已被瀏覽 5733 次,被下載 0
The thesis/dissertation has been browsed 5733 times, has been downloaded 0 times.
中文摘要
隨著全球資訊網的發展,有愈來愈多有用的資訊,存在於網路上。如何找出這些有用的資訊,便是網路探勘(Web mining)的主要工作。所謂的網路探勘,它是將傳統資料探勘(data mining)的技術,應用在網路上。而其中一個很重要的主題,便是找出使用者常常會出現哪些網頁瀏覽順序,也就是所謂的探勘瀏覽樣式(mining traversal patterns)。雖然傳統探勘關聯式法則的方法(如Apriori及DHP演算法),也能用來探勘瀏覽樣式,但由於它們並沒有利用到在網路上才會擁有的特性,因此它們會產生過多不必要的候選樣式,以致於沒有辦法提供良好的效能。在Wu等人提出的SpeedTracer演算法中,他們則是首先利用了這個網路特性,來減少在探勘過程中所會產生候選樣式的個數。但是,在SpeedTracer演算法中,並沒有有效地利用之前所提到的網路特性,來減少產生候選樣式時所需的檢查子集動作。在這篇論文中,我們提出的第一個演算法—SpeedTracer*-I,它改進了SpeedTracer演算法。在SpeedTracer*-I演算法中,它除了利用了之前提過的網路特性,直接從使用者的瀏覽記錄中,產生及計數所有的候選樣式,因而減少了候選樣式的個數。並且,更進一步地利用了這個網路特性,來加速檢查候選樣式子集的動作。接下來,我們則是以第一個演算法為基礎,提出了SpeedTracer*-II及SpeedTracer*-III兩個演算法。在這兩個演算法中,我們藉由減少掃描資料庫的次數,來減少讀取資料庫時所花費的時間,因而提供了更好的效能。在SpeedTracer*-II演算法中,會由使用者指定一個n值,我們先利用SpeedTracer*-I演算法,找出Ln,再利用Ln來產生所有的候選樣式Ck,k > n。最後,再掃描一次資料庫,便能得到所有候選樣式的計數,並決定哪些是頻繁樣式。在SpeedTracer*-III演算法中,一樣由使用者指定一個n值,我們先利用SpeedTracer*-I演算法,找出Ln,再利用Ln來幫助我們直接從瀏覽記錄中,產生及計數所有的Ck,k > n。從我們的模擬結果中顯示出,我們所提出的SpeedTracer*-I演算法,在執行時所需花費的時間,會比SpeedTracer演算法要少。再者,我們提出的SpeedTracer*-II及SpeedTracer*-III演算法,因為它們減少了資料庫的掃描次數,在執行的時間上,又會優於SpeedTracer或SpeedTracer*-I演算法。另外在模擬結果中,我們所提出的三個演算法,無論是在執行時所需花費的記憶體空間或總執行時間,皆會優於以傳統探勘關聯式法則方法為基礎的演算法(如Chen等人的FS演算法或Yen等人的FDLP演算法)。
Abstract
Since the World Wide Web (WWW) appeared, more and more useful information has
been available on the WWW. In order to find the information, one application of data
mining techniques on the WWW, referred to as Web mining, has become a research
area with increasing importance. Mining traversal patterns is one of the important
topics in Web mining. It focuses on how to find the Web page sequences which are
frequently browsed by users. Although the algorithms for mining association rules
(e.g., Apriori and DHP algorithms) could be applied to mine traversal patterns, they
do not utilize the property of Web transactions and generate too many invalid candidate
patterns. Thus, they could not provide good performance. Wu et al. proposed
an algorithm for mining traversal patterns, SpeedTracer, which utilizes the property
of Web transactions, i.e., the continuous property of the traversal patterns in the Web
structure. Although they decrease the number of candidate patterns generated in the
mining process, they do not efficiently utilize the property of Web transactions to
decrease the number of checks while checking the subsets of each candidate pattern.
In this thesis, we design three algorithms, which improve the SpeedTracer algorithm,
for mining traversal patterns. For the first algorithm, SpeedTracer*-I, it utilizes the
property of Web transactions to directly generate and count all candidate patterns
from user sessions. Moreover, it utilizes this property to improve the checking step,
when candidate patterns are generated. Next, according to the SpeedTracer*-I algorithm,
we propose SpeedTracer*-II and SpeedTracer*-III algorithms. In these two
algorithms, we improve the performance of the SpeedTracer*-I algorithm by decreasing
the number of times to scan the database. In the SpeedTracer*-II algorithm,
given a parameter n, we apply the SpeedTracer*-I algorithm to find Ln first, and
use Ln to generate all Ck, where k > n. After generating all candidate patterns, we
scan the database once to count all candidate patterns and then the frequent patterns
could be determined. In the SpeedTracer*-III algorithm, given a parameter n, we also
apply the SpeedTracer*-I algorithm to find Ln first, and directly generate and count
Ck from user sessions based on Ln, where k > n. The simulation results show that
the performance of the SpeedTracer*-I algorithm is better than that of the Speed-
Tracer algorithm in terms of the processing time. The simulation results also show
that SpeedTracer*-II and SpeedTracer*-III algorithms outperform SpeedTracer and
SpeedTracer*-I algorithms, because the former two algorithms need less times to scan
the database than the latter two algorithms. Moreover, from our simulation results,
we show that all of our proposed algorithms could provide better performance than
Apriori-like algorithms (e.g., FS and FDLP algorithms) in terms of the processing
time.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 WebMining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Pattern Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Pre-processing Tasks . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Discovery Techniques . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Mining Traversal Patterns . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 ProblemDescription of Association Rule . . . . . . . . . . . . 6
1.3.2 RelatedWork ofMining Traversal Patterns . . . . . . . . . . . 8
1.4 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 16
2. A Survey of Algorithms for Mining Traversal Patterns . . . . . . 18
2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 18
2.1.2 The DHP Algorithm . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Mining Traversal Patterns . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 TheMF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 The FS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 The SS Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 The SpeedTracer Algorithm . . . . . . . . . . . . . . . . . . . 31
2.2.5 Li and Shan’s Algorithms . . . . . . . . . . . . . . . . . . . . 31
2.2.6 The FDLP Algorithm . . . . . . . . . . . . . . . . . . . . . . 35
3. SpeedTracer*-I, SpeedTracer*-II, and SpeedTracer*-III Algorithms 36
3.1 The SpeedTracer*-I Algorithm . . . . . . . . . . . . . . . . . . . . . . 36
3.2 The SpeedTracer*-II Algorithm . . . . . . . . . . . . . . . . . . . . . 44
3.3 The SpeedTracer*-III Algorithm . . . . . . . . . . . . . . . . . . . . . 50
3.4 A Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Generation of Synthetic Data . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 SpeedTracer vs. SpeedTracer*-I . . . . . . . . . . . . . . . . . 71
4.2.2 SpeedTracer*-II vs. SpeedTracer*-III . . . . . . . . . . . . . . 75
4.3 Performance Results for RealWeb Logs . . . . . . . . . . . . . . . . . 81
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 90
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A. An Example of the Generation of Synthetic Data . . . . . . . . . . 95
B. An Example of the Real Web Log . . . . . . . . . . . . . . . . . . . 103
參考文獻 References
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between
Sets of Items in Large Databases,” Proc. of the ACM SIGMOD Int. Conf. on
Management of Data, pp. 207–216, 1993.
[2] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,”
Proc. of the 20th Int. Conf. on Very Large Data Bases, pp. 487–499, 1994.
[3] M. S. Chen, J. Han, and P. S. Yu, “Data Mining: An Overview from Database
Perspective,” IEEE Trans. on Knowledge and Data Eng., Vol. 8, No. 5, pp. 866–
882, Dec. 1996.
[4] M. S. Chen, J. S. Park, and P. S. Yu, “Efficient Data Mining for Path Traversal
Patterns,” IEEE Trans. on Knowledge and Data Eng., Vol. 10, No. 2, pp. 209–
221, March/April 1998.
[5] R. C. Chen and W. Y. Shen, “Data Mining User Sessions Based on Forward and
Backward Reference Patterns,” Int. Computer Symp., pp. 207–214, 2000.
[6] W. H. Chen, Y. H. Wu, and A. L. P. Chen, “Web-Flow Mining Techniques,
Applications and System Implementations,” Proc. of National Computer Symp.,
pp. 26–32, 1999.
[7] R. Cooley, B. Mobasher, and J. Srivastava, “Web Mining: Information and Pattern
Discovery on the World Wide Web,” The 9th Int. Conf. on Tools with
Artificial Intelligence, pp. 558–567, 1997.
[8] J. Han and K. Chang, “Data Mining for Web Intelligence,” Computer, Vol. 35,
No. 11, pp. 64–70, Nov. 2002.
[9] H. F. Li and M. K. Shan, “Mining Non-Simple Traversal Paths from Web Access
Logs,” Workshop on Internet and Distributed Systems, pp. 266–272, 2000.
[10] F. R. Lin and S. T. Chang, “Mining User Access Patterns from Network Flow on
the Internet,” The 11th National Conf. on the Information Management, pp. 1–
11, 2000.
[11] I. Y. Lin, X. M. Huang, and M. S. Chen, “Capturing User Access Patterns in
the Web for Data Mining,” The 11th IEEE Int. Conf. on Tools with Artificial
Intelligence, pp. 345–348, 1999.
[12] B. Mobasher, N. Jain, E. Han, and J. Srivastava, “Web Mining: Pattern Discovery
from World Wide Web Transactions,” Technical Report TR96-050, Dept. of
Computer Science, University of Minnesota, 1996.
[13] J. S. Park, M. S. Chen, and P. S. Yu, “An Effective Hash-Based Algorithm for
Mining Association Rules,” Proc. of the ACM SIGMOD Int. Conf. on Management
of Data, pp. 175–186, 1995.
[14] J. S. Park, M. S. Chen, and P. S. Yu, “Using A Hash-Based Method with Transaction
Trimming for Mining Association Rules,” IEEE Trans. on Knowledge and
Data Eng., Vol. 9, No. 5, pp. 813–825, Sept./Oct. 1997.
[15] W. C. Peng and M. S. Chen, “Developing Data Allocation Schemes by Incremental
Mining of User Moving Patterns in a Mobile Computing System,” IEEE
Trans. on Knowledge and Data Eng., Vol. 15, No. 1, pp. 70–84, Jan./Feb. 2003.
[16] S. M. Tseng and W. C. Chan, “Mining Complete User Moving Paths in a Mobile
Environment,” Int. Computer Symp., pp. 1–7, 2002.
[17] K. L. Wu, P. S. Yu, and A. Ballman, “SpeedTracer: A Web Usage Mining and
Analysis Tool,” IBM Systems Journal, Vol. 37, No. 1, pp. 89–105, Jan. 1998.
[18] Y. H.Wu, Y. H. Chen, and A. L. P. Chen, “Querying and Browsing the Resources
in Internet,” Int. Computer Symp., pp. 9–16, 1996.
[19] Y. Q. Xiao and M. H. Dunham, “Efficient Mining of Traversal Patterns,” Data
and Knowledge Eng., Vol. 39, No. 2, pp. 191–214, Nov. 2001.
[20] D. L. Yang, S. H. Yang, and M. C. Hong, “An Efficient Web Mining Algorithm
for Session Path Patterns,” Int. Computer Symp., pp. 107–112, 2000.
[21] S. J. Yen, Y. S. Lee, and C. H. Hsu, “Mining Frequent Traversal Patterns in a
Web Training Environment,” Proc. of National Computer Symp., pp. 105–116,
2001.
[22] C. H. Yun and M. S. Chen, “Mining Web Transaction Patterns in an Electronic
Commerce Environment,” Proc. of the 4th Pacific-Asia Conf. on Knowledge Discovery
and Data Mining, pp. 216–219, 2000.
[23] C. H. Yun and M. S. Chen, “Using Pattern-Join and Purchase-Combination for
Mining Web Transaction Patterns in an Electronic Commerce Environment,”
The 24th Annual Int. Computer Software and Applications Conf., pp. 99–104,
2000.
[24] O. R. Zaiane, M. Xin, and J. Han, “Discovering Web Access Patterns and Trends
by Applying OLAP and Data Mining Technology on Web Logs,” Proc. of Advances
in Digital Libraries Conf., pp. 19–29, 1998.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.131.110.169
論文開放下載的時間是 校外不公開

Your IP address is 3.131.110.169
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code