Responsive image
博碩士論文 etd-0117106-112531 詳細資訊
Title page for etd-0117106-112531
論文名稱
Title
利用探勘技術辨認企業網路外部環境
Using Mining Techniques to Identify External Web Environment of Companies
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
63
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2005-07-07
繳交日期
Date of Submission
2006-01-17
關鍵字
Keywords
貝氏分類器、網頁內容分類、鏈結分析、外部網路環境
Naïve Bayesian Classifiers, Web Content Classification, Hyperlink Analysis, External Web Environment
統計
Statistics
本論文已被瀏覽 5795 次,被下載 12
The thesis/dissertation has been browsed 5795 times, has been downloaded 12 times.
中文摘要
  全球資訊網的蓬勃發展使得企業逐漸傾向利用其迅速、便利的特質來散佈商業訊息;同樣地,人們也習慣於在全球資訊網中搜尋所需的資訊。對於企業而言,全球資訊網可以幫助宣傳其所提供產品或服務,亦可以成為與外部環境接觸的一項新興管道。企業建構之商業網站所包含之資訊被視為其所擁有之虛擬資產,而在中存取此項虛擬資產者,即為該企業的外部網路環境。辨識企業外部網
路環境有助於企業從策略規劃的角度上創造其商業價值。
  因此,藉由全球資訊網辨識企業之外部網路環境成為對企業的重要議題之一,本研究即針對此議題作深入之探討。從文獻中得知,網頁之間的鏈結架構可以幫助分析其間所存在的關係,本研究據此提出一個結合內容分類與網路鏈結之分類方法: CNB-HI,來幫助企業自動辨識其外部環境。
  本研究以一個實例來說明所提之分類器如何實際應用,幫助特定企業辨識其外部網路環境中之顧客、合作夥伴、媒體、協會組織等四種角色。本研究接著提出二個實驗來驗證所提出之方法。在第一個實驗中,我們比較不同形式的貝氏分類器的分類效果;而第二個實驗則在分類過程中加入了網頁間的鏈結資訊以檢驗其對分類結果的改善。結果顯示加入了網頁間的鏈結資訊後可以大幅提昇分類的正確性,因而驗證本研究所提方法的適用性。
Abstract
As the rapid growth of World Wide Web nowadays, many companies tend to disseminate relevant information such as the introduction of product and service through their commercial Web sites. A company’s Web site is deemed as a certain kind of its business assets. Customers, suppliers, partners, associations and other outsiders who desire to get access to the assets from the Web construct a company’s external Web environment. From a strategic planning point of view, identifying a company’s external environment helps to create its business values.
Therefore, this research focuses on the issue of assisting a company to identify its external Web environment using mining techniques. Several research works pointed out that the hyperlink structure among Web pages could contribute to
classifying the relationships within a company’s external environment. We then propose a classifier that combines Web content mining and hyperlink structure, CNB-HI, for such a purpose.
We apply our proposed approach to a real case to help identify the roles of customers, partners, media, and associations. Two experiments are conducted to examine the performance. In the first experiment, we compare CNB with other forms of Naïve Bayesian classifiers, and conclude that CNB achieves a better performance. However, even the performance by CNB is not satisfactory based exclusively on
content classification. The second experiment is conducted to examine the benefits with hyperlink information incorporated (CNB-HI). The result shows that the
performance of CNB-HI improves markedly. It thus justifies the feasibility of the proposed approach to real applications.
目次 Table of Contents
CHAPTER 1 INTRODUCTION ...................................................................................1
1.1 Overview..........................................................................................................1
1.2 Research Objective ..........................................................................................2
1.3 Organization of the Thesis ...............................................................................3
CHAPTER 2 LITERATURE REVIEW.........................................................................4
2.1 Web Communities............................................................................................4
2.1.1 Definitions of Web Communities..........................................................4
2.1.2 Web Communities in Business Applications ........................................7
2.2 Web Mining......................................................................................................8
2.2.1 Hyperlink Analysis................................................................................9
2.2.2 Web Content Classification.................................................................11
2.3 Bayesian Probabilistic Classifier ...................................................................14
CHAPTER 3 PROPOSED APPROACH.....................................................................17
3.1 Collecting Web Pages ....................................................................................18
3.2 Processing Web Pages....................................................................................19
3.2.1 Scrutinizing Web Pages ......................................................................20
3.2.2 Analyzing Contents.............................................................................20
3.2.3 Building Hyperlink Structure..............................................................22
3.3 Constructing Classifier...................................................................................24
3.3.1 Complement Naïve Bayes Classifier ..................................................24
3.3.2 CNB-HI Classifier ..............................................................................26
CHAPTER 4 EMPIRICAL EVALUATION................................................................29
4.1 Experimental Design......................................................................................29
4.1.1 Target Background..............................................................................29
4.1.2 Data Collection and Analysis..............................................................30
4.1.3 Evaluation Method..............................................................................34
4.2 Experiment I...................................................................................................36
4.2.1 Experimental Results ..........................................................................36
4.2.2 An Illustrated Example .......................................................................38
4.3 Experiment II .................................................................................................39
4.3.1 Hyperlink Structure.............................................................................39
4.3.2 Experimental Results ..........................................................................40
4.3.3 Illustrated Examples............................................................................41
CHAPTER 5 CONCLUSIONS ...................................................................................47
5.1 Concluding Remarks......................................................................................47
5.2 Future Works..................................................................................................48
APPENDIX A: Partial Hyperlink Structure of The Target ..........................................49
REFERENCES ............................................................................................................50
參考文獻 References
Adamic, L. A. and Adar, E., “Friends and Neighbors on the Web,” Social Networks,
Volume 23, Issue 3, 2003, pp. 211-230
Bharat, K, Broder, A., Dean, J., and Henzinger, M. R., “A Comparison of Techniques
to Find Mirrored Hosts on the World Wide Web,” Journal of the American Society for
Information Science (JASIS), Volume 51, No. 12, 2000, pp. 1,114-,1122
Boulton, R.E.S., Libert, B.D., and Sivek, S.M., Cracking the Value Chain, Harvard
Business School Press, 2000
Brin, S. and Page, L., “The Anatomy of Large-scale Hypertextual Web Search
Engine,” Computer Network, Volume 30, No. 1-7, 1998, pp. 107-117
Buyukkokten, O., Cho, J., and Garcia-Molina, H., “Exploiting Geographical Location
Information of Web Pages,” Proc. ACM SIGMOD Workshop on the Web and
Databases (WebDB), Philadelphia, Pennsylvania, USA, June, 1999, pp. 91-96
Chakrabarti S., Dom B., and Indyk, P., “Enhanced Hypertext Categorization using
Hyperlinks,” Proc. of ACM SIGMOD International Conference on Management of
Data, Seattle, Washington, USA, June, 1998, pp. 307-318
Chakrabarti S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J.,
“Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated
Text,” Computer Networks, Volume 30, No. 1-7, 1998, pp. 65-74
Chakrabarti S., “Recent Results in Automatic Web Resource Discovery,” ACM
Computing Survey, Volume 31, Issue 4es, 1999, Article No. 17
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, R., Raghavan, P.,
Rajagopalan, S., Tomkins, A., “Mining the Link Structure of the World Wide Web,”
IEEE Computer, Volume 32, No.6, 1999, pp. 60-67
Chakrabatri S., ven den Berg, M., and Dom, B., “Focused Crawling: A New Approach
to Topic-specific Web Resource Discovery,” Computer Network, Volume 31, No.
11-16, 1999, pp. 1,623-1,640
51
Chakrabarti S., “Data Mining for Hypertext: A Tutorial Survey,” SIGKDD
Explorations, Volume 1, No.2, 2000, pp. 1-11
Cooley, R., Mobasher, B., and Srivastava, J., “Web Mining: Information and Pattern
Discovery on the World Wide Web,” Proc. of 9th International Conference on Tool
with Artificial Intelligence (ICTAI), Newport Beach, CA, USA, November, 1997, pp.
558-567
Dom, B., “Pattern Recognition Meets the World Wide Web,” Proc. of the
International Conference on Pattern Recognition (ICPR), Barcelona, Spain,
September, 2000, Volume 2, pp. 2052-2059
Domingos, P. and Richardson, M., “Mining the Network Value of Customers,” Proc.
of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, San Francisco, CA, USA, August, 2001, pp.57-66
Dumais, S. T. and Chen, H., “Hierarchical Classification of Web Content,” Proc. of
23rd Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, Athens, Greece, August, 2000, pp. 256-263
Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, Wiley, 1973
Eckel, B., Thinking in Java, Prentice Hall, 2000
Edna O. F. Reid, “Identifying a Company’s Non-customer Online Communities: A
Proto-typology,” Proc. of 36th Hawaii International Conference on System Sciences,
Big Island, Hawaii, USA, January, 2003, Article No. 215
Ester, M., Kriegel, H. P., and Schubert, M., “Web Site Mining: A New Way to Spot
Competitors, Customers and Suppliers in the World Wide Web,” Proc. of the 8th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
Edmonton, Alberta, Canada, 2002, pp. 249-258
Ester, M., Kriegel, H. P., and Schubert, M., “Accurate and Efficient Crawling for
Relevant Websites,” Proc. of the 13th International Conference on Very Large Data
Bases, Toronto, Canada, August, 2004, pp. 396-407
52
Etzioni, O., “The World-wide Web: Quagmire or Gold Mine?,” Communications of
the ACM, Volume 39, No. 11, 1996, pp. 65-68
Flake, G. W., Lawrence, S. R., Giles, C. L., Coetzee, F. M., “Self-organization and
Identification of Web Communities,” IEEE Computer, Volume 35, No. 3, 2002, pp.
66-71
Gibson, D., Kleinberg, J., and Raghavan, P., “Inferring Web Communities from Link
Topology,” Proc. of the 9th ACM Conference on Hypertext and Hypermedia: Links,
Objects, Time, and Spaces- Structure in Hypermedia System, Pittsburgh, PA, USA,
June, 1998, pp. 225-234
Hagel, J. and Armstrong, A. G., Net Gain: Expending Markets through Virtual
Communities, Harvard Business School Press, 1997
Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann,
2000
Henzinger, M. R., “Hyperlink Analysis for the Web,” IEEE Internet Computing,
Volume 5, 2001, pp. 45-50
John, G.H. and Langley, P., “Estimate Continuous Distributions in Bayesian
Classifiers,” Proc. of the 11th Annual Conference on Uncertainty in Artificial
Intelligence, Québec, Canada, August, 1995, pp. 338-345
Kleinberg, J. M., “Authoritative Sources in A Hyperlinked Environment,” Proc. of the
9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California,
January, 1998, pp. 668-677
Kohavi, R., “A Study of Cross-validation and Bootstrapping for Accuracy Estimation
and Model Selection,” Proc. of the 14th International Joint Conference on Artificial
Intelligence, (IJCAI), Québec, Canada, August, 1995, pp. 1137-1145
Koller, D. and Sahami, M., “Hierarchically Classifying Documents using Very Few
Words,” Proc. of the 14th International Conference on Machine Learning (ICML),
Nashville, Tennessee, USA, July, 1997, pp.170-178
53
Kosala, R. and Blockeel, H., “Web Mining Research: A Survey,” SIGKDD
Explorations, Volume 2, No. 1, 2000, pp. 1-15
Kriegel, H. P., and Schubert, M., “Classification of Websites as Sets of Feature
Vectors,” Proc. of the IASTED International Conference Database and Applications,
Innsbruck, Austria, February, 2004, pp. 127-132
Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A., “Trawling the Web for
Emerging Cyber-communities,” Computer Network, Volume 31, No. 11-16, 1999, pp.
1,481-1,493
Letourneau, S., Famili, F., and Matwin, S., “Data Mining to Predict Aircraft
Component Replacement,” IEEE Intelligent Systems, Volume 14, No. 6, 1999, pp.
59-66
McCallum, A. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text
Classification,” Proc. of the AAAI/ICML Workshop on Learning for Text
Categorization, Wisconsin, USA, 1998, pp. 41–48
Mladenic, D., “Turning Yahoo into an Automatic Web-page Classifier,” Proc. of 13th
European Conference on Artificial Intelligence, Brighton, UK, August, 1998, pp.
473-474
Murata, T., “A Method for Discovering Purified Web Communities,” Discovery
Science, 2001, pp. 282-289
Murata, T., “Discovery of Web Communities from Positive and Negative Examples,”
Discovery Science, 2003, pp. 369-376
Oh, J. H., Myaeng, S. H., and Lee, M. H., “A Practical Hypertext Categorization
Method using Links and Incrementally Available Class Information,” Proc. of the 23rd
Annual International ACM SIGIR Conference on Research and Development on
Information Retrieval, Athens, Greece, July, 2000, pp. 264-271
Ohsawa, Y., Soma, H., Matsuo, Y., “Featuring Web Communities based on Word
Co-occurrence Structure of Communities,” Proc. of the 11th International World Wide
Web Conference (WWW), Honolulu, Hawaii, USA, May, 2002, pp.736-742
54
Pavlov, D., Balasubramanyan, R., Dom, B., Kapur, S., and Parikh, J., “Document
Preprocessing for Naïve Bayes Classification and Clustering with Mixture of
Multinominals,” Proc. of the 10th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, Washington, USA, August, 2004, pp.
823-828
Rayport, J. and Sviokla, J., “Exploring the Virtual Value Chain,” Harvard Business
Review, November-December, 1995, pp. 75-85
Porter, M., “An Algorithm for Suffix Stripping,” Program, Volume 14, No. 3, 1980,
pp. 130-137
Rennie, J.D., Shih, L., Teevan, J., and Karger, D. R., “Tackling the Poor Assumptions
of Naïve Bayes Text Classifiers,” Proc. of the 12th International Conference on
Machine Learning (ICML), Washington, USA, August, 2003, pp. 616-623
Russell, S.J., Norvig, P., Artificial Intelligence: A Modern Approach, Prentice Hall,
2003
Sahami, M., “Learning Limited Dependence Bayesian Classifiers,” Proc. of the 2nd
International Conference on Knowledge Discovery and Data Mining, Portland,
Oregon, USA, 1996, pp. 335-338
Schneider, K.M., “Techniques for Improving the Performance of Naïve Bayes for Text
Classification,” Proc. of the 6th International Conference Computational Linguistics
and Intelligent Text Processing (CICLing), Mexico City, Mexico, February, 2005, pp.
682-693
Srivastava, J., Cooley, R., Deshapande, M., and Tan, P. N., “Web Usage Mining:
Discovery and Applications of Usage Patterns from Web Data,” SIGKDD
Explorations, Volume 1, No. 2, 2000, pp. 12-23
Toutanova, K. and Manning, C.D., “Enriching the Knowledge Sources Used in a
Maximum Entropy Part-of-Speech Tagger,” Proc. of the Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large Corpora
(EMNLP/VLC), Hong Kong, China, October, 2000, pp. 63-70
55
Yang, Y. and Liu, Xin, “A Re-examination of Text Categorization Methods,” Proc. of
the 22nd Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, California, USA, August, 1999, pp. 15-19
Wallin, J., “Operationalizing Competences,” Proc. of the 5th Annual International
Conference on Competence-based Management, Finland, June, 2000
Witten, I.H., Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2000
White, S. and Smyth, P., “Algorithms for Estimating Relative Importance in
Networks,” Proc. of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, Washington, USA, August, 2003, pp. 266-275
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內公開,校外永不公開 restricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 18.212.87.137
論文開放下載的時間是 校外不公開

Your IP address is 18.212.87.137
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code