論文使用權限 Thesis access permission:校內公開,校外永不公開 restricted
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus:永不公開 not available
論文名稱 Title |
利用探勘技術辨認企業網路外部環境 Using Mining Techniques to Identify External Web Environment of Companies |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
63 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2005-07-07 |
繳交日期 Date of Submission |
2006-01-17 |
關鍵字 Keywords |
貝氏分類器、網頁內容分類、鏈結分析、外部網路環境 Naïve Bayesian Classifiers, Web Content Classification, Hyperlink Analysis, External Web Environment |
||
統計 Statistics |
本論文已被瀏覽 5795 次,被下載 12 次 The thesis/dissertation has been browsed 5795 times, has been downloaded 12 times. |
中文摘要 |
全球資訊網的蓬勃發展使得企業逐漸傾向利用其迅速、便利的特質來散佈商業訊息;同樣地,人們也習慣於在全球資訊網中搜尋所需的資訊。對於企業而言,全球資訊網可以幫助宣傳其所提供產品或服務,亦可以成為與外部環境接觸的一項新興管道。企業建構之商業網站所包含之資訊被視為其所擁有之虛擬資產,而在中存取此項虛擬資產者,即為該企業的外部網路環境。辨識企業外部網 路環境有助於企業從策略規劃的角度上創造其商業價值。 因此,藉由全球資訊網辨識企業之外部網路環境成為對企業的重要議題之一,本研究即針對此議題作深入之探討。從文獻中得知,網頁之間的鏈結架構可以幫助分析其間所存在的關係,本研究據此提出一個結合內容分類與網路鏈結之分類方法: CNB-HI,來幫助企業自動辨識其外部環境。 本研究以一個實例來說明所提之分類器如何實際應用,幫助特定企業辨識其外部網路環境中之顧客、合作夥伴、媒體、協會組織等四種角色。本研究接著提出二個實驗來驗證所提出之方法。在第一個實驗中,我們比較不同形式的貝氏分類器的分類效果;而第二個實驗則在分類過程中加入了網頁間的鏈結資訊以檢驗其對分類結果的改善。結果顯示加入了網頁間的鏈結資訊後可以大幅提昇分類的正確性,因而驗證本研究所提方法的適用性。 |
Abstract |
As the rapid growth of World Wide Web nowadays, many companies tend to disseminate relevant information such as the introduction of product and service through their commercial Web sites. A company’s Web site is deemed as a certain kind of its business assets. Customers, suppliers, partners, associations and other outsiders who desire to get access to the assets from the Web construct a company’s external Web environment. From a strategic planning point of view, identifying a company’s external environment helps to create its business values. Therefore, this research focuses on the issue of assisting a company to identify its external Web environment using mining techniques. Several research works pointed out that the hyperlink structure among Web pages could contribute to classifying the relationships within a company’s external environment. We then propose a classifier that combines Web content mining and hyperlink structure, CNB-HI, for such a purpose. We apply our proposed approach to a real case to help identify the roles of customers, partners, media, and associations. Two experiments are conducted to examine the performance. In the first experiment, we compare CNB with other forms of Naïve Bayesian classifiers, and conclude that CNB achieves a better performance. However, even the performance by CNB is not satisfactory based exclusively on content classification. The second experiment is conducted to examine the benefits with hyperlink information incorporated (CNB-HI). The result shows that the performance of CNB-HI improves markedly. It thus justifies the feasibility of the proposed approach to real applications. |
目次 Table of Contents |
CHAPTER 1 INTRODUCTION ...................................................................................1 1.1 Overview..........................................................................................................1 1.2 Research Objective ..........................................................................................2 1.3 Organization of the Thesis ...............................................................................3 CHAPTER 2 LITERATURE REVIEW.........................................................................4 2.1 Web Communities............................................................................................4 2.1.1 Definitions of Web Communities..........................................................4 2.1.2 Web Communities in Business Applications ........................................7 2.2 Web Mining......................................................................................................8 2.2.1 Hyperlink Analysis................................................................................9 2.2.2 Web Content Classification.................................................................11 2.3 Bayesian Probabilistic Classifier ...................................................................14 CHAPTER 3 PROPOSED APPROACH.....................................................................17 3.1 Collecting Web Pages ....................................................................................18 3.2 Processing Web Pages....................................................................................19 3.2.1 Scrutinizing Web Pages ......................................................................20 3.2.2 Analyzing Contents.............................................................................20 3.2.3 Building Hyperlink Structure..............................................................22 3.3 Constructing Classifier...................................................................................24 3.3.1 Complement Naïve Bayes Classifier ..................................................24 3.3.2 CNB-HI Classifier ..............................................................................26 CHAPTER 4 EMPIRICAL EVALUATION................................................................29 4.1 Experimental Design......................................................................................29 4.1.1 Target Background..............................................................................29 4.1.2 Data Collection and Analysis..............................................................30 4.1.3 Evaluation Method..............................................................................34 4.2 Experiment I...................................................................................................36 4.2.1 Experimental Results ..........................................................................36 4.2.2 An Illustrated Example .......................................................................38 4.3 Experiment II .................................................................................................39 4.3.1 Hyperlink Structure.............................................................................39 4.3.2 Experimental Results ..........................................................................40 4.3.3 Illustrated Examples............................................................................41 CHAPTER 5 CONCLUSIONS ...................................................................................47 5.1 Concluding Remarks......................................................................................47 5.2 Future Works..................................................................................................48 APPENDIX A: Partial Hyperlink Structure of The Target ..........................................49 REFERENCES ............................................................................................................50 |
參考文獻 References |
Adamic, L. A. and Adar, E., “Friends and Neighbors on the Web,” Social Networks, Volume 23, Issue 3, 2003, pp. 211-230 Bharat, K, Broder, A., Dean, J., and Henzinger, M. R., “A Comparison of Techniques to Find Mirrored Hosts on the World Wide Web,” Journal of the American Society for Information Science (JASIS), Volume 51, No. 12, 2000, pp. 1,114-,1122 Boulton, R.E.S., Libert, B.D., and Sivek, S.M., Cracking the Value Chain, Harvard Business School Press, 2000 Brin, S. and Page, L., “The Anatomy of Large-scale Hypertextual Web Search Engine,” Computer Network, Volume 30, No. 1-7, 1998, pp. 107-117 Buyukkokten, O., Cho, J., and Garcia-Molina, H., “Exploiting Geographical Location Information of Web Pages,” Proc. ACM SIGMOD Workshop on the Web and Databases (WebDB), Philadelphia, Pennsylvania, USA, June, 1999, pp. 91-96 Chakrabarti S., Dom B., and Indyk, P., “Enhanced Hypertext Categorization using Hyperlinks,” Proc. of ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June, 1998, pp. 307-318 Chakrabarti S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., and Kleinberg, J., “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Computer Networks, Volume 30, No. 1-7, 1998, pp. 65-74 Chakrabarti S., “Recent Results in Automatic Web Resource Discovery,” ACM Computing Survey, Volume 31, Issue 4es, 1999, Article No. 17 Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., “Mining the Link Structure of the World Wide Web,” IEEE Computer, Volume 32, No.6, 1999, pp. 60-67 Chakrabatri S., ven den Berg, M., and Dom, B., “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery,” Computer Network, Volume 31, No. 11-16, 1999, pp. 1,623-1,640 51 Chakrabarti S., “Data Mining for Hypertext: A Tutorial Survey,” SIGKDD Explorations, Volume 1, No.2, 2000, pp. 1-11 Cooley, R., Mobasher, B., and Srivastava, J., “Web Mining: Information and Pattern Discovery on the World Wide Web,” Proc. of 9th International Conference on Tool with Artificial Intelligence (ICTAI), Newport Beach, CA, USA, November, 1997, pp. 558-567 Dom, B., “Pattern Recognition Meets the World Wide Web,” Proc. of the International Conference on Pattern Recognition (ICPR), Barcelona, Spain, September, 2000, Volume 2, pp. 2052-2059 Domingos, P. and Richardson, M., “Mining the Network Value of Customers,” Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August, 2001, pp.57-66 Dumais, S. T. and Chen, H., “Hierarchical Classification of Web Content,” Proc. of 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, August, 2000, pp. 256-263 Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, Wiley, 1973 Eckel, B., Thinking in Java, Prentice Hall, 2000 Edna O. F. Reid, “Identifying a Company’s Non-customer Online Communities: A Proto-typology,” Proc. of 36th Hawaii International Conference on System Sciences, Big Island, Hawaii, USA, January, 2003, Article No. 215 Ester, M., Kriegel, H. P., and Schubert, M., “Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web,” Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 2002, pp. 249-258 Ester, M., Kriegel, H. P., and Schubert, M., “Accurate and Efficient Crawling for Relevant Websites,” Proc. of the 13th International Conference on Very Large Data Bases, Toronto, Canada, August, 2004, pp. 396-407 52 Etzioni, O., “The World-wide Web: Quagmire or Gold Mine?,” Communications of the ACM, Volume 39, No. 11, 1996, pp. 65-68 Flake, G. W., Lawrence, S. R., Giles, C. L., Coetzee, F. M., “Self-organization and Identification of Web Communities,” IEEE Computer, Volume 35, No. 3, 2002, pp. 66-71 Gibson, D., Kleinberg, J., and Raghavan, P., “Inferring Web Communities from Link Topology,” Proc. of the 9th ACM Conference on Hypertext and Hypermedia: Links, Objects, Time, and Spaces- Structure in Hypermedia System, Pittsburgh, PA, USA, June, 1998, pp. 225-234 Hagel, J. and Armstrong, A. G., Net Gain: Expending Markets through Virtual Communities, Harvard Business School Press, 1997 Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 Henzinger, M. R., “Hyperlink Analysis for the Web,” IEEE Internet Computing, Volume 5, 2001, pp. 45-50 John, G.H. and Langley, P., “Estimate Continuous Distributions in Bayesian Classifiers,” Proc. of the 11th Annual Conference on Uncertainty in Artificial Intelligence, Québec, Canada, August, 1995, pp. 338-345 Kleinberg, J. M., “Authoritative Sources in A Hyperlinked Environment,” Proc. of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, January, 1998, pp. 668-677 Kohavi, R., “A Study of Cross-validation and Bootstrapping for Accuracy Estimation and Model Selection,” Proc. of the 14th International Joint Conference on Artificial Intelligence, (IJCAI), Québec, Canada, August, 1995, pp. 1137-1145 Koller, D. and Sahami, M., “Hierarchically Classifying Documents using Very Few Words,” Proc. of the 14th International Conference on Machine Learning (ICML), Nashville, Tennessee, USA, July, 1997, pp.170-178 53 Kosala, R. and Blockeel, H., “Web Mining Research: A Survey,” SIGKDD Explorations, Volume 2, No. 1, 2000, pp. 1-15 Kriegel, H. P., and Schubert, M., “Classification of Websites as Sets of Feature Vectors,” Proc. of the IASTED International Conference Database and Applications, Innsbruck, Austria, February, 2004, pp. 127-132 Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A., “Trawling the Web for Emerging Cyber-communities,” Computer Network, Volume 31, No. 11-16, 1999, pp. 1,481-1,493 Letourneau, S., Famili, F., and Matwin, S., “Data Mining to Predict Aircraft Component Replacement,” IEEE Intelligent Systems, Volume 14, No. 6, 1999, pp. 59-66 McCallum, A. and Nigam, K., “A Comparison of Event Models for Naïve Bayes Text Classification,” Proc. of the AAAI/ICML Workshop on Learning for Text Categorization, Wisconsin, USA, 1998, pp. 41–48 Mladenic, D., “Turning Yahoo into an Automatic Web-page Classifier,” Proc. of 13th European Conference on Artificial Intelligence, Brighton, UK, August, 1998, pp. 473-474 Murata, T., “A Method for Discovering Purified Web Communities,” Discovery Science, 2001, pp. 282-289 Murata, T., “Discovery of Web Communities from Positive and Negative Examples,” Discovery Science, 2003, pp. 369-376 Oh, J. H., Myaeng, S. H., and Lee, M. H., “A Practical Hypertext Categorization Method using Links and Incrementally Available Class Information,” Proc. of the 23rd Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, Athens, Greece, July, 2000, pp. 264-271 Ohsawa, Y., Soma, H., Matsuo, Y., “Featuring Web Communities based on Word Co-occurrence Structure of Communities,” Proc. of the 11th International World Wide Web Conference (WWW), Honolulu, Hawaii, USA, May, 2002, pp.736-742 54 Pavlov, D., Balasubramanyan, R., Dom, B., Kapur, S., and Parikh, J., “Document Preprocessing for Naïve Bayes Classification and Clustering with Mixture of Multinominals,” Proc. of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, August, 2004, pp. 823-828 Rayport, J. and Sviokla, J., “Exploring the Virtual Value Chain,” Harvard Business Review, November-December, 1995, pp. 75-85 Porter, M., “An Algorithm for Suffix Stripping,” Program, Volume 14, No. 3, 1980, pp. 130-137 Rennie, J.D., Shih, L., Teevan, J., and Karger, D. R., “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers,” Proc. of the 12th International Conference on Machine Learning (ICML), Washington, USA, August, 2003, pp. 616-623 Russell, S.J., Norvig, P., Artificial Intelligence: A Modern Approach, Prentice Hall, 2003 Sahami, M., “Learning Limited Dependence Bayesian Classifiers,” Proc. of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA, 1996, pp. 335-338 Schneider, K.M., “Techniques for Improving the Performance of Naïve Bayes for Text Classification,” Proc. of the 6th International Conference Computational Linguistics and Intelligent Text Processing (CICLing), Mexico City, Mexico, February, 2005, pp. 682-693 Srivastava, J., Cooley, R., Deshapande, M., and Tan, P. N., “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,” SIGKDD Explorations, Volume 1, No. 2, 2000, pp. 12-23 Toutanova, K. and Manning, C.D., “Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger,” Proc. of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), Hong Kong, China, October, 2000, pp. 63-70 55 Yang, Y. and Liu, Xin, “A Re-examination of Text Categorization Methods,” Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, USA, August, 1999, pp. 15-19 Wallin, J., “Operationalizing Competences,” Proc. of the 5th Annual International Conference on Competence-based Management, Finland, June, 2000 Witten, I.H., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2000 White, S. and Smyth, P., “Algorithms for Estimating Relative Importance in Networks,” Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, August, 2003, pp. 266-275 |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:校內公開,校外永不公開 restricted 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus:永不公開 not available 您的 IP(校外) 位址是 18.212.87.137 論文開放下載的時間是 校外不公開 Your IP address is 18.212.87.137 This thesis will be available to you on Indicate off-campus access is not available. |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |