國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,XML文件之索引方法之設計與製作,Design and Implementation of Indexing Strategies for XML Documents

論文名稱 Title	XML文件之索引方法之設計與製作 Design and Implementation of Indexing Strategies for XML Documents
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	90 學年度第 2 學期 The spring semester of Academic Year 90	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	89
研究生 Author	林茂桐 Mao-Tong Lin
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	郭大維 Tei-Wei Kuo
口試委員 Advisory Committee	李建億 Chien-I Lee
口試日期 Date of Exam	2002-06-21	繳交日期 Date of Submission	2002-07-07
關鍵字 Keywords	可延伸標示語言、資料交換、索引、全球資訊網、關聯式資料庫 XML, Index, Data exchange, Relational database, WWW
統計 Statistics	本論文已被瀏覽 5656 次，被下載 1860 次 The thesis/dissertation has been browsed 5656 times, has been downloaded 1860 times.

中文摘要
在最近幾年，很多人利用全球資訊網 (World Wide Web) 網際網路 (Internet) 去找他們所想要的資訊。超文件標示語言(HTML)是一種用於發行超文件的文件標示語言，同時它也是一種對於世界上的內容發展者的目標格式。基本上，超文件標示語言最主要的貢獻在於描述如何去展示一個資料的項目。因此，很難從HTML文件中找到有用的資訊。這是因為HTML文件是將內容與展示用標籤混雜在一起。可延伸標示語言(eXtensible Markup Language) 則是另一種在網際網路及企業內部應用之間作為資料交換和應用的格式。為了能夠幫助資料的交換，企業的夥伴定義了共同XML文件的文件型別定義(DTD)來做為他們的應用所需的文件交換。而且，受歡迎的WWW/EDI、電子商務和許許多多的商業資料都使用XML在WWW上做資料的交換。基本上，XML可以描述自身資料的意義，同時，XML文件的內容是和展示的格式分開的，所以很容易從中找到有意義的資訊並且能夠進一步去分析它。當大量商業資料存在時，對於支援XML文件的管理的方法之一，是運用關聯式資料庫。對於這種方法，我們必須要能夠將XML文件轉換到關聯式資料庫內。在這一篇論文中，我們設計與實作索引方法來有效的存取XML文件。XML文件在根本上不同於關聯式資料。XML是階層與巢狀的文件，它是非常類似於半結構化資料模型。半結構化資料的特性是在於它沒有固定的綱要和它可能是不具規則或是不完整的。因為，半結構化資料模型是彈性的，所以在查詢處理上它需要大量搜尋空間，因為它沒有固定的綱要。索引是有效改善查詢效能的方法之一。鑒於XML的半結構化資料特性，我們歸納出五種查詢的型態：(1)完整而單一路徑，(2)特定樹葉節點，(3)特定內部路徑，(4)特定屬性/元素(值)，(5)在同一層中多個路徑。在這一篇論文中，我們將所有可能的查詢歸納成這五種查詢型態。接著，我們對於不同的查詢型態建立索引。除此之外，我們也設計與實作了從XML查詢語句到SQL語句的查詢轉換。我們設計了一個容易使用的使用者介面來輸入XML查詢語句。這整個系統是用JAVA程式語言實作而後端的資料庫使用 SQL Server 2000。從我們的實驗顯示中，我們的索引方法可以有效地改善XML查詢處理效能。
Abstract
In recent years, many people use the World Wide Web and Internet to find information that they want. HTML is a document markup language for publishing hypertext on the WWW. HTML has been the target format for content developers around the world. Basically, HTML tags serve the primary purpose of describing how to display a data item. Therefore, HTML documents are difficult to find some useful information. That is because, HTML documents are mixed content with display tags. On the other hand, XML is the another data format for data exchange inter-enterprise applications on the Internet. In order to facilitate data exchange, industry groups define public Document Type Definitions (DTD) that specify the format of the XML documents to be exchanged between their applications. Moreover, WWW/EDI or Electric Commerce is very popular and a lot of business data uses XML to exchange information on the World Wide Web. Basically, XML tags describe the data itself. The contents (meaning) of the XML documents and the display format is separated. It could be easily to find meaningful information of the XML documents and analyze the information. Moreover, when a large volume of business data (XML documents) exists, one way to support the management of the XML documents is to apply the relational databases. For such an approach, we must transform the XML documents to the relational databases. In this thesis, we design and implement the indexing strategies to efficiently access XML documents. XML document is fundamentally different from relational data. XML is a hierarchical and nested document, it is very similar to the semistructured data model. The characteristic of semistructured data is that it may not have a fixed schema and it may be irregular or incomplete. Though, the semistructured data model is flexible in data modeling, it requires a large search space in query processing since there is no schema fixed in advance. Indexing is the way of how to improve query performance efficiently. However, due to the special properties of semistructued data, there are up to five types of queries: (1) complete single path, (2) specified leaf only, (3) specified intrapath, (4) specified attribute/element(value), and (5) multiple paths with the same level. In this thesis, we classify all possible queries into those five query types. Next, we create different indexes for different query types. Moreover, we design and implement the query transformation from XML query statements to SQL statements. Also, we create a user-friendly interface for users to input XML query statements. The whole system is implemented in JAVA and SQL Server 2000. From our experiences, we show that our indexing strategies can improve the XML query processing performance very well.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 XML . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 XML-QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 XSL Patterns . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 XQuery . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.6 Motivations . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.7 Organization of the Thesis . . . . .. . . . . . . . . . . . . . . . 17 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1 The Object Exchange Model . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Extracting Indexing Information from XML DTDs . . . . . . . . . . 19 2.2.1 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Classication of DTD Elements . . . . . . . . . . . . . . . . . 20 2.2.3 DTD Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Existence of Multiple DataGuides . . . . . . . . . . . . . . . . 24 2.4 Four Dierent Types of Index Structures . . . . . . . . . . . . . . . . 26 2.4.1 Value Index (Vindex) . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2 Text Index (Tindex) . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.3 Link Index (Lindex) . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.4 Path Index (Pindex) . . . . . . . . . . . . . . . . . . . . . . . 27 3. Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4. Query Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 System Flowchart . . . . . .. . . . . . . . . . . . . . . . . . . . . . 41 4.3 DTD of Movie.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Constructing an XML Query . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Constructing a SQL Query . . . . . . . . . . .. . . . . . . . . . . . . 50 4.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5. The Implementation of Index Strategies . . . . . . . . . . . . . . . . 66 5.1 Index Constructing . . . . . . . . . . . .. . . . . . . . . . . . . . . 66 5.1.1 Constructing the Value Index . . . . . . . . . . . . . . . . . . 67 5.1.2 Constructing the Text Index . . . . . . . . . . . . . . . . . . 67 5.1.3 Constructing the Link Index . . . . . . . . . . . . . . . . . . 70 5.1.4 Constructing the Path Index . . . . . . . . . . . . . . . . . . . 73 5.2 Query Processing by Indexes . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.3.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 83 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2 Further Research Work . . . . . . . . . . . . . . . . . . . . . . . . . 87 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

參考文獻 References
BIBLIOGRAPHY [1] "XSL Documentation: XQL Users Guide" http://www.cuesoft.com/docs/cuexsl-activex/xql-users-guide.htm. [2] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, "Object Exchange Across Heterogeneous Information Sources", Proceedings of the Eleventh Inter- national Conference on Data Engineering, pp. 251-260, 1995. [3] Jason McHugh, Jennifer Widom, Serge Abiteboul, Qingshan Luo, and Anand Ra- jaraman,"Indexing Semistructured Data," http://www-db.standford.edu/lore. [4] Chia-He Lee, "Design and Implementation of a Mapping Technique Between XML Documents and Relational Databases," Master Thesis, Receipt of National Sun Yat-sen University, June, 2001. [5] Roy Goldman, and Jennifer Widom, "DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases," Proceedings of the 23rd VLDB Conference, pp. 436-445, 1997. [6] ALin Deutsch, Mary Fernandez, and Dan Suciu, "Storing Semistructured Data with STORED," AT&T Labs-Research, 1998. [7] J. McHugh and J. Widom, "Query optimization in semistructured data," Technical report, Standford University Database Group, 1997. Available at http://www-db.standford.edu/pub/papers/qo.ps. [8] Don Chamberlin, James Clark, Daniela Florescu, Jonathan Robie, and Mugur Stefanescu "XQuery 1.0: An XML Query Language," W3C Working Draft, 7, June, 2001, http://www.w3.org/TR/xquery. [9] Don Chamberlin, Jonathan Robie, and Daiela Florescu. "Quilt: an XML Query Language for Heterogeneous Data Sources," In Lecture Notes in Com- puter Science, Springer-Verlag, pp. 199-234, Dec, 2001. Also available at http://www.almaden.ibm.com/cs/people/chamberlin/quilt lncs.pdf. [10] World Wide Web Consortium, "XML Path Language(XPath) Version 1.0," W3C Recommendation, Nov. 16, 1999. See http://www.w3.org/TR/xpath.xml. [11] J. Robie, J. Lapp, and D. Schach, "XML Query Language (XQL)," See http://www.w3.org/TandS/QL/QL98/pp/xql.html. [12] Iternational Organizarion for Standardization (ISO). Information Technology- Database Language SQL. Standard No. ISO/IEC 9075:1999. (Available from American National Standards Institute, New York, NY 10036, (212) 642-4900.) [13] Rick Cattell et al, "The Object Database Standard: ODMG-93, Release 1.2." Morgan Kaufmann Publishers, San Francisco, 1996. [14] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L. Wiener. "The Lorel Query Language for Semistructured Data," International Journal on Digital Libraries, Vol. 1 No.1, pp. 68-88, April 1997. See http://www- db.standford.edu/ widom/pubs.html. [15] S. Cluet, S. Jacqmin, and J. Simeon. "The New YATL: Design and Specica- tions," Technical Report, INRIA, 1999. [16] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu "XML-QL: A Query Language for XML" http://www.w3.org/TR/1998/NOTE- xml-ql-19980819. [17] Y. Papakonstantnou, P. Velikhov, "Enhancing Semistructured Data Mediators with Document Type Denitions" in: Proceedings of International Conference on Data Engineering, pp. 251-260, 1999. [18] Tae-Sun Chung*, and Hyoung-Joo Kim, "Extracting Indexing Information from XML DTDs," Information Processing Letters, 81, pp. 97-103, 2002. [19] S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe, "Representative Objects: Concise Representations of Semistructured, Hierarchical Data," Proceedings of the Thirteenth International Conference on Data Engineering, pp. 79-90, 1997. [20] J. Hopcroft, "An n log n Algorithm for Minimizing the States in a Finite Au- tomaton," The Theory of Machines and Computations, Academic Press, NY, pp. 189-196, 1971. [21] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu, "XML-QL: A Query Language for XML" http://www.w3.org/TR/NOTE-xml- ql.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內外都一年後公開 withheld 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0707102-171907.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS