Responsive image
博碩士論文 etd-0707102-171907 詳細資訊
Title page for etd-0707102-171907
論文名稱
Title
XML文件之索引方法之設計與製作
Design and Implementation of Indexing Strategies for XML Documents
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
89
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2002-06-21
繳交日期
Date of Submission
2002-07-07
關鍵字
Keywords
可延伸標示語言、資料交換、索引、全球資訊網、關聯式資料庫
XML, Index, Data exchange, Relational database, WWW
統計
Statistics
本論文已被瀏覽 5656 次,被下載 1860
The thesis/dissertation has been browsed 5656 times, has been downloaded 1860 times.
中文摘要
在最近幾年,很多人利用全球資訊網 (World Wide Web) 網際網路 (Internet) 去找他們所想要的資訊。超文件標示語言(HTML)是一種用於發行超文件的文件標示語言,同時它也是一種對於世界上的內容發展者的目標格式。基本上,超文件標示語言最主要的貢獻在於描述如何去展示一個資料的項目。因此,很難從HTML文件中找到有用的資訊。這是因為HTML文件是將內容與展示用標籤混雜在一起。可延伸標示語言(eXtensible Markup Language) 則是另一種在網際網路及企業內部應用之間作為資料交換和應用的格式。為了能夠幫助資料的交換,企業的夥伴定義了共同XML文件的文件型別定義(DTD)來做為他們的應用所需的文件交換。而且,受歡迎的WWW/EDI、電子商務和許許多多的商業資料都使用XML在WWW上做資料的交換。基本上,XML可以描述自身資料的意義,同時,XML文件的內容是和展示的格式分開的,所以很容易從中找到有意義的資訊並且能夠進一步去分析它。當大量商業資料存在時,對於支援XML文件的管理的方法之一,是運用關聯式資料庫。對於這種方法,我們必須要能夠將XML文件轉換到關聯式資料庫內。在這一篇論文中,我們設計與實作索引方法來有效的存取XML文件。XML文件在根本上不同於關聯式資料。XML是階層與巢狀的文件,它是非常類似於半結構化資料模型。半結構化資料的特性是在於它沒有固定的綱要和它可能是不具規則或是不完整的。因為,半結構化資料模型是彈性的,所以在查詢處理上它需要大量搜尋空間,因為它沒有固定的綱要。索引是有效改善查詢效能的方法之一。鑒於XML的半結構化資料特性,我們歸納出五種查詢的型態:(1)完整而單一路徑,(2)特定樹葉節點,(3)特定內部路徑,(4)特定屬性/元素(值),(5)在同一層中多個路徑。在這一篇論文中,我們將所有可能的查詢歸納成這五種查詢型態。接著,我們對於不同的查詢型態建立索引。除此之外,我們也設計與實作了從XML查詢語句到SQL語句的查詢轉換。我們設計了一個容易使用的使用者介面來輸入XML查詢語句。這整個系統是用JAVA程式語言實作而後端的資料庫使用 SQL Server 2000。從我們的實驗顯示中,我們的索引方法可以有效地改善XML查詢處理效能。
Abstract
In recent years, many people use the World Wide Web and Internet
to find information that they want. HTML is a document markup
language for publishing hypertext on the WWW. HTML has been the
target format for content developers around the world. Basically,
HTML tags serve the primary purpose of describing how to display a
data item. Therefore, HTML documents are difficult to find some
useful information. That is because, HTML documents are mixed
content with display tags. On the other hand, XML is the another
data format for data exchange inter-enterprise applications on the
Internet. In order to facilitate data exchange, industry groups
define public Document Type Definitions (DTD) that specify the
format of the XML documents to be exchanged between their
applications. Moreover, WWW/EDI or Electric Commerce is very
popular and a lot of business data uses XML to exchange
information on the World Wide Web. Basically, XML tags describe
the data itself. The contents (meaning) of the XML documents and
the display format is separated. It could be easily to find
meaningful information of the XML documents and analyze the
information. Moreover, when a large volume of business data (XML
documents) exists, one way to support the management of the XML
documents is to apply the relational databases. For such an
approach, we must transform the XML documents to the relational
databases. In this thesis, we design and implement the indexing
strategies to efficiently access XML documents. XML document is
fundamentally different from relational data. XML is a
hierarchical and nested document, it is very similar to the
semistructured data model. The characteristic of semistructured
data is that it may not have a fixed schema and it may be
irregular or incomplete. Though, the semistructured data model is
flexible in data modeling, it requires a large search space in
query processing since there is no schema fixed in advance.
Indexing is the way of how to improve query performance
efficiently. However, due to the special properties of
semistructued data, there are up to five types of queries: (1)
complete single path, (2) specified leaf only, (3) specified
intrapath, (4) specified attribute/element(value), and (5)
multiple paths with the same level. In this thesis, we classify
all possible queries into those five query types. Next, we create
different indexes for different query types. Moreover, we design
and implement the query transformation from XML query statements
to SQL statements. Also, we create a user-friendly interface for
users to input XML query statements. The whole system is
implemented in JAVA and SQL Server 2000. From our experiences, we
show that our indexing strategies can improve the XML query
processing performance very well.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 XML . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 XML-QL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 XSL Patterns . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 XQuery . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Motivations . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Organization of the Thesis . . . . .. . . . . . . . . . . . . . . . 17
2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 The Object Exchange Model . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Extracting Indexing Information from XML DTDs . . . . . . . . . . 19
2.2.1 Key Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Classication of DTD Elements . . . . . . . . . . . . . . . . . 20
2.2.3 DTD Automata . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Existence of Multiple DataGuides . . . . . . . . . . . . . . . . 24
2.4 Four Di erent Types of Index Structures . . . . . . . . . . . . . . . . 26
2.4.1 Value Index (Vindex) . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2 Text Index (Tindex) . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.3 Link Index (Lindex) . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.4 Path Index (Pindex) . . . . . . . . . . . . . . . . . . . . . . . 27
3. Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4. Query Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 System Flowchart . . . . . .. . . . . . . . . . . . . . . . . . . . . . 41
4.3 DTD of Movie.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Constructing an XML Query . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Constructing a SQL Query . . . . . . . . . . .. . . . . . . . . . . . . 50
4.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5. The Implementation of Index Strategies . . . . . . . . . . . . . . . . 66
5.1 Index Constructing . . . . . . . . . . . .. . . . . . . . . . . . . . . 66
5.1.1 Constructing the Value Index . . . . . . . . . . . . . . . . . . 67
5.1.2 Constructing the Text Index . . . . . . . . . . . . . . . . . . 67
5.1.3 Constructing the Link Index . . . . . . . . . . . . . . . . . . 70
5.1.4 Constructing the Path Index . . . . . . . . . . . . . . . . . . . 73
5.2 Query Processing by Indexes . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 83
6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Further Research Work . . . . . . . . . . . . . . . . . . . . . . . . . 87
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
參考文獻 References
BIBLIOGRAPHY
[1] "XSL Documentation: XQL Users Guide"
http://www.cuesoft.com/docs/cuexsl-activex/xql-users-guide.htm.
[2] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom, "Object Exchange
Across Heterogeneous Information Sources", Proceedings of the Eleventh Inter-
national Conference on Data Engineering, pp. 251-260, 1995.
[3] Jason McHugh, Jennifer Widom, Serge Abiteboul, Qingshan Luo, and Anand Ra-
jaraman,"Indexing Semistructured Data," http://www-db.standford.edu/lore.
[4] Chia-He Lee, "Design and Implementation of a Mapping Technique Between
XML Documents and Relational Databases," Master Thesis, Receipt of National
Sun Yat-sen University, June, 2001.
[5] Roy Goldman, and Jennifer Widom, "DataGuides: Enabling Query Formulation
and Optimization in Semistructured Databases," Proceedings of the 23rd VLDB
Conference, pp. 436-445, 1997.
[6] ALin Deutsch, Mary Fernandez, and Dan Suciu, "Storing Semistructured Data
with STORED," AT&T Labs-Research, 1998.
[7] J. McHugh and J. Widom, "Query optimization in semistructured data,"
Technical report, Standford University Database Group, 1997. Available at
http://www-db.standford.edu/pub/papers/qo.ps.
[8] Don Chamberlin, James Clark, Daniela Florescu, Jonathan Robie, and Mugur
Stefanescu "XQuery 1.0: An XML Query Language," W3C Working Draft, 7,
June, 2001, http://www.w3.org/TR/xquery.
[9] Don Chamberlin, Jonathan Robie, and Daiela Florescu. "Quilt: an XML
Query Language for Heterogeneous Data Sources," In Lecture Notes in Com-
puter Science, Springer-Verlag, pp. 199-234, Dec, 2001. Also available at
http://www.almaden.ibm.com/cs/people/chamberlin/quilt lncs.pdf.
[10] World Wide Web Consortium, "XML Path Language(XPath) Version 1.0," W3C
Recommendation, Nov. 16, 1999. See http://www.w3.org/TR/xpath.xml.
[11] J. Robie, J. Lapp, and D. Schach, "XML Query Language (XQL)," See
http://www.w3.org/TandS/QL/QL98/pp/xql.html.
[12] Iternational Organizarion for Standardization (ISO). Information Technology-
Database Language SQL. Standard No. ISO/IEC 9075:1999. (Available from
American National Standards Institute, New York, NY 10036, (212) 642-4900.)
[13] Rick Cattell et al, "The Object Database Standard: ODMG-93, Release 1.2."
Morgan Kaufmann Publishers, San Francisco, 1996.
[14] Serge Abiteboul, Dallan Quass, Jason McHugh, Jennifer Widom, and Janet L.
Wiener. "The Lorel Query Language for Semistructured Data," International
Journal on Digital Libraries, Vol. 1 No.1, pp. 68-88, April 1997. See http://www-
db.standford.edu/ widom/pubs.html.
[15] S. Cluet, S. Jacqmin, and J. Simeon. "The New YATL: Design and Specica-
tions," Technical Report, INRIA, 1999.
[16] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu
"XML-QL: A Query Language for XML" http://www.w3.org/TR/1998/NOTE-
xml-ql-19980819.
[17] Y. Papakonstantnou, P. Velikhov, "Enhancing Semistructured Data Mediators
with Document Type Denitions" in: Proceedings of International Conference
on Data Engineering, pp. 251-260, 1999.
[18] Tae-Sun Chung*, and Hyoung-Joo Kim, "Extracting Indexing Information from
XML DTDs," Information Processing Letters, 81, pp. 97-103, 2002.
[19] S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe, "Representative Objects:
Concise Representations of Semistructured, Hierarchical Data," Proceedings of
the Thirteenth International Conference on Data Engineering, pp. 79-90, 1997.
[20] J. Hopcroft, "An n log n Algorithm for Minimizing the States in a Finite Au-
tomaton," The Theory of Machines and Computations, Academic Press, NY, pp.
189-196, 1971.
[21] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu,
"XML-QL: A Query Language for XML" http://www.w3.org/TR/NOTE-xml-
ql.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內外都一年後公開 withheld
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code