國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,新樹枝連結演算法結合結構對XML文件有效率的檢索,The Novel Twig-Join Algorithm with Structure for Efficient Retrieval of XML Documents

論文名稱 Title	新樹枝連結演算法結合結構對XML文件有效率的檢索 The Novel Twig-Join Algorithm with Structure for Efficient Retrieval of XML Documents
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	104 學年度第 2 學期 The spring semester of Academic Year 104	語文別 Language	英文 English
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	75
研究生 Author	龔奕瑋 Yi-wei Kung
指導教授 Advisor	李宗南 Chungnan Lee
召集委員 Convenor	郭耀煌 Yau-Hwang Kuo
口試委員 Advisory Committee	陳培殷, 張旭光, 丁川康, 楊昌彪, 賴威光, 殷堂凱 Pei-Yin Chen; Hsu-Kuang Chang; Chuan-Kang Ting; Chang-Biau Yang; Wei-Kuang Lai; Tang-Kai Yin
口試日期 Date of Exam	2016-06-27	繳交日期 Date of Submission	2016-08-04
關鍵字 Keywords	AL關係列表、TJSwift、模糊不清的查詢、eXTJSwift、查詢多樣化、結構摘要樹、查詢樹、XML XML, ambiguity, QTP, structural summary tree, AL list, TJSwift, eXTJSwift, query diversification
統計 Statistics	本論文已被瀏覽 5742 次，被下載 31 次 The thesis/dissertation has been browsed 5742 times, has been downloaded 31 times.

中文摘要
可擴展標記語言(XML)近年來已經成為網際網路上資料表示及資料交換的標準格式，以XML文件查詢的需求更是不斷增加，但是隨著資料量不斷的增加、文件越變越大的趨勢，使的查詢上的效率以及正確性的結果變得相當困難。XML文件的研究主要是透過其結構的特性，利用查詢樹(QTP)的方式找出在文件中匹配的節點來達成查詢的目的。但是，為了能夠得到匹配的查詢結果，通常會產生許多無用的節點查詢，造成耗時的計算。更因為結構的特性，造成了歧義的查詢，使查詢結果變得模糊不清，降低查詢結果的正確性。因此，如何有效的提升查詢效率以及獲得更精確的結果是一項重要的議題。為了克服耗時的問題，我們提出一個新型樹枝連結演算法(TJSwift)，首先以結構摘要樹(SST)演算法對XML文件進行優化，其目的是排除不必要的路徑，其中包含了巢狀以及重複的路徑結構，並利用優化後的結構建立鄰近連結(AL)關係的列表，完全保留結構中的節點以及各相關階層的資訊，接著以索引的方式來提升對XML文件查詢的效率，進而降低耗時的計算。同時，就我們所知，對於查詢多樣化、單一節點的目標階層差異和歧義不清的階層查詢研究上仍是不足的。因此，我們認為這些問題對於查詢的相關性以及效率上的結果來說，是一個很關鍵的研究。所以，我們進一步提出擴展型樹枝連結(eXTJSwift)演算法並定義這些問題來達到更有效的查詢能力。透過eXTJSwift演算法，能夠讓XML文件的查詢變的更精確也更豐富。為了評估我們提出的TJSwift和eXTJSwift方法，我們用兩個不同的實驗方式來與現有樹枝連結演算法相互比較，這些方法包含了TwigStack，TwigList和TJFast。在TJSwift的實驗上，針對執行時間、根據文件大小所造成的時間變化以及在路徑匹配的處理中節點讀取數目比較上評估。另一方面，eXTJSwift不僅針對執行時間和匹配數量外，更進一步的使用豐富的查詢條件在查詢多樣化、單一節點的目標階層差異和歧義不清的階層問題上來進行精確度的比較。由TJSwift實驗上證明，在同樣的查詢結果下，我們提出的方法在查詢的時間上不僅優於其他演算法，更相對於以效能為主的TJFast演算法上具有更高的效率。同樣的，eXTJSwift的實驗結果也顯示我們在精確度的比較上較為出色。
Abstract
In recent years, XML (eXtensible Markup Language) has become the standard code for data representation and data exchange on the Internet. In addition, data queries have constantly increased; however, it is getting harder to query efficiently and obtain the precise required results because of the huge amount of data. The main operation in XML query processing is finding nodes that match the given query tree pattern (QTP) in the document. The problem is that accessing too many useless nodes in order to match a query pattern is very time consuming. Meanwhile, the XML documents based on characteristics of structure can lead to results which lack clarity due to the query ambiguity of the structure. Therefore, determining how to ensure efficient query service based on a skillful representation that can support query diversification and solve ambiguity in order to achieve high precision search capability is an important issue. To overcome the time-consumption problem, we utilize the structural summary tree (SST) algorithm to optimize XML documents; the aim is to eliminate unnecessary paths that include nested structures and duplicate paths. The novel twig-join Swift (TJSwift) associated with adjacent linked (AL) lists for the provision of efficient XML query services is proposed herein, whereby queries can be versatile in terms of predicates. It can completely preserve hierarchical information, and the new index generated from SST is used to save semantic information in order to provide template-based indexing for fast data searches. At the same time, to the best of our knowledge, researches on query diversification, queries in single node and hierarchical level difference and intermediate nodes with ambiguity in regard to hierarchical level are insufficient. In terms of result relevance, effectiveness is the most crucial aspect of query search, which can be summarized as these issues. Hence, we also further propose extending twig join Swift (eXTJSwift) associated with AL lists to provide efficient XML query services, whereby queries can be versatile in terms of predicates. In order to evaluate the performance of the TJSwift and eXTJSwift approaches with that of TwigStack, TwigList and TJFast, we conducted two sets of performance evaluation. For the TJSwift, the performance evaluations were conducted in terms of total execution time, scalability and number of elements read, which indicates how many nodes must be read in a matching process. On the other hand, eXTJSwift is in terms of total execution time and number of paths matched and further to add various query criteria to compare the precise simulation in query diversification, target hierarchical level and the problem of ambiguity. Experiment of TJSwift results show that not only are these algorithms able to satisfy a query, but also has better time-saving efficiency compared with the existing twig-join algorithms such as the TJFast algorithm. Similarly, eXTJSwift achieved better accuracy than other approaches in terms of query diversification, target hierarchical level and the problem of ambiguity.

目次 Table of Contents
中文摘要 i Abstract iii Table of Contents vi List of Figures viii List of Tables xi Chapter 1 Introduction 1 1.1. Background and Motivation 1 1.2. Related Works 3 1.3. Contributions 7 1.4. Organization 8 Chapter 2 The Proposed Twig-Join Swift using Indexing AL SST 9 2.1. Structural Summary Tree for Optimal XML Document 9 2.2. The AL List of SST Representation 13 2.3. The Twig-Join Swift for AL Lists of SST 16 Chapter 3 Extended Query Problem related to the Structure Relationship 20 3.1. Query Diversification 20 3.2. Query in Single Node and Hierarchical Level Difference 23 3.3. Intermediate Nodes in Ambiguity of Hierarchical Level 24 Chapter 4 The Processing Data Model Matching Method 27 4.1. The eXTwig Join Swift for Processing the Three Issues 27 4.2. Dynamic Updates in Adjacent Linked (AL) Lists 31 Chapter 5 Experiments 33 5.1. Experiment Setup 33 5.2. Performance Comparison and Analysis 34 5.3. Diversification Comparison and Analysis 45 Chapter 6 Conclusions and Future Work 54 6.1. Conclusions 54 6.2. Future Work 55 Bibliography 57

參考文獻 References
[1] P. Hegaret, "Document Object Model (DOM)", World Wide Web Consortium, Available: http://www.w3.org/DOM/, 2005. [2] A. Berglund, S. Boag, D. Chamberlin, M. Fernandez, M. Kay, J. Robie and J. Simeon, "XML Path Language (XPath) 2.0", World Wide Web Consortium, Available: http://www.w3.org/TR/xpath20/, 2010. [3] J. Robie, "XML Processing and Data Integration with XQuery", IEEE Internet Computing, vol. 11, no. 4, pp. 62-67, 2007. [4] M. Hachicha and J. Darmont, "A Survey of XML Tree Patterns", IEEE Trans. Knowl. Data Eng., vol. 25, no. 1, pp. 29-46, 2013. [5] T. Dalamagas, T. Cheng, K. Winkel and T. Sellis, "Clustering XML Documents Using Structural Summaries", Proceedings of the Current Trends in Database Technology, pp. 547-556, Heraklion, Crete, Greece, 2004. [6] A. Nierman and H. Jagadish, "Evaluating Structural Similarity in XML Documents", Proceedings of the Fifth International Workshop on the Web and Databases, pp. 61-66, Madison, Wisconsin, USA, 2002. [7] S. Flesca, G. Manco, E. Masciari, L. Pontieri and A. Pugliese, "Fast Detection of XML Structural Similarity", IEEE Trans. Knowl. Data Eng., vol. 17, no. 2, pp. 160-175, 2005. [8] W. Lian, D. Cheung, N. Mamoulis and S. Yiu, "An Efficient and Scalable Algorithm for Clustering XML Documents by Structure", IEEE Trans. Knowl. Data Eng., vol. 16, no. 1, pp. 82-96, 2004. [9] M. Kozielski, "Improving the Results and Performance of Clustering Bit-encoded XML Documents", Proceedings of the Sixth IEEE International Conference on Data Mining, pp. 60-64, Hong Kong, China, 2006. [10] J. Yuan, X. Li and L. Ma, "An Improved XML Document Clustering Using Path Feature", Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 400-404, Shandong, China, 2008. [11] H. Leung, F. Chung, S. Chan and R. Luk, "XML Document Clustering Using Common XPath", Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration, pp. 91-96, Tokyo, Japan, 2005. [12] A. Termier, M. Rousset and M. Sebag, "TreeFinder: a First Step Towards XML Data Mining", Proceedings of the IEEE International Conference on Data Mining, pp. 450-457, Maebashi, Japan, 2002. [13] J. Yang, W. Cheung and X. Chen, "Learning the Kernel Matrix for XML Document Clustering", Proceedings of the IEEE International Conference on e-Technology, e-Commerce and e-Service, pp. 353-358, Hong Kong, China, 2005. [14] J. Liu, J. Wang, W. Hsu and K. Herbert, "XML Clustering by Principal Component Analysis", Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pp. 658-662, Boca Raton, FL, USA, 2004. [15] J. Lee, K. Lee and W. Kim, "Preparations for Semantics-based XML Mining", Proceedings of the IEEE International Conference on Data Mining, pp. 345-352, San Jose, California, USA, 2001. [16] M. Qureshi and M. Samadzadeh, "Determining the Complexity of XML Documents", Proceedings of the International Conference on Information Technology: Coding and Computing, vol. 2, pp. 416-421, Las Vegas, NV, USA, 2005. [17] X. Li, "Using Clustering Technology to Improve XML Semantic Search", Proceedings of the International Conference on Machine Learning and Cybernetics, pp. 2635-2639, Kunming, China, 2008. [18] N. Bruno, N. Koudas and D. Srivastava, "Holistic Twig Joins: Optimal XML Pattern Matching", Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 310-321, Madison, Wisconsin, USA, 2002. [19] S. Chen, H. Li, J. Tatemura, W. Hsiung, D. Agrawal and K. Candan, "Twig2Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents", Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 283-294, Seoul, Korea, 2006. [20] L. Qin, J. Yu and B. Ding, "TwigList: Make Twig Pattern Matching Fast", Proceedings of the Advances in Databases: Concepts, Systems and Applications, pp. 850-862, Bangkok, Thailand, 2007. [21] H. Jiang, W. Wang, H. Lu and J. Yu, "Holistic Twig Joins on Indexed XML Documents", Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 273-284, Berlin, Germany, 2003. [22] J. Lu, T. Ling, C. Chan and T. Chen, "From Region Encoding to Extended Dewey: on Efficient Processing of XML Twig Pattern Matching", Proceedings of the 31st International Conference on Very Large Data Bases, pp. 193-204, Trondheim, Norway, 2005. [23] E. Demidova, P. Fankhauser, X. Zhou and W. Nejdl, "DivQ: Diversification for Keyword Search over Structured Databases", Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 331-338, Geneva, Switzerland, 2010. [24] Z. Bao, J. Lu, T. Ling and B. Chen, "Towards an Effective XML Keyword Search", IEEE Trans. Knowl. Data Eng., vol. 22, no. 8, pp. 1077-1092, 2010. [25] R. Agrawal, S. Gollapudi, A. Halverson and S. Ieong, "Diversifying Search Results", Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 5-14, Barcelona, Spain, 2009. [26] J. Hu, G. Wang, F. Lochovsky, J. Sun and Z. Chen, "Understanding User's Query Intent with Wikipedia", Proceedings of the 18th International Conference on World wide web, pp. 471-480, Madrid, Spain, 2009. [27] Z. Liu, P. Sun and Y. Chen, "Structured Search Result Differentiation", Proc. VLDB Endow., vol. 2, no. 1, pp. 313-324, 2009. [28] K. Pu and X. Yu, "Keyword Query Cleaning", Proc. VLDB Endow., vol. 1, no. 1, pp. 909-920, 2008. [29] Y. Lu, W. Wang, J. Li and C. Liu, "XClean: Providing Valid Spelling Suggestions for XML Keyword Queries", Proceedings of the IEEE 27th International Conference on Data Engineering, pp. 661-672, Hannover, German, 2011. [30] X. Wu and G. Liu, "XML Twig Pattern Matching using Version Tree", Data & Knowledge Engineering, vol. 64, no. 3, pp. 580-599, 2008. [31] S. Izadi, T. Harder and M. Haghjoo, "S3: Evaluation of Tree-Pattern XML Queries Supported by Structural Summaries", Data & Knowledge Engineering, vol. 68, no. 1, pp. 126-145, 2009. [32] H. Chang, K. Hung and I. Jou, "Efficient XML Retrieval Service with Complete Path Representation", IEICE Transactions on Information and Systems, vol. 96, no. 4, pp. 906-917, 2013. [33] G. Miklau, "XMLData Repository", University of Washington Database Group. Available: http://www.cs.washington.edu/research/xmldatasets/. [34] M. Ley, "DBLP Computer Science Bibliography". Available: http://dblp.uni-trier .de/xml/dblp. xml. [35] A. Schmidt, "XMark-An XML Benchmark Project", Xml-benchmark.org, Available: http://www.xml-benchmark.org/downloads.html, 2002. [36] S. Sakr, "XMLCompBench", Xmlcompbench.sourceforge.net, Available: http:// xmlcompbench.sourceforge.net/Dataset.html, 2009. [37] Gartner Inc, "About Gartner", Gartner.com. Available: http://www.gartner.com /technology/about.jsp. [38] I. Tatarinov, S. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita and C. Zhang, "Storing and Querying Ordered XML Using a Relational Database System", Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 204-215, Madison, Wisconsin, USA, 2002. [39] P. O'Neil, E. O'Neil, S. Pal, I. Cseri, G. Schaller and N. Westbury, "ORDPATHs : Insert Friendly XML Node Labels", Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 903-908, Paris, France, 2004. [40] T. Harder, M. Haustein, C. Mathis and M. Wagner, "Node Labeling Schemes for Dynamic XML Documents Reconsidered", Data & Knowledge Engineering, vol. 60, no. 1, pp. 126-149, 2007. [41] M. Haustein and T. Harder, "An Efficient Infrastructure for Native Transactional XML Processing", Data & Knowledge Engineering, vol. 61, no. 3, pp. 500-523, 2007. [42] S. Subramaniam, S. Haw and P. Hoong, "Mapping and Labeling XML Data for Dynamic Update", Proceedings of the Second International Conference on Computer Research and Development, pp. 781-786, Kuala Lumpur, Malaysia, 2010. [43] X. Wu and D. Theodoratos, "A survey on XML streaming evaluation techniques", The VLDB Journal, vol. 22, no. 2, pp. 177-202, 2012. [44] X. Wu, S. Souldatos, D. Theodoratos, T. Dalamagas, Y. Vassiliou and T. Sellis, "Processing and Evaluating Partial Tree Pattern Queries on XML Data", IEEE Trans. Knowl. Data Eng., vol. 24, no. 12, pp. 2244-2259, 2012. [45] S. Amer-Yahia, S. Cho, L. Lakshmanan and D. Srivastava, "Minimization of tree pattern queries", Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 497-508, California, USA, 2001. [46] J. Tekli, "An Overview on XML Semantic Disambiguation from Unstructured Text to Semi-Structured Data: Background, Applications, and Ongoing Challenges", IEEE Trans. Knowl. Data Eng., vol. 28, no. 6, pp. 1383-1407, 2016.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0704116-153701.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS