國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個搜尋空間資料庫中相似關鍵字的九方區域關鍵字樹索引方法,NAAK-Tree: An Index for Querying Spatial Approximate Keywords

論文名稱 Title	一個搜尋空間資料庫中相似關鍵字的九方區域關鍵字樹索引方法 NAAK-Tree: An Index for Querying Spatial Approximate Keywords
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	100 學年度第 2 學期 The spring semester of Academic Year 100	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	78
研究生 Author	劉彥國 Yen-Guo Liou
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	陳健輝 Gen-Huey Chen
口試委員 Advisory Committee	郭大維, 李建億 Tei-Wei Kuo; Chien-I Lee
口試日期 Date of Exam	2012-06-29	繳交日期 Date of Submission	2012-07-11
關鍵字 Keywords	簽章、空間資料庫、範圍搜尋、索引結構、近似關鍵字 Signature, Index Structure, Approximate-Keyword, Spatial Database, Range Query
統計 Statistics	本論文已被瀏覽 5676 次，被下載 677 次 The thesis/dissertation has been browsed 5676 times, has been downloaded 677 times.

中文摘要
在最近幾年，地理資訊系統發展快速並且在很多應用中扮演重要的角色。其中有很多應用可以讓使用者同時利用空間資訊和關鍵字來查詢物件。大部分的空間關鍵字搜尋研究都是針對搜尋的關鍵字要能和資料庫中的關鍵字完全符合。由於使用者可能不知道如何正確地拼出關鍵字，他們會用近似的關鍵字，而不是用完全正確的關鍵字來發出搜尋。因此，怎麼去搜尋空間資料庫中相似關鍵字，漸漸成為一個重要的研究。Alsubaiee學者提出了一個以位置為基礎的關鍵字索引樹結構Location-Based-Approximate-Keyword-tree (LBAK-tree)，目的是為了讓一個以樹為基礎的空間索引有能力去處理近似關鍵字搜尋。然而，LBAK-tree是一個R-tree的結構。當其中的節點有溢出的情況時，它必須把一些節點重新加入到樹裡面，因此，它無法在節點建立的時候，就同時儲存關鍵字在這個節點裡。必須等到整個R-tree建立完成後，再從葉節點往上一層一層儲存關鍵字。而他們在一個節點新增或搜尋一個新物件的時候，必須搜尋所有子節點的空間關係，才能決定是在哪一個節點。當他們利用近似關鍵字索引找到需要的關鍵字以後，必須對接下來的子節點所儲存的關鍵字進行交集比對。但節點的高度越高，節點所儲存的關鍵字就越多，進行交集比對的時間就越多。並且在交集比對的過程中，他們一定要找完每一個交集，就算其中一個交集已經是空集合。因此，在這篇論文，我們提出一個九方區域關鍵字樹Nine-Area-Approximate-Keyword-tree (NAAK-tree)的索引結構。我們不用搜尋空間來建立結構。我們不用重新插入溢出的節點，所以我們可以在建立節點時候同時儲存關鍵字。我們可以依照搜尋範圍在整個空間中的關係，直接找到需要搜尋的節點。我們讓NAAK-tree含有簽章(signature)去加速關鍵字搜尋。透過檢查簽章的方式，我們可以有效的過濾不要的節點。如果其中一個交集是空集合，我們不找完所有的交集。從我們的實驗數據，我們看到在建立以及搜尋上，NAAK-tree的效率都比LBAK-tree的效率好。
Abstract
In recent years, the geographic information system (GIS) databases develop quickly and play a significant role in many applications. Many of these applications allow users to find objects with keywords and spatial information at the same time. Most researches in the spatial keyword queries only consider the exact match between the database and query with the textual information. Since users may not know how to spell the exact keyword, they make a query with the approximate-keyword, instead of the exact keyword. Therefore, how to process the approximate-keyword query in the spatial database becomes an important research topic. Alsubaiee et al. have proposed the Location-Based-Approximate-Keyword-tree (LBAK-tree) index structure which is to augment a tree-based spatial index with approximate-string indexes such as a gram-based index. However, the LBAK-tree index structure is the R-tree based index structure. The nodes of the R-tree have to be split and be reinserted when they get full. Due to this condition, it can not index the spatial attribute and the textual attribute at the same time. It stores the keywords in the nodes after the R-tree is already built. Based on the R-tree, it has to search all the children in a node to insert a new item and answer a query. Moreover, after they find the needed keywords by using the approximate index, they probe the nodes by checking the intersection of the similar keyword sets and the keywords stored in the nodes. However, the higher level the node is, the larger the number of keywords stored in the node is. It takes long time to check the intersections. And the LBAK-tree checks all the intersections even if there exits one of the intersections which is already an empty set. Therefore, in this thesis, we propose the Nine-Area-Approximate-Keyword-tree (NAAK-tree) index structure to process the spatial approximate-keyword query. We do not have to partition the space to construct the spatial index. We do not have to reinsert the children when split the nodes, so we can deal with the keywords at the same time. We can use the spatial number to find out the nodes that satisfy the spatial condition of the query. And we augment the NAAK-tree with signatures to speed up the query of the textual condition. We use the union of the bit strings of each keyword in a node to represent them in the node. Therefore, we can efficiently filter out the nodes that there is no keyword corresponding to the query by checking the signatures just one time without checking all the keywords stored in the nodes. Based on our NAAK-tree, if there exits one empty set in the similar keywords sets, we do not check all the similar keywords sets. From our simulation results, we show that the NAAK-tree is more efficient than the LBAK-tree to build the index and answer the spatial approximate-keyword query.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Spatial Approximate-Keyword Query . . . . . . . . . . . . . . . . . . 2 1.2 Spatial Access Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approximate String Match . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 The Signature Technique . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 The Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 14 2. A Survey of Approximate Keyword Query Processing in the Spatial Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 The 3-Level Hybrid Index Structure . . . . . . . . . . . . . . . . . . . 15 2.1.1 The Hybrid Index Structure . . . . . . . . . . . . . . . . . . . 17 2.1.2 ASK Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 The MHR-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Edit distance and Q-Gram . . . . . . . . . . . . . . . . . . . . 19 2.2.2 The Min-Wise Signature . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 The MHR-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 The LBAK-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.1 The Basic Index and Search . . . . . . . . . . . . . . . . . . . 21 2.3.2 Placing Approximate Indexes at Variable Levels . . . . . . . . 23 2.3.3 Exploiting Frequency Distribution of Keywords . . . . . . . . 24 3. The NAAK-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.1 The Partition Numbering Scheme . . . . . . . . . . . . . . . . 27 3.1.2 The NA-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.3 Three Categories of Nodes . . . . . . . . . . . . . . . . . . . . 30 3.2 Spatial Approximate-Keyword Query Processing . . . . . . . . . . . . 35 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Simulation Results of the Index Structure . . . . . . . . . . . . . . . 55 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

參考文獻 References
[1] S. Alsubaiee, A. Behm, and C. Li, “Supporting Location-Based Approximate-Keyword Queries,” Proc. of the 18th SIGSPATIAL Int. Conf. on Advances in Geographic Information Systems, pp. 61–70, 2010. [2] S. Alsubaiee and C. Li, “Fuzzy Keyword Search on Spatial Data,” Proc. of the 15th Int. Conf. on Database Systems for Advanced pplications, pp. 464–467, 2010. [3] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic Local Alignment Search Tools,” Journal of Molecular Biology, Vol. 215, No. 3, pp. 403–410, Oct. 1990. [4] N. Beckmann, H. P. Begel, R. Schneider, and B. Seeger, “The R-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. of the 1990 ACM SIGMOD Int. Conf. on Management of Data, pp. 322–331, 1990. [5] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-Wise Independent Permutations (Extended Abstract),” Proc. of the 30th annual ACM symposium on Theory of computing, pp. 327–336, 1998. [6] Y.-I. Chang, C. H. Liao, and H.-L. Chen, “NA-Trees: A Dynamic Index for Spatial Data,” Information Science and Engineering (SCI), Vol. 19, No. 1, pp. 103–139, 2003. [7] I. D. Felipe, V. Hristidis, and N. Rishe, “Keyword Search on Spatial Databases,” Proc. of the 24th Int. Conf. on Data Engineering, pp. 56–665, 2008. [8] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. of the 1984 ACM SIGMOD Int. Conf. on Management of Data, pp. 47–57, 1984. [9] R. Hariharan, B. Hore, C. Li, and S. Mehrotra, “Processing Spatial Keyword(SK) Queries in Geographic Information Retrieval (GIR) Systems,” Proc. of the 19th Int. Conf. on Scientific and Statistical Database Management, p. 16, 2007. [10] M. S. Kim, K. Y. Whang, J. G. Lee, and M. J. Lee, “N-Gram/2L: A Space and Time Efficient Two-Level N-Gram Inverted Index Structure,” Proc. of the 31st Int. Conf. on Very Large Data Bases, pp. 1–16, 2005. [11] A. Kumar, “G-Tree: A New Data Structure for Organizing Multidimensional Data,” IEEE Transactions on Knowledge and Data Engineering, Vol. 6, No. 2, pp. 341–347, 1994. [12] S. Y. Lee, M. C. Yang, and J. W. Chen, “Signature File as a Spatial Filter for Iconic Image Database,” Visual Languages and Computing, Vol. 3, pp. 373–397, 1992. [13] C. Li, J. Lu, and Y. Lu, “Efficient Merging and Filtering Algorithms for Approximate String Searches,” Proc. of the 2008 IEEE 24th Int. Conf. on Data Engineering, pp. 257–266, 2008. [14] D. J. Lipman and W. R. Pearson, “Rapid and Sensitive Protein Similarity Searches,” Science, Vol. 227, No. 12, pp. 1435–1441, March 1985. [15] A. Mazeika, M. H. B‥ohlen, N. Koudas, and D. Srivastava, “Estimating the Selectivity of Approximate String Queries,” ACM ransactions on Database Systems, Vol. 32, No. 2, pp. 12–52, 2007. [16] G. Navarro, “A Guided Tour to Approximate String Matching,” ACM Comput-ing Surveys, Vol. 33, No. 1, pp. 31–38, 2001. [17] C. S. Roberts, “Partial-match Retrieval via the Method of Superimposed Codes,” Proc. of the IEEE, Vol. 67, No. 12, pp. 1624–1641, 1979. [18] E. Sutinen and J. Tarhio, “On Using Q-Gram Locations in Approximate String Matching,” Proc. of the 3rd Annual European Symposium on Algorithms, pp. 327–340, 1995. [19] E. Ukkonen, “Approximate String Matching with Q-Grams and Maximal Matches,” Theoretical Computer Science, Vol. 92, No. 1, pp. 191–211, 1992. [20] S. Vaid, C. B. Jones, H. Joho, and M. Sanderson, “Spatio-Textual Indexing for Geographical Search on the Web,” Proc. of the 9th Symposium on Spatial and Temporal Databases, pp. 218–235, 2005. [21] Z. Wang, M. Du, and J. Le, “GR-tree: An Index for Querying Approximate Keywords in Geographic Information System,” Proc. of the 2009 Int. Conf. on Information Engineering and Computer Science, pp. 1–4, 2009. [22] Z. Wang, M. Du, X. Shi, and J. Le, “An Efficient Approach for Approximate Keyword Query in Geographic Information System,” Proc. of the 2009 IEEE Int. Conf. on Intelligent Computing and Intelligent Systems, pp. 603–607, 2009. [23] H. Williams and J. Zobel, “Indexing and Retrieval for Genomic Databases,” IEEE Trans. on Knowledge and Data Eng., Vol. 14, No. 1, pp. 63–78, Jan./Feb. 2002. [24] B. Yao, F. Li, M. Hadjieleftheriou, and K. Hou, “Approximate String Search in Spatial Databases,” Proc. of the 2010 IEEE 26th Int. Conf. on Data Engineering, pp. 545–556, 2010. [25] D. Zhang, Y. M. Chee, A. Mondal, A. Tung, and M. Kitsuregawa, “Keyword Search in Spatial Databases: Towards Searching by Document,” Proc. of the 2009 IEEE Int. Conf. on Data Engineering, pp. 688–699, 2009. [26] Y. Zhou, X. Xie, C. Wang, Y. Gong, and W. Ma, “Hybrid Index Structures for Location-Based Web Search,” Proc. of the 14th ACM Int. Conf. on Information and Knowledge Management, pp. 155–162, 2005.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0711112-160600.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS