Responsive image
博碩士論文 etd-0729104-165914 詳細資訊
Title page for etd-0729104-165914
論文名稱
Title
一個以計數分配為基礎來設計用於資料倉儲之動態區域查詢的位元索引方法
A Count-Based Partition Approach to the Design of the Range-Based Bitmap Indexes for Data Warehouses
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
98
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2004-06-04
繳交日期
Date of Submission
2004-07-29
關鍵字
Keywords
位元索引、線上資料分析處理系統、壓縮、資料倉儲、區域查詢
bitmap index, range query, data warehouse, compress, OLAP
統計
Statistics
本論文已被瀏覽 5722 次,被下載 0
The thesis/dissertation has been browsed 5722 times, has been downloaded 0 times.
中文摘要
資料倉儲(Data Warehouse)提供了整合多個操作型資料庫(Operational Database)一致性正確歷史資料,因此適用於資料的分析應用。線上資料分析處理系統(OLAP)進一步提供了分析用的工具,以便從資料倉儲萃取有用的資訊。在線上支援決策系統中,快速的回應時間是必要的。而在以讀取為主的環境裡,為了達到此目的,位元索引是最常用的方法。當資料屬性有高的基數(cardinality),以區域為基礎的位元索引法(RBI)將資料分成數個部份(range-based),利用位元向量(bitmap vector)去記錄,是較有效率的做法。然而,RBI在每段範圍含的資料個數不平均,導致不同查詢硬碟搜尋次數不一致的問題。Wu等人提出的DBEC演算法中,已進一步考慮資料的分佈情況。但卻不能保證切割結果符合需要位元向量的個數。而且,對有相同值的不同筆紀錄,可能被切割至不同的位元向量,導致硬碟輸入輸出時間增加。因此,我們提出IPDF、CP、CP*三種演算法,這三種演算法考慮了具有高基數的資料屬性及非均勻的資料分佈的情況,用來建立以區域為基礎的位元索引。在IPDF演算法中,我們根據機率密度函數(p.d.f.)切割每個部份,使得每個切割的結果包含的資料量趨於一致。在CP演算法中,我們將資料排序,然後一定數量分成同一部份。CP*演算法為CP演算法之改進,它調整分割點使得有相同值之不同紀錄被分割至同一部份。在另一方面,我們考慮查詢的歷史。根據貪婪演算法,我們提出GreedyExt和GreedyRange演算法。GreedyExt演算法針對精確查詢所設計,而GreedyRange演算法則是針對區域查詢設計。這兩種演算法挑選最常用的查詢建立位元向量,使得回答查詢的平均回應時間降低。除此之外,位元索引包含了一組位元向量,而位元索引大小可能大於硬碟的容量。為了節省儲存空間,我們提出了FZ演算法,將每個位元向量壓縮。而且為了保持二元運算的快速性,也提出了有效的演算法,可直接將位元向量作二元運算,而不需解壓縮。最後,跟據我們的分析,在硬碟存取次數上,CP*演算法優於CP演算法。而跟據我們的模擬結果,IPDF和CP*切割出來的部份,比DBEC更加平均。GreedyExt和GreedyRange這兩個演算法,在大部份的情況,都可以得到更快速的回應時間。在壓縮方面,我們提出的FZ演算法,則可比WAH演算法節省更多的儲存空間。
Abstract
Data warehouses contain data consolidated from several operational databases and provide the historical, and summarized data which is more appropriate for analysis than detail, individual records. On-Line Analytical Processing (OLAP) provides advanced analysis tools to extract information from data stored in a data warehouse. Fast response time is essential for on-line decision support. A bitmap index could reach this goal in read-mostly environments. When data has high cardinality, we prefer to use the Range-Based Index (RBI), which divides the attributes values into several partitions and a bitmap vector is used to represent a range. With RBI, however, the number of records assigned to different ranges can be highly unbalanced, resulting in different search times of disk accesses for different queries. Wu et al proposed an algorithm for RBI, DBEC, which takes the data distribution into consideration. But the DBEC strategy could not guarantee to get the partition result with the given number of bitmap vectors, PN. Moreover, for different data records with the same value, they may be partitioned into different bitmap vectors which takes long disk I/O time. Therefore, we propose the IPDF, CP, CP* strategies for constructing the dynamic range-based indexes concerning with the case that data has high cardinality and is not uniformly distributed. The IPDF strategy decides each partition according to the Probability Density Function (p.d.f.). The CP strategy sorts the data and partitions them into PN groups for every w continuous records. The CP* strategy is an improved version of the CP strategy by adjusting the cutting points such that data records with the same value will be assigned into the same partition. On the other hand, we could take the history of users' queries into consideration. Based on the greedy approach, we propose the GreedyExt and GreedyRange strategies. The GreedyExt strategy is used for answering exact queries and the GreedyRange strategy is used for answering range queries. The two strategies decide the set of queries to construct the bitmap vectors such that the average response time of answering queries could be reduced. Moreover, a bitmap index consists of a set of bitmap vectors and the size of the bitmap index could be much larger than the capacity of the disk. We propose the FZ strategy to compress each bitmap vector to reduce the size of the storage space and provide efficient bitwise operations without decompressing these bitmap vectors. Finally, from our performance analysis, the performance of the CP* strategy could be better than the CP strategy in terms of the number of disk accesses. From our simulation, we show that the ranges divided by the IPDF and CP* strategies are more uniform than those divided by the DBEC strategy. The GreedyExt and GreedyRange strategies could provide fast response time in most of situations. Moreover, the FZ strategy could reduce the storage space more than the WAH strategy.
目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 DataWarehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Decision Support Systems . . . . . . . . . . . . . . . . . . . . 2
1.1.2 The DimensionalModel . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Query Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Variant Index Strategies . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 TheMultidimensional Array-Based Strategy . . . . . . . . . . 11
1.3.2 The Hierarchical Indexing Strategy . . . . . . . . . . . . . . . 12
1.3.3 TheMultidimensional Indices Strategy . . . . . . . . . . . . . 12
1.3.4 The Bitmap Index Strategy . . . . . . . . . . . . . . . . . . . 13
1.4 RelatedWork of the Bitmap Index . . . . . . . . . . . . . . . . . . . 13
1.5 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 24
2. A Survey of Strategies for Bitmap Indexing . . . . . . . . . . . . . 25
2.1 Simple Bitmap Index . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 (Binary) Bit-Sliced Index . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Encoded Bitmap Index . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Equality, Range, Interval Encoding . . . . . . . . . . . . . . . . . . . 31
2.5 Range-Based Bitmap Index . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Compressing Bitmap Index . . . . . . . . . . . . . . . . . . . . . . . . 37
2.6.1 Byte-Aligned Bitmap Codes . . . . . . . . . . . . . . . . . . . 38
2.6.2 Word-Aligned Hybrid Run-Length Code . . . . . . . . . . . . 39
3. Dynamic Range-Based Bitmap Indexing
Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1 Strategies for Non-uniformData Distribution. . . . . . . . . . . . . . 42
3.1.1 The Integral Probability Density Function-Based Strategy (IPDF) 43
3.1.2 The Count-Based Partition Strategy (CP) . . . . . . . . . . . 46
3.1.3 The Count-Based Partition* Strategy (CP*) . . . . . . . . . . 48
3.2 Strategies for the History of Queries . . . . . . . . . . . . . . . . . . . 51
3.2.1 The GreedyExt Strategy for Exact Queries . . . . . . . . . . . 51
3.2.2 The GreedyRange Strategy for Range Queries . . . . . . . . . 54
3.3 The Filtering-Out-Zeros (FZ) Strategy for compressing the Bitmap
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4. Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.1 The CP and CP* Strategies . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 The Performance of the DBEC, IPDF, and CP* Strategies . . . . . . 76
4.2.1 The Performance Model . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Performance Analysis of the GreedyExt and GreedyRnage Strategies 82
4.4 The Performance of the WAH and FZ Strategies . . . . . . . . . . . . 84
4.4.1 The PerformanceModel . . . . . . . . . . . . . . . . . . . . . 84
4.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 85
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A. Different Types of Data Distributions . . . . . . . . . . . . . . . . . 94
參考文獻 References
[1] R. Agrawal, A. Gupta, and S. Sarawagi, “Modeling Multidimensional
Databases,” Proc. of the 13th IEEE Int. Conf. on Data Eng., pp. 232–243, 1997.
[2] G. Antoshenkov, “Byte-Aligned Bitmap Compression,” tech. rep., Oracle Corp.,
1994.
[3] G. Casella and R. L. Berger, Statistical Inference. 511 Forest Lodge Road, Pacific
Grove, CA 93950, USA: the Wadsworth Group, second ed., 2001.
[4] C. Y. Chan and Y. E. Ioannidis, “Bitmap Index Design and Evaluation,” Proc. of
the ACM SIGMOD Int. Conf. on Management of Data, pp. 355–366, 1998.
[5] C. Y. Chan and Y. E. Ioannidis, “An Efficient Bitmap Encoding Scheme for
Selection Queries,” Proc. of the ACM SIGMOD Int. Conf. on Management of
Data, pp. 215–226, 1999.
[6] S. Chaudhuri and U. Dayal, “An Overview of Data Warehouse and OLAP Technology,”
ACM SIGMOD Record, Vol. 26, No. 1, pp. 65–74, 1997.
[7] E. F. Codd, “Providing OLAP (On-Line Analytical Processing) to User-Analysts:
An IT Mandate,” tech. rep., E. F. Codd and Associates, 1993.
[8] C. I. Ezeife, “Selecting and Materializing Horizontally Partitioned Warehouse
Views,” Data and Knowledge Eng., Vol. 36, No. 2, pp. 185–210, Jan. 2001.
[9] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, “Data Cube: A Relational
Aggregation Operator Generalizing Group-by, Cross-Tabs and Sub-
Totals,” Proc. of the 12th IEEE Int. Conf. on Data Eng., pp. 152–159, 1996.
[10] A. Gupta and I. S. Mumick, “Maintenance of Materialized Views: Problems,
Techniques, and Applications,” IEEE Data Eng. Bulletin, Vol. 18, No. 2, pp. 3–
18, June 1995.
[11] J. Han, “OLAP Mining: An Integration of OLAP with Data Mining,” Proc. of
IFIP Conf. on Data Semantics, pp. 1–11, 1997.
[12] V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing Data Cubes
Efficiently,” Proc. of the ACM SIGMOD Int. Conf. on Management of Data,
pp. 205–216, 1996.
[13] K. J. Hastings, Probability and Statistics. Addison Wesley Longman, Inc.,
first ed., 1997.
[14] B. Inmon, H. William, and C. Kelley, “The 12 Rules of Data Warehouse for A
Client/Server World,” Data Management Review, pp. 6–16, May 1994.
[15] M. Jurgens and H.-J. Lenz, “Tree Based Indexes vs. Bitmap Indexes: A Performance
Study,” Int. Journal of Cooperative Information Systems, Vol. 10, No. 1-2,
pp. 355–376, Mar. 2001.
[16] P. Kalnis, N. Mamoulis, and D. Papadias, “View Selection Using Randomized
Search,” Data and Knowledge Eng., Vol. 42, No. 1, pp. 89–11, July 2002.
[17] M. Levene and G. Loizou, “Why is the Snowflake Schema A Good Data Warehouse
Design?,” Proc. of the 10th Int. Conf. on Tools with Artificial Intelligence,
pp. 225–240, 2003.
[18] T. W. Ling and E. K. Sze, “Materialized View Maintenance Using Version Numbers,”
Proc. of the 6th Int. Conf. on Database Systems for Advanced Applications,
pp. 263–270, 1999.
[19] P. O’Neil, “Model 204 Architecture and Performance,” Springer-Verlag Lecture
Notes in Computer Science 359, Proc. of the 2nd Int. Workshop on High Performance
Transactions Systems, pp. 40–59, 1987.
[20] P. O’Neil and D. Quass, “Improved Query Performance with Variant Indexes,”
Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pp. 38–49, 1997.
[21] T. Palpanas, “Knowledge Discovery in Data Warehouses,” ACM SIGMOD
Record, Vol. 29, No. 3, pp. 88–100, Sept. 2000.
[22] R. Ramakrishnan and J. Gehrke, “Database Management Systems,” tech. rep.,
The McGraw-Hill Companies, 2000.
[23] S. Sarawagi, “Indexing OLAP Data,” IEEE Data Eng. Bulletin, Vol. 202, No. 1,
pp. 36–43, 1997.
[24] A. Shoshani, L. M. Bernardo, H. Nordberg, D. Rotem, and A. Sim, “Multidimensional
Indexing and Query Coordination for Tertiary Storage Management,”
Proc. of the 11th Int. Conf. on Scientific and Statistical Database Management,
pp. 214–225, 1999.
[25] A. Y. Sihem and T. Johnson, “Optimizing Queries on Compressed Bitmap,”
Proc. of the 26th Int. Conf. on Very Large Data Bases, pp. 329–338, 2000.
[26] N. Stefanovic, J. Han, and K. Koperski, “Object-Based Selective Materialization
for Efficient Implementation of Spatial Data Cubes,” IEEE Trans. on Knowledge
and Data Eng., Vol. 12, No. 6, pp. 938–958, Nov./Dec. 2000.
[27] K. Stockinger, “Design and Implementation of Bitmap Indices for Scientific
Data,” Proc. of the Int. Database Eng. Applications Symposium, pp. 47–57, 2001.
[28] K. Stockinger, “Bitmap Indices for Speeding Up High-Dimensional Data Analysis,”
Proc. of the 13th Int. Conf. on Database and Expert Systems Applications,
pp. 881–890, 2002.
[29] K. Stockinger, D. Dullmann, W. Hoschek, and E. Schikuta, “Improving the
Performance of High-Energy Physics Analysis Through Bitmap Indices,” Proc. of
the 11th Int. Conf. on Database and Expert Systems Applications, pp. 835–845,
2000.
[30] D. Theodoratos, “Detecting Redundant Materialized Views in Data Warehouse
Evolution,” Information Systems, Vol. 26, No. 5, pp. 363–381, July 2001.
[31] P. Westerman, Data Warehousing. Morgan Kaufmann, first ed., 2001.
[32] H. K. T. Wong, H.-F. Liz, F. Olken, D. Rotem, and L. Wong, “Bit Tranposed
Files,” Proc. of the 11th Int. Conf. on Very Large Data Bases, pp. 448–457, 1985.
[33] K. Wu, E. J. Otoo, and A. Shoshani, “Compressed Bitmap Indices for Efficient
Query Processing,” tech. rep., LBNL-47807, Lawrence Berkeley National Laboratory,
Berkeley, CA, 1999.
[34] K. Wu, E. J. Otoo, and A. Shoshani, “Compressing Bitmap Indexes for Faster
Search Operations,” Proc. of the 14th Int. Conf. on Scientific and Statistical
Database Management, pp. 625–658, 2002.
[35] K.-L. Wu and P. S. Yu, “Range-Based Bitmap Indexing for High-Cardinality Attributes
with Skew,” Proc. of the 22th Int. Computer Software and Applications
Conf., pp. 61–67, 1998.
[36] M. C. Wu, “Query Optimization for Selections Using Bitmaps,” Proc. of the
ACM SIGMOD Int. Conf. on Management of Data, pp. 227–238, 1999.
[37] M. C. Wu and A. P. Buchmann, “Encoded Bitmap Indexing for Data Warehouses,”
Proc. of the 14th Int. Conf. on Data Eng., pp. 220–230, 1998.
[38] W. S. Yang, Y. D. Chung, and M. H. Kim, “The RD-Tree: a Structure for Processing
Partial-MAX,MIN Queries in OLAP,” Information Sciences, Vol. 146,
No. 1-4, pp. 137–149, Oct. 2002.
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:校內校外均不公開 not available
開放時間 Available:
校內 Campus:永不公開 not available
校外 Off-campus:永不公開 not available

您的 IP(校外) 位址是 3.238.5.144
論文開放下載的時間是 校外不公開

Your IP address is 3.238.5.144
This thesis will be available to you on Indicate off-campus access is not available.

紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code