國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,一個於資料網格中以不同臨界值方式之資料複製策略,A Different Threshold Approach to Data Replication in Data Grids

論文名稱 Title	一個於資料網格中以不同臨界值方式之資料複製策略 A Different Threshold Approach to Data Replication in Data Grids
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	96 學年度第 1 學期 The fall semester of Academic Year 96	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	68
研究生 Author	黃彥維 Yen-Wei Huang
指導教授 Advisor	張玉盈 Ye-In Chang
召集委員 Convenor	黃三益 San-Yi Huang
口試委員 Advisory Committee	李建億 Chien-I Lee
口試日期 Date of Exam	2007-12-19	繳交日期 Date of Submission	2008-01-21
關鍵字 Keywords	存取環境、臨界值、資料網格、資料複製 Data Grids, Access patterns, Data Replication, Thresholds
統計 Statistics	本論文已被瀏覽 5706 次，被下載 0 次 The thesis/dissertation has been browsed 5706 times, has been downloaded 0 times.

中文摘要
在典型的資料網格中，大量的資料都儲存在世界各地的系統中，存取的時間可能花掉數秒甚至於數小時。一些科學應用領域，例如高能物理或者地球觀察，預計產生幾Petabytes (220Gigabyes)的資料，並且被分佈於全球的科學家分析和評價。在資料網格技術中，複製機制很常被使用來降低存取時間和頻寬消耗。複製管理系統的目標不只是記錄被複製的資料，還幫助那些應用程式找到對它們來說花費最少回應時間就可以存取到的資料。在這篇論文中，我們使用傳統的資料網格架構：三種型態的節點－伺服器節點、暫存節點、用戶端節點。伺服器節點代表最主要的資料儲存位置。用戶端節點代表要求發出的起始點。而暫存節點是架構中層的資料暫存位置。然而，通常結構式的儲存系統會遇到主要的瓶頸。這是由於在結構式的儲存系統中，若有些數據還保存在磁帶的記錄中而還沒被安裝，那存取的時間可能會花掉從數秒到數小時。靜態的複製策略可以達到複製的優點，但靜態複製策略的缺點是不能適用於使用者需求的改變。所以若要存取大量的資料時，靜態複製策略就不是很適用。如此一來，在資料網格中的動態資料複製策略就被提出了。基本上在任何複製品安排策略中有3個基本考量的問題：(1)複製檔案什麼時候建立，(2)哪個檔案要被複製，(3)複製檔案應該被安置在哪裡。藉由這些問題的不同答案，很多不同的資料複製策略被提出。這領域兩個著名方法，Fast-Spread和Cascading，可以分別適用於不同的使用者存取環境。舉例來說，Fast-Spread策略適用於隨機存取，而Cascading策略適用於有區域特性的存取。然而，在如此多的存取環境中，若我們對於每一種存取環境都分別使用一種不同的資料複製策略，那整個系統將變的很複雜。因此，在這篇論文中，我們提出一個能適用於不同存取環境的單一策略。我們提出一個複製策略－一個於資料網格中以不同臨界值方式之資料複製策略(DT)，可以動態地適應於不同的存取環境並達到與Cascading和Fast-Spread策略相同的效果。在我們的方法中，架構中每一層有一個不同的臨界值。甚至於藉由仔細地調整各層臨界值之間的差距，我們可以提供比以上兩種著名策略更好的效果。更進一步來說，在大量不同的資料中，可能存在著熱門檔案。而所謂的熱門檔案就是系統中較常被要求到的那些資料。所以為了減少熱門資料的存取次數，我們提出了我們方法中的動態DT部份。在動態DT策略中，我們讓每一個資料檔案擁有自己的複製臨界值，而使那些熱門檔案可以比一般檔案更提早被複製。在我們定義的這個環境下，資料是唯讀的，所以不會產生資料是否一致性的議題。從我們的模擬結果中，顯示出我們的方法比Cascading和Fast-Spread擁有較少的回應時間。更進一步我們可以得到在動態DT策略下可以比靜態DT策略擁有更好的效果。
Abstract
Certain scientific application domains, such as High-Energy Physics or Earth Observation, are expected to produce several Petabytes (220 Gigabyes) of data that is analyzed and evaluated by the scientists all over the world. In the context of data grid technology, data replication is mostly used to reduce access latency and bandwidth consumption. In this thesis, we adopt the typical Data Grid architecture, three kinds of nodes: server, cache, and client nodes. A server node represents a main storage site. A client node represents a site where data access requests are generated, and a cache node represents an intermediate storage site. However, the access latency of the hierarchical storage system may be of the order of seconds up to hours. The static replication strategy can be used to improve such long delay; however, it cannot adapt to changes of users’ behaviors. Therefore, the dynamic data replication strategy is used in Data Grids. Three fundamental design issues in a dynamic replication strategy are: (1) when to create the replicas, (2) which files to be replicated, and (3) where the replicas to be placed. Two of well known replication strategies are Fast-Spread and Cascading, which can work well for different kinds of access patterns individually. For example, the Fast-Spread strategy works well for random access patterns, and the Cascading strategy works well for the patterns with the properties of localities. However, for so many different access patterns, if we use a strategy for one kind of access patterns and another strategy for another kind of access patterns, the system may become too complex. Therefore, in this thesis, we propose one strategy which can work for any kind of access patterns. We propose a replication approach, a Different Threshold (DT) approach to data replication in Data Grids, which can be dynamically adapted to several kinds of access patterns and provide even better performance than Cascading and Fast-Spread strategies. In our approach, there are different thresholds for different layers. Based on this approach, first, we propose a static DT strategy in which the threshold at each layer is fixed. So, by carefully adjusting the difference between the thresholds Ti, where i is the i-th layer of the tree structure, we can even provide the better performance than the above two well-known strategies. Moreover, among large amount of different data files, there may exist some hot data files. Those files which have been mostly requested are hot data files. To reduce the number of requests for the hot files, next, we propose the dynamic DT strategy. In the dynamic DT strategy, each data file even has its own threshold. We let data replication of hot files occur earlier than others by decreasing the thresholds of hot files earlier than the normal ones. From our simulation results, we show that the response time in our static DT strategy is less than that in the Cascading and the Fast-Spread strategies. Moreover, we can show that the performance of the dynamic DT strategy is better than that of the static DT strategy.

目次 Table of Contents
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . .iv LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . .vii 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Research Areas . . . . . . . . . . . . . . . . . . . . . . . .2 1.1.2 Grid Computing . . . . . . . . . . . . . . . . . . . . . . . .3 1.1.3 Resource Management Systems . . . . . . . . . . . . . . . . . 4 1.2 Data Grids . . . . . . . . . . . . . . . . . . . . . . . . . . .5 1.2.1 Data Grid Architecture Elements . . . . . . . . . . . . . . . 6 1.2.2 Data Replication . . . . . . . . . . . . . . . . . . . . . . .8 1.2.3 Strategies for Data Replication . . . . . . . . . . . . . . . 9 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Organization of The Thesis . . . . . . . . . . . . . . . . . . 17 2. A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 2.1 Dynamic Replication Strategies . . . . . . . . . . . . . . . . 18 2.1.1 Replication/Caching Strategies . . . . . . . . . . . . . . . 19 2.1.2 Strategies on Access Patterns . . . . . . . . . . . . . . . .20 2.2 Hybrid Replica Structure . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Replica Distribution Topologies . . . . . . . . . . . . . . . . . 22 2.2.2 Replica Placement Algorithm . . . . . . . . . . . . . . . . . . 22 2.3 Dynamic Replica Structure . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1 Replica Distribution topologies . . . . . . . . . . . . . . . . . 24 2.3.2 The Dynamic Replica Placement Algorithm . . . . . . . . . . 25 3. A Different Threshold Approach . . . . . . . . . . . . . . . . .28 3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 The Static DT Strategy . . . . . . . . . . . . . . . . . . . . 30 3.3 Different Effects of Ti's . . . . . . . . . . . . . . . . . . .34 3.4 The Dynamic DT Strategy . . . . . . . . . . . . . . . . . . . .37 4. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 The Performance Model . . . . . . . . . . . . . . . . . . . . .41 4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 The Static DT Strategy . . . . . . . . . . . . . . . . . . . 44 4.2.2 The Dynamic DT Strategy . . . . . . . . . . . . . . . . . . .48 5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . .52 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .52 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . .53 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

參考文獻 References
BIBLIOGRAPHY [1] Globus Toolkit website. http://globus.org/toolkit/ [2] Sujoy Basu, Sujata Banerjee, Puneet Sharma, and Sung-Ju Lee, "NodeWiz: Peer-to-peer Resource Discovery for Grids," Proc. of the 5th IEEE Int. Symp. on Cluster Computing and the Grid, Vol. 1, pp. 213-220, 2005. [3] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, "The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets," J. of Network and Computer Applications, pp. 187-200, 2001. [4] G. Coulouris, J. Dollimore, and T. Kindberg, "Distributed Systems, concepts and designs," third Edition, Addisson Wesley, 2001. [5] J. Feng and M. Humphrey, "Eliminating Replica Selection - Using Multiple Replicas to Accelerate Data Transfer on Grids," Proc. of the 10th Int. Conf. on Parallel and Distributed Systems, pp. 359-366, 2004. [6] P. A. Fishwick, "SimPack: getting started with simulation programming in C and C++," Proc. of the 24th Conf. on Winter Simulation, pp. 154-162, 1992. [7] I. Foster and C. Kesselman, "The Grid: Blueprint for a New Computing Infrastructure," Morgan Kaufmann Publishers, 1998. [8] I. Foster, "The Grid2003 Production Grid: Principles and Practice," Proc. of the 13th IEEE Int. Symp. on High Performance Distributed Computing, Vol. 00, pp. 236-245, 2004. [9] J. Smith and M. Jones, "Survey and Taxonomy of Grid Resource Management Systems," Chaitanya Kandagatla University of Texas, Austin, 2003. [10] H. Lamehamedi, B. Szymanski, Z. Shentu, and E. Deelman, "Data Replication Strategies in Grid Environments," Proc. of the 5th Int. Conf. on Algorithms and Architectures for Parallel Processing, pp. 378-383, 2002. [11] H. Lamehamedi, Z. Shentu, B. Szymanski, and E. Deelman, "Simulation of Dynamic Data Replication Strategies in Data Grids," Proc. of the 17th Int. Symp. on Parallel and Distributed Processing, pp. 10, 2003. [12] R. Moore, C. Baru, R. Marciano, A. Rajasekar, and M. Wan, "Data-Intensive Computing," In I. Foster and C. Kesselman eds., Morgan Kaufmann Publishers, pp. 105-129, 1998. [13] J. NABRZYSKI, J. M. SCHOPF, and J. WEGLARZ, "Grid Resource Management: State of the Art and Future Trends," Springer Publishers, 2003. [14] M. Park and P. A. Fishwick, "SimPackJ/S: A Web-Oriented Toolkit for Discrete Event Simulation," Proc. of SPIE Vol. 4716, pp. 348-358, 2002. [15] K. Ranganathan and I. Foster, "Design and Evaluation of Dynamic Replication Strategies for a High-Performance Data Grid," Proc. of the Int. Conf. on Computing in High Energy and Nuclear Physics, pp. 106-118, 2001. [16] K. Ranganathan and I. Foster, "Identifying Dynamic Replication Strategies for a High-Performance Data Grid," Proc. of the Int. Workshop on Grid Computing, pp. 75-86, 2002. [17] Y. Saito and H. Levy, "Optimistic Replication for Internet Data Services," In Proc. of the 14th Int. Conf. on Distributed Computing, p. 297-314, 2000. [18] S. Vazhkudai, S. Tuecke, and I. Foster, "Replica Selection in the Globus Data Grid," Proc. of the 1st IEEE/ACM Int. Symp. on Cluster Computing and the Grid, pp. 106-113, 2001.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外均不公開 not available 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 3.145.47.253 論文開放下載的時間是校外不公開 Your IP address is 3.145.47.253 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS