Data Replication in Data Intensive Scientific Applications - CiteSeerX

More documents

Recommendations

Info

8 Total Access Cost 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 500 Greedy Local Greedy Random 1000 1500 Number of Files 2000 Total Access Cost 18000 16000 14000 12000 10000 8000 6000 4000 2000 Greedy Local Greedy Random 0 10 20 30 40 50 60 70 80 90 Storage Capacity (GB) Total Access Cost 22000 20000 18000 16000 14000 12000 10000 8000 6000 4000 2000 Greedy Local Greedy Random 10 15 20 25 30 35 40 45 50 Number of Grid sites (a) Varying number of data files. (b) Varying storage capacity. (c) Varying number of Grid sites. Fig. 4. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms in large scale. Unless varied, the number of sites is 30, number of data files is 1,000, and storage capacity of each site is 50. Data file size is 1. Total Access Cost 900 800 700 600 500 400 300 200 Optimal Greedy Local Greedy Random 100 4 6 8 10 12 14 16 18 20 Number of Files Total Access Cost 700 600 500 400 300 200 Optimal Greedy Local Greedy Random 100 1 2 3 4 5 6 7 8 9 Storage Capacity (GB) Total Access Cost 600 500 400 300 Optimal 200 Greedy Local Greedy Random 100 5 10 15 20 25 Number of Grid sites (a) Varying number of data files. (b) Varying storage capacity. (c) Varying number of Grid sites. Fig. 5. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms with varying data file size. Unless varied, the number of sites is 10, number of data files is 10, and storage capacity of each site is 5. Data file size is varying from 1 to 10. (a) Cascading Replication [39]. (b) Simulation Topology. Tier 3 has 32 sites, we do not show all of them due to space limit. Fig. 6. Illustration of simulation topology. it shows that Greedy performs the best among the three algorithms. Comparison with Varying Data File Size. Even though Theorem 1 is valid only for uniform data size, we experimentally show that our greedy algorithm also achieves good system performance for varying the data file size. Figure 5 shows the performance comparison of Optimal, Greedy, Local Greedy, and Random, wherein each data file size is a random number between 1 and 10. Greedy performs the best again among the three algorithms. It can be seen that due to the non-uniform data size, the access cost is no longer zero for all four algorithms when there are five files and each site has a storage capacity five. B. Distributed Algorithm versus Replication Strategies by Ranganathan and Foster [39] In this section, we compare our distributed algorithm with the replication strategy proposed by Ranganathan and Foster [39]. First, we give an overview of their strategies, and then present the simulation environment and discuss the comparison simulation results. 1) Replication Strategies by Ranganathan and Foster: Ranganathan and Foster study the data replication in a hierarchical Data Grid model (see Figure 6). The hierarchical model, represented as a tree topology, has been used in LHCGrid project [2], which serves the scientific collaborations performing experiments at the Large Hadron Collider (LHC) in CERN. In this model, there is a tier 0 site at
9 CERN, and there are regional sites, which are from different institutes in the collaboration. For comparison, we also use this hierarchical model and assume that all data files are at the CERN site initially. 4 Like [2], we assume that only the leaf site executes jobs. When the needed data files are not stored locally, the site always goes one level higher for the data. If the data is found, it fetches the data back to leaf site; otherwise, it goes one level higher until it reaches the CERN site. Ranganathan and Foster present six different replication strategies: (1) No Replication or Caching; (2) Best Client, whereby a replica is created at the best client site (the site that has the largest number of requests for the file); (3) Cascading Replication, whereby once popularity exceeds the threshold for a file at a given time interval, a replica is created at the next level on the path to the best client; (4) Plain Caching, whereby the client that requests the file stores a copy of the file locally; (5) Caching plus Cascading Replication, which combines Plain Caching and Cascading Replication strategies; and (6) Fast Spread, whereby replicas of the file are created at each site along its path to the client. All of the above strategies are evaluated with three user access patterns: (1) Random Access: no locality in patterns, (2) Temporal Locality: data that contain a small degree of temporal locality (recently accessed files are likely to be accessed again), and (3) Geographical plus Temporal Locality: data containing a small degree of temporal and geographical locality (files recently accessed by a site are likely to be accessed again by a close-by site). Their simulation results show that Caching plus Cascading would work better than others in most cases. In our work, we compare our strategy with Caching plus Cascading (referred to as Cascading in the simulation) under Geographical and Temporal Locality. Cascading Replication Strategy. Figure 6 (a) shows in detail how Cascading works. There is a file F1 in the root site. When the number of requests for file F1 exceeds the threshold at the root, a replica copy of F1 is sent to the next layer on the path to the best client. Eventually the threshold is exceeded at the next layer, and a replica copy of F1 is sent to Client C. TABLE I SIMULATION PARAMETERS FOR HIERARCHICAL TOPOLOGY. Description Value Number of sites 43 Number of files 1,000 - 5,000 Size of each file 2 GB Available storage of each site 100 - 500 GB Number of jobs executed by each site 400 Number of input files of each job 1 - 10 Link bandwidth 100 MB/s 2) Simulation Setup: In the simulation, we use the Grid- Sim Simulator [43] (version 4.1). The GridSim toolkit allows modeling and simulation of entities in parallel and 4 However, our Data Grid model is not limited to a tree-like hierarchical model. In Section VI-C, we show a general graph model, and that our distributed replication algorithm performs better than LRU/LFU implemented in OptorSim [10]. distributed computing environment, for systems-users, applications, resources, and resource brokers. The 4.1 version of GridSim incorporates a Data Grid extension to model and simulate Data Grids. In addition, GridSim offers extensive support for design and evaluation of scheduling algorithms, which suits the need for future work wherein a more comprehensive framework of data replication and job scheduling will be studied. As a result, we use GridSim in our simulation. Figure 6 (b) shows the topology used in our simulation. Similar to the one in [39], there are four tiers in the grid, with all data being produced at the top most tier 0 (the root). Tier 2 consists of the regional centers of different continents; here we consider two. The next tier is composed of different countries. The final tier consists of the participating institution sites. The Grid contains a total of 43 sites, 32 of them executing tasks and generating requests. Table I shows the simulation parameters, most of which are from [39] for the purpose of comparison. The CERN site originally generates and stores 1,000 to 5,000 different data files, each of which is 2 GB. Each site (except for the CERN site) dynamically generates and executes 400 jobs, and each job needs 1 to 10 files as input files. To execute each job, the site first checks if the input files are in its local storage. If not, it then goes to the nearest replica site to get the file. The available storage capacity at each site varies from 100 GB to 500 GB. The link bandwidth is 100 MB/s. Our simulation is run on a DELL desktop with Core 2 Duo CPU and 1 GB RAM. We emphasize that the above parameters take into account the scaling. Usually, the amount of data in the Data Grid system is in the order of petabytes. To enable simulation of such a large amount of data, a scale of 1:1000 is considered. That is, the number of files in the system and the storage capacity of each site are both reduced by a factor of 1,000. Data File Access Patterns. Each job has one to ten input files as noted above. For the input files of each job, we adopt two file access patterns: random access pattern and tempo-spatial access pattern. In random access pattern, the input files of each job are randomly chosen from the entire set of data files available in the Data Grid. In the tempo-spatial access pattern, the input data files required by each job follows Zipf distribution [51], [9] and geographic distribution, as explained below. • In Zipf distribution (the temporal access part), for the p data files, the probability for each job to access the j th (1 ≤ j ≤ p) data file is represented by 1 P j = j ∑ θ p , where 0 ≤ θ ≤ 1. When θ = 1, the h=1 above distribution 1/hθ follows the strict Zipf distribution, while for θ = 0, it follows the uniform distribution. We choose θ to be 0.8 based on real web trace studies [9]. • In geographic distribution (the spatial access part), the data file access frequencies of jobs at a site depend on its geographic location in the Grid such that sites that are in proximity have similar data file access frequencies. In our case, the leaf sites close to each other have similar data file access pattern. Specifically,
Page 1 and 2: 1 Data Replication in Data Intensiv
Page 3 and 4: 3 optimization by using file access
Page 5 and 6: 5 Fig. 2. Each site i original grap
Page 7: 7 Total Access Cost 700 600 500 400
Page 11 and 12: 11 suppose there are n client sites
Page 13 and 14: 13 TABLE IV STORAGE CAPACITY OF EAC
Page 15 and 16: 15 [20] D. Düllmann and B. Segal.

Data Replication in Data Intensive Scientific Applications - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?