Data Replication in Data Intensive Scientific Applications - CiteSeerX

More documents

Recommendations

Info

12 Average file access time (second) 500000 400000 300000 200000 100000 Ditributed in Random Cascading in Random Distributed in TS-Static Cascading in TS-Static Distributed in TS-Dynamic Cascading in TS-Dynamic Average file access time (second) 700000 600000 500000 400000 300000 200000 100000 Ditributed in Random Cascading in Random Distributed in TS-Static Cascading in TS-Static Distributed in TS-Dynamic Cascading in TS-Dynamic 0 1000 50000 200000 400000 600000 8000001000000 Total number of files 0 1 50 100 200 400 600 800 1000 Storage Capacity (GB) (a) Varying total number of files. (b) Varying storage capacity of each node. Fig. 10. Performance comparison between our distributed algorithm and Cascading in a typical cluster environment. In (a), the storage capacity of each node is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB. Average file access time (second) 160000 140000 120000 100000 80000 60000 40000 20000 Ditributed in Random Cascading in Random Distributed in TS-Static Cascading in TS-Static Distributed in TS-Dynamic Cascading in TS-Dynamic Average file access time (second) 120000 100000 80000 60000 40000 20000 Ditributed in Random Cascading in Random Distributed in TS-Static Cascading in TS-Static Distributed in TS-Dynamic Cascading in TS-Dynamic 0 1000 50000 200000 400000 600000 8000001000000 Total number of files 0 1 50 100 200 400 600 800 1000 Storage Capacity (GB) (a) Varying total number of files. (b) Varying storage capacity of each node. Fig. 11. Performance comparison between our distributed algorithm and Cascading in a cluster environment with full connectivity. In (a), the storage capacity of each node is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB. storage, and executing the next job. Otherwise, we observe the performance differences as shown in Figure 7. Second, comparisons with different performance show that for our distributed algorithm, the percentage increase of access time due to the dynamic pattern is very small, around 5% to 8%. This shows that our algorithm adjusts to the dynamic access pattern well in a typical Grid environment. Third, Figure 9 (b) shows that with increased storage capacity of each site, the performance difference between our distributed algorithm and Cascading is getting small for all three access patterns. For example, when the storage capacity is 1 TB, Cascading yields more than twice the file access time than our distributed algorithm; while at 400 TB, it costs 40% more access time than our distributed algorithm. This shows that in the more stringent storage scenario, our caching algorithm is a better mechanism to reducing file access time and thus job execution time. Comparison in Typical Cluster Environment. In a typical cluster environment, a site often has 1,000 to 10,000 nodes (note the difference between sites and nodes as explained in Section III), each with small local storage in the range 10 GB to 1 TB. The network bandwidth within a cluster is typically 1 GB/s. We use the parameters in Table III to simulate the cluster environment. Figure 10 shows the performance comparison of our distributed algorithm and Cascading under different access patterns. We observe that for our distributed algorithm, the percentage increase of access time due to dynamic pattern is relatively large, around 40% to 100%. This shows that our algorithm does not adjust well to the dynamic access pattern in a typical cluster environment. This is because in a cluster, the diameter of the network (number of hops in the longest shortest path of two cluster nodes) is much larger than that in the typical Grid environment, and the bandwidth in a cluster is much smaller than that in Grids. As a result, when the file access pattern of each site changes in the middle of the simulation, it takes longer time to propagate such information to other nodes.
13 TABLE IV STORAGE CAPACITY OF EACH SITE IN EU DATA GRID TESTBED 1 [10]. IC IS IMPERIAL COLLEGE. Site Bologna Catania CERN IC Lyon Milano NIKHEF NorduGrid Padova RAL Torino Storage (GB) 30 30 10000 80 50 50 70 63 50 50 50 Therefore, the nearest replica catalog maintained by each node could get outdated, causing frequent cache misses for needed input files. Comparison in a Cluster Environment with Full Connectivity. We consider a cluster with a fully connected set of nodes and show the simulation results in Figure 11. There are two major observations. First, comparing Figure 9 with Figure 11, we observe that the average file access time in a cluster with full connectivity is lower than that of a Grid. This is because even though a typical Grid bandwidth might be higher than that of a cluster (4000 MB/s versus 100 MB/s shown in Tables II and III), there are much fewer point-to-point links in a Grid of 100 sites compared to a cluster with a fully connected set of 1000 nodes (around 5 × 10 3 edges versus 5 × 10 5 edges). Therefore, the full network bisection bandwidth in a cluster is higher than that of a Grid (5 × 10 5 × 100 = 5 × 10 7 MB/s versus 5 × 10 3 × 4000 = 2 × 10 7 MB/s). Second, we observe that for all three access patterns, our distributed algorithm performs the same as Cascading. This is because with full connectivity, each site can get the needed data files directly from the CERN site, therefore each site is unable to observe the data access traffic of other sites to make intelligent caching decisions. Both our distributed caching algorithm and Cascading algorithm boil down to the same LRU/LFU cache replacement algorithm. In a fully connected environment, our distributed algorithm loses its advantage of being an effective caching algorithm, like any other distributed caching algorithms. Fig. 12. EU Data Grid Testbed 1 [10]. Numbers indicate the bandwidth between two sites. C. Distributed Algorithm versus LRU/LFU in General Topology OptorSim [10] is a Grid simulator that simulates a general topology Grid testbed for data intensive high energy physics experiments. Three replication algorithms are implemented in OptorSim: (i) LRU (Least Recently Used), which deletes those files that have been used least recently; (ii) LFU (Least Frequently Used), which deletes those files that have been used least frequently in the recent past; and (iii) an economic model in which sites “buy” and “sell” files using an auction mechanism, and will only delete files if they are less valuable than the new file. We compare our distributed algorithm with LRU and LFU in a general topology since previous empirical research has found that even though the LRU/LFU strategies are extremely trivial, it is very difficult to improve them. The topology we use is the simulated topology of EU DataGrid TestBed 1 [10], as shown in Figure 12. The storage capacity of each site is given in Table IV. Since the simulation results of LRU and LFU are quite similar, we only show the results of LRU. Figure 13 (a) and (b) show the performance comparison of our distributed algorithm and LRU by varying the total number of files and storage capacity of each site, respectively. It shows that our distributed algorithm outperforms LRU in most cases. However, the performance difference is not as significant as that between distributed algorithm and Cascading. In particular, under TS-Dynamic access pattern, when the storage capacity of each site is 400 GB and 500 GB, the average file access time of LRU is even a little smaller than that of our distributed algorithm. VII. Conclusions and Future Work In this article, we study how to replicate data files in data intensive scientific applications, to reduce the file access time with the consideration of limited storage space of Grid sites. Our goal is to effectively reduce the access time of data files needed for job executions at Grid sites. We propose a centralized greedy algorithm with performance guarantee, and show that it performs comparably with the optimal algorithm. We also propose a distributed algorithm wherein Grids sites react closely to the Grid status and make intelligent caching decisions. Using GridSim, a distributed Grid simulator, we demonstrate that the distributed replication technique significantly outperforms a popular existing replication technique, and it is more adaptive to the dynamic change of file access patterns in Data Grids. We plan to further develop our work in the following directions: • As ongoing and future work, we are exploiting the synergies between data replication and job scheduling to achieve better system performance. Data replication and job scheduling are two different but complementary
Page 1 and 2: 1 Data Replication in Data Intensiv
Page 3 and 4: 3 optimization by using file access
Page 5 and 6: 5 Fig. 2. Each site i original grap
Page 7 and 8: 7 Total Access Cost 700 600 500 400
Page 9 and 10: 9 CERN, and there are regional site
Page 11: 11 suppose there are n client sites
Page 15 and 16: 15 [20] D. Düllmann and B. Segal.

Data Replication in Data Intensive Scientific Applications - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?