Data Replication in Data Intensive Scientific Applications - CiteSeerX

1 

Data Replication in Data Intensive Scientific 

Applications With Performance Guarantee 

Dharma Teja Nukarapu, Bin Tang, Liqiang Wang, and Shiyong Lu 

Abstract— Data replication has been well adopted in data 

intensive scientific applications to reduce data file transfer 

time and bandwidth consumption. However, the problem of 

data replication in Data Grids, an enabling technology for 

data intensive applications, has proven to be NP-hard and 

even non-approximable, making this problem difficult to solve. 

Meanwhile, most of the previous research in this field is 

either theoretical investigation without practical consideration, 

or heuristics-based with little or no theoretical performance 

guarantee. In this paper, we propose a data replication algorithm 

that not only has a provable theoretical performance 

guarantee, but also can be implemented in a distributed and 

practical manner. Specifically, we design a polynomial time 

centralized replication algorithm that reduces the total data 

file access delay by at least half of that reduced by the optimal 

replication solution. Based on this centralized algorithm, we 

also design a distributed caching algorithm, which can be 

easily adopted in a distributed environment such as Data 

Grids. Extensive simulations are performed to validate the 

efficiency of our proposed algorithms. Using our own simulator, 

we show that our centralized replication algorithm performs 

comparably to the optimal algorithm and other intuitive 

heuristics under different network parameters. Using GridSim, 

a popular distributed Grid simulator, we demonstrate that 

the distributed caching technique significantly outperforms an 

existing popular file caching technique in Data Grids, and it 

is more scalable and adaptive to the dynamic change of file 

access patterns in Data Grids. 

Index Terms - Data intensive applications, Data Grids, data 

replication, algorithm design and analysis, simulations. 

I. Introduction 

Data intensive scientific applications, which mainly aim 

to answer some of the most fundamental questions facing 

human beings, are becoming increasingly prevalent in a 

wide range of scientific and engineering research domains. 

Examples include human genome mapping [40], high energy 

particle physics and astronomy [1], [26], and climate change 

modeling [32]. In such applications, large amounts of data 

sets are generated, accessed, and analyzed by scientists 

worldwide. The Data Grid [49], [5], [22] is an enabling 

technology for data intensive applications. It is composed of 

hundreds of geographically distributed computation, storage, 

and networking resources to facilitate data sharing and 

management in data intensive applications. One distinct 

feature of Data Grids is that they produce and manage very 

large amount of data sets, in the order of terabytes and 

Dharma Teja Nukarapu and Bin Tang are with the Department of 

Electrical Engineering and Computer Science, Wichita State University, 

Wichita, KS 67260; Liqiang Wang is with the Department of Computer 

Science, University of Wyoming, Laramie, WY 82071; and Shiyong Lu is 

with the Department of Computer Science, Wayne State University, Detroit, 

MI 48202. 

petabytes. For example, the Large Hadron Collider (LHC) 

at the European Organization for Nuclear Research (CERN) 

near Geneva, Switzerland, is the largest scientific instrument 

on the planet. Since it began operation in August of 2008, 

it was expected to produce roughly 15 petabytes of data 

annually, which are accessed and analyzed by thousands of 

scientists around the world [3]. 

Replication is an effective mechanism to reduce file 

transfer time and bandwidth consumption in Data Grids 

- placing most accessed data at the right locations can 

greatly improve the performance of data access from a 

user’s perspective. Meanwhile, even though the memory and 

storage capacity of modern computers are ever increasing, 

they are still not keeping up with the demand of storing huge 

amounts of data produced in scientific applications. With 

each Grid site having limited memory/storage resources, 1 

the data replication problem becomes more challenging. We 

find that most of the previous work of data replication under 

limited storage capacity in Data Grids mainly occupies two 

extremes of a wide spectrum: they are either theoretical 

investigations without practical consideration, or heuristicsbased 

experiments without performance guarantee (please 

refer to Section II for a comprehensive literature review). 

In this article, we endeavor to bridge this gap by proposing 

an algorithm that has a provable performance guarantee and 

can be implemented in a distributed manner as well. 

Our model is as follows. Scientific data, in the form of 

data files, are produced and stored in the Grid sites as the 

result of scientific experiments, simulations, or computations. 

Each Grid site executes a sequence of scientific jobs 

submitted by its users. To execute each job, some scientific 

data as input files are usually needed. If these files are not 

in the local storage resource of the Grid site, they will be 

accessed from other sites, and transferred and replicated in 

the local storage of the site if necessary. Each Grid site can 

store such data files subject to its storage/memory capacity 

limitation. We study how to replicate the data files onto 

Grid sites with limited storage space in order to minimize 

the overall file access time, for Grid sites to finish executing 

their jobs. 

Specifically, we formulate this problem as a graphtheoretical 

problem and design a centralized greedy data 

replication algorithm, which provably gives the total data 

file access time reduction (compared to no replication) at 

least half of that obtained from the optimal replication 

algorithm. We also design a distributed caching technique 

1 In this article, we use memory and storage interchangeably.

2 

based on the centralized replication algorithm, 2 and show 

experimentally that it can be easily adopted in a distributed 

environment such as Data Grids. The main idea of our 

distributed algorithm is that when there are multiple replicas 

of a data file existing in a Data Grid, each Grid site 

keeps track of (and thus fetches the data file from) its 

closest replica site. This can dramatically improve Data 

Grid performance because transferring large-sized files takes 

tremendous amount of time and bandwidth [18]. The central 

part of our distributed algorithm is a mechanism for each 

Grid site to accurately locate and maintain such closest 

replica site. Our distributed algorithm is also adaptive – 

each Grid site makes a file caching decision (i.e., replica 

creation and deletion) by observing the recent data access 

traffic going through it. Our simulation results show that 

our caching strategy adapts better to the dynamic change of 

user access behavior, compared to another existing caching 

technique in Data Grids [39]. 

Currently, for most Data Grid scientific applications, the 

massive amount of data is generated from a few centralized 

sites (such as CERN for high energy physics experiments) 

and accessed by all other participating institutions. We envision 

that, in the future, in a closely collaborative scientific 

environment upon which the data-intensive applications are 

built, each participating institution site could well be a data 

producer as well as a data consumer, generating data as a 

result of scientific experiments or simulations it performs, 

and meanwhile accessing data from other sites to run its 

own applications. Therefore, the data transfer is no longer 

from a few data-generating sites to all other “client” sites, 

but could take place between an arbitrary pair of Grid sites. 

It is a great challenge in such an environment to efficiently 

share all the widely distributed and dynamically generated 

data files in terms of computing resources, bandwidth usage, 

and file transfer time. Our network model reflects such a 

vision (please refer to Section III for the detailed Data Grid 

model). 

The main results and contributions of our paper are as 

follows: 

1) We identify the limitation of the current research of 

data replication in Data Grids: they are either theoretical 

investigation without practical consideration, 

or heuristics-based implementation without a provable 

performance guarantee. 

2) To the best of our knowledge, we are the first to 

propose data replication algorithm in Data Grid, which 

not only has a provable theoretical performance guarantee, 

but can be implemented in a distributed manner 

as well. 

3) Via simulations, we show that our proposed replication 

strategies perform comparably with the optimal algorithm 

and significantly outperform an existing popular 

replication technique [39]. 

2 We emphasize the difference between replication and caching in this 

article. Replication is a proactive and centralized approach wherein the 

server replicates files into the network to achieve certain global performance 

optimum, whereas caching is reactive and takes place on the client side 

to achieve certain local optimum, and it depends on the locally observed 

network traffic, job and file distribution, etc. 

4) Via simulations, we show that our replication strategy 

adapts well to the dynamic access pattern change in 

Data Grids. 

Paper Organization. The rest of the paper is organized as 

follows. We discuss the related work in Section II. In Section 

III, we present our Data Grid model and formulate the 

data replication problem. Section IV and Section V propose 

the centralized and distributed algorithms, respectively. We 

present and analyze the simulation results in Section VI. 

Section VII concludes the paper and points out some future 

work. 

II. Related Work 

Replication has been an active research topic for many 

years in World Wide Web [35], peer-to-peer networks [4], 

ad hoc and sensor networking [25], [44], and mesh networks 

[28]. In Data Grids, enormous scientific data and 

complex scientific applications call for new replication algorithms, 

which have attracted much research recently. 

The most closely related work to ours is by Čibej, Slivnik, 

and Robič [47]. The authors study data replication on Data 

Grids as a static optimization problem. They show that this 

problem is NP-hard and non-approximable, which means 

that there is no polynomial algorithm that provides an 

approximation solution if P ≠ NP . The authors discuss two 

solutions: integer programming and simplifications. They 

only consider static data replication for the purpose of 

formal analysis. The limitation of the static approach is that 

the replication cannot adjust to the dynamically changing 

user access pattern. Furthermore, their centralized integer 

programming technique cannot be easily implemented in a 

distributed Data Grid. Moreover, Baev et al. [6], [7] show 

that if all the data have uniform size, then this problem 

is indeed approximable. And they find 20.5-approximation 

and 10-approximation algorithms. However, their approach, 

which is based on rounding an optimal solution to the linear 

programming relaxation of the problem, cannot be easily 

implemented in a distributed way. In this work, we follow 

the same direction (i.e., uniform data size), but design a 

polynomial time approximation algorithm, which can also 

be easily implemented in a distributed environment like Data 

Grids. 

Raicu et al. [36], [37] study both theoretically and empirically 

the resource allocation in data intensive applications. 

They propose a “data diffusion” approach that acquires 

computing and storage resources dynamically, replicates 

data in response to demand, and schedules computations 

close to the data. They give a O(NM) competitive ratio 

online algorithm, where N is the number of stores, each 

of which can store M objects of uniform size. However, 

their model does not allow for keeping multiple copies of 

an object simultaneously in different stores. In our model, 

we assume each object can have multiple copies, each on a 

different site. 

Some economical model based replica schemes are proposed. 

The authors in [11], [8] use an auction protocol 

to make the replication decision and to trigger long-term

3 

optimization by using file access patterns. You et al. [50] 

propose utility-based replication strategies. Lei et al. [30] 

and Schintke and Reinefeld [41] address the data replication 

for availability in the face of unreliable components, which 

is different from our work. 

Jiang and Zhang [27] propose a technique in Data Grids 

to measure how soon a file is to be re-accessed before being 

evicted compared with other files. They consider both the 

consumed disk space and disk cache size. Their approach 

can accurately rank the value of each file to be cached. 

Tang et al. [45] present a dynamic replication algorithm for 

multi-tier Data Grids. They propose two dynamic replica 

algorithms: Single Bottom Up and Aggregate Bottom Up. 

Performance results show both algorithms reduce the average 

response time of data access compared to a static replication 

strategy in a multi-tier Data Grid. Chang and Chang 

[13] propose a dynamic data replication mechanism called 

Latest Access Largest Weight (LALW). LALW associates a 

different weight to each historical data access record: a more 

recent data access has a larger weight. LALW gives a more 

precise metric to determine a popular file for replication. 

Park et al. [33] propose a dynamic replica replication 

strategy, called BHR, which benefits from “network-level 

locality” to reduce data access time by avoiding networking 

congestion in a Data Grid. Lamehamedi et al. [29] propose 

a lightweight Data Grid middleware, at the core of which 

are some dynamic data and replica location and placement 

techniques. 

Ranganathan and Foster [39] present six different replication 

strategies: No Replication or Caching, Best Client, 

Cascading Replication, Plain caching, Caching plus Cascading 

Replication, and Fast Spread. All of these strategies are 

evaluated with three user access patterns (Random Access, 

Temporal Locality, and Geographical plus Temporal Locality). 

Via the simulation, the authors find that Fast Spread 

performs the best under Random Access, and Cascading 

would work better than others under Geographical and Temporal 

Locality. Due to its wide popularity in the literature 

and its simplicity to implement, we compare our distributed 

replication algorithm to this work. 

Systemwise, there are several real testbed and system 

implementations utilizing data replication. One system is the 

Data Replication Service (DRS) designed by Chervenak et 

al. [17]. DRS is a higher-level data management service for 

scientific collaborations in Grid environments. It replicates 

a specified set of files onto a storage system and registers 

the new files in the replica catalog of the site. Another system 

is Physics Experimental Data Export (PheDEx) system 

[21]. PheDEx supports both the hierarchical distribution and 

subscription-based transfer of data. 

To execute the submitted jobs, each Grid site either gets 

the needed input data files to its local computing resource, 

schedules the job at sites where the needed input data files 

are stored, or transfers both the data and the job to a third 

site that performs the computation and returns the result. 

In this paper, we focus on the first approach. We leave the 

job scheduling problem [48], which studies how to map jobs 

into Grid resources for execution, and its coupling with data 

Fig. 1. 

Data Grid Model. 

replication, as our future research. We are aware of very 

active research studying the relationship between these two 

[12], [38], [16], [23], [14], [46], [19], [10]. The focus of 

our paper, however, is on the data replication strategy with 

a provable performance guarantee. 

As demonstrated by the experiments of Chervenak et al. 

[17], the time to execute a scientific job is mainly the time 

it takes to transfer the needed input files from server sites 

to local sites. Similar to other work in replica management 

for Data Grids [30], [41], [10], we only consider the file 

transfer time (access time), not the job execution time in 

the processor or any other internal storage processing or 

I/O time. Since the data are read only for many Data 

Grid applications [38], we do not consider consistency 

maintenance between the master file and the replica files. For 

readers who are interested in the consistency maintenance 

in Data Grids, please refer to [20], [42], [34]. 

III. Data Grid Model and Problem Formulation 

We consider a Data Grid model as shown in Figure 1. A 

Data Grid consists of a set of sites. There are institutional 

sites, which correspond to different scientific institutions 

participating in the scientific project. There is one top level 

site, which is the centralized management entity in the entire 

Data Grid environment, and its major role is to mange 

the Centralized Replica Catalogue (CRC). CRC provides 

location information about each data file and its replicas, 

and it is essentially a mapping between each data file and 

all the institutional sites where the data is replicated. Each 

site (top level site or institutional site) may contain multiple 

grid resources. A grid resource could be either a computing 

resource, which allows users to submit and execute jobs, 

or a storage resource, which allows users to store data 

files. 3 We assume that each site has both computing and 

storage capacities, and that within each site, the bandwidth 

is high enough that the communication delay inside the site 

is negligible. 

3 We differentiate site and node in this paper. We refer to each site in 

the Grid level and each node in the cluster level, where each node is a 

computing resource or storage resource or combination of both.

4 

For the data file replication problem addressed in this 

article, there are multiple data files, and each data file 

is produced by its source site (the top level site or the 

institutional site may act as a source site for more than 

one data files). Each Grid site has limited storage capacity 

and can cache/store multiple data files subject to its storage 

capacity constraint. 

Data Grid Model. A Data Grid can be represented as an 

undirected graph G(V, E), where a set of vertices V = 

{1, 2, ..., n} represents the sites in the Grid, and E is a 

set of weighted edges in the graph. The edge weight may 

represent a link metric such as loss rate, distance, delay, 

or transmission bandwidth. In this paper, the edge weight 

represents the bandwidth and we assume all edges have 

the same bandwidth B (in Section VI, we study heterogeneous 

environment where different edges have different 

bandwidths). There are p data files D = {D 1 , D 2 , . . . , D p } 

in the Data Grid, D j is originally produced and stored in 

the source site S j ∈ V . Note that a site can be the original 

source site of multiple data files. The size of data file D j is 

s j . Each site i has a storage capacity of m i (for a source site 

i, m i is the available storage space after storing its original 

data). 

Users of the Data Grid submit jobs to their own sites, 

and the jobs are executed in the FIFO order. Assume that 

the Grid site i has n i submitted jobs {t i1 , t i2 , . . . , t ini }, and 

each job t ik (1 ≤ k ≤ n i ) needs a subset F ik of D as 

its input files for execution. If we use w ij to denote the 

number of times that site i needs D j as an input file, then 

w ij = ∑ n i 

k=1 c k, where c k = 1 if D j ∈ F ik and c k = 0 

otherwise. The transmission time of sending data file D j 

along any edge is s j /B. We use d ij to denote the number 

of transmissions to transmit a data file from site i and j 

(which is equal to the number of edges between these two 

sites). We do not consider the file propagation time along 

the edge, since compared with transmission time, it is quite 

small and thus negligible. The total data file access cost in 

Data Grid before replication is the total transmission time 

spent to get all needed data files for each site: 

n∑ p∑ 

w ij × d iSj × s j /B. 

i=1 j=1 

The objective of our file replication problem is to minimize 

the total data file access cost by replicating data files in the 

Data Grid. Below, we give a formal definition of the file 

replication problem addressed in this article. 

Problem Formulation. The data replication problem in the 

Data Grid is to select a set of sets M = {A 1 , A 2 , . . . , A p }, 

where A j ⊂ V is a set of Grid sites that store a replica copy 

of D j , to minimize the total access cost in the Data Grid: 

n∑ p∑ 

τ(G, M) = w ij × min k∈({Sj}∪A j)d ik × s j /B, 

i=1 j=1 

under the storage capacity constraint that |{A j |i ∈ A j }| ≤ 

m i for all i ∈ V, which means that each Grid site i appears 

in, at most, m i sets of M. Here, each site accesses the 

data file from its closest site with a replica copy. Čibej, 

Slivnik, and Robič show that this problem (with arbitrary 

data file size) is NP-hard and even non-approximable [47]. 

However, Baev et al. [6], [7] have shown that when data 

is uniform size (i.e., s i = s j for any D i , D j ∈ D), this 

problem is approximable. Below we present a centralized 

greedy algorithm with provable performance guarantee for 

uniform data size, and a distributed algorithm respectively. 

IV. Centralized Data Replication Algorithm in Data 

Grids 

Our centralized data replication algorithm is a greedy 

algorithm. First, all Grid sites have all empty storage space 

(except for sites that originally produce and store some files). 

Then, at each step, it places one data file into the storage 

space of one site such that the reduction of total access cost 

in the Data Grid is maximized at that step. The algorithm 

terminates when all storage space of the sites has been 

replicated with data files, or the total access cost cannot 

be reduced further. Below is the algorithm. 

Algorithm 1: Centralized Data Replication Algorithm 

BEGIN 

M = A 1 = A 2 = . . . = A p = ∅ (empty set); 

while (the total access cost can still be reduced by 

replicating data files in Data Grid) 

Among all sites with available storage capacity and 

all data files, let replicating data file D i on site m 

gives the maximum τ(G, {A 1 , A 2 , . . . , A i , . . . , A p }) 

−τ(G, {A 1 , A 2 , . . . , A i ∪ {m}, . . . , A p }); 

A i = A i ∪ {m}; 

end while; 

RETURN M = {A 1 , A 2 , . . . , A p }; 

END. 

♦ 

The total running time of the greedy algorithm of data 

replication is O(p 2 n 3 m), where n is the number of sites in 

the Data Grid, m is the average number of memory pages 

in a site, and p is the total number of data files. Note that 

the number of iterations in the above algorithm is bounded 

by nm, and at each stage, we need to compute at most pn 

benefit values, where each benefit value computation may 

take O(pn) time. Below we present two lemmas showing 

some properties of above greedy algorithm. They are also 

necessary for us to prove the theorem below, which shows 

that the greedy algorithm returns a solution with a nearoptimal 

total access cost reduction. 

Lemma 1: Total access cost reduction is independent of 

the order of the data file replication. 

Proof: This is because the reduction of total access cost 

depends only on the initial and final storage configuration 

of each site, which does not depend on the exact order of 

the data replication. 

Lemma 2: Let M be a set of cache sites in an arbitrary 

stage. Let A 1 and A 2 be two distinct sets of sites not 

included in M. Then the access cost reduction due to 

selection of A 2 based upon cache site set M ∪ A 1 , denoted 

as a, is less than or equal to the access cost reduction due 

to selection of A 2 based upon M, denoted as b.

5 

Fig. 2. Each site i original graph G has m i memory space, each site in 

modified graph G’ has 2m i memory space. 

Proof: Let ClientSet(A 2 , M) be the set of sites that have 

their closest cache sites in A 2 when A 2 are selected as cache 

sites upon M. With the selection of A 1 before A 2 , some sites 

in ClientSet(A 2 , M) may find that they are closer to some 

cache sites in A 1 and then “deflect” to become elements of 

ClientSet(A 1 , M). Therefore, ClientSet(A 2 , M ∪ A 1 ) ⊆ 

ClientSet(A 2 , M). 

Since the selection of A 1 before A 2 does not change the 

closest sites in M for all sites in ClientSet(A 2 , M ∪ A 1 ), 

if ClientSet(A 2 , M ∪ A 1 ) = ClientSet(A 2 , M), a = b; if 

ClientSet(A 2 , M ∪ A 1 ) ⊂ ClientSet(A 2 , M), a < b. 

Above lemma says that the total access cost reduction due 

to an arbitrary cache site never increases with the selection 

of other cache site before it. 

We present below a theorem that shows the performance 

guarantee of our centralized greedy algorithm. In the theorem 

and the following proof, we assume that all data files 

are of the same uniform-size (occupying unit memory). The 

proof technique used here is similar to that used in [44] for a 

closely related problem of data caching in ad hoc networks. 

Theorem 1: Under the Data Grid model we propose, the 

centralized data replication algorithm delivers a solution 

whose total access cost reduction is at least half of the 

optimal total access cost reduction. 

Proof: Let L be the total number of memory pages in the 

Data Grid. WLOG, we assume L is the total number of 

iterations of the greedy algorithm (note it is possible that the 

total access cost cannot be reduced further even though sites 

still have available memory space). We assume the sequence 

of selections in greedy algorithm is {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L }, 

where n g i f g i indicating that at iteration i, data file f g i is 

replicated at site n g i . We also assume the sequence of 

selections in optimal algorithm is {n o 1f1 o , n o 2f2 o , ..., n o L f L o}, 

where n o i f i o indicating that at iteration i, data file fi o is 

replicated at site n o i . Let O and C be the total access 

cost reduction from optimal algorithm and greedy algorithm, 

respectively. 

Next, we consider a modified data grid G ′ , wherein each 

site i doubles its memory capacity (now 2m i ). We construct 

a cache placement solution for G ′ such that for site i, the 

first m i memories store the data files obtained in greedy 

algorithm and the second m i memories store the data files 

selected in the optimal algorithm, as shown in Figure 2. 

Now we calculate the total access cost reduction O ′ for G ′ . 

Obviously, O ′ ≥ O, simply because each site in G ′ caches 

extra data files beyond the data files cached in the same site 

in G. 

According to Lemma 1, since the total access cost reduction 

is independent of the order of the each individual 

selection, we consider the sequence of the cache section 

in G ′ as {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L , no 1f1 o , n o 2f2 o , ..., n o L f L o }, that 

is, the greedy selection followed by optimal selection. 

Therefore, in G ′ , the total access cost reduction is due 

to the greedy selection sequence plus the optimal selection 

sequence. For the greedy selection sequence, after 

the selection of n g L f g L 

, the access cost reduction is C. 

For the optimal selection sequence, we need to calculate 

the access cost reduction due to the addition of each 

n o i f i 

o (1 ≤ i ≤ L) based on the already added selections 

of {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L , no 1f1 o , n o 2f2 o , ..., n o i−1 f i−1 o }, 

and from Lemma 2, we know that it is less than the 

access cost reduction due to the addition of n o i f i 

o based 

on already added sequence of {n g 1 f g 1 , ng 2 f g 2 , ..., ng i−1 f g i−1 }. 

Note that the latter is less than the access cost reduction 

due to the addition of n g i f g i , based on the same sequence 

of {n g 1 f g 1 , ng 2 f g 2 , ..., ng i−1 f g i−1 }. Thus, the sum of the access 

cost reduction due to selection of {n o 1f1 o , n o 2f2 o , ..., n o L f L o} is 

less than or equal to C too. 

Thus, we come to the conclusion that O is less than or 

equal to O ′ , which is less than or equal to 2 times of C. 

V. Distributed Data Caching Algorithm in Data Grids 

In this section, we design a localized distributed caching 

algorithm based on the centralized algorithm. In the distributed 

algorithm, each Grid site observes the local Data 

Grid traffic to make an intelligent caching decision. Our 

distributed caching algorithm is advantageous since it does 

not need global information such as the network topology 

of the Grid, and it is more reactive to network states such as 

the file distribution, user access pattern, and job distribution 

in the Data Grids. Therefore, our distributed algorithm can 

adopt well to such dynamic changes in the Data Grids. 

The distributed algorithm is composed of two important 

components: nearest replica catalog (NRC) maintained at 

each site and a localized data caching algorithm running at 

each site. Again, as stated in Section III, the top level site 

maintains a Centralized Replica Catalogue (CRC), which is 

essentially a list of replica site list C j for each data file D j . 

The replica site list C j contains the set of sites (including 

source site S j ) that has a copy of D j . 

Nearest Replica Catalog (NRC). Each site i in the Grid 

maintains an NRC, and each entry in the NRC is of the form 

(D j , N j ), where N j is the nearest site that has a replica of 

D j . When a site executes a job, from its NRC, it determines 

the nearest replicate site for each of its input data files and 

goes to it directly to fetch the file (provided the input file 

is not in its local storage). As the initialization stage, the 

source sites send messages to the top level site informing it 

about their original data files. Thus, the centralized replica 

catalog initially records each data file and its source site. 

The top level site then broadcasts the replica catalogue to

6 

the entire Data Grid. Each Grid site initializes its NRC to 

the source site of each data file. Note that if i is the source 

site of D j or has cached D j , then N j is interpreted as the 

second nearest replica site, i.e., the closest site (other than 

i itself) that has a copy of D j . The second nearest replica 

site information is helpful when site i decides to remove 

the cached file D j . If there is a cache miss, the request is 

redirected to the top level site, which sends the site replica 

site list for that data file. After receiving such infomation, the 

site will update correctly its NRC table and sends the request 

to the site’s nearest cache site for that data file. Therefore, a 

cache miss takes much longer time. The above information 

is in addition to any information (such as routing tables) 

maintained by the underlying routing protocol in the Data 

Grids. 

Maintenance of NRC. As stated above, when a site i caches 

a data file D j , N j is no longer the nearest replica site, but 

rather the second-nearest replica site. The site i sends a 

message to the top level site with information that it is a 

new replica site of D j . When the top level site receives the 

message, it updates its CRC by adding site i to data file D j ’s 

replica site list C j . Then it broadcasts a message to the entire 

Data Grid containing the tuple (i, D j ) indicating the ID of 

the new replica site and the ID of the newly made replica 

file. Consider a site l that receives such message (i, D j ). Let 

(D j , N j ) be the NRC entry at site l signifying that N j is 

the replica site for D j currently closest to l. If d li < d lNj , 

then the NRC entry (D j , N j ) is updated to (D j , i). Note the 

distance values, in terms of number of hops, are available 

from the routing protocol. 

When a site i removes a data file D j from its local storage, 

its second-nearest replica site N j becomes its nearest replica 

site (note the corresponding NRC entry does not change). 

In addition, site i sends a message to the top level site with 

information that it is no longer a replica site of D j . The top 

level site updates its replica catalogue by deleting i from 

D j ’s replica site list. And then it broadcasts a message with 

the information (i, D j , C j ) to the Data Grid, where C j is the 

replica site list for D j . Consider a site l that receives such a 

message, and let (D j , N j ) be its NRC entry. If N j = i, then 

site l updates its nearest replica site entry using C j (with 

the help of the routing table). 

Localized Data Caching Algorithm. Since each site has 

limited storage capacity, a good data caching algorithm that 

runs distributedly on each site is needed. To do this, each 

site observes the data access traffic locally for a sufficiently 

long time. The local access traffic observed by site i includes 

its own local data requests, non-local data requests to data 

files cached at i, and the data request traffic that the site i 

is forwarding to other sites in the network. 

Before we present the data caching algorithm, we give 

the following two definitions: 

• Reduction in access cost of caching a data file. 

Reduction in access cost as the result of caching a data 

file at a site is the reduction in access cost given by 

the following: access frequency in local access traffic 

observed by the site × distance to the nearest replica 

site. 

• Increase in access cost of deleting a data file. Increase 

in access cost as the result of deleting a data file at a site 

is the increase in access cost given by the following: 

access frequency in local access traffic observed by the 

site × distance to the second-nearest replica site. 

For each data file D j not cached at site i, site i calculates 

the reduction in access cost by caching the file D j , while 

for each data file D j cached at site i, site i computes the 

increase in access cost of deleting the file. With the help 

of the NRC, each site can compute such a reduction or 

increase of access cost in a distributed manner using only 

local information. Thus our algorithm is adaptive: each site 

makes a data caching or deletion decision by observing 

locally the most recent data access traffic in the Data Grid. 

Cache Replacement Policy. With the above knowledge, 

a site always tries to cache data files that can fit in its 

local storage and that can give the most reduction in access 

cost. When the local storage capacity of a site is full, the 

following cache replacement policy is used. Let |D| denote 

the size of a data file (or a set of data files) D. If the access 

cost reduction of caching a newly available data file D j is 

higher than the access cost increase of some set D of cached 

data files where |D| > |D j |, then the set D is replaced by 

D j . 

Data File Consistency Maintenance. In this work, we 

assume that all the data are read-only, and thus no data consistency 

mechanism is needed. However, if we do consider 

the consistency issue, two mechanisms can be considered. 

First, when the file in a site is modified, it can be considered 

as a data file removal in our distributed algorithm discussed 

above with some necessary deviation. That is, the site sends 

a message to the top level site with information that it 

is no longer a replica site of the file. The top level site 

broadcasts a message to the Data Grid so that each site 

can delete the corresponding nearest replica site entry. The 

second mechanism is Time to Live (TTL), which is the time 

until the replica copies are considered valid. Therefore, the 

master copy and its replica are considered outdated at the 

end of the TTL time value and will not be used for the job 

execution. 

VI. Performance Evaluation 

In this section, we demonstrate the performance of our 

designed algorithms via extensive simulations. First, we 

compare the centralized greedy algorithm (referred to as 

Greedy) with the centralized optimal algorithm (referred 

to as Optimal) and two other intuitive heuristics using our 

own simulator written in Java (Section VI-A). Then, using 

GridSim [43], a well-known Grid simulator, we evaluate our 

distributed algorithm with respect to an existing caching 

algorithm in the well-cited literature [39] (Section VI-B), 

which is based upon a hierarchical Data Grid topology. 

Finally, we compare our distributed algorithm with popular 

LRU/LFU algorithms implemented in OptorSim [10] in a 

general Data Grid topology.

7 

Total Access Cost 

700 

600 

500 

400 

300 

200 

Optimal 

Greedy 

Local Greedy 

Random 


700 

600 

500 

400 

300 

200 

Optimal 

Greedy 

Local Greedy 

Random 


700 

600 

500 

400 

300 

200 

Optimal 

Greedy 

Local Greedy 

Random 

100 

100 

100 

0 

4 6 8 10 12 14 16 18 20 

Number of Files 

0 

1 2 3 4 5 6 7 8 9 

Storage Capacity (GB) 

0 

5 10 15 20 25 

Number of Grid sites 

(a) Varying number of data files. 

(b) Varying storage capacity. 

(c) Varying number of Grid sites. 

Fig. 3. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms. Unless varied, the number of sites is 10, 

number of data files is 10, and memory capacity of each site is 5. Data file size is 1. 

A. Comparison of Centralized Algorithms 

Optimal Replication Algorithm. Optimal is an exhaustive, 

and time-consuming, replication algorithm that yields an 

optimal solution. For a Data Grid with n sites and p data 

files of unit size, and each site has storage capacity m, the 

optimal algorithm is to enumerate all possible file replication 

placements in the Data Grid and find the one with minimum 

total access cost. There are (Cp m ) n such file placements, 

where Cp 

m is the number of ways selecting m files out 

of p files. Due to the high time complexity of the optimal 

algorithm, we experiment on Data Grids with 5 to 25 sites 

and the average distance between two sites is 3 to 6 hops. 

Local Greedy Algorithm and Random Algorithm. We also 

compare Greedy with two other intuitive replication algorithms: 

Local Greedy Algorithm and Random Algorithm. 

In Local Greedy, it replicates each site’s most frequently 

requested files in its local storage. Random takes place in 

round; in each round, a random file is selected and placed 

in a randomly chosen non-full site that does not have a copy 

of that file. The algorithm stops until all the storage spaces 

of all sites are occupied. 

In most comparisons of centralized algorithms, we assume 

that all file sizes are equal and all edge bandwidths are equal. 

Since the total data file access cost between sites is defined 

as the total transmission time spent to obtain the data file, 

which is equal to 

number of hops between two sites × file size/bandwidth, 

we can use the number of hops to approximate the file 

access cost. However, we also show that even in the heterogeneous 

scenarios wherein the file sizes and bandwidths 

are not equal, our centralized and distributed algorithms still 

perform very well (Figure 5 and Figure 13). 

Recall that each data has a uniform size of 1. For all 

the algorithms, initially the p files are randomly placed in 

some source sites; for the same comparison, the initial file 

placements are the same for all algorithms. Recall that for 

source sites, their storage capacities are those that remain 

after storing their original data files. Each data point in the 

plots below is an average over five different initial placement 

of the p files. We assume that the number of times each site 

accesses each file is a random number between 1 and 10. 

In all plots, we show error bars indicating the 95 percent 

confidence interval. 

Varying Number of Data Files. First, we vary the number 

of data files in the Data Grid while fixing the number of sites 

and the storage capacity of each site. Due to the high time 

complexity of the optimal algorithm, we consider a small 

Data Grid with n = 10 and m = 5, and we vary p to be 

5, 10, 15, and 20. The comparison of the four algorithms is 

shown in Figure 3 (a). In terms of the total access cost, we 

observe that Optimal < Greedy < Local Greedy < Random. 

That is, among all the algorithms, Optimal performs the 

best since it exhausts all the replication possibilities, but 

Greedy performs quite close to the optimal. Furthermore, 

we observe that Greedy performs better than Local Greedy 

and Random, which demonstrates that it is indeed a good 

replication algorithm. It also shows that when the storage 

capacity of each site is five and the total number of files is 

five, like the other three algorithms, Greedy will eventually 

replicate all five files into each Grid site, resulting in zero 

total access cost. 

Varying Storage Capacity. Figure 3 (b) shows the results 

of varying the storage capacity of each site while keeping 

the number of data files and number of sites unchanged. We 

choose n = 10 and p = 10, and vary m from 1, 3, 5, 7, 

to 9 units. It is obvious that with the increase of memory 

capacity of each Grid site, the total access cost decreases 

since more file replicas are able to be placed into the Data 

Grid. Again, we observe that Greedy performs comparably 

to Optimal, and both perform better than Local Greedy and 

Random. 

Varying Number of Grid Sites. Figure 3 (c) shows the 

result of varying the number of sites in Data Grid. We 

consider a small Data Grid, choose p = 10 and m = 5, 

and vary n as 5, 10, 15, 20, or 25. The optimal algorithm 

performs only slightly better than the greedy algorithm, in 

the range of 15% to 20%. 

Comparison in Large Scale. We study the scalability of 

different algorithms by considering a Data Grid in large 

scale, with 500 to 2,000 data files, storage capacity of each 

site varying from 10 to 90, and the number of Grid sites 

varying from 10 to 50. Figure 4 shows the performance 

comparison of Greedy, Local Greedy, and Random. Again,

8 


18000 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

500 

Greedy 

Local Greedy 

Random 

1000 

1500 


2000 


18000 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

Greedy 

Local Greedy 

Random 

0 

10 20 30 40 50 60 70 80 90 



22000 

20000 

18000 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

Greedy 

Local Greedy 

Random 

10 15 20 25 30 35 40 45 50 





Fig. 4. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms in large scale. Unless varied, the number 

of sites is 30, number of data files is 1,000, and storage capacity of each site is 50. Data file size is 1. 


900 

800 

700 

600 

500 

400 

300 

200 

Optimal 

Greedy 

Local Greedy 

Random 

100 

4 6 8 10 12 14 16 18 20 



700 

600 

500 

400 

300 

200 

Optimal 

Greedy 

Local Greedy 

Random 

100 

1 2 3 4 5 6 7 8 9 



600 

500 

400 

300 

Optimal 

200 

Greedy 

Local Greedy 

Random 

100 

5 10 15 20 25 





Fig. 5. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms with varying data file size. Unless varied, 

the number of sites is 10, number of data files is 10, and storage capacity of each site is 5. Data file size is varying from 1 to 10. 

(a) Cascading Replication [39]. 

(b) Simulation Topology. Tier 3 has 32 sites, we do not show all of them due 

to space limit. 

Fig. 6. 

Illustration of simulation topology. 

it shows that Greedy performs the best among the three 

algorithms. 

Comparison with Varying Data File Size. Even though 

Theorem 1 is valid only for uniform data size, we experimentally 

show that our greedy algorithm also achieves good 

system performance for varying the data file size. Figure 5 

shows the performance comparison of Optimal, Greedy, 

Local Greedy, and Random, wherein each data file size is a 

random number between 1 and 10. Greedy performs the best 

again among the three algorithms. It can be seen that due to 

the non-uniform data size, the access cost is no longer zero 

for all four algorithms when there are five files and each site 

has a storage capacity five. 

B. Distributed Algorithm versus Replication Strategies by 

Ranganathan and Foster [39] 

In this section, we compare our distributed algorithm with 

the replication strategy proposed by Ranganathan and Foster 

[39]. First, we give an overview of their strategies, and 

then present the simulation environment and discuss the 

comparison simulation results. 

1) Replication Strategies by Ranganathan and Foster: 

Ranganathan and Foster study the data replication in a 

hierarchical Data Grid model (see Figure 6). The hierarchical 

model, represented as a tree topology, has been used in 

LHCGrid project [2], which serves the scientific collaborations 

performing experiments at the Large Hadron Collider 

(LHC) in CERN. In this model, there is a tier 0 site at

9 

CERN, and there are regional sites, which are from different 

institutes in the collaboration. For comparison, we also use 

this hierarchical model and assume that all data files are at 

the CERN site initially. 4 Like [2], we assume that only the 

leaf site executes jobs. When the needed data files are not 

stored locally, the site always goes one level higher for the 

data. If the data is found, it fetches the data back to leaf 

site; otherwise, it goes one level higher until it reaches the 

CERN site. 

Ranganathan and Foster present six different replication 

strategies: (1) No Replication or Caching; (2) Best Client, 

whereby a replica is created at the best client site (the site 

that has the largest number of requests for the file); (3) 

Cascading Replication, whereby once popularity exceeds 

the threshold for a file at a given time interval, a replica 

is created at the next level on the path to the best client; 

(4) Plain Caching, whereby the client that requests the file 

stores a copy of the file locally; (5) Caching plus Cascading 

Replication, which combines Plain Caching and Cascading 

Replication strategies; and (6) Fast Spread, whereby replicas 

of the file are created at each site along its path to the client. 

All of the above strategies are evaluated with three user 

access patterns: (1) Random Access: no locality in patterns, 

(2) Temporal Locality: data that contain a small degree of 

temporal locality (recently accessed files are likely to be accessed 

again), and (3) Geographical plus Temporal Locality: 

data containing a small degree of temporal and geographical 

locality (files recently accessed by a site are likely to be 

accessed again by a close-by site). Their simulation results 

show that Caching plus Cascading would work better than 

others in most cases. In our work, we compare our strategy 

with Caching plus Cascading (referred to as Cascading in 

the simulation) under Geographical and Temporal Locality. 

Cascading Replication Strategy. Figure 6 (a) shows in detail 

how Cascading works. There is a file F1 in the root site. 

When the number of requests for file F1 exceeds the 

threshold at the root, a replica copy of F1 is sent to the next 

layer on the path to the best client. Eventually the threshold 

is exceeded at the next layer, and a replica copy of F1 is 

sent to Client C. 

TABLE I 

SIMULATION PARAMETERS FOR HIERARCHICAL TOPOLOGY. 

Description 

Value 

Number of sites 43 

Number of files 1,000 - 5,000 

Size of each file 

2 GB 

Available storage of each site 100 - 500 GB 

Number of jobs executed by each site 400 

Number of input files of each job 1 - 10 

Link bandwidth 

100 MB/s 

2) Simulation Setup: In the simulation, we use the Grid- 

Sim Simulator [43] (version 4.1). The GridSim toolkit 

allows modeling and simulation of entities in parallel and 

4 However, our Data Grid model is not limited to a tree-like hierarchical 

model. In Section VI-C, we show a general graph model, and that our distributed 

replication algorithm performs better than LRU/LFU implemented 

in OptorSim [10]. 

distributed computing environment, for systems-users, applications, 

resources, and resource brokers. The 4.1 version 

of GridSim incorporates a Data Grid extension to model 

and simulate Data Grids. In addition, GridSim offers extensive 

support for design and evaluation of scheduling 

algorithms, which suits the need for future work wherein 

a more comprehensive framework of data replication and 

job scheduling will be studied. As a result, we use GridSim 

in our simulation. 

Figure 6 (b) shows the topology used in our simulation. 

Similar to the one in [39], there are four tiers in the 

grid, with all data being produced at the top most tier 

0 (the root). Tier 2 consists of the regional centers of 

different continents; here we consider two. The next tier is 

composed of different countries. The final tier consists of the 

participating institution sites. The Grid contains a total of 43 

sites, 32 of them executing tasks and generating requests. 

Table I shows the simulation parameters, most of which 

are from [39] for the purpose of comparison. The CERN 

site originally generates and stores 1,000 to 5,000 different 

data files, each of which is 2 GB. Each site (except for the 

CERN site) dynamically generates and executes 400 jobs, 

and each job needs 1 to 10 files as input files. To execute 

each job, the site first checks if the input files are in its local 

storage. If not, it then goes to the nearest replica site to get 

the file. The available storage capacity at each site varies 

from 100 GB to 500 GB. The link bandwidth is 100 MB/s. 

Our simulation is run on a DELL desktop with Core 2 Duo 

CPU and 1 GB RAM. 

We emphasize that the above parameters take into account 

the scaling. Usually, the amount of data in the Data Grid 

system is in the order of petabytes. To enable simulation of 

such a large amount of data, a scale of 1:1000 is considered. 

That is, the number of files in the system and the storage 

capacity of each site are both reduced by a factor of 1,000. 

Data File Access Patterns. Each job has one to ten input 

files as noted above. For the input files of each job, we 

adopt two file access patterns: random access pattern and 

tempo-spatial access pattern. In random access pattern, the 

input files of each job are randomly chosen from the 

entire set of data files available in the Data Grid. In the 

tempo-spatial access pattern, the input data files required by 

each job follows Zipf distribution [51], [9] and geographic 

distribution, as explained below. 

• In Zipf distribution (the temporal access part), for the 

p data files, the probability for each job to access 

the j th (1 ≤ j ≤ p) data file is represented by 

1 

P j = 

j ∑ θ p , where 0 ≤ θ ≤ 1. When θ = 1, the 

h=1 

above distribution 1/hθ follows the strict Zipf distribution, 

while for θ = 0, it follows the uniform distribution. We 

choose θ to be 0.8 based on real web trace studies [9]. 

• In geographic distribution (the spatial access part), the 

data file access frequencies of jobs at a site depend 

on its geographic location in the Grid such that sites 

that are in proximity have similar data file access 

frequencies. In our case, the leaf sites close to each 

other have similar data file access pattern. Specifically,

10 

Average file access time (second) 

16000 

14000 

12000 

10000 

8000 

6000 

4000 

Ditributed in Random 

Cascading in Random 

Distributed in TS-Static 

Cascading in TS-Static 

Distributed in TS-Dynamic 

Cascading in TS-Dynamic 


16000 

14000 

12000 

10000 

8000 

6000 

4000 







2000 

2000 

0 

1000 2000 3000 4000 5000 

Total number of files 

0 

100 200 300 400 500 


(a) Varying total number of files. 

(b) Varying storage capacity of each site. 

Fig. 7. Performance comparison between our distributed algorithm and Cascading in hierarchical topology. In (a), the storage capacity of each site is 

300 GB; in (b), the number of data files is 3,000. Each data file size is 2 GB. 

120 

120 

Percentage Increase of 

File Access Time (%) 

100 

80 

60 

40 

20 

0 

1000 

Distributed 

Cascading 

2000 

3000 


4000 

5000 

Percentage Increase of 

File Access Time (%) 

100 

80 

60 

40 

20 

0 

100 

Distributed 

Cascading 

200 

300 

400 


500 


(b) Varying storage capacity of each Grid site. 

Fig. 8. 

Percentage increase of the average file access time due to dynamic access pattern. 


200000 

150000 

100000 

50000 








300000 

250000 

200000 

150000 

100000 

50000 







0 

1000 50000 200000 400000 600000 8000001000000 


0 

1 50 100 200 400 600 800 1000 

Storage Capacity (TB) 



Fig. 9. Performance comparison between our distributed algorithm and Cascading in a typical Grid environment. In (a), the storage capacity of each 

site is 500 TB; in (b), the number of data files in the Grid is 500,000. Each data file size is 1 GB.

11 

suppose there are n client sites, numbered 1, 2, ..., n, 

where client 1 is close to client 2, which is close to 

client 3, etc. Suppose the total number of data is p. 

If the accessed data ID is id according to the original 

Zipf-like access pattern, then, for client i (1 ≤ i ≤ n), 

the new data ID would be (id + p/n ∗ i) mod p. In this 

way, neighboring clients should have similar, but not 

the same, access patterns, while far-away clients have 

quite different access patterns. 

To investigate the effects of dynamic user behaviors upon 

different algorithms, we further divide the tempo-spatial 

access pattern into two: static and dynamic. In the dynamic 

access pattern, the file access pattern of each site changes in 

the middle of the simulation, while for static access pattern, 

it does not change. In summary, in our simulation, we use the 

following three file access patterns: random access pattern 

(Random), static tempo-spatial access pattern (TS-Static), 

and dynamic tempo-spatial access pattern (TS-Dynamic). 

3) Simulation Results and Analysis: Below we present 

our simulation results and analysis. 

Varying Number of Data Files. In this part of simulation, 

we vary the number of data files from 1,000, 2,000, 3,000, 

4,000, to 5,000 files, while fixing the storage capacity of 

each Grid site as 300 GB. Figure 7 (a) show the comparison 

of the average file access time of the two algorithms in all 

three access patterns. We note the following observations. 

First, for both algorithms in all three access patterns, with an 

increase in the number of files, the average file access time 

increases. However, our distributed greedy data replication 

algorithm yields average file access time three to four times 

less than that obtained by Cascading in all three access 

patterns. Second, for both our distributed algorithm and 

Cascading, their average file access time under TS-static 

is 20% to 30% smaller than that under Random access 

pattern. This is because TS-Static incorporates both temporal 

and spatial access patterns, which benefit both replication 

algorithms. Figure 7 (a) also shows the performance of both 

algorithms under the TS-Dynamic access pattern, wherein 

the file access pattern of each site changes during the simulation. 

As shown, both our distributed replication algorithm 

and Cascading suffer from file access pattern change, with 

increased average file access time compared with TS-Static. 

This is because it takes some time for both algorithms to 

obtain good replica replacement for the dynamic file access 

pattern. In a static access pattern, the Data Grids will finally 

reach some stable stage where the file placement in each 

site remains unchanged, thus giving a lower file access 

time. Figure 8 (a) also shows that our algorithm adapts the 

access pattern change better than Cascading algorithm: the 

percentage increase of the average file access time of our 

algorithm due to dynamic access patter is around 50%, while 

for Cascading, it is around 100%. 

Varying Storage Capacity of Each Site. We also vary 

the storage capacity of each Grid site from 100, 200, 300, 

400, to 500 GB while fixing the number of files in the 

grid as 3000. The performance comparison is shown in 

Figure 7 (b). As expected, as the storage capacity increases, 

there is a decrease in the file access time for both algorithms, 

since larger storage capacities allow each site to 

storage more replicated files for local access. It gives the 

same observation that our distributed replication algorithm 

performs much better than Cascading under Random and 

TS-Static/Dynamic access patterns. Moreover, for both our 

distributed algorithm and Cascading Algorithm, its average 

file access time under Random is higher than that under 

TS-static (about 20% to 30% higher). Again, compared to 

TS-Static pattern, the file access time for both algorithms 

increases in TS-Dynamic pattern. However, as shown in 

Figure 8 (b), our distributed replication algorithm evidently 

adapts to the dynamic access pattern better than Cascading 

does, giving smaller percentage increase of average file 

access time. 

TABLE II 

SIMULATION PARAMETERS FOR TYPICAL GRID ENVIRONMENT. 

Description 

Value 

Number of sites 100 

Number of files 1,000 - 1,000,000 


1 GB 

Available storage of each site 1 TB - 1 PB 

Number of jobs executed by each site 10,000 



4000 MB/s 

TABLE III 

SIMULATION PARAMETERS FOR TYPICAL CLUSTER ENVIRONMENT. 

Description 

Value 

Number of nodes 1,000 

Number of files 1,000 - 1,000,000 


1 GB 

Available storage of each site 10 GB - 1 TB 

Number of jobs executed by each site 1,000 



100 MB/s 

Comparison in Typical Grid Environment. In a typical 

Grid environment, there are usually only 10s to 100s of 

different sites with large storage (PB per site), and the 

network bandwidth between Grid sites can be 10 GB/s, even 

as high as 100 GB/s. To simulate such an environment we 

use the scaled parameters as shown in Table II. Figure 9 (a) 

and (b) show the performance comparison of our distributed 

algorithm and Cascading by varying the total number of 

files (while fixing storage capacity at 500 TB) and the 

storage capacity of each site (while fixing the number of data 

files as 500,000), respectively. Below are the observations. 

First, when there are 1,000, 50,000 and 200,000 files in 

the Grid (Figure 9 (a)), or storage capacity is 600 TB, 

800 TB, and 1,000 TB (Figure 9 (b)), the performance 

of both algorithms under the same access pattern and the 

performance of the same algorithm under different access 

patterns are very similar. This is because, in both cases, 

each site has enough space to store all of its requested files; 

therefore, no cache replacement algorithm is involved. As 

such, even with different access patterns, each site works 

the same way with both algorithms: requesting the input 

files for each job, storing the received input files in local

12 


500000 

400000 

300000 

200000 

100000 








700000 

600000 

500000 

400000 

300000 

200000 

100000 







0 

1000 50000 200000 400000 600000 8000001000000 


0 

1 50 100 200 400 600 800 1000 



(b) Varying storage capacity of each node. 

Fig. 10. Performance comparison between our distributed algorithm and Cascading in a typical cluster environment. In (a), the storage capacity of each 

node is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB. 


160000 

140000 

120000 

100000 

80000 

60000 

40000 

20000 








120000 

100000 

80000 

60000 

40000 

20000 







0 

1000 50000 200000 400000 600000 8000001000000 


0 

1 50 100 200 400 600 800 1000 



(b) Varying storage capacity of each node. 

Fig. 11. Performance comparison between our distributed algorithm and Cascading in a cluster environment with full connectivity. In (a), the storage 

capacity of each node is 500 GB; in (b), the number of data files in the cluster is 500,000. Each data file size is 1 GB. 

storage, and executing the next job. Otherwise, we observe 

the performance differences as shown in Figure 7. Second, 

comparisons with different performance show that for our 

distributed algorithm, the percentage increase of access time 

due to the dynamic pattern is very small, around 5% to 

8%. This shows that our algorithm adjusts to the dynamic 

access pattern well in a typical Grid environment. Third, 

Figure 9 (b) shows that with increased storage capacity of 

each site, the performance difference between our distributed 

algorithm and Cascading is getting small for all three access 

patterns. For example, when the storage capacity is 1 TB, 

Cascading yields more than twice the file access time than 

our distributed algorithm; while at 400 TB, it costs 40% 

more access time than our distributed algorithm. This shows 

that in the more stringent storage scenario, our caching 

algorithm is a better mechanism to reducing file access time 

and thus job execution time. 

Comparison in Typical Cluster Environment. In a typical 

cluster environment, a site often has 1,000 to 10,000 nodes 

(note the difference between sites and nodes as explained 

in Section III), each with small local storage in the range 

10 GB to 1 TB. The network bandwidth within a cluster 

is typically 1 GB/s. We use the parameters in Table III 

to simulate the cluster environment. Figure 10 shows the 

performance comparison of our distributed algorithm and 

Cascading under different access patterns. We observe that 

for our distributed algorithm, the percentage increase of 

access time due to dynamic pattern is relatively large, around 

40% to 100%. This shows that our algorithm does not 

adjust well to the dynamic access pattern in a typical cluster 

environment. This is because in a cluster, the diameter of the 

network (number of hops in the longest shortest path of two 

cluster nodes) is much larger than that in the typical Grid 

environment, and the bandwidth in a cluster is much smaller 

than that in Grids. As a result, when the file access pattern 

of each site changes in the middle of the simulation, it takes 

longer time to propagate such information to other nodes.

13 

TABLE IV 

STORAGE CAPACITY OF EACH SITE IN EU DATA GRID TESTBED 1 [10]. IC IS IMPERIAL COLLEGE. 

Site Bologna Catania CERN IC Lyon Milano NIKHEF NorduGrid Padova RAL Torino 

Storage (GB) 30 30 10000 80 50 50 70 63 50 50 50 

Therefore, the nearest replica catalog maintained by each 

node could get outdated, causing frequent cache misses for 

needed input files. 

Comparison in a Cluster Environment with Full Connectivity. 

We consider a cluster with a fully connected set 

of nodes and show the simulation results in Figure 11. There 

are two major observations. First, comparing Figure 9 with 

Figure 11, we observe that the average file access time in 

a cluster with full connectivity is lower than that of a Grid. 

This is because even though a typical Grid bandwidth might 

be higher than that of a cluster (4000 MB/s versus 100 

MB/s shown in Tables II and III), there are much fewer 

point-to-point links in a Grid of 100 sites compared to a 

cluster with a fully connected set of 1000 nodes (around 

5 × 10 3 edges versus 5 × 10 5 edges). Therefore, the full 

network bisection bandwidth in a cluster is higher than 

that of a Grid (5 × 10 5 × 100 = 5 × 10 7 MB/s versus 

5 × 10 3 × 4000 = 2 × 10 7 MB/s). Second, we observe 

that for all three access patterns, our distributed algorithm 

performs the same as Cascading. This is because with 

full connectivity, each site can get the needed data files 

directly from the CERN site, therefore each site is unable 

to observe the data access traffic of other sites to make 

intelligent caching decisions. Both our distributed caching 

algorithm and Cascading algorithm boil down to the same 

LRU/LFU cache replacement algorithm. In a fully connected 

environment, our distributed algorithm loses its advantage 

of being an effective caching algorithm, like any other 

distributed caching algorithms. 

Fig. 12. EU Data Grid Testbed 1 [10]. Numbers indicate the bandwidth 

between two sites. 

C. Distributed Algorithm versus LRU/LFU in General 

Topology 

OptorSim [10] is a Grid simulator that simulates a general 

topology Grid testbed for data intensive high energy physics 

experiments. Three replication algorithms are implemented 

in OptorSim: (i) LRU (Least Recently Used), which deletes 

those files that have been used least recently; (ii) LFU (Least 

Frequently Used), which deletes those files that have been 

used least frequently in the recent past; and (iii) an economic 

model in which sites “buy” and “sell” files using an auction 

mechanism, and will only delete files if they are less valuable 

than the new file. 

We compare our distributed algorithm with LRU and 

LFU in a general topology since previous empirical research 

has found that even though the LRU/LFU strategies are 

extremely trivial, it is very difficult to improve them. The 

topology we use is the simulated topology of EU DataGrid 

TestBed 1 [10], as shown in Figure 12. The storage capacity 

of each site is given in Table IV. Since the simulation 

results of LRU and LFU are quite similar, we only show the 

results of LRU. Figure 13 (a) and (b) show the performance 

comparison of our distributed algorithm and LRU by varying 

the total number of files and storage capacity of each 

site, respectively. It shows that our distributed algorithm 

outperforms LRU in most cases. However, the performance 

difference is not as significant as that between distributed 

algorithm and Cascading. In particular, under TS-Dynamic 

access pattern, when the storage capacity of each site is 400 

GB and 500 GB, the average file access time of LRU is even 

a little smaller than that of our distributed algorithm. 

VII. Conclusions and Future Work 

In this article, we study how to replicate data files in data 

intensive scientific applications, to reduce the file access 

time with the consideration of limited storage space of 

Grid sites. Our goal is to effectively reduce the access time 

of data files needed for job executions at Grid sites. We 

propose a centralized greedy algorithm with performance 

guarantee, and show that it performs comparably with the 

optimal algorithm. We also propose a distributed algorithm 

wherein Grids sites react closely to the Grid status and make 

intelligent caching decisions. Using GridSim, a distributed 

Grid simulator, we demonstrate that the distributed replication 

technique significantly outperforms a popular existing 

replication technique, and it is more adaptive to the dynamic 

change of file access patterns in Data Grids. 

We plan to further develop our work in the following 

directions: 

• As ongoing and future work, we are exploiting the 

synergies between data replication and job scheduling 

to achieve better system performance. Data replication 

and job scheduling are two different but complementary

14 


3000 

2500 

2000 

1500 

1000 

500 








3000 

2500 

2000 

1500 

1000 

500 







0 

1000 2000 3000 4000 5000 


0 

100 200 300 400 500 




Fig. 13. 

Performance comparison between our distributed algorithm and LRU in general topology. 

functions in Data Grids: one to minimize the total file 

access cost (thus total job execution time of all sites), 

and the other to minimize the makespan (the maximum 

job completion time among all sites). Optimizing 

both objective functions in the same framework is a 

very difficult (if not unfeasible) task. There are two 

main challenges: first, how to formulate a problem 

that incorporates not only data replication but also job 

scheduling, and which addresses both total access cost 

and maximum access cost; and second, how to find an 

efficient algorithm that, if it cannot find optimal solutions 

of minimizing total/maximum access cost, gives 

near-optimal solution for both objectives. These two 

challenges remain largely unanswered in the current 

literature. 

• We plan to design and develop data replication strategies 

in the scientific workflow [31] and large-scale 

cloud computing environments [24]. We will also pursue 

how provenance information [15], the derivation 

history of data files, can be exploited to improve the 

intelligence of data replication decision making. 

• A more robust dynamic model and replication algorithm 

will be developed. Right now, each site observes 

the data access traffic for a sufficiently long time 

window, which is set in an ad hoc manner. In the future, 

a site should dynamically decide such an “observing 

window”, depending on the traffic it is observing. 

REFERENCES 

[1] The large hadron collider. http://public.web.cern.ch/Public/en/LHC/LHCen.html. 

[2] Lcg computing fabric. http://lcg-computing-fabric.web.cern.ch/lcgcomputing-fabric/. 

[3] Worldwide lhc computing grid. http://lcg.web.cern.ch/LCG/. 

[4] A. Aazami, S. Ghandeharizadeh, and T. Helmi. Near optimal number 

of replicas for continuous media in ad-hoc networks of wireless 

devices. In Proc. International Workshop on Multimedia Information 

Systems, 2004. 

[5] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, C. Kesselman, 

S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, and I. Foster. Secure, 

efficient data transport and replica management for high-performance 

data-intensive computing. In Proc. of IEEE Symposium on Mass 

Storage Systems and Technologies, 2001. 

[6] I. Baev and R. Rajaraman. Approximation algorithms for data 

placement in arbitrary networks. In Proc. ACM-SIAM Symposium 

on Discrete Algorithms (SODA 2001). 

[7] I. Baev, R. Rajaraman, and C. Swamy. Approximation algorithms for 

data placement problems. SIAM J. Comput., 38(4):1411–1429, 2008. 

[8] W. H. Bell, D. G. Cameron, R. Cavajal-Schiaffino, A. P. Millar, 

K. Stockinger, and F. Zini. Evaluation of an economy-based file 

replication strategy for a data grid. In Proc. of International Workshop 

on Agent Based Cluster and Grid Computing (CCGrid 2003). 

[9] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web caching 

and zipf-like distributions: Evidence and implications. In Proc. 

of IEEE International Conference on Computer Communications 

(INFOCOM 2009). 

[10] D. G. Cameron, A. P. Millar, C. Nicholson, R. Carvajal-Schiaffino, 

K. Stockinger, and F. Zini. Analysis of scheduling and replica 

optimisation strategies for data grids using optorsim. Journal of Grid 

Computing, 2(1):57–69, 2004. 

[11] M. Carman, F. Zini, L. Serafini, and K. Stockinger. Towards an 

economy-based optimization of file access and replication on a data 

grid. In Proc. of International Workshop on Agent Based Cluster and 

Grid Computing (CCGrid 2002). 

[12] A. Chakrabarti and S. Sengupta. Scalable and distributed mechanisms 

for integrated scheduling and replication in data grids. In Proc. of 10th 

International Conference on Distributed Computing and Networking 

(ICDCN 2008). 

[13] R.-S. Chang and H.-P. Chang. A dynamic data replication strategy 

using access-weight in data grids. Journal of Supercomputing, 

45:277–295, 2008. 

[14] R.-S. Chang, J.-S. Chang, and S.-Y. Lin. Job scheduling and data 

replication on data grids. Future Generation Computer Systems, 

23(7):846–860. 

[15] A. Chebotko, X. Fei, C. Lin, S. Lu, , and F. Fotouhi. Storing and 

querying scientific workflow provenance metadata using an rdbms. In 

Proc. of the IEEE International Conference on e-Science and Grid 

Computing (e-Science 2007). 

[16] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler, 

S. Bharathi, G. Mehta, and K. Vahi. Data placement for scientific 

applications in distributed environments. In Proc. of IEEE/ACM 

International Conference on Grid Computing (Grid 2007). 

[17] A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, and B. Moe. 

Wide area data replication for scientific collaboration. In Proc. of 

IEEE/ACM International Workshop on Grid Computing (Grid 2005). 

[18] A. Chervenak, R. Schuler, M. Ripeanu, M. A. Amer, S. Bharathi, 

I. Foster, and C. Kesselman. The globus replica location service: Design 

and experience. IEEE Transactions on Parallel and Distributed 

Systems, 20:1260–1272, 2009. 

[19] N. N. Dang and S. B. Lim. Combination of replication and scheduling 

in data grids. International Journal of Computer Science and Network 

Security, 7(3):304–308, March 2007.

15 

[20] D. Düllmann and B. Segal. Models for replica synchronisation and 

consistency in a data grid. In Proc. of the 10th IEEE International 

Symposium on High Performance Distributed Computing (HPDC 

2001). 

[21] J. Rehn et. al. Phedex: high-throughput data transfer management 

system. In Proc. of Computing in High Energy and Nuclear Physics 

(CHEP 2006). 

[22] I. Foster. The grid: A new infrastructure for 21st century science. 

Physics Today, 55:42–47, 2002. 

[23] I. Foster and K. Ranganathan. Decoupling computation and data 

scheduling in distributed data-intensive applications. In Proc. of the 

11th IEEE International Symposium on High Performance Distributed 

Computing (HPDC 2002). 

[24] I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid 

computing 360-degrees compared. In Proc. of IEEE Grid Computing 

Environments, in conjunction with IEEE/ACM Supercomputing, pages 

1–10, 2008. 

[25] C. Intanagonwiwat, R. Govindan, and D. Estrin. Directed diffusion: 

a scalable and robust communication paradigm for sensor networks. 

In Proc. of ACM International Conference on Mobile Computing and 

Networking (MOBICOM 2000). 

[26] J. C. Jacob, D.S. Katz, T. Prince, G.B. Berriman, J.C. Good, A.C. 

Laity, E. Deelman, G.Singh, and M.-H Su. The montage architecture 

for grid-enabled science processing of large, distributed datasets. In 

Proc. of the Earth Science Technology Conference, 2004. 

[27] S. Jiang and X. Zhang. Efficient distributed disk caching in data grid 

management. In Proc. of IEEE International Conference on Cluster 

Computing (Cluster 2003). 

[28] S. Jin and L. Wang. Content and service replication strategies in 

multi-hop wireless mesh networks. In Proc. of ACM International 

Conference on Modeling, Analysis and Simulation of Wireless and 

Mobile Systems (MSWiM 2005). 

[29] H. Lamehamedi, B. K. Szymanski, and B. Conte. Distributed data 

management services for dynamic data grids, unpublished. 

[30] M. Lei, S. V. Vrbsky, and X. Hong. An on-line replication strategy 

to increase availability in data grids. Future Generation Computer 

Systems, 24:85–98, 2008. 

[31] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and Jing 

Hua. A reference architecture for scientific workflow management 

systems and the view soa solution. IEEE Transactions on Services 

Computing, 2(1):79–92, 2009. 

[32] M. Mineter, C. Jarvis, and S. Dowers. From stand-alone programs 

towards grid-aware services and components: A case study in agricultural 

modelling with interpolated climate data. Environmental 

Modelling and Software, 18(4):379–391, 2003. 

[33] S. M. Park, J. H. Kim, Y. B. Lo, and W. S. Yoon. Dynamic data grid 

replication strategy based on internet hierarchy. In Proc. of Second 

International Workshop on Grid and Cooperative Computing (GCC 

2003). 

[34] J. Pérez, F. García-Carballeira, J. Carretero, A. Calderón, and 

J. Fernández. Branch replication scheme: A new model for data 

replication in large scale data grids. Future Generation Computer 

Systems, 26(1):12–20, 2010. 

[35] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the placement 

of web server replicas. In Proc. of IEEE Conference on Computer 

Communications (INFOCOM 2001). 

[36] I. Raicu, I. Foster, Y. Zhao, P. Little, C. Moretti, A. Chaudhary, and 

D. Thain. The quest for scalable support of data intensive workloads 

in distributed systems. In Proc. of the ACM International Symposium 

on High Performance Distributed Computing (HPDC 2009). 

[37] I. Raicu, Y. Zhao, I. Foster, and A. Szalay. Accelerating large-scale 

data exploration through data diffusion. In Proc. of the International 

Workshop on Data-aware Distributed Computing (DADC 2008). 

[38] A. Ramakrishnan, G. Singh, H. Zhao, E. Deelman, R. Sakellariou, 

K. Vahi, K. Blackburn, D. Meyers, and M. Samidi. Scheduling dataintensive 

workflows onto storage-constrained distributed resources. 

In Proc. of the Seventh IEEE International Symposium on Cluster 

Computing and the Grid (CCGRID 2007). 

[39] K. Ranganathan and I. T. Foster. Identifying dynamic replication 

strategies for a high-performance data grid. In Proc. of the Second 

International Workshop on Grid Computing (GRID 2001). 

[40] A. Rodriguez, D. Sulakhe, E. Marland, N. Nefedova, M. Wilde, and 

N. Maltsev. Grid enabled server for high-throughput analysis of 

genomes. In Proc. of Workshop on Case Studies on Grid Applications, 

2004. 

[41] F. Schintke and A. Reinefeld. Modeling replica availability in large 

data grids. Journal of Grid Computing, 2(1):219–227, 2003. 

[42] H. Stockinger, A. Samar, K. Holtman, B. Allcock, I. Foster, and 

B. Tierney. File and object replication in data grids. In Proc. of the 

10th IEEE International Symposium on High Performance Distributed 

Computing (HPDC 2001). 

[43] A. Sulistio, U. Cibej, S. Venugopal, B. Robic, and R. Buyya. A 

toolkit for modelling and simulating data grids: An extension to 

gridsim. Concurrency and Computation: Practice and Experience, 

20(13):1591–1609, 2008. 

[44] B. Tang, S. R. Das, and H. Gupta. Benefit-based data caching in ad 

hoc networks. IEEE Transactions on Mobile Computing, 7(3):289– 

304, March 2008. 

[45] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. Dynamic replication 

algorithms for the multi-tier data grid. Future Generation Computer 

Systems, 21:775–790, 2005. 

[46] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. The impact of data 

replication on job scheduling performance in the data grid. Future 

Generation Computer Systems, 22:254–268, 2006. 

[47] U. Čibej, B. Slivnik, and B. Robič. The complexity of static data 

replication in data grids. Parallel Computing, 31(8-9):900–912, 2005. 

[48] S. Venugopal and R. Buyya. An scp-based heuristic approach for 

scheduling distributed data-intensive applications on global grids. 

Journal of Parallel and Distributed Computing, 68:471–487, 2008. 

[49] S. Venugopal, R. Buyya, and K. Ramamohanarao. A taxonomy of 

data grids for distributed data sharing, management, and processing. 

ACM Computing Surveys, 38(1), 2006. 

[50] X. You, G. Chang, X. Chen, C. Tian, and C. Zhu. Utility-based 

replication strategies in data grids. In Proc. of Fifth International 

Conference on Grid and Cooperative Computing (GCC 2006). 

[51] G. K. Zipf. Human Behavior and the Principle of Least Effort: An 

Introduction to Human Ecology. Addison-Wesley, 1949.

16 

Dharma Teja Nukarapu received 

the BE degree in Information 

Technology from 

Jawaharlal Nehru Technological 

University, India, in 

2007, the MS degree from 

the Department of Electrical 

Engineering and Computer 

Science, Wichita State University 

in 2009. He is currently 

a Ph.D student in the 

Department of Electrical Engineering and Computer Science 

at Wichita State University. His research interest is data 

caching and replication and job scheduling in data intensive 

scientific applications. 

Bin Tang received the 

BS degree in physics from 

Peking University, China, in 

1997, the MS degrees in materials 

science and computer 

science from Stony Brook 

University in 2000 and 2002, 

respectively, and the PhD 

degree in computer science 

from Stony Brook University 

in 2007. He is currently 

an Assistant Professor in the 

Department of Electrical Engineering and Computer Science 

at Wichita State University. His research interest is algorithmic 

aspect of data intensive sensor networks. He is a 

member of the IEEE. 

Liqiang Wang is currently 

an Assistant Professor in 

the Department of Computer 

Science at the University of 

Wyoming. He received the 

BS degree in mathematics 

from Hebei Normal University, 

China, in 1995, the 

MS degree in computer science 

from Sichuan University, 

China, in 1998, and the 

PhD degree in computer science 

from Stony Brook University in 2006. His research 

interest is the design and analysis of parallel computing 

systems. He is a Member of the IEEE. 

Shiyong Lu received the 

PhD degree in Computer Science 

from the State University 

of New York at Stony 

Brook in 2002, ME from the 

Institute of Computing Technology 

of Chinese Academy 

of Sciences at Beijing in 

1996, and BE from the University 

of Science and Technology 

of China at Hefei in 

1993. He is currently an Associate 

Professor in the Department of Computer Science, 

Wayne State University, and the Director of the Scientific 

Workflow Research Laboratory (SWR Lab). His research 

interests include scientific workflows and databases. He has 

published more than 90 papers in refereed international 

journals and conference proceedings. He is the founder 

and currently a program co-chair of the IEEE International 

Workshop on Scientific Workflows (2007 2010), an editorial 

board member for International Journal of Semantic 

Web and Information Systems and International Journal of 

Healthcare Information Systems and Informatics. He is a 

Senior Member of the IEEE.

Data Replication in Data Intensive Scientific Applications - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?