30.01.2015 Views

Data Replication in Data Intensive Scientific Applications - CiteSeerX

Data Replication in Data Intensive Scientific Applications - CiteSeerX

Data Replication in Data Intensive Scientific Applications - CiteSeerX

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1<br />

<strong>Data</strong> <strong>Replication</strong> <strong>in</strong> <strong>Data</strong> <strong>Intensive</strong> <strong>Scientific</strong><br />

<strong>Applications</strong> With Performance Guarantee<br />

Dharma Teja Nukarapu, B<strong>in</strong> Tang, Liqiang Wang, and Shiyong Lu<br />

Abstract— <strong>Data</strong> replication has been well adopted <strong>in</strong> data<br />

<strong>in</strong>tensive scientific applications to reduce data file transfer<br />

time and bandwidth consumption. However, the problem of<br />

data replication <strong>in</strong> <strong>Data</strong> Grids, an enabl<strong>in</strong>g technology for<br />

data <strong>in</strong>tensive applications, has proven to be NP-hard and<br />

even non-approximable, mak<strong>in</strong>g this problem difficult to solve.<br />

Meanwhile, most of the previous research <strong>in</strong> this field is<br />

either theoretical <strong>in</strong>vestigation without practical consideration,<br />

or heuristics-based with little or no theoretical performance<br />

guarantee. In this paper, we propose a data replication algorithm<br />

that not only has a provable theoretical performance<br />

guarantee, but also can be implemented <strong>in</strong> a distributed and<br />

practical manner. Specifically, we design a polynomial time<br />

centralized replication algorithm that reduces the total data<br />

file access delay by at least half of that reduced by the optimal<br />

replication solution. Based on this centralized algorithm, we<br />

also design a distributed cach<strong>in</strong>g algorithm, which can be<br />

easily adopted <strong>in</strong> a distributed environment such as <strong>Data</strong><br />

Grids. Extensive simulations are performed to validate the<br />

efficiency of our proposed algorithms. Us<strong>in</strong>g our own simulator,<br />

we show that our centralized replication algorithm performs<br />

comparably to the optimal algorithm and other <strong>in</strong>tuitive<br />

heuristics under different network parameters. Us<strong>in</strong>g GridSim,<br />

a popular distributed Grid simulator, we demonstrate that<br />

the distributed cach<strong>in</strong>g technique significantly outperforms an<br />

exist<strong>in</strong>g popular file cach<strong>in</strong>g technique <strong>in</strong> <strong>Data</strong> Grids, and it<br />

is more scalable and adaptive to the dynamic change of file<br />

access patterns <strong>in</strong> <strong>Data</strong> Grids.<br />

Index Terms - <strong>Data</strong> <strong>in</strong>tensive applications, <strong>Data</strong> Grids, data<br />

replication, algorithm design and analysis, simulations.<br />

I. Introduction<br />

<strong>Data</strong> <strong>in</strong>tensive scientific applications, which ma<strong>in</strong>ly aim<br />

to answer some of the most fundamental questions fac<strong>in</strong>g<br />

human be<strong>in</strong>gs, are becom<strong>in</strong>g <strong>in</strong>creas<strong>in</strong>gly prevalent <strong>in</strong> a<br />

wide range of scientific and eng<strong>in</strong>eer<strong>in</strong>g research doma<strong>in</strong>s.<br />

Examples <strong>in</strong>clude human genome mapp<strong>in</strong>g [40], high energy<br />

particle physics and astronomy [1], [26], and climate change<br />

model<strong>in</strong>g [32]. In such applications, large amounts of data<br />

sets are generated, accessed, and analyzed by scientists<br />

worldwide. The <strong>Data</strong> Grid [49], [5], [22] is an enabl<strong>in</strong>g<br />

technology for data <strong>in</strong>tensive applications. It is composed of<br />

hundreds of geographically distributed computation, storage,<br />

and network<strong>in</strong>g resources to facilitate data shar<strong>in</strong>g and<br />

management <strong>in</strong> data <strong>in</strong>tensive applications. One dist<strong>in</strong>ct<br />

feature of <strong>Data</strong> Grids is that they produce and manage very<br />

large amount of data sets, <strong>in</strong> the order of terabytes and<br />

Dharma Teja Nukarapu and B<strong>in</strong> Tang are with the Department of<br />

Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Computer Science, Wichita State University,<br />

Wichita, KS 67260; Liqiang Wang is with the Department of Computer<br />

Science, University of Wyom<strong>in</strong>g, Laramie, WY 82071; and Shiyong Lu is<br />

with the Department of Computer Science, Wayne State University, Detroit,<br />

MI 48202.<br />

petabytes. For example, the Large Hadron Collider (LHC)<br />

at the European Organization for Nuclear Research (CERN)<br />

near Geneva, Switzerland, is the largest scientific <strong>in</strong>strument<br />

on the planet. S<strong>in</strong>ce it began operation <strong>in</strong> August of 2008,<br />

it was expected to produce roughly 15 petabytes of data<br />

annually, which are accessed and analyzed by thousands of<br />

scientists around the world [3].<br />

<strong>Replication</strong> is an effective mechanism to reduce file<br />

transfer time and bandwidth consumption <strong>in</strong> <strong>Data</strong> Grids<br />

- plac<strong>in</strong>g most accessed data at the right locations can<br />

greatly improve the performance of data access from a<br />

user’s perspective. Meanwhile, even though the memory and<br />

storage capacity of modern computers are ever <strong>in</strong>creas<strong>in</strong>g,<br />

they are still not keep<strong>in</strong>g up with the demand of stor<strong>in</strong>g huge<br />

amounts of data produced <strong>in</strong> scientific applications. With<br />

each Grid site hav<strong>in</strong>g limited memory/storage resources, 1<br />

the data replication problem becomes more challeng<strong>in</strong>g. We<br />

f<strong>in</strong>d that most of the previous work of data replication under<br />

limited storage capacity <strong>in</strong> <strong>Data</strong> Grids ma<strong>in</strong>ly occupies two<br />

extremes of a wide spectrum: they are either theoretical<br />

<strong>in</strong>vestigations without practical consideration, or heuristicsbased<br />

experiments without performance guarantee (please<br />

refer to Section II for a comprehensive literature review).<br />

In this article, we endeavor to bridge this gap by propos<strong>in</strong>g<br />

an algorithm that has a provable performance guarantee and<br />

can be implemented <strong>in</strong> a distributed manner as well.<br />

Our model is as follows. <strong>Scientific</strong> data, <strong>in</strong> the form of<br />

data files, are produced and stored <strong>in</strong> the Grid sites as the<br />

result of scientific experiments, simulations, or computations.<br />

Each Grid site executes a sequence of scientific jobs<br />

submitted by its users. To execute each job, some scientific<br />

data as <strong>in</strong>put files are usually needed. If these files are not<br />

<strong>in</strong> the local storage resource of the Grid site, they will be<br />

accessed from other sites, and transferred and replicated <strong>in</strong><br />

the local storage of the site if necessary. Each Grid site can<br />

store such data files subject to its storage/memory capacity<br />

limitation. We study how to replicate the data files onto<br />

Grid sites with limited storage space <strong>in</strong> order to m<strong>in</strong>imize<br />

the overall file access time, for Grid sites to f<strong>in</strong>ish execut<strong>in</strong>g<br />

their jobs.<br />

Specifically, we formulate this problem as a graphtheoretical<br />

problem and design a centralized greedy data<br />

replication algorithm, which provably gives the total data<br />

file access time reduction (compared to no replication) at<br />

least half of that obta<strong>in</strong>ed from the optimal replication<br />

algorithm. We also design a distributed cach<strong>in</strong>g technique<br />

1 In this article, we use memory and storage <strong>in</strong>terchangeably.


2<br />

based on the centralized replication algorithm, 2 and show<br />

experimentally that it can be easily adopted <strong>in</strong> a distributed<br />

environment such as <strong>Data</strong> Grids. The ma<strong>in</strong> idea of our<br />

distributed algorithm is that when there are multiple replicas<br />

of a data file exist<strong>in</strong>g <strong>in</strong> a <strong>Data</strong> Grid, each Grid site<br />

keeps track of (and thus fetches the data file from) its<br />

closest replica site. This can dramatically improve <strong>Data</strong><br />

Grid performance because transferr<strong>in</strong>g large-sized files takes<br />

tremendous amount of time and bandwidth [18]. The central<br />

part of our distributed algorithm is a mechanism for each<br />

Grid site to accurately locate and ma<strong>in</strong>ta<strong>in</strong> such closest<br />

replica site. Our distributed algorithm is also adaptive –<br />

each Grid site makes a file cach<strong>in</strong>g decision (i.e., replica<br />

creation and deletion) by observ<strong>in</strong>g the recent data access<br />

traffic go<strong>in</strong>g through it. Our simulation results show that<br />

our cach<strong>in</strong>g strategy adapts better to the dynamic change of<br />

user access behavior, compared to another exist<strong>in</strong>g cach<strong>in</strong>g<br />

technique <strong>in</strong> <strong>Data</strong> Grids [39].<br />

Currently, for most <strong>Data</strong> Grid scientific applications, the<br />

massive amount of data is generated from a few centralized<br />

sites (such as CERN for high energy physics experiments)<br />

and accessed by all other participat<strong>in</strong>g <strong>in</strong>stitutions. We envision<br />

that, <strong>in</strong> the future, <strong>in</strong> a closely collaborative scientific<br />

environment upon which the data-<strong>in</strong>tensive applications are<br />

built, each participat<strong>in</strong>g <strong>in</strong>stitution site could well be a data<br />

producer as well as a data consumer, generat<strong>in</strong>g data as a<br />

result of scientific experiments or simulations it performs,<br />

and meanwhile access<strong>in</strong>g data from other sites to run its<br />

own applications. Therefore, the data transfer is no longer<br />

from a few data-generat<strong>in</strong>g sites to all other “client” sites,<br />

but could take place between an arbitrary pair of Grid sites.<br />

It is a great challenge <strong>in</strong> such an environment to efficiently<br />

share all the widely distributed and dynamically generated<br />

data files <strong>in</strong> terms of comput<strong>in</strong>g resources, bandwidth usage,<br />

and file transfer time. Our network model reflects such a<br />

vision (please refer to Section III for the detailed <strong>Data</strong> Grid<br />

model).<br />

The ma<strong>in</strong> results and contributions of our paper are as<br />

follows:<br />

1) We identify the limitation of the current research of<br />

data replication <strong>in</strong> <strong>Data</strong> Grids: they are either theoretical<br />

<strong>in</strong>vestigation without practical consideration,<br />

or heuristics-based implementation without a provable<br />

performance guarantee.<br />

2) To the best of our knowledge, we are the first to<br />

propose data replication algorithm <strong>in</strong> <strong>Data</strong> Grid, which<br />

not only has a provable theoretical performance guarantee,<br />

but can be implemented <strong>in</strong> a distributed manner<br />

as well.<br />

3) Via simulations, we show that our proposed replication<br />

strategies perform comparably with the optimal algorithm<br />

and significantly outperform an exist<strong>in</strong>g popular<br />

replication technique [39].<br />

2 We emphasize the difference between replication and cach<strong>in</strong>g <strong>in</strong> this<br />

article. <strong>Replication</strong> is a proactive and centralized approach where<strong>in</strong> the<br />

server replicates files <strong>in</strong>to the network to achieve certa<strong>in</strong> global performance<br />

optimum, whereas cach<strong>in</strong>g is reactive and takes place on the client side<br />

to achieve certa<strong>in</strong> local optimum, and it depends on the locally observed<br />

network traffic, job and file distribution, etc.<br />

4) Via simulations, we show that our replication strategy<br />

adapts well to the dynamic access pattern change <strong>in</strong><br />

<strong>Data</strong> Grids.<br />

Paper Organization. The rest of the paper is organized as<br />

follows. We discuss the related work <strong>in</strong> Section II. In Section<br />

III, we present our <strong>Data</strong> Grid model and formulate the<br />

data replication problem. Section IV and Section V propose<br />

the centralized and distributed algorithms, respectively. We<br />

present and analyze the simulation results <strong>in</strong> Section VI.<br />

Section VII concludes the paper and po<strong>in</strong>ts out some future<br />

work.<br />

II. Related Work<br />

<strong>Replication</strong> has been an active research topic for many<br />

years <strong>in</strong> World Wide Web [35], peer-to-peer networks [4],<br />

ad hoc and sensor network<strong>in</strong>g [25], [44], and mesh networks<br />

[28]. In <strong>Data</strong> Grids, enormous scientific data and<br />

complex scientific applications call for new replication algorithms,<br />

which have attracted much research recently.<br />

The most closely related work to ours is by Čibej, Slivnik,<br />

and Robič [47]. The authors study data replication on <strong>Data</strong><br />

Grids as a static optimization problem. They show that this<br />

problem is NP-hard and non-approximable, which means<br />

that there is no polynomial algorithm that provides an<br />

approximation solution if P ≠ NP . The authors discuss two<br />

solutions: <strong>in</strong>teger programm<strong>in</strong>g and simplifications. They<br />

only consider static data replication for the purpose of<br />

formal analysis. The limitation of the static approach is that<br />

the replication cannot adjust to the dynamically chang<strong>in</strong>g<br />

user access pattern. Furthermore, their centralized <strong>in</strong>teger<br />

programm<strong>in</strong>g technique cannot be easily implemented <strong>in</strong> a<br />

distributed <strong>Data</strong> Grid. Moreover, Baev et al. [6], [7] show<br />

that if all the data have uniform size, then this problem<br />

is <strong>in</strong>deed approximable. And they f<strong>in</strong>d 20.5-approximation<br />

and 10-approximation algorithms. However, their approach,<br />

which is based on round<strong>in</strong>g an optimal solution to the l<strong>in</strong>ear<br />

programm<strong>in</strong>g relaxation of the problem, cannot be easily<br />

implemented <strong>in</strong> a distributed way. In this work, we follow<br />

the same direction (i.e., uniform data size), but design a<br />

polynomial time approximation algorithm, which can also<br />

be easily implemented <strong>in</strong> a distributed environment like <strong>Data</strong><br />

Grids.<br />

Raicu et al. [36], [37] study both theoretically and empirically<br />

the resource allocation <strong>in</strong> data <strong>in</strong>tensive applications.<br />

They propose a “data diffusion” approach that acquires<br />

comput<strong>in</strong>g and storage resources dynamically, replicates<br />

data <strong>in</strong> response to demand, and schedules computations<br />

close to the data. They give a O(NM) competitive ratio<br />

onl<strong>in</strong>e algorithm, where N is the number of stores, each<br />

of which can store M objects of uniform size. However,<br />

their model does not allow for keep<strong>in</strong>g multiple copies of<br />

an object simultaneously <strong>in</strong> different stores. In our model,<br />

we assume each object can have multiple copies, each on a<br />

different site.<br />

Some economical model based replica schemes are proposed.<br />

The authors <strong>in</strong> [11], [8] use an auction protocol<br />

to make the replication decision and to trigger long-term


3<br />

optimization by us<strong>in</strong>g file access patterns. You et al. [50]<br />

propose utility-based replication strategies. Lei et al. [30]<br />

and Sch<strong>in</strong>tke and Re<strong>in</strong>efeld [41] address the data replication<br />

for availability <strong>in</strong> the face of unreliable components, which<br />

is different from our work.<br />

Jiang and Zhang [27] propose a technique <strong>in</strong> <strong>Data</strong> Grids<br />

to measure how soon a file is to be re-accessed before be<strong>in</strong>g<br />

evicted compared with other files. They consider both the<br />

consumed disk space and disk cache size. Their approach<br />

can accurately rank the value of each file to be cached.<br />

Tang et al. [45] present a dynamic replication algorithm for<br />

multi-tier <strong>Data</strong> Grids. They propose two dynamic replica<br />

algorithms: S<strong>in</strong>gle Bottom Up and Aggregate Bottom Up.<br />

Performance results show both algorithms reduce the average<br />

response time of data access compared to a static replication<br />

strategy <strong>in</strong> a multi-tier <strong>Data</strong> Grid. Chang and Chang<br />

[13] propose a dynamic data replication mechanism called<br />

Latest Access Largest Weight (LALW). LALW associates a<br />

different weight to each historical data access record: a more<br />

recent data access has a larger weight. LALW gives a more<br />

precise metric to determ<strong>in</strong>e a popular file for replication.<br />

Park et al. [33] propose a dynamic replica replication<br />

strategy, called BHR, which benefits from “network-level<br />

locality” to reduce data access time by avoid<strong>in</strong>g network<strong>in</strong>g<br />

congestion <strong>in</strong> a <strong>Data</strong> Grid. Lamehamedi et al. [29] propose<br />

a lightweight <strong>Data</strong> Grid middleware, at the core of which<br />

are some dynamic data and replica location and placement<br />

techniques.<br />

Ranganathan and Foster [39] present six different replication<br />

strategies: No <strong>Replication</strong> or Cach<strong>in</strong>g, Best Client,<br />

Cascad<strong>in</strong>g <strong>Replication</strong>, Pla<strong>in</strong> cach<strong>in</strong>g, Cach<strong>in</strong>g plus Cascad<strong>in</strong>g<br />

<strong>Replication</strong>, and Fast Spread. All of these strategies are<br />

evaluated with three user access patterns (Random Access,<br />

Temporal Locality, and Geographical plus Temporal Locality).<br />

Via the simulation, the authors f<strong>in</strong>d that Fast Spread<br />

performs the best under Random Access, and Cascad<strong>in</strong>g<br />

would work better than others under Geographical and Temporal<br />

Locality. Due to its wide popularity <strong>in</strong> the literature<br />

and its simplicity to implement, we compare our distributed<br />

replication algorithm to this work.<br />

Systemwise, there are several real testbed and system<br />

implementations utiliz<strong>in</strong>g data replication. One system is the<br />

<strong>Data</strong> <strong>Replication</strong> Service (DRS) designed by Chervenak et<br />

al. [17]. DRS is a higher-level data management service for<br />

scientific collaborations <strong>in</strong> Grid environments. It replicates<br />

a specified set of files onto a storage system and registers<br />

the new files <strong>in</strong> the replica catalog of the site. Another system<br />

is Physics Experimental <strong>Data</strong> Export (PheDEx) system<br />

[21]. PheDEx supports both the hierarchical distribution and<br />

subscription-based transfer of data.<br />

To execute the submitted jobs, each Grid site either gets<br />

the needed <strong>in</strong>put data files to its local comput<strong>in</strong>g resource,<br />

schedules the job at sites where the needed <strong>in</strong>put data files<br />

are stored, or transfers both the data and the job to a third<br />

site that performs the computation and returns the result.<br />

In this paper, we focus on the first approach. We leave the<br />

job schedul<strong>in</strong>g problem [48], which studies how to map jobs<br />

<strong>in</strong>to Grid resources for execution, and its coupl<strong>in</strong>g with data<br />

Fig. 1.<br />

<strong>Data</strong> Grid Model.<br />

replication, as our future research. We are aware of very<br />

active research study<strong>in</strong>g the relationship between these two<br />

[12], [38], [16], [23], [14], [46], [19], [10]. The focus of<br />

our paper, however, is on the data replication strategy with<br />

a provable performance guarantee.<br />

As demonstrated by the experiments of Chervenak et al.<br />

[17], the time to execute a scientific job is ma<strong>in</strong>ly the time<br />

it takes to transfer the needed <strong>in</strong>put files from server sites<br />

to local sites. Similar to other work <strong>in</strong> replica management<br />

for <strong>Data</strong> Grids [30], [41], [10], we only consider the file<br />

transfer time (access time), not the job execution time <strong>in</strong><br />

the processor or any other <strong>in</strong>ternal storage process<strong>in</strong>g or<br />

I/O time. S<strong>in</strong>ce the data are read only for many <strong>Data</strong><br />

Grid applications [38], we do not consider consistency<br />

ma<strong>in</strong>tenance between the master file and the replica files. For<br />

readers who are <strong>in</strong>terested <strong>in</strong> the consistency ma<strong>in</strong>tenance<br />

<strong>in</strong> <strong>Data</strong> Grids, please refer to [20], [42], [34].<br />

III. <strong>Data</strong> Grid Model and Problem Formulation<br />

We consider a <strong>Data</strong> Grid model as shown <strong>in</strong> Figure 1. A<br />

<strong>Data</strong> Grid consists of a set of sites. There are <strong>in</strong>stitutional<br />

sites, which correspond to different scientific <strong>in</strong>stitutions<br />

participat<strong>in</strong>g <strong>in</strong> the scientific project. There is one top level<br />

site, which is the centralized management entity <strong>in</strong> the entire<br />

<strong>Data</strong> Grid environment, and its major role is to mange<br />

the Centralized Replica Catalogue (CRC). CRC provides<br />

location <strong>in</strong>formation about each data file and its replicas,<br />

and it is essentially a mapp<strong>in</strong>g between each data file and<br />

all the <strong>in</strong>stitutional sites where the data is replicated. Each<br />

site (top level site or <strong>in</strong>stitutional site) may conta<strong>in</strong> multiple<br />

grid resources. A grid resource could be either a comput<strong>in</strong>g<br />

resource, which allows users to submit and execute jobs,<br />

or a storage resource, which allows users to store data<br />

files. 3 We assume that each site has both comput<strong>in</strong>g and<br />

storage capacities, and that with<strong>in</strong> each site, the bandwidth<br />

is high enough that the communication delay <strong>in</strong>side the site<br />

is negligible.<br />

3 We differentiate site and node <strong>in</strong> this paper. We refer to each site <strong>in</strong><br />

the Grid level and each node <strong>in</strong> the cluster level, where each node is a<br />

comput<strong>in</strong>g resource or storage resource or comb<strong>in</strong>ation of both.


4<br />

For the data file replication problem addressed <strong>in</strong> this<br />

article, there are multiple data files, and each data file<br />

is produced by its source site (the top level site or the<br />

<strong>in</strong>stitutional site may act as a source site for more than<br />

one data files). Each Grid site has limited storage capacity<br />

and can cache/store multiple data files subject to its storage<br />

capacity constra<strong>in</strong>t.<br />

<strong>Data</strong> Grid Model. A <strong>Data</strong> Grid can be represented as an<br />

undirected graph G(V, E), where a set of vertices V =<br />

{1, 2, ..., n} represents the sites <strong>in</strong> the Grid, and E is a<br />

set of weighted edges <strong>in</strong> the graph. The edge weight may<br />

represent a l<strong>in</strong>k metric such as loss rate, distance, delay,<br />

or transmission bandwidth. In this paper, the edge weight<br />

represents the bandwidth and we assume all edges have<br />

the same bandwidth B (<strong>in</strong> Section VI, we study heterogeneous<br />

environment where different edges have different<br />

bandwidths). There are p data files D = {D 1 , D 2 , . . . , D p }<br />

<strong>in</strong> the <strong>Data</strong> Grid, D j is orig<strong>in</strong>ally produced and stored <strong>in</strong><br />

the source site S j ∈ V . Note that a site can be the orig<strong>in</strong>al<br />

source site of multiple data files. The size of data file D j is<br />

s j . Each site i has a storage capacity of m i (for a source site<br />

i, m i is the available storage space after stor<strong>in</strong>g its orig<strong>in</strong>al<br />

data).<br />

Users of the <strong>Data</strong> Grid submit jobs to their own sites,<br />

and the jobs are executed <strong>in</strong> the FIFO order. Assume that<br />

the Grid site i has n i submitted jobs {t i1 , t i2 , . . . , t <strong>in</strong>i }, and<br />

each job t ik (1 ≤ k ≤ n i ) needs a subset F ik of D as<br />

its <strong>in</strong>put files for execution. If we use w ij to denote the<br />

number of times that site i needs D j as an <strong>in</strong>put file, then<br />

w ij = ∑ n i<br />

k=1 c k, where c k = 1 if D j ∈ F ik and c k = 0<br />

otherwise. The transmission time of send<strong>in</strong>g data file D j<br />

along any edge is s j /B. We use d ij to denote the number<br />

of transmissions to transmit a data file from site i and j<br />

(which is equal to the number of edges between these two<br />

sites). We do not consider the file propagation time along<br />

the edge, s<strong>in</strong>ce compared with transmission time, it is quite<br />

small and thus negligible. The total data file access cost <strong>in</strong><br />

<strong>Data</strong> Grid before replication is the total transmission time<br />

spent to get all needed data files for each site:<br />

n∑ p∑<br />

w ij × d iSj × s j /B.<br />

i=1 j=1<br />

The objective of our file replication problem is to m<strong>in</strong>imize<br />

the total data file access cost by replicat<strong>in</strong>g data files <strong>in</strong> the<br />

<strong>Data</strong> Grid. Below, we give a formal def<strong>in</strong>ition of the file<br />

replication problem addressed <strong>in</strong> this article.<br />

Problem Formulation. The data replication problem <strong>in</strong> the<br />

<strong>Data</strong> Grid is to select a set of sets M = {A 1 , A 2 , . . . , A p },<br />

where A j ⊂ V is a set of Grid sites that store a replica copy<br />

of D j , to m<strong>in</strong>imize the total access cost <strong>in</strong> the <strong>Data</strong> Grid:<br />

n∑ p∑<br />

τ(G, M) = w ij × m<strong>in</strong> k∈({Sj}∪A j)d ik × s j /B,<br />

i=1 j=1<br />

under the storage capacity constra<strong>in</strong>t that |{A j |i ∈ A j }| ≤<br />

m i for all i ∈ V, which means that each Grid site i appears<br />

<strong>in</strong>, at most, m i sets of M. Here, each site accesses the<br />

data file from its closest site with a replica copy. Čibej,<br />

Slivnik, and Robič show that this problem (with arbitrary<br />

data file size) is NP-hard and even non-approximable [47].<br />

However, Baev et al. [6], [7] have shown that when data<br />

is uniform size (i.e., s i = s j for any D i , D j ∈ D), this<br />

problem is approximable. Below we present a centralized<br />

greedy algorithm with provable performance guarantee for<br />

uniform data size, and a distributed algorithm respectively.<br />

IV. Centralized <strong>Data</strong> <strong>Replication</strong> Algorithm <strong>in</strong> <strong>Data</strong><br />

Grids<br />

Our centralized data replication algorithm is a greedy<br />

algorithm. First, all Grid sites have all empty storage space<br />

(except for sites that orig<strong>in</strong>ally produce and store some files).<br />

Then, at each step, it places one data file <strong>in</strong>to the storage<br />

space of one site such that the reduction of total access cost<br />

<strong>in</strong> the <strong>Data</strong> Grid is maximized at that step. The algorithm<br />

term<strong>in</strong>ates when all storage space of the sites has been<br />

replicated with data files, or the total access cost cannot<br />

be reduced further. Below is the algorithm.<br />

Algorithm 1: Centralized <strong>Data</strong> <strong>Replication</strong> Algorithm<br />

BEGIN<br />

M = A 1 = A 2 = . . . = A p = ∅ (empty set);<br />

while (the total access cost can still be reduced by<br />

replicat<strong>in</strong>g data files <strong>in</strong> <strong>Data</strong> Grid)<br />

Among all sites with available storage capacity and<br />

all data files, let replicat<strong>in</strong>g data file D i on site m<br />

gives the maximum τ(G, {A 1 , A 2 , . . . , A i , . . . , A p })<br />

−τ(G, {A 1 , A 2 , . . . , A i ∪ {m}, . . . , A p });<br />

A i = A i ∪ {m};<br />

end while;<br />

RETURN M = {A 1 , A 2 , . . . , A p };<br />

END.<br />

♦<br />

The total runn<strong>in</strong>g time of the greedy algorithm of data<br />

replication is O(p 2 n 3 m), where n is the number of sites <strong>in</strong><br />

the <strong>Data</strong> Grid, m is the average number of memory pages<br />

<strong>in</strong> a site, and p is the total number of data files. Note that<br />

the number of iterations <strong>in</strong> the above algorithm is bounded<br />

by nm, and at each stage, we need to compute at most pn<br />

benefit values, where each benefit value computation may<br />

take O(pn) time. Below we present two lemmas show<strong>in</strong>g<br />

some properties of above greedy algorithm. They are also<br />

necessary for us to prove the theorem below, which shows<br />

that the greedy algorithm returns a solution with a nearoptimal<br />

total access cost reduction.<br />

Lemma 1: Total access cost reduction is <strong>in</strong>dependent of<br />

the order of the data file replication.<br />

Proof: This is because the reduction of total access cost<br />

depends only on the <strong>in</strong>itial and f<strong>in</strong>al storage configuration<br />

of each site, which does not depend on the exact order of<br />

the data replication.<br />

Lemma 2: Let M be a set of cache sites <strong>in</strong> an arbitrary<br />

stage. Let A 1 and A 2 be two dist<strong>in</strong>ct sets of sites not<br />

<strong>in</strong>cluded <strong>in</strong> M. Then the access cost reduction due to<br />

selection of A 2 based upon cache site set M ∪ A 1 , denoted<br />

as a, is less than or equal to the access cost reduction due<br />

to selection of A 2 based upon M, denoted as b.


5<br />

Fig. 2. Each site i orig<strong>in</strong>al graph G has m i memory space, each site <strong>in</strong><br />

modified graph G’ has 2m i memory space.<br />

Proof: Let ClientSet(A 2 , M) be the set of sites that have<br />

their closest cache sites <strong>in</strong> A 2 when A 2 are selected as cache<br />

sites upon M. With the selection of A 1 before A 2 , some sites<br />

<strong>in</strong> ClientSet(A 2 , M) may f<strong>in</strong>d that they are closer to some<br />

cache sites <strong>in</strong> A 1 and then “deflect” to become elements of<br />

ClientSet(A 1 , M). Therefore, ClientSet(A 2 , M ∪ A 1 ) ⊆<br />

ClientSet(A 2 , M).<br />

S<strong>in</strong>ce the selection of A 1 before A 2 does not change the<br />

closest sites <strong>in</strong> M for all sites <strong>in</strong> ClientSet(A 2 , M ∪ A 1 ),<br />

if ClientSet(A 2 , M ∪ A 1 ) = ClientSet(A 2 , M), a = b; if<br />

ClientSet(A 2 , M ∪ A 1 ) ⊂ ClientSet(A 2 , M), a < b.<br />

Above lemma says that the total access cost reduction due<br />

to an arbitrary cache site never <strong>in</strong>creases with the selection<br />

of other cache site before it.<br />

We present below a theorem that shows the performance<br />

guarantee of our centralized greedy algorithm. In the theorem<br />

and the follow<strong>in</strong>g proof, we assume that all data files<br />

are of the same uniform-size (occupy<strong>in</strong>g unit memory). The<br />

proof technique used here is similar to that used <strong>in</strong> [44] for a<br />

closely related problem of data cach<strong>in</strong>g <strong>in</strong> ad hoc networks.<br />

Theorem 1: Under the <strong>Data</strong> Grid model we propose, the<br />

centralized data replication algorithm delivers a solution<br />

whose total access cost reduction is at least half of the<br />

optimal total access cost reduction.<br />

Proof: Let L be the total number of memory pages <strong>in</strong> the<br />

<strong>Data</strong> Grid. WLOG, we assume L is the total number of<br />

iterations of the greedy algorithm (note it is possible that the<br />

total access cost cannot be reduced further even though sites<br />

still have available memory space). We assume the sequence<br />

of selections <strong>in</strong> greedy algorithm is {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L },<br />

where n g i f g i <strong>in</strong>dicat<strong>in</strong>g that at iteration i, data file f g i is<br />

replicated at site n g i . We also assume the sequence of<br />

selections <strong>in</strong> optimal algorithm is {n o 1f1 o , n o 2f2 o , ..., n o L f L o},<br />

where n o i f i o <strong>in</strong>dicat<strong>in</strong>g that at iteration i, data file fi o is<br />

replicated at site n o i . Let O and C be the total access<br />

cost reduction from optimal algorithm and greedy algorithm,<br />

respectively.<br />

Next, we consider a modified data grid G ′ , where<strong>in</strong> each<br />

site i doubles its memory capacity (now 2m i ). We construct<br />

a cache placement solution for G ′ such that for site i, the<br />

first m i memories store the data files obta<strong>in</strong>ed <strong>in</strong> greedy<br />

algorithm and the second m i memories store the data files<br />

selected <strong>in</strong> the optimal algorithm, as shown <strong>in</strong> Figure 2.<br />

Now we calculate the total access cost reduction O ′ for G ′ .<br />

Obviously, O ′ ≥ O, simply because each site <strong>in</strong> G ′ caches<br />

extra data files beyond the data files cached <strong>in</strong> the same site<br />

<strong>in</strong> G.<br />

Accord<strong>in</strong>g to Lemma 1, s<strong>in</strong>ce the total access cost reduction<br />

is <strong>in</strong>dependent of the order of the each <strong>in</strong>dividual<br />

selection, we consider the sequence of the cache section<br />

<strong>in</strong> G ′ as {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L , no 1f1 o , n o 2f2 o , ..., n o L f L o }, that<br />

is, the greedy selection followed by optimal selection.<br />

Therefore, <strong>in</strong> G ′ , the total access cost reduction is due<br />

to the greedy selection sequence plus the optimal selection<br />

sequence. For the greedy selection sequence, after<br />

the selection of n g L f g L<br />

, the access cost reduction is C.<br />

For the optimal selection sequence, we need to calculate<br />

the access cost reduction due to the addition of each<br />

n o i f i<br />

o (1 ≤ i ≤ L) based on the already added selections<br />

of {n g 1 f g 1 , ng 2 f g 2 , ..., ng L f g L , no 1f1 o , n o 2f2 o , ..., n o i−1 f i−1 o },<br />

and from Lemma 2, we know that it is less than the<br />

access cost reduction due to the addition of n o i f i<br />

o based<br />

on already added sequence of {n g 1 f g 1 , ng 2 f g 2 , ..., ng i−1 f g i−1 }.<br />

Note that the latter is less than the access cost reduction<br />

due to the addition of n g i f g i , based on the same sequence<br />

of {n g 1 f g 1 , ng 2 f g 2 , ..., ng i−1 f g i−1 }. Thus, the sum of the access<br />

cost reduction due to selection of {n o 1f1 o , n o 2f2 o , ..., n o L f L o} is<br />

less than or equal to C too.<br />

Thus, we come to the conclusion that O is less than or<br />

equal to O ′ , which is less than or equal to 2 times of C.<br />

V. Distributed <strong>Data</strong> Cach<strong>in</strong>g Algorithm <strong>in</strong> <strong>Data</strong> Grids<br />

In this section, we design a localized distributed cach<strong>in</strong>g<br />

algorithm based on the centralized algorithm. In the distributed<br />

algorithm, each Grid site observes the local <strong>Data</strong><br />

Grid traffic to make an <strong>in</strong>telligent cach<strong>in</strong>g decision. Our<br />

distributed cach<strong>in</strong>g algorithm is advantageous s<strong>in</strong>ce it does<br />

not need global <strong>in</strong>formation such as the network topology<br />

of the Grid, and it is more reactive to network states such as<br />

the file distribution, user access pattern, and job distribution<br />

<strong>in</strong> the <strong>Data</strong> Grids. Therefore, our distributed algorithm can<br />

adopt well to such dynamic changes <strong>in</strong> the <strong>Data</strong> Grids.<br />

The distributed algorithm is composed of two important<br />

components: nearest replica catalog (NRC) ma<strong>in</strong>ta<strong>in</strong>ed at<br />

each site and a localized data cach<strong>in</strong>g algorithm runn<strong>in</strong>g at<br />

each site. Aga<strong>in</strong>, as stated <strong>in</strong> Section III, the top level site<br />

ma<strong>in</strong>ta<strong>in</strong>s a Centralized Replica Catalogue (CRC), which is<br />

essentially a list of replica site list C j for each data file D j .<br />

The replica site list C j conta<strong>in</strong>s the set of sites (<strong>in</strong>clud<strong>in</strong>g<br />

source site S j ) that has a copy of D j .<br />

Nearest Replica Catalog (NRC). Each site i <strong>in</strong> the Grid<br />

ma<strong>in</strong>ta<strong>in</strong>s an NRC, and each entry <strong>in</strong> the NRC is of the form<br />

(D j , N j ), where N j is the nearest site that has a replica of<br />

D j . When a site executes a job, from its NRC, it determ<strong>in</strong>es<br />

the nearest replicate site for each of its <strong>in</strong>put data files and<br />

goes to it directly to fetch the file (provided the <strong>in</strong>put file<br />

is not <strong>in</strong> its local storage). As the <strong>in</strong>itialization stage, the<br />

source sites send messages to the top level site <strong>in</strong>form<strong>in</strong>g it<br />

about their orig<strong>in</strong>al data files. Thus, the centralized replica<br />

catalog <strong>in</strong>itially records each data file and its source site.<br />

The top level site then broadcasts the replica catalogue to


6<br />

the entire <strong>Data</strong> Grid. Each Grid site <strong>in</strong>itializes its NRC to<br />

the source site of each data file. Note that if i is the source<br />

site of D j or has cached D j , then N j is <strong>in</strong>terpreted as the<br />

second nearest replica site, i.e., the closest site (other than<br />

i itself) that has a copy of D j . The second nearest replica<br />

site <strong>in</strong>formation is helpful when site i decides to remove<br />

the cached file D j . If there is a cache miss, the request is<br />

redirected to the top level site, which sends the site replica<br />

site list for that data file. After receiv<strong>in</strong>g such <strong>in</strong>fomation, the<br />

site will update correctly its NRC table and sends the request<br />

to the site’s nearest cache site for that data file. Therefore, a<br />

cache miss takes much longer time. The above <strong>in</strong>formation<br />

is <strong>in</strong> addition to any <strong>in</strong>formation (such as rout<strong>in</strong>g tables)<br />

ma<strong>in</strong>ta<strong>in</strong>ed by the underly<strong>in</strong>g rout<strong>in</strong>g protocol <strong>in</strong> the <strong>Data</strong><br />

Grids.<br />

Ma<strong>in</strong>tenance of NRC. As stated above, when a site i caches<br />

a data file D j , N j is no longer the nearest replica site, but<br />

rather the second-nearest replica site. The site i sends a<br />

message to the top level site with <strong>in</strong>formation that it is a<br />

new replica site of D j . When the top level site receives the<br />

message, it updates its CRC by add<strong>in</strong>g site i to data file D j ’s<br />

replica site list C j . Then it broadcasts a message to the entire<br />

<strong>Data</strong> Grid conta<strong>in</strong><strong>in</strong>g the tuple (i, D j ) <strong>in</strong>dicat<strong>in</strong>g the ID of<br />

the new replica site and the ID of the newly made replica<br />

file. Consider a site l that receives such message (i, D j ). Let<br />

(D j , N j ) be the NRC entry at site l signify<strong>in</strong>g that N j is<br />

the replica site for D j currently closest to l. If d li < d lNj ,<br />

then the NRC entry (D j , N j ) is updated to (D j , i). Note the<br />

distance values, <strong>in</strong> terms of number of hops, are available<br />

from the rout<strong>in</strong>g protocol.<br />

When a site i removes a data file D j from its local storage,<br />

its second-nearest replica site N j becomes its nearest replica<br />

site (note the correspond<strong>in</strong>g NRC entry does not change).<br />

In addition, site i sends a message to the top level site with<br />

<strong>in</strong>formation that it is no longer a replica site of D j . The top<br />

level site updates its replica catalogue by delet<strong>in</strong>g i from<br />

D j ’s replica site list. And then it broadcasts a message with<br />

the <strong>in</strong>formation (i, D j , C j ) to the <strong>Data</strong> Grid, where C j is the<br />

replica site list for D j . Consider a site l that receives such a<br />

message, and let (D j , N j ) be its NRC entry. If N j = i, then<br />

site l updates its nearest replica site entry us<strong>in</strong>g C j (with<br />

the help of the rout<strong>in</strong>g table).<br />

Localized <strong>Data</strong> Cach<strong>in</strong>g Algorithm. S<strong>in</strong>ce each site has<br />

limited storage capacity, a good data cach<strong>in</strong>g algorithm that<br />

runs distributedly on each site is needed. To do this, each<br />

site observes the data access traffic locally for a sufficiently<br />

long time. The local access traffic observed by site i <strong>in</strong>cludes<br />

its own local data requests, non-local data requests to data<br />

files cached at i, and the data request traffic that the site i<br />

is forward<strong>in</strong>g to other sites <strong>in</strong> the network.<br />

Before we present the data cach<strong>in</strong>g algorithm, we give<br />

the follow<strong>in</strong>g two def<strong>in</strong>itions:<br />

• Reduction <strong>in</strong> access cost of cach<strong>in</strong>g a data file.<br />

Reduction <strong>in</strong> access cost as the result of cach<strong>in</strong>g a data<br />

file at a site is the reduction <strong>in</strong> access cost given by<br />

the follow<strong>in</strong>g: access frequency <strong>in</strong> local access traffic<br />

observed by the site × distance to the nearest replica<br />

site.<br />

• Increase <strong>in</strong> access cost of delet<strong>in</strong>g a data file. Increase<br />

<strong>in</strong> access cost as the result of delet<strong>in</strong>g a data file at a site<br />

is the <strong>in</strong>crease <strong>in</strong> access cost given by the follow<strong>in</strong>g:<br />

access frequency <strong>in</strong> local access traffic observed by the<br />

site × distance to the second-nearest replica site.<br />

For each data file D j not cached at site i, site i calculates<br />

the reduction <strong>in</strong> access cost by cach<strong>in</strong>g the file D j , while<br />

for each data file D j cached at site i, site i computes the<br />

<strong>in</strong>crease <strong>in</strong> access cost of delet<strong>in</strong>g the file. With the help<br />

of the NRC, each site can compute such a reduction or<br />

<strong>in</strong>crease of access cost <strong>in</strong> a distributed manner us<strong>in</strong>g only<br />

local <strong>in</strong>formation. Thus our algorithm is adaptive: each site<br />

makes a data cach<strong>in</strong>g or deletion decision by observ<strong>in</strong>g<br />

locally the most recent data access traffic <strong>in</strong> the <strong>Data</strong> Grid.<br />

Cache Replacement Policy. With the above knowledge,<br />

a site always tries to cache data files that can fit <strong>in</strong> its<br />

local storage and that can give the most reduction <strong>in</strong> access<br />

cost. When the local storage capacity of a site is full, the<br />

follow<strong>in</strong>g cache replacement policy is used. Let |D| denote<br />

the size of a data file (or a set of data files) D. If the access<br />

cost reduction of cach<strong>in</strong>g a newly available data file D j is<br />

higher than the access cost <strong>in</strong>crease of some set D of cached<br />

data files where |D| > |D j |, then the set D is replaced by<br />

D j .<br />

<strong>Data</strong> File Consistency Ma<strong>in</strong>tenance. In this work, we<br />

assume that all the data are read-only, and thus no data consistency<br />

mechanism is needed. However, if we do consider<br />

the consistency issue, two mechanisms can be considered.<br />

First, when the file <strong>in</strong> a site is modified, it can be considered<br />

as a data file removal <strong>in</strong> our distributed algorithm discussed<br />

above with some necessary deviation. That is, the site sends<br />

a message to the top level site with <strong>in</strong>formation that it<br />

is no longer a replica site of the file. The top level site<br />

broadcasts a message to the <strong>Data</strong> Grid so that each site<br />

can delete the correspond<strong>in</strong>g nearest replica site entry. The<br />

second mechanism is Time to Live (TTL), which is the time<br />

until the replica copies are considered valid. Therefore, the<br />

master copy and its replica are considered outdated at the<br />

end of the TTL time value and will not be used for the job<br />

execution.<br />

VI. Performance Evaluation<br />

In this section, we demonstrate the performance of our<br />

designed algorithms via extensive simulations. First, we<br />

compare the centralized greedy algorithm (referred to as<br />

Greedy) with the centralized optimal algorithm (referred<br />

to as Optimal) and two other <strong>in</strong>tuitive heuristics us<strong>in</strong>g our<br />

own simulator written <strong>in</strong> Java (Section VI-A). Then, us<strong>in</strong>g<br />

GridSim [43], a well-known Grid simulator, we evaluate our<br />

distributed algorithm with respect to an exist<strong>in</strong>g cach<strong>in</strong>g<br />

algorithm <strong>in</strong> the well-cited literature [39] (Section VI-B),<br />

which is based upon a hierarchical <strong>Data</strong> Grid topology.<br />

F<strong>in</strong>ally, we compare our distributed algorithm with popular<br />

LRU/LFU algorithms implemented <strong>in</strong> OptorSim [10] <strong>in</strong> a<br />

general <strong>Data</strong> Grid topology.


7<br />

Total Access Cost<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Optimal<br />

Greedy<br />

Local Greedy<br />

Random<br />

Total Access Cost<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Optimal<br />

Greedy<br />

Local Greedy<br />

Random<br />

Total Access Cost<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Optimal<br />

Greedy<br />

Local Greedy<br />

Random<br />

100<br />

100<br />

100<br />

0<br />

4 6 8 10 12 14 16 18 20<br />

Number of Files<br />

0<br />

1 2 3 4 5 6 7 8 9<br />

Storage Capacity (GB)<br />

0<br />

5 10 15 20 25<br />

Number of Grid sites<br />

(a) Vary<strong>in</strong>g number of data files.<br />

(b) Vary<strong>in</strong>g storage capacity.<br />

(c) Vary<strong>in</strong>g number of Grid sites.<br />

Fig. 3. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms. Unless varied, the number of sites is 10,<br />

number of data files is 10, and memory capacity of each site is 5. <strong>Data</strong> file size is 1.<br />

A. Comparison of Centralized Algorithms<br />

Optimal <strong>Replication</strong> Algorithm. Optimal is an exhaustive,<br />

and time-consum<strong>in</strong>g, replication algorithm that yields an<br />

optimal solution. For a <strong>Data</strong> Grid with n sites and p data<br />

files of unit size, and each site has storage capacity m, the<br />

optimal algorithm is to enumerate all possible file replication<br />

placements <strong>in</strong> the <strong>Data</strong> Grid and f<strong>in</strong>d the one with m<strong>in</strong>imum<br />

total access cost. There are (Cp m ) n such file placements,<br />

where Cp<br />

m is the number of ways select<strong>in</strong>g m files out<br />

of p files. Due to the high time complexity of the optimal<br />

algorithm, we experiment on <strong>Data</strong> Grids with 5 to 25 sites<br />

and the average distance between two sites is 3 to 6 hops.<br />

Local Greedy Algorithm and Random Algorithm. We also<br />

compare Greedy with two other <strong>in</strong>tuitive replication algorithms:<br />

Local Greedy Algorithm and Random Algorithm.<br />

In Local Greedy, it replicates each site’s most frequently<br />

requested files <strong>in</strong> its local storage. Random takes place <strong>in</strong><br />

round; <strong>in</strong> each round, a random file is selected and placed<br />

<strong>in</strong> a randomly chosen non-full site that does not have a copy<br />

of that file. The algorithm stops until all the storage spaces<br />

of all sites are occupied.<br />

In most comparisons of centralized algorithms, we assume<br />

that all file sizes are equal and all edge bandwidths are equal.<br />

S<strong>in</strong>ce the total data file access cost between sites is def<strong>in</strong>ed<br />

as the total transmission time spent to obta<strong>in</strong> the data file,<br />

which is equal to<br />

number of hops between two sites × file size/bandwidth,<br />

we can use the number of hops to approximate the file<br />

access cost. However, we also show that even <strong>in</strong> the heterogeneous<br />

scenarios where<strong>in</strong> the file sizes and bandwidths<br />

are not equal, our centralized and distributed algorithms still<br />

perform very well (Figure 5 and Figure 13).<br />

Recall that each data has a uniform size of 1. For all<br />

the algorithms, <strong>in</strong>itially the p files are randomly placed <strong>in</strong><br />

some source sites; for the same comparison, the <strong>in</strong>itial file<br />

placements are the same for all algorithms. Recall that for<br />

source sites, their storage capacities are those that rema<strong>in</strong><br />

after stor<strong>in</strong>g their orig<strong>in</strong>al data files. Each data po<strong>in</strong>t <strong>in</strong> the<br />

plots below is an average over five different <strong>in</strong>itial placement<br />

of the p files. We assume that the number of times each site<br />

accesses each file is a random number between 1 and 10.<br />

In all plots, we show error bars <strong>in</strong>dicat<strong>in</strong>g the 95 percent<br />

confidence <strong>in</strong>terval.<br />

Vary<strong>in</strong>g Number of <strong>Data</strong> Files. First, we vary the number<br />

of data files <strong>in</strong> the <strong>Data</strong> Grid while fix<strong>in</strong>g the number of sites<br />

and the storage capacity of each site. Due to the high time<br />

complexity of the optimal algorithm, we consider a small<br />

<strong>Data</strong> Grid with n = 10 and m = 5, and we vary p to be<br />

5, 10, 15, and 20. The comparison of the four algorithms is<br />

shown <strong>in</strong> Figure 3 (a). In terms of the total access cost, we<br />

observe that Optimal < Greedy < Local Greedy < Random.<br />

That is, among all the algorithms, Optimal performs the<br />

best s<strong>in</strong>ce it exhausts all the replication possibilities, but<br />

Greedy performs quite close to the optimal. Furthermore,<br />

we observe that Greedy performs better than Local Greedy<br />

and Random, which demonstrates that it is <strong>in</strong>deed a good<br />

replication algorithm. It also shows that when the storage<br />

capacity of each site is five and the total number of files is<br />

five, like the other three algorithms, Greedy will eventually<br />

replicate all five files <strong>in</strong>to each Grid site, result<strong>in</strong>g <strong>in</strong> zero<br />

total access cost.<br />

Vary<strong>in</strong>g Storage Capacity. Figure 3 (b) shows the results<br />

of vary<strong>in</strong>g the storage capacity of each site while keep<strong>in</strong>g<br />

the number of data files and number of sites unchanged. We<br />

choose n = 10 and p = 10, and vary m from 1, 3, 5, 7,<br />

to 9 units. It is obvious that with the <strong>in</strong>crease of memory<br />

capacity of each Grid site, the total access cost decreases<br />

s<strong>in</strong>ce more file replicas are able to be placed <strong>in</strong>to the <strong>Data</strong><br />

Grid. Aga<strong>in</strong>, we observe that Greedy performs comparably<br />

to Optimal, and both perform better than Local Greedy and<br />

Random.<br />

Vary<strong>in</strong>g Number of Grid Sites. Figure 3 (c) shows the<br />

result of vary<strong>in</strong>g the number of sites <strong>in</strong> <strong>Data</strong> Grid. We<br />

consider a small <strong>Data</strong> Grid, choose p = 10 and m = 5,<br />

and vary n as 5, 10, 15, 20, or 25. The optimal algorithm<br />

performs only slightly better than the greedy algorithm, <strong>in</strong><br />

the range of 15% to 20%.<br />

Comparison <strong>in</strong> Large Scale. We study the scalability of<br />

different algorithms by consider<strong>in</strong>g a <strong>Data</strong> Grid <strong>in</strong> large<br />

scale, with 500 to 2,000 data files, storage capacity of each<br />

site vary<strong>in</strong>g from 10 to 90, and the number of Grid sites<br />

vary<strong>in</strong>g from 10 to 50. Figure 4 shows the performance<br />

comparison of Greedy, Local Greedy, and Random. Aga<strong>in</strong>,


8<br />

Total Access Cost<br />

18000<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

500<br />

Greedy<br />

Local Greedy<br />

Random<br />

1000<br />

1500<br />

Number of Files<br />

2000<br />

Total Access Cost<br />

18000<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

Greedy<br />

Local Greedy<br />

Random<br />

0<br />

10 20 30 40 50 60 70 80 90<br />

Storage Capacity (GB)<br />

Total Access Cost<br />

22000<br />

20000<br />

18000<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

Greedy<br />

Local Greedy<br />

Random<br />

10 15 20 25 30 35 40 45 50<br />

Number of Grid sites<br />

(a) Vary<strong>in</strong>g number of data files.<br />

(b) Vary<strong>in</strong>g storage capacity.<br />

(c) Vary<strong>in</strong>g number of Grid sites.<br />

Fig. 4. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms <strong>in</strong> large scale. Unless varied, the number<br />

of sites is 30, number of data files is 1,000, and storage capacity of each site is 50. <strong>Data</strong> file size is 1.<br />

Total Access Cost<br />

900<br />

800<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Optimal<br />

Greedy<br />

Local Greedy<br />

Random<br />

100<br />

4 6 8 10 12 14 16 18 20<br />

Number of Files<br />

Total Access Cost<br />

700<br />

600<br />

500<br />

400<br />

300<br />

200<br />

Optimal<br />

Greedy<br />

Local Greedy<br />

Random<br />

100<br />

1 2 3 4 5 6 7 8 9<br />

Storage Capacity (GB)<br />

Total Access Cost<br />

600<br />

500<br />

400<br />

300<br />

Optimal<br />

200<br />

Greedy<br />

Local Greedy<br />

Random<br />

100<br />

5 10 15 20 25<br />

Number of Grid sites<br />

(a) Vary<strong>in</strong>g number of data files.<br />

(b) Vary<strong>in</strong>g storage capacity.<br />

(c) Vary<strong>in</strong>g number of Grid sites.<br />

Fig. 5. Performance comparison of Greedy, Optimal, Local Greedy, and Random algorithms with vary<strong>in</strong>g data file size. Unless varied,<br />

the number of sites is 10, number of data files is 10, and storage capacity of each site is 5. <strong>Data</strong> file size is vary<strong>in</strong>g from 1 to 10.<br />

(a) Cascad<strong>in</strong>g <strong>Replication</strong> [39].<br />

(b) Simulation Topology. Tier 3 has 32 sites, we do not show all of them due<br />

to space limit.<br />

Fig. 6.<br />

Illustration of simulation topology.<br />

it shows that Greedy performs the best among the three<br />

algorithms.<br />

Comparison with Vary<strong>in</strong>g <strong>Data</strong> File Size. Even though<br />

Theorem 1 is valid only for uniform data size, we experimentally<br />

show that our greedy algorithm also achieves good<br />

system performance for vary<strong>in</strong>g the data file size. Figure 5<br />

shows the performance comparison of Optimal, Greedy,<br />

Local Greedy, and Random, where<strong>in</strong> each data file size is a<br />

random number between 1 and 10. Greedy performs the best<br />

aga<strong>in</strong> among the three algorithms. It can be seen that due to<br />

the non-uniform data size, the access cost is no longer zero<br />

for all four algorithms when there are five files and each site<br />

has a storage capacity five.<br />

B. Distributed Algorithm versus <strong>Replication</strong> Strategies by<br />

Ranganathan and Foster [39]<br />

In this section, we compare our distributed algorithm with<br />

the replication strategy proposed by Ranganathan and Foster<br />

[39]. First, we give an overview of their strategies, and<br />

then present the simulation environment and discuss the<br />

comparison simulation results.<br />

1) <strong>Replication</strong> Strategies by Ranganathan and Foster:<br />

Ranganathan and Foster study the data replication <strong>in</strong> a<br />

hierarchical <strong>Data</strong> Grid model (see Figure 6). The hierarchical<br />

model, represented as a tree topology, has been used <strong>in</strong><br />

LHCGrid project [2], which serves the scientific collaborations<br />

perform<strong>in</strong>g experiments at the Large Hadron Collider<br />

(LHC) <strong>in</strong> CERN. In this model, there is a tier 0 site at


9<br />

CERN, and there are regional sites, which are from different<br />

<strong>in</strong>stitutes <strong>in</strong> the collaboration. For comparison, we also use<br />

this hierarchical model and assume that all data files are at<br />

the CERN site <strong>in</strong>itially. 4 Like [2], we assume that only the<br />

leaf site executes jobs. When the needed data files are not<br />

stored locally, the site always goes one level higher for the<br />

data. If the data is found, it fetches the data back to leaf<br />

site; otherwise, it goes one level higher until it reaches the<br />

CERN site.<br />

Ranganathan and Foster present six different replication<br />

strategies: (1) No <strong>Replication</strong> or Cach<strong>in</strong>g; (2) Best Client,<br />

whereby a replica is created at the best client site (the site<br />

that has the largest number of requests for the file); (3)<br />

Cascad<strong>in</strong>g <strong>Replication</strong>, whereby once popularity exceeds<br />

the threshold for a file at a given time <strong>in</strong>terval, a replica<br />

is created at the next level on the path to the best client;<br />

(4) Pla<strong>in</strong> Cach<strong>in</strong>g, whereby the client that requests the file<br />

stores a copy of the file locally; (5) Cach<strong>in</strong>g plus Cascad<strong>in</strong>g<br />

<strong>Replication</strong>, which comb<strong>in</strong>es Pla<strong>in</strong> Cach<strong>in</strong>g and Cascad<strong>in</strong>g<br />

<strong>Replication</strong> strategies; and (6) Fast Spread, whereby replicas<br />

of the file are created at each site along its path to the client.<br />

All of the above strategies are evaluated with three user<br />

access patterns: (1) Random Access: no locality <strong>in</strong> patterns,<br />

(2) Temporal Locality: data that conta<strong>in</strong> a small degree of<br />

temporal locality (recently accessed files are likely to be accessed<br />

aga<strong>in</strong>), and (3) Geographical plus Temporal Locality:<br />

data conta<strong>in</strong><strong>in</strong>g a small degree of temporal and geographical<br />

locality (files recently accessed by a site are likely to be<br />

accessed aga<strong>in</strong> by a close-by site). Their simulation results<br />

show that Cach<strong>in</strong>g plus Cascad<strong>in</strong>g would work better than<br />

others <strong>in</strong> most cases. In our work, we compare our strategy<br />

with Cach<strong>in</strong>g plus Cascad<strong>in</strong>g (referred to as Cascad<strong>in</strong>g <strong>in</strong><br />

the simulation) under Geographical and Temporal Locality.<br />

Cascad<strong>in</strong>g <strong>Replication</strong> Strategy. Figure 6 (a) shows <strong>in</strong> detail<br />

how Cascad<strong>in</strong>g works. There is a file F1 <strong>in</strong> the root site.<br />

When the number of requests for file F1 exceeds the<br />

threshold at the root, a replica copy of F1 is sent to the next<br />

layer on the path to the best client. Eventually the threshold<br />

is exceeded at the next layer, and a replica copy of F1 is<br />

sent to Client C.<br />

TABLE I<br />

SIMULATION PARAMETERS FOR HIERARCHICAL TOPOLOGY.<br />

Description<br />

Value<br />

Number of sites 43<br />

Number of files 1,000 - 5,000<br />

Size of each file<br />

2 GB<br />

Available storage of each site 100 - 500 GB<br />

Number of jobs executed by each site 400<br />

Number of <strong>in</strong>put files of each job 1 - 10<br />

L<strong>in</strong>k bandwidth<br />

100 MB/s<br />

2) Simulation Setup: In the simulation, we use the Grid-<br />

Sim Simulator [43] (version 4.1). The GridSim toolkit<br />

allows model<strong>in</strong>g and simulation of entities <strong>in</strong> parallel and<br />

4 However, our <strong>Data</strong> Grid model is not limited to a tree-like hierarchical<br />

model. In Section VI-C, we show a general graph model, and that our distributed<br />

replication algorithm performs better than LRU/LFU implemented<br />

<strong>in</strong> OptorSim [10].<br />

distributed comput<strong>in</strong>g environment, for systems-users, applications,<br />

resources, and resource brokers. The 4.1 version<br />

of GridSim <strong>in</strong>corporates a <strong>Data</strong> Grid extension to model<br />

and simulate <strong>Data</strong> Grids. In addition, GridSim offers extensive<br />

support for design and evaluation of schedul<strong>in</strong>g<br />

algorithms, which suits the need for future work where<strong>in</strong><br />

a more comprehensive framework of data replication and<br />

job schedul<strong>in</strong>g will be studied. As a result, we use GridSim<br />

<strong>in</strong> our simulation.<br />

Figure 6 (b) shows the topology used <strong>in</strong> our simulation.<br />

Similar to the one <strong>in</strong> [39], there are four tiers <strong>in</strong> the<br />

grid, with all data be<strong>in</strong>g produced at the top most tier<br />

0 (the root). Tier 2 consists of the regional centers of<br />

different cont<strong>in</strong>ents; here we consider two. The next tier is<br />

composed of different countries. The f<strong>in</strong>al tier consists of the<br />

participat<strong>in</strong>g <strong>in</strong>stitution sites. The Grid conta<strong>in</strong>s a total of 43<br />

sites, 32 of them execut<strong>in</strong>g tasks and generat<strong>in</strong>g requests.<br />

Table I shows the simulation parameters, most of which<br />

are from [39] for the purpose of comparison. The CERN<br />

site orig<strong>in</strong>ally generates and stores 1,000 to 5,000 different<br />

data files, each of which is 2 GB. Each site (except for the<br />

CERN site) dynamically generates and executes 400 jobs,<br />

and each job needs 1 to 10 files as <strong>in</strong>put files. To execute<br />

each job, the site first checks if the <strong>in</strong>put files are <strong>in</strong> its local<br />

storage. If not, it then goes to the nearest replica site to get<br />

the file. The available storage capacity at each site varies<br />

from 100 GB to 500 GB. The l<strong>in</strong>k bandwidth is 100 MB/s.<br />

Our simulation is run on a DELL desktop with Core 2 Duo<br />

CPU and 1 GB RAM.<br />

We emphasize that the above parameters take <strong>in</strong>to account<br />

the scal<strong>in</strong>g. Usually, the amount of data <strong>in</strong> the <strong>Data</strong> Grid<br />

system is <strong>in</strong> the order of petabytes. To enable simulation of<br />

such a large amount of data, a scale of 1:1000 is considered.<br />

That is, the number of files <strong>in</strong> the system and the storage<br />

capacity of each site are both reduced by a factor of 1,000.<br />

<strong>Data</strong> File Access Patterns. Each job has one to ten <strong>in</strong>put<br />

files as noted above. For the <strong>in</strong>put files of each job, we<br />

adopt two file access patterns: random access pattern and<br />

tempo-spatial access pattern. In random access pattern, the<br />

<strong>in</strong>put files of each job are randomly chosen from the<br />

entire set of data files available <strong>in</strong> the <strong>Data</strong> Grid. In the<br />

tempo-spatial access pattern, the <strong>in</strong>put data files required by<br />

each job follows Zipf distribution [51], [9] and geographic<br />

distribution, as expla<strong>in</strong>ed below.<br />

• In Zipf distribution (the temporal access part), for the<br />

p data files, the probability for each job to access<br />

the j th (1 ≤ j ≤ p) data file is represented by<br />

1<br />

P j =<br />

j ∑ θ p , where 0 ≤ θ ≤ 1. When θ = 1, the<br />

h=1<br />

above distribution 1/hθ follows the strict Zipf distribution,<br />

while for θ = 0, it follows the uniform distribution. We<br />

choose θ to be 0.8 based on real web trace studies [9].<br />

• In geographic distribution (the spatial access part), the<br />

data file access frequencies of jobs at a site depend<br />

on its geographic location <strong>in</strong> the Grid such that sites<br />

that are <strong>in</strong> proximity have similar data file access<br />

frequencies. In our case, the leaf sites close to each<br />

other have similar data file access pattern. Specifically,


10<br />

Average file access time (second)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

Average file access time (second)<br />

16000<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

2000<br />

2000<br />

0<br />

1000 2000 3000 4000 5000<br />

Total number of files<br />

0<br />

100 200 300 400 500<br />

Storage Capacity (GB)<br />

(a) Vary<strong>in</strong>g total number of files.<br />

(b) Vary<strong>in</strong>g storage capacity of each site.<br />

Fig. 7. Performance comparison between our distributed algorithm and Cascad<strong>in</strong>g <strong>in</strong> hierarchical topology. In (a), the storage capacity of each site is<br />

300 GB; <strong>in</strong> (b), the number of data files is 3,000. Each data file size is 2 GB.<br />

120<br />

120<br />

Percentage Increase of<br />

File Access Time (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1000<br />

Distributed<br />

Cascad<strong>in</strong>g<br />

2000<br />

3000<br />

Number of Files<br />

4000<br />

5000<br />

Percentage Increase of<br />

File Access Time (%)<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

100<br />

Distributed<br />

Cascad<strong>in</strong>g<br />

200<br />

300<br />

400<br />

Storage Capacity (GB)<br />

500<br />

(a) Vary<strong>in</strong>g number of data files.<br />

(b) Vary<strong>in</strong>g storage capacity of each Grid site.<br />

Fig. 8.<br />

Percentage <strong>in</strong>crease of the average file access time due to dynamic access pattern.<br />

Average file access time (second)<br />

200000<br />

150000<br />

100000<br />

50000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

Average file access time (second)<br />

300000<br />

250000<br />

200000<br />

150000<br />

100000<br />

50000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

0<br />

1000 50000 200000 400000 600000 8000001000000<br />

Total number of files<br />

0<br />

1 50 100 200 400 600 800 1000<br />

Storage Capacity (TB)<br />

(a) Vary<strong>in</strong>g total number of files.<br />

(b) Vary<strong>in</strong>g storage capacity of each site.<br />

Fig. 9. Performance comparison between our distributed algorithm and Cascad<strong>in</strong>g <strong>in</strong> a typical Grid environment. In (a), the storage capacity of each<br />

site is 500 TB; <strong>in</strong> (b), the number of data files <strong>in</strong> the Grid is 500,000. Each data file size is 1 GB.


11<br />

suppose there are n client sites, numbered 1, 2, ..., n,<br />

where client 1 is close to client 2, which is close to<br />

client 3, etc. Suppose the total number of data is p.<br />

If the accessed data ID is id accord<strong>in</strong>g to the orig<strong>in</strong>al<br />

Zipf-like access pattern, then, for client i (1 ≤ i ≤ n),<br />

the new data ID would be (id + p/n ∗ i) mod p. In this<br />

way, neighbor<strong>in</strong>g clients should have similar, but not<br />

the same, access patterns, while far-away clients have<br />

quite different access patterns.<br />

To <strong>in</strong>vestigate the effects of dynamic user behaviors upon<br />

different algorithms, we further divide the tempo-spatial<br />

access pattern <strong>in</strong>to two: static and dynamic. In the dynamic<br />

access pattern, the file access pattern of each site changes <strong>in</strong><br />

the middle of the simulation, while for static access pattern,<br />

it does not change. In summary, <strong>in</strong> our simulation, we use the<br />

follow<strong>in</strong>g three file access patterns: random access pattern<br />

(Random), static tempo-spatial access pattern (TS-Static),<br />

and dynamic tempo-spatial access pattern (TS-Dynamic).<br />

3) Simulation Results and Analysis: Below we present<br />

our simulation results and analysis.<br />

Vary<strong>in</strong>g Number of <strong>Data</strong> Files. In this part of simulation,<br />

we vary the number of data files from 1,000, 2,000, 3,000,<br />

4,000, to 5,000 files, while fix<strong>in</strong>g the storage capacity of<br />

each Grid site as 300 GB. Figure 7 (a) show the comparison<br />

of the average file access time of the two algorithms <strong>in</strong> all<br />

three access patterns. We note the follow<strong>in</strong>g observations.<br />

First, for both algorithms <strong>in</strong> all three access patterns, with an<br />

<strong>in</strong>crease <strong>in</strong> the number of files, the average file access time<br />

<strong>in</strong>creases. However, our distributed greedy data replication<br />

algorithm yields average file access time three to four times<br />

less than that obta<strong>in</strong>ed by Cascad<strong>in</strong>g <strong>in</strong> all three access<br />

patterns. Second, for both our distributed algorithm and<br />

Cascad<strong>in</strong>g, their average file access time under TS-static<br />

is 20% to 30% smaller than that under Random access<br />

pattern. This is because TS-Static <strong>in</strong>corporates both temporal<br />

and spatial access patterns, which benefit both replication<br />

algorithms. Figure 7 (a) also shows the performance of both<br />

algorithms under the TS-Dynamic access pattern, where<strong>in</strong><br />

the file access pattern of each site changes dur<strong>in</strong>g the simulation.<br />

As shown, both our distributed replication algorithm<br />

and Cascad<strong>in</strong>g suffer from file access pattern change, with<br />

<strong>in</strong>creased average file access time compared with TS-Static.<br />

This is because it takes some time for both algorithms to<br />

obta<strong>in</strong> good replica replacement for the dynamic file access<br />

pattern. In a static access pattern, the <strong>Data</strong> Grids will f<strong>in</strong>ally<br />

reach some stable stage where the file placement <strong>in</strong> each<br />

site rema<strong>in</strong>s unchanged, thus giv<strong>in</strong>g a lower file access<br />

time. Figure 8 (a) also shows that our algorithm adapts the<br />

access pattern change better than Cascad<strong>in</strong>g algorithm: the<br />

percentage <strong>in</strong>crease of the average file access time of our<br />

algorithm due to dynamic access patter is around 50%, while<br />

for Cascad<strong>in</strong>g, it is around 100%.<br />

Vary<strong>in</strong>g Storage Capacity of Each Site. We also vary<br />

the storage capacity of each Grid site from 100, 200, 300,<br />

400, to 500 GB while fix<strong>in</strong>g the number of files <strong>in</strong> the<br />

grid as 3000. The performance comparison is shown <strong>in</strong><br />

Figure 7 (b). As expected, as the storage capacity <strong>in</strong>creases,<br />

there is a decrease <strong>in</strong> the file access time for both algorithms,<br />

s<strong>in</strong>ce larger storage capacities allow each site to<br />

storage more replicated files for local access. It gives the<br />

same observation that our distributed replication algorithm<br />

performs much better than Cascad<strong>in</strong>g under Random and<br />

TS-Static/Dynamic access patterns. Moreover, for both our<br />

distributed algorithm and Cascad<strong>in</strong>g Algorithm, its average<br />

file access time under Random is higher than that under<br />

TS-static (about 20% to 30% higher). Aga<strong>in</strong>, compared to<br />

TS-Static pattern, the file access time for both algorithms<br />

<strong>in</strong>creases <strong>in</strong> TS-Dynamic pattern. However, as shown <strong>in</strong><br />

Figure 8 (b), our distributed replication algorithm evidently<br />

adapts to the dynamic access pattern better than Cascad<strong>in</strong>g<br />

does, giv<strong>in</strong>g smaller percentage <strong>in</strong>crease of average file<br />

access time.<br />

TABLE II<br />

SIMULATION PARAMETERS FOR TYPICAL GRID ENVIRONMENT.<br />

Description<br />

Value<br />

Number of sites 100<br />

Number of files 1,000 - 1,000,000<br />

Size of each file<br />

1 GB<br />

Available storage of each site 1 TB - 1 PB<br />

Number of jobs executed by each site 10,000<br />

Number of <strong>in</strong>put files of each job 1 - 10<br />

L<strong>in</strong>k bandwidth<br />

4000 MB/s<br />

TABLE III<br />

SIMULATION PARAMETERS FOR TYPICAL CLUSTER ENVIRONMENT.<br />

Description<br />

Value<br />

Number of nodes 1,000<br />

Number of files 1,000 - 1,000,000<br />

Size of each file<br />

1 GB<br />

Available storage of each site 10 GB - 1 TB<br />

Number of jobs executed by each site 1,000<br />

Number of <strong>in</strong>put files of each job 1 - 10<br />

L<strong>in</strong>k bandwidth<br />

100 MB/s<br />

Comparison <strong>in</strong> Typical Grid Environment. In a typical<br />

Grid environment, there are usually only 10s to 100s of<br />

different sites with large storage (PB per site), and the<br />

network bandwidth between Grid sites can be 10 GB/s, even<br />

as high as 100 GB/s. To simulate such an environment we<br />

use the scaled parameters as shown <strong>in</strong> Table II. Figure 9 (a)<br />

and (b) show the performance comparison of our distributed<br />

algorithm and Cascad<strong>in</strong>g by vary<strong>in</strong>g the total number of<br />

files (while fix<strong>in</strong>g storage capacity at 500 TB) and the<br />

storage capacity of each site (while fix<strong>in</strong>g the number of data<br />

files as 500,000), respectively. Below are the observations.<br />

First, when there are 1,000, 50,000 and 200,000 files <strong>in</strong><br />

the Grid (Figure 9 (a)), or storage capacity is 600 TB,<br />

800 TB, and 1,000 TB (Figure 9 (b)), the performance<br />

of both algorithms under the same access pattern and the<br />

performance of the same algorithm under different access<br />

patterns are very similar. This is because, <strong>in</strong> both cases,<br />

each site has enough space to store all of its requested files;<br />

therefore, no cache replacement algorithm is <strong>in</strong>volved. As<br />

such, even with different access patterns, each site works<br />

the same way with both algorithms: request<strong>in</strong>g the <strong>in</strong>put<br />

files for each job, stor<strong>in</strong>g the received <strong>in</strong>put files <strong>in</strong> local


12<br />

Average file access time (second)<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

Average file access time (second)<br />

700000<br />

600000<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

0<br />

1000 50000 200000 400000 600000 8000001000000<br />

Total number of files<br />

0<br />

1 50 100 200 400 600 800 1000<br />

Storage Capacity (GB)<br />

(a) Vary<strong>in</strong>g total number of files.<br />

(b) Vary<strong>in</strong>g storage capacity of each node.<br />

Fig. 10. Performance comparison between our distributed algorithm and Cascad<strong>in</strong>g <strong>in</strong> a typical cluster environment. In (a), the storage capacity of each<br />

node is 500 GB; <strong>in</strong> (b), the number of data files <strong>in</strong> the cluster is 500,000. Each data file size is 1 GB.<br />

Average file access time (second)<br />

160000<br />

140000<br />

120000<br />

100000<br />

80000<br />

60000<br />

40000<br />

20000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

Average file access time (second)<br />

120000<br />

100000<br />

80000<br />

60000<br />

40000<br />

20000<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

0<br />

1000 50000 200000 400000 600000 8000001000000<br />

Total number of files<br />

0<br />

1 50 100 200 400 600 800 1000<br />

Storage Capacity (GB)<br />

(a) Vary<strong>in</strong>g total number of files.<br />

(b) Vary<strong>in</strong>g storage capacity of each node.<br />

Fig. 11. Performance comparison between our distributed algorithm and Cascad<strong>in</strong>g <strong>in</strong> a cluster environment with full connectivity. In (a), the storage<br />

capacity of each node is 500 GB; <strong>in</strong> (b), the number of data files <strong>in</strong> the cluster is 500,000. Each data file size is 1 GB.<br />

storage, and execut<strong>in</strong>g the next job. Otherwise, we observe<br />

the performance differences as shown <strong>in</strong> Figure 7. Second,<br />

comparisons with different performance show that for our<br />

distributed algorithm, the percentage <strong>in</strong>crease of access time<br />

due to the dynamic pattern is very small, around 5% to<br />

8%. This shows that our algorithm adjusts to the dynamic<br />

access pattern well <strong>in</strong> a typical Grid environment. Third,<br />

Figure 9 (b) shows that with <strong>in</strong>creased storage capacity of<br />

each site, the performance difference between our distributed<br />

algorithm and Cascad<strong>in</strong>g is gett<strong>in</strong>g small for all three access<br />

patterns. For example, when the storage capacity is 1 TB,<br />

Cascad<strong>in</strong>g yields more than twice the file access time than<br />

our distributed algorithm; while at 400 TB, it costs 40%<br />

more access time than our distributed algorithm. This shows<br />

that <strong>in</strong> the more str<strong>in</strong>gent storage scenario, our cach<strong>in</strong>g<br />

algorithm is a better mechanism to reduc<strong>in</strong>g file access time<br />

and thus job execution time.<br />

Comparison <strong>in</strong> Typical Cluster Environment. In a typical<br />

cluster environment, a site often has 1,000 to 10,000 nodes<br />

(note the difference between sites and nodes as expla<strong>in</strong>ed<br />

<strong>in</strong> Section III), each with small local storage <strong>in</strong> the range<br />

10 GB to 1 TB. The network bandwidth with<strong>in</strong> a cluster<br />

is typically 1 GB/s. We use the parameters <strong>in</strong> Table III<br />

to simulate the cluster environment. Figure 10 shows the<br />

performance comparison of our distributed algorithm and<br />

Cascad<strong>in</strong>g under different access patterns. We observe that<br />

for our distributed algorithm, the percentage <strong>in</strong>crease of<br />

access time due to dynamic pattern is relatively large, around<br />

40% to 100%. This shows that our algorithm does not<br />

adjust well to the dynamic access pattern <strong>in</strong> a typical cluster<br />

environment. This is because <strong>in</strong> a cluster, the diameter of the<br />

network (number of hops <strong>in</strong> the longest shortest path of two<br />

cluster nodes) is much larger than that <strong>in</strong> the typical Grid<br />

environment, and the bandwidth <strong>in</strong> a cluster is much smaller<br />

than that <strong>in</strong> Grids. As a result, when the file access pattern<br />

of each site changes <strong>in</strong> the middle of the simulation, it takes<br />

longer time to propagate such <strong>in</strong>formation to other nodes.


13<br />

TABLE IV<br />

STORAGE CAPACITY OF EACH SITE IN EU DATA GRID TESTBED 1 [10]. IC IS IMPERIAL COLLEGE.<br />

Site Bologna Catania CERN IC Lyon Milano NIKHEF NorduGrid Padova RAL Tor<strong>in</strong>o<br />

Storage (GB) 30 30 10000 80 50 50 70 63 50 50 50<br />

Therefore, the nearest replica catalog ma<strong>in</strong>ta<strong>in</strong>ed by each<br />

node could get outdated, caus<strong>in</strong>g frequent cache misses for<br />

needed <strong>in</strong>put files.<br />

Comparison <strong>in</strong> a Cluster Environment with Full Connectivity.<br />

We consider a cluster with a fully connected set<br />

of nodes and show the simulation results <strong>in</strong> Figure 11. There<br />

are two major observations. First, compar<strong>in</strong>g Figure 9 with<br />

Figure 11, we observe that the average file access time <strong>in</strong><br />

a cluster with full connectivity is lower than that of a Grid.<br />

This is because even though a typical Grid bandwidth might<br />

be higher than that of a cluster (4000 MB/s versus 100<br />

MB/s shown <strong>in</strong> Tables II and III), there are much fewer<br />

po<strong>in</strong>t-to-po<strong>in</strong>t l<strong>in</strong>ks <strong>in</strong> a Grid of 100 sites compared to a<br />

cluster with a fully connected set of 1000 nodes (around<br />

5 × 10 3 edges versus 5 × 10 5 edges). Therefore, the full<br />

network bisection bandwidth <strong>in</strong> a cluster is higher than<br />

that of a Grid (5 × 10 5 × 100 = 5 × 10 7 MB/s versus<br />

5 × 10 3 × 4000 = 2 × 10 7 MB/s). Second, we observe<br />

that for all three access patterns, our distributed algorithm<br />

performs the same as Cascad<strong>in</strong>g. This is because with<br />

full connectivity, each site can get the needed data files<br />

directly from the CERN site, therefore each site is unable<br />

to observe the data access traffic of other sites to make<br />

<strong>in</strong>telligent cach<strong>in</strong>g decisions. Both our distributed cach<strong>in</strong>g<br />

algorithm and Cascad<strong>in</strong>g algorithm boil down to the same<br />

LRU/LFU cache replacement algorithm. In a fully connected<br />

environment, our distributed algorithm loses its advantage<br />

of be<strong>in</strong>g an effective cach<strong>in</strong>g algorithm, like any other<br />

distributed cach<strong>in</strong>g algorithms.<br />

Fig. 12. EU <strong>Data</strong> Grid Testbed 1 [10]. Numbers <strong>in</strong>dicate the bandwidth<br />

between two sites.<br />

C. Distributed Algorithm versus LRU/LFU <strong>in</strong> General<br />

Topology<br />

OptorSim [10] is a Grid simulator that simulates a general<br />

topology Grid testbed for data <strong>in</strong>tensive high energy physics<br />

experiments. Three replication algorithms are implemented<br />

<strong>in</strong> OptorSim: (i) LRU (Least Recently Used), which deletes<br />

those files that have been used least recently; (ii) LFU (Least<br />

Frequently Used), which deletes those files that have been<br />

used least frequently <strong>in</strong> the recent past; and (iii) an economic<br />

model <strong>in</strong> which sites “buy” and “sell” files us<strong>in</strong>g an auction<br />

mechanism, and will only delete files if they are less valuable<br />

than the new file.<br />

We compare our distributed algorithm with LRU and<br />

LFU <strong>in</strong> a general topology s<strong>in</strong>ce previous empirical research<br />

has found that even though the LRU/LFU strategies are<br />

extremely trivial, it is very difficult to improve them. The<br />

topology we use is the simulated topology of EU <strong>Data</strong>Grid<br />

TestBed 1 [10], as shown <strong>in</strong> Figure 12. The storage capacity<br />

of each site is given <strong>in</strong> Table IV. S<strong>in</strong>ce the simulation<br />

results of LRU and LFU are quite similar, we only show the<br />

results of LRU. Figure 13 (a) and (b) show the performance<br />

comparison of our distributed algorithm and LRU by vary<strong>in</strong>g<br />

the total number of files and storage capacity of each<br />

site, respectively. It shows that our distributed algorithm<br />

outperforms LRU <strong>in</strong> most cases. However, the performance<br />

difference is not as significant as that between distributed<br />

algorithm and Cascad<strong>in</strong>g. In particular, under TS-Dynamic<br />

access pattern, when the storage capacity of each site is 400<br />

GB and 500 GB, the average file access time of LRU is even<br />

a little smaller than that of our distributed algorithm.<br />

VII. Conclusions and Future Work<br />

In this article, we study how to replicate data files <strong>in</strong> data<br />

<strong>in</strong>tensive scientific applications, to reduce the file access<br />

time with the consideration of limited storage space of<br />

Grid sites. Our goal is to effectively reduce the access time<br />

of data files needed for job executions at Grid sites. We<br />

propose a centralized greedy algorithm with performance<br />

guarantee, and show that it performs comparably with the<br />

optimal algorithm. We also propose a distributed algorithm<br />

where<strong>in</strong> Grids sites react closely to the Grid status and make<br />

<strong>in</strong>telligent cach<strong>in</strong>g decisions. Us<strong>in</strong>g GridSim, a distributed<br />

Grid simulator, we demonstrate that the distributed replication<br />

technique significantly outperforms a popular exist<strong>in</strong>g<br />

replication technique, and it is more adaptive to the dynamic<br />

change of file access patterns <strong>in</strong> <strong>Data</strong> Grids.<br />

We plan to further develop our work <strong>in</strong> the follow<strong>in</strong>g<br />

directions:<br />

• As ongo<strong>in</strong>g and future work, we are exploit<strong>in</strong>g the<br />

synergies between data replication and job schedul<strong>in</strong>g<br />

to achieve better system performance. <strong>Data</strong> replication<br />

and job schedul<strong>in</strong>g are two different but complementary


14<br />

Average file access time (second)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

Average file access time (second)<br />

3000<br />

2500<br />

2000<br />

1500<br />

1000<br />

500<br />

Ditributed <strong>in</strong> Random<br />

Cascad<strong>in</strong>g <strong>in</strong> Random<br />

Distributed <strong>in</strong> TS-Static<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Static<br />

Distributed <strong>in</strong> TS-Dynamic<br />

Cascad<strong>in</strong>g <strong>in</strong> TS-Dynamic<br />

0<br />

1000 2000 3000 4000 5000<br />

Total number of files<br />

0<br />

100 200 300 400 500<br />

Storage Capacity (GB)<br />

(a) Vary<strong>in</strong>g total number of files.<br />

(b) Vary<strong>in</strong>g storage capacity of each site.<br />

Fig. 13.<br />

Performance comparison between our distributed algorithm and LRU <strong>in</strong> general topology.<br />

functions <strong>in</strong> <strong>Data</strong> Grids: one to m<strong>in</strong>imize the total file<br />

access cost (thus total job execution time of all sites),<br />

and the other to m<strong>in</strong>imize the makespan (the maximum<br />

job completion time among all sites). Optimiz<strong>in</strong>g<br />

both objective functions <strong>in</strong> the same framework is a<br />

very difficult (if not unfeasible) task. There are two<br />

ma<strong>in</strong> challenges: first, how to formulate a problem<br />

that <strong>in</strong>corporates not only data replication but also job<br />

schedul<strong>in</strong>g, and which addresses both total access cost<br />

and maximum access cost; and second, how to f<strong>in</strong>d an<br />

efficient algorithm that, if it cannot f<strong>in</strong>d optimal solutions<br />

of m<strong>in</strong>imiz<strong>in</strong>g total/maximum access cost, gives<br />

near-optimal solution for both objectives. These two<br />

challenges rema<strong>in</strong> largely unanswered <strong>in</strong> the current<br />

literature.<br />

• We plan to design and develop data replication strategies<br />

<strong>in</strong> the scientific workflow [31] and large-scale<br />

cloud comput<strong>in</strong>g environments [24]. We will also pursue<br />

how provenance <strong>in</strong>formation [15], the derivation<br />

history of data files, can be exploited to improve the<br />

<strong>in</strong>telligence of data replication decision mak<strong>in</strong>g.<br />

• A more robust dynamic model and replication algorithm<br />

will be developed. Right now, each site observes<br />

the data access traffic for a sufficiently long time<br />

w<strong>in</strong>dow, which is set <strong>in</strong> an ad hoc manner. In the future,<br />

a site should dynamically decide such an “observ<strong>in</strong>g<br />

w<strong>in</strong>dow”, depend<strong>in</strong>g on the traffic it is observ<strong>in</strong>g.<br />

REFERENCES<br />

[1] The large hadron collider. http://public.web.cern.ch/Public/en/LHC/LHCen.html.<br />

[2] Lcg comput<strong>in</strong>g fabric. http://lcg-comput<strong>in</strong>g-fabric.web.cern.ch/lcgcomput<strong>in</strong>g-fabric/.<br />

[3] Worldwide lhc comput<strong>in</strong>g grid. http://lcg.web.cern.ch/LCG/.<br />

[4] A. Aazami, S. Ghandeharizadeh, and T. Helmi. Near optimal number<br />

of replicas for cont<strong>in</strong>uous media <strong>in</strong> ad-hoc networks of wireless<br />

devices. In Proc. International Workshop on Multimedia Information<br />

Systems, 2004.<br />

[5] B. Allcock, J. Bester, J. Bresnahan, A.L. Chervenak, C. Kesselman,<br />

S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, and I. Foster. Secure,<br />

efficient data transport and replica management for high-performance<br />

data-<strong>in</strong>tensive comput<strong>in</strong>g. In Proc. of IEEE Symposium on Mass<br />

Storage Systems and Technologies, 2001.<br />

[6] I. Baev and R. Rajaraman. Approximation algorithms for data<br />

placement <strong>in</strong> arbitrary networks. In Proc. ACM-SIAM Symposium<br />

on Discrete Algorithms (SODA 2001).<br />

[7] I. Baev, R. Rajaraman, and C. Swamy. Approximation algorithms for<br />

data placement problems. SIAM J. Comput., 38(4):1411–1429, 2008.<br />

[8] W. H. Bell, D. G. Cameron, R. Cavajal-Schiaff<strong>in</strong>o, A. P. Millar,<br />

K. Stock<strong>in</strong>ger, and F. Z<strong>in</strong>i. Evaluation of an economy-based file<br />

replication strategy for a data grid. In Proc. of International Workshop<br />

on Agent Based Cluster and Grid Comput<strong>in</strong>g (CCGrid 2003).<br />

[9] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Web cach<strong>in</strong>g<br />

and zipf-like distributions: Evidence and implications. In Proc.<br />

of IEEE International Conference on Computer Communications<br />

(INFOCOM 2009).<br />

[10] D. G. Cameron, A. P. Millar, C. Nicholson, R. Carvajal-Schiaff<strong>in</strong>o,<br />

K. Stock<strong>in</strong>ger, and F. Z<strong>in</strong>i. Analysis of schedul<strong>in</strong>g and replica<br />

optimisation strategies for data grids us<strong>in</strong>g optorsim. Journal of Grid<br />

Comput<strong>in</strong>g, 2(1):57–69, 2004.<br />

[11] M. Carman, F. Z<strong>in</strong>i, L. Seraf<strong>in</strong>i, and K. Stock<strong>in</strong>ger. Towards an<br />

economy-based optimization of file access and replication on a data<br />

grid. In Proc. of International Workshop on Agent Based Cluster and<br />

Grid Comput<strong>in</strong>g (CCGrid 2002).<br />

[12] A. Chakrabarti and S. Sengupta. Scalable and distributed mechanisms<br />

for <strong>in</strong>tegrated schedul<strong>in</strong>g and replication <strong>in</strong> data grids. In Proc. of 10th<br />

International Conference on Distributed Comput<strong>in</strong>g and Network<strong>in</strong>g<br />

(ICDCN 2008).<br />

[13] R.-S. Chang and H.-P. Chang. A dynamic data replication strategy<br />

us<strong>in</strong>g access-weight <strong>in</strong> data grids. Journal of Supercomput<strong>in</strong>g,<br />

45:277–295, 2008.<br />

[14] R.-S. Chang, J.-S. Chang, and S.-Y. L<strong>in</strong>. Job schedul<strong>in</strong>g and data<br />

replication on data grids. Future Generation Computer Systems,<br />

23(7):846–860.<br />

[15] A. Chebotko, X. Fei, C. L<strong>in</strong>, S. Lu, , and F. Fotouhi. Stor<strong>in</strong>g and<br />

query<strong>in</strong>g scientific workflow provenance metadata us<strong>in</strong>g an rdbms. In<br />

Proc. of the IEEE International Conference on e-Science and Grid<br />

Comput<strong>in</strong>g (e-Science 2007).<br />

[16] A. Chervenak, E. Deelman, M. Livny, M.-H. Su, R. Schuler,<br />

S. Bharathi, G. Mehta, and K. Vahi. <strong>Data</strong> placement for scientific<br />

applications <strong>in</strong> distributed environments. In Proc. of IEEE/ACM<br />

International Conference on Grid Comput<strong>in</strong>g (Grid 2007).<br />

[17] A. Chervenak, R. Schuler, C. Kesselman, S. Koranda, and B. Moe.<br />

Wide area data replication for scientific collaboration. In Proc. of<br />

IEEE/ACM International Workshop on Grid Comput<strong>in</strong>g (Grid 2005).<br />

[18] A. Chervenak, R. Schuler, M. Ripeanu, M. A. Amer, S. Bharathi,<br />

I. Foster, and C. Kesselman. The globus replica location service: Design<br />

and experience. IEEE Transactions on Parallel and Distributed<br />

Systems, 20:1260–1272, 2009.<br />

[19] N. N. Dang and S. B. Lim. Comb<strong>in</strong>ation of replication and schedul<strong>in</strong>g<br />

<strong>in</strong> data grids. International Journal of Computer Science and Network<br />

Security, 7(3):304–308, March 2007.


15<br />

[20] D. Düllmann and B. Segal. Models for replica synchronisation and<br />

consistency <strong>in</strong> a data grid. In Proc. of the 10th IEEE International<br />

Symposium on High Performance Distributed Comput<strong>in</strong>g (HPDC<br />

2001).<br />

[21] J. Rehn et. al. Phedex: high-throughput data transfer management<br />

system. In Proc. of Comput<strong>in</strong>g <strong>in</strong> High Energy and Nuclear Physics<br />

(CHEP 2006).<br />

[22] I. Foster. The grid: A new <strong>in</strong>frastructure for 21st century science.<br />

Physics Today, 55:42–47, 2002.<br />

[23] I. Foster and K. Ranganathan. Decoupl<strong>in</strong>g computation and data<br />

schedul<strong>in</strong>g <strong>in</strong> distributed data-<strong>in</strong>tensive applications. In Proc. of the<br />

11th IEEE International Symposium on High Performance Distributed<br />

Comput<strong>in</strong>g (HPDC 2002).<br />

[24] I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud comput<strong>in</strong>g and grid<br />

comput<strong>in</strong>g 360-degrees compared. In Proc. of IEEE Grid Comput<strong>in</strong>g<br />

Environments, <strong>in</strong> conjunction with IEEE/ACM Supercomput<strong>in</strong>g, pages<br />

1–10, 2008.<br />

[25] C. Intanagonwiwat, R. Gov<strong>in</strong>dan, and D. Estr<strong>in</strong>. Directed diffusion:<br />

a scalable and robust communication paradigm for sensor networks.<br />

In Proc. of ACM International Conference on Mobile Comput<strong>in</strong>g and<br />

Network<strong>in</strong>g (MOBICOM 2000).<br />

[26] J. C. Jacob, D.S. Katz, T. Pr<strong>in</strong>ce, G.B. Berriman, J.C. Good, A.C.<br />

Laity, E. Deelman, G.S<strong>in</strong>gh, and M.-H Su. The montage architecture<br />

for grid-enabled science process<strong>in</strong>g of large, distributed datasets. In<br />

Proc. of the Earth Science Technology Conference, 2004.<br />

[27] S. Jiang and X. Zhang. Efficient distributed disk cach<strong>in</strong>g <strong>in</strong> data grid<br />

management. In Proc. of IEEE International Conference on Cluster<br />

Comput<strong>in</strong>g (Cluster 2003).<br />

[28] S. J<strong>in</strong> and L. Wang. Content and service replication strategies <strong>in</strong><br />

multi-hop wireless mesh networks. In Proc. of ACM International<br />

Conference on Model<strong>in</strong>g, Analysis and Simulation of Wireless and<br />

Mobile Systems (MSWiM 2005).<br />

[29] H. Lamehamedi, B. K. Szymanski, and B. Conte. Distributed data<br />

management services for dynamic data grids, unpublished.<br />

[30] M. Lei, S. V. Vrbsky, and X. Hong. An on-l<strong>in</strong>e replication strategy<br />

to <strong>in</strong>crease availability <strong>in</strong> data grids. Future Generation Computer<br />

Systems, 24:85–98, 2008.<br />

[31] C. L<strong>in</strong>, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, and J<strong>in</strong>g<br />

Hua. A reference architecture for scientific workflow management<br />

systems and the view soa solution. IEEE Transactions on Services<br />

Comput<strong>in</strong>g, 2(1):79–92, 2009.<br />

[32] M. M<strong>in</strong>eter, C. Jarvis, and S. Dowers. From stand-alone programs<br />

towards grid-aware services and components: A case study <strong>in</strong> agricultural<br />

modell<strong>in</strong>g with <strong>in</strong>terpolated climate data. Environmental<br />

Modell<strong>in</strong>g and Software, 18(4):379–391, 2003.<br />

[33] S. M. Park, J. H. Kim, Y. B. Lo, and W. S. Yoon. Dynamic data grid<br />

replication strategy based on <strong>in</strong>ternet hierarchy. In Proc. of Second<br />

International Workshop on Grid and Cooperative Comput<strong>in</strong>g (GCC<br />

2003).<br />

[34] J. Pérez, F. García-Carballeira, J. Carretero, A. Calderón, and<br />

J. Fernández. Branch replication scheme: A new model for data<br />

replication <strong>in</strong> large scale data grids. Future Generation Computer<br />

Systems, 26(1):12–20, 2010.<br />

[35] L. Qiu, V. N. Padmanabhan, and G. M. Voelker. On the placement<br />

of web server replicas. In Proc. of IEEE Conference on Computer<br />

Communications (INFOCOM 2001).<br />

[36] I. Raicu, I. Foster, Y. Zhao, P. Little, C. Moretti, A. Chaudhary, and<br />

D. Tha<strong>in</strong>. The quest for scalable support of data <strong>in</strong>tensive workloads<br />

<strong>in</strong> distributed systems. In Proc. of the ACM International Symposium<br />

on High Performance Distributed Comput<strong>in</strong>g (HPDC 2009).<br />

[37] I. Raicu, Y. Zhao, I. Foster, and A. Szalay. Accelerat<strong>in</strong>g large-scale<br />

data exploration through data diffusion. In Proc. of the International<br />

Workshop on <strong>Data</strong>-aware Distributed Comput<strong>in</strong>g (DADC 2008).<br />

[38] A. Ramakrishnan, G. S<strong>in</strong>gh, H. Zhao, E. Deelman, R. Sakellariou,<br />

K. Vahi, K. Blackburn, D. Meyers, and M. Samidi. Schedul<strong>in</strong>g data<strong>in</strong>tensive<br />

workflows onto storage-constra<strong>in</strong>ed distributed resources.<br />

In Proc. of the Seventh IEEE International Symposium on Cluster<br />

Comput<strong>in</strong>g and the Grid (CCGRID 2007).<br />

[39] K. Ranganathan and I. T. Foster. Identify<strong>in</strong>g dynamic replication<br />

strategies for a high-performance data grid. In Proc. of the Second<br />

International Workshop on Grid Comput<strong>in</strong>g (GRID 2001).<br />

[40] A. Rodriguez, D. Sulakhe, E. Marland, N. Nefedova, M. Wilde, and<br />

N. Maltsev. Grid enabled server for high-throughput analysis of<br />

genomes. In Proc. of Workshop on Case Studies on Grid <strong>Applications</strong>,<br />

2004.<br />

[41] F. Sch<strong>in</strong>tke and A. Re<strong>in</strong>efeld. Model<strong>in</strong>g replica availability <strong>in</strong> large<br />

data grids. Journal of Grid Comput<strong>in</strong>g, 2(1):219–227, 2003.<br />

[42] H. Stock<strong>in</strong>ger, A. Samar, K. Holtman, B. Allcock, I. Foster, and<br />

B. Tierney. File and object replication <strong>in</strong> data grids. In Proc. of the<br />

10th IEEE International Symposium on High Performance Distributed<br />

Comput<strong>in</strong>g (HPDC 2001).<br />

[43] A. Sulistio, U. Cibej, S. Venugopal, B. Robic, and R. Buyya. A<br />

toolkit for modell<strong>in</strong>g and simulat<strong>in</strong>g data grids: An extension to<br />

gridsim. Concurrency and Computation: Practice and Experience,<br />

20(13):1591–1609, 2008.<br />

[44] B. Tang, S. R. Das, and H. Gupta. Benefit-based data cach<strong>in</strong>g <strong>in</strong> ad<br />

hoc networks. IEEE Transactions on Mobile Comput<strong>in</strong>g, 7(3):289–<br />

304, March 2008.<br />

[45] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. Dynamic replication<br />

algorithms for the multi-tier data grid. Future Generation Computer<br />

Systems, 21:775–790, 2005.<br />

[46] M. Tang, B.-S. Lee, C.-K. Yeo, and X. Tang. The impact of data<br />

replication on job schedul<strong>in</strong>g performance <strong>in</strong> the data grid. Future<br />

Generation Computer Systems, 22:254–268, 2006.<br />

[47] U. Čibej, B. Slivnik, and B. Robič. The complexity of static data<br />

replication <strong>in</strong> data grids. Parallel Comput<strong>in</strong>g, 31(8-9):900–912, 2005.<br />

[48] S. Venugopal and R. Buyya. An scp-based heuristic approach for<br />

schedul<strong>in</strong>g distributed data-<strong>in</strong>tensive applications on global grids.<br />

Journal of Parallel and Distributed Comput<strong>in</strong>g, 68:471–487, 2008.<br />

[49] S. Venugopal, R. Buyya, and K. Ramamohanarao. A taxonomy of<br />

data grids for distributed data shar<strong>in</strong>g, management, and process<strong>in</strong>g.<br />

ACM Comput<strong>in</strong>g Surveys, 38(1), 2006.<br />

[50] X. You, G. Chang, X. Chen, C. Tian, and C. Zhu. Utility-based<br />

replication strategies <strong>in</strong> data grids. In Proc. of Fifth International<br />

Conference on Grid and Cooperative Comput<strong>in</strong>g (GCC 2006).<br />

[51] G. K. Zipf. Human Behavior and the Pr<strong>in</strong>ciple of Least Effort: An<br />

Introduction to Human Ecology. Addison-Wesley, 1949.


16<br />

Dharma Teja Nukarapu received<br />

the BE degree <strong>in</strong> Information<br />

Technology from<br />

Jawaharlal Nehru Technological<br />

University, India, <strong>in</strong><br />

2007, the MS degree from<br />

the Department of Electrical<br />

Eng<strong>in</strong>eer<strong>in</strong>g and Computer<br />

Science, Wichita State University<br />

<strong>in</strong> 2009. He is currently<br />

a Ph.D student <strong>in</strong> the<br />

Department of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Computer Science<br />

at Wichita State University. His research <strong>in</strong>terest is data<br />

cach<strong>in</strong>g and replication and job schedul<strong>in</strong>g <strong>in</strong> data <strong>in</strong>tensive<br />

scientific applications.<br />

B<strong>in</strong> Tang received the<br />

BS degree <strong>in</strong> physics from<br />

Pek<strong>in</strong>g University, Ch<strong>in</strong>a, <strong>in</strong><br />

1997, the MS degrees <strong>in</strong> materials<br />

science and computer<br />

science from Stony Brook<br />

University <strong>in</strong> 2000 and 2002,<br />

respectively, and the PhD<br />

degree <strong>in</strong> computer science<br />

from Stony Brook University<br />

<strong>in</strong> 2007. He is currently<br />

an Assistant Professor <strong>in</strong> the<br />

Department of Electrical Eng<strong>in</strong>eer<strong>in</strong>g and Computer Science<br />

at Wichita State University. His research <strong>in</strong>terest is algorithmic<br />

aspect of data <strong>in</strong>tensive sensor networks. He is a<br />

member of the IEEE.<br />

Liqiang Wang is currently<br />

an Assistant Professor <strong>in</strong><br />

the Department of Computer<br />

Science at the University of<br />

Wyom<strong>in</strong>g. He received the<br />

BS degree <strong>in</strong> mathematics<br />

from Hebei Normal University,<br />

Ch<strong>in</strong>a, <strong>in</strong> 1995, the<br />

MS degree <strong>in</strong> computer science<br />

from Sichuan University,<br />

Ch<strong>in</strong>a, <strong>in</strong> 1998, and the<br />

PhD degree <strong>in</strong> computer science<br />

from Stony Brook University <strong>in</strong> 2006. His research<br />

<strong>in</strong>terest is the design and analysis of parallel comput<strong>in</strong>g<br />

systems. He is a Member of the IEEE.<br />

Shiyong Lu received the<br />

PhD degree <strong>in</strong> Computer Science<br />

from the State University<br />

of New York at Stony<br />

Brook <strong>in</strong> 2002, ME from the<br />

Institute of Comput<strong>in</strong>g Technology<br />

of Ch<strong>in</strong>ese Academy<br />

of Sciences at Beij<strong>in</strong>g <strong>in</strong><br />

1996, and BE from the University<br />

of Science and Technology<br />

of Ch<strong>in</strong>a at Hefei <strong>in</strong><br />

1993. He is currently an Associate<br />

Professor <strong>in</strong> the Department of Computer Science,<br />

Wayne State University, and the Director of the <strong>Scientific</strong><br />

Workflow Research Laboratory (SWR Lab). His research<br />

<strong>in</strong>terests <strong>in</strong>clude scientific workflows and databases. He has<br />

published more than 90 papers <strong>in</strong> refereed <strong>in</strong>ternational<br />

journals and conference proceed<strong>in</strong>gs. He is the founder<br />

and currently a program co-chair of the IEEE International<br />

Workshop on <strong>Scientific</strong> Workflows (2007 2010), an editorial<br />

board member for International Journal of Semantic<br />

Web and Information Systems and International Journal of<br />

Healthcare Information Systems and Informatics. He is a<br />

Senior Member of the IEEE.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!