A Replication Strategy Based on Swarm Intelligence ... - IEEE Xplore

klmp.pku.edu.cn

A Replication Strategy Based on Swarm Intelligence ... - IEEE Xplore

A ong>Replicationong> ong>Strategyong> ong>Basedong> on Swarm Intelligence

in Spatial Data Grid

Ping Zhang, Kunqing Xie * , Xiujun Ma * ,Xiong Li, Yixian Sun

Key Laboratory of Machine Perception (Minister of Education)

Peking University,

Beijing, China

*Corresponding author: kunqing@cis.pku.edu.cn; maxj@cis.pku.edu.cn

Abstract—In the spatial data grid, the distribution of query and

the data is unevenly some resource become hotspot and the

hotspots are changing over time, which may cause the global load

unbalanced, this dynamic problem becomes a key challenge in

Data Grid. Data replication is a way to deal with this problem,

which improves data availability, reduces latency and increases

throughput. In this paper, we present a new replication approach

which is adaptive, completely decentralized, and based on swarm

intelligence which is intrinsically a bottom-up approach. Every

site in the grid system has a single agent, which is serving as

containers for data, following simple rules of behavior and

without knowing any global information. The strategy that agents

follow includes which data to create replica and where the replica

is locating. The local interactions and simple action between

agents give a fairly optimal replication location solution globally.

We carried the experiments using OptorSim for the EU Data

Grid Testbed 1. Experimental results show that our approach

performs better than No replication and when the scale of jobs is

big, our method will outperform the Economic Model, but the

space consumption is proportional.

Keywords-distribute system; spatial data grid; data replication;

swarm intelligence; agent

I. INTRODUCTION

As traditional GIS has many flaws in managing distributed,

large-scale, heterogeneous spatial data, lots of countries build

their own spatial data grid to provide abundant spatial data

services. As one class of Data Grid, there may exist millions of

files and replicas distributed on hundreds of geographically

dispersed storage resources, which are organized to provide an

integrated service [1].

In this system, the distribution of query and the data is

unevenly, some resource become hotspot and the hotspots are

changing over time, which may cause the global load

unbalanced, this dynamic problem becomes a key challenge in

Data Grid [2]. Data replication is a way to deal with this

problem, which improves data availability, reduces latency and

increases throughput. Data replication mainly concerns the

decision of when and where to create physical copies of logical

data fragments and when and how to update them to maintain

data consistency. In this paper, we focus on data replication

location problem because the spatial resource is not updated

frequently.

It is not a trivial matter to design a replication location

service which is reliable and can provide good performance to

all the Grid users. Globus and EU Data Grid set projects for

finding optimal replication location service in dynamic

environment [3]. However, the problem of finding an optimal

replication scheme in Grid system has been shown to be NPhard

for the static case [4]. So it is unlikely to find a

convergent-optimal algorithm in this dynamic environment, all

these replication algorithms are heuristic. On the other side,

most of the distributed applications exclude any form of

centralized structure, requiring control to be completely

decentralized which also make the traditional replication

algorithms hard to deploy.

In order to deal with the decentralization and dynamism, we

present a new replication approach which is adaptive,

completely decentralized, and based on swarm intelligence,

which is intrinsically a bottom-up approach. The local

interactions and simple actions among agents lead to the

emergence of intelligent global behavior, our approach is to

find a local optimal solution, with agents’ local interactions,

and the local solution will spread to the global and give a fairly

optimal replication location solution globally. Our work is as

following:

A replication approach based on swarm intelligence is

proposed. Every site in the grid system has a single agent,

which is serving as manager of file data, following simple rules

of behavior and without knowing any global information. The

agent will follow two strategies when it is going to take a

replication action: a strategy that selecting which data to create

a replica. The agent computes “degree of heat” of every data

file on the site according to the local access information. This

quota helps the agent select which data replica should be

copied and sent to other site, so as to balance the load of local

site and reduce the global responding time of that data. And

another strategy that finding where the replica is locating.

ong>Basedong> on every site’s access information and its’ neighbors’

information, the agent estimates the load which may be

transferred to another site with replicate a data cell, and then

choosing the fittest site to locate the replica.

To evaluate our approach we use a simulation package

called OptorSim [5]. As part of the EU Data Grid project,

OptorSim is a Data Grid simulator designed to allow

experiments and evaluations of various replication optimization

This work was supported by The National High Technology Research and

Development Program of China (863 Key Program) No.2007AA120502

and the National Natural Science Foundation of China under Grant

No.40801046 and No. 60874082


strategies in a Grid environment. We carry the experiments on

a model of the EU Data Grid Testbed 1 sites and their

associated network geometry [3]. The query in the simulation is

submitted with a fixed probability such that some queries are

more popular than others. In order to evaluate our approach, we

contrast the No replication and the Economic Model method

built in OptorSim by using the total job execution time and

storage usage as evaluation indicators.

This paper is organized as follows. Section 2 reviews

related works. Section 3 describes the replica strategy we

designed. Section 4 shows the experiment scenario. And

Section 5 gives experiment result. Section 6 concludes.

II. RELATED WORKS

The replication strategies in Data Grid are classified as

bottom-up approach and top down approach [2]. And an initial

work on dynamic file replication in Data Grid was done by K.

Ranganathan and I. Foster [2], they proposing six replication

strategies: No ong>Replicationong>, Best Client, Cascading, Plain

Caching, Cascading plus Caching and Fast Spread. Among all

these top-down schemes, Fast Spread strategy gives a relative

high performance through various access patterns. However,

these approach is based on a Hierarchical Gird, and they

cannot apply into a P2P Gird directly.

S. Naseera, T.Vivekanandan and Dr. K.V. Madhu Murthy

[6] demonstrated that the adopt replication approach in Data

Grid can reduce the file access cost effectively and improve

the data availability in many applications. And maintaining a

minimum number of replicas in an arbitrary topology is an

NP-complete problem[4]. So it is unlikely to find a

convergent-optimal algorithm in a dynamic environment just

like Data Grid, and the time cost of the regular method is

intolerable.

Rahman et. al. [7] proposed a dynamic replica maintenance

algorithm in a static replica placement environment which reallocates

replicas to new sites if performance reduces over last

k time periods. Xin Sun, Kan Li, Yushu Liu [8] present the

bidirectional linked list based replica location service (BLL-

RLS) on tree-based hierarchical unstructured overlay networks.

William H. Bell and David G. Cameron give a summary in [9]

that the replication algorithms proposed now are based on the

assumption that popular files in one site are also popular in

other sites. The replication action is triggered when the

popularity of a file overcomes a threshold.

To study the replication and management service in Data

Grid, various Grid simulation projects have been undertaken

in recent years, such as GridSim [10], and OptorSim [5].

David G. Cameron , Rub´ en Carvajal-Schiaffino have done

a work about OptorSim[11],they give a conclusion about the

main advantage of OptorSim with respect to the previous

simulators. And in [9] William H.Bell and David G.Cameron

detail the design and implementation of OptorSim and analyze

various replication algorithms based on different Grid

workloads.

III. REPLICATION APPROACH BASED ON SWARM

INTELLIGENCE

To solve the replication problem, we propose a replication

approach based on swarm intelligence. Inspired by natural bio

systems such as ant colonies and bird flocking, a swarm

Intelligence system is typically made up of a huge number of

simple agents interacting locally with one another and with

their environment. Although the agents follow very simple

rules, and there is no centralized control dictating how a single

agent should behave, local interactions among them lead to the

emergence of "intelligent" global behavior.

Agents have the property of self-governed, adaptive of

environmental change, and coordination among themselves,

which make them very suitable of cooperating complex spatial

task in a dynamic and complex distributed system. In the

spatial data grid we studied, every site has a single agent,

which is serving as manager of data at local site. The agent’s

behavior is simple and it has nothing global information. ong>Basedong>

on sensation of environment and local interactions among them,

the local solution will spread to the global and give a fairly

optimal replication location solution globally.

The behaviors of an agent at each site are just to select the

right data to create replica at the right time and send the replica

to the right place. As mentioned above, the two actions of an

agent should lead a local solution. So we propose two strategies

for these two issues.

A. A strategy that selecting which data to create replica

The first step of a replication strategy is choosing which

data to create replica, in our method, the decision is made by

the agent on a site, so we define a quota Di represents the

“degree of heat” of data file i of this site. The larger Di means

the file i has much more requests. This quota helps agent on

local site select which data replica should be copied and sent to

other site, so as to balance the load of local site and reduce the

global responding time of that data.

The degree of heat Di is defined as:



D lnt

1


The Di rises and falls with requests, and tj is the time since

the jth request of this data file. This equation is based on the

rational analysis of Anderson and Schooler [12]. That is to say,

the heat of degree of a file not only considers the recent

requests, but also considers the history access log. And 0.5 has

emerged as the default value for the time-based decay

parameter γ.

B. A strategy that finding where the replica is locating

Before an agent create a replica and chooses where the

replica is locating, an agent must know when to take these

actions is appropriate. So we define a parameter Li represents

the threshold of the CPU load that every site i can bear. Li is in

the interval [0,1]. The Li is set at 0.8 in our experiment, which

means if an agent has found the CPU load exceeded 80%, it

will choose “the hottest” file to make a replica.


As a local solution, the decision of where to place replica is

made based on local file access information and neighbor list.

The algorithm is like this:

1 Choose the hottest file h and get the access list of file

alist(h),as well as neighbor list of local agent l,nlist(l);

2 j=0;

3 While (j


• Storage Resources Usage: The percentage of total

storage resources used during the simulation in Grid

sites can also be a valuable source of information [5].

And in order to evaluate our approach better, we contrast

two replication strategies built in OptorSim:

• No ong>Replicationong>: This algorithm never replicates a file.

The distribution of initial file replicas is decided at the

beginning of the simulation and does not change

during its execution [5].

• Economic Model based on Zipf-like distribution: In

this algorithm, data files are “sold” by SEs to either

CEs or other SEs. A file is purchased by Computing

Elements (CE) for running a job and by Storage

Elements (SE) to make an investment. The purchasing

action of a SE means the SE gets a replica of the file.

CEs try to minimize the file purchase cost, while SEs

attempt to maximize their profits [5].

• Our simulations are run varying the number of jobs

from 100 to 1000 and to 10000.

• The Total Job Time of three strategies for increasing

number of jobs can be seen in Figure 2 and Figure 3. In

Figure 2, with no optimization, the jobs take so much

longer than the other two optimization algorithms as all

the files requested by jobs have to be transferred from

CERN. The single resource site leads a huge time on

queue. And when the scale of jobs is small, the time

difference is not obvious, but when the number of jobs

comes to 10000, the result shows a significant

difference.

Total Job Time (s)

140000

120000

100000

80000

60000

40000

20000

0

100 1000 10000

Number of Jobs

Economic Model

Swarm

Intelligence

No ong>Replicationong>

Figure 2. Total Job Time for increasing number of jobs

To further explain our method, in Figure 3, we focus on the

Economic Model and our method. When number of jobs is

small, we find the swarm intelligence strategy costs a little

more time than Economic Model, this is because the scale of

job is small and the CERN can endure the access pressure for a

longer time, which leads agents to take a replica action late.

That is to say, when the agents start to replica a file and spread

it along the Grid, the total jobs will have been done in a short

time. However, with the expansion of the jobs scale, our

method gets a better and better performance than the Economic

Model. When the number of jobs is 10000, the total job time is

about 10% faster using our method than Economic Model. That

is mainly because the speed of spreading replica over the Grid

of our method is faster than Economic Model. The more

replicas bring a shorter time. Contrast Figure 3 to Figure 4, we

can see, when we make a full use of storage resource, we will

get a faster job execution speed, that is trading space for speed.

Total Job Time (s)

12000

10000

8000

6000

4000

2000

0

100 1000 10000

Number of Jobs

Economic

Model

Swarm

Intelligence

Figure 3. Total Job Time for increasing number of jobs (Compare Economic

Model and Swarm Intelligence ong>Replicationong> Strategies)

Storage Resource Usage (%)

100

90

80

70

60

50

40

30

20

10

0

100 1000 10000

Number of Jobs

Economic

Model

Swarm

Intelligence

No ong>Replicationong>

Figure 4. Storage Resources Usage for increasing number of jobs

VI. CONCLUSION

In this paper, we propose a replication strategy based on

swarm intelligence in Spatial Data Grid. Every site in the grid

system has a single agent, which is serving as manager of data,

following simple rules of behavior and without knowing any

global information. The local interactions among them lead to

the emergence of "intelligent" global behavior. And we design

two strategies that an agent will use when taking a replication

action: A strategy for selecting which data should be copied.

The degree of heat of every data file helps agent on local site

select which data replica should be copied and sent to other

site. A strategy for finding where the replica should be located.

ong>Basedong> on every site’s access information and its’ neighbors’

information, the agent on local site estimates the load which

may be transferred to another site with replicate a data cell.

And then the agent chooses the fittest site to locate the replica.

In order to evaluate our approach, we contrast the No

replication and the Economic Model method by using the total

job execution time and storage usage as evaluation indicators.

Experimental results show that our approach performs better

than No replication and when the scale of jobs is big, our

method will outperform the Economic Model, but the space

consumption is proportional.


In future, we plan to improve the strategies that agents used,

and make our approach performs better and at the same time

makes a more economic usage of the Grid storage.

REFERENCES

[1] Spatial Data Repositories / GRID Systems

http://isferea.jrc.ec.europa.eu/ACTIVITIES/TECHNOLOGIES/Pages/Sp

atialDataRepositoriesGRIDSystems.aspx

[2] Kavitha R., and Foster I., “Identifying Dynamic ong>Replicationong> Strategies

for a HighPerformance Data Grid”, in Proceedings of the Second

International Workshop on Grid Computing, 2001, pp 75-86.

[3] The Data Grid Project. http://www.eu-datagrid.org.

[4] Garey, M. and Johnson, D. Computers and Intractabdity, 1979.

[5] OptorSim-A Replica Optimiser Simulation. http://griddatamanagement.web.cern.ch/grid-datamanagement/optimisation/optor/.

[6] S. Naseera, T.Vivekanandan and Dr. K.V. Madhu Murthy, “Trust ong>Basedong>

Data ong>Replicationong> ong>Strategyong> in a Data Grid”, In proceedings of 2nd

International conference on Information Processing, ICIP,Bangalore,

June 2008, pp. 467-473

[7] R.M. Rahman, K. Baker and R. Alhajj, “Replica Placement Design with

Static Optimality and Dynamic Maintainability”, Proceedings of the

IEEE/ACM International Conference on Cluster Computing and Grid

(CCGRID 06), Singapore, May 2006.

[8] Xin Sun, Kan Li, Yushu Liu , “An Efficient Replica Location Method in

Hierarchical P2P Networks ,” in Eigth IEEE/ACIS International

Conference on Computer and Information Science (ICCI’09), 2009

[9] William H. Bell, David G. Cameron, Ruben Carvajal-Schiaffino,A. Paul

Millar, Kurt Stockinger, Floriano Zini, “Evaluation of an Economy-

ong>Basedong> File ong>Replicationong> ong>Strategyong> for a Data Grid,” in Proceedings of the

3rd IEEE/ACM International Symposium on Cluster Computing and the

Grid (CCGrid’03), 2003

[10] P. Crosby. EDGSim. http://www.hep.ucl.ac.uk/˜pac/EDGSim/.

[11] David G. Cameron, Rub´ en Carvajal-Schiaffino, A. Paul Millar,

Caitriana Nicholson, Kurt Stockinger, Floriano Zini, “Evaluating

Scheduling and Replica Optimisation Strategies in OptorSim,” In 4th

International Workshop on Grid Computing (Grid’03), 2003

[12] Anderson,J.R.,&Schooler,L.J.(1991).Reflections of the environment in

memory. Psychological Science, 2, 396–408.

[13] Summary of CDF5858 for Metadata Group Steven Hanlon, May 2004,

http://www.gridpp.ac.uk/datamanagement/metadata/SubGroups/UseCas

es/summaries/cdf5858_summ.html

[14] Qaisar Rasool, Jianzhong Li, George S. Oreku, Shuo Zhang, Donghua

Yang, “A Load Balancing Replica Placement ong>Strategyong> in Data Grid,” in

Third International Conference on Digital Information Management

(ICDIM’ 08), London,United Kingdom, 2008

[15] Naseera, S. and Murthy, K.V.M., “Agent ong>Basedong> Replica Placement in a

Data Grid Environement,” in First International Conference on

Computational Intelligence, Communication Systems and Networks,

(CICSYN '09), 2009.

[16] Kavitha R., and Foster I., “Design and Evaluation of ong>Replicationong>

Strategies for a High Performance Data Grid”, Proceedings of

Computing and High Energy and Nuclear Physics, 2001.

[17] M. Young, The Technical Writer's Handbook. Mill Valley, CA:

University Science, 1989.

More magazines by this user
Similar magazines