20.01.2014 Views

Workshop proceeding - final.pdf - Faculty of Information and ...

Workshop proceeding - final.pdf - Faculty of Information and ...

Workshop proceeding - final.pdf - Faculty of Information and ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

First, a good replication strategy can guarantee fast data access for the cloud workflow system. In<br />

the scientific workflows with many parallel tasks will simultaneously access the same dataset on one<br />

data centre. The limitation <strong>of</strong> computing capacity <strong>and</strong> b<strong>and</strong>width in that data centre would be a<br />

bottleneck for the whole cloud workflow system. If we have several replicas in different data centres,<br />

this bottleneck will be eliminated.<br />

Second, a good replication strategy can reduce data movement between data centres. For example,<br />

if tasks in one data centre always need to retrieve data from the same data set in a remote data centre,<br />

it is better to replicate that data set in the local data centre to reduce the data movement.<br />

Third, a good replication strategy can guarantee data reliability for the cloud workflow system.<br />

Because data centres in cloud workflow systems are built up with massive cheap commodity<br />

hardware, the breakdown <strong>of</strong> some hardware could happen any time. It is essential to keep several<br />

copies <strong>of</strong> each data in different data centres for reliability.<br />

However, at present, data replication strategies that utilised in cloud data management systems are<br />

usually static. For example, in Hadoop, users can manually set the number <strong>of</strong> replicas, <strong>and</strong> the system<br />

will automatically replicate the application data in different places (racks or clusters, depends on the<br />

scale <strong>of</strong> the system). Static replication can guarantee the data reliability, but in cloud environment,<br />

different application data have different usage rate, where we should have the dynamic strategy to<br />

replicate the application data based on their usage rate.<br />

2) Methodology<br />

The basic strategy for the replication could be as follow:<br />

a) Always keep fix number copies <strong>of</strong> each dataset in different data centres to guarantee reliability<br />

<strong>and</strong> dynamically add new replicas for each dataset to to guarantee data availability.<br />

b) Where to place the replicas is based on data dependency.<br />

c) How many replicas should a dataset have is based on usage rate <strong>of</strong> this dataset.<br />

Reference:<br />

[1] "Hadoop, http://hadoop.apache.org/", accessed on 25 November 2009.<br />

[2] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, <strong>and</strong> M. W. Storer, "Maximizing Efficiency<br />

By Trading Storage for Computation," in <strong>Workshop</strong> on Hot Topics in Cloud Computing<br />

(HotCloud'09), pp. 1-5, 2009.<br />

[3] I. Altintas, O. Barney, <strong>and</strong> E. Jaeger-Frank, "Provenance Collection Support in the Kepler<br />

Scientific Workflow System," in International Provenance <strong>and</strong> Annotation <strong>Workshop</strong>, pp. 118-<br />

132, 2006.<br />

[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A.<br />

Patterson, A. Rabkin, I. Stoica, <strong>and</strong> M. Zaharia, "Above the Clouds: A Berkeley View <strong>of</strong> Cloud<br />

Computing," University <strong>of</strong> California at Berkeley,<br />

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.<strong>pdf</strong>, Technical Report<br />

UCB/EECS-2009-28, accessed on 25 November 2009.<br />

[5] M. D. d. Assuncao, A. d. Costanzo, <strong>and</strong> R. Buyya, "Evaluating the cost-benefit <strong>of</strong> using cloud<br />

computing to extend the capacity <strong>of</strong> clusters," in 18th ACM International Symposium on High<br />

Performance Distributed Computing, Garching, Germany, pp. 1-10, 2009.<br />

[6] Z. Bao, S. Cohen-Boulakia, S. B. Davidson, A. Eyal, <strong>and</strong> S. Khanna, "Differencing Provenance<br />

in Scientific Workflows," in 25th IEEE International Conference on Data Engineering, ICDE<br />

'09., pp. 808-819, 2009.<br />

[7] R. Barga <strong>and</strong> D. Gannon, "Scientific versus Business Workflows," in Workflows for e-Science,<br />

pp. 9-16, 2007.<br />

[8] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, <strong>and</strong> I. Br<strong>and</strong>ic, "Cloud computing <strong>and</strong> emerging<br />

IT platforms: Vision, hype, <strong>and</strong> reality for delivering computing as the 5th utility," Future<br />

Generation Computer Systems, vol. in press, pp. 1-18, 2009.<br />

[9] E. Deelman <strong>and</strong> A. Chervenak, "Data Management Challenges <strong>of</strong> Data-Intensive Scientific<br />

Workflows," in IEEE International Symposium on Cluster Computing <strong>and</strong> the Grid, pp. 687-692,<br />

2008.<br />

71

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!