Workshop proceeding - final.pdf - Faculty of Information and ...
Workshop proceeding - final.pdf - Faculty of Information and ...
Workshop proceeding - final.pdf - Faculty of Information and ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
First, a good replication strategy can guarantee fast data access for the cloud workflow system. In<br />
the scientific workflows with many parallel tasks will simultaneously access the same dataset on one<br />
data centre. The limitation <strong>of</strong> computing capacity <strong>and</strong> b<strong>and</strong>width in that data centre would be a<br />
bottleneck for the whole cloud workflow system. If we have several replicas in different data centres,<br />
this bottleneck will be eliminated.<br />
Second, a good replication strategy can reduce data movement between data centres. For example,<br />
if tasks in one data centre always need to retrieve data from the same data set in a remote data centre,<br />
it is better to replicate that data set in the local data centre to reduce the data movement.<br />
Third, a good replication strategy can guarantee data reliability for the cloud workflow system.<br />
Because data centres in cloud workflow systems are built up with massive cheap commodity<br />
hardware, the breakdown <strong>of</strong> some hardware could happen any time. It is essential to keep several<br />
copies <strong>of</strong> each data in different data centres for reliability.<br />
However, at present, data replication strategies that utilised in cloud data management systems are<br />
usually static. For example, in Hadoop, users can manually set the number <strong>of</strong> replicas, <strong>and</strong> the system<br />
will automatically replicate the application data in different places (racks or clusters, depends on the<br />
scale <strong>of</strong> the system). Static replication can guarantee the data reliability, but in cloud environment,<br />
different application data have different usage rate, where we should have the dynamic strategy to<br />
replicate the application data based on their usage rate.<br />
2) Methodology<br />
The basic strategy for the replication could be as follow:<br />
a) Always keep fix number copies <strong>of</strong> each dataset in different data centres to guarantee reliability<br />
<strong>and</strong> dynamically add new replicas for each dataset to to guarantee data availability.<br />
b) Where to place the replicas is based on data dependency.<br />
c) How many replicas should a dataset have is based on usage rate <strong>of</strong> this dataset.<br />
Reference:<br />
[1] "Hadoop, http://hadoop.apache.org/", accessed on 25 November 2009.<br />
[2] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, <strong>and</strong> M. W. Storer, "Maximizing Efficiency<br />
By Trading Storage for Computation," in <strong>Workshop</strong> on Hot Topics in Cloud Computing<br />
(HotCloud'09), pp. 1-5, 2009.<br />
[3] I. Altintas, O. Barney, <strong>and</strong> E. Jaeger-Frank, "Provenance Collection Support in the Kepler<br />
Scientific Workflow System," in International Provenance <strong>and</strong> Annotation <strong>Workshop</strong>, pp. 118-<br />
132, 2006.<br />
[4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A.<br />
Patterson, A. Rabkin, I. Stoica, <strong>and</strong> M. Zaharia, "Above the Clouds: A Berkeley View <strong>of</strong> Cloud<br />
Computing," University <strong>of</strong> California at Berkeley,<br />
http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.<strong>pdf</strong>, Technical Report<br />
UCB/EECS-2009-28, accessed on 25 November 2009.<br />
[5] M. D. d. Assuncao, A. d. Costanzo, <strong>and</strong> R. Buyya, "Evaluating the cost-benefit <strong>of</strong> using cloud<br />
computing to extend the capacity <strong>of</strong> clusters," in 18th ACM International Symposium on High<br />
Performance Distributed Computing, Garching, Germany, pp. 1-10, 2009.<br />
[6] Z. Bao, S. Cohen-Boulakia, S. B. Davidson, A. Eyal, <strong>and</strong> S. Khanna, "Differencing Provenance<br />
in Scientific Workflows," in 25th IEEE International Conference on Data Engineering, ICDE<br />
'09., pp. 808-819, 2009.<br />
[7] R. Barga <strong>and</strong> D. Gannon, "Scientific versus Business Workflows," in Workflows for e-Science,<br />
pp. 9-16, 2007.<br />
[8] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, <strong>and</strong> I. Br<strong>and</strong>ic, "Cloud computing <strong>and</strong> emerging<br />
IT platforms: Vision, hype, <strong>and</strong> reality for delivering computing as the 5th utility," Future<br />
Generation Computer Systems, vol. in press, pp. 1-18, 2009.<br />
[9] E. Deelman <strong>and</strong> A. Chervenak, "Data Management Challenges <strong>of</strong> Data-Intensive Scientific<br />
Workflows," in IEEE International Symposium on Cluster Computing <strong>and</strong> the Grid, pp. 687-692,<br />
2008.<br />
71