Workshop proceeding - final.pdf - Faculty of Information and ...

More documents

Recommendations

Info

workflows. By comparing the generation cost and storage cost, our strategy can automatically decide whether an intermediate dataset should be stored or deleted in the cloud system, no matter this intermediate dataset is a new dataset, regenerated dataset or stored dataset in the system. Data provenance is important to this strategy in building the IDG. Fortunately, due to the importance of data provenance in scientific applications, much research about recording data provenance of the system has been done. Some of them are especially for scientific workflow systems [6]. Some popular scientific workflow systems, such as Kepler [16], have their own system to record provenance during the workflow execution [3]. In [19], Osterweil et al. present how to generate a Data Derivation Graph (DDG) for the execution of a scientific workflow, where one DDG records the data provenance of one execution. Similar to the DDG, our IDG is also based on the scientific workflow data provenance, but it depicts the dependency relationships of all the intermediate data in the system. Hence we can build our IDG by taking advantage of these related works. B. Data placement strategy 1) Problem Analysis Scientific applications are data intensive and usually need collaborations of scientists from different institutions [7], hence application data in scientific workflows are usually distributed and very large. When one task needs to process data from different data centres, moving data becomes a challenge. Some application data are too large to be moved efficiently, some may have fixed locations that are not feasible to be moved and some may have to be located at fixed data centres for processing, but these are only one aspect of this challenge. For the application data that are flexible to be moved, we also cannot move them whenever and wherever we want, since in the cloud computing platform, data centres may belong to different cloud service providers that data movement would result in costs. Furthermore, the infrastructure of cloud computing systems is hidden from their users. They just offer the computation and storage resources required by users for their applications. The users do not know the exact physical locations where their data are stored. This kind of model is very convenient for users, but remains a big challenge for data management to scientific cloud workflow systems. 2) Methodology In cloud computing systems, the infrastructure is hidden from users. Hence, for most of the application data, the system will decide where to store them. Dependencies exist among these data. In this paper, we initially adapt the clustering algorithms for data movement based on data dependency. Clustering algorithms have been used in pattern recognition since 1980s, which can classify patterns into groups without supervision. Today they are widely used to process data streams. In many scientific workflow applications, the intermediate data movement is in data stream format and the newly generated data must be moved to the destination in real-time. We adapt the k-means clustering algorithm for data placement in scientific cloud workflow systems. Scientific workflows can be very complex, one task might require many datasets for execution; furthermore, one dataset might also be required by many tasks. If some datasets are always used together by many tasks, we say that these datasets are dependant on each other. In our strategy, we try to keep these datasets in one data centre, so that when tasks were scheduled to this data centre, most, if not all, of the data they need are stored locally. Our data placement strategy has two algorithms, one for the build-time stage and one for the runtime stage of scientific workflows. In the build-time stage algorithm, we construct a dependency matrix for all the application data, which represents the dependencies between all the datasets including the datasets that may have fixed locations. Then we use the BEA algorithm [17] to cluster the matrix and partition it that datasets in every partition are highly dependent upon each other. We distribute the partitions into k data centres, where the partitions have fixed location datasets are also placed in the appropriate data centres. These k data centres are initially as the partitions of the k-means algorithm at runtime stage. At runtime, our clustering algorithm deals with the newly generated data that will be needed by other tasks. For every newly generated dataset, we calculate its dependencies with all k data centres, and move the data to the data centre that has the highest dependency with it. By placing data with their dependencies, our strategy attempts to minimise the total data movement during the execution of workflows. Furthermore, with the pre-allocate of data to other data centres, our strategy can prevent data gathering to one data centre and reduces the time spent waiting for data by ensuring that relevant data are stored locally. C. Data replication strategy 1) Problem Analysis Data replication is also an important issue for cloud workflow systems as presented by the two points below: 70
First, a good replication strategy can guarantee fast data access for the cloud workflow system. In the scientific workflows with many parallel tasks will simultaneously access the same dataset on one data centre. The limitation of computing capacity and bandwidth in that data centre would be a bottleneck for the whole cloud workflow system. If we have several replicas in different data centres, this bottleneck will be eliminated. Second, a good replication strategy can reduce data movement between data centres. For example, if tasks in one data centre always need to retrieve data from the same data set in a remote data centre, it is better to replicate that data set in the local data centre to reduce the data movement. Third, a good replication strategy can guarantee data reliability for the cloud workflow system. Because data centres in cloud workflow systems are built up with massive cheap commodity hardware, the breakdown of some hardware could happen any time. It is essential to keep several copies of each data in different data centres for reliability. However, at present, data replication strategies that utilised in cloud data management systems are usually static. For example, in Hadoop, users can manually set the number of replicas, and the system will automatically replicate the application data in different places (racks or clusters, depends on the scale of the system). Static replication can guarantee the data reliability, but in cloud environment, different application data have different usage rate, where we should have the dynamic strategy to replicate the application data based on their usage rate. 2) Methodology The basic strategy for the replication could be as follow: a) Always keep fix number copies of each dataset in different data centres to guarantee reliability and dynamically add new replicas for each dataset to to guarantee data availability. b) Where to place the replicas is based on data dependency. c) How many replicas should a dataset have is based on usage rate of this dataset. Reference: [1] "Hadoop, http://hadoop.apache.org/", accessed on 25 November 2009. [2] I. Adams, D. D. E. Long, E. L. Miller, S. Pasupathy, and M. W. Storer, "Maximizing Efficiency By Trading Storage for Computation," in Workshop on Hot Topics in Cloud Computing (HotCloud'09), pp. 1-5, 2009. [3] I. Altintas, O. Barney, and E. Jaeger-Frank, "Provenance Collection Support in the Kepler Scientific Workflow System," in International Provenance and Annotation Workshop, pp. 118- 132, 2006. [4] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, "Above the Clouds: A Berkeley View of Cloud Computing," University of California at Berkeley, http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.pdf, Technical Report UCB/EECS-2009-28, accessed on 25 November 2009. [5] M. D. d. Assuncao, A. d. Costanzo, and R. Buyya, "Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters," in 18th ACM International Symposium on High Performance Distributed Computing, Garching, Germany, pp. 1-10, 2009. [6] Z. Bao, S. Cohen-Boulakia, S. B. Davidson, A. Eyal, and S. Khanna, "Differencing Provenance in Scientific Workflows," in 25th IEEE International Conference on Data Engineering, ICDE '09., pp. 808-819, 2009. [7] R. Barga and D. Gannon, "Scientific versus Business Workflows," in Workflows for e-Science, pp. 9-16, 2007. [8] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility," Future Generation Computer Systems, vol. in press, pp. 1-18, 2009. [9] E. Deelman and A. Chervenak, "Data Management Challenges of Data-Intensive Scientific Workflows," in IEEE International Symposium on Cluster Computing and the Grid, pp. 687-692, 2008. 71
Page 2 and 3:
The 3 rd VACPS Research Workshop Pr
Page 4 and 5:
Organizing Committee Victorian Asso
Page 6 and 7:
Preface On the occasion of the publ
Page 8 and 9:
Paper as a low-cost base material f
Page 10 and 11:
Acknowledgements We gratefully ackn
Page 13 and 14:
Career planning and Job Interview i
Page 15 and 16:
The role of mitochondria in atrial
Page 18 and 19:
Track-Before-Detect Procedures for
Page 20 and 21:
3. Dynamic model for approaching ta
Page 22 and 23:
⎧π0 if j = i and Cx ij k > δij
Page 24 and 25:
k k k k ( ) p( p Ζ | X , B = ∏ z
Page 26 and 27:
As illustrated in Fig.4, the struct
Page 28 and 29: ( xˆ , m ˆ ) = argmax Γ ( x , m
Page 30 and 31: (a) (b) Fig. 7. mesh plot of the re
Page 32 and 33: Definition 2 The probability of tar
Page 34 and 35: Acknowledgments Xiaobo Deng would l
Page 36 and 37: speed, several methods can be utili
Page 38 and 39: Where u ′ , v′ ) is the point p
Page 40 and 41: ase on SIFT are also unsuitable to
Page 42 and 43: Demonstration and Performance Analy
Page 45 and 46: Effect of mini-tyrosl-tRNA syntheta
Page 47 and 48: 2.1 Materials Sprague-Dauley(SD) ma
Page 49 and 50: Fig. 2. The representative photomic
Page 51 and 52: mount/0.1mm 2 , as for mini-TrpRS g
Page 53 and 54: 3.4.The mRNA expression of mini-Tyr
Page 55 and 56: 9.Du XJ, Gao X, Ramsey D. Surgical
Page 57 and 58: Distribution of three forms alpha-m
Page 59 and 60: Materials and Methods Ethics All an
Page 61 and 62: Fig.1 (A) Mass Spectrometry results
Page 63 and 64: indicated that there is more des-α
Page 65 and 66: Fig.6 (A) Mass Spectrometry results
Page 67 and 68: transcription-polymerase chain reac
Page 69 and 70: Investigation of the mechanism of t
Page 71 and 72: Dual mode roll-up effect in a multi
Page 73: Paper as a low-cost base material f
Page 76 and 77: cloud computing offers a cost-effec
Page 80 and 81: [10] E. Deelman, G. Singh, M. Livny
Page 82 and 83: Web Page Prediction Based on Condit
Page 84 and 85: A Multi-modal Gesture Recognition S
Page 86: Searching for Fair Joint Gains in A
Page 89 and 90: environments. However, challenges a
Page 91 and 92: conclude that the hydrophobic PHB c
Page 93 and 94: Figure 4. (a) Energy transfer model
Page 95 and 96: 20 Intensity (/10 6 ) / a.u. 15 10
Page 97 and 98: (22) (a) Cho, J.; Quinn, J. F.; Car
Page 99 and 100: Grain refinement of pure magnesium
Page 101 and 102: Fig. 3 The optical and SEM microstr
Page 103 and 104: Conclusions (1) The large grains in
Page 105 and 106: [22-25]. Sorting of normal and Babe
Page 107 and 108: Figure 1. Г×Re[fCM] vs. frequency
Page 109 and 110: Figure 3. The DEP spectrum of cylin
Page 111 and 112: A B C Figure 5. The gradient of the
Page 113 and 114: Figure 7. SEM image of damaged elec
Page 115 and 116: Figure 9. Particle separation progr
Page 117 and 118: The separation efficiency can be im
Page 119 and 120: [37] A. Motayed, M. Q. He, A. V. Da
Page 121 and 122: Fig. 1. The transverse and longitud
Page 123 and 124: References: [1] Xiong Liu, Mark C.
Page 125 and 126: new tissue[10]. Most of these human
Page 127 and 128: 2.8 Viability and morphology study
Page 129 and 130:
Fig 3 and 4 SEM images of electrosp
Page 131 and 132:
studies have indicated that surface
Page 133 and 134:
Fig 10 Tensile Stress curve of elec
Page 135 and 136:
well as on a usual nanofibrous scaf
Page 137 and 138:
Uniform Coating of WO x on TiO 2 Na
Page 140 and 141:
Union as a Social Regulator of Mark
Page 142 and 143:
associated with a reinvention of in
Page 144 and 145:
enterprises…There obviously is no
Page 146 and 147:
Union assistants participate in org
Page 148 and 149:
[5] Deyo, Frederic C and Ağartan,
Page 150 and 151:
Organizational Capabilities • Lea
Page 152 and 153:
Table 1. Correlations, Reliabilitie
Page 154 and 155:
[18]. Jap, S. D. (1999). Pie-expans
Page 156 and 157:
Early Childhood Education Matters:
Page 158 and 159:
decisions on selection of child car
Page 160 and 161:
Noble, K. (2007). Parent choice of
Page 162:
An Exploration of Country of Origin
Page 165 and 166:
Supercontinuum generation for fiber
Page 167 and 168:
Gas chromatographic retention indic
Page 169 and 170:
Outward Foreign Direct Investment f
Page 171 and 172:
Fiber Nonlinearity Precompensation
Page 173 and 174:
Improved Nonlinearity precompensati
show all

Workshop proceeding - final.pdf - Faculty of Information and ...

Create successful ePaper yourself

Delete template?

Save as template?