Workshop proceeding - final.pdf - Faculty of Information and ...
Workshop proceeding - final.pdf - Faculty of Information and ...
Workshop proceeding - final.pdf - Faculty of Information and ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Data Management in Cloud Scientific Workflow Systems<br />
Dong Yuan 1 *<br />
* Presenter<br />
1. <strong>Faculty</strong> <strong>of</strong> <strong>Information</strong> <strong>and</strong> Communication Technologies, Swinburne University <strong>of</strong><br />
Technology, Hawthorn, Melbourne, Australia 3122<br />
Data-intensive scientific applications are posing many challenges in distributed<br />
computing systems. In the scientific field, the application data are expected to double every<br />
year over the next decade <strong>and</strong> further. With this continuing data explosion, high performance<br />
computing systems are needed to store <strong>and</strong> process data efficiently, <strong>and</strong> workflow<br />
technologies are facilitated to automate these scientific applications. Scientific workflows are<br />
typically very complex. They usually have a large number <strong>of</strong> tasks <strong>and</strong> need a long time for<br />
execution. Running scientific workflow applications usually need not only high performance<br />
computing resources but also massive storage. The emergence <strong>of</strong> cloud computing<br />
technologies <strong>of</strong>fers a new way to develop scientific workflow systems. Scientists can upload<br />
their data <strong>and</strong> launch their applications on the scientific cloud workflow systems from<br />
everywhere in the world via the Internet, <strong>and</strong> they only need to pay for the resources that they<br />
use for their applications. As all the data are managed in the cloud, it is easy to share data<br />
among scientists. This kind <strong>of</strong> model is very convenient for users, but remains a big challenge<br />
to the system. This seminar will discuss several research topics <strong>of</strong> data management in<br />
scientific cloud workflow systems.<br />
Keywords<br />
component, formatting, style, styling, insert (key words)<br />
1. Introduction<br />
Data-intensive scientific applications are posing many challenges in distributed computing systems.<br />
In many scientific research fields, like astronomy, high-energy physics <strong>and</strong> bio-informatics, scientists<br />
need to analyse terabytes <strong>of</strong> data either from existing data resources or collected from physical<br />
devices. During these processes, similar amounts <strong>of</strong> new data might also be generated as intermediate<br />
or <strong>final</strong> products [9]. According to [20], in the scientific field, the application data are expected to<br />
double every year over the next decade <strong>and</strong> further. With this continuing data explosion, high<br />
performance computing systems are needed to store <strong>and</strong> process data efficiently, <strong>and</strong> workflow<br />
technologies are facilitated to automate these scientific applications. Scientific workflows are typically<br />
very complex. They usually have a large number <strong>of</strong> tasks <strong>and</strong> need a long time for execution. Running<br />
scientific workflow applications usually need not only high performance computing resources but also<br />
massive storage [9].<br />
Nowadays, popular scientific workflows are <strong>of</strong>ten deployed in grid systems [16] because they have<br />
high performance <strong>and</strong> massive storage. However, building a grid system is extremely expensive <strong>and</strong> it<br />
is normally not open for scientists all over the world. The emergence <strong>of</strong> cloud computing technologies<br />
<strong>of</strong>fers a new way to develop scientific workflow systems.<br />
Since late 2007 the concept <strong>of</strong> cloud computing was proposed [22], it has been utilised in many<br />
areas with some success. Cloud computing is deemed as the next generation <strong>of</strong> IT platforms that can<br />
deliver computing as a kind <strong>of</strong> utility [8]. Foster et al. made a comprehensive comparison <strong>of</strong> grid<br />
computing <strong>and</strong> cloud computing [11]. Some features <strong>of</strong> cloud computing also meet the requirements<br />
<strong>of</strong> scientific workflow systems. Cloud computing systems provide the high performance <strong>and</strong> massive<br />
storage required for scientific applications in the same way as grid systems, but with a lower<br />
infrastructure construction cost among many other features, because cloud computing systems are<br />
composed <strong>of</strong> data centres which can be clusters <strong>of</strong> commodity hardware [22]. Research into doing<br />
science <strong>and</strong> data-intensive applications on the cloud has already commenced [18], such as early<br />
experiences like Nimbus [14] <strong>and</strong> Cumulus [21] projects. The work by Deelman et al. [10] shows that<br />
67