29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

Applications must be designed to dynamically adjust to compute nodes entering and leaving the resource<br />

pool. They must be capable <strong>of</strong> dealing with failures and rescheduling work. In traditional clusters, batch<br />

systems are routinely used to manage workflows. Batch systems such as Torque, Sun GridEngine, and<br />

Condor can and have been deployed in virtualized cloud environments [71, 49]. However, these deployments<br />

typically require system administration expertise with batch systems and an understanding <strong>of</strong> how to best<br />

configure them for the cloud environment. Grid tools can also play a role, but they require the user to<br />

understand and manage certificates and deploy tools like Globus. To lower the entry barrier for scientific<br />

users, <strong>Magellan</strong> personnel have developed and deployed Torque and Globus-based system images.<br />

Each <strong>of</strong> our applications have used different mechanisms to distribute work. MG-RAST and the IMG<br />

task farmer are examples <strong>of</strong> locally built tools that handle this problem. Hadoop might be run on top <strong>of</strong><br />

virtual machines for this purpose; however, it will suffer from lack <strong>of</strong> knowledge <strong>of</strong> data locality. STAR<br />

jobs are embarrassingly parallel applications (i.e., non-MPI codes), where each job fits in one core and uses<br />

custom scripts to handle workflow and data management.<br />

A majority <strong>of</strong> our <strong>Magellan</strong> domain science users, despite previous experience with clouds and virtualization,<br />

ranked their experience with workflow management in virtualized cloud environments as medium to<br />

difficult.<br />

11.4.6 Data Management<br />

The last significant challenge is managing the data for the workload. This includes both input and output<br />

data. For the bioinformatic workloads, the input data includes both the new sequence data as well as the<br />

reference data. Some consideration has to be given to where this data will be stored and read from, how<br />

it will be transported, and how this will scale with many worker nodes. In a traditional batch cluster, the<br />

users would typically have access to a cluster-wide file system. However, with EC2-style cloud systems, there<br />

is a different set <strong>of</strong> building blocks: volatile local storage, persistent block storage associated with a single<br />

instance (EBS), and a scalable put/get storage system (S3). These are different from traditional cluster<br />

and HPC <strong>of</strong>ferings that users are accustomed to, and hence our users made limited use <strong>of</strong> these storage<br />

systems. The image also can be used to store static data. Each <strong>of</strong> these options has differing performance<br />

characteristics that are dependent on the application. Thus, the choice <strong>of</strong> storage components depends on<br />

the volume <strong>of</strong> data and the access patterns (and cost, in the case <strong>of</strong> EC2). Similar to workflow management,<br />

a majority <strong>of</strong> our users noted that data management in these environments took some effort.<br />

11.4.7 Performance, Reliability and Portability<br />

Scientific applications need to consider performance and reliability <strong>of</strong> these environments. Our benchmarks<br />

(Chapter 9) show that the performance <strong>of</strong> an application is largely dependent on the type <strong>of</strong> application.<br />

Cloud resources have also gained popularity since resources are immediately available to handle spikes<br />

in load etc. However, starting a set <strong>of</strong> virtual machines can take a few minutes and can be highly variable<br />

when trying to get a large number <strong>of</strong> VMs [66]. In case <strong>of</strong> synchronous applications, it will be necessary to<br />

wait until all the virtual machines are up; and the user gets charged in public clouds for the time that the<br />

machines are up, even if they are not being used.<br />

High availability is <strong>of</strong>ten mentioned as one <strong>of</strong> the advantages <strong>of</strong> cloud resources. It is true that a number<br />

<strong>of</strong> cloud computing vendors such as Amazon provide various fault tolerance mechanisms, such as availability<br />

zones and regions, to enable highly fault-tolerant applications. However, the burden is largely on the end<br />

users to design their applications to be fault-tolerant, usually at a higher cost. Failures can occur at the<br />

hardware or s<strong>of</strong>tware level, and good design practices to survive failures must be used when building s<strong>of</strong>tware<br />

and services. The Amazon outage in April 2011 was triggered by an incorrect network configuration change<br />

made during maintenance that affected multiple services. Applications that used multiple regions were less<br />

impacted by the event. However, the common EBS plane impacted EBS activities. Thus cloud computing<br />

has many <strong>of</strong> the same availability challenges that impact HPC centers.<br />

116

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!