29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

method to the Open <strong>Science</strong> Grid (OSG). An important aspect <strong>of</strong> porting Einstein@Home to run on OSG<br />

was strengthening <strong>of</strong> fault-tolerance and implementation <strong>of</strong> automatic recovery from errors. Fixing problems<br />

manually at scale simply isn’t practical, so Einstein@OSG eventually automated the process.<br />

LIGO used the <strong>Magellan</strong> serial queue and the Eucalyptus setup at NERSC to understand the advantages<br />

<strong>of</strong> running in cloud environments. A virtual cluster was set up in the NERSC Eucalyptus deployment for<br />

LIGO to access the virtual resources. The virtual cluster was set up to serve an OSG gatekeeper on a head<br />

virtual machine to provide the user a seamless transition between the traditional batch queue submission<br />

and the virtual cluster.<br />

Our work with LIGO shows how center staff might set up virtual clusters for use by certain groups.<br />

To keep the image size small, the gatekeeper stack (approximately 600 MB) was installed on a block store<br />

volume. In order to work around the virtual nature <strong>of</strong> the head node, it was always booted with a private IP;<br />

and a reserved public IP, which is static, was assigned after the instance was booted. The gsiftp server had<br />

to be configured to handle the private/public duality <strong>of</strong> the IP addressing. No other changes were required<br />

to the LIGO s<strong>of</strong>tware stack. LIGO was able to switch from running through the NERSC Globus gatekeeper<br />

to running through the virtual gatekeeper without major difficulties, and ran successfully.<br />

11.2.4 ATLAS<br />

Figure 11.6: ATLAS CloudCRV job flow.<br />

ATLAS is a particle physics experiment at the Large Hadron Collider (LHC) at CERN. The ATLAS<br />

detector is searching for new discoveries in the head-on collisions <strong>of</strong> protons <strong>of</strong> extraordinarily high energy.<br />

ATLAS used <strong>Magellan</strong> resources at NERSC to build a highly scalable compute cluster on the Eucalyptus<br />

cloud. The configuration <strong>of</strong> the cluster is similar to a normal ATLAS compute cluster: Panda is used as the<br />

primary job distributer, where a user can submit jobs into a queue. A storage gateway is installed in the<br />

cluster to move data into and out <strong>of</strong> the cluster. A Hadoop File System (HDFS) is installed on the worker<br />

nodes so that each worker node also acts as a storage node.<br />

The key difference between the <strong>Magellan</strong> cluster and a conventional ATLAS cluster is that it fully utilizes<br />

the agility provided by the cloud. The cluster can automatically adjust the number <strong>of</strong> workers according to<br />

the number <strong>of</strong> jobs waiting in the queue. When there are many jobs waiting in the queue, new worker nodes<br />

are automatically added to the cluster. When the jobs are finished, idle workers are terminated automatically<br />

to save cost.<br />

To address many <strong>of</strong> the challenges in working with virtual clusters that we discussed in earlier case studies,<br />

the ATLAS project developed and deployed a tool called CloudCRV. CloudCRV can automatically deploy<br />

and configure individual VMs to work as a cluster. It can also scale the size <strong>of</strong> the cluster automatically<br />

according to demand. As shown in Figure 11.6, the lifetime <strong>of</strong> all core services and the worker nodes are<br />

controlled by CloudCRV.<br />

110

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!