29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

by the networking service, and are then ’attached’ to the requested virtual machine via additional network<br />

address translation rules.<br />

As all networking is handled by one service, this introduces a performance bottleneck for communicationintensive<br />

applications. Furthermore, the complexity <strong>of</strong> the networking setup is quite large, and requires a<br />

significant amount <strong>of</strong> effort to understand. Some inflexibility with regard to support <strong>of</strong> physical network<br />

architecture were encountered in the design as well.<br />

Scalability. Overall, OpenStack scales well, and performance on the 420 node cloud at ALCF was reliable<br />

up to per-project instance counts <strong>of</strong> more than 200. Instance counts <strong>of</strong> 300 or greater have been successfully<br />

accomplished, though at that scale the single API node becomes overloaded and instance launch times<br />

increase dramatically.<br />

6.3 Nimbus<br />

Nimbus is an open-source IaaS cloud-computing framework, created by a team <strong>of</strong> developers at Argonne<br />

National Labs and the University <strong>of</strong> Chicago. It has been successfully deployed on a number <strong>of</strong> sites around<br />

the country, one <strong>of</strong> which is an 82 node cloud running on <strong>Magellan</strong> project hardware at Argonne.<br />

Internals. Nimbus supports the creation <strong>of</strong> virtual machines using both Xen and KVM, the latter being<br />

used on the <strong>Magellan</strong> cloud. The core services include the Workspace Service, which handles all virtual machine<br />

management and user interaction via grid-credential or Amazon EC2 API. The configuration clients<br />

run on individual virtual machines, and control setup <strong>of</strong> services, user and virtual cluster credentials, and<br />

image metadata acquisition. The cloudinit.d script is one <strong>of</strong> these clients, so user images can easily be<br />

modified for compatibility with Amazon EC2, similar to Eucalyptus and OpenStack.<br />

Running Nimbus on <strong>Magellan</strong>. The two biggest challenges encountered running Nimbus on <strong>Magellan</strong><br />

hardware involved issues with networking and s<strong>of</strong>tware dependencies. The network issues revolved primarily<br />

around the security measures present on site. Difficulties with ingress firewall settings caused user ssh<br />

sessions to timeout prematurely, and complicated the application <strong>of</strong> public IP addresses to virtual machines;<br />

this prompted one user to eschew the use <strong>of</strong> externally-routable IPs altogether. Taken together, these issues<br />

made management <strong>of</strong> the cloud difficult; and even simple administrative tasks required exorbitant effort.<br />

In order to alleviate the aforementioned networking difficulties, an image propagation protocol called<br />

LANTorrent was developed. LANTorrent is a multicast system that allows researchers to quickly distribute<br />

VM images across the cloud at one time; tests on <strong>Magellan</strong> resulted in 1000 simultaneous boots. LANTorrent<br />

forms a store and forward chain <strong>of</strong> recipients that allows the image to be sent at a rate much faster than can<br />

be accomplished with http or ssh transfers. This significantly lessened the time a user’s cluster required to<br />

boot.<br />

6.4 Discussion<br />

In the previous section, we detailed current cloud <strong>of</strong>ferings. Here we examine some <strong>of</strong> the gaps and challenges<br />

in using existing cloud <strong>of</strong>ferings directly for scientific computing.<br />

Resources. Cloud computing carries with it the assumption <strong>of</strong> an unlimited supply <strong>of</strong> resources available<br />

on demand. While this was true in early days <strong>of</strong> cloud computing when demand for resources was still<br />

ramping up, more recently users have noticed that their requests have not been satisfied on providers such<br />

as Amazon EC2 due to insufficient capacity. This situation is similar to current day supercomputing and<br />

grid resources that are <strong>of</strong>ten over-subscribed and have long wait queues. Thus for the user, scientific cloud<br />

computing as an unlimited supply <strong>of</strong> cycles tends to be less promising. There is a need for differentiated<br />

levels <strong>of</strong> service similar to Amazon’s current <strong>of</strong>ferings but with advanced resource request interfaces with<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!