Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
by the networking service, and are then ’attached’ to the requested virtual machine via additional network<br />
address translation rules.<br />
As all networking is handled by one service, this introduces a performance bottleneck for communicationintensive<br />
applications. Furthermore, the complexity <strong>of</strong> the networking setup is quite large, and requires a<br />
significant amount <strong>of</strong> effort to understand. Some inflexibility with regard to support <strong>of</strong> physical network<br />
architecture were encountered in the design as well.<br />
Scalability. Overall, OpenStack scales well, and performance on the 420 node cloud at ALCF was reliable<br />
up to per-project instance counts <strong>of</strong> more than 200. Instance counts <strong>of</strong> 300 or greater have been successfully<br />
accomplished, though at that scale the single API node becomes overloaded and instance launch times<br />
increase dramatically.<br />
6.3 Nimbus<br />
Nimbus is an open-source IaaS cloud-computing framework, created by a team <strong>of</strong> developers at Argonne<br />
National Labs and the University <strong>of</strong> Chicago. It has been successfully deployed on a number <strong>of</strong> sites around<br />
the country, one <strong>of</strong> which is an 82 node cloud running on <strong>Magellan</strong> project hardware at Argonne.<br />
Internals. Nimbus supports the creation <strong>of</strong> virtual machines using both Xen and KVM, the latter being<br />
used on the <strong>Magellan</strong> cloud. The core services include the Workspace Service, which handles all virtual machine<br />
management and user interaction via grid-credential or Amazon EC2 API. The configuration clients<br />
run on individual virtual machines, and control setup <strong>of</strong> services, user and virtual cluster credentials, and<br />
image metadata acquisition. The cloudinit.d script is one <strong>of</strong> these clients, so user images can easily be<br />
modified for compatibility with Amazon EC2, similar to Eucalyptus and OpenStack.<br />
Running Nimbus on <strong>Magellan</strong>. The two biggest challenges encountered running Nimbus on <strong>Magellan</strong><br />
hardware involved issues with networking and s<strong>of</strong>tware dependencies. The network issues revolved primarily<br />
around the security measures present on site. Difficulties with ingress firewall settings caused user ssh<br />
sessions to timeout prematurely, and complicated the application <strong>of</strong> public IP addresses to virtual machines;<br />
this prompted one user to eschew the use <strong>of</strong> externally-routable IPs altogether. Taken together, these issues<br />
made management <strong>of</strong> the cloud difficult; and even simple administrative tasks required exorbitant effort.<br />
In order to alleviate the aforementioned networking difficulties, an image propagation protocol called<br />
LANTorrent was developed. LANTorrent is a multicast system that allows researchers to quickly distribute<br />
VM images across the cloud at one time; tests on <strong>Magellan</strong> resulted in 1000 simultaneous boots. LANTorrent<br />
forms a store and forward chain <strong>of</strong> recipients that allows the image to be sent at a rate much faster than can<br />
be accomplished with http or ssh transfers. This significantly lessened the time a user’s cluster required to<br />
boot.<br />
6.4 Discussion<br />
In the previous section, we detailed current cloud <strong>of</strong>ferings. Here we examine some <strong>of</strong> the gaps and challenges<br />
in using existing cloud <strong>of</strong>ferings directly for scientific computing.<br />
Resources. Cloud computing carries with it the assumption <strong>of</strong> an unlimited supply <strong>of</strong> resources available<br />
on demand. While this was true in early days <strong>of</strong> cloud computing when demand for resources was still<br />
ramping up, more recently users have noticed that their requests have not been satisfied on providers such<br />
as Amazon EC2 due to insufficient capacity. This situation is similar to current day supercomputing and<br />
grid resources that are <strong>of</strong>ten over-subscribed and have long wait queues. Thus for the user, scientific cloud<br />
computing as an unlimited supply <strong>of</strong> cycles tends to be less promising. There is a need for differentiated<br />
levels <strong>of</strong> service similar to Amazon’s current <strong>of</strong>ferings but with advanced resource request interfaces with<br />
34