Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
Hadoop and Eucalyptus users. These sessions primarily provided users with an overview <strong>of</strong> the basic steps<br />
to using these technologies and provided an opportunity to discuss some <strong>of</strong> the details <strong>of</strong> how to port applications<br />
to these environments. For example, ALCF held a ”Welcome to <strong>Magellan</strong> Day” workshop that ran<br />
a full day and included 4 hours <strong>of</strong> intensive hands-on training in using Eucalyptus, building and managing<br />
images, and debugging images. This workshop demonstrated the difficulties in training scientific users to<br />
become, in essence, system administrators.<br />
Community Support Model. For advanced user activities, we relied on a community support model.<br />
In order for community support to work, it needs a critical mass <strong>of</strong> experienced users willing to share their<br />
expertise. Building a community <strong>of</strong> experienced users requires significant outreach and training, as well<br />
as an extended period <strong>of</strong> time to develop this community support base. In addition, it needs to be clear<br />
when issues are system problems versus a user issue. Early in the project, the cloud s<strong>of</strong>tware stacks were<br />
sufficiently immature that this process was difficult. Aside from these issues, the community support model<br />
worked reasonably well. Successful groups <strong>of</strong>ten had a primary user who was either already experienced<br />
or gained the expertise to build the infrastructure for the rest <strong>of</strong> the group to use. While this approach is<br />
not always feasible, it worked particular well in many cases. For example, one Argonne scientific team was<br />
able to build a complete cluster inside <strong>of</strong> the ALCF <strong>Magellan</strong> public cloud space, including the Sun Grid<br />
Enging (SGE) batch queueing system and storage, which also had secure access to kerberized NFS resources<br />
outside the system. They were able to route jobs into the system for execution. The team was able to<br />
successfully achieve this complicated configuration because one member <strong>of</strong> their team had extensive system<br />
administration expertise.<br />
7.3 Discussion<br />
Many <strong>of</strong> the problems users faced on both the virtual and hardware-as-a-service clusters involved Linux<br />
administrative tasks. Most HPC users have little systems administration experience, so even relatively<br />
minor issues frequently required support from <strong>Magellan</strong> staff. Often, this could be accomplished by sharing<br />
advice or links to online documentation via email, but at other times it required launching rescue images to<br />
diagnose and repair a damaged system. Installing s<strong>of</strong>tware and debugging system errors were <strong>of</strong>ten necessary<br />
to kick-start a user’s efforts. This process was complicated further when the users chose Linux distributions<br />
outside <strong>of</strong> the baseline images which the <strong>Magellan</strong> staff had pre-configured.<br />
<strong>Magellan</strong> staff also handle challenges related to access to commercial s<strong>of</strong>tware or other common tools<br />
or infrastructure. For example, some users were accustomed to using proprietary s<strong>of</strong>tware which could not<br />
always be bundled into a virtual machine without violating license agreements. If a user had the appropriate<br />
license, they had the freedom to install and customized the images to include the s<strong>of</strong>tware, but this <strong>of</strong>ten<br />
led to issues with supporting modified images.<br />
Security related challenges were also commonplace. Many <strong>of</strong> these issues revolved around user credentials.<br />
We discuss additional security challenges in Chapter 8. On the virtual cluster, there were typically problems<br />
with the users’ client-side tools installation or SSL libraries. For example, only certain versions <strong>of</strong> the clientside<br />
ec2 tools work with Eucalyptus. At ALCF, almost all <strong>of</strong> the users had some initial difficulty using the<br />
cryptographic tokens which were required to access the bare-metal provisioning service. While these were<br />
<strong>of</strong>ten easily resolved, they constituted a significant portion <strong>of</strong> the total support requests for this service.<br />
Another common issue was with externally-routable IPs and firewall settings on the instances, which by<br />
default block all incoming connections. Again these were easily resolved, but increased the support load.<br />
The final challenges that users <strong>of</strong>ten encountered were related to managing data and workflows (discussed<br />
in Chapter 11). The solution to these challenges are <strong>of</strong>ten application specific requiring a detailed<br />
understanding <strong>of</strong> the data and workflow. Furthermore, to fully leverage the advantages <strong>of</strong> the cloud model,<br />
such as elasticity, <strong>of</strong>ten requires the user to fundamentally rethink the execution model. All <strong>of</strong> this makes it<br />
difficult to <strong>of</strong>fer a cost-effective support model.<br />
38