29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

Hadoop and Eucalyptus users. These sessions primarily provided users with an overview <strong>of</strong> the basic steps<br />

to using these technologies and provided an opportunity to discuss some <strong>of</strong> the details <strong>of</strong> how to port applications<br />

to these environments. For example, ALCF held a ”Welcome to <strong>Magellan</strong> Day” workshop that ran<br />

a full day and included 4 hours <strong>of</strong> intensive hands-on training in using Eucalyptus, building and managing<br />

images, and debugging images. This workshop demonstrated the difficulties in training scientific users to<br />

become, in essence, system administrators.<br />

Community Support Model. For advanced user activities, we relied on a community support model.<br />

In order for community support to work, it needs a critical mass <strong>of</strong> experienced users willing to share their<br />

expertise. Building a community <strong>of</strong> experienced users requires significant outreach and training, as well<br />

as an extended period <strong>of</strong> time to develop this community support base. In addition, it needs to be clear<br />

when issues are system problems versus a user issue. Early in the project, the cloud s<strong>of</strong>tware stacks were<br />

sufficiently immature that this process was difficult. Aside from these issues, the community support model<br />

worked reasonably well. Successful groups <strong>of</strong>ten had a primary user who was either already experienced<br />

or gained the expertise to build the infrastructure for the rest <strong>of</strong> the group to use. While this approach is<br />

not always feasible, it worked particular well in many cases. For example, one Argonne scientific team was<br />

able to build a complete cluster inside <strong>of</strong> the ALCF <strong>Magellan</strong> public cloud space, including the Sun Grid<br />

Enging (SGE) batch queueing system and storage, which also had secure access to kerberized NFS resources<br />

outside the system. They were able to route jobs into the system for execution. The team was able to<br />

successfully achieve this complicated configuration because one member <strong>of</strong> their team had extensive system<br />

administration expertise.<br />

7.3 Discussion<br />

Many <strong>of</strong> the problems users faced on both the virtual and hardware-as-a-service clusters involved Linux<br />

administrative tasks. Most HPC users have little systems administration experience, so even relatively<br />

minor issues frequently required support from <strong>Magellan</strong> staff. Often, this could be accomplished by sharing<br />

advice or links to online documentation via email, but at other times it required launching rescue images to<br />

diagnose and repair a damaged system. Installing s<strong>of</strong>tware and debugging system errors were <strong>of</strong>ten necessary<br />

to kick-start a user’s efforts. This process was complicated further when the users chose Linux distributions<br />

outside <strong>of</strong> the baseline images which the <strong>Magellan</strong> staff had pre-configured.<br />

<strong>Magellan</strong> staff also handle challenges related to access to commercial s<strong>of</strong>tware or other common tools<br />

or infrastructure. For example, some users were accustomed to using proprietary s<strong>of</strong>tware which could not<br />

always be bundled into a virtual machine without violating license agreements. If a user had the appropriate<br />

license, they had the freedom to install and customized the images to include the s<strong>of</strong>tware, but this <strong>of</strong>ten<br />

led to issues with supporting modified images.<br />

Security related challenges were also commonplace. Many <strong>of</strong> these issues revolved around user credentials.<br />

We discuss additional security challenges in Chapter 8. On the virtual cluster, there were typically problems<br />

with the users’ client-side tools installation or SSL libraries. For example, only certain versions <strong>of</strong> the clientside<br />

ec2 tools work with Eucalyptus. At ALCF, almost all <strong>of</strong> the users had some initial difficulty using the<br />

cryptographic tokens which were required to access the bare-metal provisioning service. While these were<br />

<strong>of</strong>ten easily resolved, they constituted a significant portion <strong>of</strong> the total support requests for this service.<br />

Another common issue was with externally-routable IPs and firewall settings on the instances, which by<br />

default block all incoming connections. Again these were easily resolved, but increased the support load.<br />

The final challenges that users <strong>of</strong>ten encountered were related to managing data and workflows (discussed<br />

in Chapter 11). The solution to these challenges are <strong>of</strong>ten application specific requiring a detailed<br />

understanding <strong>of</strong> the data and workflow. Furthermore, to fully leverage the advantages <strong>of</strong> the cloud model,<br />

such as elasticity, <strong>of</strong>ten requires the user to fundamentally rethink the execution model. All <strong>of</strong> this makes it<br />

difficult to <strong>of</strong>fer a cost-effective support model.<br />

38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!