Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
Routing IP packets from the virtual machine (VM) to the hypervisor (HV) and then using IP over<br />
InfiniBand is the most readily available approach today. In this approach, the VM communicates through<br />
the HV using a virtual network interface. Then standard IP routing is used to route the packets over the<br />
InfiniBand interface. This approach provides the lowest performance <strong>of</strong> the various options. Since the packet<br />
must traverse multiple TCP stacks (in the VM and through the HV), the overhead is high. Also there are<br />
numerous integration challenges in integrating this with cloud management s<strong>of</strong>tware like Eucalyptus and<br />
OpenStack.<br />
Utilizing Ethernet over InfiniBand (EoIB) is another approach. In EoIB, first proposed by Mellanox, the<br />
Ethernet frames are encapsulated in an InfiniBand packet. The InfiniBand stack presents a virtual Ethernet<br />
NIC. This virtual NIC would be presented on the HV and then bridged to the VM. This approach has several<br />
advantages over the other approaches. VLANs should be supported in this model, which would provide some<br />
support for isolating networks. Bridging <strong>of</strong> the virtual Ethernet interface is similar to how Ethernet NICS<br />
are commonly bridged for a virtual machine. This means that packages like Eucalyptus could more easily<br />
support it. Since the communication would be TCP-based, some performance would be lost compared with<br />
remote memory access <strong>of</strong>fered by native InfiniBand. However, the packets would only go through a single<br />
TCP stack (on the VM) on each side. So the performance should exceed the IP over InfiniBand method.<br />
While this approach looks promising, the current EoIB implementation does not support Linux bridging.<br />
Another disadvantage is that only Mellanox currently plans to support EoIB, and it requires a special bridge<br />
device connected to the InfiniBand network. However, the devices are relatively low cost.<br />
Full virtualization <strong>of</strong> the InfiniBand adapter is just starting to emerge. This method would provide the<br />
best performance, since it would support full RDMA. However, it would be difficult to isolate networks<br />
and protect the network from actions by the VM. Furthermore, images would have to include and configure<br />
the InfiniBand stack. This would dramatically complicate the process <strong>of</strong> creating and maintaining<br />
images. <strong>Final</strong>ly, only the latest adapters from Mellanox are expected to support full virtualization. The<br />
version <strong>of</strong> adapters in <strong>Magellan</strong> will likely not support this method. We are exploring options for evaluating<br />
virtualization <strong>of</strong> the NIC and looking at alternative methods.<br />
High-performance networks like InfiniBand can dramatically impact the performance, but integrating<br />
them into virtual machine-based systems requires further research and development. Alternatively, it may<br />
be more attractive to look at how some <strong>of</strong> the benefits <strong>of</strong> virtualization can be provided to the users without<br />
using true virtualization. For example, our early experiments with MOAB/xCAT show that dynamic imaging<br />
<strong>of</strong> nodes or s<strong>of</strong>tware overlay methods allow the user to tailor the environment for their application without<br />
incurring the cost <strong>of</strong> virtualization.<br />
9.7.2 I/O on Virtual Machines<br />
The I/O performance results clearly highlight that I/O can be one <strong>of</strong> the causes <strong>of</strong> performance bottlenecks<br />
on virtualized cloud environments. Performance in VMs is lower than on physical machines, which may be<br />
attributed to an additional level <strong>of</strong> abstraction between the VM and the hardware. Also, it is evident from<br />
the results that local disks perform better than block store volumes. Block store volumes can be mounted<br />
on multiple instances and used as a shared file system to run MPI programs, but the performance degrades<br />
significantly. Thus these may not be suitable for all the MPI applications that compose the HPC workload.<br />
Our large-scale tests also capture the variability associated with virtual machine performance on Amazon.<br />
HPC systems typically provide a high-performance scratch file system that is used for reading initial<br />
conditions, saving simulation output, or writing checkpoint files. This capability is typically provided by<br />
parallel file systems such as GPFS, Lustre, or Panasas. Providing this capability for virtual machine-based<br />
systems presents several challenges including security, scalability, and stability.<br />
The security challenges focus around the level <strong>of</strong> control a user would have within the virtual machine.<br />
Most parallel file systems use a host-based trust model, i.e., a server trusts that a client will enforce access<br />
rules and permissions. Since the user would likely have root privileges on the VM, they would be capable <strong>of</strong><br />
viewing and modifying other users’ data on the file system.<br />
Accessing shared file systems from user controlled virtual machines can also create scalability and stability<br />
81