Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
Magellan Final Report - Office of Science - U.S. Department of Energy
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />
100 <br />
12 <br />
Percentage Performance with Respect to Na4ve <br />
90 <br />
80 <br />
70 <br />
60 <br />
50 <br />
40 <br />
30 <br />
20 <br />
10 <br />
TCP o IB <br />
TCP o Eth <br />
VM <br />
Percentage Performance Rela/ve to Na/ve <br />
10 <br />
8 <br />
6 <br />
4 <br />
2 <br />
TCP o IB <br />
TCP o Eth <br />
IB/VM <br />
0 <br />
DGEMM STREAM Ping Pong RandRing <br />
Bandwidth Bandwidth <br />
HPL FFTE PTRANS RandAccess <br />
0 <br />
Ping Pong Latency <br />
RandRing Latency <br />
Figure 9.3: Performance <strong>of</strong> the HPCC challenge benchmarks relative to the native hardware configuration<br />
for each <strong>of</strong> the cases examined. (a) Shows the performance for all <strong>of</strong> the components <strong>of</strong> the benchmark<br />
except the latency measurements which are shown in (b) for clarity. Note the very different vertical scale on<br />
(b).<br />
performance <strong>of</strong> each <strong>of</strong> the benchmarks relative to that obtained on the native host using the InfiniBand<br />
interconnect. First we consider Figure 9.3a. The first two measurements, DGEMM and STREAM, are<br />
measures <strong>of</strong> the floating point performance and memory bandwidth, respectively. As they do not depend on<br />
the network, they are in this case a probe to allow us to determine the impact <strong>of</strong> virtualization on compute<br />
performance. In both cases, the performance inside the virtual machine is about 30% slower than running on<br />
the native hardware, which is in line with previous measurements. [46] There seems to be two reasons: the<br />
overhead in translating the memory address instructions, and the KVM virtual environment not supporting<br />
all the instructions on the CPU such as SSE, etc. Subsequent to this study, a later version <strong>of</strong> KVM was<br />
installed on the <strong>Magellan</strong> testbed that supports the advanced CPU instructions in the virtual machine. The<br />
CPU overhead was observed to be negligible after this upgrade.<br />
Next we consider the network bandwidth measurements, <strong>of</strong> which we made two kinds, ping-pong and<br />
random ring. Ping-pong is a measure <strong>of</strong> point-to-point bandwidth, whereas random ring consists <strong>of</strong> each<br />
task simultaneously sending to a randomly selected partner, with the intent to induce network contention.<br />
The ping-pong results show a decrease in performance relative to the native measurement for each <strong>of</strong> the<br />
cases examined. The biggest drop relative to the native performance comes from using TCP over IB, with<br />
additional decreases in performance observed for TCP over Ethernet and virtualization. The random ring<br />
bandwidth clearly shows the lack <strong>of</strong> capability <strong>of</strong> the Ethernet connection to cope with significant amounts <strong>of</strong><br />
network contention. In this case, in contrast to the ping-pong case, the virtualization does not significantly<br />
further decrease the performance. The network latency measurements, both ping-pong and random ring,<br />
(Figure 9.3b) show qualitatively different trends to the bandwidths, as well as a significantly larger overall<br />
performance decrease, even at the TCP over InfiniBand level. The principal increase in latency occurs when<br />
switching to TCP over IB, then a further increase occurs from within the virtual machine. The TCP over<br />
IB and TCP over Ethernet latencies are approximately the same, which indicates the latency is primarily<br />
a function <strong>of</strong> the transport mechanism, as one would expect. The performance <strong>of</strong> the HPL benchmark is<br />
primarily a function <strong>of</strong> the DGEMM performance and the ping-pong bandwidth. The exact ratio depends<br />
upon the balance point <strong>of</strong> the machine, thus as the network capabilities become relatively worse, the HPL<br />
performance decreases. Both FFT and PTRANS are sensitive to random ring bandwidth and mirror the<br />
trends observed there. Random access is sensitive to network latency primarily and in a similar manner<br />
mimics the trends observed previously in the network latency measurements.<br />
The performance <strong>of</strong> our application benchmarks relative to the native using the various machine config-<br />
53