29.12.2014 Views

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

Magellan Final Report - Office of Science - U.S. Department of Energy

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Magellan</strong> <strong>Final</strong> <strong>Report</strong><br />

100
<br />

12
<br />

Percentage
Performance
with
Respect
to
Na4ve
<br />

90
<br />

80
<br />

70
<br />

60
<br />

50
<br />

40
<br />

30
<br />

20
<br />

10
<br />

TCP
o
IB
<br />

TCP
o
Eth
<br />

VM
<br />

Percentage
Performance
Rela/ve
to
Na/ve
<br />

10
<br />

8
<br />

6
<br />

4
<br />

2
<br />

TCP
o
IB
<br />

TCP
o
Eth
<br />

IB/VM
<br />

0
<br />

DGEMM
 STREAM
 Ping
Pong
 RandRing
<br />

Bandwidth
 Bandwidth
<br />

HPL
 FFTE
 PTRANS
 RandAccess
<br />

0
<br />

Ping
Pong
Latency
<br />

RandRing
Latency
<br />

Figure 9.3: Performance <strong>of</strong> the HPCC challenge benchmarks relative to the native hardware configuration<br />

for each <strong>of</strong> the cases examined. (a) Shows the performance for all <strong>of</strong> the components <strong>of</strong> the benchmark<br />

except the latency measurements which are shown in (b) for clarity. Note the very different vertical scale on<br />

(b).<br />

performance <strong>of</strong> each <strong>of</strong> the benchmarks relative to that obtained on the native host using the InfiniBand<br />

interconnect. First we consider Figure 9.3a. The first two measurements, DGEMM and STREAM, are<br />

measures <strong>of</strong> the floating point performance and memory bandwidth, respectively. As they do not depend on<br />

the network, they are in this case a probe to allow us to determine the impact <strong>of</strong> virtualization on compute<br />

performance. In both cases, the performance inside the virtual machine is about 30% slower than running on<br />

the native hardware, which is in line with previous measurements. [46] There seems to be two reasons: the<br />

overhead in translating the memory address instructions, and the KVM virtual environment not supporting<br />

all the instructions on the CPU such as SSE, etc. Subsequent to this study, a later version <strong>of</strong> KVM was<br />

installed on the <strong>Magellan</strong> testbed that supports the advanced CPU instructions in the virtual machine. The<br />

CPU overhead was observed to be negligible after this upgrade.<br />

Next we consider the network bandwidth measurements, <strong>of</strong> which we made two kinds, ping-pong and<br />

random ring. Ping-pong is a measure <strong>of</strong> point-to-point bandwidth, whereas random ring consists <strong>of</strong> each<br />

task simultaneously sending to a randomly selected partner, with the intent to induce network contention.<br />

The ping-pong results show a decrease in performance relative to the native measurement for each <strong>of</strong> the<br />

cases examined. The biggest drop relative to the native performance comes from using TCP over IB, with<br />

additional decreases in performance observed for TCP over Ethernet and virtualization. The random ring<br />

bandwidth clearly shows the lack <strong>of</strong> capability <strong>of</strong> the Ethernet connection to cope with significant amounts <strong>of</strong><br />

network contention. In this case, in contrast to the ping-pong case, the virtualization does not significantly<br />

further decrease the performance. The network latency measurements, both ping-pong and random ring,<br />

(Figure 9.3b) show qualitatively different trends to the bandwidths, as well as a significantly larger overall<br />

performance decrease, even at the TCP over InfiniBand level. The principal increase in latency occurs when<br />

switching to TCP over IB, then a further increase occurs from within the virtual machine. The TCP over<br />

IB and TCP over Ethernet latencies are approximately the same, which indicates the latency is primarily<br />

a function <strong>of</strong> the transport mechanism, as one would expect. The performance <strong>of</strong> the HPL benchmark is<br />

primarily a function <strong>of</strong> the DGEMM performance and the ping-pong bandwidth. The exact ratio depends<br />

upon the balance point <strong>of</strong> the machine, thus as the network capabilities become relatively worse, the HPL<br />

performance decreases. Both FFT and PTRANS are sensitive to random ring bandwidth and mirror the<br />

trends observed there. Random access is sensitive to network latency primarily and in a similar manner<br />

mimics the trends observed previously in the network latency measurements.<br />

The performance <strong>of</strong> our application benchmarks relative to the native using the various machine config-<br />

53

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!