29.01.2015 Views

Performance comparison for OpenMP, PThreads and MPI Hot Plate

Performance comparison for OpenMP, PThreads and MPI Hot Plate

Performance comparison for OpenMP, PThreads and MPI Hot Plate

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Ilya Raykhel<br />

CS 684<br />

<strong>Hot</strong> plate <strong>OpenMP</strong>, <strong>PThreads</strong> <strong>and</strong> <strong>MPI</strong> per<strong>for</strong>mance <strong>comparison</strong>.<br />

Introduction.<br />

There are multiple ways to implement a parallel solution to the same problem of calculating a stable<br />

temperature state on a "<strong>Hot</strong> <strong>Plate</strong>" system. The "<strong>Hot</strong> <strong>Plate</strong>" is a matrix, elements of which are<br />

temperature values. Some of these values are fixed, while others try to reach a stable state based on<br />

the laws of thermodynamics. The problem is to calculate a stable state <strong>for</strong> the whole system, when no<br />

temperature values change anymore. The solution is apparent in case of a single-threaded execution,<br />

but there are multiple ways to do it in multiprocessor environment. Comparison of three of these<br />

approaches, the <strong>OpenMP</strong> implementation, the <strong>PThreads</strong> implementation, <strong>and</strong> the <strong>MPI</strong> implementation, is<br />

presented in this paper.<br />

Approach.<br />

<strong>OpenMP</strong> implementation of <strong>Hot</strong> plate is based on shared memory usage <strong>and</strong> <strong>OpenMP</strong> library of C<br />

compiler directives, which takes care of thread spawning <strong>and</strong> thread-to-processor allocation. Advantage<br />

of this approach is less memory usage <strong>and</strong> less direct inter-thread communication. Disadvantage of<br />

using <strong>OpenMP</strong> is a communication barrier <strong>and</strong> synchronization -- since the main requirement of the<br />

problem is that the entire system must reach a stable state, a check on the whole hot plate is required<br />

every nth iteration. This means that threads have to be synchronized <strong>and</strong> joined, which significantly<br />

increases the time it takes <strong>for</strong> the program to execute.<br />

<strong>PThreads</strong> follows the <strong>OpenMP</strong> approach <strong>and</strong> has same advantages <strong>and</strong> disadvantages. In essence, it is<br />

<strong>OpenMP</strong> on a lower level.<br />

<strong>MPI</strong>, or Message Passing Interface, is a fully-distributed system that runs on multiple processors <strong>and</strong><br />

uses direct inter-process communication via bus messages. No memory is shared among the processes,<br />

<strong>and</strong> all in<strong>for</strong>mation has to be explicitly exchanged. This approach eliminates synchronization<br />

requirement, but introduces a need to send blocks of memory that could have been shared by using<br />

<strong>OpenMP</strong>.<br />

Experimental setup.<br />

Two implementations were made to match in as many details as possible to minimize the discrepancies<br />

due to algorithmic inconsistencies. That included using the same serial optimizations with the same<br />

parameters, such as the whole hot plate state was checked every 10th iteration regardless of<br />

implementation, although of course different approaches had to be used <strong>for</strong> that check: thread joining<br />

in <strong>OpenMP</strong>/<strong>PThreads</strong> <strong>and</strong> <strong>MPI</strong> reduce in <strong>MPI</strong>. This also included resetting fixed temperature values on<br />

every iterating instead of checking <strong>for</strong> them.<br />

All programs were run on Marylou0 SGI symmetric multiprocessing BYU supercomputer using 1, 2, 4, 8<br />

<strong>and</strong> 16 processors. All programs were compiled using cc compiler with optimization level of 3.<br />

Since there is no "st<strong>and</strong>ard" or "known best" algorithm <strong>for</strong> the hot plate, absolute speedup metric is<br />

hard to define <strong>for</strong> per<strong>for</strong>mance measurements; instead wall clock time <strong>and</strong> relative speedup will be<br />

used.<br />

Results.


It is clear that <strong>MPI</strong> implementation shows a rather better per<strong>for</strong>mance than <strong>OpenMP</strong> or <strong>PThreads</strong>. <strong>MPI</strong><br />

also seems to scale significantly better, with almost linear time/processors correspondence. Speedup is<br />

not superlinear because horizontal scale is logarithmic.<br />

Conclusion.<br />

The experiment clearly demonstrated that <strong>MPI</strong> implementation of hot plate per<strong>for</strong>ms much faster than<br />

<strong>OpenMP</strong> or <strong>PThreads</strong>. Apparently, thread spawning/joining overhead adversely affects overall<br />

per<strong>for</strong>mance, while <strong>MPI</strong>'s reduce <strong>and</strong> interprocess communication proceeds much faster. One thing to<br />

note is that since number of processors used (16) is negligible compared to overall problem size<br />

(768x768), amount of direct interprocess communication is not high compared to the time spent on


computing temperature values. So, <strong>MPI</strong>'s per<strong>for</strong>mance is expected to get less scalable with more than<br />

16 processors, ultimately resulting in no speedup due to Amdahl's Law.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!