12.07.2015 Views

COPYRIGHT 2008, PRINCETON UNIVERSITY PRESS

COPYRIGHT 2008, PRINCETON UNIVERSITY PRESS

COPYRIGHT 2008, PRINCETON UNIVERSITY PRESS

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

an mpi tutorial 611D.4 Parallel TuningRecall the Tune program with which we experimented in Chapter 14,“High-Performance Computing Hardware, Tuning, and Parallel Computing,” todetermine how memory access for a large matrix affects the running time of programs.You may also recall that as the size of the matrix was made larger, theexecution time increased more rapidly than the number of operations the programhad to perform, with the increase coming from the time it took to transfer theneeded matrix elements in and out of central memory.Because parallel programming on a multiprocessor also involves a good dealof data transfer, the Tune program is also a good teaching tool for seeing howcommunication costs affect parallel computations. Listing D.7 gives the programTuneMPI.c, which is a modified version of the Tune program in which each row ofthe large-matrix multiplication is performed on a different processor using MPI:⎡⎤ ⎡ ⎤⇒ rank 1 ⇒ψ 1 ⇓⇒ rank 2 ⇒ψ 2 ⇓⇒ rank 3 ⇒ψ 3 ⇓⇒ rank 1 ⇒ψ 4 ⇓[H] N×N × [Ψ] N×1 =⇒ rank 2 ⇒×ψ 5 ⇓. (D.4.1)⇒ rank 3 ⇒ψ 6 ⇓⎢ ⇒ rank 1 ⇒⎥ ⎢ ψ 7 ⇓⎥⎣⎦ ⎣ ⎦. .. . ..N×NHere the arrows indicate how each row of H is multiplied by the single column ofΨ, with the multiplication of each row performed on a different processor (rank).The assignment of rows to processors continues until we run out of processors, andthen it starts all over again. Since this multiplication is repeated for a number ofiterations, this is the most computationally intensive part of the program, and so itmakes sense to parallelize it.On the right in Figure D.3 is the speedup curve we obtained by runningTuneMPI.c. However, even if the matrix is large, the Tune program is not computationallyintensive enough to overcome the cost of communication among nodesinherent in parallel computing. Consequently, to increase computing time we haveinserted an inner for loop over k that takes up time but accomplishes nothing(we’ve all had days like that). Slowing down the program should help make thespeedup curve more realistic.✞☎/∗ TuneMPI . c : a matrix algebra program to be tuned for performaceN X N Matrix speed tests using MPI ∗/N×1#include "mpi .h"#include #include #include −101<strong>COPYRIGHT</strong> <strong>2008</strong>, PRINCET O N UNIVE R S I T Y P R E S SEVALUATION COPY ONLY. NOT FOR USE IN COURSES.ALLpup_06.04 — <strong>2008</strong>/2/15 — Page 611

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!