Dr. David Cronk Innovative Computing Lab University of ... - It works!

An Introduction to Parallel Programming with 

MPI 

Dr. David Cronk 

Innovative Computing Lab 

University of Tennessee

Day 1 

Morning - Lecture 

Course Outline 

Parallel Processing 

Parallel Programming 

Introduction to MPI 

Getting started with MPI 

Afternoon – Lab 

Hands on exercises building and executing basic 

MPI programs 

David Cronk 2

Day 2 

Course Outline (cont) 

Morning – Lecture 

Communication Modes 

Collective communication 


Hands on exercises using various 

communication methods in MPI 



Day 3 


Derived Datatypes 

What else is there 

Groups and communicators 

MPI-2 

Dynamic process management 

One-sided communication 

MPI-I/O 


Hands on exercises creating and using derived datatypes 


Day 1 









MPI programs 



Outline 

› What is parallel processing 

› An example of parallel processing 

› Why use parallel processing 



What is parallel processing 

› Using multiple processors to solve a single 

problem 

• Task parallelism 

– The problem consists of a number of independent 

tasks 

– Each processor or groups of processors can perform 

a separate task 

• Data parallelism 

– The problem consists of dependent tasks 

– Each processor works on a different part of data 



1 

! 

0 

4 

_1_ x 2 dx= _ 

_ 

We can approximate the integral as a sum of rectangles 

N 

! F _x __x" _ 

i 

i = 0 







Why parallel processing 

› Faster time to completion 

• Computation can be performed faster with 

more processors 

› Able to run larger jobs or at a higher 

resolution 

• Larger jobs can complete in a reasonable 

amount of time on multiple processors 

• Data for larger jobs can fit in memory when 

spread out across multiple processors 


Day 1 









MPI programs 



Outline 

› Programming models 

› Message passing issues 

• Data distribution 

• Flow control 



Programming models 

› Shared memory 

• All processes have access to global memory 

› Distributed memory (message passing) 

• Processes have access to only local memory. 

Data is shared via explicit message passing 

› Combination shared/distributed 

• Groups of processes share access to “local” 

data while data is shared between groups via 

explicit message passing 



Message Passing Issues 

› Data Distribution 

• Minimize overhead 

– Latency (message start up time) 

» Few large messages is better than many small 

– Memory movement 

• Maximize load balance 

– Less idle time waiting for data or synchronizing 

– Each process should do about the same work 

› Flow Control 

• Minimize waiting 


Data Distribution 




Flow Control 

0 1 2 3 4 5 ……… 

Send to 1 Recv Send to from 2 0 

Send Recv to from 2 0 

Recv Send to from 3 1 


Recv Send to from 4 2 



Day 1 









MPI programs 



Outline 

› What is MPI 

• History 

• Origins 

• What MPI is not 

• Specifications 

• What’s new and in the future 


What is MPI 


› MPI stands for “Message Passing 

Interface” 

› In ancient times (late 1980’s early 1990’s) 

each vender had its own message passing 

library 

• Non-portable code 

• Not enough people doing parallel computing 

due to lack of standards 


What is MPI 

April 1992 was the beginning of the MPI forum 

› Organized at SC92 

› Consisted of hardware vendors, software vendors, 

academicians, and end users 

› Held 2 day meetings every 6 weeks 

› Created drafts of the MPI standard 

› This standard was to include all the functionality 

believed to be needed to make the message 

passing model a success 

› Final version released may, 1994 


What is MPI 

A standard library specification! 

› Defines syntax and semantics of an extended 

message passing model 

› It is not a language or compiler specification 

› It is not a specific implementation 

› It does not give implementation specifics 

• Hints are offered, but implementers are free to do things 

however they want 

• Different implementations may do the same thing in a 

very different manner 

› http://www.mpi-forum.org 


What is MPI 

A library specification designed to support 

parallel computing in a distributed memory 

environment 

› Routines for cooperative message passing 

• There is a sender and a receiver 

• Point-to-point communication 

• Collective communication 

› Routines for synchronization 

› Derived data types for non-contiguous data 

access patterns 

› Ability to create sub-sets of processors 

› Ability to create process topologies 


Continuing to grow! 

What is MPI 

› New routines have been added to replace 

old routines 

› New functionality has been added 

• Dynamic process management 

• One sided communication 

• Parallel I/O 


TAKE 

A 

BREAK 


Day 1 









MPI programs 


Getting Started with MPI 

Outline 

› Introduction 

› 6 basic functions 

› Basic program structure 

› Groups and communicators 

› A very simple program 

› Message passing 

› A simple message passing example 

› Types of programs 

• Traditional 

• Master/Slave 

• Examples 

› Unsafe communication 


Getting Started with MPI 

MPI contains 125 routines (more with the 

extensions)! 

Many programs can be written with just 6 

MPI routines! 

Upon startup, all processes can be 

identified by their rank, which goes from 

0 to N-1 where there are N processes 


6 Basic Functions 

MPI_INIT: Initialize MPI 

MPI_Finalize: Finalize MPI 

MPI_COMM_SIZE: How many processes 

are running 

MPI_COMM_RANK: What is my process 

number 

MPI_SEND: Send a message 

MPI_RECV: Receive a message 


MPI_INIT (ierr) 

ierr: Integer error return value. 0 on 

success, non-zero on failure. 

This MUST be the first MPI routine called 

in any program. 

Can only be called once 

Sets up the environment to enable 

message passing 

This is NOT how you run an MPI program 


MPI_FINALIZE (ierr) 

ierr: Integer error return value. 0 on success, non-zero 

on failure. 

This routine must be called by each process before it 

exits 

This call cleans up all MPI state 

No other MPI routines may be called after 

MPI_FINALIZE 

All pending communication must be completed (locally) 

before a call to MPI_FINALIZE 

There must be no unmatched communication calls 

when last process finalizes 

Only process with rank 0 is guaranteed to return 


Basic Program Structure 

program main 

include ‘mpif.h’ 

integer ierr 

call MPI_INIT (ierr) 

……… 

Do some work 

……… 

call MPI_FINALIZE (ierr) 

Maybe do some additional 

serial computation if rank 0 

…………………. 

end 

#include “mpi.h” 

int main () 

{ 

MPI_Init () 

……… 

Do some work 

……… 

MPI_Finalize () 

Maybe do some additional 

serial computation if rank 0 

…………………. 

} 



We will not be using this, but it is important to 

understand so you understand other routines 

Groups can be thought of as sets of processes 

These groups are associated with what are called 

“communicators” 

A communicator defines a “communication domain” 

Upon startup, there is a single set of processes 

associated with the communicator 

MPI_COMM_WORLD 

Groups can be created which are sub-sets of this 

original group, also associated with communicators 


MPI_COMM_RANK (comm, rank, ierr) 

comm: Communicator (integer in Fortran, 

MPI_Comm in C) . We will always use 


rank: Returned rank of calling process 

Between 0 and n-1 with n processes 

ierr: Integer error return code 

This routine returns the relative rank of the 

calling process, within the group associated 

with comm. 


MPI_COMM_SIZE (comm, size, ierr) 

Comm: Communicator handle 

Size: Upon return, the number of 

processes in the group associated with 

comm. For our purposes, always the 

total number of processes 

This routine returns the number of 

processes in the group associated with 

comm 


program main 


integer ierr, size, rank 

A Very Simple Program 

Hello World 


call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierr) 

call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierr) 

print *, ‘Hello World from process’, rank, ‘of’, size 


end 


Hello World 

> mpirun –np 4 a.out 

> 

> Hello World from 2 of 4 




> mpirun –np 4 a.out 

> 






Message Passing 

Message passing is the transfer of data from 

one process to another 

› This transfer requires cooperation of the 

sender and the receiver, but is initiated by 

the sender 

› There must be a way to “describe” the data 

› There must be a way to identify specific 

processes 

› There must be a way to identify kinds of 

messages 



Data is described by a triple 

1. Address: Where is the data stored 

2. Datatype: What is the type of the data 

› Basic types (integers, reals, etc) 

› Derived types (good for non-contiguous data 

access) 

3. Count: How many elements make up the 

message 



Processes are specified by a double 

1. Communicator: We will always use 


2. Rank: The relative rank of the specified 

process within the group associated with 

the communicator 

Messages are identified by a single tag 

This can be used to differentiate between 

different kinds of messages 


MPI_SEND(buf, cnt, dtype, dest, 

tag, comm, ierr) 

buf: The address of the beginning of the 

data to be sent 

cnt: The number of elements to be sent 

dtype: datatype of each element 

dest: The rank of the destination 

tag: Integer message tag 

0 >= tag >= MPI_TAG_UB (at least 32767) 

comm: The communicator 


MPI_STATUS 

Since a receive call only specifies the maximum 

number of elements to receive, there must be 

a way to find out the number of elements 

actually received 

MPI_Get_count (status, datatype, count, ierr) 

This routine returns in count the actual number of 

elements received by a call that returned status 

MPI_STATUS_IGNORE can be used to avoid 

the overhead of filling in the status 


Send/Recv Example 

program main 


CHARACTER*20 msg 

integer ierr, rank, tag, status (MPI_STATUS_SIZE) 

tag = 99 


call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierr) 

if (myrank .eq. 0) then 

msg = “Hello there” 

call MPI_SEND (msg, 11, MPI_CHARACTER, 1, tag, & 

MPI_COMM_WORLD, ierr) 

else if (myrank .eq. 1) then 

call MPI_RECV(msg, 20, MPI_CHARACTER, 0, tag, & 

MPI_COMM_WORLD, status, ierr) 

write (*,*) msg 

endif 


end 


Types of data parallel MPI 

Programs 

Peer/SPMD 

› Break the problem up into about even sized parts 

and distribute across all processors 

› What if problem is such that you cannot tell how 

much work must be done on each part 

Master/Slave 

› Break the problem up into many more parts than 

there are processors 

› Master sends work to slaves 

› Parts may be all the same size or the size may 

vary 



1-D block distribution 

2-D block block 

distribution 

2-D block cyclic 

distribution 


Peer/SPMD Example 

Compute the sum of a large array of N integers 

Comm = MPI_COMM_WORLD 

Call MPI_COMM_RANK (comm, rank, ierr) 

Call MPI_COMM_SIZE (comm, npes, ierr) 

Stride = N/npes 

Start = (stride * rank) + 1 

Sum = 0 

DO I = start, start+stride 

sum = sum + array(I) 

ENDDO 

If (rank .eq. 0) then 

DO I = 1, npes-1 

call MPI_RECV(tmp, 1, MPI_INTEGER, 

& 

I, 2, comm, status, ierr) 

sum = sum + tmp 

ENDDO 

ELSE 

MPI_SEND (sum, 1, MPI_INTEGER, 

& 0, 2 comm, ierr) 

ENDIF 


Master/Slave Example 

Do some computation for each of N elements in a large array. 

The amount of computation depends on the value of each element. 

Main 

Call MPI_INIT (ierr) 

comm = MPI_COMM_WORLD 

Call MPI_COMM_RANK (comm, rank, ierr) 

Call MPI_COMM_SIZE (comm, size, ierr) 

stride = N/(10*(size-1)) 

If (rank .eq. 0) 

call master () 

Else 

call slave() 

Endif 

Call MPI_FINALIZE (ierr) 



Master code: 

current = 1 

DO I = 1, size-1 

call MPI_SEND (info[current], stride 

&, MPI_INTEGER, I, 2, comm, ierr) 

current = current + stride 

ENDDO 

Do I = 1, 9*(size-1) 

call MPI_RECV (tmp, 0, MPI_BYTE, 

& MPI_ANY_SOURCE, 3, comm, status,ierr) 

source = status(MPI_SOURCE) 

call MPI_SEND (info[current], stride , 

& 

MPI_INTEGER, 

& 

source, 2, comm, ierr) 


enddo 

Do I = 1, size-1 

call MPI_SEND (info, 0, MPI_INTEGER, 

& I, 4, comm, ierr) 

Enddo 



Slave code 

1 Call MPI_RECV (info, stride, MPI_INTEGER, 0, MPI_ANY_TAG, comm, 

& 

status, ierr) 

if (status(MPI_TAG) .eq. 4) 

goto 200 

do I =1, stride 

do_compute (array(I)) 

enddo 

call MPI_SEND (tmp, 0, MPI_BYTE, 0, 3, comm, ierr) 

goto 100 

200 continue 



Master code: 

current = 1 

DO I = 1, size-1 

call MPI_SEND (info[current], stride 

&, MPI_INTEGER, I, 2, comm, ierr) 


ENDDO 

Do I = 1, 9*(size-1) 

call MPI_RECV (tmp, 0, MPI_BYTE, 

& MPI_ANY_SOURCE, 3, comm, status,ierr) 

source = status(MPI_SOURCE) 

call MPI_SEND (info[current], stride , 

& 

MPI_INTEGER, 

& 

source, 2, comm, ierr) 


enddo 

Do (I = 1, size-1) 

call MPI_RECV (tmp, 0, MPI_BYTE, & 

I, 3, comm, status, ierr) 

call MPI_SEND (info, 0, MPI_INTEGER, 

& I, 4, comm, ierr) 

Enddo 


Unsafe Communication Patterns 

› Process 0 and process 1 must exchange data 

› Process 0 sends data to process 1 and then 

receives data from process 1 

› Process 1 sends data to process 0 and then 

receives data from process 0 

› If there is not enough system buffer space for 

either message, this will deadlock 

› Any communication pattern that relies on 

system buffers is unsafe 

› Any pattern that includes a cycle of blocking 

sends is unsafe 


Unsafe Communication Patterns 

send 

recv 

send 

recv 

send 

recv 

Potential 

deadlock 

Potential 

deadlock 

send 

recv 

send 

recv 

send 

recv 


Jacobi Iteration 

0 

0 

M+1 

A 

exchange 

exchange 

exchange 

……. 

N+1 



Initializations 

. 

. 

DO WHILE (.NOT. Converged(A)) 

. 

do iteration 

. 

. 

! Perform exchange 

IF (myrank .GT. 0) THEN 

CALL MPI_SEND (A(1,1), n, MPI_REAL, & 

myrank-1, tag, comm, ierr) 

ENDIF 

IF (myrank .LT. nprocs-1) THEN 

CALL MPI_SEND (A(1,m), n, MPI_REAL, & 

myrank+1, tag, comm, ierr) 

ENDIF 


CALL MPI_RECV (A(1,0), n, MPI_REAL, & 

myrank-1, tag, comm, status, ierr) 

ENDIF 


CALL MPI_RECV (A(1,m+1), n, MPI_REAL, & 

myrank+1, tag, comm, status, ierr) 

ENDIF 

ENDDO 

POTENTIAL DEADLOCK! 




IF (MOD(myrank,2) .EQ. 1) THEN 



IF (myrank .LT. Nprocs-1) THEN 



ENDIF 



IF (myrank .LT. Nprocs-1) THEN 



ENDIF 

ENDIF 

ELSE 




ENDIF 




ENDIF 




ENDIF 




ENDIF 

ENDIF 


Day 2 










Outline 

› Standard mode 

• Blocking 

• Non-blocking 

› Non-standard mode 

• Buffered 

• Synchronous 

• Ready 

› Performance issues 


Point-to-Point Communication 

Modes 

Standard Mode: 

blocking: 

MPI_SEND (buf, count, datatype, dest, tag, comm, ierr) 

MPI_RECV (buf, count, datatype, source, tag, comm, status, ierr) 

Generally ONLY use if you cannot call earlier AND there is no other 

work that can be done! 

Standard ONLY states that buffers can be used once calls return. It 

is implementation dependant on when blocking calls return. 

Blocking sends MAY block until a matching receive is posted. This 

is not required behavior, but the standard does not prohibit this 

behavior either. Further, a blocking send may have to wait for 

system resources such as system managed message buffers. 

Be VERY careful of deadlock when using blocking calls! 



Modes (cont) 

Standard Mode: 

Non-blocking (immediate) sends/receives: 

MPI_ISEND (buf, count, datatype, dest, tag, comm, request, ierr) 

MPI_IRECV (buf, count, datatype, source, tag, comm, request, ierr) 

MPI_WAIT (request, status, ierr) 

MPI_WAITANY (count, requests, index, status, ierr) 

MPI_WAITALL (count, requests, statuses, ierr) 

MPI_TEST (request, flag, status, ierr) 

MPI_TESTANY (count, requests, index, flag, status, ierr) 

MPI_TESTALL (count, requests, flag, statuses, ierr) 

Allows communication calls to be posted early, which may improve 

performance. 

Overlap computation and communication 

Latency tolerance 

Less (or no) buffering 

* MUST either complete these calls (with wait or test) or 

call MPI_REQUEST_FREE 


MPI_ISEND (buf, cnt, dtype, dest, 

tag, comm, request, ierr) 

Same syntax as MPI_SEND with the addition of 

a request handle 

Request is a handle (int in Fortran 

MPI_Request in C) used to check for 

completeness of the send 

This call returns immediately 

Data in buf may not be accessed until the user 

has completed the send operation 

The send is completed by a successful call to 

MPI_TEST or a call to MPI_WAIT 


MPI_IRECV(buf, cnt, dtype, 

source, tag, comm, request, ierr) 

Same syntax as MPI_RECV except status is 

replaced with a request handle 

Request is a handle (int in Fortran 

MPI_Request in C) used to check for 

completeness of the recv 

This call returns immediately 

Data in buf may not be accessed until the user 

has completed the receive operation 

The receive is completed by a successful call to 

MPI_TEST or a call to MPI_WAIT 


MPI_WAIT (request, status, ierr) 

Request is the handle returned by the nonblocking 

send or receive call 

Upon return, status holds source, tag, and error 

code information 

This call does not return until the non-blocking 

call referenced by request has completed 

Upon return, the request handle is freed 

If request was returned by a call to MPI_ISEND, 

return of this call indicates nothing about the 

destination process 


MPI_WAITANY (count, requests, 

index, status, ierr) 

Requests is an array of handles returned by nonblocking 

send or receive calls 

Count is the number of requests 

This call does not return until a non-blocking call 

referenced by one of the requests has completed 

Upon return, index holds the index into the array of 

requests of the call that completed 

Upon return, status holds source, tag, and error code 

information for the call that completed 

Upon return, the request handle stored in 

requests[index] is freed 


MPI_WAITALL (count, requests, 

statuses, ierr) 

requests is an array of handles returned by nonblocking 


count is the number of requests 

This call does not return until all non-blocking 

call referenced by requests have completed 

Upon return, statuses hold source, tag, and 

error code information for all the calls that 

completed 

Upon return, the request handles stored in 

requests are all freed 


MPI_TEST (request, flag, status, 

ierr) 

request is a handle returned by a non-blocking 

send or receive call 

Upon return, flag will have been set to true if the 

associated non-blocking call has completed. 

Otherwise it is set to false 

If flag returns true, the request handle is freed 

and status contains source, tag, and error 

code information 

If request was returned by a call to MPI_ISEND, 

return with flag set to true indicates nothing 

about the destination process 


MPI_TESTANY (count, requests, 

index, flag, status, ierr) 




Upon return, flag will have been set to true if 

one of the associated non-blocking call has 

completed. Otherwise it is set to false 

If flag returns true, index holds the index of the 

call that completed, the request handle is 

freed, and status contains source, tag, and 

error code information 


MPI_TESTALL (count, requests, 

flag, statuses, ierr) 




Upon return, flag will have been set to true if 

ALL of the associated non-blocking call have 

completed. Otherwise it is set to false 

If flag returns true, all the request handles are 

freed, and statuses contains source, tag, and 

error code information for each operation 


Non-blocking Communication 

100 continue 

if (err .lt. Delta) goto 200 

do some computation 

do I = 0, npes 

if (I .ne. myrank) 

set up data to send 

call MPI_SEND (sdata(I), cnt, dtype, & 

I, tag, comm, ierr) 

endif 

enddo 



set up data to recv 

call MPI_RECV (rdata(I), cnt, dtype, & 

I, tag, comm, status, ierr) 

endif 

enddo 

goto 100 

Clearly unsafe 

100 continue 

if (err .lt. Delta) goto 200 





call MPI_ISEND (sdata(I), cnt, dtype, & 

I, tag, comm, request, ierr) 

endif 

enddo 



set up data to recv 

call MPI_RECV (rdata(I), cnt, dtype, & 

I, tag, comm, status, ierr) 

endif 

enddo 

goto 100 

May run out of handles 


Non-blocking Communication 

100 continue 

……. 

J = 0 



J = J+1 



I, tag, comm, request(J), ierr) 

endif 

enddo 

call MPI_WAITALL (npes-1, request, 

& 

……… 

receive data 

……….. 

Unsafe again 


100 continue 

……….. 

J = 0 



J = J+1 



I, tag, comm, request(J), ierr) 

endif 

enddo 

………… 

receive data 

…………. 

call MPI_WAITALL (npes-1, request, 

& 


Safe…….but 


Non-blocking communication 

J = 0 



J = J+1 


I, tag, comm, s_request[J], ierr) 

endif 

enddo 

J = 0 

do (I = 0, npes) 


J = J+1 

call MPI_IRECV (rdata(I), cnt, dtype, & 

I, tag, comm, r_request[J], ierr) 

endif 

enddo 

call MPI_WAITALL (…, s_request, …) 

call MPI_WAITALL (…, r_request, …) 

Safe, and pretty good 



0 

0 

M+1 

A 

exchange 

exchange 

exchange 

……. 

N+1 



IF (myrank .EQ. 0) THEN 

left = MPI_PROC_NULL 

ELSE 

left = myrank-1 

ENDIF 

IF (myrank .EQ. nprocs) THEN 

right = MPI_NULL_PROC 

ELSE 

right = myrank+1 

. 

. 

Compute boundary columns 

. 

. 


CALL MPI_ISEND (A(1,1), n, MPI_REAL, & 

left, tag, comm, requests(1), ierr) 

CALL MPI_ISEND (A(1,m), n, MPI_REAL, & 

right, tag, comm, requests(2), ierr) 

CALL MPI_IRECV (A(1,0), n, MPI_REAL, & 

left, tag, comm, requests(3), ierr) 

CALL MPI_IRECV (A(1,m+1), n, MPI_REAL, & 

right, tag, comm, requests(4), ierr) 

. 

. 

Compute interior 

. 

. 

CALL MPI_WAITALL (4, requests, statuses, ierr) 



Modes (cont) 

Non-standard mode communication 

› Only used by the sender! (MPI uses the 

push communication model) 

› Buffered mode - A buffer must be provided 

by the application 

› Synchronous mode - Completes only after 

a matching receive has been posted 

› Ready mode - May only be called when a 

matching receive has already been posted 



Modes: Buffered 

MPI_BSEND (buf, count, datatype, dest, tag, comm, ierr) 

MPI_IBSEND (buf, count, dtype, dest, tag, comm, req, ierr) 

MPI_BUFFER_ATTACH (buff, size, ierr) 

MPI_BUFFER_DETACH (buff, size, ierr) 

Buffered sends do not rely on system buffers 

The user supplies a buffer that MUST be large enough for all 

messages 

User need not worry about calls blocking, waiting for system 

buffer space 

The buffer is managed by MPI 

The user MUST ensure there is no buffer overflow 


Buffered Sends 

#define BUFFSIZE 1000 

char *buff; 

char b1[500], b2[500] 

MPI_Buffer_attach (buff, BUFFSIZE); 

Seg violation 

#define BUFFSIZE 1000 

char *buff; 

char b1[500], b2[500]; 

buff = (char*) malloc(BUFFSIZE); 

MPI_Buffer_attach(buff, BUFFSIZE); 

MPI_Bsend(b1, 500, MPI_CHAR, 1, 1, 

MPI_COMM_WORLD); 



Buffer overflow 

int size; 

char *buff; 

char b1[500], b2[500]; 

MPI_Pack_size (500, MPI_CHAR, 

MPI_COMM_WORLD, &size); 

size += MPI_BSEND_OVERHEAD; 

size = size * 2; 

buff = (char*) malloc(size); 

MPI_Buffer_attach(buff, size); 





MPI_Buffer_detach (&buff, &size); 

Safe 



Modes: Synchronous 

MPI_SSEND (buf, count, datatype, dest, tag, comm, ierr) 

MPI_ISSEND (buf, count, dtype, dest, tag, comm, req, ierr) 

Can be started (called) at any time. 

Does not complete until a matching receive has been posted 

and the receive operation has been started 

* Does NOT mean the matching receive has completed 

Can be used in place of sending and receiving 

acknowledgements 

Can be more efficient when used appropriately 

buffering may be avoided 



Modes: Ready Mode 

MPI_RSEND (buf, count, datatype, dest, tag, comm, ierr) 

MPI_IRSEND (buf, count, dtype, dest, tag, comm, req, ierr) 

May ONLY be started (called) if a matching receive has already 

been posted. 

If a matching receive has not been posted, the results are 

undefined 

May be most efficient when appropriate 

Removal of handshake operation 

Should only be used with extreme caution 


Ready Mode 

SLAVE 

MASTER 

while (!done) { 

MPI_Recv (NULL, 0, MPI_INT, MPI_ANYSOURCE, 

1, MPI_COMM_WORLD, &status); 

source = status.MPI_SOURCE; 

get_work (…..); 

MPI_Rsend (buff, count, datatype, source, 2, 


if (no more work) done = TRUE; 

} 


MPI_Send (NULL, 0, MPI_INT, MASTER, 

1, MPI_COMM_WORLD); 

MPI_Recv (buff, count, datatype, MASTER, 

2, MPI_COMM_WORLD, &status); 

… 

} 

UNSAFE 


MPI_Irecv (buff, count, datatype, MASTER, 

2, MPI_COMM_WORLD, &request); 

MPI_Ssend (NULL, 0, MPI_INT, MASTER, 


MPI_Wait (&request, &status); 

… 

} 

SAFE 



Modes: Performance Issues 

› Non-blocking calls are almost always the way to go 

› Communication can be carried out during blocking system 

calls 

› Computation and communication can be overlapped if there 

is special purpose communication hardware 

› Less likely to have errors that lead to deadlock 

› Standard mode is usually sufficient - but buffered mode can 

offer advantages 

• Particularly if there are frequent, large messages being sent 

• If the user is unsure the system provides sufficient buffer space 

› Synchronous mode can be more efficient if acks are needed 

• Also tells the system that buffering is not required 


TAKE 

A 

BREAK 


Day 2 









Collective Communication 

OUTLINE 

› Introduction 

› Synchronization 

› Global Collective Communication 

› Reduction 

• Reduction routines 

• User defined operations 




Amount of data sent must exactly match the amount of 

data received 

Collective routines are collective across an entire 

communicator and must be called in the same order 

from all processors within the communicator 

Collective routines are all blocking 

This simply means buffers can be re-used upon return 

Collective routines may return as soon as the calling 

process’ participation is complete 

Does not say anything about the other processors 

Collective routines may or may not be synchronizing 

No mixing of collective and point-to-point communication 



Synchronization 

Barrier: MPI_BARRIER (comm, ierr) 

Only collective routine which provides explicit 

synchronization 

Returns at any processor only after all processes 

have entered the call 

Barrier can be used to ensure all processes have 

reached a certain point in the computation 



Global Communication Routines: 

› Except broadcast, each routine has 2 variants: 

• Standard variant: All messages are the same size 

• Vector Variant: Each item is a vector of possibly varying length 

› If there is a single origin or destination, it is referred to as the 

root 

› Each routine (except broadcast) has distinct send and 

receive arguments 

› Send and receive buffers must be disjoint 

› Each can use MPI_IN_PLACE, which allows the user to 

specify that data contributed by the caller is already in its 

final location. 


Collective Communication: Bcast 

MPI_BCAST (buffer, count, datatype, root, comm, ierr) 

This routine copies count elements of type datatype from 

buffer on the process with rank root into the location 

specified by buffer on all other processes in the group 

referenced by the communicator comm 

All processes must supply the same value for the argument 

root 

MPI_BCAST is strictly in-place, not data is moved at the root 

REMEMBER: A broadcast need not be synchronizing. 

Returning from a broadcast tells you nothing about the 

status of the other processes involved in a broadcast. 

Furthermore, though MPI does not require MPI_BCAST to 

be synchronizing, it neither prohibits synchronous behavior. 


BCAST 

If (myrank == root) { 

fp = fopen (filename, ‘r’); 

fscanf (fp, ‘%d’, &iters); 

fclose (fp); 

MPI_Bcast (&iters, 1, MPI_INT, 

root, MPI_COMM_WORLD); 

} 

else { 

MPI_Recv (&iters, 1, MPI_INT, 

root, tag, MPI_COMM_WORLD, 

&status); 

} 

If (myrank == root) { 

fp = fopen (filename, ‘r’); 

fscanf (fp, ‘%d’, &iters); 

fclose (fp); 

} 

MPI_Bcast (&iters, 1, MPI_INT, 


cont 

THAT’S BETTER 

OOPS! 


Collective Communication: 

Gather 

A Gather operation has data from all processes 

collected, or “gathered”, at a central process, 

referred to as the “root” 

Even the root process contributes data 

The root process can be any process, it does 

not have to have any particular rank 

Every process must pass the same value for the 

root argument 

Each process must send the same amount of 

data 



Gather 

MPI_GATHER (sendbuf, sendcount, sendtype, recvbuf, 

recvcount, recvtype, root, comm, ierr) 

› Receive arguments are only meaningful at the root 

› recvcount is the number of elements received from each 

process 

› Data received at root in rank order 

› Root can use MPI_IN_PLACE for sendbuf: 

• data is assumed to be in the correct place in the recvbuf 

P1 = root 

P2 

P3 

P2 

MPI_GATHER 

root 


MPI_Gather 

for (i = 0; i < 20; i++) { 


tmp[i] = some value; 

} 

MPI_Gather (tmp, 20, MPI_INT, res, 

20, MPI_INT, 0, MPI_COMM_WORLD); 

if (myrank == 0) 

write out results 

WORKS 

int tmp[20]; 

int res[320]; 

for (i = 0; i < 20; i++) { 



res[i] = some value 

else tmp[i] = some value 

} 


MPI_Gather (MPI_IN_PLACE, 

20, MPI_INT, res, 20, MPI_INT, 


write out results 

else 

MPI_Gather (tmp, 20, MPI_INT, 

tmp, 320, MPI_REAL, 0 


A OK 



Gatherv 

MPI_GATHERV (sendbuf, sendcount, sendtype, 

recvbuf, recvcounts, displs, recvtype, root, comm,ierr) 

› Vector variant of MPI_GATHER 

› Allows a varying amount of data from each proc 

› allows root to specify where data from each proc 

goes 

› No portion of the receive buffer may be written 

more than once 

› The amount of data specified to be received at the 

root must match amount of data sent by non-roots 

› Displacements are in terms of recvtype 

› MPI_IN_PLACE may be used by root. 



Gatherv (cont) 

1 2 3 4 counts 

9 7 4 0 displs 

P1 = root 

P2 

MPI_GATHERV 

P3 

P4 




sbuffs 

rbuff 

stride = 105; 

root = 0; 

for (i = 0; i < nprocs; i++) { 

displs[i] = i*stride 

counts[i] = 100; 

} 

MPI_Gatherv (sbuff, 100, MPI_INT, 

rbuff, counts, displs, MPI_INT, 





nprocs 

…. 

nprocs 

CALL MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) 

CALL MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) 

scount = myrank+1 

displs(1) = 0 

rcounts(1) = 1 

DO I=2,nprocs 

displs(I) = displs(I-1) + I – 1 

rcounts(I) = I 

ENDDO 

CALL MPI_GATHERV(sbuff, scount, MPI_INT, rbuff, rcounts, 

displs, MPI_INT, root, MPI_COMM_WORL, ierr) 



Scatter 

MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf, 

recvcount, recvtype, root, comm, ierr) 

› Opposite of MPI_GATHER 

› Send arguments only meaningful at root 

› Root can use MPI_IN_PLACE for recvbuf 

P1 

MPI_SCATTER 

P2 

B (on root) 

P3 

P4 


MPI_SCATTER 

IF (MYPE .EQ. ROOT) THEN 

OPEN (25, FILE=‘filename’) 

READ (25, *) nprocs, nboxes 

READ (25, *) mat(i,j) (i=1,nboxes)(j=1,nprocs) 

CLOSE (25) 

ENDIF 

CALL MPI_BCAST (nboxes, 1, MPI_INTEGER, 

& ROOT, MPI_COMM_WORLD, ierr) 

CALL MPI_SCATTER (mat, nboxes, MPI_INT, 

& lboxes, nboxes, MPI_INT, ROOT, 

& MPI_COMM_WORLD, ierr) 



Scatterv 

MPI_SCATTERV (sendbuf, scounts, displs, sendtype, 

recvbuf, recvcount, recvtype, ierr) 

› Opposite of MPI_GATHERV 

› Send arguments only meaningful at root 

› Root can use MPI_IN_PLACE for recvbuf 

› No location of the sendbuf can be read more than 

once 



Scatterv (cont) 

P1 

MPI_SCATTERV 

P2 

B (on root) 

P3 

counts 

1 2 3 4 

dislps 

9 7 4 0 

P4 


MPI_SCATTERV 

C mnb = max number of boxes 

IF (MYPE .EQ. ROOT) THEN 

OPEN (25, FILE=‘filename’) 

READ (25, *) nprocs 

READ (25, *) (nboxes(I), I=1,nprocs) 

READ (25, *) mat(I,J) (I=1,nboxes(I))(J=1,nprocs) 

CLOSE (25) 

DO I = 1,nprocs 

displs(I) = (I-1)*mnb 

ENDDO 

ENDIF 

CALL MPI_SCATTER (nboxes, 1, MPI_INT, 

nb, 1, MPI_INT, ROOT, MPI_COMM_WORLD, ierr) 

CALL MPI_SCATTERV (mat, nboxes, displs, MPI_INT, 

lboxes, nb, MPI_INT, ROOT, MPI_COMM_WORLD, ierr) 



Allgather 

MPI_ALLGATHER (sendbuf, sendcount, 

sendtype, recvbuf, recvcount, recvtype, 

comm, ierr) 

› Same as MPI_GATHER, except all processors get 

the result 

› MPI_IN_PLACE may be used for sendbuf of all 

processors 

› Equivalent to a gather followed by a bcast 



Allgatherv 

MPI_ALLGATHERV (sendbuf, sendcount, 

sendtype, recvbuf, recvcounts, displs, 

recvtype, comm, ierr) 

› Same as MPI_GATHERV, except all processes 

get the result 

› MPI_IN_PLACE may be used for sendbuf of all 

processors 

› Similar to a gatherv followed by a bcast 

• If there are “holes” in the receive buffer, bcast would 

overwrite the “holes” 

• displs need not be the same on each PE 


Allgatherv 

int mycount; /* inited to # local ints */ 

int counts[4]; /* inited to proper values */ 

int displs[4]; 

displs[0] = 0; 

for (I = 1; I < 4; I++) 

displs[I] = displs[I-1] + counts[I-1]; 

MPI_Allgatherv(sbuff, mycount, MPI_INT, 




Allgatherv 

sbuffs 

rbuff 

stride = 105; 

root = 0; 

for (i = 0; i < nprocs; i++) { 

displs[i] = i*stride 

counts[I] = 100; 

} 

MPI_Allgatherv (sbuff, 100, MPI_INT, 




Allgatherv 

p1 p2 p3 p4 

MPI_Comm_rank (MPI_COMM_WORLD, &myrank); 

MPI_Comm_size (MPI_COMM_WORLD, &nprocs); 

p1 

p2 

p3 

p4 

for (i=0; i


Alltoall (scatter/gather) 

MPI_ALLTOALL (sendbuf, sendcount, sendtype, 

recvbuf, recvcount, recvtype, comm, ierr) 



Alltoallv 

MPI_ALLTOALLV (sendbuf, sendcounts, sdispls, 

sendtype, recvbuf, recvcounts, rdispls, recvtype, 

comm, ierr) 

› Same as MPI_ALLTOALL, but the vector variant 

• Can specify how many blocks to send to each processor, 

location of blocks to send, how many blocks to receive from 

each processor, and where to place the received blocks 

• No location in the sendbuf can be read more than once and no 

location in the recvbuf can be written to more than once 



Alltoallw 

MPI_ALLTOALLW (sendbuf, sendcounts, sdispls, 

sendtypes, recvbuf, recvcounts, rdispls, recvtypes, 

comm, ierr) 

› Same as MPI_ALLTOALLV, except different 

datatypes can be specified for data scattered as 

well as data gathered 

• Can specify how many blocks to send to each processor, 

location and type of blocks to send, how many blocks to receive 

from each processor, the type of the blocks received, and 

where to place the received blocks 

• Displacements are now in terms of bytes rather that types 

• No location in the sendbuf can be read more than once and no 

location in the recvbuf can be written to more than once 


Example 

subroutine System_Change_init ( dq, ierror) 

!------------------------------------------------------------------------------ 

! Subroutine for data exchange on all six boundaries 

! 

! This routine initiates the operations, has to be finished by according 

! Wait/Waitall 

!------------------------------------------------------------------------------ 

! 

USE globale_daten 

USE comm 

implicit none 

include 'mpif.h' 

double precision, dimension (0:n1+1,0:n2+1,0:n3+1,nc) :: dq 

! local variables 

integer :: handnum, info, position 

integer :: j, k, n, ni 

integer :: ierror 

integer :: size2, size4, size6 

! /* Fortran90 */ 

integer, dimension (MPI_STATUS_SIZE,6) :: sendstatusfeld 

integer, dimension (MPI_STATUS_SIZE) :: status 

double precision, allocatable, dimension (:) :: global_dq 

size2 = (n2+2)*(n3+2)*nc * SIZE_OF_REALx 



if (.not. rand_ab) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_io), 

& recvtype(tid_io), tid_io, 10001, 

& MPI_COMM_WORLD, recvhandle(1), info) 

else 

recvhandle(1) = MPI_REQUEST_NULL 

endif 

if (.not. rand_sing) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_iu), 

& recvtype(tid_iu), tid_iu, 10002, 


else 


endif 


Example (cont) 

if (.not. rand_zu) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_jo), 

& recvtype(tid_jo), tid_jo, 10003, 


else 


endif 

if (.not. rand_festk) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_ju), 

& recvtype(tid_ju), tid_ju, 10004, 


else 


endif 

if (.not. rand_symo) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_ko), 

& recvtype(tid_ko), tid_ko, 10005, 


else 


endif 

if (.not. rand_symu) then 

call MPI_IRECV(dq(1,1,1,1), recvcount(tid_ku), 

& recvtype(tid_ku), tid_ku, 10006, 


else 


endif 

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

if (.not. rand_sing) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_iu), 

& sendtype(tid_iu), tid_iu, 10001, 

& MPI_COMM_WORLD, sendhandle(1), info) 

else 

sendhandle(1) = MPI_REQUEST_NULL 

endif 

if (.not. rand_ab) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_io), 

& sendtype(tid_io), tid_io, 10002, 


else 


endif 


Example (cont) 

if (.not. rand_festk) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_ju), 

& sendtype(tid_ju), tid_ju, 10003, 


else 


endif 

if (.not. rand_zu) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_jo), 

& sendtype(tid_jo), tid_jo, 10004, 


else 


endif 

if (.not. rand_symo) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_ko), 

& sendtype(tid_ko), tid_ko, 10006, 


else 


endif 

!... Waitall for the Isends/we have to force to finish them.. 

call MPI_WAITALL(6, sendhandle, sendstatusfeld, info) 

ierror = 0 

return 

end 

if (.not. rand_symu) then 

call MPI_ISEND(dq(1,1,1,1), sendcount(tid_ku), 

& sendtype(tid_ku), tid_ku, 10005, 


else 


endif 


Example (Alltoallw) 

subroutine System_Change ( dq, ierror) 

!------------------------------------------------------------------------------ 

! 

! Subroutine for data exchange on all six boundaries 

! 

! uses the MPI-2 function MPI_Alltoallw 

!------------------------------------------------------------------------------ 

! 

USE globale_daten 

implicit none 

include 'mpif.h' 

double precision, dimension (0:n1+1,0:n2+1,0:n3+1,nc) :: dq 

integer :: ierror 

call MPI_Alltoallw ( dq(1,1,1,1), sendcount, senddisps, sendtype, & 

dq(1,1,1,1), recvcount, recvdisps, recvtype, & 

MPI_COMM_WORLD, ierror ) 

end subroutine System_Change 



Reduction 

› Global reduction across all members of a group 

› Can us predefined operations or user defined 

operations 

› Can be used on single elements or arrays of 

elements 

› Counts and types must be the same on all 

processors 

› Operations are assumed to be associative 

› All pre-defined operations are commutative 

› User defined operations can be different on each 

processor, but not recommended 



Reduction (reduce) 

MPI_REDUCE (sendbuf, recvbuf, count, datatype, op, 

root, comm, ierr) 

› recvbuf only meaningful on root 

› Combines elements (on an element by element basis) in 

sendbuf according to op 

› Results of the reduction are returned to root in recvbuf 

› MPI_IN_PLACE can be used for sendbuf on root 


MPI_REDUCE 

REAL a(n), b(n,m), c(m) 

REAL sum(m) 

DO j=1,m 

sum(j) = 0.0 

DO i = 1,n 

sum(j) = sum(j) + a(i)*b(i,j) 

ENDDO 

ENDDO 

CALL MPI_REDUCE(sum, c, m, MPI_REAL, 

MPI_SUM, 0, MPI_COMM_WORLD, ierr) 



Reduction (cont) 

› MPI_ALLREDUCE (sendbuf, recvbuf, 

count, datatype, op, comm, ierr) 

› Same as MPI_REDUCE, except all 

processors get the result 

› MPI_REDUCE_SCATTER (sendbuf, 

recv_buff, recvcounts, datatype, op, 

comm, ierr) 

› Acts like it does a reduce followed by a 

scatterv 


MPI_ALLREDUCE 

REAL a(n), b(n,m), c(m) 

REAL sum(m) 

DO j=1,m 

sum(j) = 0.0 

DO i = 1,n 


ENDDO 

ENDDO 

CALL MPI_ALLREDUCE(sum, c, m, MPI_REAL, 

MPI_SUM, MPI_COMM_WORLD, ierr) 


MPI_REDUCE_SCATTER 

DO j=1,nprocs 

counts(j) = n 

DO j=1,m 

sum(j) = 0.0 

DO i = 1,n 


ENDDO 

ENDDO 

CALL MPI_REDUCE_SCATTER(sum, c, counts, MPI_REAL, 

MPI_SUM, MPI_COMM_WORLD, ierr) 


Collective Communication: Prefix 

Reduction 

MPI_SCAN (sendbuf, recvbuf, count, 

datatype, op, comm, ierr) 

Performs an inclusive element-wise prefix 

reduction 

MPI_EXSCAN (sendbuf, recvbuf, count, 

datatype, op, comm, ierr) 

› Performs an exclusive prefix reduction 

› Results are undefined at process 0 


MPI_SCAN 

MPI_SCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) 

MPI_EXSCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) 



Reduction - user defined ops 

MPI_OP_CREATE (function, commute, op, ierr) 

› if commute is true, operation is assumed to be 

commutative 

› Function is a user defined function with 4 arguments 

• invec: input vector 

• inoutvec: input and output value 

• len: number of elements 

• datatype: MPI_DATATYPE 

• Returns invec[i] op inoutvec[i], i = 0..len-1 

› MPI_OP_FREE (op, ierr) 


Create a non-commutative sum 

Void my_add (void *in, void *out, int *len, MPI_Datatype *dtype) 

{ 

int i; 

int *iarray1, *iarray2; 

float *farray1, *farray2; 

if (*dtype == MPI_INT) { 

iarray1 = in; iarray2 = out; 

for (i = 0; i < *len; i++) 

iarray2[i] = iarra2[i] + iarray1[i]; 

} else { 

farray1 = in; farray2 = out; 

for (i = 0; i < *len; i++) 

iarray2[i] = iarra2[i] + iarray1[i]; 

} 

} 

… 

MPI_Op myop; 

MPI_Op_create ((MPI_User_function*)my_add, False, &myop); 

MPI_Reduce (inbuf, outbuf, num, MPI_FLOAT, myop, 0, MPI_COMM_WORLD); 



Performance Issues 

Collective operations should have much better 

performance than simply sending messages directly 

› Broadcast may make use of a broadcast tree (or other 

mechanism) 

› All collective operations can potentially make use of a tree 

(or other) mechanism to improve performance 

Important to use the simplest collective operations 

which still achieve the needed results 

Use MPI_IN_PLACE whenever appropriate and 

supported 

Reduces unnecessary memory usage and redundant data 

movement 



Day 3 





MPI-2 



MPI-I/O 





› Introduction and definitions 

› Datatype constructors 

› Committing derived datatypes 

› Pack and Unpack 




› A derived datatype is a sequence of primitive 

datatypes and displacements 

› Derived datatypes are created by building on 

primitive datatypes 

› A derived datatype’s typemap is the sequence of 

(primitive type, disp) pairs that defines the derived 

datatype 

› These displacements need not be positive, unique, or 

increasing. 

› A datatype’s type signature is just the sequence of 

primitive datatypes 

› A messages type signature is the type signature of 

the datatype being sent, repeated count times 


Derived Datatypes (cont) 

Typemap = 

(MPI_INT, 0) (MPI_INT, 12) (MPI_INT, 16) (MPI_INT, 20) (MPI_INT, 36) 

Type Signature = 

{MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT} 

Type Signature = 

{MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT} 

In collective communication, the type signature of data sent must 

match the type signature of data received! 



Lower Bound: The lowest displacement of an entry of 

this datatype 

Upper Bound: Relative address of the last byte 

occupied by entries of this datatype, rounded up to 

satisfy alignment requirements 

Extent: The span from lower to upper bound 

MPI_GET_EXTENT (datatype, lb, extent, ierr) 

MPI_TYPE_SIZE (datatype, size, ierr) 

MPI_GET_ADDRESS (location, address, ierr) 

Lb, extent, and address are of type MPI_Aint in C and 

INTEGER(KIND=MPI_ADDRESS_KIND) in Fortran 


Datatype Constructors 

MPI_TYPE_DUP (oldtype, newtype, ierr) 

› Simply duplicates an existing type 

› Not useful to regular users 

MPI_TYPE_CONTIGUOUS(count,oldtype,newtype,ierr) 

› Creates a new type representing count contiguous 

occurrences of oldtype (uses extent(oldtype)) 

› ex: MPI_TYPE_CONTIGUOUS (2, MPI_INT, 2INT) 

• creates a new datatype 2INT which represents 

an array of 2 integers 

Datatype 2INT 


CONTIGUOUS DATATYPE 

P1 sends 100 integers to P2 

P1 

int buff[100]; 

MPI_Datatype dtype; 

... 

... 

MPI_Type_contiguous (100, 

MPI_INT, &dtype); 

MPI_Type_commit (&dtype); 

P2 

int buff[100] 

MPI_Recv (buff, 100, MPI_INT, 1, tag, 

MPI_COMM_WORLD, &status) 

MPI_Send (buff, 1, dtype, 2, tag, 

MPI_COMM_WORLD) 


Datatype Constructors (cont) 

MPI_TYPE_VECTOR (count, blocklength, stride, 

oldtype, newtype, ierr) 

› Creates a datatype representing count regularly spaced 

occurrences of blocklength contiguous oldtypes 

› stride is in terms of elements of oldtype (may be negative) 

› ex: MPI_TYPE_VECTOR (4, 2, 3, 2INT, AINT) 

AINT 



MPI_TYPE_CREATE_HVECTOR (count, blocklength, 

stride, oldtype, newtype, ierr) 

› Identical to MPI_TYPE_VECTOR, except stride is given in 

bytes rather than elements. 

› In C, stride is of type MPI_Aint 

› In Fortran, stride is integer(kind=MPI_ADDRESS_KIND) 

› ex: MPI_TYPE_CREATE_HVECTOR (4, 2, 20, 2INT, BINT) 

BINT 


EXAMPLE 

REAL a(100,100), B(100,100) 

CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr) 

CALL MPI_TYPE_SIZE (MPI_REAL, sizeofreal, ierr) 

CALL MPI_TYPE_VECTOR (100, 1, 100,MPI_REAL, rtype, ierr) 

CALL MPI_TYPE_CREATE_HVECTOR (100, 1, sizeofreal, rtype, 

xpose, ierr) 

CALL MPI_TYPE_COMMIT (xpose, ierr) 

CALL MPI_SENDRECV (a, 1, xpose, myrank, 0, b, 100*100, 

MPI_REAL, myrank, 0, MPI_COMM_WORLD, status, ierr) 


HVECTOR 

struct particles 

{ 

char class; 

double coord[4]; 

char other[5]; 

} 

void sendparts () 

{ 

struct particles parts[1000]; 

int dest, tag; 

MPI_Datatype dtype; 

size = sizeof (struct particles); 

MPI_Type_create_hvector (1000, 2, size, 

MPI_DOUBLE, &loctype); 

MPI_Type_commit(&loctype); 

MPI_Send(parts[0].coord, 1, loctype, dest, tag, comm); 



MPI_TYPE_INDEXED (count, blocklengths, displs, oldtype, newtype, ierr) 

•Allows specification of non-contiguous data layout 

•Good for irregular problems 

•ex: MPI_TYPE_INDEXED (3, lengths, displs, 2INT, CINT) 

•lengths = (2, 4, 3) displs = (0,3,8) 

CINT 

•Most often, block sizes are all the same (typically 1) 

•MPI-2 introduced a new constructor 



MPI_TYPE_CREATE_INDEXED_BLOCK (count, 

blocklength, displs, oldtype, newtype, ierr) 

› Same as MPI_TYPE_INDEXED, except all blocks are the 

same length (blocklength) 

› ex: MPI_TYPE_INDEXED_BLOCK (7, 1, displs, MPI_INT, 

DINT) 

• displs = (1, 3, 4, 6, 9, 13, 14) 

DINT 

•displs = (14, 3, 4, 9, 6, 13, 1) 

DINT 



MPI_TYPE_CREATE_HINDEXED (count, blocklengths, displs, 

oldtype, newtype, ierr) 

Identical to MPI_TYPE_INDEXED except displacements are in bytes 

rather then elements 

Displs are of type MPI_Aint in C and 

integer(KIND=MPI_ADDRESS_KIND) in Fortran 

MPI_TYPE_CREATE_STRUCT (count, lengths, displs, types, 

newtype, ierr) 

› Used mainly for sending arrays of structures 

› count is number of fields in the structure 

› lengths is number of elements in each field 

› displs should be calculated (portability) and are of type MPI_Aint in 

C and integer(KIND=MPI_ADDRESS_KIND) in Fortran 


MPI_TYPE_CREATE_STRUCT 

MPI_Datatype stype; 

MPI_Datatype types[3] = 

{MPI_CHAR, MPI_DOUBLE, 

MPI_CHAR}; 

int lens[3] = {1, 6, 7}; 

MPI_Aint displs[3] = {0, 

sizeof(double), 7*sizeof(double)}; 

MPI_Type_create_struct (3, lens, 

displs, types, &stype); 

Non-portable 

struct s1 { 

char class; 

double d[6]; 

char b[7]; 

}; 

struct s1 sarray[100]; 




MPI_CHAR}; 

int lens[3] = {1, 6, 7}; 

MPI_Aint displs[3]; 

MPI_Get_address (&sarray[0].class, 

&displs[0]); 

MPI_Get_address (&sarray[0].d, 

&displs[1]); 

MPI_Get_address (&sarray[0].b, 

&displs[2]); 

for (i=2;i>=0;i--) displs[i] -=displs[0] 



Semi-portable 


MPI_TYPE_CREATE_STRUCT 

int i; 

char c[100]; 

float f[3]; 

int a; 

MPI_Aint disp[4]; 

int lens[4] = {1, 100, 3, 1}; 

MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_FLOAT, MPI_INT}; 


MPI_Get_address(&i, &disp[0]); 

MPI_Get_address(c, &disp[1]); 

MPI_Get_address(f, &disp[2]); 

MPI_Get_address(&a, &disp[3]); 

MPI_Type_create_struct(4, lens, disp, types, &stype); 

MPI_Type_commit (&stype); 

MPI_Send (MPI_BOTTOM, 1, stype, ……..); 



MPI_TYPE_CREATE_RESIZED 

(oldtype, lb, extent, newtype, ierr) 

› sets a new lower bound and extent for 

oldtype 

› Does NOT change amount of data sent in 

a message 

• only changes data access pattern 

› lb and extent are of type MPI_Aint in C and 

integer(KIND=MPI_ADDRESS_KIND) in 

Fortran 



Struct s1 { 

char class; 

double d[2]; 

char b[3]; 

}; 

struct s1 sarray[100]; 

MPI_Datatype stype, ntype; 



MPI_CHAR}; 

int lens[3] = {1, 6, 7}; 

MPI_Aint displs[3]; 

MPI_Get_address (&sarray[0].class, 

&displs[0]; 

MPI_Get_address (&sarray[0].d, 

&displs[1]; 

MPI_Get_address (&sarray[0].b, 

&displs[2]; 

for (i=2;i>=0;i--) displs[i] -=displs[0] 



MPI_Type_create_resized (stype, 0, 

sizeof(struct s1), &ntype); 

Really Portable 



0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 

P 0 

Send to P 1 

Send to P 2 

Send to P 3 

MPI_Type_vector (2, 4, 6, MPI_INT, &my_type); 

What if I want to send all purple portions at once 

MPI_Send (outbuff, 3, my_type, target, tag, comm); 

This would send 0, 1, 2, 3, 6, 7, 8, 9, 10,11, 12, 13, 16, 17, 18, 19,.. 

MPI_Type_create_resized (MY_TYPE, 0, 12*sizeof(int), &new_type); 

We can now send all elements in one call 

We could even use a scatter now to send to P 1 , P 2 , and P 3 



MPI_TYPE_CREATE_SUBARRAY (ndims, sizes, 

subsizes, starts, order, oldtype, newtype, ierr) 

› Creates a newtype which represents a contiguous 

subsection of an array with ndims dimensions 

• This sub-array is only contiguous conceptually, it may not be 

stored contiguously in memory! 

› ndims is the number of array dimensions 

› sizes is an integer array is magnitude of array in each 

dimension 

› Subsizes is integer array of magnitude of the subarray in 

each dimension 

› starts is an integer array specifying starting coordinates of 

subarray 

› Arrays are assumed to be indexed starting a zero!!! 

› Order must be MPI_ORDER_C or MPI_ORDER_FORTRAN 

• C programs may specify Fortran ordering, and vice-versa 


Datatype Constructors: Subarrays 

(1,1) 

(10,10) 

MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes, 

starts, MPI_ORDER_FORTRAN, MPI_INT, sarray) 

sizes = (10, 10) 

subsizes = (6,6) 

starts = (3, 3) 


Datatype Constructors: Subarrays 

(1,1) 

(10,10) 

MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes, 

starts, MPI_ORDER_FORTRAN, MPI_INT, sarray) 

sizes = (10, 10) 

subsizes = (6,6) 

starts = (2,2) 


Datatype Constructors: Darrays 

MPI_TYPE_CREATE_DARRAY (size, rank, dims, 

gsizes, distribs, dargs, psizes, order, oldt, newtype, 

ierr) 

› Used with arrays that are distributed in HPF-like fashion on 

Cartesian process grids 

› Generates datatypes corresponding to the sub-arrays stored 

on each processor 

› Returns in newtype a datatype specific to the sub-array 

stored on process rank 



Derived datatypes must be committed before 

they can be used 

› MPI_TYPE_COMMIT (datatype, ierr) 

› Performs a “compilation” of the datatype 

description into an efficient representation 

Derived datatypes should be freed when they 

are no longer needed 

› MPI_TYPE_FREE (datatype, ierr) 

› Does not effect datatypes derived from the freed 

datatype or current communication 


Pack and Unpack 

MPI_PACK (inbuf, incount, datatype, outbuf, outsize, 

position, comm, ierr) 

MPI_UNPACK (inbuf, insize, position, outbuf, outcount, 

datatype, comm, ierr) 

MPI_PACK_SIZE (incount, datatype, comm, size, ierr) 

› Packed messages must be sent with the type MPI_PACKED 

› Packed messages can be received with any matching 

datatype 

› Unpacked messages can be received with the type 

MPI_PACKED 

› Receives must use type MPI_PACKED if the messages are 

to be unpacked 


int i; 

char c[100]; 


int lens[2] = {1, 100]; 

MPI_Datatype types[2] = {MPI_INT, MPI_CHAR}; 

MPI_Datatype type1; 

MPI_Get_address (&i, &(disp[0]); 

MPI_Get_address(c, &(disp[1]); 

MPI_Type_create_struct (2, lens, disp, types, &type1); 

MPI_Type_commit (&type1); 

MPI_Send (MPI_BOTTOM, 1, type1, 1, 0, MPI_COMM_WORLD) 

int i; 

char c[100]; 

char buf[110]; 

int pos = 0; 

Pack and Unpack 

MPI_Pack(&i, 1, MPI_INT, buf, 110, &pos, MPI_COMM_WORLD); 

MPI_Pack(c, 100, MPI_CHAR, buf, 110, &pos, MPI_COMM_WORLD); 

MPI_Send(buf, pos, MPI_PACKED, 1, 0, MPI_COMM_WORLD); 

Char c[100]; 

MPI_Status status; 

int i, comm; 


int len[2] = {1, 100}; 

MPI_Datatype types[2] = {MPI_INT, MPI_CHAR}; 

MPI_Datatype type1; 

MPI_Get_address (&i, &(disp[0]); 

MPI_Get_address(c, &(disp[1]); 

MPI_Type_create_struct (2, lens, disp, types, &type1); 

MPI_Type_commit (&type1); 

comm = MPI_COMM_WORLD; 

MPI_Recv (MPI_BOTTOM, 1, type1, 0, 0, comm, &status); 

int i, comm; 

char c[100]; 

MPI_Status status; 

char buf[110] 

int pos = 0; 

comm = MPI_COMM_WORLD; 

MPI_Recv (buf, 110, MPI_PACKED, 1, 0, comm, &status); 

MPI_Unpack (buf, &pos, &i, 1, MPI_INT, comm); 

MPI_Unpack (buf, 110, &pos, c, 100, MPI_CHAR, comm); 


Derived Datatypes: Performance 

Issues 

› May allow the user to send fewer or smaller 

messages 

› System dependant on how well this works 

› May be able to significantly reduce memory copies 

› Not all implementations have done a good job 

› can make I/O much more efficient 

› Data packing may be more efficient if it reduces the 

number of send operations by packing meta-data at 

the front of the message 

› This is often possible (and advantageous) for data layouts 

that are runtime dependant 


TAKE 

A 

BREAK 



Day 3 





MPI-2 



MPI-I/O 




Communicators and Groups 

Many MPI users are only familiar with 


A communicator can be thought of as a handle to a 

group 

A group is an ordered set of processes 

Each process is associated with a rank 

Ranks are contiguous and start from zero 

For many applications (dual level parallelism) 

maintaining different groups is appropriate 

Groups allow collective operations to work on a subset 

of processes 

Information can be added onto communicators to be 

passed into routines 


Communicators and Groups(cont) 

While we think of a communicator as spanning 

processes, it is actually unique to a process 

A communicator can be thought of as a handle 

to an object (group attribute) that describes a 

group of processes 

An intracommunicator is used for 

communication within a single group 

An intercommunicator is used for 

communication between 2 disjoint groups 



P 2 

P 1 

P 3 


Comm1 

Comm2 



Refer to previous slide 

There are 3 distinct groups 

These are associated with MPI_COMM_WORLD, 

comm1, and comm2 

P 3 is a member of all 3 groups and may have 

different ranks in each group(say 0, 3 , & 4) 

If P 2 wants to send a message to P 1 it must use 


If P 2 wants to send a message to P 3 it can use 

MPI_COMM_WORLD (send to rank 0) or comm1 

(send to rank 3) 


Group Management 

You can access information about groups 

Size, rank within, translate ranks between 2 

groups, compare 2 groups 

You can create new groups 

Get a group associated with a comm 

Set operations, include or exclude lists 


Communicator management 

You can access information about 

communicators 

Size of group associated with, rank within 

group associated with, compare 2 

You can create new communicators 

Create a new comm for a group 

Split a group into disjoint set of groups, and 

create new comms 

Can create intercommunicators 



Day 3 





MPI-2 



MPI-I/O 




One Sided Communication 

One sided communication allows shmem style 

gets and puts 

Only one process need actively participate in 

one sided operations 

With sufficient hardware support, remote 

memory operations can offer greater 

performance and functionality over the 

message passing model 

MPI remote memory operations do not make 

use of a shared address space 



By requiring only one process to 

participate, significant performance 

improvements are possible 

› No implicit ordering of data delivery 

› No implicit synchronization 

Some programs are more easily written 

with the remote memory access (RMA) 

model 

› Global counter 



RMA operations require 3 steps 

1. Define an area of memory that can be 

used for RMA operations (window) 

2. Specify the data to be moved and 

where to move it 

3. Specify a way of to know the data is 

available 



Get 

Put 

Address space 

Windows 



Memory Windows 

A memory window defines an area of memory that 

can be used for RMA operations 

A memory window must be a contiguous block of 

memory 

Described by a base address and number of bytes 

Window creation is collective across a communicator 

A window object is returned. This window object is 

used for all subsequent RMA calls 



Data movement 

MPI_PUT 

MPI_GET 

MPI_ACCUMULATE 

All data movement routines are nonblocking 

Synchronization call is required to ensure 

operation completion 


Dynamic Process Management 

Standard manner of starting processes in 

PVM 

Useful in assembling complex distributed 

applications 

Of questionable value with batch queuing 

systems and resource managers 

Processes are started during runtime 



Spawning new processes 

Collective over a group 

New processes form their own group with their own 


Spawning processes are returned an 

intercommunicator with the spawning processes 

part of the local group and spawned processes 

part of the remote group 

Spawned processes can get this intercommunicator 

with call to MPI_Comm_get_parent 



Two running MPI programs may connect 

with each other 

One program announces a willingness to 

accept connections while another 

attempts to establish a connection 

Once a connection has been established, 

the two parallel programs share an 

intercommunicator 


MPI-I/O 

Introduction 

› What is parallel I/O 

› Why do we need parallel I/O 

› What is MPI-I/O 


What is parallel I/O 

INTRODUCTION 

› Multiple processes accessing a single file 



INTRODUCTION 


› Often, both data and file access is noncontiguous 

• Ghost cells cause non-contiguous data access 

• Block or cyclic distributions cause noncontiguous 

file access 


Non-Contiguous Access 

Local Mem 

File layout 



INTRODUCTION 


› Often, both data and file access is noncontiguous 

• Ghost cells cause non-contiguous data access 

• Block or cyclic distributions cause noncontiguous 

file access 

› Want to access data and files with as few 

I/O calls as possible 


INTRODUCTION (cont) 

Why use parallel I/O 

› Many users do not have time to learn the 

complexities of I/O optimization 



Integer dim 

parameter (dim=10000) 

Integer*4 out_array(dim) 

OPEN (fh,filename,UNFORMATTED) 

WRITE(fh) (out_array(I), I=1,dim) 

rl = 4*dim 

OPEN (fh, filename, DIRECT, RECL=rl) 

WRITE (fh, REC=1) out_array 






› Use of parallel I/O can simplify coding 

• Single read/write operation vs. multiple 

read/write operations 






› Use of parallel I/O can simplify coding 

• Single read/write operation vs. multiple 

read/write operations 

› Parallel I/O potentially offers significant 

performance improvement over traditional 

approaches 



Traditional approaches 

› Each process writes to a separate file 

• Often requires an additional post-processing 

step 

• Without post-processing, restarts must use 

same number of processor 

› Result sent to a master processor, which 

collects results and writes out to disk 

› Each processor calculates position in file 

and writes individually 



What is MPI-I/O 

› MPI-I/O is a set of extensions to the 

original MPI standard 

› This is an interface specification: It does 

NOT give implementation specifics 

› It provides routines for file manipulation 

and data access 

› Calls to MPI-I/O routines are portable 

across a large number of architectures 


MPI-I/O 

Makes extensive use of derived datatypes 

Allows non-contiguous memory and or disk data 

to be read/written with a single I/O call 

Can simplify programming and maintenance 

May allow for overlap of computation and I/O 

Collective I/O allows for performance 

optimization through marshaling 

Significant performance improvement is 

possible 


Recommended references 

MPI - The Complete Reference Volume 1, The 

MPI Core 

MPI - The Complete Reference Volume 2, The 

MPI Extensions 

USING MPI: Portable Parallel Programming 

with the Message-Passing Interface 

Using MPI-2: Advanced Features of the 

Message-Passing Interface 


Other Useful Information 

https://okc.erdc.hpc.mil 

Computational Environments (CE) 

Under “Papers/Pubs” 

– Using MPI-I/O 

– Guidelines for writing portable MPI programs 

MPI-CHECK 

A decent free () debugger for MPI 

programs 

http://andrew.ait.iastate.edu/HPC/MPI- 

CHECK.htm

Dr. David Cronk Innovative Computing Lab University of ... - It works!

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?