Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...
Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...
Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>CINECA</strong><br />
Casalecchio <strong>di</strong> Reno (BO)<br />
www.cinec<br />
a.it<br />
~<br />
Via Magnanelli 6/3, 40033 Casalecchio <strong>di</strong> Reno | 051 6171411 | www.cineca.it<br />
DEISA-TeraGrid Summer School on<br />
HPC Challenges in Computational Sciences<br />
Oct. 4-7, 2010, Acireale, Catania, Italy<br />
Overview on Hybrid Programming<br />
Giovanni Erbacci<br />
HPC Division, <strong>CINECA</strong><br />
g.erbacci@cineca.it
Outline<br />
- Evolution of modern HPC Architectures<br />
- Programming para<strong>di</strong>gms<br />
- Introduction to hybrid programming<br />
www.cineca.it 2
Parallel Architectures<br />
P P<br />
M<br />
P P<br />
SMP Node<br />
Up to some years ago:<br />
efforts to improve MPI scalability<br />
mainly MPI applications<br />
Node 1<br />
Node 6<br />
Network<br />
Node 5<br />
Node 2 Node 3<br />
MPI applications easy to port also to shared memory architectures<br />
pure MPI: one process per core<br />
Node 4<br />
www.cineca.it 3
Standard de facto<br />
MPI<br />
Distributed memory systems<br />
message passing<br />
data <strong>di</strong>stribution mo<strong>del</strong><br />
Version 2.1 (09/08)<br />
API for C/C++ and Fortran<br />
processes<br />
OpenMP<br />
Shared memory systems<br />
Thre<strong>ad</strong>s creations<br />
relaxed-consistency mo<strong>del</strong><br />
Version 3.0 (05/08)<br />
Compiler <strong>di</strong>rective C and Fortran<br />
thre<strong>ad</strong>s<br />
… till some time ago, pure MPI was implicitly assumed to be as efficient<br />
as a well implemented hybrid MPI/OpenMP code using MPI for internode<br />
communication and OpenMP for intra-node parallelisation.<br />
www.cineca.it 4
Process<br />
A process is created by the operating system, and requires a fair amount of<br />
"overhe<strong>ad</strong>".<br />
Processes contain information about program resources and program execution<br />
state, inclu<strong>di</strong>ng:<br />
- Process ID, process group ID, user ID, and group ID<br />
- Environment<br />
- Working <strong>di</strong>rectory.<br />
- Program instructions<br />
- Registers<br />
- Stack<br />
- Heap<br />
- File descriptors<br />
- Signal actions<br />
- Shared libraries<br />
- Inter-process communication tools (such as message queues, pipes,<br />
semaphores, or shared memory).<br />
www.cineca.it 5
Thre<strong>ad</strong><br />
A thre<strong>ad</strong> is defined as an independent stream of instructions that can be scheduled to run as<br />
such by the operating system.<br />
Thre<strong>ad</strong>s use and exist within the process resources<br />
- are able to be scheduled by the operating system<br />
- run as independent entities<br />
- they duplicate only the bare essential resources that enable them to exist as<br />
executable code.<br />
This independent flow of control is accomplished because a thre<strong>ad</strong> maintains its own:<br />
- Stack pointer<br />
- Registers<br />
- Scheduling properties (such as policy or priority)<br />
- Set of pen<strong>di</strong>ng and blocked signals<br />
- Thre<strong>ad</strong> specific data.<br />
Thre<strong>ad</strong>s may share the process resources with other thre<strong>ad</strong>s that act equally independently (and<br />
dependently)<br />
Rea<strong>di</strong>ng and writing to the same memory locations is possible, and therefore requires<br />
explicit synchronization by the programmer.<br />
Thre<strong>ad</strong> <strong>di</strong>e if the parent process <strong>di</strong>es<br />
Thre<strong>ad</strong> Iis "lightweight" because most of the overhe<strong>ad</strong> has alre<strong>ad</strong>y been accomplished through the<br />
creation of its process.<br />
www.cineca.it 6
MPI Execution Mo<strong>del</strong><br />
- Single Program Multiple Data<br />
- A copy of the code is executed<br />
by each process<br />
- the execution flow is <strong>di</strong>ffernt<br />
<strong>di</strong>pen<strong>di</strong>ng from the context<br />
(process id, local data, etc)<br />
www.cineca.it 7
OpenMP Execution Mo<strong>del</strong><br />
- A single thre<strong>ad</strong> starts execute<br />
sequentially<br />
- When a parallel region is<br />
reached, several slave thre<strong>ad</strong>s are<br />
forked to run in parallel<br />
- At the end of the parallel region,<br />
all the slave thre<strong>ad</strong>s <strong>di</strong>e<br />
- Only the master thre<strong>ad</strong> continues<br />
the sequential execution<br />
www.cineca.it 8
Processors: trends<br />
Aggregate Number of Core for Top500 Supercomputers<br />
www.cineca.it 9<br />
number of cores<br />
3000000<br />
2500000<br />
2000000<br />
1500000<br />
1000000<br />
500000<br />
0<br />
giu-93<br />
giu-94<br />
giu-95<br />
giu-96<br />
giu-97<br />
giu-98<br />
giu-99<br />
giu-00<br />
years<br />
giu-01<br />
giu-02<br />
giu-03<br />
giu-04<br />
giu-05<br />
giu-06<br />
giu-07<br />
giu-08
MPI inter process communication<br />
MPI on Multi core CPU<br />
node node<br />
network<br />
MPI_BCAST<br />
node node<br />
Re-design<br />
applications<br />
1 MPI process / core<br />
- Stress network<br />
- Stress OS<br />
Many MPI codes heavily use<br />
ALLTOALL communication<br />
Messages = processes * processes<br />
We need to exploit the hierarchy<br />
Mix message passing<br />
and multi-threa<strong>di</strong>ng<br />
www.cineca.it 10
Hybrid Mo<strong>del</strong><br />
Multi-node SMP (Symmetric Multiprocessor) connected by an<br />
interconnection network.<br />
Each node is mapped (at least) a process MPI and more OpenMP thre<strong>ad</strong>s<br />
www.cineca.it 11
Hybrid mo<strong>del</strong>: benefits<br />
Collective communication is often a bottleneck<br />
Hybrid implementation:<br />
- decreases the number of messages<br />
by a factor of (# thre<strong>ad</strong>s ^ 2)<br />
- length of messages increases<br />
by a factor of (# thre<strong>ad</strong>s)<br />
www.cineca.it 12
www.cineca.it 13
Domain decomposition<br />
MPI implementation:<br />
- each process has to exchange ghost-cells<br />
- even if the two processes are within the<br />
same node (two <strong>di</strong>fferent processes do<br />
not share the same memory).<br />
www.cineca.it 14
Domain decomposition /1<br />
The hybrid approach allows you to<br />
share the more cells.<br />
Each thre<strong>ad</strong> access all the cells<br />
within the node<br />
- communication decreases<br />
- the size of MPI messages<br />
increases.<br />
www.cineca.it 15
MPI vs. OpenMP<br />
Pure MPI Pro:<br />
High scalability<br />
High portability<br />
No false sharing<br />
Scalability within the node<br />
Pure MPI Cons:<br />
Not easy to develop and debug<br />
Explicit communication<br />
Big granularity<br />
Not easy to garantee a good<br />
Lo<strong>ad</strong> balancing<br />
Pure OpenMP Pro:<br />
Easy to implement (in general)<br />
Low latency<br />
Implicit Communication<br />
Coarse and fine granularity<br />
Lo<strong>ad</strong> balancing dynamic<br />
Pure OpenMP Cons:<br />
Only on shared memory architectures<br />
Scalability only within the node<br />
Wait for the unlock of data<br />
Not specifoc order for thre<strong>ad</strong>s<br />
Cache consistency and false sharing<br />
www.cineca.it 16
MPI plus OpenMP<br />
Pro:<br />
Better use of memory hierarchy<br />
Better use of interconnection<br />
Use of OpenMP within a node avoids overhe<strong>ad</strong>s associated with calling the<br />
MPI library<br />
Improve scalability<br />
Cons:<br />
Overhe<strong>ad</strong> in thre<strong>ad</strong> management<br />
Greater attention to memory access<br />
Worse performances (in some cases)<br />
www.cineca.it 17
False sharing in OpenMP<br />
#pragma omp parallel for shared(A)<br />
schedule(static,1)<br />
for (int i=0; i
Hybrid pseudo code<br />
call MPI_INIT (ierr)<br />
call MPI_COMM_RANK (…)<br />
call MPI_COMM_SIZE (…)<br />
… some computation and MPI communication<br />
call OMP_SET_NUM_THREADS(4)<br />
!$OMP PARALLEL<br />
!$OMP DO<br />
do i=1,n<br />
… computation<br />
enddo<br />
!$OMP END DO<br />
!$OMP END PARALLEL<br />
… some computation and MPI communication<br />
call MPI_FINALIZE (ierr)<br />
www.cineca.it 19
Example<br />
!$omp parallel do<br />
DO I = 1,N<br />
A(I) = B(I) + C(I)<br />
END DO<br />
CALL MPI_BSEND(A(N),1,.....)<br />
CALL MPI_RECV(A(0),1,.....)<br />
!$omp parallel do<br />
DO I = 1,N<br />
D(I) = A(I-1) + A(I)<br />
ENDO<br />
Implicit OpenMP barrier<br />
<strong>ad</strong>ded here<br />
cache miss to access received<br />
data<br />
other thre<strong>ad</strong>s idle while<br />
master does MPI calls<br />
Single MPI task may not use<br />
all network bandwidth<br />
cache miss to access message<br />
data<br />
www.cineca.it 20
Mixed para<strong>di</strong>gm<br />
www.cineca.it 21
MPI_INIT_Thre<strong>ad</strong> support ( MPI-2)<br />
MPI_INIT_THREAD (required, provided, ierr)<br />
IN: required, level of thre<strong>ad</strong> support (integer)).<br />
OUT: provided, level provided (integer).<br />
provided (coud be less than required).<br />
Four levels are supported: :<br />
MPI_THREAD_SINGLE: Only one thre<strong>ad</strong> will run. Equivalent to MPI_INIT<br />
MPI_THREAD_FUNNELED: Processes may be multithre<strong>ad</strong>ed,<br />
- all communication is done by one OpenMP thre<strong>ad</strong>,<br />
- but this may occur inside parallel regions<br />
- other thre<strong>ad</strong>s may be computing during communication<br />
MPI_THREAD_SERIALIZED: processes could be multithre<strong>ad</strong>ed. More<br />
thre<strong>ad</strong>s may make MPI calls, but only one at a time.<br />
MPI_THREAD_MULTIPLE: Multiple thre<strong>ad</strong>s may make MPI calls, with no<br />
restrictions.<br />
www.cineca.it 22
MPI_THREAD_SINGLE (master only)<br />
!$OMP PARALLEL DO<br />
do i=1,10000<br />
a(i)=b(i)+f*d(i)<br />
enddo<br />
!$OMP END PARALLEL DO<br />
call MPI_Xxx(...)<br />
!$OMP PARALLEL DO<br />
do i=1,10000<br />
x(i)=a(i)+f*b(i)<br />
enddo<br />
!$OMP END PARALLEL DO<br />
#pragma omp parallel for<br />
{<br />
for (i=0; i
MPI_THREAD_FUNNELED<br />
The process may be multithre<strong>ad</strong>ed but only the main thre<strong>ad</strong> will<br />
make MPI calls.<br />
www.cineca.it 24
MPI_THREAD_FUNNELED<br />
- calls outside the parallel region<br />
- Inside the parallel region with "omp master<br />
!$OMP BARRIER<br />
!$OMP MASTER<br />
call MPI_Xxx(...)<br />
!$OMP END MASTER<br />
!$OMP BARRIER<br />
#pragma omp barrier<br />
#pragma omp master<br />
MPI_Xxx(...);<br />
#pragma omp barrier<br />
There is no synchronization with $opm master<br />
So is neded a barrier before and after, to ensure that the data and<br />
buffers are available before and/or after the MPI calls<br />
www.cineca.it 25
MPI_THREAD_SERIALIZED<br />
The process may be multithre<strong>ad</strong>ed and multiple thre<strong>ad</strong>s may make<br />
MPI calls, but only one at a time<br />
MPI calls are not m<strong>ad</strong>e concurrently from two <strong>di</strong>stinct thre<strong>ad</strong>s<br />
www.cineca.it 26
MPI_THREAD_SERIALIZED<br />
• Outside the region parallel<br />
• Inside the parallel region with "omp master<br />
• Inside the parallel region with "omp single"<br />
!$OMP BARRIER<br />
!$OMP SINGLE<br />
call MPI_Xxx(...)<br />
!$OMP END SINGLE<br />
#pragma omp barrier<br />
#pragma omp single<br />
MPI_Xxx(...);<br />
www.cineca.it 27
MPI_THREAD_MULTIPLE<br />
- Multiple thre<strong>ad</strong>s may call MPI anytime with no restrictions.<br />
- Less restrictive and very flexible, but the application becomes<br />
very complicated to handle<br />
www.cineca.it 28
HPC Evolution<br />
From Herb Sutter<br />
Moore’s law is hol<strong>di</strong>ng, in<br />
the number of transistors<br />
– Transistors on an ASIC still<br />
doubling every 18 months at<br />
constant cost<br />
– 15 years of exponential clock<br />
rate growth has ended<br />
Moore’s Law reinterpreted<br />
– Performance improvements are<br />
now coming from the increase in<br />
the number of cores on a<br />
processor (ASIC)<br />
– #cores per chip doubles every<br />
18 months inste<strong>ad</strong> of clock<br />
– 64-512 thre<strong>ad</strong>s per node will<br />
become visible soon<br />
www.cineca.it 29
HPC: Evolution / 1<br />
Million core systems on the horizon<br />
Current Status (10k-250K cores): BGP@Juelich 295K, XT5@ORNL 224K, BGL@ LLNL 213K, BGP@ANL164K,<br />
Ro<strong>ad</strong>runner@LANL 122K, Nebulae@NSCS-Shenzhen 120KStatus<br />
2011- 2013: 200K-1.6 M core range will be achieved<br />
“Serial computing is de<strong>ad</strong>, and the parallel computing revolution has begun:<br />
Are you part of the solution, or part of the problem?”<br />
Dave Patterson, UC Berkeley, Usenix conference June 2008<br />
Amdahl’s law exists and implies dramatic problems in the range of 100K - 1M cores<br />
Power isssues:<br />
For Multi-Petaflop/s to Exaflop/s systems real <strong>di</strong>sruption in technology is needed.<br />
With current type of technology we can estimate the following situation:<br />
2012 2018<br />
10 Pflop/s 1Eflop/s<br />
9.3 MW for IT 154 MW for IT<br />
14 MW total 213 MW total<br />
12.3 M€/Y 187 M€/Y (assuming PUE = 1.5 € 0,10 KW/h)<br />
www.cineca.it 30
Real HPC Crisis is with Software<br />
A supercomputer application and software are usually much more long-lived than a<br />
hardware<br />
- Hardware life typically four-five years at most.<br />
- Fortran and C are still the main programming mo<strong>del</strong>s<br />
Programming is stuck<br />
- Arguably hasn’t changed so much since the 70’s<br />
Software is a major cost component of modern technologies.<br />
- The tra<strong>di</strong>tion in HPC system procurement is to assume that the software is free.<br />
It’s time for a change<br />
- Complexity is rising dramatically<br />
- Challenges for the applications on Petaflop systems<br />
- Improvement of existing codes will become complex and partly impossible<br />
- The use of O(100K) cores implies dramatic optimization effort<br />
- New para<strong>di</strong>gm as the support of a hundred thre<strong>ad</strong>s in one node implies new<br />
parallelization strategies<br />
- Implementation of new parallel programming methods in existing large<br />
applications has not always a promising perspective<br />
There is the need for new community codes<br />
www.cineca.it 31
Modern Languages and Mo<strong>del</strong>s<br />
Current programming mo<strong>del</strong>s<br />
MPI<br />
OpenMP<br />
Hybrid programming mo<strong>del</strong>s<br />
OpenMP+MPI (or MPI+Multithrea<strong>di</strong>ng)<br />
MPI + stream Procecessing (CUDA, OpenCL)<br />
Partitioned Global Address Space (PGAS) Programming Languages:<br />
Unified Parallel C (UPC), Coarray Fortran, Titanium<br />
Next Generation Programming Languages and Mo<strong>del</strong>s:<br />
Chapel, X10, Fortress, Transactional Memory<br />
Languages, Para<strong>di</strong>gms and Environments for Hardware Accelerators<br />
Cell programming,<br />
CUDA,<br />
OpenCL<br />
StarSs<br />
Cn (FPGA)<br />
CAPS HMPP<br />
RapidMind<br />
API support for compilers<br />
www.cineca.it 32
Some Current Unmet Needs<br />
Performance / Portability<br />
Fault tolerance<br />
Better programming mo<strong>del</strong>s<br />
- Global shared <strong>ad</strong>dress space<br />
- Visible locality<br />
Maybe coming soon (since incremental, yet offering real benefits):<br />
- Partitioned Global Address Space (PGAS) languages:<br />
. “Minor” extensions to existing languages<br />
. More convenient than MPI<br />
. Have performance transparency via explicit remote memory references<br />
The critical cycle of prototyping, assessment, and commercialization must be a<br />
long-term, sustaining investment, not a one time, crash program.<br />
www.cineca.it 33