09.06.2013 Views

Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...

Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...

Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>CINECA</strong><br />

Casalecchio <strong>di</strong> Reno (BO)<br />

www.cinec<br />

a.it<br />

~<br />

Via Magnanelli 6/3, 40033 Casalecchio <strong>di</strong> Reno | 051 6171411 | www.cineca.it<br />

DEISA-TeraGrid Summer School on<br />

HPC Challenges in Computational Sciences<br />

Oct. 4-7, 2010, Acireale, Catania, Italy<br />

Overview on Hybrid Programming<br />

Giovanni Erbacci<br />

HPC Division, <strong>CINECA</strong><br />

g.erbacci@cineca.it


Outline<br />

- Evolution of modern HPC Architectures<br />

- Programming para<strong>di</strong>gms<br />

- Introduction to hybrid programming<br />

www.cineca.it 2


Parallel Architectures<br />

P P<br />

M<br />

P P<br />

SMP Node<br />

Up to some years ago:<br />

efforts to improve MPI scalability<br />

mainly MPI applications<br />

Node 1<br />

Node 6<br />

Network<br />

Node 5<br />

Node 2 Node 3<br />

MPI applications easy to port also to shared memory architectures<br />

pure MPI: one process per core<br />

Node 4<br />

www.cineca.it 3


Standard de facto<br />

MPI<br />

Distributed memory systems<br />

message passing<br />

data <strong>di</strong>stribution mo<strong>del</strong><br />

Version 2.1 (09/08)<br />

API for C/C++ and Fortran<br />

processes<br />

OpenMP<br />

Shared memory systems<br />

Thre<strong>ad</strong>s creations<br />

relaxed-consistency mo<strong>del</strong><br />

Version 3.0 (05/08)<br />

Compiler <strong>di</strong>rective C and Fortran<br />

thre<strong>ad</strong>s<br />

… till some time ago, pure MPI was implicitly assumed to be as efficient<br />

as a well implemented hybrid MPI/OpenMP code using MPI for internode<br />

communication and OpenMP for intra-node parallelisation.<br />

www.cineca.it 4


Process<br />

A process is created by the operating system, and requires a fair amount of<br />

"overhe<strong>ad</strong>".<br />

Processes contain information about program resources and program execution<br />

state, inclu<strong>di</strong>ng:<br />

- Process ID, process group ID, user ID, and group ID<br />

- Environment<br />

- Working <strong>di</strong>rectory.<br />

- Program instructions<br />

- Registers<br />

- Stack<br />

- Heap<br />

- File descriptors<br />

- Signal actions<br />

- Shared libraries<br />

- Inter-process communication tools (such as message queues, pipes,<br />

semaphores, or shared memory).<br />

www.cineca.it 5


Thre<strong>ad</strong><br />

A thre<strong>ad</strong> is defined as an independent stream of instructions that can be scheduled to run as<br />

such by the operating system.<br />

Thre<strong>ad</strong>s use and exist within the process resources<br />

- are able to be scheduled by the operating system<br />

- run as independent entities<br />

- they duplicate only the bare essential resources that enable them to exist as<br />

executable code.<br />

This independent flow of control is accomplished because a thre<strong>ad</strong> maintains its own:<br />

- Stack pointer<br />

- Registers<br />

- Scheduling properties (such as policy or priority)<br />

- Set of pen<strong>di</strong>ng and blocked signals<br />

- Thre<strong>ad</strong> specific data.<br />

Thre<strong>ad</strong>s may share the process resources with other thre<strong>ad</strong>s that act equally independently (and<br />

dependently)<br />

Rea<strong>di</strong>ng and writing to the same memory locations is possible, and therefore requires<br />

explicit synchronization by the programmer.<br />

Thre<strong>ad</strong> <strong>di</strong>e if the parent process <strong>di</strong>es<br />

Thre<strong>ad</strong> Iis "lightweight" because most of the overhe<strong>ad</strong> has alre<strong>ad</strong>y been accomplished through the<br />

creation of its process.<br />

www.cineca.it 6


MPI Execution Mo<strong>del</strong><br />

- Single Program Multiple Data<br />

- A copy of the code is executed<br />

by each process<br />

- the execution flow is <strong>di</strong>ffernt<br />

<strong>di</strong>pen<strong>di</strong>ng from the context<br />

(process id, local data, etc)<br />

www.cineca.it 7


OpenMP Execution Mo<strong>del</strong><br />

- A single thre<strong>ad</strong> starts execute<br />

sequentially<br />

- When a parallel region is<br />

reached, several slave thre<strong>ad</strong>s are<br />

forked to run in parallel<br />

- At the end of the parallel region,<br />

all the slave thre<strong>ad</strong>s <strong>di</strong>e<br />

- Only the master thre<strong>ad</strong> continues<br />

the sequential execution<br />

www.cineca.it 8


Processors: trends<br />

Aggregate Number of Core for Top500 Supercomputers<br />

www.cineca.it 9<br />

number of cores<br />

3000000<br />

2500000<br />

2000000<br />

1500000<br />

1000000<br />

500000<br />

0<br />

giu-93<br />

giu-94<br />

giu-95<br />

giu-96<br />

giu-97<br />

giu-98<br />

giu-99<br />

giu-00<br />

years<br />

giu-01<br />

giu-02<br />

giu-03<br />

giu-04<br />

giu-05<br />

giu-06<br />

giu-07<br />

giu-08


MPI inter process communication<br />

MPI on Multi core CPU<br />

node node<br />

network<br />

MPI_BCAST<br />

node node<br />

Re-design<br />

applications<br />

1 MPI process / core<br />

- Stress network<br />

- Stress OS<br />

Many MPI codes heavily use<br />

ALLTOALL communication<br />

Messages = processes * processes<br />

We need to exploit the hierarchy<br />

Mix message passing<br />

and multi-threa<strong>di</strong>ng<br />

www.cineca.it 10


Hybrid Mo<strong>del</strong><br />

Multi-node SMP (Symmetric Multiprocessor) connected by an<br />

interconnection network.<br />

Each node is mapped (at least) a process MPI and more OpenMP thre<strong>ad</strong>s<br />

www.cineca.it 11


Hybrid mo<strong>del</strong>: benefits<br />

Collective communication is often a bottleneck<br />

Hybrid implementation:<br />

- decreases the number of messages<br />

by a factor of (# thre<strong>ad</strong>s ^ 2)<br />

- length of messages increases<br />

by a factor of (# thre<strong>ad</strong>s)<br />

www.cineca.it 12


www.cineca.it 13


Domain decomposition<br />

MPI implementation:<br />

- each process has to exchange ghost-cells<br />

- even if the two processes are within the<br />

same node (two <strong>di</strong>fferent processes do<br />

not share the same memory).<br />

www.cineca.it 14


Domain decomposition /1<br />

The hybrid approach allows you to<br />

share the more cells.<br />

Each thre<strong>ad</strong> access all the cells<br />

within the node<br />

- communication decreases<br />

- the size of MPI messages<br />

increases.<br />

www.cineca.it 15


MPI vs. OpenMP<br />

Pure MPI Pro:<br />

High scalability<br />

High portability<br />

No false sharing<br />

Scalability within the node<br />

Pure MPI Cons:<br />

Not easy to develop and debug<br />

Explicit communication<br />

Big granularity<br />

Not easy to garantee a good<br />

Lo<strong>ad</strong> balancing<br />

Pure OpenMP Pro:<br />

Easy to implement (in general)<br />

Low latency<br />

Implicit Communication<br />

Coarse and fine granularity<br />

Lo<strong>ad</strong> balancing dynamic<br />

Pure OpenMP Cons:<br />

Only on shared memory architectures<br />

Scalability only within the node<br />

Wait for the unlock of data<br />

Not specifoc order for thre<strong>ad</strong>s<br />

Cache consistency and false sharing<br />

www.cineca.it 16


MPI plus OpenMP<br />

Pro:<br />

Better use of memory hierarchy<br />

Better use of interconnection<br />

Use of OpenMP within a node avoids overhe<strong>ad</strong>s associated with calling the<br />

MPI library<br />

Improve scalability<br />

Cons:<br />

Overhe<strong>ad</strong> in thre<strong>ad</strong> management<br />

Greater attention to memory access<br />

Worse performances (in some cases)<br />

www.cineca.it 17


False sharing in OpenMP<br />

#pragma omp parallel for shared(A)<br />

schedule(static,1)<br />

for (int i=0; i


Hybrid pseudo code<br />

call MPI_INIT (ierr)<br />

call MPI_COMM_RANK (…)<br />

call MPI_COMM_SIZE (…)<br />

… some computation and MPI communication<br />

call OMP_SET_NUM_THREADS(4)<br />

!$OMP PARALLEL<br />

!$OMP DO<br />

do i=1,n<br />

… computation<br />

enddo<br />

!$OMP END DO<br />

!$OMP END PARALLEL<br />

… some computation and MPI communication<br />

call MPI_FINALIZE (ierr)<br />

www.cineca.it 19


Example<br />

!$omp parallel do<br />

DO I = 1,N<br />

A(I) = B(I) + C(I)<br />

END DO<br />

CALL MPI_BSEND(A(N),1,.....)<br />

CALL MPI_RECV(A(0),1,.....)<br />

!$omp parallel do<br />

DO I = 1,N<br />

D(I) = A(I-1) + A(I)<br />

ENDO<br />

Implicit OpenMP barrier<br />

<strong>ad</strong>ded here<br />

cache miss to access received<br />

data<br />

other thre<strong>ad</strong>s idle while<br />

master does MPI calls<br />

Single MPI task may not use<br />

all network bandwidth<br />

cache miss to access message<br />

data<br />

www.cineca.it 20


Mixed para<strong>di</strong>gm<br />

www.cineca.it 21


MPI_INIT_Thre<strong>ad</strong> support ( MPI-2)<br />

MPI_INIT_THREAD (required, provided, ierr)<br />

IN: required, level of thre<strong>ad</strong> support (integer)).<br />

OUT: provided, level provided (integer).<br />

provided (coud be less than required).<br />

Four levels are supported: :<br />

MPI_THREAD_SINGLE: Only one thre<strong>ad</strong> will run. Equivalent to MPI_INIT<br />

MPI_THREAD_FUNNELED: Processes may be multithre<strong>ad</strong>ed,<br />

- all communication is done by one OpenMP thre<strong>ad</strong>,<br />

- but this may occur inside parallel regions<br />

- other thre<strong>ad</strong>s may be computing during communication<br />

MPI_THREAD_SERIALIZED: processes could be multithre<strong>ad</strong>ed. More<br />

thre<strong>ad</strong>s may make MPI calls, but only one at a time.<br />

MPI_THREAD_MULTIPLE: Multiple thre<strong>ad</strong>s may make MPI calls, with no<br />

restrictions.<br />

www.cineca.it 22


MPI_THREAD_SINGLE (master only)<br />

!$OMP PARALLEL DO<br />

do i=1,10000<br />

a(i)=b(i)+f*d(i)<br />

enddo<br />

!$OMP END PARALLEL DO<br />

call MPI_Xxx(...)<br />

!$OMP PARALLEL DO<br />

do i=1,10000<br />

x(i)=a(i)+f*b(i)<br />

enddo<br />

!$OMP END PARALLEL DO<br />

#pragma omp parallel for<br />

{<br />

for (i=0; i


MPI_THREAD_FUNNELED<br />

The process may be multithre<strong>ad</strong>ed but only the main thre<strong>ad</strong> will<br />

make MPI calls.<br />

www.cineca.it 24


MPI_THREAD_FUNNELED<br />

- calls outside the parallel region<br />

- Inside the parallel region with "omp master<br />

!$OMP BARRIER<br />

!$OMP MASTER<br />

call MPI_Xxx(...)<br />

!$OMP END MASTER<br />

!$OMP BARRIER<br />

#pragma omp barrier<br />

#pragma omp master<br />

MPI_Xxx(...);<br />

#pragma omp barrier<br />

There is no synchronization with $opm master<br />

So is neded a barrier before and after, to ensure that the data and<br />

buffers are available before and/or after the MPI calls<br />

www.cineca.it 25


MPI_THREAD_SERIALIZED<br />

The process may be multithre<strong>ad</strong>ed and multiple thre<strong>ad</strong>s may make<br />

MPI calls, but only one at a time<br />

MPI calls are not m<strong>ad</strong>e concurrently from two <strong>di</strong>stinct thre<strong>ad</strong>s<br />

www.cineca.it 26


MPI_THREAD_SERIALIZED<br />

• Outside the region parallel<br />

• Inside the parallel region with "omp master<br />

• Inside the parallel region with "omp single"<br />

!$OMP BARRIER<br />

!$OMP SINGLE<br />

call MPI_Xxx(...)<br />

!$OMP END SINGLE<br />

#pragma omp barrier<br />

#pragma omp single<br />

MPI_Xxx(...);<br />

www.cineca.it 27


MPI_THREAD_MULTIPLE<br />

- Multiple thre<strong>ad</strong>s may call MPI anytime with no restrictions.<br />

- Less restrictive and very flexible, but the application becomes<br />

very complicated to handle<br />

www.cineca.it 28


HPC Evolution<br />

From Herb Sutter<br />

Moore’s law is hol<strong>di</strong>ng, in<br />

the number of transistors<br />

– Transistors on an ASIC still<br />

doubling every 18 months at<br />

constant cost<br />

– 15 years of exponential clock<br />

rate growth has ended<br />

Moore’s Law reinterpreted<br />

– Performance improvements are<br />

now coming from the increase in<br />

the number of cores on a<br />

processor (ASIC)<br />

– #cores per chip doubles every<br />

18 months inste<strong>ad</strong> of clock<br />

– 64-512 thre<strong>ad</strong>s per node will<br />

become visible soon<br />

www.cineca.it 29


HPC: Evolution / 1<br />

Million core systems on the horizon<br />

Current Status (10k-250K cores): BGP@Juelich 295K, XT5@ORNL 224K, BGL@ LLNL 213K, BGP@ANL164K,<br />

Ro<strong>ad</strong>runner@LANL 122K, Nebulae@NSCS-Shenzhen 120KStatus<br />

2011- 2013: 200K-1.6 M core range will be achieved<br />

“Serial computing is de<strong>ad</strong>, and the parallel computing revolution has begun:<br />

Are you part of the solution, or part of the problem?”<br />

Dave Patterson, UC Berkeley, Usenix conference June 2008<br />

Amdahl’s law exists and implies dramatic problems in the range of 100K - 1M cores<br />

Power isssues:<br />

For Multi-Petaflop/s to Exaflop/s systems real <strong>di</strong>sruption in technology is needed.<br />

With current type of technology we can estimate the following situation:<br />

2012 2018<br />

10 Pflop/s 1Eflop/s<br />

9.3 MW for IT 154 MW for IT<br />

14 MW total 213 MW total<br />

12.3 M€/Y 187 M€/Y (assuming PUE = 1.5 € 0,10 KW/h)<br />

www.cineca.it 30


Real HPC Crisis is with Software<br />

A supercomputer application and software are usually much more long-lived than a<br />

hardware<br />

- Hardware life typically four-five years at most.<br />

- Fortran and C are still the main programming mo<strong>del</strong>s<br />

Programming is stuck<br />

- Arguably hasn’t changed so much since the 70’s<br />

Software is a major cost component of modern technologies.<br />

- The tra<strong>di</strong>tion in HPC system procurement is to assume that the software is free.<br />

It’s time for a change<br />

- Complexity is rising dramatically<br />

- Challenges for the applications on Petaflop systems<br />

- Improvement of existing codes will become complex and partly impossible<br />

- The use of O(100K) cores implies dramatic optimization effort<br />

- New para<strong>di</strong>gm as the support of a hundred thre<strong>ad</strong>s in one node implies new<br />

parallelization strategies<br />

- Implementation of new parallel programming methods in existing large<br />

applications has not always a promising perspective<br />

There is the need for new community codes<br />

www.cineca.it 31


Modern Languages and Mo<strong>del</strong>s<br />

Current programming mo<strong>del</strong>s<br />

MPI<br />

OpenMP<br />

Hybrid programming mo<strong>del</strong>s<br />

OpenMP+MPI (or MPI+Multithrea<strong>di</strong>ng)<br />

MPI + stream Procecessing (CUDA, OpenCL)<br />

Partitioned Global Address Space (PGAS) Programming Languages:<br />

Unified Parallel C (UPC), Coarray Fortran, Titanium<br />

Next Generation Programming Languages and Mo<strong>del</strong>s:<br />

Chapel, X10, Fortress, Transactional Memory<br />

Languages, Para<strong>di</strong>gms and Environments for Hardware Accelerators<br />

Cell programming,<br />

CUDA,<br />

OpenCL<br />

StarSs<br />

Cn (FPGA)<br />

CAPS HMPP<br />

RapidMind<br />

API support for compilers<br />

www.cineca.it 32


Some Current Unmet Needs<br />

Performance / Portability<br />

Fault tolerance<br />

Better programming mo<strong>del</strong>s<br />

- Global shared <strong>ad</strong>dress space<br />

- Visible locality<br />

Maybe coming soon (since incremental, yet offering real benefits):<br />

- Partitioned Global Address Space (PGAS) languages:<br />

. “Minor” extensions to existing languages<br />

. More convenient than MPI<br />

. Have performance transparency via explicit remote memory references<br />

The critical cycle of prototyping, assessment, and commercialization must be a<br />

long-term, sustaining investment, not a one time, crash program.<br />

www.cineca.it 33

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!