Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...

CINECA 

Casalecchio di Reno (BO) 

www.cinec 

a.it 

~ 

Via Magnanelli 6/3, 40033 Casalecchio di Reno | 051 6171411 | www.cineca.it 

DEISA-TeraGrid Summer School on 

HPC Challenges in Computational Sciences 

Oct. 4-7, 2010, Acireale, Catania, Italy 

Overview on Hybrid Programming 

Giovanni Erbacci 

HPC Division, CINECA 

g.erbacci@cineca.it

Outline 

- Evolution of modern HPC Architectures 

- Programming paradigms 

- Introduction to hybrid programming 

www.cineca.it 2

Parallel Architectures 

P P 

M 

P P 

SMP Node 

Up to some years ago: 

efforts to improve MPI scalability 

mainly MPI applications 

Node 1 

Node 6 

Network 

Node 5 

Node 2 Node 3 

MPI applications easy to port also to shared memory architectures 

pure MPI: one process per core 

Node 4 

www.cineca.it 3

Standard de facto 

MPI 

Distributed memory systems 

message passing 

data distribution model 

Version 2.1 (09/08) 

API for C/C++ and Fortran 

processes 

OpenMP 

Shared memory systems 

Threads creations 

relaxed-consistency model 

Version 3.0 (05/08) 

Compiler directive C and Fortran 

threads 

… till some time ago, pure MPI was implicitly assumed to be as efficient 

as a well implemented hybrid MPI/OpenMP code using MPI for internode 

communication and OpenMP for intra-node parallelisation. 

www.cineca.it 4

Process 

A process is created by the operating system, and requires a fair amount of 

"overhead". 

Processes contain information about program resources and program execution 

state, including: 

- Process ID, process group ID, user ID, and group ID 

- Environment 

- Working directory. 

- Program instructions 

- Registers 

- Stack 

- Heap 

- File descriptors 

- Signal actions 

- Shared libraries 

- Inter-process communication tools (such as message queues, pipes, 

semaphores, or shared memory). 

www.cineca.it 5

Thread 

A thread is defined as an independent stream of instructions that can be scheduled to run as 

such by the operating system. 

Threads use and exist within the process resources 

- are able to be scheduled by the operating system 

- run as independent entities 

- they duplicate only the bare essential resources that enable them to exist as 

executable code. 

This independent flow of control is accomplished because a thread maintains its own: 

- Stack pointer 

- Registers 

- Scheduling properties (such as policy or priority) 

- Set of pending and blocked signals 

- Thread specific data. 

Threads may share the process resources with other threads that act equally independently (and 

dependently) 

Reading and writing to the same memory locations is possible, and therefore requires 

explicit synchronization by the programmer. 

Thread die if the parent process dies 

Thread Iis "lightweight" because most of the overhead has already been accomplished through the 

creation of its process. 

www.cineca.it 6

MPI Execution Model 

- Single Program Multiple Data 

- A copy of the code is executed 

by each process 

- the execution flow is differnt 

dipending from the context 

(process id, local data, etc) 

www.cineca.it 7

OpenMP Execution Model 

- A single thread starts execute 

sequentially 

- When a parallel region is 

reached, several slave threads are 

forked to run in parallel 

- At the end of the parallel region, 

all the slave threads die 

- Only the master thread continues 

the sequential execution 

www.cineca.it 8

Processors: trends 

Aggregate Number of Core for Top500 Supercomputers 

www.cineca.it 9 

number of cores 

3000000 

2500000 

2000000 

1500000 

1000000 

500000 

0 

giu-93 

giu-94 

giu-95 

giu-96 

giu-97 

giu-98 

giu-99 

giu-00 

years 

giu-01 

giu-02 

giu-03 

giu-04 

giu-05 

giu-06 

giu-07 

giu-08

MPI inter process communication 

MPI on Multi core CPU 

node node 

network 

MPI_BCAST 

node node 

Re-design 

applications 

1 MPI process / core 

- Stress network 

- Stress OS 

Many MPI codes heavily use 

ALLTOALL communication 

Messages = processes * processes 

We need to exploit the hierarchy 

Mix message passing 

and multi-threading 

www.cineca.it 10

Hybrid Model 

Multi-node SMP (Symmetric Multiprocessor) connected by an 

interconnection network. 

Each node is mapped (at least) a process MPI and more OpenMP threads 

www.cineca.it 11

Hybrid model: benefits 

Collective communication is often a bottleneck 

Hybrid implementation: 

- decreases the number of messages 

by a factor of (# threads ^ 2) 

- length of messages increases 

by a factor of (# threads) 

www.cineca.it 12

www.cineca.it 13

Domain decomposition 

MPI implementation: 

- each process has to exchange ghost-cells 

- even if the two processes are within the 

same node (two different processes do 

not share the same memory). 

www.cineca.it 14

Domain decomposition /1 

The hybrid approach allows you to 

share the more cells. 

Each thread access all the cells 

within the node 

- communication decreases 

- the size of MPI messages 

increases. 

www.cineca.it 15

MPI vs. OpenMP 

Pure MPI Pro: 

High scalability 

High portability 

No false sharing 

Scalability within the node 

Pure MPI Cons: 

Not easy to develop and debug 

Explicit communication 

Big granularity 

Not easy to garantee a good 

Load balancing 

Pure OpenMP Pro: 

Easy to implement (in general) 

Low latency 

Implicit Communication 

Coarse and fine granularity 

Load balancing dynamic 

Pure OpenMP Cons: 

Only on shared memory architectures 

Scalability only within the node 

Wait for the unlock of data 

Not specifoc order for threads 

Cache consistency and false sharing 

www.cineca.it 16

MPI plus OpenMP 

Pro: 

Better use of memory hierarchy 

Better use of interconnection 

Use of OpenMP within a node avoids overheads associated with calling the 

MPI library 

Improve scalability 

Cons: 

Overhead in thread management 

Greater attention to memory access 

Worse performances (in some cases) 

www.cineca.it 17

False sharing in OpenMP 

#pragma omp parallel for shared(A) 

schedule(static,1) 

for (int i=0; i

Hybrid pseudo code 

call MPI_INIT (ierr) 

call MPI_COMM_RANK (…) 

call MPI_COMM_SIZE (…) 

… some computation and MPI communication 

call OMP_SET_NUM_THREADS(4) 

!$OMP PARALLEL 

!$OMP DO 

do i=1,n 

… computation 

enddo 

!$OMP END DO 

!$OMP END PARALLEL 

… some computation and MPI communication 

call MPI_FINALIZE (ierr) 

www.cineca.it 19

Example 

!$omp parallel do 

DO I = 1,N 

A(I) = B(I) + C(I) 

END DO 

CALL MPI_BSEND(A(N),1,.....) 

CALL MPI_RECV(A(0),1,.....) 

!$omp parallel do 

DO I = 1,N 

D(I) = A(I-1) + A(I) 

ENDO 

Implicit OpenMP barrier 

added here 

cache miss to access received 

data 

other threads idle while 

master does MPI calls 

Single MPI task may not use 

all network bandwidth 

cache miss to access message 

data 

www.cineca.it 20

Mixed paradigm 

www.cineca.it 21

MPI_INIT_Thread support ( MPI-2) 

MPI_INIT_THREAD (required, provided, ierr) 

IN: required, level of thread support (integer)). 

OUT: provided, level provided (integer). 

provided (coud be less than required). 

Four levels are supported: : 

MPI_THREAD_SINGLE: Only one thread will run. Equivalent to MPI_INIT 

MPI_THREAD_FUNNELED: Processes may be multithreaded, 

- all communication is done by one OpenMP thread, 

- but this may occur inside parallel regions 

- other threads may be computing during communication 

MPI_THREAD_SERIALIZED: processes could be multithreaded. More 

threads may make MPI calls, but only one at a time. 

MPI_THREAD_MULTIPLE: Multiple threads may make MPI calls, with no 

restrictions. 

www.cineca.it 22

MPI_THREAD_SINGLE (master only) 

!$OMP PARALLEL DO 

do i=1,10000 

a(i)=b(i)+f*d(i) 

enddo 

!$OMP END PARALLEL DO 

call MPI_Xxx(...) 

!$OMP PARALLEL DO 

do i=1,10000 

x(i)=a(i)+f*b(i) 

enddo 

!$OMP END PARALLEL DO 

#pragma omp parallel for 

{ 

for (i=0; i

MPI_THREAD_FUNNELED 

The process may be multithreaded but only the main thread will 

make MPI calls. 

www.cineca.it 24

MPI_THREAD_FUNNELED 

- calls outside the parallel region 

- Inside the parallel region with "omp master 

!$OMP BARRIER 

!$OMP MASTER 


!$OMP END MASTER 

!$OMP BARRIER 

#pragma omp barrier 

#pragma omp master 

MPI_Xxx(...); 


There is no synchronization with $opm master 

So is neded a barrier before and after, to ensure that the data and 

buffers are available before and/or after the MPI calls 

www.cineca.it 25

MPI_THREAD_SERIALIZED 

The process may be multithreaded and multiple threads may make 

MPI calls, but only one at a time 

MPI calls are not made concurrently from two distinct threads 

www.cineca.it 26

MPI_THREAD_SERIALIZED 

• Outside the region parallel 

• Inside the parallel region with "omp master 

• Inside the parallel region with "omp single" 

!$OMP BARRIER 

!$OMP SINGLE 


!$OMP END SINGLE 


#pragma omp single 

MPI_Xxx(...); 

www.cineca.it 27

MPI_THREAD_MULTIPLE 

- Multiple threads may call MPI anytime with no restrictions. 

- Less restrictive and very flexible, but the application becomes 

very complicated to handle 

www.cineca.it 28

HPC Evolution 

From Herb Sutter 

Moore’s law is holding, in 

the number of transistors 

– Transistors on an ASIC still 

doubling every 18 months at 

constant cost 

– 15 years of exponential clock 

rate growth has ended 

Moore’s Law reinterpreted 

– Performance improvements are 

now coming from the increase in 

the number of cores on a 

processor (ASIC) 

– #cores per chip doubles every 

18 months instead of clock 

– 64-512 threads per node will 

become visible soon 

www.cineca.it 29

HPC: Evolution / 1 

Million core systems on the horizon 

Current Status (10k-250K cores): BGP@Juelich 295K, XT5@ORNL 224K, BGL@ LLNL 213K, BGP@ANL164K, 

Roadrunner@LANL 122K, Nebulae@NSCS-Shenzhen 120KStatus 

2011- 2013: 200K-1.6 M core range will be achieved 

“Serial computing is dead, and the parallel computing revolution has begun: 

Are you part of the solution, or part of the problem?” 

Dave Patterson, UC Berkeley, Usenix conference June 2008 

Amdahl’s law exists and implies dramatic problems in the range of 100K - 1M cores 

Power isssues: 

For Multi-Petaflop/s to Exaflop/s systems real disruption in technology is needed. 

With current type of technology we can estimate the following situation: 

2012 2018 

10 Pflop/s 1Eflop/s 

9.3 MW for IT 154 MW for IT 

14 MW total 213 MW total 

12.3 M€/Y 187 M€/Y (assuming PUE = 1.5 € 0,10 KW/h) 

www.cineca.it 30

Real HPC Crisis is with Software 

A supercomputer application and software are usually much more long-lived than a 

hardware 

- Hardware life typically four-five years at most. 

- Fortran and C are still the main programming models 

Programming is stuck 

- Arguably hasn’t changed so much since the 70’s 

Software is a major cost component of modern technologies. 

- The tradition in HPC system procurement is to assume that the software is free. 

It’s time for a change 

- Complexity is rising dramatically 

- Challenges for the applications on Petaflop systems 

- Improvement of existing codes will become complex and partly impossible 

- The use of O(100K) cores implies dramatic optimization effort 

- New paradigm as the support of a hundred threads in one node implies new 

parallelization strategies 

- Implementation of new parallel programming methods in existing large 

applications has not always a promising perspective 

There is the need for new community codes 

www.cineca.it 31

Modern Languages and Models 

Current programming models 

MPI 

OpenMP 

Hybrid programming models 

OpenMP+MPI (or MPI+Multithreading) 

MPI + stream Procecessing (CUDA, OpenCL) 

Partitioned Global Address Space (PGAS) Programming Languages: 

Unified Parallel C (UPC), Coarray Fortran, Titanium 

Next Generation Programming Languages and Models: 

Chapel, X10, Fortress, Transactional Memory 

Languages, Paradigms and Environments for Hardware Accelerators 

Cell programming, 

CUDA, 

OpenCL 

StarSs 

Cn (FPGA) 

CAPS HMPP 

RapidMind 

API support for compilers 

www.cineca.it 32

Some Current Unmet Needs 

Performance / Portability 

Fault tolerance 

Better programming models 

- Global shared address space 

- Visible locality 

Maybe coming soon (since incremental, yet offering real benefits): 

- Partitioned Global Address Space (PGAS) languages: 

. “Minor” extensions to existing languages 

. More convenient than MPI 

. Have performance transparency via explicit remote memory references 

The critical cycle of prototyping, assessment, and commercialization must be a 

long-term, sustaining investment, not a one time, crash program. 

www.cineca.it 33

Il sistema di calcolo ad alte prestazioni del CINECA: evoluzione di ...

Create successful ePaper yourself

Delete template?

Save as template?