22.01.2014 Views

PCS - Part 2: Multiprocessor Architectures

PCS - Part 2: Multiprocessor Architectures

PCS - Part 2: Multiprocessor Architectures

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>PCS</strong> - <strong>Part</strong> 2:<br />

<strong>Multiprocessor</strong> <strong>Architectures</strong><br />

Peter Sobe<br />

Institute of Computer Engineering<br />

University of Lübeck, Germany<br />

Baltic Summer School, Tartu 2009<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


<strong>Part</strong> 2 - Contents<br />

Multicore and symmetrical multiprocessors<br />

Cache Coherency<br />

Distributed Shared Memory<br />

Programming Models for <strong>Multiprocessor</strong> Systems<br />

Multiple processes<br />

Multithreading<br />

OpenMP<br />

Graphic Processing Units (GPU)<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


<strong>Multiprocessor</strong> Systems<br />

Structure:<br />

P 0<br />

P 1<br />

P 2<br />

P 3<br />

P (p−1)<br />

Cache Cache Cache Cache<br />

Cache<br />

Communication Network<br />

MEM MEM MEM MEM<br />

MEM<br />

Shared memory for all processors, i.e. a common address<br />

space<br />

Coordination and cooperation using shared variables in<br />

memory<br />

Computer runs with a single instance of the operating<br />

system<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Shared Memory <strong>Multiprocessor</strong>s<br />

processors potentially operate independently and asynchronous<br />

cooperative operation obtained by software<br />

Processor 1<br />

Instructions<br />

Control<br />

Unit<br />

Processor N<br />

Instructions<br />

Control<br />

Unit<br />

Arithm.<br />

Logical<br />

Unit<br />

Data<br />

Arithm.<br />

Logical<br />

Unit<br />

Data<br />

Communication Network<br />

global<br />

Memory Unit 1<br />

global<br />

Memory Unit M<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Symmetrical <strong>Multiprocessor</strong>s (SMP)<br />

SMP - Symmetry in terms of same processor type and<br />

equal cost for memory access, independent of originating<br />

processor and of accessed physical address<br />

Example: Intel SMP Servers using FrontSide-Bus<br />

SMPs up to 32 processors, bigger systems do not perform<br />

well due to bottleneck of a common bus.<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


MultiCore<br />

Many processor cores on a single chip<br />

Used like a symmetrical<br />

multiprocessor system<br />

Processors with private L1<br />

Cache<br />

Cache coherency by<br />

MESI-like protocol (MoESI)<br />

Private/shared L2 Cache<br />

cores share the memory<br />

interface<br />

CPU1 CPU2<br />

L1 Cache L1 Cache<br />

L2 Cache<br />

bus interface<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Distributed Shared Memory (1)<br />

Mix concept: Distributed Shared Memory<br />

Hardware structure looks like a distributed memory<br />

architecture,<br />

OS techniques combined with hardware accelerations<br />

provide a virtual shared memory<br />

Processor−Memory−Unit 1<br />

Steuerwerk<br />

Instructions<br />

Processor−Memory−Unit N<br />

Instructions<br />

Control<br />

Unit<br />

Arithm.<br />

Logical<br />

Unit<br />

Data<br />

Arithm.<br />

Logical<br />

Unit<br />

Data<br />

Communication Network<br />

Memory<br />

Memory<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Distributed Shared Memory (2)<br />

Processor-Memory-Units connected to a multiprocessor<br />

system<br />

Communication network mostly a hierarchic switched<br />

network<br />

Asymmetric Structure: Different memory access cost,<br />

depending on the referred address and the processor that<br />

originates the access<br />

Non Uniform Memory Access (NUMA, ccNUMA when<br />

cache coherent)<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Example: SUN SF15K (1)<br />

Sun SF15K:<br />

ccNUMA-<strong>Multiprocessor</strong> system,<br />

72 Sun UltraSparc III - 900 Mhz<br />

18 system boards with 4 processors and<br />

4 memory modules each<br />

Within system board: UMA/SMP, cache<br />

coherency by snooping<br />

Across different system boards:<br />

Directory-based cache coherency,<br />

implemented by SSM agents<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Example: SUN SF15K (2)<br />

memory access times (750 MHz UltraSPARC III)<br />

same CPU 216 ns<br />

same Board 235 ns<br />

different Board 375 ns<br />

Communication network:<br />

18x18 Crossbar for<br />

address, + cache<br />

coherency control signals<br />

18x18 Crossbar for data<br />

transfer<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Shared Memory and Caching<br />

Caches are used in order to release network and main<br />

memory from frequent data transfer<br />

Non–shared data can be kept in caches for a long time<br />

without interaction with main memory<br />

This improves scalability of the system, but introduces a<br />

consistency problem. This problem is solved by cache<br />

coherency protocols.<br />

Consistency problem:<br />

P1<br />

Consistent<br />

P2<br />

P1<br />

Inconsistent<br />

P2<br />

Write 23, 2500<br />

23: 1000<br />

23: 1000<br />

23: 1000<br />

23: 2500<br />

23: 1000<br />

23: 1000<br />

23: 2500<br />

after "write−back"<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cache Coherency<br />

Coherency:<br />

Ensures that no old copies of data are used<br />

Weaker than consistency, i.e. inconsistencies are allowed<br />

but along with keeping track of inconsistencies<br />

Protocols:<br />

Invalidation: invalidate a copy when another processor is<br />

writing on address (snooping), always write-through is<br />

necessary<br />

MESI: keep track on usage of data, snooping, write-back<br />

only when necessary<br />

Directory–based Cache Coherency: for systems without<br />

shared address bus<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cache Coherency: MESI (1)<br />

Motivation for MESI: Allow the Write-back strategy as long<br />

no other processor is accessing to the cached address<br />

Protocols similar to MESI also exist for DSM system<br />

without a shared snooping medium,⇒directory-based<br />

caches<br />

The term ’MESI’ comes from the 4 states ’M’,’E’,’S’ and ’I’<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cache Coherency: MESI (2)<br />

M Exclusive Modified The line is exclusively in this cache<br />

and got modified (written)<br />

The line is exclusively in this cache<br />

E Exclusive Unmodified but was not modified, i.e. was<br />

only accessed by read operations<br />

S Shared Unmodified This line is also present in another<br />

processors cache, but was not modified<br />

Line was modified by another<br />

I Invalid processor,<br />

cache entry may not be used<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cache Coherency: MESI (3)<br />

States and transitions:<br />

local events:<br />

RM ... read miss<br />

RH ... read hit<br />

WM ... write miss<br />

WH ... write hit<br />

distant events:<br />

SHR ... shared read<br />

SHW ... shared write<br />

Dirty line copy back<br />

Invalidate<br />

Read with intent to modify<br />

Cache line fill<br />

WH<br />

SHW<br />

Invalid<br />

WM<br />

exclusive<br />

Modified<br />

WH<br />

RM / shared<br />

SHR<br />

SHW<br />

WH<br />

SHW<br />

RM / exclusive<br />

RH<br />

Shared<br />

unmodified<br />

SHR<br />

Exclusive<br />

unmodified<br />

RH<br />

RH<br />

Figure taken from: T. Ungerer, ’Parallelrechner und Parallele Programmierung’<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cache Coherency: MESI-like protocols<br />

MESI requires address bus visible for all caches, thus MESI<br />

solely appropriate for MultiCore and SMP systems<br />

DSM system:<br />

No shared address bus, instead a decentralized network<br />

for address and data transfer<br />

Directory-based cache coherence protocols: Each<br />

memory line is tagged with information which caches hold<br />

a copy of the line<br />

A distributed protocol is invoked each time a memory line<br />

or a cache line is accessed<br />

SSM-Agent runs coherency protocol on behalf of the local<br />

caches.<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Example: SUN SF15K (3)<br />

Within a system board: Cache coherency by snooping and<br />

MESI<br />

Additionally, each system board contains a SSM-agent, working<br />

according a directory based cache coherency algorithm<br />

Principle:<br />

Cache coherency interactions remain local within board, as<br />

long in a presence vector no board-distant processor is<br />

stored.<br />

If a copy of a memory line is stored in a board-distant<br />

cache/processor, then SSM agent runs the distributed<br />

protocol.<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Example: SUN SF15K (3)<br />

Example - Invalidate cache copies after altering a memory line:<br />

SSM agent initiates transfer of the address across the<br />

18x18 address-crossbar with control wires set to<br />

’Invalidate’<br />

the destination board is contained in a part of the address<br />

SSM-agent of the destination board receives the address,<br />

this address will be transferred via the local address bus<br />

with control signal set to ’Shared Write’<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Example: SUN SF15K (5)<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Programming Models (1)<br />

Choice is mainly influenced by the aspect of shared memory,<br />

and cache coherency<br />

Options:<br />

Multiple processes (using fork) and communication via<br />

Shmem segments<br />

Multithreading: Threads run on different nodes and utilize<br />

parallel machine, Threads run onto a shared address<br />

space<br />

OpenMP - Set of compiler directives for controlling<br />

multi-threaded, space divided execution<br />

As well, more general programming models work on shared<br />

memory computers:<br />

Explicit message passing among multiple processes:<br />

Unix-Pipelines/Sockets, MPI<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Programming Models: Multithreading<br />

Several threads run onto several processors under control of<br />

the operating system<br />

OS-specific thread functions, e.g. Solaris threads<br />

portability standard: POSIX-Threads, pthread library<br />

Basic functions:<br />

int pthread_create(pthread_t *thread,<br />

const pthread_attr_t *attr,<br />

void *(*start_routine)(void*),<br />

void *arg);<br />

void pthread_exit(void *value_ptr);<br />

int pthread_join(pthread_t thread, void **value_ptr);<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Programming Models: OpenMP<br />

OpenMP:<br />

Example for Loop-parallelization:<br />

for (i=0;i


Graphic Processing Units (GPU)<br />

Usage of programmable shader units in modern graphic<br />

cards for non-graphics computations<br />

Data parallel execution model:<br />

same function on several data portions<br />

programmed in a MIMD style<br />

executed partly in SIMD mode<br />

C programming language for<br />

CUDA-Extension (NVidia)<br />

Brooks+ (ATI)<br />

OpenCL (platform independent)<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Graphic Processing Units<br />

Architecture:<br />

multiprocessors that operate in<br />

SIMD mode (but also execute<br />

MIMD code in a sequentialized<br />

fashion)<br />

a set of multiprocessors<br />

host and device memory<br />

device memory is shared by all<br />

multiprocessors<br />

Input data must be copied from<br />

host to device memory, output<br />

data along the reverse way<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cuda - Programming (1)<br />

NVidia - Compute Unified<br />

Device Architecture<br />

Programming and execution<br />

model - oriented on, but not<br />

fixed to - GPU structures<br />

A kernel function is<br />

executed by many threads<br />

that commonly operate on<br />

different portions of the<br />

input data<br />

Blocks of threads - run<br />

parallel and may use shared<br />

memory<br />

Grid of blocks - blocks are<br />

executed either in a batch<br />

mode or parallel, depending<br />

on the device capabilities<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>


Cuda - Programming (2)<br />

Programming interface<br />

supports 1, 2 and 3 dimensional blocks and grids<br />

addressing of data elements within the kernel functions by BlockIdx<br />

(block in a grid) and ThreadIdx (thread in a block)<br />

Example:<br />

Parallel computation y[]= a*x[] + y[]<br />

__global__<br />

void saxpy_parallel (int n, float a, float *x, float *y)<br />

{<br />

int i = blockIdx.x*blockDim.x+threadIdx.x;<br />

if (i


Summary <strong>Part</strong> 2<br />

<strong>Multiprocessor</strong> systems with many processors, connected<br />

by a shared memory<br />

SMP, MultiCore and DSM<br />

Such systems scale up to 32 processors (SMP) and a few<br />

hundred processors (DSM)<br />

Cache Coherency allows to use a shared memory<br />

transparently, without care for cache consistency problems<br />

Programming models base on shared memory, e.g.<br />

multiple threads<br />

Peter Sobe<br />

<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!