PCS - Part 2: Multiprocessor Architectures

PCS - Part 2: 

Multiprocessor Architectures 

Peter Sobe 

Institute of Computer Engineering 

University of Lübeck, Germany 

Baltic Summer School, Tartu 2009 

Peter Sobe 

PCS - Part 2: Multiprocessor Architectures

Part 2 - Contents 

Multicore and symmetrical multiprocessors 

Cache Coherency 

Distributed Shared Memory 

Programming Models for Multiprocessor Systems 

Multiple processes 

Multithreading 

OpenMP 

Graphic Processing Units (GPU) 

Peter Sobe 


Multiprocessor Systems 

Structure: 

P 0 

P 1 

P 2 

P 3 

P (p−1) 

Cache Cache Cache Cache 

Cache 

Communication Network 

MEM MEM MEM MEM 

MEM 

Shared memory for all processors, i.e. a common address 

space 

Coordination and cooperation using shared variables in 

memory 

Computer runs with a single instance of the operating 

system 

Peter Sobe 


Shared Memory Multiprocessors 

processors potentially operate independently and asynchronous 

cooperative operation obtained by software 

Processor 1 

Instructions 

Control 

Unit 

Processor N 

Instructions 

Control 

Unit 

Arithm. 

Logical 

Unit 

Data 

Arithm. 

Logical 

Unit 

Data 


global 

Memory Unit 1 

global 

Memory Unit M 

Peter Sobe 


Symmetrical Multiprocessors (SMP) 

SMP - Symmetry in terms of same processor type and 

equal cost for memory access, independent of originating 

processor and of accessed physical address 

Example: Intel SMP Servers using FrontSide-Bus 

SMPs up to 32 processors, bigger systems do not perform 

well due to bottleneck of a common bus. 

Peter Sobe 


MultiCore 

Many processor cores on a single chip 

Used like a symmetrical 

multiprocessor system 

Processors with private L1 

Cache 

Cache coherency by 

MESI-like protocol (MoESI) 

Private/shared L2 Cache 

cores share the memory 

interface 

CPU1 CPU2 

L1 Cache L1 Cache 

L2 Cache 

bus interface 

Peter Sobe 


Distributed Shared Memory (1) 

Mix concept: Distributed Shared Memory 

Hardware structure looks like a distributed memory 

architecture, 

OS techniques combined with hardware accelerations 

provide a virtual shared memory 

Processor−Memory−Unit 1 

Steuerwerk 

Instructions 

Processor−Memory−Unit N 

Instructions 

Control 

Unit 

Arithm. 

Logical 

Unit 

Data 

Arithm. 

Logical 

Unit 

Data 


Memory 

Memory 

Peter Sobe 


Distributed Shared Memory (2) 

Processor-Memory-Units connected to a multiprocessor 

system 

Communication network mostly a hierarchic switched 

network 

Asymmetric Structure: Different memory access cost, 

depending on the referred address and the processor that 

originates the access 

Non Uniform Memory Access (NUMA, ccNUMA when 

cache coherent) 

Peter Sobe 


Example: SUN SF15K (1) 

Sun SF15K: 

ccNUMA-Multiprocessor system, 

72 Sun UltraSparc III - 900 Mhz 

18 system boards with 4 processors and 

4 memory modules each 

Within system board: UMA/SMP, cache 

coherency by snooping 

Across different system boards: 

Directory-based cache coherency, 

implemented by SSM agents 

Peter Sobe 



memory access times (750 MHz UltraSPARC III) 

same CPU 216 ns 

same Board 235 ns 

different Board 375 ns 

Communication network: 

18x18 Crossbar for 

address, + cache 

coherency control signals 

18x18 Crossbar for data 

transfer 

Peter Sobe 


Shared Memory and Caching 

Caches are used in order to release network and main 

memory from frequent data transfer 

Non–shared data can be kept in caches for a long time 

without interaction with main memory 

This improves scalability of the system, but introduces a 

consistency problem. This problem is solved by cache 

coherency protocols. 

Consistency problem: 

P1 

Consistent 

P2 

P1 

Inconsistent 

P2 

Write 23, 2500 

23: 1000 

23: 1000 

23: 1000 

23: 2500 

23: 1000 

23: 1000 

23: 2500 

after "write−back" 

Peter Sobe 


Cache Coherency 

Coherency: 

Ensures that no old copies of data are used 

Weaker than consistency, i.e. inconsistencies are allowed 

but along with keeping track of inconsistencies 

Protocols: 

Invalidation: invalidate a copy when another processor is 

writing on address (snooping), always write-through is 

necessary 

MESI: keep track on usage of data, snooping, write-back 

only when necessary 

Directory–based Cache Coherency: for systems without 

shared address bus 

Peter Sobe 


Cache Coherency: MESI (1) 

Motivation for MESI: Allow the Write-back strategy as long 

no other processor is accessing to the cached address 

Protocols similar to MESI also exist for DSM system 

without a shared snooping medium,⇒directory-based 

caches 

The term ’MESI’ comes from the 4 states ’M’,’E’,’S’ and ’I’ 

Peter Sobe 



M Exclusive Modified The line is exclusively in this cache 

and got modified (written) 

The line is exclusively in this cache 

E Exclusive Unmodified but was not modified, i.e. was 

only accessed by read operations 

S Shared Unmodified This line is also present in another 

processors cache, but was not modified 

Line was modified by another 

I Invalid processor, 

cache entry may not be used 

Peter Sobe 



States and transitions: 

local events: 

RM ... read miss 

RH ... read hit 

WM ... write miss 

WH ... write hit 

distant events: 

SHR ... shared read 

SHW ... shared write 

Dirty line copy back 

Invalidate 

Read with intent to modify 

Cache line fill 

WH 

SHW 

Invalid 

WM 

exclusive 

Modified 

WH 

RM / shared 

SHR 

SHW 

WH 

SHW 

RM / exclusive 

RH 

Shared 

unmodified 

SHR 

Exclusive 

unmodified 

RH 

RH 

Figure taken from: T. Ungerer, ’Parallelrechner und Parallele Programmierung’ 

Peter Sobe 


Cache Coherency: MESI-like protocols 

MESI requires address bus visible for all caches, thus MESI 

solely appropriate for MultiCore and SMP systems 

DSM system: 

No shared address bus, instead a decentralized network 

for address and data transfer 

Directory-based cache coherence protocols: Each 

memory line is tagged with information which caches hold 

a copy of the line 

A distributed protocol is invoked each time a memory line 

or a cache line is accessed 

SSM-Agent runs coherency protocol on behalf of the local 

caches. 

Peter Sobe 



Within a system board: Cache coherency by snooping and 

MESI 

Additionally, each system board contains a SSM-agent, working 

according a directory based cache coherency algorithm 

Principle: 

Cache coherency interactions remain local within board, as 

long in a presence vector no board-distant processor is 

stored. 

If a copy of a memory line is stored in a board-distant 

cache/processor, then SSM agent runs the distributed 

protocol. 

Peter Sobe 



Example - Invalidate cache copies after altering a memory line: 

SSM agent initiates transfer of the address across the 

18x18 address-crossbar with control wires set to 

’Invalidate’ 

the destination board is contained in a part of the address 

SSM-agent of the destination board receives the address, 

this address will be transferred via the local address bus 

with control signal set to ’Shared Write’ 

Peter Sobe 



Peter Sobe 


Programming Models (1) 

Choice is mainly influenced by the aspect of shared memory, 

and cache coherency 

Options: 

Multiple processes (using fork) and communication via 

Shmem segments 

Multithreading: Threads run on different nodes and utilize 

parallel machine, Threads run onto a shared address 

space 

OpenMP - Set of compiler directives for controlling 

multi-threaded, space divided execution 

As well, more general programming models work on shared 

memory computers: 

Explicit message passing among multiple processes: 

Unix-Pipelines/Sockets, MPI 

Peter Sobe 


Programming Models: Multithreading 

Several threads run onto several processors under control of 

the operating system 

OS-specific thread functions, e.g. Solaris threads 

portability standard: POSIX-Threads, pthread library 

Basic functions: 

int pthread_create(pthread_t *thread, 

const pthread_attr_t *attr, 

void *(*start_routine)(void*), 

void *arg); 

void pthread_exit(void *value_ptr); 

int pthread_join(pthread_t thread, void **value_ptr); 

Peter Sobe 


Programming Models: OpenMP 

OpenMP: 

Example for Loop-parallelization: 

for (i=0;i

Graphic Processing Units (GPU) 

Usage of programmable shader units in modern graphic 

cards for non-graphics computations 

Data parallel execution model: 

same function on several data portions 

programmed in a MIMD style 

executed partly in SIMD mode 

C programming language for 

CUDA-Extension (NVidia) 

Brooks+ (ATI) 

OpenCL (platform independent) 

Peter Sobe 


Graphic Processing Units 

Architecture: 

multiprocessors that operate in 

SIMD mode (but also execute 

MIMD code in a sequentialized 

fashion) 

a set of multiprocessors 

host and device memory 

device memory is shared by all 

multiprocessors 

Input data must be copied from 

host to device memory, output 

data along the reverse way 

Peter Sobe 


Cuda - Programming (1) 

NVidia - Compute Unified 

Device Architecture 

Programming and execution 

model - oriented on, but not 

fixed to - GPU structures 

A kernel function is 

executed by many threads 

that commonly operate on 

different portions of the 

input data 

Blocks of threads - run 

parallel and may use shared 

memory 

Grid of blocks - blocks are 

executed either in a batch 

mode or parallel, depending 

on the device capabilities 

Peter Sobe 


Cuda - Programming (2) 

Programming interface 

supports 1, 2 and 3 dimensional blocks and grids 

addressing of data elements within the kernel functions by BlockIdx 

(block in a grid) and ThreadIdx (thread in a block) 

Example: 

Parallel computation y[]= a*x[] + y[] 

__global__ 

void saxpy_parallel (int n, float a, float *x, float *y) 

{ 

int i = blockIdx.x*blockDim.x+threadIdx.x; 

if (i

Summary Part 2 

Multiprocessor systems with many processors, connected 

by a shared memory 

SMP, MultiCore and DSM 

Such systems scale up to 32 processors (SMP) and a few 

hundred processors (DSM) 

Cache Coherency allows to use a shared memory 

transparently, without care for cache consistency problems 

Programming models base on shared memory, e.g. 

multiple threads 

Peter Sobe

PCS - Part 2: Multiprocessor Architectures

Create successful ePaper yourself

Delete template?

Save as template?