PCS - Part 2: Multiprocessor Architectures
PCS - Part 2: Multiprocessor Architectures
PCS - Part 2: Multiprocessor Architectures
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>PCS</strong> - <strong>Part</strong> 2:<br />
<strong>Multiprocessor</strong> <strong>Architectures</strong><br />
Peter Sobe<br />
Institute of Computer Engineering<br />
University of Lübeck, Germany<br />
Baltic Summer School, Tartu 2009<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
<strong>Part</strong> 2 - Contents<br />
Multicore and symmetrical multiprocessors<br />
Cache Coherency<br />
Distributed Shared Memory<br />
Programming Models for <strong>Multiprocessor</strong> Systems<br />
Multiple processes<br />
Multithreading<br />
OpenMP<br />
Graphic Processing Units (GPU)<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
<strong>Multiprocessor</strong> Systems<br />
Structure:<br />
P 0<br />
P 1<br />
P 2<br />
P 3<br />
P (p−1)<br />
Cache Cache Cache Cache<br />
Cache<br />
Communication Network<br />
MEM MEM MEM MEM<br />
MEM<br />
Shared memory for all processors, i.e. a common address<br />
space<br />
Coordination and cooperation using shared variables in<br />
memory<br />
Computer runs with a single instance of the operating<br />
system<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Shared Memory <strong>Multiprocessor</strong>s<br />
processors potentially operate independently and asynchronous<br />
cooperative operation obtained by software<br />
Processor 1<br />
Instructions<br />
Control<br />
Unit<br />
Processor N<br />
Instructions<br />
Control<br />
Unit<br />
Arithm.<br />
Logical<br />
Unit<br />
Data<br />
Arithm.<br />
Logical<br />
Unit<br />
Data<br />
Communication Network<br />
global<br />
Memory Unit 1<br />
global<br />
Memory Unit M<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Symmetrical <strong>Multiprocessor</strong>s (SMP)<br />
SMP - Symmetry in terms of same processor type and<br />
equal cost for memory access, independent of originating<br />
processor and of accessed physical address<br />
Example: Intel SMP Servers using FrontSide-Bus<br />
SMPs up to 32 processors, bigger systems do not perform<br />
well due to bottleneck of a common bus.<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
MultiCore<br />
Many processor cores on a single chip<br />
Used like a symmetrical<br />
multiprocessor system<br />
Processors with private L1<br />
Cache<br />
Cache coherency by<br />
MESI-like protocol (MoESI)<br />
Private/shared L2 Cache<br />
cores share the memory<br />
interface<br />
CPU1 CPU2<br />
L1 Cache L1 Cache<br />
L2 Cache<br />
bus interface<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Distributed Shared Memory (1)<br />
Mix concept: Distributed Shared Memory<br />
Hardware structure looks like a distributed memory<br />
architecture,<br />
OS techniques combined with hardware accelerations<br />
provide a virtual shared memory<br />
Processor−Memory−Unit 1<br />
Steuerwerk<br />
Instructions<br />
Processor−Memory−Unit N<br />
Instructions<br />
Control<br />
Unit<br />
Arithm.<br />
Logical<br />
Unit<br />
Data<br />
Arithm.<br />
Logical<br />
Unit<br />
Data<br />
Communication Network<br />
Memory<br />
Memory<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Distributed Shared Memory (2)<br />
Processor-Memory-Units connected to a multiprocessor<br />
system<br />
Communication network mostly a hierarchic switched<br />
network<br />
Asymmetric Structure: Different memory access cost,<br />
depending on the referred address and the processor that<br />
originates the access<br />
Non Uniform Memory Access (NUMA, ccNUMA when<br />
cache coherent)<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Example: SUN SF15K (1)<br />
Sun SF15K:<br />
ccNUMA-<strong>Multiprocessor</strong> system,<br />
72 Sun UltraSparc III - 900 Mhz<br />
18 system boards with 4 processors and<br />
4 memory modules each<br />
Within system board: UMA/SMP, cache<br />
coherency by snooping<br />
Across different system boards:<br />
Directory-based cache coherency,<br />
implemented by SSM agents<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Example: SUN SF15K (2)<br />
memory access times (750 MHz UltraSPARC III)<br />
same CPU 216 ns<br />
same Board 235 ns<br />
different Board 375 ns<br />
Communication network:<br />
18x18 Crossbar for<br />
address, + cache<br />
coherency control signals<br />
18x18 Crossbar for data<br />
transfer<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Shared Memory and Caching<br />
Caches are used in order to release network and main<br />
memory from frequent data transfer<br />
Non–shared data can be kept in caches for a long time<br />
without interaction with main memory<br />
This improves scalability of the system, but introduces a<br />
consistency problem. This problem is solved by cache<br />
coherency protocols.<br />
Consistency problem:<br />
P1<br />
Consistent<br />
P2<br />
P1<br />
Inconsistent<br />
P2<br />
Write 23, 2500<br />
23: 1000<br />
23: 1000<br />
23: 1000<br />
23: 2500<br />
23: 1000<br />
23: 1000<br />
23: 2500<br />
after "write−back"<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cache Coherency<br />
Coherency:<br />
Ensures that no old copies of data are used<br />
Weaker than consistency, i.e. inconsistencies are allowed<br />
but along with keeping track of inconsistencies<br />
Protocols:<br />
Invalidation: invalidate a copy when another processor is<br />
writing on address (snooping), always write-through is<br />
necessary<br />
MESI: keep track on usage of data, snooping, write-back<br />
only when necessary<br />
Directory–based Cache Coherency: for systems without<br />
shared address bus<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cache Coherency: MESI (1)<br />
Motivation for MESI: Allow the Write-back strategy as long<br />
no other processor is accessing to the cached address<br />
Protocols similar to MESI also exist for DSM system<br />
without a shared snooping medium,⇒directory-based<br />
caches<br />
The term ’MESI’ comes from the 4 states ’M’,’E’,’S’ and ’I’<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cache Coherency: MESI (2)<br />
M Exclusive Modified The line is exclusively in this cache<br />
and got modified (written)<br />
The line is exclusively in this cache<br />
E Exclusive Unmodified but was not modified, i.e. was<br />
only accessed by read operations<br />
S Shared Unmodified This line is also present in another<br />
processors cache, but was not modified<br />
Line was modified by another<br />
I Invalid processor,<br />
cache entry may not be used<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cache Coherency: MESI (3)<br />
States and transitions:<br />
local events:<br />
RM ... read miss<br />
RH ... read hit<br />
WM ... write miss<br />
WH ... write hit<br />
distant events:<br />
SHR ... shared read<br />
SHW ... shared write<br />
Dirty line copy back<br />
Invalidate<br />
Read with intent to modify<br />
Cache line fill<br />
WH<br />
SHW<br />
Invalid<br />
WM<br />
exclusive<br />
Modified<br />
WH<br />
RM / shared<br />
SHR<br />
SHW<br />
WH<br />
SHW<br />
RM / exclusive<br />
RH<br />
Shared<br />
unmodified<br />
SHR<br />
Exclusive<br />
unmodified<br />
RH<br />
RH<br />
Figure taken from: T. Ungerer, ’Parallelrechner und Parallele Programmierung’<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cache Coherency: MESI-like protocols<br />
MESI requires address bus visible for all caches, thus MESI<br />
solely appropriate for MultiCore and SMP systems<br />
DSM system:<br />
No shared address bus, instead a decentralized network<br />
for address and data transfer<br />
Directory-based cache coherence protocols: Each<br />
memory line is tagged with information which caches hold<br />
a copy of the line<br />
A distributed protocol is invoked each time a memory line<br />
or a cache line is accessed<br />
SSM-Agent runs coherency protocol on behalf of the local<br />
caches.<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Example: SUN SF15K (3)<br />
Within a system board: Cache coherency by snooping and<br />
MESI<br />
Additionally, each system board contains a SSM-agent, working<br />
according a directory based cache coherency algorithm<br />
Principle:<br />
Cache coherency interactions remain local within board, as<br />
long in a presence vector no board-distant processor is<br />
stored.<br />
If a copy of a memory line is stored in a board-distant<br />
cache/processor, then SSM agent runs the distributed<br />
protocol.<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Example: SUN SF15K (3)<br />
Example - Invalidate cache copies after altering a memory line:<br />
SSM agent initiates transfer of the address across the<br />
18x18 address-crossbar with control wires set to<br />
’Invalidate’<br />
the destination board is contained in a part of the address<br />
SSM-agent of the destination board receives the address,<br />
this address will be transferred via the local address bus<br />
with control signal set to ’Shared Write’<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Example: SUN SF15K (5)<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Programming Models (1)<br />
Choice is mainly influenced by the aspect of shared memory,<br />
and cache coherency<br />
Options:<br />
Multiple processes (using fork) and communication via<br />
Shmem segments<br />
Multithreading: Threads run on different nodes and utilize<br />
parallel machine, Threads run onto a shared address<br />
space<br />
OpenMP - Set of compiler directives for controlling<br />
multi-threaded, space divided execution<br />
As well, more general programming models work on shared<br />
memory computers:<br />
Explicit message passing among multiple processes:<br />
Unix-Pipelines/Sockets, MPI<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Programming Models: Multithreading<br />
Several threads run onto several processors under control of<br />
the operating system<br />
OS-specific thread functions, e.g. Solaris threads<br />
portability standard: POSIX-Threads, pthread library<br />
Basic functions:<br />
int pthread_create(pthread_t *thread,<br />
const pthread_attr_t *attr,<br />
void *(*start_routine)(void*),<br />
void *arg);<br />
void pthread_exit(void *value_ptr);<br />
int pthread_join(pthread_t thread, void **value_ptr);<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Programming Models: OpenMP<br />
OpenMP:<br />
Example for Loop-parallelization:<br />
for (i=0;i
Graphic Processing Units (GPU)<br />
Usage of programmable shader units in modern graphic<br />
cards for non-graphics computations<br />
Data parallel execution model:<br />
same function on several data portions<br />
programmed in a MIMD style<br />
executed partly in SIMD mode<br />
C programming language for<br />
CUDA-Extension (NVidia)<br />
Brooks+ (ATI)<br />
OpenCL (platform independent)<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Graphic Processing Units<br />
Architecture:<br />
multiprocessors that operate in<br />
SIMD mode (but also execute<br />
MIMD code in a sequentialized<br />
fashion)<br />
a set of multiprocessors<br />
host and device memory<br />
device memory is shared by all<br />
multiprocessors<br />
Input data must be copied from<br />
host to device memory, output<br />
data along the reverse way<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cuda - Programming (1)<br />
NVidia - Compute Unified<br />
Device Architecture<br />
Programming and execution<br />
model - oriented on, but not<br />
fixed to - GPU structures<br />
A kernel function is<br />
executed by many threads<br />
that commonly operate on<br />
different portions of the<br />
input data<br />
Blocks of threads - run<br />
parallel and may use shared<br />
memory<br />
Grid of blocks - blocks are<br />
executed either in a batch<br />
mode or parallel, depending<br />
on the device capabilities<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>
Cuda - Programming (2)<br />
Programming interface<br />
supports 1, 2 and 3 dimensional blocks and grids<br />
addressing of data elements within the kernel functions by BlockIdx<br />
(block in a grid) and ThreadIdx (thread in a block)<br />
Example:<br />
Parallel computation y[]= a*x[] + y[]<br />
__global__<br />
void saxpy_parallel (int n, float a, float *x, float *y)<br />
{<br />
int i = blockIdx.x*blockDim.x+threadIdx.x;<br />
if (i
Summary <strong>Part</strong> 2<br />
<strong>Multiprocessor</strong> systems with many processors, connected<br />
by a shared memory<br />
SMP, MultiCore and DSM<br />
Such systems scale up to 32 processors (SMP) and a few<br />
hundred processors (DSM)<br />
Cache Coherency allows to use a shared memory<br />
transparently, without care for cache consistency problems<br />
Programming models base on shared memory, e.g.<br />
multiple threads<br />
Peter Sobe<br />
<strong>PCS</strong> - <strong>Part</strong> 2: <strong>Multiprocessor</strong> <strong>Architectures</strong>