Pre-requisites This is a textbook-based course
Pre-requisites This is a textbook-based course
Pre-requisites This is a textbook-based course
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
332<br />
Advanced dvanced Computer omputer Architecture<br />
rch tecture<br />
Chapter 1<br />
Introduction and review of<br />
Pipelines, Performance, Caches, and Virtual<br />
Memoryy<br />
January 2009<br />
Paul H J Kelly<br />
These lecture notes are partly <strong>based</strong> on the <strong>course</strong> text text,<br />
Hennessy and Patterson’s Computer Architecture, a<br />
quantitative approach (4th ed), and on the lecture slides of<br />
David Patterson’s Berkeley <strong>course</strong> (CS252)<br />
Course materials online at<br />
http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture<br />
.html<br />
<strong>Th<strong>is</strong></strong> <strong>is</strong> a <strong>textbook</strong>-<strong>based</strong> <strong>course</strong><br />
Advanced Computer Architecture Chapter 1. p1<br />
Computer Architecture: A Quantitative<br />
Approach (4th Approach (4 Edition) Edition)<br />
John L. Hennessy, David A. Patterson<br />
~580 pages. Morgan Kaufmann (2007); ISBN:<br />
978-0-12-370490-0<br />
with substantial additional material on CD<br />
Price: £ 37.99 (Amazon.co.uk, Nov 2006<br />
Publ<strong>is</strong>her’s companion web site:<br />
http://<strong>textbook</strong>s.elsevier.com/0123704901/<br />
Textbook includes some vital introductory material as<br />
appendices:<br />
Appendix A: tutorial on pipelining (read it NOW)<br />
Appendix C: tutorial on caching (read it NOW)<br />
Further appendices (some in book, some in CD) cover<br />
more advanced material (some very relevant to parts of<br />
the <strong>course</strong>), eg<br />
Networks<br />
Parallel applications<br />
Implementing Coherence Protocols<br />
Embedded systems<br />
VLIW<br />
Computer arithmetic (esp floating point)<br />
H<strong>is</strong>torical perspectives<br />
Advanced Computer Architecture Chapter 1. p3<br />
<strong>Th<strong>is</strong></strong> a third-level computer architecture <strong>course</strong><br />
<strong>Pre</strong>-<strong>requ<strong>is</strong>ites</strong><br />
The usual path would be to take th<strong>is</strong> <strong>course</strong> after following a<br />
<strong>course</strong> <strong>based</strong> on a <strong>textbook</strong> like “Computer Organization and<br />
Design” (Patterson and Hennessy, Morgan Kaufmann)<br />
<strong>Th<strong>is</strong></strong> <strong>course</strong> <strong>is</strong> <strong>based</strong> on the more advanced book by the same<br />
authors (see next slide)<br />
You can take th<strong>is</strong> <strong>course</strong> provided you’re prepared to catch<br />
up if necessary<br />
Read chapters 1 to 8 of “Computer Organization and Design” (COD) if<br />
th<strong>is</strong> material <strong>is</strong> new to you<br />
If you have studied computer architecture before, make sure COD<br />
Chapters 2, 6, 7 are familiar<br />
See also “Appendix A Pipelining: Basic and Intermediate Concepts” of<br />
<strong>course</strong> <strong>textbook</strong><br />
FAST review today y of Pipelining, p g, Performance, , Caches, , and<br />
Virtual Memory<br />
Who are these guys anyway and why<br />
should I read their book?<br />
Advanced Computer Architecture Chapter 1. p2<br />
RAID-I RAID I (1989) ( 989)<br />
cons<strong>is</strong>ted of a Sun<br />
John Hennessy:<br />
4/280 workstation<br />
Founder, MIPS<br />
with 128 MB of<br />
Computer Systems<br />
DRAM, four dual-<br />
string SCSI<br />
<strong>Pre</strong>sident, Stanford<br />
controllers, 28<br />
University<br />
5.25-inch SCSI<br />
(previous president: Condoleezza Rice)<br />
d<strong>is</strong>ks and<br />
specialized p d<strong>is</strong>k<br />
striping software.<br />
David Patterson<br />
Leader, Berkeley RISC<br />
project j (led to Sun’s<br />
SPARC)<br />
RAID (redundant arrays<br />
of inexpensive d<strong>is</strong>ks)<br />
Professor, f , University y of f<br />
/www.cs.berkeley.e<br />
California, Berkeley<br />
Current president of the<br />
ACM<br />
RISC-I (1982) Contains 44,420<br />
trans<strong>is</strong>tors, fabbed in 5 micron<br />
Served on Information<br />
NMOS NMOS, with ith a di die area of f 77 mm2 Technology Adv<strong>is</strong>ory<br />
Committee to the US<br />
<strong>Pre</strong>sident<br />
2 , ran<br />
at 1 MHz. <strong>Th<strong>is</strong></strong> chip <strong>is</strong> probably the<br />
first VLSI RISC.<br />
http://<br />
edu/~pa<br />
ttrsn/AArch/prototypes2.<br />
.html<br />
Advanced Computer Architecture Chapter 1. p4
Course web site:<br />
Admin<strong>is</strong>tration details<br />
http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitectu<br />
re.html<br />
Course <strong>textbook</strong>: H&P 4 th ed<br />
Read Appendix A right away<br />
Background g for 2008 context…<br />
See Workshop on Trends in Computing Performance<br />
http://www7.nationalacademies.org/CSTB/project_computingperformance_workshop.html<br />
Ch1<br />
Review of pipelined, in-order<br />
processor architecture and simple<br />
cache structures<br />
Ch2<br />
Caches in more depth<br />
Software techniques to improve<br />
cache performance<br />
p<br />
Virtual memory<br />
Benchmarking<br />
Fab<br />
Ch3<br />
Instruction-level parallel<strong>is</strong>m<br />
Dynamic scheduling, out-of-order<br />
Reg<strong>is</strong>ter renaming<br />
SSpeculative l i execution i<br />
Branch prediction<br />
Limits to ILP<br />
Ch4<br />
Compiler techniques – loop nest<br />
transformations<br />
Loop parallel<strong>is</strong>ation, interchange,<br />
tiling/blocking, skewing<br />
Course overview (plan)<br />
Advanced Computer Architecture Chapter 1. p5<br />
Ch5<br />
Multithreading, hyperthreading, SMT<br />
Static instruction instruction scheduling<br />
Software pipelining<br />
EPIC/IA-64; instruction-set support for<br />
speculation and reg<strong>is</strong>ter renaming<br />
Ch6<br />
GPUs, GPGPU, and manycore<br />
Ch7<br />
Shared-memory y multiprocessors<br />
p<br />
Cache coherency<br />
Large-scale cache-coherency; ccNUMA.<br />
COMA<br />
Lab-<strong>based</strong> <strong>course</strong>work exerc<strong>is</strong>e:<br />
Simulation study<br />
“challenge” challenge<br />
Using performance analys<strong>is</strong> tools<br />
Exam:<br />
Partially y <strong>based</strong> on recent processor p<br />
architecture article, which we will study in<br />
advance (see past papers)<br />
Advanced Computer Architecture Chapter 1. p7<br />
Course organ<strong>is</strong>ation<br />
Lecturer:<br />
Paul Kelly – Leader, Software Performance Optim<strong>is</strong>ation research group<br />
Tutorial helper:<br />
AAnton t LLokhmotov kh t – postdoctoral td t l researcher: h PhD from f Cambridge C b id on optim<strong>is</strong>ation ti i ti<br />
and algorithms for SIMD. Industry experience with Broadcom (VLIW hardware),<br />
Clearspeed (massively-multicore SIMD hardware), Codeplay (compilers for games),<br />
ACE (compilers)<br />
3 h hours per week k<br />
Nominally two hours of lectures, one hour of classroom tutorials<br />
We will use the time more flexibly<br />
Assessment:<br />
Exam<br />
For CS M.Eng. Class, exam will take place in last week of term<br />
For everyone else, exam will take place early in the summer term<br />
Th The goal l of f the h <strong>course</strong> <strong>is</strong> i to teach h you how h to think hi k about b computer<br />
architecture<br />
The exam usually includes some architectural ideas not presented in the<br />
lectures<br />
Coursework<br />
You will be assigned a substantial, laboratory-<strong>based</strong> exerc<strong>is</strong>e<br />
You will learn about performance tuning for computationally-intensive kernels<br />
You will learn about using simulators, and experimentally evaluating<br />
hypotheses to understand system performance<br />
Y You are encouraged d t to bbring i llaptops t t to class l t to get t started t t d and d get t hhelp l<br />
during tutorials<br />
Please do not use computers for anything else during classes<br />
Advanced Computer Architecture Chapter 1. p6<br />
A "Typical" RISC<br />
32-bit fixed format instruction (3 formats, see next slide)<br />
32 32-bit general-purpose reg<strong>is</strong>ters<br />
(R0 contains zero, double-prec<strong>is</strong>ion/long operands occupy a pair)<br />
Memory access only via load/store instructions<br />
N No instruction i t ti both b th accesses memory and d does d arithmetic ith ti<br />
All arithmetic <strong>is</strong> done on reg<strong>is</strong>ters<br />
3-address, reg-reg arithmetic instruction<br />
Subw r1,r2,r3 r1 r2 r3 means r1 := : r2 r2-r3 r3<br />
reg<strong>is</strong>ters identifiers always occupy same bits of instruction encoding<br />
Single addressing mode for load/store:<br />
base + d<strong>is</strong>placement<br />
ie reg<strong>is</strong>ter contents are added dd d to constant f from instruction word, d and d<br />
used as address, eg “lw R2,100(r1)” means “r2 := Mem[100+r1]”<br />
no indirection<br />
see: SPARC, MIPS, ARM, HP PA-R<strong>is</strong>c,<br />
Simple p branch conditions<br />
Delayed branch<br />
DEC Alpha, p , IBM PowerPC, ,<br />
CDC 6600, CDC 7600, Cray-1,<br />
Cray-2, Cray-3<br />
Not: Intel IA-32, IA-64 (?),<br />
Motorola 68000,<br />
DEC DE VAX, PDP-11, PDP 11 IBM B<br />
360/370<br />
Eg: VAX matchc, IA32 scas instructions!<br />
Advanced Computer Architecture Chapter 1. p8
Example: MIPS (Note reg<strong>is</strong>ter location)<br />
Reg<strong>is</strong>ter-Reg<strong>is</strong>ter<br />
31 26 25 2120<br />
16 15 1110<br />
6 5<br />
0<br />
Op Rs1 Rs2 Rd Opx<br />
Reg<strong>is</strong>ter-Immediate<br />
31 26 25 21 20 16 15<br />
0<br />
Op<br />
Branch<br />
Rs1 Rd immediate<br />
31 26 25 2120<br />
16 15<br />
0<br />
Op Rs1 Rs2/Opx immediate<br />
Jump / Call<br />
31 26 25<br />
0<br />
Op<br />
target<br />
Q: What <strong>is</strong> the largest signed immediate operand for “subw r1,r2,X”?<br />
Q: What range of addresses can a conditional branch jump to?<br />
Advanced Computer Architecture Chapter 1. p9<br />
A machine to execute these instructions<br />
To execute th<strong>is</strong> instruction set we need a machine that fetches<br />
them and does does what each each instruction says<br />
A “universal” computing device – a simple digital circuit that, with<br />
the right code, can compute anything<br />
Something like:<br />
Instr = Mem[PC]; PC+=4;<br />
rs1 = Reg[Instr.rs1];<br />
rs2 = Reg[Instr.rs2];<br />
imm = SignExtend(Instr.imm);<br />
Operand1 = if(Instr.op==BRANCH) then PC else rs1;<br />
Operand2 = if(immediateOperand(Instr.op)) if(immediateOperand(Instr op)) then imm else rs2;<br />
res = ALU(Instr.op, Operand1, Operand2);<br />
switch(Instr.op) {<br />
case BRANCH:<br />
if (rs1==0) then PC=PC+imm; continue;<br />
case STORE:<br />
Mem[res] = rs1; continue;<br />
case LOAD:<br />
lmd = Mem[res];<br />
}<br />
Reg[Instr.rd] = if (Instr.op==LOAD) then lmd else res;<br />
Advanced Computer Architecture Chapter 1. p11<br />
So where do I find a MIPS processor?<br />
MIPS licensees shipped pp more than 350 million<br />
units during f<strong>is</strong>cal year 2007<br />
(http://www.mips.com/company/about-us/milestones/)<br />
HP 4100 multifunction printer<br />
Next PC<br />
Addrress<br />
Instruction<br />
Fetch<br />
4<br />
Addeer<br />
Memory<br />
Figure 3.1, Page 130, CA:AQA 2e<br />
Sony PS2 and PSP<br />
Instt<br />
Digimax L85 digital camera<br />
http://www.zoran.com/COACH-9<br />
Linksys WRT54G Router (Linux-<strong>based</strong>)<br />
Advanced Computer Architecture Chapter 1. p10<br />
5 Steps of MIPS Datapath<br />
Instr. Decode<br />
Reg. Fetch<br />
Next SEQ PC<br />
RS1<br />
RS2<br />
RD<br />
Imm<br />
Reg FFile<br />
Si Sign<br />
Extend<br />
Execute<br />
Addr. Calc<br />
MUX<br />
MUX<br />
Zero?<br />
ALLU<br />
WB Data<br />
Memory<br />
Access<br />
MUX<br />
Data<br />
MMemory<br />
L<br />
M<br />
D<br />
Write<br />
Back<br />
MUX<br />
Advanced Computer Architecture Chapter 1. p12
Next PC<br />
Addrress<br />
Instruction<br />
Fetch<br />
4<br />
Addeer<br />
Memory<br />
Figure 3.1, Page 130, CA:AQA 2e<br />
Instt<br />
Pipelining the MIPS datapath<br />
Instr. Decode<br />
Reg. Fetch<br />
Next SEQ PC<br />
RS1<br />
RS2<br />
RD<br />
Imm<br />
Reg FFile<br />
Si Sign<br />
Extend<br />
Execute<br />
Addr. Calc<br />
MUX<br />
MUX<br />
Zero?<br />
ALLU<br />
WB Data<br />
Memory<br />
Access<br />
MUX<br />
Data<br />
MMemory<br />
We will see more complex pipeline structures later.<br />
For example, the Pentium 4 “Netburst” architecture has 31 stages.<br />
I<br />
n<br />
s<br />
tt<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
Time (clock cycles)<br />
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />
Ifetch<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
L<br />
M<br />
D<br />
Write<br />
Back<br />
MUX<br />
Advanced Computer Architecture Chapter 1. p13<br />
V<strong>is</strong>ualizing Pipelining<br />
Reg<br />
DMem<br />
r Ifetch f h Reg<br />
DMem RReg<br />
Pipelining doesn’t help latency of single instruction<br />
it helps throughput of entire workload<br />
Pipeline rate limited by slowest pipeline stage<br />
Potential speedup = Number pipe stages<br />
Unbalanced lengths g of f ppipe p stages g reduces speedup p p<br />
Time to “fill” pipeline and time to “drain” it reduces speedup<br />
Speedup comes from parallel<strong>is</strong>m<br />
For free – no new hardware<br />
Figure 3.3, Page 133 , CA:AQA 2e<br />
ALU<br />
Reg<br />
Advanced Computer Architecture Chapter 1. p15<br />
5-stage MIPS pipeline with pipeline buffers<br />
Next PC<br />
Addresss<br />
Instruction<br />
Fetch<br />
4<br />
Adder<br />
Memorry<br />
IF/IDD<br />
Instr. Decode<br />
Reg. Fetch<br />
Execute<br />
Addr. Calc<br />
Next SEQ PC Next SEQ PC<br />
RS1<br />
RS2<br />
Reg File<br />
Sign<br />
Extend<br />
Imm<br />
ID/EXX<br />
MUX MMUX<br />
Zero?<br />
ALUU<br />
EX/MEEM<br />
Memory<br />
Access<br />
MUX<br />
RD RD RD<br />
• Data stationary control<br />
– local decode for each instruction phase / pipeline stage<br />
Figure 3.4, Page 134 , CA:AQA 2e<br />
DData<br />
Meemory<br />
MEM/WWB<br />
Write<br />
Back<br />
MUX<br />
WB Data<br />
Advanced Computer Architecture Chapter 1. p14<br />
It’s Not That Easy for Computers<br />
Limits to pipelining: Hazards prevent<br />
next instruction from executing during its<br />
designated clock cycle<br />
Structural hazards: HW cannot support th<strong>is</strong><br />
combination of instructions<br />
Data hazards: Instruction depends on result<br />
of prior instruction still in the pipeline<br />
CControl nt l hhazards: d : Caused C d by b delay d l between b t n<br />
the fetching of instructions and dec<strong>is</strong>ions<br />
about changes in control flow (branches and<br />
j jumps). )<br />
Advanced Computer Architecture Chapter 1. p16
One Memory Port/Structural Hazards<br />
Time (clock cycles)<br />
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />
Ifetch Reg<br />
I Load<br />
n<br />
s<br />
Instr 1<br />
t<br />
rr.<br />
Instr 2<br />
O<br />
rr<br />
d Instr 3<br />
e<br />
r Instr 4<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Eg if there <strong>is</strong> only one memory for both instructions and data<br />
Two different stages may need access at same time<br />
Example: IBM/Sony/Toshiba Cell processor<br />
Figure 3.6, Page 142 , CA:AQA 2e<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
dd<br />
e<br />
r<br />
Time (clock cycles)<br />
add r1,r2,r3<br />
sub r4,r1,r3<br />
and r6,r1,r7<br />
or r8,r1,r9<br />
Reg<br />
DMem<br />
ALUU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
Advanced Computer Architecture Chapter 1. p17<br />
Data Hazard on R1<br />
IF ID/RF EX MEM WB<br />
Ifetch<br />
xor r10 r10,r1,r11 r1 r11<br />
Figure 3.9, page 147 , CA:AQA 2e<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALUU<br />
Regg<br />
Reg<br />
DMem<br />
ALU A<br />
Reg<br />
DMem<br />
Regg<br />
Advanced Computer Architecture Chapter 1. p19<br />
One Memory Port/Structural Hazards<br />
Time (clock cycles)<br />
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />
Ifetch Reg<br />
I Load<br />
n<br />
s<br />
t Instr 1 r.<br />
Ifetch Reg<br />
O<br />
Instr 2<br />
Ifetch<br />
rr<br />
d<br />
e<br />
Stall<br />
r Instr 3<br />
ALU<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
r Ifetch Reg<br />
Instr 3<br />
Reg<br />
Bubble Bubble Bubble Bubble Bubble<br />
ALU A<br />
DMem DMem<br />
Instr 3 cannot be loaded in cycle 4<br />
ID stage has nothing to do in cycle 5<br />
EX stage has nothing to do in cycle 6, etc. “Bubble” propagates<br />
Figure 3.7, Page 143 , CA:AQA 2e<br />
Regg<br />
Advanced Computer Architecture Chapter 1. p18<br />
Three Generic Data Hazards<br />
Read a After ft r Wr Write t ( (RAW) W)<br />
InstrJ tries to read operand before InstrI writes it<br />
I: add r1,r2,r3<br />
J: sub r4,r1,r3<br />
Caused by a “Dependence” (in compiler nomenclature).<br />
<strong>Th<strong>is</strong></strong> hazard results from an actual need for<br />
communication<br />
communication.<br />
Advanced Computer Architecture Chapter 1. p20
II<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
rr<br />
d<br />
e<br />
r<br />
Three Generic Data Hazards<br />
Write After Read (WAR)<br />
Instr J writes operand before Instr I reads it<br />
I: sub r4,r1,r3<br />
J: add r1,r2,r3 , ,<br />
K: mul r6,r1,r7<br />
Called an “anti-dependence” by compiler writers.<br />
<strong>Th<strong>is</strong></strong> results from reuse of the name “r1”.<br />
C Can’t ’ happen h in i MIPS 5 stage pipeline i li bbecause:<br />
All instructions take 5 stages, and<br />
Reads are always in stage 2, and<br />
Writes are always in stage 5<br />
add r1,r2,r3<br />
sub r4,r1,r3<br />
and r6,r1,r7 6 1<br />
Advanced Computer Architecture Chapter 1. p21<br />
Forwarding to Avoid Data Hazard<br />
Time (clock cycles)<br />
Ifetch<br />
or r8 r8,r1,r9 r1 r9<br />
xor r10,r1,r11<br />
Reg<br />
Ifetch<br />
Figure 33.10, 10 Page 149 , CA:AQA 2e<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
LU<br />
Ifetch<br />
AL<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
Advanced Computer Architecture Chapter 1. p23<br />
Three Generic Data Hazards<br />
Write After Write (WAW)<br />
Instr J writes operand before Instr I writes it.<br />
I: sub r1,r4,r3<br />
J: add r1,r2,r3<br />
KK: mul l r6,r1,r7 6 1 7<br />
Called an “output output dependence” dependence by compiler writers<br />
<strong>Th<strong>is</strong></strong> also results from the reuse of name “r1”.<br />
Can’t happen pp in MIPS 5 stage g pipeline p p because:<br />
All instructions take 5 stages, and<br />
Writes are always in stage 5<br />
Will see WAR and WAW in later more complicated pipes<br />
Advanced Computer Architecture Chapter 1. p22<br />
HW Change for Forwarding<br />
Figure 3.20, Page 161, CA:AQA 2e<br />
Add forwarding (“bypass”) paths<br />
Add multiplexors l l to select l where ALU operand should l come from<br />
Determine mux control in ID stage<br />
If source reg<strong>is</strong>ter <strong>is</strong> the target of an instrn that will not WB in time<br />
NextPC<br />
Reg<strong>is</strong>teers<br />
Immediate<br />
ID/ /EX<br />
mmux<br />
mux<br />
ALU<br />
EX/MMEM<br />
Data<br />
Memory<br />
MEM/ /WR<br />
muxx<br />
Advanced Computer Architecture Chapter 1. p24
Time (clock cycles)<br />
Data Hazard Even with Forwarding<br />
Figure 3.12, Page 153 , CA:AQA 2e<br />
LU<br />
II n<br />
s<br />
lw r1 r1, 0(r2) If Ifetch t h RReg tt<br />
r.<br />
sub r4 r4,r1,r6 r1 r6<br />
Ifetch Reg<br />
OO<br />
r<br />
d<br />
e<br />
r<br />
and r6 r6,r1,r7 r1 r7<br />
AL<br />
DM DMem<br />
Reg LU<br />
Ifetch<br />
AL<br />
Reg<br />
RReg<br />
DMem<br />
ALU A<br />
Reg<br />
DMem<br />
Ifetch Reg<br />
A<br />
or r8 r8,r1,r9 r1 r9<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
Advanced Computer Architecture Chapter 1. p25<br />
Software Scheduling to Avoid Load Hazards<br />
Try producing fast code for<br />
a = b + c;<br />
d = e – f;<br />
assuming a, b, c, d ,e, and f in memory.<br />
Slow code: Fast code:<br />
LW Rb Rb,b b LW Rb Rb,b b<br />
LW Rc,c LW Rc,c<br />
STALL LW Re,e<br />
ADD Ra Ra,Rb,Rc Rb Rc ADD Ra Ra,Rb,Rb Rb Rb<br />
SW a,Ra<br />
LW Re,e<br />
LW Rf Rf,ff LW Rf Rf,f f<br />
STALL SW a,Ra<br />
SUB Rd,Re,Rf SUB Rd,Re,Rf<br />
SW d,Rd SW d,Rd<br />
10 cycles (2 stalls) 8 cycles (0 stalls)<br />
Show the stalls<br />
explicitly<br />
Advanced Computer Architecture Chapter 1. p27<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
Time (clock cycles)<br />
lw r1 r1, 0(r2)<br />
sub r4,r1,r6 r4 r1 r6<br />
and r6,r1,r7 r6 r1 r7<br />
or r8,r1,r9<br />
Data Hazard Even with Forwarding<br />
Figure 3.13, Page 154 , CA:AQA 2e<br />
Ifetch<br />
Reg<br />
LU<br />
Ifetch<br />
AL<br />
Reg<br />
Ifetch<br />
DMem<br />
Bubble<br />
Reg<br />
ALU A<br />
Bubble Reg<br />
Bubble<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Regg<br />
DMem<br />
EX stage waits in cycle 4 for operand<br />
Following instruction (“and”) ( and ) waits in ID stage<br />
M<strong>is</strong>sed instruction <strong>is</strong>sue opportunity…<br />
10: beq r1,r3,36<br />
14: and r2,r3,r5<br />
18: or r6,r1,r7<br />
22: add r8,r1,r9<br />
36: xor r10 r10,r1,r11 r1 r11<br />
Ifetch<br />
ALU<br />
Reg<br />
DMem<br />
Advanced Computer Architecture Chapter 1. p26<br />
Control Hazard on Branches<br />
Three Stage Stall<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALUU<br />
Regg<br />
Reg<br />
DMem<br />
ALU A<br />
Reg<br />
DMem<br />
Regg<br />
Advanced Computer Architecture Chapter 1. p28
Instruction<br />
Fetch<br />
Pipelined MIPS Datapath with early branch<br />
determination<br />
Instr. Decode<br />
Reg. Fetch<br />
Next PC Next<br />
SEQ PC<br />
MUX<br />
Addresss<br />
4<br />
Adder<br />
Memorry<br />
Figure 3.22, page 163, CA:AQA 2/e<br />
IF/ ID<br />
RS1<br />
Adder<br />
Zero?<br />
RS2 ID/E<br />
Reg File<br />
Sign<br />
Extend<br />
Imm<br />
X<br />
Execute<br />
Addr. Calc<br />
MMUX<br />
ALUU<br />
EX/MEEM<br />
RD RD RD<br />
Memory<br />
Access<br />
DData<br />
Meemory<br />
MEM/WWB<br />
Write<br />
Back<br />
MUX<br />
WB Data<br />
Advanced Computer Architecture Chapter 1. p29<br />
Four Branch Hazard Alternatives<br />
#4: Delayed Branch<br />
y<br />
Define branch to take place AFTER a following instruction<br />
branch instruction<br />
sequential ti l successor1 sequential successor2 ........<br />
sequential successorn branch target if taken<br />
Branch delay of length n<br />
1 slot delay allows proper dec<strong>is</strong>ion and branch target<br />
address in 5 stage pipeline<br />
MIPS uses th<strong>is</strong>; eg in LW R3, #100<br />
If (R1==0)<br />
LW R4, #200<br />
BEQZ R1 R1, L1<br />
SW R3, X<br />
SW R4, X<br />
L1:<br />
LW R5 R5,X X<br />
“SW R3, X” instruction <strong>is</strong> executed regardless<br />
“SW R4, X” instruction <strong>is</strong> executed only if R1 <strong>is</strong> non-zero<br />
X=100<br />
Else<br />
X=100<br />
X=200<br />
R5 = X<br />
Advanced Computer Architecture Chapter 1. p31<br />
Four Branch Hazard Alternatives<br />
#1: Stall until branch direction <strong>is</strong> clear<br />
(wasteful – the next instruction <strong>is</strong> being fetched during ID)<br />
#2: <strong>Pre</strong>dict Branch Not Taken<br />
Execute Execute successor successor instructions instructions in in sequence sequence<br />
“Squash” instructions in pipeline if branch actually taken<br />
With MIPS we have advantage of late pipeline state update<br />
47% MIPS branches are not taken on average<br />
PC+4 already calculated, so use it to get next instruction<br />
#3: <strong>Pre</strong>dict Branch Taken<br />
53% MIPS branches are taken on average<br />
BBut t in i MIPS instruction i t ti set t we haven’t h ’t calculated l l t d branch b h target t t address dd<br />
yet (because branches are relative to the PC)<br />
MIPS still incurs 1 cycle branch penalty<br />
With some other machines, branch target <strong>is</strong> known before branch<br />
condition<br />
Advanced Computer Architecture Chapter 1. p30<br />
Delayed Branch<br />
Where to get instructions to fill branch delay slot?<br />
BBefore f bbranch h iinstruction t ti<br />
From the target address: only valuable when branch taken<br />
From fall through: only valuable when branch not taken<br />
Compiler effectiveness for single branch delay slot:<br />
Fills about 60% of branch delay slots<br />
About 80% of instructions executed in branch delay slots<br />
useful in computation<br />
About 50% (60% x 80%) of slots usefully filled<br />
Delayed Branch downside: 7-8 stage pipelines,<br />
multiple instructions <strong>is</strong>sued per clock (superscalar)<br />
Canceling branches<br />
Branch delay slot instruction <strong>is</strong> executed but write-back <strong>is</strong><br />
d<strong>is</strong>abled if it <strong>is</strong> not supposed to be executed<br />
Two variants: branch “likely taken”, branch “likely not-taken”<br />
allows more slots to be filled<br />
L1:<br />
target<br />
before<br />
Blt R1 R1,L1 L1<br />
fallthru<br />
Advanced Computer Architecture Chapter 1. p32
Eliminating hazards with simultaneous multi-threading<br />
If we had no stalls we could fin<strong>is</strong>h one instruction<br />
every cycle<br />
If we had no hazards we could do without<br />
forwarding – and decode/control would be simpler<br />
too<br />
Perrformaance<br />
PC0<br />
PC1<br />
Ifetch<br />
Next<br />
PC<br />
Thread 0<br />
regs<br />
Reg ALU<br />
Thread 1<br />
regs g<br />
DMem<br />
Reg<br />
Example:<br />
PowerPC<br />
processing<br />
element (PPE)<br />
in the Cell<br />
Broadband<br />
Engine (Sony<br />
PlayStation 3)<br />
IF maintains two Program Counters<br />
E Even cycle l – ffetch t h f from PC0<br />
Odd cycle – fetch from PC1<br />
Thread 0 reads and writes thread thread-0 0 reg<strong>is</strong>ters<br />
No reg<strong>is</strong>ter-to-reg<strong>is</strong>ter hazards between adjacent<br />
pipeline stages<br />
Advanced Computer Architecture Chapter 1. p33<br />
Life used to be so easy<br />
Processor-DRAM Memory y Gap p (latency) ( y)<br />
1000 CPU<br />
“Moore’s Law”<br />
µProc<br />
60%/yr 60%/yr.<br />
(2X/1.5yr)<br />
100 Processor-Memory<br />
Processor Memory<br />
Performance Gap:<br />
(grows 50% / year)<br />
10<br />
1<br />
1980<br />
1981<br />
1982<br />
1983<br />
1984<br />
1985<br />
1986<br />
1987<br />
1988<br />
1989<br />
1990<br />
1991<br />
1992<br />
1993<br />
1994<br />
1995<br />
1996<br />
1997<br />
1998<br />
1999<br />
2000<br />
Time<br />
In 1980 a large RAM’s access time was close to the CPU cycle time. 1980s<br />
machines had little or no need for cache. Life <strong>is</strong> no longer quite so simple.<br />
DRAM<br />
DRAM<br />
9%/yr.<br />
(2X/10 yrs)<br />
Advanced Computer Architecture Chapter 1. p35<br />
So – how fast can th<strong>is</strong> design go?<br />
A simple i l 55-stage t pipeline i li can run at t >3GHz 3GH<br />
Limited by critical path through slowest pipeline stage<br />
logic g<br />
Tradeoff: do more per cycle? Or increase clock rate?<br />
Or do more per cycle, in parallel…<br />
At 3GHz, clock period <strong>is</strong> 330 picoseconds.<br />
The time light takes to go about four inches<br />
Ab About 10 gate ddelays l<br />
for example, the Cell BE <strong>is</strong> designed for 11 FO4 (“fanout=4”)<br />
gates per cycle:<br />
www.fe.infn.it/~belletti/articles/ISSCC2005-cell.pdf<br />
f i f it/ b ll tti/ ti l /ISSCC2005 ll df<br />
Pipeline latches etc account for 3-5 FO4 delays leaving<br />
only 5-8 for actual work<br />
How can we build a RAM that can implement our MEM stage in<br />
5-8 FO4 delays?<br />
Advanced Computer Architecture Chapter 1. p34<br />
Memory Hierarchy: Terminology<br />
Hit: data appears appears in in some block X in in the the upper level<br />
Hit Rate: the fraction of memory accesses found in the upper level<br />
Hit Time: Time to access the upper level which cons<strong>is</strong>ts of<br />
RAM access time + Time to determine hit/m<strong>is</strong>s<br />
M<strong>is</strong>s: data needs to be retrieved from a block Y in<br />
the lower level<br />
M<strong>is</strong>s Rate = 1 - (Hit Rate)<br />
M<strong>is</strong>s Penalty: Time to replace a block in the upper level +<br />
Time to deliver the block the processor<br />
Hit Time
Capacity apac ty<br />
Access Time<br />
Cost<br />
Levels of the Memory Hierarchy<br />
Staging<br />
Xfer Unit<br />
CPU Reg<strong>is</strong>ters<br />
100s Bytes y<br />
1 KB Direct Mapped Cache, 32B blocks<br />
For a 2N byte y cache:<br />
The uppermost (32 - N) bits are always the Cache Tag<br />
The lowest M bits are the Byte Select (Block Size = 2M )<br />
31 9<br />
4<br />
Cache Tag Example: 0x50 Cache Index Byte Select<br />
Stored as part<br />
of the cache “state” state<br />
Ex: 0x01 Ex: 0x00<br />
Valid Bit<br />
:<br />
Cache Tag<br />
Direct-mapped cache – read access<br />
Valid<br />
: :<br />
Cache Data<br />
Byte 31 Byte 1 Byte 0 0<br />
0x50 Byte 63 Byte 33 Byte 32 1<br />
2<br />
3<br />
:<br />
Compare<br />
Hit<br />
:<br />
Byte 1023 :<br />
Data<br />
Byte 992<br />
0<br />
31<br />
Advanced Computer Architecture Chapter 1. p41<br />
Direct-mapped Cache - structure<br />
Capacity: C C bytes (eg 1KB)<br />
Blocksize: B bytes (eg 32)<br />
Byte select bits: 0..log(B)-1 (eg 0..4)<br />
Number of of blocks: C/B (eg 32)<br />
Address size: A (eg 32 bits)<br />
Cache index size: I=log(C/B) (eg log(32)=5)<br />
Tag size: A-I-log(B) (eg 32-5-5=22)<br />
Cache Tag<br />
: :<br />
:<br />
Adr Tag<br />
Compare<br />
Cache Index<br />
Cache Data<br />
Cache Block 0<br />
Hit<br />
Cache Block<br />
Advanced Computer Architecture Chapter 1. p43<br />
1 KB Direct Mapped Cache, 32B blocks<br />
(0) 0<br />
1 Cache location 0 can be occupied<br />
2<br />
b by d data f from main i memory<br />
3<br />
4<br />
location 0, 32, 64, … etc.<br />
5<br />
Cache location 1 can be occupied<br />
6<br />
7<br />
8<br />
by data from main memory<br />
llocation ti 1 1, 33 33, 65 65, … etc. t<br />
9<br />
10<br />
11<br />
12<br />
13 13<br />
14<br />
15<br />
Main<br />
MMemory<br />
In general, all locations with same<br />
Address bits map to the same<br />
location in the cache Which one should<br />
we place in the cache?<br />
H How can we tell ll which hi h one <strong>is</strong> i in i<br />
the cache?<br />
(32)<br />
: :<br />
16<br />
17<br />
18<br />
19<br />
Cache C Data<br />
20<br />
21<br />
Byte 31 Byte 1 Byte 0 0<br />
22<br />
23<br />
Byte 63 Byte 33 Byte 32 1<br />
24<br />
2<br />
25<br />
26<br />
3<br />
27<br />
28<br />
29<br />
:<br />
30<br />
31<br />
32<br />
33<br />
34<br />
35<br />
Byte 1023 :<br />
Byte 992<br />
31<br />
Advanced Computer Architecture Chapter 1. p42<br />
Two-way Set Associative Cache<br />
N-way set associative: N N entries for for each Cache Cache<br />
Index<br />
N direct mapped caches operated in parallel (N typically 2 to 4)<br />
EExample: l Two-way T set t associative i ti cache h<br />
Cache Index selects a “set” from the cache<br />
The two tags in the set are compared in parallel<br />
Data <strong>is</strong> selected <strong>based</strong> on the tag result<br />
Cache Index<br />
Valid Cache Tag Cache Data<br />
Cache Data<br />
Cache Block 0<br />
: :<br />
:<br />
Adr Tag<br />
Compare<br />
Sel1<br />
1 Mux 0<br />
Sel0<br />
Hit<br />
OR<br />
Cache Block 0<br />
:<br />
Cache Block<br />
Cache Tag Valid<br />
Compare<br />
: :<br />
Advanced Computer Architecture Chapter 1. p44
Valid<br />
D<strong>is</strong>advantage of Set Associative Cache<br />
N N-way S Set Associative A i i Cache C h v. Direct Di Mapped M d Cache: C h<br />
N comparators vs. 1<br />
Extra MUX delay for the data<br />
Data comes AFTER Hit/M<strong>is</strong>s Hit/M<strong>is</strong>s<br />
In a direct mapped cache, Cache Block <strong>is</strong> available BEFORE<br />
Hit/M<strong>is</strong>s:<br />
Possible to assume a hit and continue. continue Recover later if m<strong>is</strong>s. m<strong>is</strong>s<br />
Cache Tag<br />
Cache Data<br />
Cache Block 0<br />
: :<br />
:<br />
Adr Tag<br />
Compare<br />
Cache Index<br />
Cache Data Cache Tag Valid<br />
Sel1<br />
1 Mux 0<br />
Sel0<br />
Hit<br />
OR<br />
Cache Block 0<br />
:<br />
Cache Block<br />
Compare<br />
: :<br />
Advanced Computer Architecture Chapter 1. p45<br />
4 Questions for Memory Hierarchy<br />
Q1: 1 Where Wh can a block bl k be b placed l d in the h upper level? l l?<br />
(Block placement)<br />
Q2: Q2 How <strong>is</strong> a block found if it <strong>is</strong> in the upper level?<br />
(Block identification)<br />
Q3: Which block should be replaced on a m<strong>is</strong>s?<br />
(Block (Block replacement)<br />
replacement)<br />
Q4: What happens on a write?<br />
(Write strategy)<br />
Advanced Computer Architecture Chapter 1. p47<br />
Basic cache terminology<br />
Example: Intel Pentium 4 Level-1 cache (pre-<strong>Pre</strong>scott)<br />
Capacity: 8K bytes (total amount of data cache can store)<br />
Block: 64 bytes (so there are 8K/64=128 blocks in the cache)<br />
Ways: 4 (addresses with same index bits can be placed in one of 4 ways)<br />
Sets: 32 (=128/4, ( 128/4, that <strong>is</strong> each RAM array holds 32 blocks)<br />
Index: 5 bits (since 25 =32 and we need index to select one of the 32 ways)<br />
Tag: 21 bits (=32 minus 5 for index, minus 6 to address byte within block)<br />
Access time: 2 cycles cycles, ( (.6ns 6ns at 3GHz; pipelined pipelined, dual-ported dual ported [load+store])<br />
Valid<br />
Cache Tag g<br />
Cache Data<br />
Cache Block 0<br />
: :<br />
:<br />
Adr Tag<br />
Compare<br />
Cache Index<br />
Cache Data Cache Tag g Valid<br />
Sel1<br />
1 Mux 0<br />
Sel0<br />
Hit<br />
OR<br />
Cache Block 0<br />
:<br />
Cache Block<br />
Compare<br />
: :<br />
Advanced Computer Architecture Chapter 1. p46<br />
Q1: Where can a block be placed in<br />
the upper level?<br />
0 1 2 3 4 5 6<br />
In a fully-associative cache cache, block<br />
12 can be placed in any location in<br />
the cache<br />
SSet t 0<br />
246<br />
0 1<br />
In a two two-way way setassociative<br />
cache, the<br />
set <strong>is</strong> determined by its<br />
low-order address bits –<br />
(12 mod 4) = 0<br />
Block 12 can be placed in<br />
either of the two cache<br />
locations in set 0<br />
0<br />
12<br />
2<br />
3<br />
4<br />
5<br />
6<br />
7<br />
In a direct-mapped<br />
cache, block 12 can only y<br />
be placed in one cache<br />
location, determined by<br />
its low-order address<br />
bits –<br />
(12 mod 8) = 4<br />
Advanced Computer Architecture Chapter 1. p48
Valid<br />
Q2: How <strong>is</strong> a block found if it <strong>is</strong> in the upper<br />
level?<br />
Cache Tag<br />
Cache Data<br />
Cache Block 0<br />
: :<br />
:<br />
Adr Tagg<br />
CCompare<br />
Cache Index<br />
Cache Data<br />
Sel1<br />
1 Mux 0<br />
Sel0<br />
Hi Hit<br />
OR<br />
Cache Block 0<br />
:<br />
Cache Block<br />
Tag on each block<br />
No need to check index or block offset<br />
Block Address<br />
Tag<br />
Index<br />
Cache Tag Valid<br />
CCompare<br />
Block<br />
Offset<br />
Increasing associativity shrinks index, expands tag<br />
: :<br />
Advanced Computer Architecture Chapter 1. p49<br />
Q4: What happens on a write?<br />
Write through—The information <strong>is</strong> written<br />
to both both the the block in in the the cache cache and to the the<br />
block in the lower-level memory<br />
Write back—The information <strong>is</strong> written only<br />
to the block in the cache. The modified<br />
cache block <strong>is</strong> written to main memory only<br />
when it <strong>is</strong> replaced.<br />
<strong>is</strong> block clean or dirty?<br />
Pros and Cons of each?<br />
WT: read m<strong>is</strong>ses cannot result in writes<br />
WB: no repeated d writes to same location l<br />
WT always y combined with write buffers so<br />
that don’t wait for lower level memory<br />
Advanced Computer Architecture Chapter 1. p51<br />
Q3: Which block should be replaced on a<br />
m<strong>is</strong>s?<br />
Easy for Direct Mapped<br />
Set Associative or Fully Associative:<br />
Random<br />
LRU (Least Recently Used)<br />
Assoc: 2-way 4-way 8-way<br />
Size LRU Ran LRU Ran LRU Ran<br />
16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%<br />
64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%<br />
256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%<br />
Benchmark studies show that LRU beats random only y with small caches<br />
Advanced Computer Architecture Chapter 1. p50<br />
Write Buffer for Write Through<br />
Processor<br />
Cache<br />
Write Buffer<br />
DRAM<br />
A Write Buffer <strong>is</strong> needed between the Cache and<br />
MMemory<br />
Processor: writes data into the cache and the write buffer<br />
Memory controller: write contents of the buffer to memory<br />
WWrite i buffer b ff <strong>is</strong> i j just a FIFO: FIFO<br />
Typical number of entries: 4<br />
Works fine if: Store frequency (w.r.t. time) 1 / DRAM write cycle<br />
Write buffer saturation<br />
Advanced Computer Architecture Chapter 1. p52
http://www.karo.com<br />
A Modern Memory Hierarchy<br />
B By ttaking ki advantage d t of f th the principle i i l of f locality: l lit<br />
<strong>Pre</strong>sent the user with as much memory as <strong>is</strong> available in the<br />
cheapest technology.<br />
Provide access at the speed offered by the fastest technology<br />
technology.<br />
Datapath<br />
Processor<br />
Control<br />
Reg<strong>is</strong>teers<br />
On-Chiip<br />
Cachee<br />
Second<br />
Level<br />
Cache<br />
(SRAM)<br />
Main<br />
Memory<br />
(DRAM)<br />
Secondary<br />
Storage<br />
(D<strong>is</strong>k)<br />
Tertiary<br />
Storage<br />
(D<strong>is</strong>k/Tape)<br />
Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s<br />
Size (bytes): 100s<br />
Ks Ms<br />
(10s ms)<br />
Gs<br />
(10s sec)<br />
Ts<br />
Advanced Computer Architecture Chapter 1. p53<br />
Can we live without cache?<br />
Interesting nterest ng exception: except on Cray/Tera ray/ era M MTA, ,<br />
first delivered June 1999:<br />
www.cray.com/products/systems/mta/<br />
Each CPU switches every cycle between<br />
128 threads<br />
Each thread can have up to 8<br />
outstanding tt di memory accesses<br />
3D toroidal mesh interconnect<br />
Memory accessed hashed to spread load<br />
across banks<br />
MTA-1 fabricated using Gallium<br />
Arsenide, not silicon<br />
“nearly un-manufacturable” (wikipedia)<br />
Third-generation Cray XMT:<br />
http://www.cray.com/Products/XMT.aspx<br />
Advanced Computer Architecture Chapter 1. p55<br />
StorageTek STK 9310<br />
(“Powderhorn”)<br />
2,000, 3,000, 4,000,<br />
55,000, 000 or 6,000 6 000<br />
cartridge slots per<br />
library storage module<br />
(LSM)<br />
Up to 24 LSMs per<br />
library (144,000<br />
cartridges)<br />
120 TB (1 LSM) to<br />
28 28,800 800 TB capacity (24<br />
LSM)<br />
Each cartridge holds<br />
300GB, readable up to<br />
40 MB/sec<br />
Up to 28.8 petabytes<br />
Ave 4s to load tape<br />
Large-scale storage<br />
http://www.b2net.co.uk/storagetek/storagetek_powderhorn_9310_tape_library.htm<br />
http://en.wikipedia.org/wiki/Tape_library<br />
http://www.ibm.qassociates.co.uk/storage-tape-enterpr<strong>is</strong>e-tape-drive-J1A-specifications.htm<br />
Advanced Computer Architecture Chapter 1. p54<br />
Ch1<br />
Review of pipelined, in-order<br />
processor architecture and simple<br />
cache structures<br />
Ch2<br />
Caches in more depth<br />
Software techniques to improve<br />
cache performance<br />
p<br />
Virtual memory<br />
Benchmarking<br />
Fab<br />
Ch3<br />
Instruction-level parallel<strong>is</strong>m<br />
Dynamic scheduling, out-of-order<br />
Reg<strong>is</strong>ter renaming<br />
SSpeculative l i execution i<br />
Branch prediction<br />
Limits to ILP<br />
Ch4<br />
Compiler techniques – loop nest<br />
transformations<br />
Loop parallel<strong>is</strong>ation, interchange,<br />
tiling/blocking, skewing<br />
Where we are going…<br />
Ch5<br />
Multithreading, hyperthreading, SMT<br />
Static instruction instruction scheduling scheduling<br />
Software pipelining<br />
EPIC/IA-64; instruction-set support for<br />
speculation and reg<strong>is</strong>ter renaming<br />
Ch6<br />
GPUs, GPGPU, and manycore<br />
Ch7<br />
Shared-memory y multiprocessors<br />
p<br />
Cache coherency<br />
Large-scale cache-coherency; ccNUMA.<br />
COMA<br />
Lab-<strong>based</strong> <strong>course</strong>work exerc<strong>is</strong>e:<br />
Simulation study<br />
“challenge” challenge<br />
Using performance analys<strong>is</strong> tools<br />
Exam:<br />
Partially y <strong>based</strong> on recent processor p<br />
architecture article, which we will study in<br />
advance (see past papers)<br />
Advanced Computer Architecture Chapter 1. p56