15.04.2013 Views

Pre-requisites This is a textbook-based course

Pre-requisites This is a textbook-based course

Pre-requisites This is a textbook-based course

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

332<br />

Advanced dvanced Computer omputer Architecture<br />

rch tecture<br />

Chapter 1<br />

Introduction and review of<br />

Pipelines, Performance, Caches, and Virtual<br />

Memoryy<br />

January 2009<br />

Paul H J Kelly<br />

These lecture notes are partly <strong>based</strong> on the <strong>course</strong> text text,<br />

Hennessy and Patterson’s Computer Architecture, a<br />

quantitative approach (4th ed), and on the lecture slides of<br />

David Patterson’s Berkeley <strong>course</strong> (CS252)<br />

Course materials online at<br />

http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitecture<br />

.html<br />

<strong>Th<strong>is</strong></strong> <strong>is</strong> a <strong>textbook</strong>-<strong>based</strong> <strong>course</strong><br />

Advanced Computer Architecture Chapter 1. p1<br />

Computer Architecture: A Quantitative<br />

Approach (4th Approach (4 Edition) Edition)<br />

John L. Hennessy, David A. Patterson<br />

~580 pages. Morgan Kaufmann (2007); ISBN:<br />

978-0-12-370490-0<br />

with substantial additional material on CD<br />

Price: £ 37.99 (Amazon.co.uk, Nov 2006<br />

Publ<strong>is</strong>her’s companion web site:<br />

http://<strong>textbook</strong>s.elsevier.com/0123704901/<br />

Textbook includes some vital introductory material as<br />

appendices:<br />

Appendix A: tutorial on pipelining (read it NOW)<br />

Appendix C: tutorial on caching (read it NOW)<br />

Further appendices (some in book, some in CD) cover<br />

more advanced material (some very relevant to parts of<br />

the <strong>course</strong>), eg<br />

Networks<br />

Parallel applications<br />

Implementing Coherence Protocols<br />

Embedded systems<br />

VLIW<br />

Computer arithmetic (esp floating point)<br />

H<strong>is</strong>torical perspectives<br />

Advanced Computer Architecture Chapter 1. p3<br />

<strong>Th<strong>is</strong></strong> a third-level computer architecture <strong>course</strong><br />

<strong>Pre</strong>-<strong>requ<strong>is</strong>ites</strong><br />

The usual path would be to take th<strong>is</strong> <strong>course</strong> after following a<br />

<strong>course</strong> <strong>based</strong> on a <strong>textbook</strong> like “Computer Organization and<br />

Design” (Patterson and Hennessy, Morgan Kaufmann)<br />

<strong>Th<strong>is</strong></strong> <strong>course</strong> <strong>is</strong> <strong>based</strong> on the more advanced book by the same<br />

authors (see next slide)<br />

You can take th<strong>is</strong> <strong>course</strong> provided you’re prepared to catch<br />

up if necessary<br />

Read chapters 1 to 8 of “Computer Organization and Design” (COD) if<br />

th<strong>is</strong> material <strong>is</strong> new to you<br />

If you have studied computer architecture before, make sure COD<br />

Chapters 2, 6, 7 are familiar<br />

See also “Appendix A Pipelining: Basic and Intermediate Concepts” of<br />

<strong>course</strong> <strong>textbook</strong><br />

FAST review today y of Pipelining, p g, Performance, , Caches, , and<br />

Virtual Memory<br />

Who are these guys anyway and why<br />

should I read their book?<br />

Advanced Computer Architecture Chapter 1. p2<br />

RAID-I RAID I (1989) ( 989)<br />

cons<strong>is</strong>ted of a Sun<br />

John Hennessy:<br />

4/280 workstation<br />

Founder, MIPS<br />

with 128 MB of<br />

Computer Systems<br />

DRAM, four dual-<br />

string SCSI<br />

<strong>Pre</strong>sident, Stanford<br />

controllers, 28<br />

University<br />

5.25-inch SCSI<br />

(previous president: Condoleezza Rice)<br />

d<strong>is</strong>ks and<br />

specialized p d<strong>is</strong>k<br />

striping software.<br />

David Patterson<br />

Leader, Berkeley RISC<br />

project j (led to Sun’s<br />

SPARC)<br />

RAID (redundant arrays<br />

of inexpensive d<strong>is</strong>ks)<br />

Professor, f , University y of f<br />

/www.cs.berkeley.e<br />

California, Berkeley<br />

Current president of the<br />

ACM<br />

RISC-I (1982) Contains 44,420<br />

trans<strong>is</strong>tors, fabbed in 5 micron<br />

Served on Information<br />

NMOS NMOS, with ith a di die area of f 77 mm2 Technology Adv<strong>is</strong>ory<br />

Committee to the US<br />

<strong>Pre</strong>sident<br />

2 , ran<br />

at 1 MHz. <strong>Th<strong>is</strong></strong> chip <strong>is</strong> probably the<br />

first VLSI RISC.<br />

http://<br />

edu/~pa<br />

ttrsn/AArch/prototypes2.<br />

.html<br />

Advanced Computer Architecture Chapter 1. p4


Course web site:<br />

Admin<strong>is</strong>tration details<br />

http://www.doc.ic.ac.uk/~phjk/AdvancedCompArchitectu<br />

re.html<br />

Course <strong>textbook</strong>: H&P 4 th ed<br />

Read Appendix A right away<br />

Background g for 2008 context…<br />

See Workshop on Trends in Computing Performance<br />

http://www7.nationalacademies.org/CSTB/project_computingperformance_workshop.html<br />

Ch1<br />

Review of pipelined, in-order<br />

processor architecture and simple<br />

cache structures<br />

Ch2<br />

Caches in more depth<br />

Software techniques to improve<br />

cache performance<br />

p<br />

Virtual memory<br />

Benchmarking<br />

Fab<br />

Ch3<br />

Instruction-level parallel<strong>is</strong>m<br />

Dynamic scheduling, out-of-order<br />

Reg<strong>is</strong>ter renaming<br />

SSpeculative l i execution i<br />

Branch prediction<br />

Limits to ILP<br />

Ch4<br />

Compiler techniques – loop nest<br />

transformations<br />

Loop parallel<strong>is</strong>ation, interchange,<br />

tiling/blocking, skewing<br />

Course overview (plan)<br />

Advanced Computer Architecture Chapter 1. p5<br />

Ch5<br />

Multithreading, hyperthreading, SMT<br />

Static instruction instruction scheduling<br />

Software pipelining<br />

EPIC/IA-64; instruction-set support for<br />

speculation and reg<strong>is</strong>ter renaming<br />

Ch6<br />

GPUs, GPGPU, and manycore<br />

Ch7<br />

Shared-memory y multiprocessors<br />

p<br />

Cache coherency<br />

Large-scale cache-coherency; ccNUMA.<br />

COMA<br />

Lab-<strong>based</strong> <strong>course</strong>work exerc<strong>is</strong>e:<br />

Simulation study<br />

“challenge” challenge<br />

Using performance analys<strong>is</strong> tools<br />

Exam:<br />

Partially y <strong>based</strong> on recent processor p<br />

architecture article, which we will study in<br />

advance (see past papers)<br />

Advanced Computer Architecture Chapter 1. p7<br />

Course organ<strong>is</strong>ation<br />

Lecturer:<br />

Paul Kelly – Leader, Software Performance Optim<strong>is</strong>ation research group<br />

Tutorial helper:<br />

AAnton t LLokhmotov kh t – postdoctoral td t l researcher: h PhD from f Cambridge C b id on optim<strong>is</strong>ation ti i ti<br />

and algorithms for SIMD. Industry experience with Broadcom (VLIW hardware),<br />

Clearspeed (massively-multicore SIMD hardware), Codeplay (compilers for games),<br />

ACE (compilers)<br />

3 h hours per week k<br />

Nominally two hours of lectures, one hour of classroom tutorials<br />

We will use the time more flexibly<br />

Assessment:<br />

Exam<br />

For CS M.Eng. Class, exam will take place in last week of term<br />

For everyone else, exam will take place early in the summer term<br />

Th The goal l of f the h <strong>course</strong> <strong>is</strong> i to teach h you how h to think hi k about b computer<br />

architecture<br />

The exam usually includes some architectural ideas not presented in the<br />

lectures<br />

Coursework<br />

You will be assigned a substantial, laboratory-<strong>based</strong> exerc<strong>is</strong>e<br />

You will learn about performance tuning for computationally-intensive kernels<br />

You will learn about using simulators, and experimentally evaluating<br />

hypotheses to understand system performance<br />

Y You are encouraged d t to bbring i llaptops t t to class l t to get t started t t d and d get t hhelp l<br />

during tutorials<br />

Please do not use computers for anything else during classes<br />

Advanced Computer Architecture Chapter 1. p6<br />

A "Typical" RISC<br />

32-bit fixed format instruction (3 formats, see next slide)<br />

32 32-bit general-purpose reg<strong>is</strong>ters<br />

(R0 contains zero, double-prec<strong>is</strong>ion/long operands occupy a pair)<br />

Memory access only via load/store instructions<br />

N No instruction i t ti both b th accesses memory and d does d arithmetic ith ti<br />

All arithmetic <strong>is</strong> done on reg<strong>is</strong>ters<br />

3-address, reg-reg arithmetic instruction<br />

Subw r1,r2,r3 r1 r2 r3 means r1 := : r2 r2-r3 r3<br />

reg<strong>is</strong>ters identifiers always occupy same bits of instruction encoding<br />

Single addressing mode for load/store:<br />

base + d<strong>is</strong>placement<br />

ie reg<strong>is</strong>ter contents are added dd d to constant f from instruction word, d and d<br />

used as address, eg “lw R2,100(r1)” means “r2 := Mem[100+r1]”<br />

no indirection<br />

see: SPARC, MIPS, ARM, HP PA-R<strong>is</strong>c,<br />

Simple p branch conditions<br />

Delayed branch<br />

DEC Alpha, p , IBM PowerPC, ,<br />

CDC 6600, CDC 7600, Cray-1,<br />

Cray-2, Cray-3<br />

Not: Intel IA-32, IA-64 (?),<br />

Motorola 68000,<br />

DEC DE VAX, PDP-11, PDP 11 IBM B<br />

360/370<br />

Eg: VAX matchc, IA32 scas instructions!<br />

Advanced Computer Architecture Chapter 1. p8


Example: MIPS (Note reg<strong>is</strong>ter location)<br />

Reg<strong>is</strong>ter-Reg<strong>is</strong>ter<br />

31 26 25 2120<br />

16 15 1110<br />

6 5<br />

0<br />

Op Rs1 Rs2 Rd Opx<br />

Reg<strong>is</strong>ter-Immediate<br />

31 26 25 21 20 16 15<br />

0<br />

Op<br />

Branch<br />

Rs1 Rd immediate<br />

31 26 25 2120<br />

16 15<br />

0<br />

Op Rs1 Rs2/Opx immediate<br />

Jump / Call<br />

31 26 25<br />

0<br />

Op<br />

target<br />

Q: What <strong>is</strong> the largest signed immediate operand for “subw r1,r2,X”?<br />

Q: What range of addresses can a conditional branch jump to?<br />

Advanced Computer Architecture Chapter 1. p9<br />

A machine to execute these instructions<br />

To execute th<strong>is</strong> instruction set we need a machine that fetches<br />

them and does does what each each instruction says<br />

A “universal” computing device – a simple digital circuit that, with<br />

the right code, can compute anything<br />

Something like:<br />

Instr = Mem[PC]; PC+=4;<br />

rs1 = Reg[Instr.rs1];<br />

rs2 = Reg[Instr.rs2];<br />

imm = SignExtend(Instr.imm);<br />

Operand1 = if(Instr.op==BRANCH) then PC else rs1;<br />

Operand2 = if(immediateOperand(Instr.op)) if(immediateOperand(Instr op)) then imm else rs2;<br />

res = ALU(Instr.op, Operand1, Operand2);<br />

switch(Instr.op) {<br />

case BRANCH:<br />

if (rs1==0) then PC=PC+imm; continue;<br />

case STORE:<br />

Mem[res] = rs1; continue;<br />

case LOAD:<br />

lmd = Mem[res];<br />

}<br />

Reg[Instr.rd] = if (Instr.op==LOAD) then lmd else res;<br />

Advanced Computer Architecture Chapter 1. p11<br />

So where do I find a MIPS processor?<br />

MIPS licensees shipped pp more than 350 million<br />

units during f<strong>is</strong>cal year 2007<br />

(http://www.mips.com/company/about-us/milestones/)<br />

HP 4100 multifunction printer<br />

Next PC<br />

Addrress<br />

Instruction<br />

Fetch<br />

4<br />

Addeer<br />

Memory<br />

Figure 3.1, Page 130, CA:AQA 2e<br />

Sony PS2 and PSP<br />

Instt<br />

Digimax L85 digital camera<br />

http://www.zoran.com/COACH-9<br />

Linksys WRT54G Router (Linux-<strong>based</strong>)<br />

Advanced Computer Architecture Chapter 1. p10<br />

5 Steps of MIPS Datapath<br />

Instr. Decode<br />

Reg. Fetch<br />

Next SEQ PC<br />

RS1<br />

RS2<br />

RD<br />

Imm<br />

Reg FFile<br />

Si Sign<br />

Extend<br />

Execute<br />

Addr. Calc<br />

MUX<br />

MUX<br />

Zero?<br />

ALLU<br />

WB Data<br />

Memory<br />

Access<br />

MUX<br />

Data<br />

MMemory<br />

L<br />

M<br />

D<br />

Write<br />

Back<br />

MUX<br />

Advanced Computer Architecture Chapter 1. p12


Next PC<br />

Addrress<br />

Instruction<br />

Fetch<br />

4<br />

Addeer<br />

Memory<br />

Figure 3.1, Page 130, CA:AQA 2e<br />

Instt<br />

Pipelining the MIPS datapath<br />

Instr. Decode<br />

Reg. Fetch<br />

Next SEQ PC<br />

RS1<br />

RS2<br />

RD<br />

Imm<br />

Reg FFile<br />

Si Sign<br />

Extend<br />

Execute<br />

Addr. Calc<br />

MUX<br />

MUX<br />

Zero?<br />

ALLU<br />

WB Data<br />

Memory<br />

Access<br />

MUX<br />

Data<br />

MMemory<br />

We will see more complex pipeline structures later.<br />

For example, the Pentium 4 “Netburst” architecture has 31 stages.<br />

I<br />

n<br />

s<br />

tt<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

Time (clock cycles)<br />

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />

Ifetch<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

L<br />

M<br />

D<br />

Write<br />

Back<br />

MUX<br />

Advanced Computer Architecture Chapter 1. p13<br />

V<strong>is</strong>ualizing Pipelining<br />

Reg<br />

DMem<br />

r Ifetch f h Reg<br />

DMem RReg<br />

Pipelining doesn’t help latency of single instruction<br />

it helps throughput of entire workload<br />

Pipeline rate limited by slowest pipeline stage<br />

Potential speedup = Number pipe stages<br />

Unbalanced lengths g of f ppipe p stages g reduces speedup p p<br />

Time to “fill” pipeline and time to “drain” it reduces speedup<br />

Speedup comes from parallel<strong>is</strong>m<br />

For free – no new hardware<br />

Figure 3.3, Page 133 , CA:AQA 2e<br />

ALU<br />

Reg<br />

Advanced Computer Architecture Chapter 1. p15<br />

5-stage MIPS pipeline with pipeline buffers<br />

Next PC<br />

Addresss<br />

Instruction<br />

Fetch<br />

4<br />

Adder<br />

Memorry<br />

IF/IDD<br />

Instr. Decode<br />

Reg. Fetch<br />

Execute<br />

Addr. Calc<br />

Next SEQ PC Next SEQ PC<br />

RS1<br />

RS2<br />

Reg File<br />

Sign<br />

Extend<br />

Imm<br />

ID/EXX<br />

MUX MMUX<br />

Zero?<br />

ALUU<br />

EX/MEEM<br />

Memory<br />

Access<br />

MUX<br />

RD RD RD<br />

• Data stationary control<br />

– local decode for each instruction phase / pipeline stage<br />

Figure 3.4, Page 134 , CA:AQA 2e<br />

DData<br />

Meemory<br />

MEM/WWB<br />

Write<br />

Back<br />

MUX<br />

WB Data<br />

Advanced Computer Architecture Chapter 1. p14<br />

It’s Not That Easy for Computers<br />

Limits to pipelining: Hazards prevent<br />

next instruction from executing during its<br />

designated clock cycle<br />

Structural hazards: HW cannot support th<strong>is</strong><br />

combination of instructions<br />

Data hazards: Instruction depends on result<br />

of prior instruction still in the pipeline<br />

CControl nt l hhazards: d : Caused C d by b delay d l between b t n<br />

the fetching of instructions and dec<strong>is</strong>ions<br />

about changes in control flow (branches and<br />

j jumps). )<br />

Advanced Computer Architecture Chapter 1. p16


One Memory Port/Structural Hazards<br />

Time (clock cycles)<br />

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />

Ifetch Reg<br />

I Load<br />

n<br />

s<br />

Instr 1<br />

t<br />

rr.<br />

Instr 2<br />

O<br />

rr<br />

d Instr 3<br />

e<br />

r Instr 4<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Eg if there <strong>is</strong> only one memory for both instructions and data<br />

Two different stages may need access at same time<br />

Example: IBM/Sony/Toshiba Cell processor<br />

Figure 3.6, Page 142 , CA:AQA 2e<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

dd<br />

e<br />

r<br />

Time (clock cycles)<br />

add r1,r2,r3<br />

sub r4,r1,r3<br />

and r6,r1,r7<br />

or r8,r1,r9<br />

Reg<br />

DMem<br />

ALUU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

Advanced Computer Architecture Chapter 1. p17<br />

Data Hazard on R1<br />

IF ID/RF EX MEM WB<br />

Ifetch<br />

xor r10 r10,r1,r11 r1 r11<br />

Figure 3.9, page 147 , CA:AQA 2e<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALUU<br />

Regg<br />

Reg<br />

DMem<br />

ALU A<br />

Reg<br />

DMem<br />

Regg<br />

Advanced Computer Architecture Chapter 1. p19<br />

One Memory Port/Structural Hazards<br />

Time (clock cycles)<br />

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />

Ifetch Reg<br />

I Load<br />

n<br />

s<br />

t Instr 1 r.<br />

Ifetch Reg<br />

O<br />

Instr 2<br />

Ifetch<br />

rr<br />

d<br />

e<br />

Stall<br />

r Instr 3<br />

ALU<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

r Ifetch Reg<br />

Instr 3<br />

Reg<br />

Bubble Bubble Bubble Bubble Bubble<br />

ALU A<br />

DMem DMem<br />

Instr 3 cannot be loaded in cycle 4<br />

ID stage has nothing to do in cycle 5<br />

EX stage has nothing to do in cycle 6, etc. “Bubble” propagates<br />

Figure 3.7, Page 143 , CA:AQA 2e<br />

Regg<br />

Advanced Computer Architecture Chapter 1. p18<br />

Three Generic Data Hazards<br />

Read a After ft r Wr Write t ( (RAW) W)<br />

InstrJ tries to read operand before InstrI writes it<br />

I: add r1,r2,r3<br />

J: sub r4,r1,r3<br />

Caused by a “Dependence” (in compiler nomenclature).<br />

<strong>Th<strong>is</strong></strong> hazard results from an actual need for<br />

communication<br />

communication.<br />

Advanced Computer Architecture Chapter 1. p20


II<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

rr<br />

d<br />

e<br />

r<br />

Three Generic Data Hazards<br />

Write After Read (WAR)<br />

Instr J writes operand before Instr I reads it<br />

I: sub r4,r1,r3<br />

J: add r1,r2,r3 , ,<br />

K: mul r6,r1,r7<br />

Called an “anti-dependence” by compiler writers.<br />

<strong>Th<strong>is</strong></strong> results from reuse of the name “r1”.<br />

C Can’t ’ happen h in i MIPS 5 stage pipeline i li bbecause:<br />

All instructions take 5 stages, and<br />

Reads are always in stage 2, and<br />

Writes are always in stage 5<br />

add r1,r2,r3<br />

sub r4,r1,r3<br />

and r6,r1,r7 6 1<br />

Advanced Computer Architecture Chapter 1. p21<br />

Forwarding to Avoid Data Hazard<br />

Time (clock cycles)<br />

Ifetch<br />

or r8 r8,r1,r9 r1 r9<br />

xor r10,r1,r11<br />

Reg<br />

Ifetch<br />

Figure 33.10, 10 Page 149 , CA:AQA 2e<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

LU<br />

Ifetch<br />

AL<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

Advanced Computer Architecture Chapter 1. p23<br />

Three Generic Data Hazards<br />

Write After Write (WAW)<br />

Instr J writes operand before Instr I writes it.<br />

I: sub r1,r4,r3<br />

J: add r1,r2,r3<br />

KK: mul l r6,r1,r7 6 1 7<br />

Called an “output output dependence” dependence by compiler writers<br />

<strong>Th<strong>is</strong></strong> also results from the reuse of name “r1”.<br />

Can’t happen pp in MIPS 5 stage g pipeline p p because:<br />

All instructions take 5 stages, and<br />

Writes are always in stage 5<br />

Will see WAR and WAW in later more complicated pipes<br />

Advanced Computer Architecture Chapter 1. p22<br />

HW Change for Forwarding<br />

Figure 3.20, Page 161, CA:AQA 2e<br />

Add forwarding (“bypass”) paths<br />

Add multiplexors l l to select l where ALU operand should l come from<br />

Determine mux control in ID stage<br />

If source reg<strong>is</strong>ter <strong>is</strong> the target of an instrn that will not WB in time<br />

NextPC<br />

Reg<strong>is</strong>teers<br />

Immediate<br />

ID/ /EX<br />

mmux<br />

mux<br />

ALU<br />

EX/MMEM<br />

Data<br />

Memory<br />

MEM/ /WR<br />

muxx<br />

Advanced Computer Architecture Chapter 1. p24


Time (clock cycles)<br />

Data Hazard Even with Forwarding<br />

Figure 3.12, Page 153 , CA:AQA 2e<br />

LU<br />

II n<br />

s<br />

lw r1 r1, 0(r2) If Ifetch t h RReg tt<br />

r.<br />

sub r4 r4,r1,r6 r1 r6<br />

Ifetch Reg<br />

OO<br />

r<br />

d<br />

e<br />

r<br />

and r6 r6,r1,r7 r1 r7<br />

AL<br />

DM DMem<br />

Reg LU<br />

Ifetch<br />

AL<br />

Reg<br />

RReg<br />

DMem<br />

ALU A<br />

Reg<br />

DMem<br />

Ifetch Reg<br />

A<br />

or r8 r8,r1,r9 r1 r9<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

Advanced Computer Architecture Chapter 1. p25<br />

Software Scheduling to Avoid Load Hazards<br />

Try producing fast code for<br />

a = b + c;<br />

d = e – f;<br />

assuming a, b, c, d ,e, and f in memory.<br />

Slow code: Fast code:<br />

LW Rb Rb,b b LW Rb Rb,b b<br />

LW Rc,c LW Rc,c<br />

STALL LW Re,e<br />

ADD Ra Ra,Rb,Rc Rb Rc ADD Ra Ra,Rb,Rb Rb Rb<br />

SW a,Ra<br />

LW Re,e<br />

LW Rf Rf,ff LW Rf Rf,f f<br />

STALL SW a,Ra<br />

SUB Rd,Re,Rf SUB Rd,Re,Rf<br />

SW d,Rd SW d,Rd<br />

10 cycles (2 stalls) 8 cycles (0 stalls)<br />

Show the stalls<br />

explicitly<br />

Advanced Computer Architecture Chapter 1. p27<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

Time (clock cycles)<br />

lw r1 r1, 0(r2)<br />

sub r4,r1,r6 r4 r1 r6<br />

and r6,r1,r7 r6 r1 r7<br />

or r8,r1,r9<br />

Data Hazard Even with Forwarding<br />

Figure 3.13, Page 154 , CA:AQA 2e<br />

Ifetch<br />

Reg<br />

LU<br />

Ifetch<br />

AL<br />

Reg<br />

Ifetch<br />

DMem<br />

Bubble<br />

Reg<br />

ALU A<br />

Bubble Reg<br />

Bubble<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Regg<br />

DMem<br />

EX stage waits in cycle 4 for operand<br />

Following instruction (“and”) ( and ) waits in ID stage<br />

M<strong>is</strong>sed instruction <strong>is</strong>sue opportunity…<br />

10: beq r1,r3,36<br />

14: and r2,r3,r5<br />

18: or r6,r1,r7<br />

22: add r8,r1,r9<br />

36: xor r10 r10,r1,r11 r1 r11<br />

Ifetch<br />

ALU<br />

Reg<br />

DMem<br />

Advanced Computer Architecture Chapter 1. p26<br />

Control Hazard on Branches<br />

Three Stage Stall<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALUU<br />

Regg<br />

Reg<br />

DMem<br />

ALU A<br />

Reg<br />

DMem<br />

Regg<br />

Advanced Computer Architecture Chapter 1. p28


Instruction<br />

Fetch<br />

Pipelined MIPS Datapath with early branch<br />

determination<br />

Instr. Decode<br />

Reg. Fetch<br />

Next PC Next<br />

SEQ PC<br />

MUX<br />

Addresss<br />

4<br />

Adder<br />

Memorry<br />

Figure 3.22, page 163, CA:AQA 2/e<br />

IF/ ID<br />

RS1<br />

Adder<br />

Zero?<br />

RS2 ID/E<br />

Reg File<br />

Sign<br />

Extend<br />

Imm<br />

X<br />

Execute<br />

Addr. Calc<br />

MMUX<br />

ALUU<br />

EX/MEEM<br />

RD RD RD<br />

Memory<br />

Access<br />

DData<br />

Meemory<br />

MEM/WWB<br />

Write<br />

Back<br />

MUX<br />

WB Data<br />

Advanced Computer Architecture Chapter 1. p29<br />

Four Branch Hazard Alternatives<br />

#4: Delayed Branch<br />

y<br />

Define branch to take place AFTER a following instruction<br />

branch instruction<br />

sequential ti l successor1 sequential successor2 ........<br />

sequential successorn branch target if taken<br />

Branch delay of length n<br />

1 slot delay allows proper dec<strong>is</strong>ion and branch target<br />

address in 5 stage pipeline<br />

MIPS uses th<strong>is</strong>; eg in LW R3, #100<br />

If (R1==0)<br />

LW R4, #200<br />

BEQZ R1 R1, L1<br />

SW R3, X<br />

SW R4, X<br />

L1:<br />

LW R5 R5,X X<br />

“SW R3, X” instruction <strong>is</strong> executed regardless<br />

“SW R4, X” instruction <strong>is</strong> executed only if R1 <strong>is</strong> non-zero<br />

X=100<br />

Else<br />

X=100<br />

X=200<br />

R5 = X<br />

Advanced Computer Architecture Chapter 1. p31<br />

Four Branch Hazard Alternatives<br />

#1: Stall until branch direction <strong>is</strong> clear<br />

(wasteful – the next instruction <strong>is</strong> being fetched during ID)<br />

#2: <strong>Pre</strong>dict Branch Not Taken<br />

Execute Execute successor successor instructions instructions in in sequence sequence<br />

“Squash” instructions in pipeline if branch actually taken<br />

With MIPS we have advantage of late pipeline state update<br />

47% MIPS branches are not taken on average<br />

PC+4 already calculated, so use it to get next instruction<br />

#3: <strong>Pre</strong>dict Branch Taken<br />

53% MIPS branches are taken on average<br />

BBut t in i MIPS instruction i t ti set t we haven’t h ’t calculated l l t d branch b h target t t address dd<br />

yet (because branches are relative to the PC)<br />

MIPS still incurs 1 cycle branch penalty<br />

With some other machines, branch target <strong>is</strong> known before branch<br />

condition<br />

Advanced Computer Architecture Chapter 1. p30<br />

Delayed Branch<br />

Where to get instructions to fill branch delay slot?<br />

BBefore f bbranch h iinstruction t ti<br />

From the target address: only valuable when branch taken<br />

From fall through: only valuable when branch not taken<br />

Compiler effectiveness for single branch delay slot:<br />

Fills about 60% of branch delay slots<br />

About 80% of instructions executed in branch delay slots<br />

useful in computation<br />

About 50% (60% x 80%) of slots usefully filled<br />

Delayed Branch downside: 7-8 stage pipelines,<br />

multiple instructions <strong>is</strong>sued per clock (superscalar)<br />

Canceling branches<br />

Branch delay slot instruction <strong>is</strong> executed but write-back <strong>is</strong><br />

d<strong>is</strong>abled if it <strong>is</strong> not supposed to be executed<br />

Two variants: branch “likely taken”, branch “likely not-taken”<br />

allows more slots to be filled<br />

L1:<br />

target<br />

before<br />

Blt R1 R1,L1 L1<br />

fallthru<br />

Advanced Computer Architecture Chapter 1. p32


Eliminating hazards with simultaneous multi-threading<br />

If we had no stalls we could fin<strong>is</strong>h one instruction<br />

every cycle<br />

If we had no hazards we could do without<br />

forwarding – and decode/control would be simpler<br />

too<br />

Perrformaance<br />

PC0<br />

PC1<br />

Ifetch<br />

Next<br />

PC<br />

Thread 0<br />

regs<br />

Reg ALU<br />

Thread 1<br />

regs g<br />

DMem<br />

Reg<br />

Example:<br />

PowerPC<br />

processing<br />

element (PPE)<br />

in the Cell<br />

Broadband<br />

Engine (Sony<br />

PlayStation 3)<br />

IF maintains two Program Counters<br />

E Even cycle l – ffetch t h f from PC0<br />

Odd cycle – fetch from PC1<br />

Thread 0 reads and writes thread thread-0 0 reg<strong>is</strong>ters<br />

No reg<strong>is</strong>ter-to-reg<strong>is</strong>ter hazards between adjacent<br />

pipeline stages<br />

Advanced Computer Architecture Chapter 1. p33<br />

Life used to be so easy<br />

Processor-DRAM Memory y Gap p (latency) ( y)<br />

1000 CPU<br />

“Moore’s Law”<br />

µProc<br />

60%/yr 60%/yr.<br />

(2X/1.5yr)<br />

100 Processor-Memory<br />

Processor Memory<br />

Performance Gap:<br />

(grows 50% / year)<br />

10<br />

1<br />

1980<br />

1981<br />

1982<br />

1983<br />

1984<br />

1985<br />

1986<br />

1987<br />

1988<br />

1989<br />

1990<br />

1991<br />

1992<br />

1993<br />

1994<br />

1995<br />

1996<br />

1997<br />

1998<br />

1999<br />

2000<br />

Time<br />

In 1980 a large RAM’s access time was close to the CPU cycle time. 1980s<br />

machines had little or no need for cache. Life <strong>is</strong> no longer quite so simple.<br />

DRAM<br />

DRAM<br />

9%/yr.<br />

(2X/10 yrs)<br />

Advanced Computer Architecture Chapter 1. p35<br />

So – how fast can th<strong>is</strong> design go?<br />

A simple i l 55-stage t pipeline i li can run at t >3GHz 3GH<br />

Limited by critical path through slowest pipeline stage<br />

logic g<br />

Tradeoff: do more per cycle? Or increase clock rate?<br />

Or do more per cycle, in parallel…<br />

At 3GHz, clock period <strong>is</strong> 330 picoseconds.<br />

The time light takes to go about four inches<br />

Ab About 10 gate ddelays l<br />

for example, the Cell BE <strong>is</strong> designed for 11 FO4 (“fanout=4”)<br />

gates per cycle:<br />

www.fe.infn.it/~belletti/articles/ISSCC2005-cell.pdf<br />

f i f it/ b ll tti/ ti l /ISSCC2005 ll df<br />

Pipeline latches etc account for 3-5 FO4 delays leaving<br />

only 5-8 for actual work<br />

How can we build a RAM that can implement our MEM stage in<br />

5-8 FO4 delays?<br />

Advanced Computer Architecture Chapter 1. p34<br />

Memory Hierarchy: Terminology<br />

Hit: data appears appears in in some block X in in the the upper level<br />

Hit Rate: the fraction of memory accesses found in the upper level<br />

Hit Time: Time to access the upper level which cons<strong>is</strong>ts of<br />

RAM access time + Time to determine hit/m<strong>is</strong>s<br />

M<strong>is</strong>s: data needs to be retrieved from a block Y in<br />

the lower level<br />

M<strong>is</strong>s Rate = 1 - (Hit Rate)<br />

M<strong>is</strong>s Penalty: Time to replace a block in the upper level +<br />

Time to deliver the block the processor<br />

Hit Time


Capacity apac ty<br />

Access Time<br />

Cost<br />

Levels of the Memory Hierarchy<br />

Staging<br />

Xfer Unit<br />

CPU Reg<strong>is</strong>ters<br />

100s Bytes y<br />


1 KB Direct Mapped Cache, 32B blocks<br />

For a 2N byte y cache:<br />

The uppermost (32 - N) bits are always the Cache Tag<br />

The lowest M bits are the Byte Select (Block Size = 2M )<br />

31 9<br />

4<br />

Cache Tag Example: 0x50 Cache Index Byte Select<br />

Stored as part<br />

of the cache “state” state<br />

Ex: 0x01 Ex: 0x00<br />

Valid Bit<br />

:<br />

Cache Tag<br />

Direct-mapped cache – read access<br />

Valid<br />

: :<br />

Cache Data<br />

Byte 31 Byte 1 Byte 0 0<br />

0x50 Byte 63 Byte 33 Byte 32 1<br />

2<br />

3<br />

:<br />

Compare<br />

Hit<br />

:<br />

Byte 1023 :<br />

Data<br />

Byte 992<br />

0<br />

31<br />

Advanced Computer Architecture Chapter 1. p41<br />

Direct-mapped Cache - structure<br />

Capacity: C C bytes (eg 1KB)<br />

Blocksize: B bytes (eg 32)<br />

Byte select bits: 0..log(B)-1 (eg 0..4)<br />

Number of of blocks: C/B (eg 32)<br />

Address size: A (eg 32 bits)<br />

Cache index size: I=log(C/B) (eg log(32)=5)<br />

Tag size: A-I-log(B) (eg 32-5-5=22)<br />

Cache Tag<br />

: :<br />

:<br />

Adr Tag<br />

Compare<br />

Cache Index<br />

Cache Data<br />

Cache Block 0<br />

Hit<br />

Cache Block<br />

Advanced Computer Architecture Chapter 1. p43<br />

1 KB Direct Mapped Cache, 32B blocks<br />

(0) 0<br />

1 Cache location 0 can be occupied<br />

2<br />

b by d data f from main i memory<br />

3<br />

4<br />

location 0, 32, 64, … etc.<br />

5<br />

Cache location 1 can be occupied<br />

6<br />

7<br />

8<br />

by data from main memory<br />

llocation ti 1 1, 33 33, 65 65, … etc. t<br />

9<br />

10<br />

11<br />

12<br />

13 13<br />

14<br />

15<br />

Main<br />

MMemory<br />

In general, all locations with same<br />

Address bits map to the same<br />

location in the cache Which one should<br />

we place in the cache?<br />

H How can we tell ll which hi h one <strong>is</strong> i in i<br />

the cache?<br />

(32)<br />

: :<br />

16<br />

17<br />

18<br />

19<br />

Cache C Data<br />

20<br />

21<br />

Byte 31 Byte 1 Byte 0 0<br />

22<br />

23<br />

Byte 63 Byte 33 Byte 32 1<br />

24<br />

2<br />

25<br />

26<br />

3<br />

27<br />

28<br />

29<br />

:<br />

30<br />

31<br />

32<br />

33<br />

34<br />

35<br />

Byte 1023 :<br />

Byte 992<br />

31<br />

Advanced Computer Architecture Chapter 1. p42<br />

Two-way Set Associative Cache<br />

N-way set associative: N N entries for for each Cache Cache<br />

Index<br />

N direct mapped caches operated in parallel (N typically 2 to 4)<br />

EExample: l Two-way T set t associative i ti cache h<br />

Cache Index selects a “set” from the cache<br />

The two tags in the set are compared in parallel<br />

Data <strong>is</strong> selected <strong>based</strong> on the tag result<br />

Cache Index<br />

Valid Cache Tag Cache Data<br />

Cache Data<br />

Cache Block 0<br />

: :<br />

:<br />

Adr Tag<br />

Compare<br />

Sel1<br />

1 Mux 0<br />

Sel0<br />

Hit<br />

OR<br />

Cache Block 0<br />

:<br />

Cache Block<br />

Cache Tag Valid<br />

Compare<br />

: :<br />

Advanced Computer Architecture Chapter 1. p44


Valid<br />

D<strong>is</strong>advantage of Set Associative Cache<br />

N N-way S Set Associative A i i Cache C h v. Direct Di Mapped M d Cache: C h<br />

N comparators vs. 1<br />

Extra MUX delay for the data<br />

Data comes AFTER Hit/M<strong>is</strong>s Hit/M<strong>is</strong>s<br />

In a direct mapped cache, Cache Block <strong>is</strong> available BEFORE<br />

Hit/M<strong>is</strong>s:<br />

Possible to assume a hit and continue. continue Recover later if m<strong>is</strong>s. m<strong>is</strong>s<br />

Cache Tag<br />

Cache Data<br />

Cache Block 0<br />

: :<br />

:<br />

Adr Tag<br />

Compare<br />

Cache Index<br />

Cache Data Cache Tag Valid<br />

Sel1<br />

1 Mux 0<br />

Sel0<br />

Hit<br />

OR<br />

Cache Block 0<br />

:<br />

Cache Block<br />

Compare<br />

: :<br />

Advanced Computer Architecture Chapter 1. p45<br />

4 Questions for Memory Hierarchy<br />

Q1: 1 Where Wh can a block bl k be b placed l d in the h upper level? l l?<br />

(Block placement)<br />

Q2: Q2 How <strong>is</strong> a block found if it <strong>is</strong> in the upper level?<br />

(Block identification)<br />

Q3: Which block should be replaced on a m<strong>is</strong>s?<br />

(Block (Block replacement)<br />

replacement)<br />

Q4: What happens on a write?<br />

(Write strategy)<br />

Advanced Computer Architecture Chapter 1. p47<br />

Basic cache terminology<br />

Example: Intel Pentium 4 Level-1 cache (pre-<strong>Pre</strong>scott)<br />

Capacity: 8K bytes (total amount of data cache can store)<br />

Block: 64 bytes (so there are 8K/64=128 blocks in the cache)<br />

Ways: 4 (addresses with same index bits can be placed in one of 4 ways)<br />

Sets: 32 (=128/4, ( 128/4, that <strong>is</strong> each RAM array holds 32 blocks)<br />

Index: 5 bits (since 25 =32 and we need index to select one of the 32 ways)<br />

Tag: 21 bits (=32 minus 5 for index, minus 6 to address byte within block)<br />

Access time: 2 cycles cycles, ( (.6ns 6ns at 3GHz; pipelined pipelined, dual-ported dual ported [load+store])<br />

Valid<br />

Cache Tag g<br />

Cache Data<br />

Cache Block 0<br />

: :<br />

:<br />

Adr Tag<br />

Compare<br />

Cache Index<br />

Cache Data Cache Tag g Valid<br />

Sel1<br />

1 Mux 0<br />

Sel0<br />

Hit<br />

OR<br />

Cache Block 0<br />

:<br />

Cache Block<br />

Compare<br />

: :<br />

Advanced Computer Architecture Chapter 1. p46<br />

Q1: Where can a block be placed in<br />

the upper level?<br />

0 1 2 3 4 5 6<br />

In a fully-associative cache cache, block<br />

12 can be placed in any location in<br />

the cache<br />

SSet t 0<br />

246<br />

0 1<br />

In a two two-way way setassociative<br />

cache, the<br />

set <strong>is</strong> determined by its<br />

low-order address bits –<br />

(12 mod 4) = 0<br />

Block 12 can be placed in<br />

either of the two cache<br />

locations in set 0<br />

0<br />

12<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

In a direct-mapped<br />

cache, block 12 can only y<br />

be placed in one cache<br />

location, determined by<br />

its low-order address<br />

bits –<br />

(12 mod 8) = 4<br />

Advanced Computer Architecture Chapter 1. p48


Valid<br />

Q2: How <strong>is</strong> a block found if it <strong>is</strong> in the upper<br />

level?<br />

Cache Tag<br />

Cache Data<br />

Cache Block 0<br />

: :<br />

:<br />

Adr Tagg<br />

CCompare<br />

Cache Index<br />

Cache Data<br />

Sel1<br />

1 Mux 0<br />

Sel0<br />

Hi Hit<br />

OR<br />

Cache Block 0<br />

:<br />

Cache Block<br />

Tag on each block<br />

No need to check index or block offset<br />

Block Address<br />

Tag<br />

Index<br />

Cache Tag Valid<br />

CCompare<br />

Block<br />

Offset<br />

Increasing associativity shrinks index, expands tag<br />

: :<br />

Advanced Computer Architecture Chapter 1. p49<br />

Q4: What happens on a write?<br />

Write through—The information <strong>is</strong> written<br />

to both both the the block in in the the cache cache and to the the<br />

block in the lower-level memory<br />

Write back—The information <strong>is</strong> written only<br />

to the block in the cache. The modified<br />

cache block <strong>is</strong> written to main memory only<br />

when it <strong>is</strong> replaced.<br />

<strong>is</strong> block clean or dirty?<br />

Pros and Cons of each?<br />

WT: read m<strong>is</strong>ses cannot result in writes<br />

WB: no repeated d writes to same location l<br />

WT always y combined with write buffers so<br />

that don’t wait for lower level memory<br />

Advanced Computer Architecture Chapter 1. p51<br />

Q3: Which block should be replaced on a<br />

m<strong>is</strong>s?<br />

Easy for Direct Mapped<br />

Set Associative or Fully Associative:<br />

Random<br />

LRU (Least Recently Used)<br />

Assoc: 2-way 4-way 8-way<br />

Size LRU Ran LRU Ran LRU Ran<br />

16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%<br />

64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%<br />

256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%<br />

Benchmark studies show that LRU beats random only y with small caches<br />

Advanced Computer Architecture Chapter 1. p50<br />

Write Buffer for Write Through<br />

Processor<br />

Cache<br />

Write Buffer<br />

DRAM<br />

A Write Buffer <strong>is</strong> needed between the Cache and<br />

MMemory<br />

Processor: writes data into the cache and the write buffer<br />

Memory controller: write contents of the buffer to memory<br />

WWrite i buffer b ff <strong>is</strong> i j just a FIFO: FIFO<br />

Typical number of entries: 4<br />

Works fine if: Store frequency (w.r.t. time) 1 / DRAM write cycle<br />

Write buffer saturation<br />

Advanced Computer Architecture Chapter 1. p52


http://www.karo.com<br />

A Modern Memory Hierarchy<br />

B By ttaking ki advantage d t of f th the principle i i l of f locality: l lit<br />

<strong>Pre</strong>sent the user with as much memory as <strong>is</strong> available in the<br />

cheapest technology.<br />

Provide access at the speed offered by the fastest technology<br />

technology.<br />

Datapath<br />

Processor<br />

Control<br />

Reg<strong>is</strong>teers<br />

On-Chiip<br />

Cachee<br />

Second<br />

Level<br />

Cache<br />

(SRAM)<br />

Main<br />

Memory<br />

(DRAM)<br />

Secondary<br />

Storage<br />

(D<strong>is</strong>k)<br />

Tertiary<br />

Storage<br />

(D<strong>is</strong>k/Tape)<br />

Speed (ns): 1s 10s 100s 10,000,000s 10,000,000,000s<br />

Size (bytes): 100s<br />

Ks Ms<br />

(10s ms)<br />

Gs<br />

(10s sec)<br />

Ts<br />

Advanced Computer Architecture Chapter 1. p53<br />

Can we live without cache?<br />

Interesting nterest ng exception: except on Cray/Tera ray/ era M MTA, ,<br />

first delivered June 1999:<br />

www.cray.com/products/systems/mta/<br />

Each CPU switches every cycle between<br />

128 threads<br />

Each thread can have up to 8<br />

outstanding tt di memory accesses<br />

3D toroidal mesh interconnect<br />

Memory accessed hashed to spread load<br />

across banks<br />

MTA-1 fabricated using Gallium<br />

Arsenide, not silicon<br />

“nearly un-manufacturable” (wikipedia)<br />

Third-generation Cray XMT:<br />

http://www.cray.com/Products/XMT.aspx<br />

Advanced Computer Architecture Chapter 1. p55<br />

StorageTek STK 9310<br />

(“Powderhorn”)<br />

2,000, 3,000, 4,000,<br />

55,000, 000 or 6,000 6 000<br />

cartridge slots per<br />

library storage module<br />

(LSM)<br />

Up to 24 LSMs per<br />

library (144,000<br />

cartridges)<br />

120 TB (1 LSM) to<br />

28 28,800 800 TB capacity (24<br />

LSM)<br />

Each cartridge holds<br />

300GB, readable up to<br />

40 MB/sec<br />

Up to 28.8 petabytes<br />

Ave 4s to load tape<br />

Large-scale storage<br />

http://www.b2net.co.uk/storagetek/storagetek_powderhorn_9310_tape_library.htm<br />

http://en.wikipedia.org/wiki/Tape_library<br />

http://www.ibm.qassociates.co.uk/storage-tape-enterpr<strong>is</strong>e-tape-drive-J1A-specifications.htm<br />

Advanced Computer Architecture Chapter 1. p54<br />

Ch1<br />

Review of pipelined, in-order<br />

processor architecture and simple<br />

cache structures<br />

Ch2<br />

Caches in more depth<br />

Software techniques to improve<br />

cache performance<br />

p<br />

Virtual memory<br />

Benchmarking<br />

Fab<br />

Ch3<br />

Instruction-level parallel<strong>is</strong>m<br />

Dynamic scheduling, out-of-order<br />

Reg<strong>is</strong>ter renaming<br />

SSpeculative l i execution i<br />

Branch prediction<br />

Limits to ILP<br />

Ch4<br />

Compiler techniques – loop nest<br />

transformations<br />

Loop parallel<strong>is</strong>ation, interchange,<br />

tiling/blocking, skewing<br />

Where we are going…<br />

Ch5<br />

Multithreading, hyperthreading, SMT<br />

Static instruction instruction scheduling scheduling<br />

Software pipelining<br />

EPIC/IA-64; instruction-set support for<br />

speculation and reg<strong>is</strong>ter renaming<br />

Ch6<br />

GPUs, GPGPU, and manycore<br />

Ch7<br />

Shared-memory y multiprocessors<br />

p<br />

Cache coherency<br />

Large-scale cache-coherency; ccNUMA.<br />

COMA<br />

Lab-<strong>based</strong> <strong>course</strong>work exerc<strong>is</strong>e:<br />

Simulation study<br />

“challenge” challenge<br />

Using performance analys<strong>is</strong> tools<br />

Exam:<br />

Partially y <strong>based</strong> on recent processor p<br />

architecture article, which we will study in<br />

advance (see past papers)<br />

Advanced Computer Architecture Chapter 1. p56

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!