05.11.2014 Views

Appendix A Pipelining: Basic and Intermediate Concepts

Appendix A Pipelining: Basic and Intermediate Concepts

Appendix A Pipelining: Basic and Intermediate Concepts

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

EEF011 Computer Architecture<br />

計 算 機 結 構<br />

<strong>Appendix</strong> A<br />

<strong>Pipelining</strong>: <strong>Basic</strong> <strong>and</strong><br />

<strong>Intermediate</strong> <strong>Concepts</strong><br />

吳 俊 興<br />

高 雄 大 學 資 訊 工 程 學 系<br />

October 2004


Outline<br />

<strong>Basic</strong> concept of <strong>Pipelining</strong><br />

The <strong>Basic</strong> Pipeline for MIPS<br />

The Major Hurdles of <strong>Pipelining</strong> – Pipeline Hazards<br />

2


Laundry Example<br />

What Is <strong>Pipelining</strong>?<br />

• Ann, Betty, Cathy, Dave<br />

each has one load of clothes<br />

to wash, dry, <strong>and</strong> fold<br />

A B C D<br />

• Washer takes 30 minutes<br />

• Dryer takes 40 minutes<br />

• “Folder” takes 20 minutes<br />

3


What Is <strong>Pipelining</strong>?<br />

6 PM 7 8 9 10 11 Midnight<br />

Time<br />

30 40 20 30 40 20 30 40 20 30 40 20<br />

T<br />

a<br />

s<br />

k<br />

O<br />

r<br />

d<br />

e<br />

r<br />

A<br />

B<br />

C<br />

D<br />

Sequential laundry takes 6 hours for 4 loads<br />

Want to reduce the time? -<strong>Pipelining</strong>!!!<br />

4


What Is <strong>Pipelining</strong>?<br />

6 PM 7 8 9<br />

Time<br />

T<br />

a<br />

s<br />

k<br />

O<br />

r<br />

d<br />

e<br />

r<br />

A<br />

B<br />

C<br />

D<br />

30 40 40 40 40 20<br />

• Start work ASAP<br />

• Pipelined laundry takes<br />

3.5 hours for 4 loads<br />

5


What Is <strong>Pipelining</strong>?<br />

‣ <strong>Pipelining</strong> is an implementation technique whereby<br />

multiple instructions are overlapped in execution<br />

‣ It takes advantage of parallelism that exists among<br />

instructions => instruction-level parallelism<br />

‣ It is the key implementation technique used to make<br />

fast CPUs<br />

• <strong>Pipelining</strong> doesn’t help latency of single task; it helps<br />

throughput of entire workload<br />

• Pipeline rate is limited by the slowest pipeline stage<br />

• Multiple tasks operating simultaneously<br />

• Potential speedup = Number of pipe stages<br />

– Unbalanced lengths of pipe stages reduces speedup<br />

6


MIPS Without <strong>Pipelining</strong><br />

‣ The execution of instructions is controlled by CPU clock. One<br />

specific function in one clock cycle.<br />

‣ Every MIPS instruction takes 5 clock cycles in terms of five different<br />

stages.<br />

‣ Several temporary registers are introduced to implement the 5-stage<br />

structure.<br />

7


MIPS Functions<br />

Only consider loadstore,<br />

BEQZ, <strong>and</strong><br />

integer ALU<br />

Passed To Next Stage<br />

IR


MIPS Functions<br />

Passed To Next Stage<br />

A


MIPS Functions<br />

Passed To Next Stage<br />

ALUOutput


MIPS Functions<br />

Passed To Next Stage<br />

LMD = Mem[ALUOutput]<br />

or<br />

Mem[ALUOutput] = B;<br />

If (cond) PC


MIPS Functions<br />

Passed To Next Stage<br />

Regs[rd]


The classic five-stages pipeline for MIPS<br />

F<br />

F<br />

F<br />

F<br />

We can pipeline the execution with almost no changes by simply starting a<br />

new instruction on each clock cycle.<br />

Each clock cycle becomes a pipe stage – a cycle in the pipe line which<br />

results in the execution pattern as a typical way of pipeline structure.<br />

Although each instruction takes 5 clock cycles to complete, the hardware<br />

will initiate a new instruction during each clock cycle <strong>and</strong> will be executing<br />

some parts of the five different instruction already existing in the pipeline.<br />

It may be hard to believe that pipelining is as simple as this.<br />

Instruction number<br />

Instruction i<br />

Instruction i+1<br />

Instruction i+2<br />

Instruction i+3<br />

Instruction i+4<br />

Clock number<br />

2 3 41<br />

5<br />

6 7 8 9<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF ID EX MEM WB<br />

13


Figure A.2 The pipeline can be thought of as a series of data<br />

paths shifted in time<br />

14


Simple MIPS Pipeline<br />

F<br />

F<br />

F<br />

F<br />

MIPS pipeline data path to deal with problems that pipelining introduces in<br />

real implementation.<br />

It is critical to ensure that instructions at different stage in the pipeline do<br />

not attempt to use the hardware resources at the same time (in the same<br />

clock cycle) – perform different operations with the same functional unit<br />

such as ALU on the same clock cycle.<br />

Instructions <strong>and</strong> data memories are separated in different caches (IM/DM).<br />

Register file is used in two stages: one for reading in ID <strong>and</strong> one for writing<br />

in WB. To h<strong>and</strong>le a read <strong>and</strong> a write to the same register, we perform the<br />

register write in the first half of the clock <strong>and</strong> the read in the second.<br />

15


Pipeline implementation for MIPS<br />

In order to ensure that instructions in different stages of the pipeline do not<br />

interfere with each other, the data path is pipelined by adding a set of registers,<br />

one between each pair of pipe stages.<br />

The registers serve to convey values <strong>and</strong> control information from one stage to the<br />

next.<br />

Most of the data paths flow from left to right, which is from earlier in time to later.<br />

The paths flowing from right to left (which carry the register write-back information<br />

<strong>and</strong> PC information on a branch) introduce complications into the pipeline.<br />

16


Events on Pipe Stages of the MIPS Pipeline<br />

Stage Any instruction<br />

Figure A.19<br />

IF<br />

ID<br />

IF/ID.IR


<strong>Basic</strong> Performance Issues for <strong>Pipelining</strong><br />

Example: Assume that an unpipelined processor has a 1ns clock cycle<br />

<strong>and</strong> that it uses 4 cycles for ALU operations <strong>and</strong> branches <strong>and</strong> 5 cycles<br />

for memory operations. Assume that the relative frequencies of these<br />

operations are 40%, 20%, <strong>and</strong> 40%, respectively. Suppose that due to<br />

clock skew <strong>and</strong> setup, pipelining the processor adds 0.2 ns overhead to<br />

the clock. Ignoring any latency impact, how much speedup in the<br />

instruction execution time will we gain from the pipeline implementation?<br />

Solution:<br />

Avg. instr. exec time unpipelined = Clock cycle time x Avg. CPI<br />

= 1ns x (40%x4+20%x4+40%x5) = 4.4ns<br />

Ideal situation without any latency, avg. CPI is just only 1 cycle for all<br />

kind of instructions <strong>and</strong> the clock cycle time is equal to 1.0ns + 0.2ns<br />

(1.2ns), then Avg. instr. exec time pipelined = 1.2ns x1 = 1.2ns<br />

Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times.<br />

What is the result if there is no overhead when implement pipelining?<br />

18


A.2 The Major Hurdle of <strong>Pipelining</strong> –<br />

Pipeline Hazard<br />

q Limits to pipelining: there are situations, called Hazards, prevent next<br />

instruction from executing during its designated clock cycle, thus<br />

reduce the performance from the ideal speedup. Three classes of<br />

hazards are:<br />

– Structural hazards: arise from resource conflicts when the hardware<br />

cannot support all possible combinations of instructions simultaneously<br />

in overlapped execution- two different instructions use same h/w in the<br />

same cycle .<br />

– Data hazards: arise when an instruction depends on result of prior<br />

instruction still in the pipeline, RAW, WAR <strong>and</strong> WAW.<br />

– Control hazards: <strong>Pipelining</strong> of branches & other instructions that<br />

change the PC.<br />

‣ Common solution is to stall the pipeline until the hazard is cleared, i.e.,<br />

inserting one or more “bubbles” in the pipeline.<br />

19


Performance of <strong>Pipelining</strong> with Stalls<br />

• The Pipelined CPI:<br />

CPI pipelined<br />

= Ideal<br />

CPI Pipeline<br />

stall cycles per instr.<br />

+=+<br />

1<br />

Pipeline<br />

stall cycles per instr.<br />

•Ignoring cycle time overhead of pipelining, <strong>and</strong> assuming the stages<br />

are perfectly balanced (all occupy one clock cycle) <strong>and</strong> all instructions<br />

take the same num of cycles, we have speedup from pipelining:<br />

Speedup<br />

=<br />

CPI<br />

CPI<br />

pipelined<br />

CPIunpipelined<br />

==<br />

+ Pipeline1<br />

stall cycles per instr.<br />

unpipelined<br />

Pipeline depth<br />

+ Pipeline1<br />

stall cycles per instr.<br />

20


I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

Structural Hazards<br />

When two or<br />

more different<br />

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Cycle 6Cycle 7<br />

instructions want<br />

to use same h/w<br />

resource in same<br />

cycle<br />

Time (clock cycles)<br />

Load<br />

Ifetch<br />

Instr 1<br />

Instr 2<br />

Instr 3<br />

Instr 4<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

e.g., MEM uses<br />

the same memory<br />

port as IF as<br />

shown in this<br />

slide.<br />

Solution: stall<br />

Reg<br />

DMem<br />

Reg<br />

21


Time (clock cycles)<br />

Structural Hazards<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

Load<br />

Instr 1<br />

Instr 2<br />

Stall<br />

Instr 3<br />

Cycle 1Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />

This is another<br />

Ifetch Reg<br />

DMem Reg<br />

way of looking<br />

at the effect of<br />

a stall.<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

Bubble Bubble Bubble Bubble Bubble<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

22


Structural Hazards<br />

This is another way to represent the stall.<br />

23


• Stall<br />

Dealing With Structural Hazards<br />

– low cost, simple<br />

– Increases CPI<br />

– use for rare case since stalling has performance<br />

effect<br />

• Replicate resource<br />

– good performance<br />

– increases cost (+ maybe interconnect delay)<br />

– useful for cheap or divisible resources<br />

E.g., we use separate<br />

instruction <strong>and</strong> data<br />

memories in MIPS<br />

pipeline<br />

24


Data Hazards<br />

• Data hazards occur when the pipeline changes the order of<br />

read/write accesses to oper<strong>and</strong>s (registers) so that the order<br />

differs from the order seen by sequentially executing<br />

instructions on an unpipelined processor.<br />

• Where there’s real trouble is when we have:<br />

instruction A<br />

instruction B,<br />

<strong>and</strong> B manipulates (reads or writes) data before A does. This<br />

violates the order of the instructions, since the architecture<br />

implies that A completes entirely before B is executed.<br />

25


Data Hazards<br />

Execution Order is:<br />

Instr I<br />

Instr J<br />

Read After Write (RAW)<br />

Instr J tries to read oper<strong>and</strong> before Instr I writes it<br />

I: dadd r1,r2,r3<br />

J: dsub r4,r1,r3<br />

• Caused by a “dependence” (in compiler nomenclature).<br />

This hazard results from an actual need for<br />

communication.<br />

26


Data Hazards<br />

Execution Order is:<br />

Instr I<br />

Instr J<br />

Write After Read (WAR)<br />

Instr J tries to write oper<strong>and</strong> before Instr I reads it<br />

– Gets wrong oper<strong>and</strong><br />

I: dsub r4,r1,r3<br />

J: dadd r1,r2,r3<br />

K: mul r6,r1,r7<br />

– Called an “anti-dependence” by compiler writers.<br />

This results from reuse of the name “r1”.<br />

• Can’t happen in MIPS 5 stage pipeline because:<br />

– All instructions take 5 stages, <strong>and</strong><br />

– Reads are always in stage 2, <strong>and</strong><br />

– Writes are always in stage 5<br />

27


Data Hazards<br />

Execution Order is:<br />

Instr I<br />

Instr J<br />

Write After Write (WAW)<br />

Instr J tries to write oper<strong>and</strong> before Instr I writes it<br />

– Leaves wrong result ( Instr I not Instr J )<br />

I: dsub r1,r4,r3<br />

J: dadd r1,r2,r3<br />

K: mul r6,r1,r7<br />

– Called an “output dependence” by compiler writers<br />

This also results from the reuse of name “r1”.<br />

• Can’t happen in MIPS 5 stage pipeline because:<br />

– All instructions take 5 stages, <strong>and</strong><br />

– Writes are always in stage 5<br />

• Will see WAR <strong>and</strong> WAW in later more complicated<br />

pipeline implementations<br />

28


Solutions to Data Hazards<br />

• Simple Solution to RAW<br />

• Hardware detects RAW <strong>and</strong> stalls until the result is written into<br />

the register<br />

+ low cost to implement, simple<br />

-- reduces # instruction executed per cycle<br />

• Minimizing RAW stalls: Forwarding (also called bypassing)<br />

• Key insight: the result is not really needed by the current<br />

instruction until after the previous instruction actually produces it.<br />

• The ALU result from both the EX/MEM <strong>and</strong> MEM/WB pipeline<br />

registers is always fed back to the ALU inputs.<br />

• If the forwarding hardware detects that the previous ALU<br />

operation has written the register corresponding to a source for<br />

the current ALU operation, control logic selects the forwarded<br />

result as the ALU input rather than the value read from the<br />

register file.<br />

29


Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />

CC8 CC9<br />

Data Hazards<br />

IF ID EX MEM WB<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

dadd r1,r2,r3<br />

dsub r4,r1,r3<br />

<strong>and</strong> r6,r1,r7<br />

or r8,r1,r9<br />

xor r10,r1,r11<br />

Ifetch<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

The use of the result of the ADD instruction in the next two instructions causes a<br />

hazard, since the register is not written until after those instructions read it.<br />

30


Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />

CC8 CC9<br />

Forwarding to Avoid Data Hazards<br />

Forwarding is the concept of making data available to the input of the ALU<br />

for subsequent instructions, even though the generating instruction hasn’t<br />

gotten to WB in order to write the memory or registers.<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

dadd r1,r2,r3<br />

dsub r4,r1,r3<br />

<strong>and</strong> r6,r1,r7<br />

or r8,r1,r9<br />

xor r10,r1,r11<br />

Ifetch<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Ifetch<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

31


Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />

CC8<br />

Data Hazards Requiring Stalls<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

LD R1,0(R2)<br />

DSUB R4,R1,R6<br />

Ifetch<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

O<br />

r<br />

d<br />

e<br />

r<br />

AND R6,R1,R7<br />

OR R8,R1,R9<br />

Ifetch<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

There are some instances where hazards occur, even with forwarding,<br />

e.g., the data isn’t loaded until after the MEM stage.<br />

32


Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />

CC8<br />

Data Hazards Requiring Stalls<br />

I<br />

n<br />

s<br />

t<br />

r.<br />

O<br />

r<br />

d<br />

e<br />

r<br />

LD R1,0(R2)<br />

Ifetch<br />

DSUB R4,R1,R6<br />

AND R6,R1,R7<br />

Reg<br />

Ifetch<br />

ALU<br />

Reg<br />

Ifetch<br />

DMem<br />

Bubble<br />

Bubble<br />

Reg<br />

ALU<br />

Reg<br />

DMem<br />

ALU<br />

Reg<br />

DMem<br />

Reg<br />

OR<br />

R8,R1,R9<br />

Bubble<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

The stall is necessary for the case.<br />

33


Another Representation of the Stall<br />

LD<br />

R1, 0(R2)<br />

IF<br />

ID<br />

EX<br />

MEM WB<br />

DSUB R4, R1, R5<br />

AND R6, R1, R7<br />

IF<br />

ID<br />

EX MEM WB<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

OR<br />

R8, R1, R9<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

LD<br />

R1, 0(R2)<br />

IF<br />

ID EX MEM<br />

WB<br />

DSUB R4, R1, R5<br />

IF<br />

ID stall EX MEM<br />

WB<br />

AND R6, R1, R7<br />

IF<br />

stall<br />

ID<br />

EX<br />

MEM<br />

WB<br />

OR<br />

R8, R1, R9<br />

stall<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

In the top table, we can see why a stall is needed: The MEM cycle<br />

of the load produces a value that is needed in the EX cycle of the<br />

DSUB, which occurs at the same time. This problem is solved by<br />

inserting a stall, as shown in the bottom table.<br />

34


Control Hazards<br />

• A control hazard happens when we need to find the<br />

destination of a branch, <strong>and</strong> can’t fetch any new<br />

instructions until we know that destination.<br />

– If instruction i is a taken branch, then the PC is normally not<br />

changed until the end of ID<br />

• Control hazards can cause a greater performance<br />

loss than do data hazards.<br />

35


Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />

CC8 CC9<br />

Control Hazard on Branches<br />

Three-Cycle Stall<br />

12: beq r1,r3,36<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

16: <strong>and</strong> r2,r3,r5<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

20: or r6,r1,r7<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

24: add r8,r1,r9<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

36: xor r10,r1,r11<br />

Ifetch<br />

Reg<br />

ALU<br />

DMem<br />

Reg<br />

36


Branch Stall Impact<br />

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!<br />

• Two solutions to this dramatic increase:<br />

– Determine branch taken or not sooner, AND<br />

– Compute target address earlier<br />

• MIPS branch tests if register = 0 or ^ 0<br />

• MIPS Solution:<br />

– Move Zero test to ID stage<br />

– Adder to calculate target address in ID stage<br />

– 1 clock cycle penalty for branch versus 3<br />

37


The Pipeline of 1-Cycle Stall for Branch<br />

38


Four Solutions to Branch Hazards<br />

#1: Stall until branch direction is clear<br />

– Simple both for software <strong>and</strong> hardware<br />

– Branch penalty is fixed (1-cycle penalty for revised MIPS)<br />

Branch instr.<br />

Branch successor<br />

Branch successor+1<br />

Branch successor+2<br />

IF<br />

ID EX MEM<br />

WB<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

39


Four Solutions to Branch Hazards<br />

#2: Predict Branch Not Taken<br />

– Continue to fetch instructions as if the branch were a normal<br />

instruction.<br />

– If the branch is taken, turn the fetched instruction into a no-op<br />

<strong>and</strong> restart the fetch at the target address.<br />

Untaken branch instr.<br />

Branch successor<br />

IF<br />

ID EX MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

Branch successor+1<br />

Branch successor+2<br />

Branch successor+3<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

Taken branch instr.<br />

Branch successor<br />

IF<br />

ID EX MEM<br />

WB<br />

IF<br />

idle idle<br />

Branch target<br />

Branch successor+1<br />

Branch successor+2<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

IF<br />

ID EX<br />

MEM<br />

WB<br />

IF<br />

ID<br />

EX<br />

MEM<br />

WB<br />

40


Four Solutions to Branch Hazards<br />

#3: Predict Branch Taken<br />

– As soon as the branch is decoded <strong>and</strong> the target address is<br />

computed, we assume the branch to be taken <strong>and</strong> begin<br />

fetching <strong>and</strong> executing at the target.<br />

– But haven’t calculated the target address before we know<br />

the branch outcome in MIPS<br />

• MIPS still incurs 1-cycle branch penalty<br />

• Useful for other machines on which the target address is<br />

known before the branch outcome<br />

41


Four Solutions to Branch Hazards<br />

#4: Delayed Branch<br />

– The execution cycle with a branch delay of one is<br />

branch instruction<br />

sequential successor 1<br />

branch target if taken<br />

– The sequential successor is in the branch delay slot.<br />

– The instruction in the branch delay slot is executed whether<br />

or not the branch is taken (for zero cycle penalty)<br />

•Where to get instructions to fill branch delay slot?<br />

– From before branch instruction<br />

– From target address: only valuable when branch taken<br />

– From fall through: only valuable when branch not taken<br />

– Canceling or nullifying branches allow more slots to be filled (nonzero<br />

cycle penalty, its value depends on the rate of correct<br />

predication)<br />

– the delay-slot instruction is turned into a no-op if incorrectly<br />

predicted<br />

42


Four Solutions to Branch Hazards<br />

43


<strong>Pipelining</strong> Introduction Summary<br />

• Just overlap tasks, <strong>and</strong> easy if tasks are independent<br />

• Speed Up vs. Pipeline Depth; if ideal CPI is 1, then:<br />

Speedup =<br />

Pipeline Depth<br />

1 + Pipeline stall CPI<br />

X<br />

Clock Cycle Unpipelined<br />

Clock Cycle Pipelined<br />

• Hazards limit performance on computers:<br />

– Structural: need more hardware resources<br />

– Data (RAW,WAR,WAW): need forwarding, compiler scheduling<br />

– Control: delayed branch, prediction<br />

44

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!