Appendix A Pipelining: Basic and Intermediate Concepts
Appendix A Pipelining: Basic and Intermediate Concepts
Appendix A Pipelining: Basic and Intermediate Concepts
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
EEF011 Computer Architecture<br />
計 算 機 結 構<br />
<strong>Appendix</strong> A<br />
<strong>Pipelining</strong>: <strong>Basic</strong> <strong>and</strong><br />
<strong>Intermediate</strong> <strong>Concepts</strong><br />
吳 俊 興<br />
高 雄 大 學 資 訊 工 程 學 系<br />
October 2004
Outline<br />
<strong>Basic</strong> concept of <strong>Pipelining</strong><br />
The <strong>Basic</strong> Pipeline for MIPS<br />
The Major Hurdles of <strong>Pipelining</strong> – Pipeline Hazards<br />
2
Laundry Example<br />
What Is <strong>Pipelining</strong>?<br />
• Ann, Betty, Cathy, Dave<br />
each has one load of clothes<br />
to wash, dry, <strong>and</strong> fold<br />
A B C D<br />
• Washer takes 30 minutes<br />
• Dryer takes 40 minutes<br />
• “Folder” takes 20 minutes<br />
3
What Is <strong>Pipelining</strong>?<br />
6 PM 7 8 9 10 11 Midnight<br />
Time<br />
30 40 20 30 40 20 30 40 20 30 40 20<br />
T<br />
a<br />
s<br />
k<br />
O<br />
r<br />
d<br />
e<br />
r<br />
A<br />
B<br />
C<br />
D<br />
Sequential laundry takes 6 hours for 4 loads<br />
Want to reduce the time? -<strong>Pipelining</strong>!!!<br />
4
What Is <strong>Pipelining</strong>?<br />
6 PM 7 8 9<br />
Time<br />
T<br />
a<br />
s<br />
k<br />
O<br />
r<br />
d<br />
e<br />
r<br />
A<br />
B<br />
C<br />
D<br />
30 40 40 40 40 20<br />
• Start work ASAP<br />
• Pipelined laundry takes<br />
3.5 hours for 4 loads<br />
5
What Is <strong>Pipelining</strong>?<br />
‣ <strong>Pipelining</strong> is an implementation technique whereby<br />
multiple instructions are overlapped in execution<br />
‣ It takes advantage of parallelism that exists among<br />
instructions => instruction-level parallelism<br />
‣ It is the key implementation technique used to make<br />
fast CPUs<br />
• <strong>Pipelining</strong> doesn’t help latency of single task; it helps<br />
throughput of entire workload<br />
• Pipeline rate is limited by the slowest pipeline stage<br />
• Multiple tasks operating simultaneously<br />
• Potential speedup = Number of pipe stages<br />
– Unbalanced lengths of pipe stages reduces speedup<br />
6
MIPS Without <strong>Pipelining</strong><br />
‣ The execution of instructions is controlled by CPU clock. One<br />
specific function in one clock cycle.<br />
‣ Every MIPS instruction takes 5 clock cycles in terms of five different<br />
stages.<br />
‣ Several temporary registers are introduced to implement the 5-stage<br />
structure.<br />
7
MIPS Functions<br />
Only consider loadstore,<br />
BEQZ, <strong>and</strong><br />
integer ALU<br />
Passed To Next Stage<br />
IR
MIPS Functions<br />
Passed To Next Stage<br />
A
MIPS Functions<br />
Passed To Next Stage<br />
ALUOutput
MIPS Functions<br />
Passed To Next Stage<br />
LMD = Mem[ALUOutput]<br />
or<br />
Mem[ALUOutput] = B;<br />
If (cond) PC
MIPS Functions<br />
Passed To Next Stage<br />
Regs[rd]
The classic five-stages pipeline for MIPS<br />
F<br />
F<br />
F<br />
F<br />
We can pipeline the execution with almost no changes by simply starting a<br />
new instruction on each clock cycle.<br />
Each clock cycle becomes a pipe stage – a cycle in the pipe line which<br />
results in the execution pattern as a typical way of pipeline structure.<br />
Although each instruction takes 5 clock cycles to complete, the hardware<br />
will initiate a new instruction during each clock cycle <strong>and</strong> will be executing<br />
some parts of the five different instruction already existing in the pipeline.<br />
It may be hard to believe that pipelining is as simple as this.<br />
Instruction number<br />
Instruction i<br />
Instruction i+1<br />
Instruction i+2<br />
Instruction i+3<br />
Instruction i+4<br />
Clock number<br />
2 3 41<br />
5<br />
6 7 8 9<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF ID EX MEM WB<br />
13
Figure A.2 The pipeline can be thought of as a series of data<br />
paths shifted in time<br />
14
Simple MIPS Pipeline<br />
F<br />
F<br />
F<br />
F<br />
MIPS pipeline data path to deal with problems that pipelining introduces in<br />
real implementation.<br />
It is critical to ensure that instructions at different stage in the pipeline do<br />
not attempt to use the hardware resources at the same time (in the same<br />
clock cycle) – perform different operations with the same functional unit<br />
such as ALU on the same clock cycle.<br />
Instructions <strong>and</strong> data memories are separated in different caches (IM/DM).<br />
Register file is used in two stages: one for reading in ID <strong>and</strong> one for writing<br />
in WB. To h<strong>and</strong>le a read <strong>and</strong> a write to the same register, we perform the<br />
register write in the first half of the clock <strong>and</strong> the read in the second.<br />
15
Pipeline implementation for MIPS<br />
In order to ensure that instructions in different stages of the pipeline do not<br />
interfere with each other, the data path is pipelined by adding a set of registers,<br />
one between each pair of pipe stages.<br />
The registers serve to convey values <strong>and</strong> control information from one stage to the<br />
next.<br />
Most of the data paths flow from left to right, which is from earlier in time to later.<br />
The paths flowing from right to left (which carry the register write-back information<br />
<strong>and</strong> PC information on a branch) introduce complications into the pipeline.<br />
16
Events on Pipe Stages of the MIPS Pipeline<br />
Stage Any instruction<br />
Figure A.19<br />
IF<br />
ID<br />
IF/ID.IR
<strong>Basic</strong> Performance Issues for <strong>Pipelining</strong><br />
Example: Assume that an unpipelined processor has a 1ns clock cycle<br />
<strong>and</strong> that it uses 4 cycles for ALU operations <strong>and</strong> branches <strong>and</strong> 5 cycles<br />
for memory operations. Assume that the relative frequencies of these<br />
operations are 40%, 20%, <strong>and</strong> 40%, respectively. Suppose that due to<br />
clock skew <strong>and</strong> setup, pipelining the processor adds 0.2 ns overhead to<br />
the clock. Ignoring any latency impact, how much speedup in the<br />
instruction execution time will we gain from the pipeline implementation?<br />
Solution:<br />
Avg. instr. exec time unpipelined = Clock cycle time x Avg. CPI<br />
= 1ns x (40%x4+20%x4+40%x5) = 4.4ns<br />
Ideal situation without any latency, avg. CPI is just only 1 cycle for all<br />
kind of instructions <strong>and</strong> the clock cycle time is equal to 1.0ns + 0.2ns<br />
(1.2ns), then Avg. instr. exec time pipelined = 1.2ns x1 = 1.2ns<br />
Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times.<br />
What is the result if there is no overhead when implement pipelining?<br />
18
A.2 The Major Hurdle of <strong>Pipelining</strong> –<br />
Pipeline Hazard<br />
q Limits to pipelining: there are situations, called Hazards, prevent next<br />
instruction from executing during its designated clock cycle, thus<br />
reduce the performance from the ideal speedup. Three classes of<br />
hazards are:<br />
– Structural hazards: arise from resource conflicts when the hardware<br />
cannot support all possible combinations of instructions simultaneously<br />
in overlapped execution- two different instructions use same h/w in the<br />
same cycle .<br />
– Data hazards: arise when an instruction depends on result of prior<br />
instruction still in the pipeline, RAW, WAR <strong>and</strong> WAW.<br />
– Control hazards: <strong>Pipelining</strong> of branches & other instructions that<br />
change the PC.<br />
‣ Common solution is to stall the pipeline until the hazard is cleared, i.e.,<br />
inserting one or more “bubbles” in the pipeline.<br />
19
Performance of <strong>Pipelining</strong> with Stalls<br />
• The Pipelined CPI:<br />
CPI pipelined<br />
= Ideal<br />
CPI Pipeline<br />
stall cycles per instr.<br />
+=+<br />
1<br />
Pipeline<br />
stall cycles per instr.<br />
•Ignoring cycle time overhead of pipelining, <strong>and</strong> assuming the stages<br />
are perfectly balanced (all occupy one clock cycle) <strong>and</strong> all instructions<br />
take the same num of cycles, we have speedup from pipelining:<br />
Speedup<br />
=<br />
CPI<br />
CPI<br />
pipelined<br />
CPIunpipelined<br />
==<br />
+ Pipeline1<br />
stall cycles per instr.<br />
unpipelined<br />
Pipeline depth<br />
+ Pipeline1<br />
stall cycles per instr.<br />
20
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
Structural Hazards<br />
When two or<br />
more different<br />
Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Cycle 6Cycle 7<br />
instructions want<br />
to use same h/w<br />
resource in same<br />
cycle<br />
Time (clock cycles)<br />
Load<br />
Ifetch<br />
Instr 1<br />
Instr 2<br />
Instr 3<br />
Instr 4<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
e.g., MEM uses<br />
the same memory<br />
port as IF as<br />
shown in this<br />
slide.<br />
Solution: stall<br />
Reg<br />
DMem<br />
Reg<br />
21
Time (clock cycles)<br />
Structural Hazards<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
Load<br />
Instr 1<br />
Instr 2<br />
Stall<br />
Instr 3<br />
Cycle 1Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7<br />
This is another<br />
Ifetch Reg<br />
DMem Reg<br />
way of looking<br />
at the effect of<br />
a stall.<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
Bubble Bubble Bubble Bubble Bubble<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
22
Structural Hazards<br />
This is another way to represent the stall.<br />
23
• Stall<br />
Dealing With Structural Hazards<br />
– low cost, simple<br />
– Increases CPI<br />
– use for rare case since stalling has performance<br />
effect<br />
• Replicate resource<br />
– good performance<br />
– increases cost (+ maybe interconnect delay)<br />
– useful for cheap or divisible resources<br />
E.g., we use separate<br />
instruction <strong>and</strong> data<br />
memories in MIPS<br />
pipeline<br />
24
Data Hazards<br />
• Data hazards occur when the pipeline changes the order of<br />
read/write accesses to oper<strong>and</strong>s (registers) so that the order<br />
differs from the order seen by sequentially executing<br />
instructions on an unpipelined processor.<br />
• Where there’s real trouble is when we have:<br />
instruction A<br />
instruction B,<br />
<strong>and</strong> B manipulates (reads or writes) data before A does. This<br />
violates the order of the instructions, since the architecture<br />
implies that A completes entirely before B is executed.<br />
25
Data Hazards<br />
Execution Order is:<br />
Instr I<br />
Instr J<br />
Read After Write (RAW)<br />
Instr J tries to read oper<strong>and</strong> before Instr I writes it<br />
I: dadd r1,r2,r3<br />
J: dsub r4,r1,r3<br />
• Caused by a “dependence” (in compiler nomenclature).<br />
This hazard results from an actual need for<br />
communication.<br />
26
Data Hazards<br />
Execution Order is:<br />
Instr I<br />
Instr J<br />
Write After Read (WAR)<br />
Instr J tries to write oper<strong>and</strong> before Instr I reads it<br />
– Gets wrong oper<strong>and</strong><br />
I: dsub r4,r1,r3<br />
J: dadd r1,r2,r3<br />
K: mul r6,r1,r7<br />
– Called an “anti-dependence” by compiler writers.<br />
This results from reuse of the name “r1”.<br />
• Can’t happen in MIPS 5 stage pipeline because:<br />
– All instructions take 5 stages, <strong>and</strong><br />
– Reads are always in stage 2, <strong>and</strong><br />
– Writes are always in stage 5<br />
27
Data Hazards<br />
Execution Order is:<br />
Instr I<br />
Instr J<br />
Write After Write (WAW)<br />
Instr J tries to write oper<strong>and</strong> before Instr I writes it<br />
– Leaves wrong result ( Instr I not Instr J )<br />
I: dsub r1,r4,r3<br />
J: dadd r1,r2,r3<br />
K: mul r6,r1,r7<br />
– Called an “output dependence” by compiler writers<br />
This also results from the reuse of name “r1”.<br />
• Can’t happen in MIPS 5 stage pipeline because:<br />
– All instructions take 5 stages, <strong>and</strong><br />
– Writes are always in stage 5<br />
• Will see WAR <strong>and</strong> WAW in later more complicated<br />
pipeline implementations<br />
28
Solutions to Data Hazards<br />
• Simple Solution to RAW<br />
• Hardware detects RAW <strong>and</strong> stalls until the result is written into<br />
the register<br />
+ low cost to implement, simple<br />
-- reduces # instruction executed per cycle<br />
• Minimizing RAW stalls: Forwarding (also called bypassing)<br />
• Key insight: the result is not really needed by the current<br />
instruction until after the previous instruction actually produces it.<br />
• The ALU result from both the EX/MEM <strong>and</strong> MEM/WB pipeline<br />
registers is always fed back to the ALU inputs.<br />
• If the forwarding hardware detects that the previous ALU<br />
operation has written the register corresponding to a source for<br />
the current ALU operation, control logic selects the forwarded<br />
result as the ALU input rather than the value read from the<br />
register file.<br />
29
Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />
CC8 CC9<br />
Data Hazards<br />
IF ID EX MEM WB<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
dadd r1,r2,r3<br />
dsub r4,r1,r3<br />
<strong>and</strong> r6,r1,r7<br />
or r8,r1,r9<br />
xor r10,r1,r11<br />
Ifetch<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
The use of the result of the ADD instruction in the next two instructions causes a<br />
hazard, since the register is not written until after those instructions read it.<br />
30
Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />
CC8 CC9<br />
Forwarding to Avoid Data Hazards<br />
Forwarding is the concept of making data available to the input of the ALU<br />
for subsequent instructions, even though the generating instruction hasn’t<br />
gotten to WB in order to write the memory or registers.<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
dadd r1,r2,r3<br />
dsub r4,r1,r3<br />
<strong>and</strong> r6,r1,r7<br />
or r8,r1,r9<br />
xor r10,r1,r11<br />
Ifetch<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Ifetch<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
31
Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />
CC8<br />
Data Hazards Requiring Stalls<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
LD R1,0(R2)<br />
DSUB R4,R1,R6<br />
Ifetch<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
O<br />
r<br />
d<br />
e<br />
r<br />
AND R6,R1,R7<br />
OR R8,R1,R9<br />
Ifetch<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
There are some instances where hazards occur, even with forwarding,<br />
e.g., the data isn’t loaded until after the MEM stage.<br />
32
Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />
CC8<br />
Data Hazards Requiring Stalls<br />
I<br />
n<br />
s<br />
t<br />
r.<br />
O<br />
r<br />
d<br />
e<br />
r<br />
LD R1,0(R2)<br />
Ifetch<br />
DSUB R4,R1,R6<br />
AND R6,R1,R7<br />
Reg<br />
Ifetch<br />
ALU<br />
Reg<br />
Ifetch<br />
DMem<br />
Bubble<br />
Bubble<br />
Reg<br />
ALU<br />
Reg<br />
DMem<br />
ALU<br />
Reg<br />
DMem<br />
Reg<br />
OR<br />
R8,R1,R9<br />
Bubble<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
The stall is necessary for the case.<br />
33
Another Representation of the Stall<br />
LD<br />
R1, 0(R2)<br />
IF<br />
ID<br />
EX<br />
MEM WB<br />
DSUB R4, R1, R5<br />
AND R6, R1, R7<br />
IF<br />
ID<br />
EX MEM WB<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
OR<br />
R8, R1, R9<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
LD<br />
R1, 0(R2)<br />
IF<br />
ID EX MEM<br />
WB<br />
DSUB R4, R1, R5<br />
IF<br />
ID stall EX MEM<br />
WB<br />
AND R6, R1, R7<br />
IF<br />
stall<br />
ID<br />
EX<br />
MEM<br />
WB<br />
OR<br />
R8, R1, R9<br />
stall<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
In the top table, we can see why a stall is needed: The MEM cycle<br />
of the load produces a value that is needed in the EX cycle of the<br />
DSUB, which occurs at the same time. This problem is solved by<br />
inserting a stall, as shown in the bottom table.<br />
34
Control Hazards<br />
• A control hazard happens when we need to find the<br />
destination of a branch, <strong>and</strong> can’t fetch any new<br />
instructions until we know that destination.<br />
– If instruction i is a taken branch, then the PC is normally not<br />
changed until the end of ID<br />
• Control hazards can cause a greater performance<br />
loss than do data hazards.<br />
35
Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7<br />
CC8 CC9<br />
Control Hazard on Branches<br />
Three-Cycle Stall<br />
12: beq r1,r3,36<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
16: <strong>and</strong> r2,r3,r5<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
20: or r6,r1,r7<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
24: add r8,r1,r9<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
36: xor r10,r1,r11<br />
Ifetch<br />
Reg<br />
ALU<br />
DMem<br />
Reg<br />
36
Branch Stall Impact<br />
• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!<br />
• Two solutions to this dramatic increase:<br />
– Determine branch taken or not sooner, AND<br />
– Compute target address earlier<br />
• MIPS branch tests if register = 0 or ^ 0<br />
• MIPS Solution:<br />
– Move Zero test to ID stage<br />
– Adder to calculate target address in ID stage<br />
– 1 clock cycle penalty for branch versus 3<br />
37
The Pipeline of 1-Cycle Stall for Branch<br />
38
Four Solutions to Branch Hazards<br />
#1: Stall until branch direction is clear<br />
– Simple both for software <strong>and</strong> hardware<br />
– Branch penalty is fixed (1-cycle penalty for revised MIPS)<br />
Branch instr.<br />
Branch successor<br />
Branch successor+1<br />
Branch successor+2<br />
IF<br />
ID EX MEM<br />
WB<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
39
Four Solutions to Branch Hazards<br />
#2: Predict Branch Not Taken<br />
– Continue to fetch instructions as if the branch were a normal<br />
instruction.<br />
– If the branch is taken, turn the fetched instruction into a no-op<br />
<strong>and</strong> restart the fetch at the target address.<br />
Untaken branch instr.<br />
Branch successor<br />
IF<br />
ID EX MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
Branch successor+1<br />
Branch successor+2<br />
Branch successor+3<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
Taken branch instr.<br />
Branch successor<br />
IF<br />
ID EX MEM<br />
WB<br />
IF<br />
idle idle<br />
Branch target<br />
Branch successor+1<br />
Branch successor+2<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
IF<br />
ID EX<br />
MEM<br />
WB<br />
IF<br />
ID<br />
EX<br />
MEM<br />
WB<br />
40
Four Solutions to Branch Hazards<br />
#3: Predict Branch Taken<br />
– As soon as the branch is decoded <strong>and</strong> the target address is<br />
computed, we assume the branch to be taken <strong>and</strong> begin<br />
fetching <strong>and</strong> executing at the target.<br />
– But haven’t calculated the target address before we know<br />
the branch outcome in MIPS<br />
• MIPS still incurs 1-cycle branch penalty<br />
• Useful for other machines on which the target address is<br />
known before the branch outcome<br />
41
Four Solutions to Branch Hazards<br />
#4: Delayed Branch<br />
– The execution cycle with a branch delay of one is<br />
branch instruction<br />
sequential successor 1<br />
branch target if taken<br />
– The sequential successor is in the branch delay slot.<br />
– The instruction in the branch delay slot is executed whether<br />
or not the branch is taken (for zero cycle penalty)<br />
•Where to get instructions to fill branch delay slot?<br />
– From before branch instruction<br />
– From target address: only valuable when branch taken<br />
– From fall through: only valuable when branch not taken<br />
– Canceling or nullifying branches allow more slots to be filled (nonzero<br />
cycle penalty, its value depends on the rate of correct<br />
predication)<br />
– the delay-slot instruction is turned into a no-op if incorrectly<br />
predicted<br />
42
Four Solutions to Branch Hazards<br />
43
<strong>Pipelining</strong> Introduction Summary<br />
• Just overlap tasks, <strong>and</strong> easy if tasks are independent<br />
• Speed Up vs. Pipeline Depth; if ideal CPI is 1, then:<br />
Speedup =<br />
Pipeline Depth<br />
1 + Pipeline stall CPI<br />
X<br />
Clock Cycle Unpipelined<br />
Clock Cycle Pipelined<br />
• Hazards limit performance on computers:<br />
– Structural: need more hardware resources<br />
– Data (RAW,WAR,WAW): need forwarding, compiler scheduling<br />
– Control: delayed branch, prediction<br />
44