Appendix A Pipelining: Basic and Intermediate Concepts

EEF011 Computer Architecture 

計算機結構 

Appendix A 

Pipelining: Basic and 

Intermediate Concepts 

吳俊興 

高雄大學資訊工程學系 

October 2004

Outline 

Basic concept of Pipelining 

The Basic Pipeline for MIPS 

The Major Hurdles of Pipelining – Pipeline Hazards 

2

Laundry Example 

What Is Pipelining? 

• Ann, Betty, Cathy, Dave 

each has one load of clothes 

to wash, dry, and fold 

A B C D 

• Washer takes 30 minutes 

• Dryer takes 40 minutes 

• “Folder” takes 20 minutes 

3


6 PM 7 8 9 10 11 Midnight 

Time 

30 40 20 30 40 20 30 40 20 30 40 20 

T 

a 

s 

k 

O 

r 

d 

e 

r 

A 

B 

C 

D 

Sequential laundry takes 6 hours for 4 loads 

Want to reduce the time? -Pipelining!!! 

4


6 PM 7 8 9 

Time 

T 

a 

s 

k 

O 

r 

d 

e 

r 

A 

B 

C 

D 

30 40 40 40 40 20 

• Start work ASAP 

• Pipelined laundry takes 

3.5 hours for 4 loads 

5


‣ Pipelining is an implementation technique whereby 

multiple instructions are overlapped in execution 

‣ It takes advantage of parallelism that exists among 

instructions => instruction-level parallelism 

‣ It is the key implementation technique used to make 

fast CPUs 

• Pipelining doesn’t help latency of single task; it helps 

throughput of entire workload 

• Pipeline rate is limited by the slowest pipeline stage 

• Multiple tasks operating simultaneously 

• Potential speedup = Number of pipe stages 

– Unbalanced lengths of pipe stages reduces speedup 

6

MIPS Without Pipelining 

‣ The execution of instructions is controlled by CPU clock. One 

specific function in one clock cycle. 

‣ Every MIPS instruction takes 5 clock cycles in terms of five different 

stages. 

‣ Several temporary registers are introduced to implement the 5-stage 

structure. 

7

MIPS Functions 

Only consider loadstore, 

BEQZ, and 

integer ALU 

Passed To Next Stage 

IR



A



ALUOutput



LMD = Mem[ALUOutput] 

or 

Mem[ALUOutput] = B; 

If (cond) PC



Regs[rd]

The classic five-stages pipeline for MIPS 

F 

F 

F 

F 

We can pipeline the execution with almost no changes by simply starting a 

new instruction on each clock cycle. 

Each clock cycle becomes a pipe stage – a cycle in the pipe line which 

results in the execution pattern as a typical way of pipeline structure. 

Although each instruction takes 5 clock cycles to complete, the hardware 

will initiate a new instruction during each clock cycle and will be executing 

some parts of the five different instruction already existing in the pipeline. 

It may be hard to believe that pipelining is as simple as this. 

Instruction number 

Instruction i 

Instruction i+1 




Clock number 

2 3 41 

5 

6 7 8 9 

IF 

ID EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

IF 

ID 

EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

IF ID EX MEM WB 

13

Figure A.2 The pipeline can be thought of as a series of data 

paths shifted in time 

14

Simple MIPS Pipeline 

F 

F 

F 

F 

MIPS pipeline data path to deal with problems that pipelining introduces in 

real implementation. 

It is critical to ensure that instructions at different stage in the pipeline do 

not attempt to use the hardware resources at the same time (in the same 

clock cycle) – perform different operations with the same functional unit 

such as ALU on the same clock cycle. 

Instructions and data memories are separated in different caches (IM/DM). 

Register file is used in two stages: one for reading in ID and one for writing 

in WB. To handle a read and a write to the same register, we perform the 

register write in the first half of the clock and the read in the second. 

15

Pipeline implementation for MIPS 

In order to ensure that instructions in different stages of the pipeline do not 

interfere with each other, the data path is pipelined by adding a set of registers, 

one between each pair of pipe stages. 

The registers serve to convey values and control information from one stage to the 

next. 

Most of the data paths flow from left to right, which is from earlier in time to later. 

The paths flowing from right to left (which carry the register write-back information 

and PC information on a branch) introduce complications into the pipeline. 

16

Events on Pipe Stages of the MIPS Pipeline 

Stage Any instruction 

Figure A.19 

IF 

ID 

IF/ID.IR

Basic Performance Issues for Pipelining 

Example: Assume that an unpipelined processor has a 1ns clock cycle 

and that it uses 4 cycles for ALU operations and branches and 5 cycles 

for memory operations. Assume that the relative frequencies of these 

operations are 40%, 20%, and 40%, respectively. Suppose that due to 

clock skew and setup, pipelining the processor adds 0.2 ns overhead to 

the clock. Ignoring any latency impact, how much speedup in the 

instruction execution time will we gain from the pipeline implementation? 

Solution: 

Avg. instr. exec time unpipelined = Clock cycle time x Avg. CPI 

= 1ns x (40%x4+20%x4+40%x5) = 4.4ns 

Ideal situation without any latency, avg. CPI is just only 1 cycle for all 

kind of instructions and the clock cycle time is equal to 1.0ns + 0.2ns 

(1.2ns), then Avg. instr. exec time pipelined = 1.2ns x1 = 1.2ns 

Then, speed up from pipelining is 4.4ns/1.2ns or 3.7 times. 

What is the result if there is no overhead when implement pipelining? 

18

A.2 The Major Hurdle of Pipelining – 

Pipeline Hazard 

q Limits to pipelining: there are situations, called Hazards, prevent next 

instruction from executing during its designated clock cycle, thus 

reduce the performance from the ideal speedup. Three classes of 

hazards are: 

– Structural hazards: arise from resource conflicts when the hardware 

cannot support all possible combinations of instructions simultaneously 

in overlapped execution- two different instructions use same h/w in the 

same cycle . 

– Data hazards: arise when an instruction depends on result of prior 

instruction still in the pipeline, RAW, WAR and WAW. 

– Control hazards: Pipelining of branches & other instructions that 

change the PC. 

‣ Common solution is to stall the pipeline until the hazard is cleared, i.e., 

inserting one or more “bubbles” in the pipeline. 

19

Performance of Pipelining with Stalls 

• The Pipelined CPI: 

CPI pipelined 

= Ideal 

CPI Pipeline 

stall cycles per instr. 

+=+ 

1 

Pipeline 


•Ignoring cycle time overhead of pipelining, and assuming the stages 

are perfectly balanced (all occupy one clock cycle) and all instructions 

take the same num of cycles, we have speedup from pipelining: 

Speedup 

= 

CPI 

CPI 

pipelined 

CPIunpipelined 

== 

+ Pipeline1 


unpipelined 

Pipeline depth 

+ Pipeline1 


20

I 

n 

s 

t 

r. 

O 

r 

d 

e 

r 

Structural Hazards 

When two or 

more different 

Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5 Cycle 6Cycle 7 

instructions want 

to use same h/w 

resource in same 

cycle 

Time (clock cycles) 

Load 

Ifetch 

Instr 1 

Instr 2 

Instr 3 

Instr 4 

Reg 

Ifetch 

ALU 

Reg 

Ifetch 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Reg 

DMem 

ALU 

e.g., MEM uses 

the same memory 

port as IF as 

shown in this 

slide. 

Solution: stall 

Reg 

DMem 

Reg 

21

Time (clock cycles) 


I 

n 

s 

t 

r. 

O 

r 

d 

e 

r 

Load 

Instr 1 

Instr 2 

Stall 

Instr 3 

Cycle 1Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 

This is another 

Ifetch Reg 

DMem Reg 

way of looking 

at the effect of 

a stall. 

Ifetch 

ALU 

Reg 

Ifetch 

ALU 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

Bubble Bubble Bubble Bubble Bubble 

Ifetch 

Reg 

ALU 

DMem 

Reg 

22


This is another way to represent the stall. 

23

• Stall 

Dealing With Structural Hazards 

– low cost, simple 

– Increases CPI 

– use for rare case since stalling has performance 

effect 

• Replicate resource 

– good performance 

– increases cost (+ maybe interconnect delay) 

– useful for cheap or divisible resources 

E.g., we use separate 

instruction and data 

memories in MIPS 

pipeline 

24

Data Hazards 

• Data hazards occur when the pipeline changes the order of 

read/write accesses to operands (registers) so that the order 

differs from the order seen by sequentially executing 

instructions on an unpipelined processor. 

• Where there’s real trouble is when we have: 

instruction A 

instruction B, 

and B manipulates (reads or writes) data before A does. This 

violates the order of the instructions, since the architecture 

implies that A completes entirely before B is executed. 

25

Data Hazards 

Execution Order is: 

Instr I 

Instr J 

Read After Write (RAW) 

Instr J tries to read operand before Instr I writes it 

I: dadd r1,r2,r3 

J: dsub r4,r1,r3 

• Caused by a “dependence” (in compiler nomenclature). 

This hazard results from an actual need for 

communication. 

26

Data Hazards 


Instr I 

Instr J 

Write After Read (WAR) 

Instr J tries to write operand before Instr I reads it 

– Gets wrong operand 

I: dsub r4,r1,r3 

J: dadd r1,r2,r3 

K: mul r6,r1,r7 

– Called an “anti-dependence” by compiler writers. 

This results from reuse of the name “r1”. 

• Can’t happen in MIPS 5 stage pipeline because: 

– All instructions take 5 stages, and 

– Reads are always in stage 2, and 

– Writes are always in stage 5 

27

Data Hazards 


Instr I 

Instr J 

Write After Write (WAW) 

Instr J tries to write operand before Instr I writes it 

– Leaves wrong result ( Instr I not Instr J ) 

I: dsub r1,r4,r3 

J: dadd r1,r2,r3 

K: mul r6,r1,r7 

– Called an “output dependence” by compiler writers 

This also results from the reuse of name “r1”. 

• Can’t happen in MIPS 5 stage pipeline because: 

– All instructions take 5 stages, and 

– Writes are always in stage 5 

• Will see WAR and WAW in later more complicated 

pipeline implementations 

28

Solutions to Data Hazards 

• Simple Solution to RAW 

• Hardware detects RAW and stalls until the result is written into 

the register 

+ low cost to implement, simple 

-- reduces # instruction executed per cycle 

• Minimizing RAW stalls: Forwarding (also called bypassing) 

• Key insight: the result is not really needed by the current 

instruction until after the previous instruction actually produces it. 

• The ALU result from both the EX/MEM and MEM/WB pipeline 

registers is always fed back to the ALU inputs. 

• If the forwarding hardware detects that the previous ALU 

operation has written the register corresponding to a source for 

the current ALU operation, control logic selects the forwarded 

result as the ALU input rather than the value read from the 

register file. 

29

Time (clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 

CC8 CC9 

Data Hazards 

IF ID EX MEM WB 

I 

n 

s 

t 

r. 

O 

r 

d 

e 

r 

dadd r1,r2,r3 

dsub r4,r1,r3 

and r6,r1,r7 

or r8,r1,r9 

xor r10,r1,r11 

Ifetch 

Reg 

Ifetch 

ALU 

Reg 

Ifetch 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

The use of the result of the ADD instruction in the next two instructions causes a 

hazard, since the register is not written until after those instructions read it. 

30


CC8 CC9 

Forwarding to Avoid Data Hazards 

Forwarding is the concept of making data available to the input of the ALU 

for subsequent instructions, even though the generating instruction hasn’t 

gotten to WB in order to write the memory or registers. 

I 

n 

s 

t 

r. 

O 

r 

d 

e 

r 

dadd r1,r2,r3 

dsub r4,r1,r3 

and r6,r1,r7 

or r8,r1,r9 

xor r10,r1,r11 

Ifetch 

Reg 

Ifetch 

ALU 

Reg 

Ifetch 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Ifetch 

Reg 

DMem 

ALU 

Reg 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

31


CC8 

Data Hazards Requiring Stalls 

I 

n 

s 

t 

r. 

LD R1,0(R2) 

DSUB R4,R1,R6 

Ifetch 

Reg 

Ifetch 

ALU 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

O 

r 

d 

e 

r 

AND R6,R1,R7 

OR R8,R1,R9 

Ifetch 

Reg 

Ifetch 

ALU 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

There are some instances where hazards occur, even with forwarding, 

e.g., the data isn’t loaded until after the MEM stage. 

32


CC8 

Data Hazards Requiring Stalls 

I 

n 

s 

t 

r. 

O 

r 

d 

e 

r 

LD R1,0(R2) 

Ifetch 

DSUB R4,R1,R6 

AND R6,R1,R7 

Reg 

Ifetch 

ALU 

Reg 

Ifetch 

DMem 

Bubble 

Bubble 

Reg 

ALU 

Reg 

DMem 

ALU 

Reg 

DMem 

Reg 

OR 

R8,R1,R9 

Bubble 

Ifetch 

Reg 

ALU 

DMem 

The stall is necessary for the case. 

33

Another Representation of the Stall 

LD 

R1, 0(R2) 

IF 

ID 

EX 

MEM WB 

DSUB R4, R1, R5 

AND R6, R1, R7 

IF 

ID 

EX MEM WB 

IF 

ID 

EX 

MEM 

WB 

OR 

R8, R1, R9 

IF 

ID 

EX 

MEM 

WB 

LD 

R1, 0(R2) 

IF 

ID EX MEM 

WB 

DSUB R4, R1, R5 

IF 

ID stall EX MEM 

WB 

AND R6, R1, R7 

IF 

stall 

ID 

EX 

MEM 

WB 

OR 

R8, R1, R9 

stall 

IF 

ID 

EX 

MEM 

WB 

In the top table, we can see why a stall is needed: The MEM cycle 

of the load produces a value that is needed in the EX cycle of the 

DSUB, which occurs at the same time. This problem is solved by 

inserting a stall, as shown in the bottom table. 

34

Control Hazards 

• A control hazard happens when we need to find the 

destination of a branch, and can’t fetch any new 

instructions until we know that destination. 

– If instruction i is a taken branch, then the PC is normally not 

changed until the end of ID 

• Control hazards can cause a greater performance 

loss than do data hazards. 

35


CC8 CC9 

Control Hazard on Branches 

Three-Cycle Stall 

12: beq r1,r3,36 

Ifetch 

Reg 

ALU 

DMem 

Reg 

16: and r2,r3,r5 

Ifetch 

Reg 

ALU 

DMem 

Reg 

20: or r6,r1,r7 

Ifetch 

Reg 

ALU 

DMem 

Reg 

24: add r8,r1,r9 

Ifetch 

Reg 

ALU 

DMem 

Reg 

36: xor r10,r1,r11 

Ifetch 

Reg 

ALU 

DMem 

Reg 

36

Branch Stall Impact 

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! 

• Two solutions to this dramatic increase: 

– Determine branch taken or not sooner, AND 

– Compute target address earlier 

• MIPS branch tests if register = 0 or ^ 0 

• MIPS Solution: 

– Move Zero test to ID stage 

– Adder to calculate target address in ID stage 

– 1 clock cycle penalty for branch versus 3 

37

The Pipeline of 1-Cycle Stall for Branch 

38

Four Solutions to Branch Hazards 

#1: Stall until branch direction is clear 

– Simple both for software and hardware 

– Branch penalty is fixed (1-cycle penalty for revised MIPS) 

Branch instr. 

Branch successor 

Branch successor+1 


IF 

ID EX MEM 

WB 

IF 

ID 

EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

39


#2: Predict Branch Not Taken 

– Continue to fetch instructions as if the branch were a normal 

instruction. 

– If the branch is taken, turn the fetched instruction into a no-op 

and restart the fetch at the target address. 

Untaken branch instr. 


IF 

ID EX MEM 

WB 

IF 

ID EX 

MEM 

WB 




IF 

ID 

EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

IF 

ID 

EX 

MEM 

WB 

Taken branch instr. 


IF 

ID EX MEM 

WB 

IF 

idle idle 

Branch target 



IF 

ID 

EX 

MEM 

WB 

IF 

ID EX 

MEM 

WB 

IF 

ID 

EX 

MEM 

WB 

40


#3: Predict Branch Taken 

– As soon as the branch is decoded and the target address is 

computed, we assume the branch to be taken and begin 

fetching and executing at the target. 

– But haven’t calculated the target address before we know 

the branch outcome in MIPS 

• MIPS still incurs 1-cycle branch penalty 

• Useful for other machines on which the target address is 

known before the branch outcome 

41


#4: Delayed Branch 

– The execution cycle with a branch delay of one is 

branch instruction 

sequential successor 1 

branch target if taken 

– The sequential successor is in the branch delay slot. 

– The instruction in the branch delay slot is executed whether 

or not the branch is taken (for zero cycle penalty) 

•Where to get instructions to fill branch delay slot? 

– From before branch instruction 

– From target address: only valuable when branch taken 

– From fall through: only valuable when branch not taken 

– Canceling or nullifying branches allow more slots to be filled (nonzero 

cycle penalty, its value depends on the rate of correct 

predication) 

– the delay-slot instruction is turned into a no-op if incorrectly 

predicted 

42


43

Pipelining Introduction Summary 

• Just overlap tasks, and easy if tasks are independent 

• Speed Up vs. Pipeline Depth; if ideal CPI is 1, then: 

Speedup = 

Pipeline Depth 

1 + Pipeline stall CPI 

X 

Clock Cycle Unpipelined 

Clock Cycle Pipelined 

• Hazards limit performance on computers: 

– Structural: need more hardware resources 

– Data (RAW,WAR,WAW): need forwarding, compiler scheduling 

– Control: delayed branch, prediction 

44

Appendix A Pipelining: Basic and Intermediate Concepts

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?