Problem 1: Loop Unrolling [18 points] In this problem, we will use the ...

CPE432: Computer Design Fall 2010 

Homework 3; Due Thursday, December 16 

Solution 

Problem 1: Loop Unrolling [18 points] 

In this problem, we will use the pipeline shown in Figure A.31 on page A.50 of your book. Its 

characteristics are: 

• If unspecified, its properties are like those in the MIPS pipeline. 

• There is 1 integer functional unit, taking 1 cycle to perform integer addition 

(including effective address calculation for loads/stores), subtraction, logic operations 

and branch operations. 

• There is 1 FP/integer multiplier, taking 8 cycles to perform multiplication. It is 

pipelined. 

• There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is 

pipelined. 

• There is 1 FP/integer divider, taking 24 cycles. It is NOT pipelined. 

• There is full forwarding and bypassing, including forwarding from the end of an FU 

to the MEM stage for stores. 

• Loads and stores complete in one cycle. That is, they spend one cycle in the MEM 

stage after the effective address calculation. 

• There are as many registers, both FP and integer, as you need. 

• There is one branch delay slot. 

• While the hardware has full forwarding and bypassing, it is the responsibility of the 

compiler to schedule such that the operands of each instruction are available when 

needed by each instruction. 

Loop: L.D F4, 0 (R1) 

MUL.D F8, F4, F0 

L.D F6, 0 (R2) 

ADD.D F10, F6, F2 

ADD.D F12, F8, F10 

S.D F12, 0 (R3) 

DADDUI R1, R1, 8 



DSUB R5, R4, R1 

BNEZ R5, Loop 

Part A. [6 points] 

Consider the role of the compiler in scheduling the code. Rewrite this loop, but let every row 

take a cycle. If an instruction can’t be issued on a given cycle (because the current instruction 

has a dependency that will not be resolved in time), write STALL instead, and move on to the 

next cycle to see if it can be issued then. Assume that a NOP is scheduled in the branch delay 

slot (effectively stalling 1 cycle after the branch). Explain all stalls, but don’t reorder 

instructions. How many cycles elapse before the second iteration begins Show your work.

Loop: 

L.D F4, 0(R1) 

stall RAW F4 


L.D F6, 0(R2) 

stall RAW F6 

ADD.D F10, F6, F2 

stall RAW F8, F10 

stall RAW F8, F10 

stall RAW F8 

stall RAW F8 

ADD.D F12, F8, F10 

stall RAW F12 

S.D F12, 0(R3) 

DADDUI R1, R1, #8 



DSUB R5, R4, R 

stall RAW R5 


NOP 

20 Cycles. 

Part B. [6 points] 

Now reschedule the loop. You can change immediate values and memory offsets. You can 

reorder instructions, but don’t change anything else. Show any stalls that remain. How many 

cycles elapse before the second iteration begins Show your work. 


L.D F4, 0(R1) 

L.D F6, 0(R2) 


ADD.D F10, F6, F2 


DADDUI R2, R2, #8



stall RAW F8 

stall RAW F8 

ADD.D F12, F8, F10 


S.D F12, -8(R3) 

13 Cycles.

Part C. [6 points] 

Now unroll and reschedule the loop the minimum number of times needed to eliminate all 

stalls. You can remove redundant instructions. How many times did you unroll the loop 

How many cycles elapse before the next iteration of the loop begins Don’t worry about 

clean-up code. Show your work. 


L.D F4, 0(R1) 

L.D F6, 0(R2) 


L.D F14, 8(R1) 

L.D F16, 8(R2) 

MUL.D F18, F14, F0 

ADD.D F10, F6, F2 

ADD.D F20, F16, F2 




ADD.D F12, F8, F10 


ADD.D F22, F18, F20 

S.D F12, -16(R3) 


S.D F22, -8(R3) 

17 cycles for 2 iterations, 8.5 cycles per iteration. 

Problem 2: Tomasulo's algorithm (12 points) 

This exercise examines Tomasulo’s algorithm on a simple loop operation. Consider the 

following code fragment: 

LOOP: L.D F2, 0(R1) 

L.D F4, 8(R1) 

DIV.D F6, F2, F4 


ADD.D F6, F2, F4 

MUL.D F10, F6, F6 

S.D F8, 0(R1) 

S.D F10, 8(R1) 

DADDI R1, R1, 16 

BNEZ R1, LOOP 

1. The pipeline functional units are described by the following table 

FU type Cycles in EX #of FU’s # of Reservation Stations 

Integer 1 1 5 

FP add/subtract 4 1 4

FP multiply/divide 15 2 4 

2. Functional units are NOT pipelined (i.e., if one instruction is using the functional unit, 

another instruction cannot enter it). 

3. All stages except EX take one cycle to complete. 

4. There is no forwarding between functional units. Both integer and floating point results 

are communicated through the CDB. 

5. Memory accesses use the integer functional unit to perform effective address calculation. 

All loads and stores will access memory during the EX stage. Pipeline stage EX does both the 

effective address calculation and memory access for loads/stores. 

6. There are unlimited load/store buffers and an infinite instruction queue. 

7. Loads and stores take one cycle to execute. Loads and stores share a memory access 

unit. 

8. If an instruction is in the WR stage in cycle x, then an instruction that is waiting on the 

same functional unit (due to a structural hazard) can begin execution in cycle x , unless it 

needs to read the CDB, in which case it can only start executing on cycle x + 1. 

9. Only one instruction can write to the CDB in a clock cycle. 

10. Branches and stores do not need the CDB since they don’t have WR stage. 

11. Whenever there is a conflict for a functional unit or the CDB, assume program order. 

12. When an instruction is done executing in its functional unit and is waiting for the CDB, 

it is still occupying the functional unit and its reservation station. (meaning no other 

instruction may enter). 

13. Treat the BNEZ instruction as an Integer instruction. Assume L.D instruction after the 

BNEZ can be issued the cycle after BNEZ instruction is issued due to branch prediction. 

14. Initially, R1 < -16. 

Fill in the execution profile for the first two iterations of the above code fragment in Table 2, 

including 

• The reservation station used by each instruction. This should include both the 

functional unit type and the number of the reservation station. If multiple reservation 

stations of a particular type are available, associate early program order with lower 

cardinality. 

• The cycles that each instruction occupies in the IS, EX, and WR stages. 

• Comments to justify your answer such as type of hazards and the registers involved.

Instruction Reservation Station IS EX WR Comments (if appropriate) 

L.D F2, 0(R1) Integer 1 1 2 3 

L.D F4, 8(R1) Integer 2 2 3 4 

DIV.D F6,F2,F4 FP Mul/Div 1 3 5-19 20 RAW on F4 

MUL.D F8,F6,F6 FP Mul/Div 2 4 26-40 41 RAW on F6, Structural Hazard on FU 

ADD.D F6,F2,F4 FP Add 1 5 6-9 10 

MUL.D F10,F6,F6 FP Mul/Div 3 6 11-25 26 RAW on F6 

S.D F8, 0(R1) Integer 1 7 42 RAW on F8 


DADDI R1,R1,16 Integer 3 9 10 11 

BNEZ R1, LOOP Integer 4 10 12 RAW on R1 

L.D F2, 0(R1) Integer 3 11 13 14 Structural Hazard on FU 

L.D F4, 8(R1) Integer 5 12 14 15 Structural Hazard on FU 

DIV.D F6,F2,F4 FP Mul/Div 4 13 20-34 35 RAW on F2 and F4, Str. Haz. On FU 

MUL.D F8,F6,F6 FP Mul/Div 1 20 41-55 56 Str. Haz. on Reservation Station + FU 

ADD.D F6,F2,F4 FP Add 1 21 22-25 27 CDB Conflict 

MUL.D F10,F6,F6 FP Mul/Div 3 26 35-49 50 RAW on F6, Str. Haz. on Res. Sta. + FU 



DADDI R1,R1,16 Integer 4 29 30 31 

BNEZ R1, LOOP Integer 5 30 32 RAW on R1 

Table 2. Execution profile using Tomasulo’s algorithm. 

Problem 3 [8 points] 

Part A [3 points] 

What technique does Tomasulo’s employ to eliminate WAR and WAW hazards Why does it 

work Why doesn’t it also eliminate RAW 

Tomasulo’s uses register renaming to avoid these hazards. It works because by changing the 

5 

destination registers, the 2 instructions which cause the hazard no longer need to access the 

same 

register thus eliminating the hazards. It doesn’t work on RAW dependencies because no 

matter 

what we rename the register to, the ’reading’ instruction still needs the result from the 

’writing’ 

instruction in order to execute. 

Part B [2 points]

What is the difference between reservation stations and reorder buffers 

Reorder buffers store the state of an instruction temporarily until it is time for that 

instruction to commit. Reservation stations store the instruction and its operands until it can 

execute. 

Part C [3 points] 

Reservation stations and reorder buffers both have the value fields to store the result of an 

instruction. Why do we still need this value field in the reservation station if it is available in 

the reorder buffer 

In the reorder buffers, the value is only available until they commit. After the commit stage, 

the entry in the reorder buffer is freed so if an instruction needed that value after the 

commit, the value would not be there.

Problem 1: Loop Unrolling [18 points] In this problem, we will use the ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?