31.01.2015 Views

Problem 1: Loop Unrolling [18 points] In this problem, we will use the ...

Problem 1: Loop Unrolling [18 points] In this problem, we will use the ...

Problem 1: Loop Unrolling [18 points] In this problem, we will use the ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CPE432: Computer Design Fall 2010<br />

Homework 3; Due Thursday, December 16<br />

Solution<br />

<strong>Problem</strong> 1: <strong>Loop</strong> <strong>Unrolling</strong> [<strong>18</strong> <strong>points</strong>]<br />

<strong>In</strong> <strong>this</strong> <strong>problem</strong>, <strong>we</strong> <strong>will</strong> <strong>use</strong> <strong>the</strong> pipeline shown in Figure A.31 on page A.50 of your book. Its<br />

characteristics are:<br />

• If unspecified, its properties are like those in <strong>the</strong> MIPS pipeline.<br />

• There is 1 integer functional unit, taking 1 cycle to perform integer addition<br />

(including effective address calculation for loads/stores), subtraction, logic operations<br />

and branch operations.<br />

• There is 1 FP/integer multiplier, taking 8 cycles to perform multiplication. It is<br />

pipelined.<br />

• There is 1 FP adder, taking 3 cycles to perform FP additions and subtractions. It is<br />

pipelined.<br />

• There is 1 FP/integer divider, taking 24 cycles. It is NOT pipelined.<br />

• There is full forwarding and bypassing, including forwarding from <strong>the</strong> end of an FU<br />

to <strong>the</strong> MEM stage for stores.<br />

• Loads and stores complete in one cycle. That is, <strong>the</strong>y spend one cycle in <strong>the</strong> MEM<br />

stage after <strong>the</strong> effective address calculation.<br />

• There are as many registers, both FP and integer, as you need.<br />

• There is one branch delay slot.<br />

• While <strong>the</strong> hardware has full forwarding and bypassing, it is <strong>the</strong> responsibility of <strong>the</strong><br />

compiler to schedule such that <strong>the</strong> operands of each instruction are available when<br />

needed by each instruction.<br />

<strong>Loop</strong>: L.D F4, 0 (R1)<br />

MUL.D F8, F4, F0<br />

L.D F6, 0 (R2)<br />

ADD.D F10, F6, F2<br />

ADD.D F12, F8, F10<br />

S.D F12, 0 (R3)<br />

DADDUI R1, R1, 8<br />

DADDUI R2, R2, 8<br />

DADDUI R3, R3, 8<br />

DSUB R5, R4, R1<br />

BNEZ R5, <strong>Loop</strong><br />

Part A. [6 <strong>points</strong>]<br />

Consider <strong>the</strong> role of <strong>the</strong> compiler in scheduling <strong>the</strong> code. Rewrite <strong>this</strong> loop, but let every row<br />

take a cycle. If an instruction can’t be issued on a given cycle (beca<strong>use</strong> <strong>the</strong> current instruction<br />

has a dependency that <strong>will</strong> not be resolved in time), write STALL instead, and move on to <strong>the</strong><br />

next cycle to see if it can be issued <strong>the</strong>n. Assume that a NOP is scheduled in <strong>the</strong> branch delay<br />

slot (effectively stalling 1 cycle after <strong>the</strong> branch). Explain all stalls, but don’t reorder<br />

instructions. How many cycles elapse before <strong>the</strong> second iteration begins Show your work.


<strong>Loop</strong>:<br />

L.D F4, 0(R1)<br />

stall RAW F4<br />

MUL.D F8, F4, F0<br />

L.D F6, 0(R2)<br />

stall RAW F6<br />

ADD.D F10, F6, F2<br />

stall RAW F8, F10<br />

stall RAW F8, F10<br />

stall RAW F8<br />

stall RAW F8<br />

ADD.D F12, F8, F10<br />

stall RAW F12<br />

S.D F12, 0(R3)<br />

DADDUI R1, R1, #8<br />

DADDUI R2, R2, #8<br />

DADDUI R3, R3, #8<br />

DSUB R5, R4, R<br />

stall RAW R5<br />

BNEZ R5, <strong>Loop</strong><br />

NOP<br />

20 Cycles.<br />

Part B. [6 <strong>points</strong>]<br />

Now reschedule <strong>the</strong> loop. You can change immediate values and memory offsets. You can<br />

reorder instructions, but don’t change anything else. Show any stalls that remain. How many<br />

cycles elapse before <strong>the</strong> second iteration begins Show your work.<br />

<strong>Loop</strong>:<br />

L.D F4, 0(R1)<br />

L.D F6, 0(R2)<br />

MUL.D F8, F4, F0<br />

ADD.D F10, F6, F2<br />

DADDUI R1, R1, #8<br />

DADDUI R2, R2, #8


DADDUI R3, R3, #8<br />

DSUB R5, R4, R1<br />

stall RAW F8<br />

stall RAW F8<br />

ADD.D F12, F8, F10<br />

BNEZ R5, <strong>Loop</strong><br />

S.D F12, -8(R3)<br />

13 Cycles.


Part C. [6 <strong>points</strong>]<br />

Now unroll and reschedule <strong>the</strong> loop <strong>the</strong> minimum number of times needed to eliminate all<br />

stalls. You can remove redundant instructions. How many times did you unroll <strong>the</strong> loop<br />

How many cycles elapse before <strong>the</strong> next iteration of <strong>the</strong> loop begins Don’t worry about<br />

clean-up code. Show your work.<br />

<strong>Loop</strong>:<br />

L.D F4, 0(R1)<br />

L.D F6, 0(R2)<br />

MUL.D F8, F4, F0<br />

L.D F14, 8(R1)<br />

L.D F16, 8(R2)<br />

MUL.D F<strong>18</strong>, F14, F0<br />

ADD.D F10, F6, F2<br />

ADD.D F20, F16, F2<br />

DADDUI R1, R1, #16<br />

DADDUI R2, R2, #16<br />

DADDUI R3, R3, #16<br />

ADD.D F12, F8, F10<br />

DSUB R5, R4, R1<br />

ADD.D F22, F<strong>18</strong>, F20<br />

S.D F12, -16(R3)<br />

BNEZ R5, <strong>Loop</strong><br />

S.D F22, -8(R3)<br />

17 cycles for 2 iterations, 8.5 cycles per iteration.<br />

<strong>Problem</strong> 2: Tomasulo's algorithm (12 <strong>points</strong>)<br />

This exercise examines Tomasulo’s algorithm on a simple loop operation. Consider <strong>the</strong><br />

following code fragment:<br />

LOOP: L.D F2, 0(R1)<br />

L.D F4, 8(R1)<br />

DIV.D F6, F2, F4<br />

MUL.D F8, F6, F6<br />

ADD.D F6, F2, F4<br />

MUL.D F10, F6, F6<br />

S.D F8, 0(R1)<br />

S.D F10, 8(R1)<br />

DADDI R1, R1, 16<br />

BNEZ R1, LOOP<br />

1. The pipeline functional units are described by <strong>the</strong> following table<br />

FU type Cycles in EX #of FU’s # of Reservation Stations<br />

<strong>In</strong>teger 1 1 5<br />

FP add/subtract 4 1 4


FP multiply/divide 15 2 4<br />

2. Functional units are NOT pipelined (i.e., if one instruction is using <strong>the</strong> functional unit,<br />

ano<strong>the</strong>r instruction cannot enter it).<br />

3. All stages except EX take one cycle to complete.<br />

4. There is no forwarding bet<strong>we</strong>en functional units. Both integer and floating point results<br />

are communicated through <strong>the</strong> CDB.<br />

5. Memory accesses <strong>use</strong> <strong>the</strong> integer functional unit to perform effective address calculation.<br />

All loads and stores <strong>will</strong> access memory during <strong>the</strong> EX stage. Pipeline stage EX does both <strong>the</strong><br />

effective address calculation and memory access for loads/stores.<br />

6. There are unlimited load/store buffers and an infinite instruction queue.<br />

7. Loads and stores take one cycle to execute. Loads and stores share a memory access<br />

unit.<br />

8. If an instruction is in <strong>the</strong> WR stage in cycle x, <strong>the</strong>n an instruction that is waiting on <strong>the</strong><br />

same functional unit (due to a structural hazard) can begin execution in cycle x , unless it<br />

needs to read <strong>the</strong> CDB, in which case it can only start executing on cycle x + 1.<br />

9. Only one instruction can write to <strong>the</strong> CDB in a clock cycle.<br />

10. Branches and stores do not need <strong>the</strong> CDB since <strong>the</strong>y don’t have WR stage.<br />

11. Whenever <strong>the</strong>re is a conflict for a functional unit or <strong>the</strong> CDB, assume program order.<br />

12. When an instruction is done executing in its functional unit and is waiting for <strong>the</strong> CDB,<br />

it is still occupying <strong>the</strong> functional unit and its reservation station. (meaning no o<strong>the</strong>r<br />

instruction may enter).<br />

13. Treat <strong>the</strong> BNEZ instruction as an <strong>In</strong>teger instruction. Assume L.D instruction after <strong>the</strong><br />

BNEZ can be issued <strong>the</strong> cycle after BNEZ instruction is issued due to branch prediction.<br />

14. <strong>In</strong>itially, R1 < -16.<br />

Fill in <strong>the</strong> execution profile for <strong>the</strong> first two iterations of <strong>the</strong> above code fragment in Table 2,<br />

including<br />

• The reservation station <strong>use</strong>d by each instruction. This should include both <strong>the</strong><br />

functional unit type and <strong>the</strong> number of <strong>the</strong> reservation station. If multiple reservation<br />

stations of a particular type are available, associate early program order with lo<strong>we</strong>r<br />

cardinality.<br />

• The cycles that each instruction occupies in <strong>the</strong> IS, EX, and WR stages.<br />

• Comments to justify your ans<strong>we</strong>r such as type of hazards and <strong>the</strong> registers involved.


<strong>In</strong>struction Reservation Station IS EX WR Comments (if appropriate)<br />

L.D F2, 0(R1) <strong>In</strong>teger 1 1 2 3<br />

L.D F4, 8(R1) <strong>In</strong>teger 2 2 3 4<br />

DIV.D F6,F2,F4 FP Mul/Div 1 3 5-19 20 RAW on F4<br />

MUL.D F8,F6,F6 FP Mul/Div 2 4 26-40 41 RAW on F6, Structural Hazard on FU<br />

ADD.D F6,F2,F4 FP Add 1 5 6-9 10<br />

MUL.D F10,F6,F6 FP Mul/Div 3 6 11-25 26 RAW on F6<br />

S.D F8, 0(R1) <strong>In</strong>teger 1 7 42 RAW on F8<br />

S.D F10, 8(R1) <strong>In</strong>teger 2 8 27 RAW on F10<br />

DADDI R1,R1,16 <strong>In</strong>teger 3 9 10 11<br />

BNEZ R1, LOOP <strong>In</strong>teger 4 10 12 RAW on R1<br />

L.D F2, 0(R1) <strong>In</strong>teger 3 11 13 14 Structural Hazard on FU<br />

L.D F4, 8(R1) <strong>In</strong>teger 5 12 14 15 Structural Hazard on FU<br />

DIV.D F6,F2,F4 FP Mul/Div 4 13 20-34 35 RAW on F2 and F4, Str. Haz. On FU<br />

MUL.D F8,F6,F6 FP Mul/Div 1 20 41-55 56 Str. Haz. on Reservation Station + FU<br />

ADD.D F6,F2,F4 FP Add 1 21 22-25 27 CDB Conflict<br />

MUL.D F10,F6,F6 FP Mul/Div 3 26 35-49 50 RAW on F6, Str. Haz. on Res. Sta. + FU<br />

S.D F8, 0(R1) <strong>In</strong>teger 3 27 57 RAW on F8<br />

S.D F10, 8(R1) <strong>In</strong>teger 2 28 51 RAW on F10<br />

DADDI R1,R1,16 <strong>In</strong>teger 4 29 30 31<br />

BNEZ R1, LOOP <strong>In</strong>teger 5 30 32 RAW on R1<br />

Table 2. Execution profile using Tomasulo’s algorithm.<br />

<strong>Problem</strong> 3 [8 <strong>points</strong>]<br />

Part A [3 <strong>points</strong>]<br />

What technique does Tomasulo’s employ to eliminate WAR and WAW hazards Why does it<br />

work Why doesn’t it also eliminate RAW<br />

Tomasulo’s <strong>use</strong>s register renaming to avoid <strong>the</strong>se hazards. It works beca<strong>use</strong> by changing <strong>the</strong><br />

5<br />

destination registers, <strong>the</strong> 2 instructions which ca<strong>use</strong> <strong>the</strong> hazard no longer need to access <strong>the</strong><br />

same<br />

register thus eliminating <strong>the</strong> hazards. It doesn’t work on RAW dependencies beca<strong>use</strong> no<br />

matter<br />

what <strong>we</strong> rename <strong>the</strong> register to, <strong>the</strong> ’reading’ instruction still needs <strong>the</strong> result from <strong>the</strong><br />

’writing’<br />

instruction in order to execute.<br />

Part B [2 <strong>points</strong>]


What is <strong>the</strong> difference bet<strong>we</strong>en reservation stations and reorder buffers<br />

Reorder buffers store <strong>the</strong> state of an instruction temporarily until it is time for that<br />

instruction to commit. Reservation stations store <strong>the</strong> instruction and its operands until it can<br />

execute.<br />

Part C [3 <strong>points</strong>]<br />

Reservation stations and reorder buffers both have <strong>the</strong> value fields to store <strong>the</strong> result of an<br />

instruction. Why do <strong>we</strong> still need <strong>this</strong> value field in <strong>the</strong> reservation station if it is available in<br />

<strong>the</strong> reorder buffer<br />

<strong>In</strong> <strong>the</strong> reorder buffers, <strong>the</strong> value is only available until <strong>the</strong>y commit. After <strong>the</strong> commit stage,<br />

<strong>the</strong> entry in <strong>the</strong> reorder buffer is freed so if an instruction needed that value after <strong>the</strong><br />

commit, <strong>the</strong> value would not be <strong>the</strong>re.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!