CPE432: Computer Design Fall 2010 Homework 1 Solution ...

CPE432: Computer Design Fall 2010 

Homework 1 Solution 

Assigned: October/10 

Due in class October/17 

Total points: 40 

1. Amdahl’s law [8 points] 

Three enhancements with the following speedups are proposed for a new architecture: 

Speedup 1 = 30 



Only one enhancement is usable at a time. 

a) [4 points] If enhancements 1 and 2 are each usable for 30% of the time, what 

fraction of the time must enhancement 3 be used to achieve an overall speedup 

of 10 

Solution: 

To solve this problem, we first need to develop a new and improved form of Amdahl’s 

Law that can handle multiple enhancements where only one enhancement is usable at a 

time. We simply change the terms involving the fraction of time an enhancement can be 

used into summations: 

Speedup = [ 1 – (FE 1 + FE 2 + FE 3 ) + ( (FE 1 /SE 1 ) + (FE 2 /SE 2 ) + (FE 3 /SE 3 ) ) ] ‐1 

If we plug in the numbers, we get: 

10 = [ 1 – (0.30 + 0.30 + FE 3 ) + ( (0.30/30) + (0.30/20) + (FE 3 /10) ) ] ‐1 

FE 3 = 0.36 

Therefore, the third enhancement must be usable in the enhanced system 36% of the 

time to achieve an overall speedup of 10. 

Grading: 

3 points for correctly setting up the equation 

1 points for the correct values in the equation and get the final answer. 

b) [4 points] Assume for some benchmark, the fraction of use is 15% for each of 

enhancements 1 and 2 and 70% for enhancement 3. We want to maximize 

performance. If only one enhancement can be implemented, which should it be 

If two enhancements can be implemented, which should be chosen


Here we will again use Amdahl’s law to compute speedups. 

Speedup for one enhancement only = [ 1 – FE 1 + (FE 1 /SE 1 ) ] ‐1 

Speedup for two enhancements = [ 1 – (FE 1 + FE 2 ) + ( (FE 1 /SE 1 ) + (FE 2 /SE 2 ) ) ] ‐1 

If we plug in the numbers, we get: 

Speedup 1 = (1 – 0.15 + 0.15/30) ‐1 = 1.169 

Speedup 2 = (1 – 0.15 + 0.15/20) ‐1 = 1.166 

Speedup 3 = (1 – 0.70 + 0.70/10) ‐1 = 2.703 

Therefore, if we are allowed to select a single enhancement, we would choose E 3 

Speedup 12 = [(1 ‐ 0.15 ‐ 0.15) + (0.15/30 + 0.15/20)] ‐1 = 1.4035 

Speedup 13 = [(1 ‐ 0.15 ‐ 0.70) + (0.15/30 + 0.70/10)] ‐1 = 4.4444 

Speedup 23 = [(1 ‐ 0.15 ‐ 0.70) + (0.15/20 + 0.70/10)] ‐1 = 4.3956 

Therefore, if two enhancements can be implemented, we would choose E 1 and E 3. 

Grading: 

2 points for correctly calculating one enhancement speedups 

2 points for the correctly calculating two enhancement speedups

2. Measuring processor’s time [8 points] 

After graduating, you are asked to become the lead computer designer at Hyper Computer, 

Inc. Your study of usage of high‐level language constructs suggests that procedure calls are one of 

the most expensive operations. You have invented a new architecture with an ISA that reduces 

the loads and stores normally associated with procedure calls and returns. The first thing you do 

is run some experiments with and without this optimization. Your experiments use the same 

state‐of‐the‐art optimizing compiler that will be used with either version of the computer. These 

experiments reveal the following information: 

‐ The clock cycle time of the optimized version is 5% lower than the unoptimized version 

‐ Thirty percent of the instructions in the unoptimized version are loads or stores. 

‐ The optimized version executes two‐thirds as many loads and stores as the unoptimized 

version. For all other instructions the dynamic execution counts are unchanged. 

‐ Every instruction (including load and store) in the unoptimized version takes one clock 

cycle. 

‐ Due to the optimization, the procedure call and return instructions take one extra cycle 

in the optimized version, and these instructions accounts for 5% of total instruction 

count in the optimized version. 

Which is faster Justify your decision quantitatively. 

Solution: To decide which is faster, we need to measure the CPU time: 

CPU Time = IC * CPI * Clk 

For the unoptimized case, we have the CPU Time: 

CPU un = IC un * CPI un * Clk un 

Because CPI un = 1.0, so we have: 

CPU un = IC un * 1.0 * Clk un 

Since 30% of the instructions are load and store, and in the optimized version, the machine 

executes 2/3 of them, so in the optimized version, we can reduce 30% * 1/3 = 10% of the 

instructions, making: 

IC new = 0.9 * IC un 

CPI new = 0.95 * 1 + 0.05 * 2 = 1.05 

Clk new 

= 0.95 * Clk un 

So we have: 

CPU new 

= IC new * CPI new * Clk new 

= 0.9 * IC un * 1.05 * 0.95 * Clk un 

= 0.89775 * IC un * Clk un 

= 0.89775 * CPU un 

So we should use the optimized version. 

Grading: 2 points for pointing out we should use CPU Time formula to compare. 2 points for each 

correct component calculation (IC, CPI and Clk)

3. Basic Pipelining [16 points] 

Consider the following code fragment: 

Loop: 

LW R1, 0(R2) 

DADDI R1, R1, 1 

SW R1, 0(R2) 


DADDI R4, R4, ‐4 

BNEZ R4, Loop 

Consider the standard 5 stage pipeline machine (IF ID EX MEM WB). Assume the initial value 

of R4 is 396 and all memory accesses hit in the cache. 

a. [5 points] Show the timing of the above code fragment for one iteration as well as 

for the load of the second iteration. For this part, assume there is no forwarding or 

bypassing hardware. Assume a register write occurs in the first half of the cycle 

and a register read occurs in the last half of the cycle. Also, assume that branches are 

resolved in the memory stage and are handled by flushing the pipeline. Use a 

pipeline timing chart to show the timing as below (expand the chart if you need 

more cycles). How many cycles does this loop take to complete (for all iterations, not 

just one iteration) 

Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 

LW R1, 0(R2) F D X M W 


SW R1, 0(R2) 


DADDI R4, R4, -4 

BNEZ R4, Loop 

LW R1, 0(R2) 


It is evident that the loop iterates 99 times. To calculate the total time the loop takes to 

iterate, we look at the length of the first 98 iterations, then factor in the 99th iteration which 

takes a bit longer to execute. 

The pipeline diagram: 

Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 

LW R1, 0(R2) F D X M W 

DADDI R1, R1, 1 F D S S X M W 

SW R1, 0(R2) F S S D S S X M W 

DADDI R2, R2, 4 F S S D X M W 

DADDI R4, R4, -4 F D X M W 

BNEZ R4, Loop F D S S X M W

LW R1, 0(R2) F D X M W 

Here, “S” indicates a stall. The last cycle of an iteration is overlapped with the first cycle of 

the next, so it is not counted until the end. Therefore, the first 98 iterations take 15 cycles 

each, while the last iteration takes 16 cycles. Therefore, the total time taken from the code to 

execute is 98 x 15 + 16 = 1486 clock cycles. 

Grading: 1 points for line 2, 3, 6 

0.5 points for line 1, 4, 5, 7 

b. [5 points] Show the timing for the same instruction sequence for the pipeline with 

full forwarding and bypassing hardware (as discussed in class). Assume that branches 

are resolved in the MEM stage and are predicted as not taken. How many cycles 

does this loop take to complete 


Instruction C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 

LW R1, 0(R2) F D X M W 

DADDI R1, R1, 1 F D S X M W 

SW R1, 0(R2) F S D X M W 

DADDI R2, R2, 4 F D X M W 

DADDI R4, R4, -4 F D X M W 

BNEZ R4, Loop F D X M W 

LW R1, 0(R2) F D X M W 

The last cycle of an iteration is overlapped with the first cycle of the next, so it is not counted 

until the end. Therefore, the first 98 iterations take 10 cycles each, while the last iteration 

takes 11 cycles. Therefore, the total time taken from the code to execute is 98 x 10 + 11 = 

991 clock cycles. 

Grading: Same as problem (a). 

(c) [3 points] How does the branch delay slot improve performance Point out where in your 

solution for part b that it would be beneficial. 


The branch delay slot is a place after the branch instruction for an instruction that will be 

executed regardless of whether the branch is taken or not. By placing such an instruction 

after the branch and always executing it, we can do useful work while we are still calculating 

whether the branch is taken or not and what the target address is. 

More specifically, in part b, the LW instruction enters the IF stage when the branch is in the

WB stage (the LW cannot enter the IF stage until this point because we don’t know the 

branch target until after the MEM stage). If we had a branch delay slot, we could have fit an 

extra instruction between the 2 with no penalty. 

Grading: 

2 points for explanation. 

1 point for pointing out where in part b it is useful. 

(d) [3 points] Why does static branch prediction improve performance over no branch 

prediction 


It allows the processor to load an instruction and put it into the pipeline earlier in the 

pipeline after a branch. If the branch prediction is correct, then nothing needs to be done 

and we save a few clock cycles. If it mispredicts, we flush the pipeline and then load the 

correct instruction and so it is not different from not predicting at all in this case. 

Grading: 

3 points for explanation.

4. Hazards [8 points] 

Consider a pipeline with the following structure: IF ID EX MEM WB. Assume that the EX stage 

is 1 cycle long for all ALU operations, loads and stores. Also, the EX stage is 3 cycles long for 

the FP add, and 6 cycles long for the FP multiply. The pipeline supports full forwarding. All 

other stages in the pipeline take one cycle each. The branch is resolved in the ID stage. WAW 

hazards are resolved by stalling the later instruction. For the following code, list all the data 

hazards that cause stalls. State the type of data hazard and give a brief explanation why each 

hazard occurs. 

(A quick inspection should be ok. You don’t need to do a thorough pipeline diagram like in 

question 3). 

loop: L.D F0, 0(R1) #1 

L.D F2, 8(R1) #2 

L.D F4, 16(R1) #3 

L.D F6, 24(R1) #4 

MULT.D F8, F6, F0 #5 

ADD.D F10, F4, F0 #6 

ADD.D F8, F2, F0 #7 

S.D 0(R2), F8 #8 

DADDI R2, R2, 8 #9 

S.D 8(R2), F10 #10 

DSUBI R1, R1, 32 #11 

BNEZ R1, loop #12 


(a) RAW hazard between instructions 4 and 5 due to line 5 (MULT.D) needing the result from 

line 4 (L.D) before it is available. 

(b) WAW hazard between instructions 5 and 7 due to line 7 (ADD.D) wanting to WB to the 

same register before line 5 (MULT.D) would WB. 

(c) RAW hazard between instructions 7 and 8 due to line 8 (S.D) storing the value computed 

by line 7 (ADD.D) 

(d) RAW hazard between instructions 11 and 12 due to line 12 (BNEZ) wanting to use the 

result of the line 11 (DSUBI) in determining the branch result, this is a hazard because 

branches are determined in the ID stage. 

The BNEZ ID stage occurs at the same time as the SUBI EX stage so forwarding cannot 

eliminate this hazard. 

Grading: 

2 points per hazard (1 point for type, 1 point for reason).

CPE432: Computer Design Fall 2010 Homework 1 Solution ...

Create successful ePaper yourself

Delete template?

Save as template?