Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Exercises ■ G-51<br />
that Tloop of overhead is incurred for each iteration of the outer loop. What<br />
limits the performance?<br />
c. [20] Rewrite the VMIPS code to reduce the performance limitation;<br />
show the resulting inner loop in VMIPS vector instructions. (Hint: Think<br />
about what establishes Tchime ; can you affect it?) Find the total time for the<br />
resulting sequence.<br />
d. [20] Estimate the performance of your new version, using the<br />
techniques of Section G.6 and finding T64. G.7 [15/15/25] Consider the following code.<br />
do 10 i=1,64<br />
if (B(i) .ne. 0) then<br />
A(i) = A(i) / B(i)<br />
10 continue<br />
Assume that the addresses of A and B are in Ra and Rb, respectively, and that F0<br />
contains 0.<br />
a. [15] Write the VMIPS code for this loop using the vector-mask capability.<br />
b. [15] Write the VMIPS code for this loop using scatter-gather.<br />
c. [25] Estimate the performance (T100 in clock cycles) of these two vector<br />
loops, assuming a divide latency of 20 cycles. Assume that all vector<br />
instructions run at one result per clock, independent of the setting of the<br />
vector-mask register. Assume that 50% of the entries of B are 0. Considering<br />
hardware costs, which would you build if the above loop were typical?<br />
G.8 [15/20/15/15] In “Fallacies and Pitfalls” of Chapter 1, we saw that<br />
the difference between peak and sustained performance could be large: For one<br />
problem, a Hitachi S810 had a peak speed twice as high as that of the Cray X-MP,<br />
while for another more realistic problem, the Cray X-MP was twice as fast as the<br />
Hitachi processor. Let’s examine why this might occur using two versions of<br />
VMIPS and the following code sequences:<br />
C Code sequence 1<br />
do 10 i=1,10000<br />
A(i) = x * A(i) + y * A(i)<br />
10 continue<br />
C Code sequence 2<br />
do 10 i=1,100<br />
A(i) = x * A(i)<br />
10 continue<br />
Assume there is a version of VMIPS (call it VMIPS-II) that has two copies of<br />
every floating-point functional unit with full chaining among them. Assume that<br />
both VMIPS and VMIPS-II have two load-store units. Because of the extra func-