Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
G-20 ■ <strong>Appendix</strong> G Vector Processors<br />
The value of T start is the sum of<br />
■ The vector load start-up of 12 clock cycles<br />
■ A 7-clock-cycle start-up for the multiply<br />
■ A 12-clock-cycle start-up for the store<br />
Thus, the value of T start is given by<br />
So, the overall value becomes<br />
T start = 12 + 7 + 12 = 31<br />
T 200 = 660 + 4 × 31= 784<br />
The execution time per element with all start-up costs is then 784/200 = 3.9,<br />
compared with a chime approximation of three. In Section G.4, we will be more<br />
ambitious—allowing overlapping of separate convoys.<br />
Figure G.9 shows the overhead and effective rates per element for the previous<br />
example (A = B × s) with various vector lengths. A chime counting model<br />
would lead to 3 clock cycles per element, while the two sources of overhead add<br />
0.9 clock cycles per element in the limit.<br />
The next few sections introduce enhancements that reduce this time. We will<br />
see how to reduce the number of convoys and hence the number of chimes using<br />
a technique called chaining. The loop overhead can be reduced by further overlapping<br />
the execution of vector and scalar instructions, allowing the scalar loop<br />
overhead in one iteration to be executed while the vector instructions in the previous<br />
instruction are completing. Finally, the vector start-up overhead can also be<br />
eliminated, using a technique that allows overlap of vector instructions in separate<br />
convoys.<br />
Vector Stride<br />
The second problem this section addresses is that the position in memory of adjacent<br />
elements in a vector may not be sequential. Consider the straightforward<br />
code for matrix multiply:<br />
do 10 i = 1,100<br />
do 10 j = 1,100<br />
A(i,j) = 0.0<br />
do 10 k = 1,100<br />
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)<br />
At the statement labeled 10 we could vectorize the multiplication of each row of B<br />
with each column of C and strip-mine the inner loop with k as the index variable.