01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

G-20 ■ <strong>Appendix</strong> G Vector Processors<br />

The value of T start is the sum of<br />

■ The vector load start-up of 12 clock cycles<br />

■ A 7-clock-cycle start-up for the multiply<br />

■ A 12-clock-cycle start-up for the store<br />

Thus, the value of T start is given by<br />

So, the overall value becomes<br />

T start = 12 + 7 + 12 = 31<br />

T 200 = 660 + 4 × 31= 784<br />

The execution time per element with all start-up costs is then 784/200 = 3.9,<br />

compared with a chime approximation of three. In Section G.4, we will be more<br />

ambitious—allowing overlapping of separate convoys.<br />

Figure G.9 shows the overhead and effective rates per element for the previous<br />

example (A = B × s) with various vector lengths. A chime counting model<br />

would lead to 3 clock cycles per element, while the two sources of overhead add<br />

0.9 clock cycles per element in the limit.<br />

The next few sections introduce enhancements that reduce this time. We will<br />

see how to reduce the number of convoys and hence the number of chimes using<br />

a technique called chaining. The loop overhead can be reduced by further overlapping<br />

the execution of vector and scalar instructions, allowing the scalar loop<br />

overhead in one iteration to be executed while the vector instructions in the previous<br />

instruction are completing. Finally, the vector start-up overhead can also be<br />

eliminated, using a technique that allows overlap of vector instructions in separate<br />

convoys.<br />

Vector Stride<br />

The second problem this section addresses is that the position in memory of adjacent<br />

elements in a vector may not be sequential. Consider the straightforward<br />

code for matrix multiply:<br />

do 10 i = 1,100<br />

do 10 j = 1,100<br />

A(i,j) = 0.0<br />

do 10 k = 1,100<br />

10 A(i,j) = A(i,j)+B(i,k)*C(k,j)<br />

At the statement labeled 10 we could vectorize the multiplication of each row of B<br />

with each column of C and strip-mine the inner loop with k as the index variable.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!