01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Clock<br />

cycles<br />

9<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

10<br />

G.3 Two Real-World Issues: Vector Length and Stride ■ G-21<br />

30 50 70 90 110 130 150 170 190<br />

Vector size<br />

Total time<br />

per element<br />

Total<br />

overhead<br />

per element<br />

Figure G.9 The total execution time per element and the total overhead time per<br />

element versus the vector length for the example on page G-19. For short vectors the<br />

total start-up time is more than one-half of the total time, while for long vectors it<br />

reduces to about one-third of the total time. The sudden jumps occur when the vector<br />

length crosses a multiple of 64, forcing another iteration of the strip-mining code and<br />

execution of a set of vector instructions. These operations increase T n by T loop + T start.<br />

To do so, we must consider how adjacent elements in B and adjacent elements<br />

in C are addressed. As we discussed in Section 5.5, when an array is allocated<br />

memory, it is linearized and must be laid out in either row-major or columnmajor<br />

order. This linearization means that either the elements in the row or the<br />

elements in the column are not adjacent in memory. For example, if the preceding<br />

loop were written in FORTRAN, which allocates column-major order, the elements<br />

of B that are accessed by iterations in the inner loop are separated by the<br />

row size times 8 (the number of bytes per entry) for a total of 800 bytes. In Chapter<br />

5, we saw that blocking could be used to improve the locality in cache-based<br />

systems. For vector processors without caches, we need another technique to<br />

fetch elements of a vector that are not adjacent in memory.<br />

This distance separating elements that are to be gathered into a single register<br />

is called the stride. In the current example, using column-major layout for the<br />

matrices means that matrix C has a stride of 1, or 1 double word (8 bytes), separating<br />

successive elements, and matrix B has a stride of 100, or 100 double words<br />

(800 bytes).<br />

Once a vector is loaded into a vector register it acts as if it had logically adjacent<br />

elements. Thus a vector-register processor can handle strides greater than<br />

one, called nonunit strides, using only vector-load and vector-store operations<br />

with stride capability. This ability to access nonsequential memory locations and

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!