01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

G-22 ■ <strong>Appendix</strong> G Vector Processors<br />

to reshape them into a dense structure is one of the major advantages of a vector<br />

processor over a cache-based processor. Caches inherently deal with unit stride<br />

data, so that while increasing block size can help reduce miss rates for large scientific<br />

data sets with unit stride, increasing block size can have a negative effect<br />

for data that is accessed with nonunit stride. While blocking techniques can<br />

solve some of these problems (see Section 5.5), the ability to efficiently access<br />

data that is not contiguous remains an advantage for vector processors on certain<br />

problems.<br />

On VMIPS, where the addressable unit is a byte, the stride for our example<br />

would be 800. The value must be computed dynamically, since the size of the<br />

matrix may not be known at compile time, or—just like vector length—may<br />

change for different executions of the same statement. The vector stride, like the<br />

vector starting address, can be put in a general-purpose register. Then the VMIPS<br />

instruction LVWS (load vector with stride) can be used to fetch the vector into a<br />

vector register. Likewise, when a nonunit stride vector is being stored, SVWS<br />

(store vector with stride) can be used. In some vector processors the loads and<br />

stores always have a stride value stored in a register, so that only a single load and<br />

a single store instruction are required. Unit strides occur much more frequently<br />

than other strides and can benefit from special case handling in the memory system,<br />

and so are often separated from nonunit stride operations as in VMIPS.<br />

Complications in the memory system can occur from supporting strides<br />

greater than one. In Chapter 5 we saw that memory accesses could proceed at full<br />

speed if the number of memory banks was at least as large as the bank busy time<br />

in clock cycles. Once nonunit strides are introduced, however, it becomes possible<br />

to request accesses from the same bank more frequently than the bank busy<br />

time allows. When multiple accesses contend for a bank, a memory bank conflict<br />

occurs and one access must be stalled. A bank conflict, and hence a stall, will<br />

occur if<br />

Number of banks<br />

------------------------------------------------------------------------------------------------------------------------ <<br />

Bank busy time<br />

Least common multiple (Stride, Number of banks)<br />

Example Suppose we have 8 memory banks with a bank busy time of 6 clocks and a total<br />

memory latency of 12 cycles. How long will it take to complete a 64-element<br />

vector load with a stride of 1? With a stride of 32?<br />

Answer Since the number of banks is larger than the bank busy time, for a stride of 1, the<br />

load will take 12 + 64 = 76 clock cycles, or 1.2 clocks per element. The worst<br />

possible stride is a value that is a multiple of the number of memory banks, as in<br />

this case with a stride of 32 and 8 memory banks. Every access to memory (after<br />

the first one) will collide with the previous access and will have to wait for the 6clock-cycle<br />

bank busy time. The total time will be 12 + 1 + 6 * 63 = 391 clock<br />

cycles, or 6.1 clocks per element.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!