01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

G-18 ■ <strong>Appendix</strong> G Vector Processors<br />

The inner loop of the preceding code is vectorizable with length VL, which is<br />

equal to either (n mod MVL) or MVL. The VLR register must be set twice—once<br />

at each place where the variable VL in the code is assigned. With multiple vector<br />

operations executing in parallel, the hardware must copy the value of VLR to the<br />

vector functional unit when a vector operation issues, in case VLR is changed for<br />

a subsequent vector operation.<br />

Several vector ISAs have been developed that allow implementations to have<br />

different maximum vector-register lengths. For example, the IBM vector extension<br />

for the IBM 370 series mainframes supports an MVL of anywhere between<br />

8 and 512 elements. A “load vector count and update” (VLVCU) instruction is<br />

provided to control strip-mined loops. The VLVCU instruction has a single scalar<br />

register operand that specifies the desired vector length. The vector-length<br />

register is set to the minimum of the desired length and the maximum available<br />

vector length, and this value is also subtracted from the scalar register, setting<br />

the condition codes to indicate if the loop should be terminated. In this way,<br />

object code can be moved unchanged between two different implementations<br />

while making full use of the available vector-register length within each stripmined<br />

loop iteration.<br />

In addition to the start-up overhead, we need to account for the overhead of<br />

executing the strip-mined loop. This strip-mining overhead, which arises from the<br />

need to reinitiate the vector sequence and set the VLR, effectively adds to the<br />

vector start-up time, assuming that a convoy does not overlap with other instructions.<br />

If that overhead for a convoy is 10 cycles, then the effective overhead per<br />

64 elements increases by 10 cycles, or 0.15 cycles per element.<br />

There are two key factors that contribute to the running time of a strip-mined<br />

loop consisting of a sequence of convoys:<br />

1. The number of convoys in the loop, which determines the number of chimes.<br />

We use the notation Tchime for the execution time in chimes.<br />

2. The overhead for each strip-mined sequence of convoys. This overhead consists<br />

of the cost of executing the scalar code for strip-mining each block,<br />

Tloop , plus the vector start-up cost for each convoy, Tstart .<br />

There may also be a fixed overhead associated with setting up the vector<br />

sequence the first time. In recent vector processors this overhead has become<br />

quite small, so we ignore it.<br />

The components can be used to state the total running time for a vector<br />

sequence operating on a vector of length n, which we will call T n :<br />

Tn =<br />

n<br />

------------ × ( Tloop + Tstart) + n × Tchime<br />

MVL<br />

The values of T start , T loop , and T chime are compiler and processor dependent. The<br />

register allocation and scheduling of the instructions affect both what goes in a<br />

convoy and the start-up overhead of each convoy.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!