Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
G-18 ■ <strong>Appendix</strong> G Vector Processors<br />
The inner loop of the preceding code is vectorizable with length VL, which is<br />
equal to either (n mod MVL) or MVL. The VLR register must be set twice—once<br />
at each place where the variable VL in the code is assigned. With multiple vector<br />
operations executing in parallel, the hardware must copy the value of VLR to the<br />
vector functional unit when a vector operation issues, in case VLR is changed for<br />
a subsequent vector operation.<br />
Several vector ISAs have been developed that allow implementations to have<br />
different maximum vector-register lengths. For example, the IBM vector extension<br />
for the IBM 370 series mainframes supports an MVL of anywhere between<br />
8 and 512 elements. A “load vector count and update” (VLVCU) instruction is<br />
provided to control strip-mined loops. The VLVCU instruction has a single scalar<br />
register operand that specifies the desired vector length. The vector-length<br />
register is set to the minimum of the desired length and the maximum available<br />
vector length, and this value is also subtracted from the scalar register, setting<br />
the condition codes to indicate if the loop should be terminated. In this way,<br />
object code can be moved unchanged between two different implementations<br />
while making full use of the available vector-register length within each stripmined<br />
loop iteration.<br />
In addition to the start-up overhead, we need to account for the overhead of<br />
executing the strip-mined loop. This strip-mining overhead, which arises from the<br />
need to reinitiate the vector sequence and set the VLR, effectively adds to the<br />
vector start-up time, assuming that a convoy does not overlap with other instructions.<br />
If that overhead for a convoy is 10 cycles, then the effective overhead per<br />
64 elements increases by 10 cycles, or 0.15 cycles per element.<br />
There are two key factors that contribute to the running time of a strip-mined<br />
loop consisting of a sequence of convoys:<br />
1. The number of convoys in the loop, which determines the number of chimes.<br />
We use the notation Tchime for the execution time in chimes.<br />
2. The overhead for each strip-mined sequence of convoys. This overhead consists<br />
of the cost of executing the scalar code for strip-mining each block,<br />
Tloop , plus the vector start-up cost for each convoy, Tstart .<br />
There may also be a fixed overhead associated with setting up the vector<br />
sequence the first time. In recent vector processors this overhead has become<br />
quite small, so we ignore it.<br />
The components can be used to state the total running time for a vector<br />
sequence operating on a vector of length n, which we will call T n :<br />
Tn =<br />
n<br />
------------ × ( Tloop + Tstart) + n × Tchime<br />
MVL<br />
The values of T start , T loop , and T chime are compiler and processor dependent. The<br />
register allocation and scheduling of the instructions affect both what goes in a<br />
convoy and the start-up overhead of each convoy.