Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
G-40 ■ <strong>Appendix</strong> G Vector Processors<br />
Tn lim ⎛----- ⎞<br />
n → ∞⎝<br />
n ⎠<br />
n + 64<br />
= lim ⎛-------------- ⎞<br />
n → ∞⎝<br />
n ⎠<br />
= 1<br />
R∞ =<br />
2 × 500 MHz<br />
-------------------------------<br />
1<br />
= 1000 MFLOPS<br />
Adding the extra memory pipelines and more flexible issue logic yields an<br />
improvement in peak performance of a factor of 4. However, T 66 = 130, so for<br />
shorter vectors, the sustained performance improvement is about 326/130 = 2.5<br />
times.<br />
In summary, we have examined several measures of vector performance.<br />
Theoretical peak performance can be calculated based purely on the value of<br />
T chime as<br />
Number of FLOPS per iteration × Clock rate<br />
----------------------------------------------------------------------------------------------------------<br />
Tchime By including the loop overhead, we can calculate values for peak performance for<br />
an infinite-length vector (R ∞ ) and also for sustained performance, Rn for a vector<br />
of length n, which is computed as<br />
Number of FLOPS per iteration × n × Clock rate<br />
Rn = --------------------------------------------------------------------------------------------------------------------<br />
Tn Using these measures we also can find N 1/2 and Nv, which give us another way of<br />
looking at the start-up overhead for vectors and the ratio of vector to scalar speed.<br />
A wide variety of measures of performance of vector processors are useful in<br />
understanding the range of performance that applications may see on a vector<br />
processor.<br />
G.7 Fallacies and Pitfalls<br />
Pitfall Concentrating on peak performance and ignoring start-up overhead.<br />
Early memory-memory vector processors such as the TI ASC and the CDC<br />
STAR-100 had long start-up times. For some vector problems, N v could be<br />
greater than 100! On the CYBER 205, derived from the STAR-100, the start-up<br />
overhead for DAXPY is 158 clock cycles, substantially increasing the break-even<br />
point. With a single vector unit, which contains 2 memory pipelines, the CYBER<br />
205 can sustain a rate of 2 clocks per iteration. The time for DAXPY for a vector<br />
of length n is therefore roughly 158 + 2n. If the clock rates of the Cray-1 and the<br />
CYBER 205 were identical, the Cray-1 would be faster until n > 64. Because<br />
the Cray-1 clock is also faster (even though the 205 is newer), the crossover<br />
point is over 100. Comparing a four-lane CYBER 205 (the maximum-size processor)<br />
with the Cray X-MP that was delivered shortly after the 205, the 205<br />
has a peak rate of two results per clock cycle—twice as fast as the X-MP. However,<br />
vectors must be longer than about 200 for the CYBER 205 to be faster.