01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

G-40 ■ <strong>Appendix</strong> G Vector Processors<br />

Tn lim ⎛----- ⎞<br />

n → ∞⎝<br />

n ⎠<br />

n + 64<br />

= lim ⎛-------------- ⎞<br />

n → ∞⎝<br />

n ⎠<br />

= 1<br />

R∞ =<br />

2 × 500 MHz<br />

-------------------------------<br />

1<br />

= 1000 MFLOPS<br />

Adding the extra memory pipelines and more flexible issue logic yields an<br />

improvement in peak performance of a factor of 4. However, T 66 = 130, so for<br />

shorter vectors, the sustained performance improvement is about 326/130 = 2.5<br />

times.<br />

In summary, we have examined several measures of vector performance.<br />

Theoretical peak performance can be calculated based purely on the value of<br />

T chime as<br />

Number of FLOPS per iteration × Clock rate<br />

----------------------------------------------------------------------------------------------------------<br />

Tchime By including the loop overhead, we can calculate values for peak performance for<br />

an infinite-length vector (R ∞ ) and also for sustained performance, Rn for a vector<br />

of length n, which is computed as<br />

Number of FLOPS per iteration × n × Clock rate<br />

Rn = --------------------------------------------------------------------------------------------------------------------<br />

Tn Using these measures we also can find N 1/2 and Nv, which give us another way of<br />

looking at the start-up overhead for vectors and the ratio of vector to scalar speed.<br />

A wide variety of measures of performance of vector processors are useful in<br />

understanding the range of performance that applications may see on a vector<br />

processor.<br />

G.7 Fallacies and Pitfalls<br />

Pitfall Concentrating on peak performance and ignoring start-up overhead.<br />

Early memory-memory vector processors such as the TI ASC and the CDC<br />

STAR-100 had long start-up times. For some vector problems, N v could be<br />

greater than 100! On the CYBER 205, derived from the STAR-100, the start-up<br />

overhead for DAXPY is 158 clock cycles, substantially increasing the break-even<br />

point. With a single vector unit, which contains 2 memory pipelines, the CYBER<br />

205 can sustain a rate of 2 clocks per iteration. The time for DAXPY for a vector<br />

of length n is therefore roughly 158 + 2n. If the clock rates of the Cray-1 and the<br />

CYBER 205 were identical, the Cray-1 would be faster until n > 64. Because<br />

the Cray-1 clock is also faster (even though the 205 is newer), the crossover<br />

point is over 100. Comparing a four-lane CYBER 205 (the maximum-size processor)<br />

with the Cray X-MP that was delivered shortly after the 205, the 205<br />

has a peak rate of two results per clock cycle—twice as fast as the X-MP. However,<br />

vectors must be longer than about 200 for the CYBER 205 to be faster.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!