01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

G-36 ■ <strong>Appendix</strong> G Vector Processors<br />

load-store units, allow convoys to overlap to reduce the impact of start-up overheads,<br />

and decrease the number of loads required by vector-register allocation.<br />

We will examine the first two extensions in this section. The last optimization is<br />

actually used for the Cray-1, VMIPS’s cousin, to boost the performance by 50%.<br />

Reducing the number of loads requires an interprocedural optimization; we<br />

examine this transformation in Exercise G.6. Before we examine the first two<br />

extensions, let’s see what the real performance, including overhead, is.<br />

The Peak Performance of VMIPS on DAXPY<br />

First, we should determine what the peak performance, R ∞ , really is, since we<br />

know it must differ from the ideal 333 MFLOPS rate. For now, we continue to<br />

use the simplifying assumption that a convoy cannot start until all the instructions<br />

in an earlier convoy have completed; later we will remove this restriction. Using<br />

this simplification, the start-up overhead for the vector sequence is simply the<br />

sum of the start-up times of the instructions:<br />

Tstart = 12+ 7+ 12+ 6+ 12 = 49<br />

Using MVL = 64, T loop = 15, T start = 49, and T chime = 3 in the performance<br />

equation, and assuming that n is not an exact multiple of 64, the time for an nelement<br />

operation is<br />

T n<br />

n = ----- × ( 15 + 49)<br />

+ 3n<br />

64<br />

≤ ( n + 64)<br />

+ 3n<br />

= 4n + 64<br />

The sustained rate is actually over 4 clock cycles per iteration, rather than the<br />

theoretical rate of 3 chimes, which ignores overhead. The major part of the difference<br />

is the cost of the start-up overhead for each block of 64 elements (49 cycles<br />

versus 15 for the loop overhead).<br />

We can now compute R ∞ for a 500 MHz clock as<br />

Operations per iteration × Clock rate<br />

R∞ = lim ⎛--------------------------------------------------------------------------------------- ⎞<br />

n → ∞⎝<br />

Clock cycles per iteration ⎠<br />

The numerator is independent of n, hence<br />

R∞ =<br />

Operations per iteration × Clock rate<br />

--------------------------------------------------------------------------------------lim<br />

( Clock cycles per iteration)<br />

n → ∞<br />

Tn lim ( Clock cycles per iteration)<br />

= lim ⎛----- ⎞<br />

n → ∞<br />

n → ∞⎝<br />

n ⎠<br />

4n + 64<br />

= lim ⎛----------------- ⎞<br />

n → ∞⎝<br />

n ⎠<br />

= 4<br />

R∞ =<br />

2 × 500 MHz<br />

-------------------------------<br />

4<br />

=<br />

250 MFLOPS

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!