Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
G-36 ■ <strong>Appendix</strong> G Vector Processors<br />
load-store units, allow convoys to overlap to reduce the impact of start-up overheads,<br />
and decrease the number of loads required by vector-register allocation.<br />
We will examine the first two extensions in this section. The last optimization is<br />
actually used for the Cray-1, VMIPS’s cousin, to boost the performance by 50%.<br />
Reducing the number of loads requires an interprocedural optimization; we<br />
examine this transformation in Exercise G.6. Before we examine the first two<br />
extensions, let’s see what the real performance, including overhead, is.<br />
The Peak Performance of VMIPS on DAXPY<br />
First, we should determine what the peak performance, R ∞ , really is, since we<br />
know it must differ from the ideal 333 MFLOPS rate. For now, we continue to<br />
use the simplifying assumption that a convoy cannot start until all the instructions<br />
in an earlier convoy have completed; later we will remove this restriction. Using<br />
this simplification, the start-up overhead for the vector sequence is simply the<br />
sum of the start-up times of the instructions:<br />
Tstart = 12+ 7+ 12+ 6+ 12 = 49<br />
Using MVL = 64, T loop = 15, T start = 49, and T chime = 3 in the performance<br />
equation, and assuming that n is not an exact multiple of 64, the time for an nelement<br />
operation is<br />
T n<br />
n = ----- × ( 15 + 49)<br />
+ 3n<br />
64<br />
≤ ( n + 64)<br />
+ 3n<br />
= 4n + 64<br />
The sustained rate is actually over 4 clock cycles per iteration, rather than the<br />
theoretical rate of 3 chimes, which ignores overhead. The major part of the difference<br />
is the cost of the start-up overhead for each block of 64 elements (49 cycles<br />
versus 15 for the loop overhead).<br />
We can now compute R ∞ for a 500 MHz clock as<br />
Operations per iteration × Clock rate<br />
R∞ = lim ⎛--------------------------------------------------------------------------------------- ⎞<br />
n → ∞⎝<br />
Clock cycles per iteration ⎠<br />
The numerator is independent of n, hence<br />
R∞ =<br />
Operations per iteration × Clock rate<br />
--------------------------------------------------------------------------------------lim<br />
( Clock cycles per iteration)<br />
n → ∞<br />
Tn lim ( Clock cycles per iteration)<br />
= lim ⎛----- ⎞<br />
n → ∞<br />
n → ∞⎝<br />
n ⎠<br />
4n + 64<br />
= lim ⎛----------------- ⎞<br />
n → ∞⎝<br />
n ⎠<br />
= 4<br />
R∞ =<br />
2 × 500 MHz<br />
-------------------------------<br />
4<br />
=<br />
250 MFLOPS