Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
We want Time 1 ≥ Time 2 , so<br />
G.4 Enhancing Vector Performance ■ G-29<br />
That is, the second method is faster if less than one-quarter of the elements are<br />
nonzero. In many cases the frequency of execution is much lower. If the index<br />
vector can be reused, or if the number of vector statements within the if statement<br />
grows, the advantage of the scatter-gather approach will increase sharply.<br />
Multiple Lanes<br />
Time1 = 5( n)<br />
Time2 = 4n + 4 fn<br />
5n ≥ 4n + 4 fn<br />
1<br />
-- ≥<br />
f<br />
4<br />
One of the greatest advantages of a vector instruction set is that it allows software<br />
to pass a large amount of parallel work to hardware using only a single short<br />
instruction. A single vector instruction can include tens to hundreds of independent<br />
operations yet be encoded in the same number of bits as a conventional scalar<br />
instruction. The parallel semantics of a vector instruction allows an<br />
implementation to execute these elemental operations using either a deeply pipelined<br />
functional unit, as in the VMIPS implementation we’ve studied so far, or by<br />
using an array of parallel functional units, or a combination of parallel and pipelined<br />
functional units. Figure G.11 illustrates how vector performance can be<br />
improved by using parallel pipelines to execute a vector add instruction.<br />
The VMIPS instruction set has been designed with the property that all vector<br />
arithmetic instructions only allow element N of one vector register to take part in<br />
operations with element N from other vector registers. This dramatically simplifies<br />
the construction of a highly parallel vector unit, which can be structured as<br />
multiple parallel lanes. As with a traffic highway, we can increase the peak<br />
throughput of a vector unit by adding more lanes. The structure of a four-lane<br />
vector unit is shown in Figure G.12.<br />
Each lane contains one portion of the vector-register file and one execution<br />
pipeline from each vector functional unit. Each vector functional unit executes<br />
vector instructions at the rate of one element group per cycle using multiple pipelines,<br />
one per lane. The first lane holds the first element (element 0) for all vector<br />
registers, and so the first element in any vector instruction will have its source and<br />
destination operands located in the first lane. This allows the arithmetic pipeline<br />
local to the lane to complete the operation without communicating with other<br />
lanes. Interlane wiring is only required to access main memory. This lack of interlane<br />
communication reduces the wiring cost and register file ports required to<br />
build a highly parallel execution unit, and helps explains why current vector<br />
supercomputers can complete up to 64 operations per cycle (2 arithmetic units<br />
and 2 load-store units across 16 lanes).