01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

We want Time 1 ≥ Time 2 , so<br />

G.4 Enhancing Vector Performance ■ G-29<br />

That is, the second method is faster if less than one-quarter of the elements are<br />

nonzero. In many cases the frequency of execution is much lower. If the index<br />

vector can be reused, or if the number of vector statements within the if statement<br />

grows, the advantage of the scatter-gather approach will increase sharply.<br />

Multiple Lanes<br />

Time1 = 5( n)<br />

Time2 = 4n + 4 fn<br />

5n ≥ 4n + 4 fn<br />

1<br />

-- ≥<br />

f<br />

4<br />

One of the greatest advantages of a vector instruction set is that it allows software<br />

to pass a large amount of parallel work to hardware using only a single short<br />

instruction. A single vector instruction can include tens to hundreds of independent<br />

operations yet be encoded in the same number of bits as a conventional scalar<br />

instruction. The parallel semantics of a vector instruction allows an<br />

implementation to execute these elemental operations using either a deeply pipelined<br />

functional unit, as in the VMIPS implementation we’ve studied so far, or by<br />

using an array of parallel functional units, or a combination of parallel and pipelined<br />

functional units. Figure G.11 illustrates how vector performance can be<br />

improved by using parallel pipelines to execute a vector add instruction.<br />

The VMIPS instruction set has been designed with the property that all vector<br />

arithmetic instructions only allow element N of one vector register to take part in<br />

operations with element N from other vector registers. This dramatically simplifies<br />

the construction of a highly parallel vector unit, which can be structured as<br />

multiple parallel lanes. As with a traffic highway, we can increase the peak<br />

throughput of a vector unit by adding more lanes. The structure of a four-lane<br />

vector unit is shown in Figure G.12.<br />

Each lane contains one portion of the vector-register file and one execution<br />

pipeline from each vector functional unit. Each vector functional unit executes<br />

vector instructions at the rate of one element group per cycle using multiple pipelines,<br />

one per lane. The first lane holds the first element (element 0) for all vector<br />

registers, and so the first element in any vector instruction will have its source and<br />

destination operands located in the first lane. This allows the arithmetic pipeline<br />

local to the lane to complete the operation without communicating with other<br />

lanes. Interlane wiring is only required to access main memory. This lack of interlane<br />

communication reduces the wiring cost and register file ports required to<br />

build a highly parallel execution unit, and helps explains why current vector<br />

supercomputers can complete up to 64 operations per cycle (2 arithmetic units<br />

and 2 load-store units across 16 lanes).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!