01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

G-30 ■ <strong>Appendix</strong> G Vector Processors<br />

A[9]<br />

A[8]<br />

A[7]<br />

A[6]<br />

A[5]<br />

A[4]<br />

A[3]<br />

A[2]<br />

A[1]<br />

+<br />

C[0]<br />

B[9]<br />

B[8]<br />

B[7]<br />

B[6]<br />

B[5]<br />

B[4]<br />

B[3]<br />

B[2]<br />

B[1]<br />

Figure G.11 Using multiple functional units to improve the performance of a single<br />

vector add instruction, C = A + B. The machine shown in (a) has a single add pipeline<br />

and can complete one addition per cycle. The machine shown in (b) has four add pipelines<br />

and can complete four additions per cycle. The elements within a single vector<br />

add instruction are interleaved across the four pipelines. The set of elements that move<br />

through the pipelines together is termed an element group. (Reproduced with permission<br />

from Asanovic [1998].)<br />

Adding multiple lanes is a popular technique to improve vector performance<br />

as it requires little increase in control complexity and does not require changes to<br />

existing machine code. Several vector supercomputers are sold as a range of<br />

models that vary in the number of lanes installed, allowing users to trade price<br />

against peak vector performance. The Cray SV1 allows four two-lane CPUs to be<br />

ganged together using operating system software to form a single larger eightlane<br />

CPU.<br />

Pipelined Instruction Start-Up<br />

Adding multiple lanes increases peak performance, but does not change start-up<br />

latency, and so it becomes critical to reduce start-up overhead by allowing the<br />

start of one vector instruction to be overlapped with the completion of preceding<br />

vector instructions. The simplest case to consider is when two vector instructions<br />

access a different set of vector registers. For example, in the code sequence<br />

ADDV.D V1,V2,V3<br />

ADDV.D V4,V5,V6<br />

A[8]<br />

A[4]<br />

+<br />

C[0]<br />

B[8]<br />

B[4]<br />

A[9] B[9]<br />

A[5]<br />

+<br />

C[1]<br />

B[5]<br />

A[6]<br />

Element group<br />

(a) (b)<br />

+<br />

C[2]<br />

B[6]<br />

A[7]<br />

+<br />

C[3]<br />

B[7]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!