01.09.2013 Views

Appendix G - Clemson University

Appendix G - Clemson University

Appendix G - Clemson University

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

G-44 ■ <strong>Appendix</strong> G Vector Processors<br />

Cray, who worked on the 6600 and the 7600 at CDC, founded Cray Research<br />

and introduced the Cray-1 in 1976 (see Russell [1978]). The Cray-1 used a<br />

vector-register architecture to significantly lower start-up overhead and to reduce<br />

memory bandwidth requirements. He also had efficient support for nonunit stride<br />

and invented chaining. Most importantly, the Cray-1 was the fastest scalar processor<br />

in the world at that time. This matching of good scalar and vector performance<br />

was probably the most significant factor in making the Cray-1 a success.<br />

Some customers bought the processor primarily for its outstanding scalar performance.<br />

Many subsequent vector processors are based on the architecture of this<br />

first commercially successful vector processor. Baskett and Keller [1977] provide<br />

a good evaluation of the Cray-1.<br />

In 1981, CDC started shipping the CYBER 205 (see Lincoln [1982]). The<br />

205 had the same basic architecture as the STAR, but offered improved performance<br />

all around as well as expandability of the vector unit with up to four lanes,<br />

each with multiple functional units and a wide load-store pipe that provided multiple<br />

words per clock. The peak performance of the CYBER 205 greatly exceeded<br />

the performance of the Cray-1. However, on real programs, the performance difference<br />

was much smaller.<br />

The CDC STAR processor and its descendant, the CYBER 205, were<br />

memory-memory vector processors. To keep the hardware simple and support the<br />

high bandwidth requirements (up to three memory references per floating-point<br />

operation), these processors did not efficiently handle nonunit stride. While most<br />

loops have unit stride, a nonunit stride loop had poor performance on these processors<br />

because memory-to-memory data movements were required to gather<br />

together (and scatter back) the nonadjacent vector elements; these operations<br />

used special scatter-gather instructions. In addition, there was special support for<br />

sparse vectors that used a bit vector to represent the zeros and nonzeros and a<br />

dense vector of nonzero values. These more complex vector operations were slow<br />

because of the long memory latency, and it was often faster to use scalar mode for<br />

sparse or nonunit stride operations. Schneck [1987] described several of the early<br />

pipelined processors (e.g., Stretch) through the first vector processors, including<br />

the 205 and Cray-1. Dongarra [1986] did another good survey, focusing on more<br />

recent processors.<br />

In 1983, Cray Research shipped the first Cray X-MP (see Chen [1983]). With<br />

an improved clock rate (9.5 ns versus 12.5 ns on the Cray-1), better chaining support,<br />

and multiple memory pipelines, this processor maintained the Cray<br />

Research lead in supercomputers. The Cray-2, a completely new design configurable<br />

with up to four processors, was introduced later. A major feature of the<br />

Cray-2 was the use of DRAM, which made it possible to have very large memories.<br />

The first Cray-2 with its 256M word (64-bit words) memory contained more<br />

memory than the total of all the Cray machines shipped to that point! The Cray-2<br />

had a much faster clock than the X-MP, but also much deeper pipelines; however,<br />

it lacked chaining, had an enormous memory latency, and had only one memory<br />

pipe per processor. In general, the Cray-2 is only faster than the Cray X-MP on<br />

problems that require its very large main memory.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!