Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
G-44 ■ <strong>Appendix</strong> G Vector Processors<br />
Cray, who worked on the 6600 and the 7600 at CDC, founded Cray Research<br />
and introduced the Cray-1 in 1976 (see Russell [1978]). The Cray-1 used a<br />
vector-register architecture to significantly lower start-up overhead and to reduce<br />
memory bandwidth requirements. He also had efficient support for nonunit stride<br />
and invented chaining. Most importantly, the Cray-1 was the fastest scalar processor<br />
in the world at that time. This matching of good scalar and vector performance<br />
was probably the most significant factor in making the Cray-1 a success.<br />
Some customers bought the processor primarily for its outstanding scalar performance.<br />
Many subsequent vector processors are based on the architecture of this<br />
first commercially successful vector processor. Baskett and Keller [1977] provide<br />
a good evaluation of the Cray-1.<br />
In 1981, CDC started shipping the CYBER 205 (see Lincoln [1982]). The<br />
205 had the same basic architecture as the STAR, but offered improved performance<br />
all around as well as expandability of the vector unit with up to four lanes,<br />
each with multiple functional units and a wide load-store pipe that provided multiple<br />
words per clock. The peak performance of the CYBER 205 greatly exceeded<br />
the performance of the Cray-1. However, on real programs, the performance difference<br />
was much smaller.<br />
The CDC STAR processor and its descendant, the CYBER 205, were<br />
memory-memory vector processors. To keep the hardware simple and support the<br />
high bandwidth requirements (up to three memory references per floating-point<br />
operation), these processors did not efficiently handle nonunit stride. While most<br />
loops have unit stride, a nonunit stride loop had poor performance on these processors<br />
because memory-to-memory data movements were required to gather<br />
together (and scatter back) the nonadjacent vector elements; these operations<br />
used special scatter-gather instructions. In addition, there was special support for<br />
sparse vectors that used a bit vector to represent the zeros and nonzeros and a<br />
dense vector of nonzero values. These more complex vector operations were slow<br />
because of the long memory latency, and it was often faster to use scalar mode for<br />
sparse or nonunit stride operations. Schneck [1987] described several of the early<br />
pipelined processors (e.g., Stretch) through the first vector processors, including<br />
the 205 and Cray-1. Dongarra [1986] did another good survey, focusing on more<br />
recent processors.<br />
In 1983, Cray Research shipped the first Cray X-MP (see Chen [1983]). With<br />
an improved clock rate (9.5 ns versus 12.5 ns on the Cray-1), better chaining support,<br />
and multiple memory pipelines, this processor maintained the Cray<br />
Research lead in supercomputers. The Cray-2, a completely new design configurable<br />
with up to four processors, was introduced later. A major feature of the<br />
Cray-2 was the use of DRAM, which made it possible to have very large memories.<br />
The first Cray-2 with its 256M word (64-bit words) memory contained more<br />
memory than the total of all the Cray machines shipped to that point! The Cray-2<br />
had a much faster clock than the X-MP, but also much deeper pipelines; however,<br />
it lacked chaining, had an enormous memory latency, and had only one memory<br />
pipe per processor. In general, the Cray-2 is only faster than the Cray X-MP on<br />
problems that require its very large main memory.