Appendix G - Clemson University
Appendix G - Clemson University
Appendix G - Clemson University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
G-14 ■ <strong>Appendix</strong> G Vector Processors<br />
Typically, penalties for start-ups on load-store units are higher than those for<br />
arithmetic functional units—over 100 clock cycles on some processors. For<br />
VMIPS we will assume a start-up time of 12 clock cycles, the same as the Cray-<br />
1. Figure G.6 summarizes the start-up penalties for VMIPS vector operations.<br />
To maintain an initiation rate of 1 word fetched or stored per clock, the memory<br />
system must be capable of producing or accepting this much data. This is<br />
usually done by creating multiple memory banks, as discussed in Section 5.8. As<br />
we will see in the next section, having significant numbers of banks is useful for<br />
dealing with vector loads or stores that access rows or columns of data.<br />
Most vector processors use memory banks rather than simple interleaving for<br />
three primary reasons:<br />
1. Many vector computers support multiple loads or stores per clock, and the<br />
memory bank cycle time is often several times larger than the CPU cycle<br />
time. To support multiple simultaneous accesses, the memory system needs to<br />
have multiple banks and be able to control the addresses to the banks independently.<br />
2. As we will see in the next section, many vector processors support the ability<br />
to load or store data words that are not sequential. In such cases, independent<br />
bank addressing, rather than interleaving, is required.<br />
3. Many vector computers support multiple processors sharing the same memory<br />
system, and so each processor will be generating its own independent<br />
stream of addresses.<br />
In combination, these features lead to a large number of independent memory<br />
banks, as shown by the following example.<br />
Example The Cray T90 has a CPU clock cycle of 2.167 ns and in its largest configuration<br />
(Cray T932) has 32 processors each capable of generating four loads and two<br />
stores per CPU clock cycle. The CPU clock cycle is 2.167 ns, while the cycle<br />
time of the SRAMs used in the memory system is 15 ns. Calculate the minimum<br />
number of memory banks required to allow all CPUs to run at full memory bandwidth.<br />
Answer The maximum number of memory references each cycle is 192 (32 CPUs times 6<br />
references per CPU). Each SRAM bank is busy for 15/2.167 = 6.92 clock cycles,<br />
which we round up to 7 CPU clock cycles. Therefore we require a minimum of<br />
192 × 7 = 1344 memory banks!<br />
The Cray T932 actually has 1024 memory banks, and so the early models<br />
could not sustain full bandwidth to all CPUs simultaneously. A subsequent memory<br />
upgrade replaced the 15 ns asynchronous SRAMs with pipelined synchronous<br />
SRAMs that more than halved the memory cycle time, thereby providing<br />
sufficient bandwidth.