30.12.2012 Views

Superconducting Technology Assessment - nitrd

Superconducting Technology Assessment - nitrd

Superconducting Technology Assessment - nitrd

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CNET<br />

Crossbar<br />

DRAM-PIM<br />

16 TB<br />

Since the demise of the HTMT project, processor-in-memory technology has advanced as a result of the HPCS<br />

program; the other technologies—RSFQ processors and memory, optical Data Vortex, and holographic storage—<br />

have languished due to limited funding.<br />

Latency, Bandwidth, Parallelism<br />

Attaining high performance in large parallel systems is a challenge. Communications must be balanced with computation;<br />

with insufficient bandwidth, nodes are often stalled while waiting for input. Excess bandwidth would be<br />

wasted, but this is rarely a practical problem: as the number of “hops” between nodes increases, so does the bandwidth<br />

consumed by each message. The result is that aggregate system bandwidth should increase not linearly with<br />

numbers of nodes but as N logN (Clos, hypercube) or N 2 (toroidal mesh) to sustain the same level of random<br />

node-to-node messages per node. Large systems tend to suffer from insufficient bandwidth from the typical<br />

application perspective.<br />

Physical size is another limitation. Burton Smith likes to cite Little’s Law:<br />

latency x bandwidth = concurrency<br />

HSP HSP HSP HSP<br />

CRAM<br />

SRAM<br />

P M<br />

HTMT BLOCK DIAGRAM<br />

Compute Intensive — Flow Control Driven<br />

CRAM<br />

SRAM<br />

CRAM<br />

SRAM<br />

in communications systems which transport messages from input to output without either creating or destroying<br />

them. High bandwidth contributes to high throughput: thus, high latencies are tolerated only if large numbers of<br />

messages can be generated and processed concurrently. In practice, there are limits to the degree of concurrency<br />

supportable by a given application at any one time. Low latency is desirable, but latency is limited by speed-of-light<br />

considerations, so the larger the system, the higher the latency between randomly selected nodes. As a result,<br />

applications must be capable of high degrees of parallelism to take advantage of physically large systems.<br />

CRAM<br />

SRAM<br />

P M P M P M<br />

Data Vortex Network<br />

P M P M P M P M P M<br />

Smart Main Memory<br />

Data Intensive — Data Flow Driven<br />

Latencies<br />

Inter-HSP<br />

400-1000 cycles<br />

Intra-execution<br />

pipeline<br />

10 - 100 cycles<br />

to CRAM<br />

40 - 400 cycles<br />

to SRAM<br />

400 - 1000 cycles<br />

to DRAM<br />

10,000 - 40,000<br />

cycles<br />

DRAM to HRAM<br />

1x10 6 -4x10 6 cycles<br />

163

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!