A Scalable Computer Architecture for On-line Pulsar Search on the ...
A Scalable Computer Architecture for On-line Pulsar Search on the ...
A Scalable Computer Architecture for On-line Pulsar Search on the ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
A <str<strong>on</strong>g>Scalable</str<strong>on</strong>g> <str<strong>on</strong>g>Computer</str<strong>on</strong>g> <str<strong>on</strong>g>Architecture</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g><br />
<str<strong>on</strong>g>On</str<strong>on</strong>g>-<str<strong>on</strong>g>line</str<strong>on</strong>g> <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />
- Draft Versi<strong>on</strong> -<br />
G. Knittel, A. Horneffer MPI <str<strong>on</strong>g>for</str<strong>on</strong>g> Radio Astr<strong>on</strong>omy B<strong>on</strong>n<br />
with help from:<br />
M. Kramer, B. Klein, R. Eatough
GPU-Based <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Timing<br />
FFT –DeDisp– IFFT –<br />
Full Stokes – Folding<br />
GPU<br />
DISK<br />
GPU<br />
User‐Mem<br />
DMA<br />
Fast Bit‐<br />
Reversal<br />
mory<br />
Syst tem Me<br />
CPU<br />
From<br />
ADC<br />
NIC
GPU-Processing<br />
Processing<br />
FFT<br />
Coherent Dedispersi<strong>on</strong><br />
DM<br />
IFFT<br />
Full Stokes Parameters<br />
Folding<br />
Pulse Period
Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance:<br />
610M Samples/s<br />
(2 GPUs)<br />
GPU-Based <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Timing
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
FFT<br />
Coherent Dedispersi<strong>on</strong><br />
DM<br />
IFFT<br />
Full Stokes Parameters<br />
Folding<br />
Pulse Period
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
FFT<br />
Coherent Dedispersi<strong>on</strong><br />
Trial DM<br />
Loop<br />
IFFT<br />
Stokes I Parameter<br />
Folding<br />
Trial<br />
Pulse Period
Binary Systems <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
FFT<br />
Coherent Dedispersi<strong>on</strong><br />
Trial DM<br />
Loop<br />
IFFT<br />
Stokes I Parameter<br />
Folding<br />
Trial<br />
Pulse Period,<br />
Orbital<br />
Parameters
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
• Idea:<br />
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />
Folding (in Time Domain)
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
• Idea:<br />
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />
Folding<br />
• Binary Systems:<br />
Make Length of Phase Bins variable<br />
(similar to Time Sequence Resampling)
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
• Idea:<br />
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />
Folding<br />
• Binary Systems:<br />
Make Length of Phase Bins variable<br />
• High-Dimensi<strong>on</strong>al <str<strong>on</strong>g>Search</str<strong>on</strong>g> Space:<br />
Complete Coverage not possible.
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
• Frequency Domain:<br />
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />
Harm<strong>on</strong>ic Summati<strong>on</strong><br />
• Binary Systems:<br />
Process Range of neighboring Frequency<br />
Bins<br />
• Use same Hardware!
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
• Frequency Domain:<br />
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />
Harm<strong>on</strong>ic Summati<strong>on</strong><br />
• Binary Systems:<br />
Process Range of neighboring Frequency<br />
Bins<br />
• Use same Hardware!<br />
• Not completely worked out yet.
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />
• Add Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />
Polyphase Filterbank<br />
FFT<br />
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
Coherent Dedispersi<strong>on</strong><br />
Polyphase Filterbank<br />
FFT<br />
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
Coherent Dedispersi<strong>on</strong><br />
IFFT<br />
Stokes I Parameter<br />
Folding<br />
Power Spectrum<br />
Harm<strong>on</strong>ic Sum
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />
Polyphase Filterbank<br />
FFT<br />
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
Coherent Dedispersi<strong>on</strong><br />
IFFT<br />
Stokes I Parameter<br />
Folding<br />
FPGA<br />
CPU<br />
CPU<br />
GPU<br />
GPU<br />
GPU<br />
ASIC<br />
Polyphase Filterbank<br />
FFT<br />
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
Coherent Dedispersi<strong>on</strong><br />
Power Spectrum<br />
Harm<strong>on</strong>ic Sum
Mode of Operati<strong>on</strong> (Time Domain)<br />
Telescope 0<br />
Telescope 1<br />
Telescope 127
Parallelizati<strong>on</strong>: Timeslicing<br />
Telescope 0<br />
Telescope 1<br />
Telescope 127<br />
Timeslice to Timeslice to<br />
Timeslice to<br />
„Rank 0“ „Rank 1“<br />
„Rank 0“
Parallelizati<strong>on</strong><br />
Telescope 0<br />
Telescope 1<br />
Telescope 127<br />
Required Processing Time defines Number of Ranks
Parallelizati<strong>on</strong><br />
Data from all Telescopes<br />
Chain Network<br />
Rank 0 Rank 1 Rank 2 Rank 63<br />
<str<strong>on</strong>g>Architecture</str<strong>on</strong>g> scales endlessly
Parallelizati<strong>on</strong><br />
Telescope n<br />
Polyphase Filterbank<br />
16 Subbands<br />
Compute Node 15<br />
Compute Node 1<br />
Compute Node 0
Parallelizati<strong>on</strong><br />
<str<strong>on</strong>g>On</str<strong>on</strong>g>e Rank<br />
Compute Node 0<br />
Data Capture Phase<br />
Compute Node 1<br />
Compute Node 15<br />
8 Telescopes each
Parallelizati<strong>on</strong><br />
<str<strong>on</strong>g>On</str<strong>on</strong>g>e Rank<br />
Compute Node 0<br />
Filtering and<br />
Subband Distributi<strong>on</strong><br />
Compute Node 1<br />
Ring Network<br />
Compute Node 15
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
Compute Node k<br />
Local<br />
Mem<br />
FPGA<br />
Polyphase<br />
Filterbank,<br />
Data Exchange<br />
System<br />
Mem<br />
Beam 0 Subb k<br />
Beam 1 Subb k<br />
Beam 2 Subb k<br />
Beam n Subb k<br />
Tel 0 Subb k<br />
Tel 1 Subb k<br />
Tel 2 Subb k<br />
Tel 127 Subb k<br />
FFT,<br />
Coherent<br />
Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
CPU<br />
CPU
Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming <strong>on</strong> CPUs<br />
• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance using AVX:<br />
128 Input Spectra, single-precisi<strong>on</strong> float,<br />
256k Elements<br />
• 128 Beams of same Size:<br />
2.1s per Core @ 3.5GHz (prel. Results)
GPU-Processing<br />
Processing<br />
Video Video DeDisp, IFFT, SI<br />
Mem Mem<br />
System<br />
Mem<br />
GPU0 GPUn<br />
Local<br />
Mem<br />
FPGA<br />
Beam 0 Subb k<br />
Beam 1 Subb k<br />
Beam 2 Subb k<br />
Beam n Subb k<br />
Tel 0 Subb k<br />
Tel 1 Subb k<br />
Tel 2 Subb k<br />
Compute Node k<br />
FFT,<br />
Coherent<br />
Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />
CPU<br />
CPU<br />
Polyphase<br />
Tl127 Tel Subb bbk<br />
Filterbank,<br />
Data Exchange
GPU-Processing<br />
Processing<br />
• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance:<br />
2 Spectra, horz/vert, single-precisi<strong>on</strong> float,<br />
4M Elements<br />
• Total Power Time Sequence of same Size:<br />
Total Power Time Sequence of same Size:<br />
5ms
GPU-Processing<br />
Processing<br />
• How to output <strong>the</strong> Results to <strong>the</strong> ASICs
GPU-Processing<br />
Processing<br />
• How to output <strong>the</strong> Results to <strong>the</strong> ASICs<br />
• All PCIe-Slots are already taken (GPUs,<br />
FPGAs)
GPU-Processing<br />
Processing<br />
• How to output <strong>the</strong> Results to <strong>the</strong> ASICs<br />
• All PCIe-Slots are already taken (GPUs,<br />
FPGAs)<br />
• Write to Screen Buffer to be output via<br />
Write to Screen Buffer, to be output via<br />
M<strong>on</strong>itor Cable
GPU-Processing<br />
Processing<br />
Mini‐DisplayPort<br />
17.28 Gbit/s<br />
~ 70 Gbit/s<br />
Equiv. 1 PCIe x16 Slot
• Does it work<br />
GPU-Processing<br />
Processing
GPU-Processing<br />
Processing<br />
• Does it work Yes, but...<br />
GPU Kernel<br />
Screen<br />
Via DVI:<br />
2.7 Gbit/s<br />
(Video)
Massively-Parallel Folding<br />
Local<br />
Mem<br />
Local<br />
Mem<br />
SI Time Sequence<br />
ASIC0<br />
ASICn<br />
M<strong>on</strong>itor<br />
Cable<br />
Video<br />
Mem<br />
Massively‐Parallel<br />
Folding<br />
GPU0
Massively-Parallel Folding<br />
Compute Node 0<br />
Compute Node 1<br />
ASIC‐PC<br />
Up to 16 M<strong>on</strong>itor Cables<br />
Compute Node 15<br />
GPU‐PC
Folding – Time Domain<br />
Hypo<strong>the</strong>tical<br />
Pulse Period P<br />
<br />
time<br />
Dt t S lit P l<br />
Detects Solitary <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s<br />
having P +‐ small P
Folding – Accelerati<strong>on</strong> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
Hypo<strong>the</strong>tical<br />
Accelerati<strong>on</strong><br />
<br />
Variable Bin Length<br />
(# of Samples per Bin)<br />
time<br />
Equiv. to Time Sequence Resampling
Harm<strong>on</strong>ic Summati<strong>on</strong><br />
f 2f 3f 4f<br />
f 0 2f 0 3f 0 4f 0<br />
<br />
freq<br />
Dt Detects t Solitary <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s<br />
between f 0 and f 0 +f<br />
f 0<br />
f 0 + f
Harm<strong>on</strong>ic Sum - Accelerati<strong>on</strong> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />
f 2f 3f 4f<br />
f 0 2f 0 3f 0 4f 0<br />
<br />
freq<br />
Dt Detects t Binary Systems<br />
with max. Accelerati<strong>on</strong> f<br />
f 0<br />
f 0 + f
Folding Processor<br />
Broadcast Bus<br />
SI Time Sequence or Power Spectrum<br />
Accumulator<br />
Programmable<br />
Set of<br />
Counters and<br />
Incrementers<br />
Memory<br />
64 x 32 bits<br />
To / from<br />
local Memory
<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Detector ASIC<br />
10 3 ‐ 10 5<br />
Folding<br />
Processors
ASIC Network<br />
Ring Network<br />
Rank 0 Rank 1 Rank 2<br />
Rank 63<br />
ASIC‐PC
RFI Mitigati<strong>on</strong><br />
• Subband-relative:<br />
Accumulati<strong>on</strong> is per Subband<br />
• Beam-relative
The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine<br />
• 128 Telescopes<br />
• 128 Beams<br />
• 100 DMs<br />
• 50.000 Orbits<br />
• 64x10 6.4 8 hypo<strong>the</strong>tical <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s
The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine<br />
• PC Cluster<br />
• Switch-less Design, helps Scalability<br />
• GPU-PC + ASIC-PC = Compute Node<br />
• 64 Ranks of 16 Compute Nodes<br />
• 2048 PCs, 4096 CPUs, 8192 GPUs,<br />
16384 ASICS<br />
• 32.5 M€
Thanks!
The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine
Folding Processor
Costs
Supercomputer Costs 2005<br />
• Sandia Nati<strong>on</strong>al Laboratories Red Storm: $90<br />
milli<strong>on</strong><br />
• Los Alamos Nati<strong>on</strong>al Laboratory ASCI Q: $215<br />
milli<strong>on</strong><br />
• Earth Simulator Center, Japan: $250 milli<strong>on</strong><br />
• IBM Blue Gene/L: $290 milli<strong>on</strong><br />
various (unreliable) Internet Sources