04.02.2015 Views

A Scalable Computer Architecture for On-line Pulsar Search on the ...

A Scalable Computer Architecture for On-line Pulsar Search on the ...

A Scalable Computer Architecture for On-line Pulsar Search on the ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A <str<strong>on</strong>g>Scalable</str<strong>on</strong>g> <str<strong>on</strong>g>Computer</str<strong>on</strong>g> <str<strong>on</strong>g>Architecture</str<strong>on</strong>g> <str<strong>on</strong>g>for</str<strong>on</strong>g><br />

<str<strong>on</strong>g>On</str<strong>on</strong>g>-<str<strong>on</strong>g>line</str<strong>on</strong>g> <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />

- Draft Versi<strong>on</strong> -<br />

G. Knittel, A. Horneffer MPI <str<strong>on</strong>g>for</str<strong>on</strong>g> Radio Astr<strong>on</strong>omy B<strong>on</strong>n<br />

with help from:<br />

M. Kramer, B. Klein, R. Eatough


GPU-Based <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Timing<br />

FFT –DeDisp– IFFT –<br />

Full Stokes – Folding<br />

GPU<br />

DISK<br />

GPU<br />

User‐Mem<br />

DMA<br />

Fast Bit‐<br />

Reversal<br />

mory<br />

Syst tem Me<br />

CPU<br />

From<br />

ADC<br />

NIC


GPU-Processing<br />

Processing<br />

FFT<br />

Coherent Dedispersi<strong>on</strong><br />

DM<br />

IFFT<br />

Full Stokes Parameters<br />

Folding<br />

Pulse Period


Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance:<br />

610M Samples/s<br />

(2 GPUs)<br />

GPU-Based <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Timing


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

FFT<br />

Coherent Dedispersi<strong>on</strong><br />

DM<br />

IFFT<br />

Full Stokes Parameters<br />

Folding<br />

Pulse Period


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

FFT<br />

Coherent Dedispersi<strong>on</strong><br />

Trial DM<br />

Loop<br />

IFFT<br />

Stokes I Parameter<br />

Folding<br />

Trial<br />

Pulse Period


Binary Systems <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

FFT<br />

Coherent Dedispersi<strong>on</strong><br />

Trial DM<br />

Loop<br />

IFFT<br />

Stokes I Parameter<br />

Folding<br />

Trial<br />

Pulse Period,<br />

Orbital<br />

Parameters


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

• Idea:<br />

<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />

Folding (in Time Domain)


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

• Idea:<br />

<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />

Folding<br />

• Binary Systems:<br />

Make Length of Phase Bins variable<br />

(similar to Time Sequence Resampling)


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

• Idea:<br />

<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />

Folding<br />

• Binary Systems:<br />

Make Length of Phase Bins variable<br />

• High-Dimensi<strong>on</strong>al <str<strong>on</strong>g>Search</str<strong>on</strong>g> Space:<br />

Complete Coverage not possible.


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

• Frequency Domain:<br />

<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />

Harm<strong>on</strong>ic Summati<strong>on</strong><br />

• Binary Systems:<br />

Process Range of neighboring Frequency<br />

Bins<br />

• Use same Hardware!


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

• Frequency Domain:<br />

<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> by Massively-Parallel<br />

Harm<strong>on</strong>ic Summati<strong>on</strong><br />

• Binary Systems:<br />

Process Range of neighboring Frequency<br />

Bins<br />

• Use same Hardware!<br />

• Not completely worked out yet.


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />

• Add Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />

Polyphase Filterbank<br />

FFT<br />

Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

Coherent Dedispersi<strong>on</strong><br />

Polyphase Filterbank<br />

FFT<br />

Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

Coherent Dedispersi<strong>on</strong><br />

IFFT<br />

Stokes I Parameter<br />

Folding<br />

Power Spectrum<br />

Harm<strong>on</strong>ic Sum


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> SKA<br />

Polyphase Filterbank<br />

FFT<br />

Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

Coherent Dedispersi<strong>on</strong><br />

IFFT<br />

Stokes I Parameter<br />

Folding<br />

FPGA<br />

CPU<br />

CPU<br />

GPU<br />

GPU<br />

GPU<br />

ASIC<br />

Polyphase Filterbank<br />

FFT<br />

Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

Coherent Dedispersi<strong>on</strong><br />

Power Spectrum<br />

Harm<strong>on</strong>ic Sum


Mode of Operati<strong>on</strong> (Time Domain)<br />

Telescope 0<br />

Telescope 1<br />

Telescope 127


Parallelizati<strong>on</strong>: Timeslicing<br />

Telescope 0<br />

Telescope 1<br />

Telescope 127<br />

Timeslice to Timeslice to<br />

Timeslice to<br />

„Rank 0“ „Rank 1“<br />

„Rank 0“


Parallelizati<strong>on</strong><br />

Telescope 0<br />

Telescope 1<br />

Telescope 127<br />

Required Processing Time defines Number of Ranks


Parallelizati<strong>on</strong><br />

Data from all Telescopes<br />

Chain Network<br />

Rank 0 Rank 1 Rank 2 Rank 63<br />

<str<strong>on</strong>g>Architecture</str<strong>on</strong>g> scales endlessly


Parallelizati<strong>on</strong><br />

Telescope n<br />

Polyphase Filterbank<br />

16 Subbands<br />

Compute Node 15<br />

Compute Node 1<br />

Compute Node 0


Parallelizati<strong>on</strong><br />

<str<strong>on</strong>g>On</str<strong>on</strong>g>e Rank<br />

Compute Node 0<br />

Data Capture Phase<br />

Compute Node 1<br />

Compute Node 15<br />

8 Telescopes each


Parallelizati<strong>on</strong><br />

<str<strong>on</strong>g>On</str<strong>on</strong>g>e Rank<br />

Compute Node 0<br />

Filtering and<br />

Subband Distributi<strong>on</strong><br />

Compute Node 1<br />

Ring Network<br />

Compute Node 15


Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

Compute Node k<br />

Local<br />

Mem<br />

FPGA<br />

Polyphase<br />

Filterbank,<br />

Data Exchange<br />

System<br />

Mem<br />

Beam 0 Subb k<br />

Beam 1 Subb k<br />

Beam 2 Subb k<br />

Beam n Subb k<br />

Tel 0 Subb k<br />

Tel 1 Subb k<br />

Tel 2 Subb k<br />

Tel 127 Subb k<br />

FFT,<br />

Coherent<br />

Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

CPU<br />

CPU


Coherent Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming <strong>on</strong> CPUs<br />

• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance using AVX:<br />

128 Input Spectra, single-precisi<strong>on</strong> float,<br />

256k Elements<br />

• 128 Beams of same Size:<br />

2.1s per Core @ 3.5GHz (prel. Results)


GPU-Processing<br />

Processing<br />

Video Video DeDisp, IFFT, SI<br />

Mem Mem<br />

System<br />

Mem<br />

GPU0 GPUn<br />

Local<br />

Mem<br />

FPGA<br />

Beam 0 Subb k<br />

Beam 1 Subb k<br />

Beam 2 Subb k<br />

Beam n Subb k<br />

Tel 0 Subb k<br />

Tel 1 Subb k<br />

Tel 2 Subb k<br />

Compute Node k<br />

FFT,<br />

Coherent<br />

Beam<str<strong>on</strong>g>for</str<strong>on</strong>g>ming<br />

CPU<br />

CPU<br />

Polyphase<br />

Tl127 Tel Subb bbk<br />

Filterbank,<br />

Data Exchange


GPU-Processing<br />

Processing<br />

• Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance:<br />

2 Spectra, horz/vert, single-precisi<strong>on</strong> float,<br />

4M Elements<br />

• Total Power Time Sequence of same Size:<br />

Total Power Time Sequence of same Size:<br />

5ms


GPU-Processing<br />

Processing<br />

• How to output <strong>the</strong> Results to <strong>the</strong> ASICs


GPU-Processing<br />

Processing<br />

• How to output <strong>the</strong> Results to <strong>the</strong> ASICs<br />

• All PCIe-Slots are already taken (GPUs,<br />

FPGAs)


GPU-Processing<br />

Processing<br />

• How to output <strong>the</strong> Results to <strong>the</strong> ASICs<br />

• All PCIe-Slots are already taken (GPUs,<br />

FPGAs)<br />

• Write to Screen Buffer to be output via<br />

Write to Screen Buffer, to be output via<br />

M<strong>on</strong>itor Cable


GPU-Processing<br />

Processing<br />

Mini‐DisplayPort<br />

17.28 Gbit/s<br />

~ 70 Gbit/s<br />

Equiv. 1 PCIe x16 Slot


• Does it work<br />

GPU-Processing<br />

Processing


GPU-Processing<br />

Processing<br />

• Does it work Yes, but...<br />

GPU Kernel<br />

Screen<br />

Via DVI:<br />

2.7 Gbit/s<br />

(Video)


Massively-Parallel Folding<br />

Local<br />

Mem<br />

Local<br />

Mem<br />

SI Time Sequence<br />

ASIC0<br />

ASICn<br />

M<strong>on</strong>itor<br />

Cable<br />

Video<br />

Mem<br />

Massively‐Parallel<br />

Folding<br />

GPU0


Massively-Parallel Folding<br />

Compute Node 0<br />

Compute Node 1<br />

ASIC‐PC<br />

Up to 16 M<strong>on</strong>itor Cables<br />

Compute Node 15<br />

GPU‐PC


Folding – Time Domain<br />

Hypo<strong>the</strong>tical<br />

Pulse Period P<br />

<br />

time<br />

Dt t S lit P l<br />

Detects Solitary <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s<br />

having P +‐ small P


Folding – Accelerati<strong>on</strong> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

Hypo<strong>the</strong>tical<br />

Accelerati<strong>on</strong><br />

<br />

Variable Bin Length<br />

(# of Samples per Bin)<br />

time<br />

Equiv. to Time Sequence Resampling


Harm<strong>on</strong>ic Summati<strong>on</strong><br />

f 2f 3f 4f<br />

f 0 2f 0 3f 0 4f 0<br />

<br />

freq<br />

Dt Detects t Solitary <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s<br />

between f 0 and f 0 +f<br />

f 0<br />

f 0 + f


Harm<strong>on</strong>ic Sum - Accelerati<strong>on</strong> <str<strong>on</strong>g>Search</str<strong>on</strong>g><br />

f 2f 3f 4f<br />

f 0 2f 0 3f 0 4f 0<br />

<br />

freq<br />

Dt Detects t Binary Systems<br />

with max. Accelerati<strong>on</strong> f<br />

f 0<br />

f 0 + f


Folding Processor<br />

Broadcast Bus<br />

SI Time Sequence or Power Spectrum<br />

Accumulator<br />

Programmable<br />

Set of<br />

Counters and<br />

Incrementers<br />

Memory<br />

64 x 32 bits<br />

To / from<br />

local Memory


<str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> Detector ASIC<br />

10 3 ‐ 10 5<br />

Folding<br />

Processors


ASIC Network<br />

Ring Network<br />

Rank 0 Rank 1 Rank 2<br />

Rank 63<br />

ASIC‐PC


RFI Mitigati<strong>on</strong><br />

• Subband-relative:<br />

Accumulati<strong>on</strong> is per Subband<br />

• Beam-relative


The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine<br />

• 128 Telescopes<br />

• 128 Beams<br />

• 100 DMs<br />

• 50.000 Orbits<br />

• 64x10 6.4 8 hypo<strong>the</strong>tical <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g>s


The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine<br />

• PC Cluster<br />

• Switch-less Design, helps Scalability<br />

• GPU-PC + ASIC-PC = Compute Node<br />

• 64 Ranks of 16 Compute Nodes<br />

• 2048 PCs, 4096 CPUs, 8192 GPUs,<br />

16384 ASICS<br />

• 32.5 M€


Thanks!


The <str<strong>on</strong>g>Pulsar</str<strong>on</strong>g> <str<strong>on</strong>g>Search</str<strong>on</strong>g> Machine


Folding Processor


Costs


Supercomputer Costs 2005<br />

• Sandia Nati<strong>on</strong>al Laboratories Red Storm: $90<br />

milli<strong>on</strong><br />

• Los Alamos Nati<strong>on</strong>al Laboratory ASCI Q: $215<br />

milli<strong>on</strong><br />

• Earth Simulator Center, Japan: $250 milli<strong>on</strong><br />

• IBM Blue Gene/L: $290 milli<strong>on</strong><br />

various (unreliable) Internet Sources

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!