GTC 2012 - GPU Technology Conference
GTC 2012 - GPU Technology Conference
GTC 2012 - GPU Technology Conference
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Signal Processing on <strong>GPU</strong>s for Radio Telescopes<br />
John W. Romein<br />
Netherlands Institute for Radio Astronomy (ASTRON)<br />
Dwingeloo, the Netherlands<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 1
Overview<br />
radio telescopes<br />
six radio telescope algorithms on <strong>GPU</strong>s<br />
part 1: real-time processing of telescope data<br />
1) FIR filter<br />
2) FFT<br />
3) bandpass correction<br />
4) delay compensation<br />
5) correlator<br />
part 2: creation of sky images<br />
6) gridding (new <strong>GPU</strong> algorithm!)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 2
Intro: Radio Telescopes<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 3
LOFAR Radio Telescope<br />
largest low-frequency telescope<br />
distributed sensor network<br />
~85,000 sensors<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 4
LOFAR: A Software Telescope<br />
different observation modes require flexibility<br />
standard imaging<br />
pulsar survey<br />
known pulsar<br />
epoch of reionization<br />
transients<br />
ultra-high energy particles<br />
…<br />
need supercomputer<br />
real time<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 5
LOFAR Data Processing<br />
Blue Gene/P supercomputer<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 6
Square Kilometre Array<br />
future radio telescope<br />
huge processing requirements<br />
TFLOPS<br />
LOFAR (<strong>2012</strong>) ~30<br />
SKA 10% (2016) ~30,000<br />
Full SKA (2020) ~1,000,000<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 7
Part 1: Real-Time Processing of Telescope Data<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 8
Rationale<br />
2005: LOFAR needed supercomputer<br />
<strong>2012</strong>: can <strong>GPU</strong>s do this work<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 9
Blue Gene/P Algorithms on <strong>GPU</strong>s<br />
BG/P software complex<br />
several processing pipelines<br />
try imaging pipeline on <strong>GPU</strong><br />
computational kernels only<br />
other pipelines + control software: later<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 10
CUDA or OpenCL<br />
OpenCL advantages<br />
vendor independent<br />
runtime compilation: easier programming (parameters constant)<br />
float2 samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];<br />
OpenCL disadvantages<br />
less mature<br />
e.g., poor support for FFTs<br />
cannot use all <strong>GPU</strong> features<br />
go for OpenCL<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 11
Poly-Phase Filter (PPF) bank<br />
splits frequency band into channels<br />
like prism<br />
time resolution ➜ freq. resolution<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 12
Poly-Phase Filter (PPF) bank<br />
FIR filter + FFT<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 13
1) Finite Impulse Response (FIR) Filter<br />
history & weights (in registers)<br />
no physical shift<br />
many FMAs<br />
operational intensity = 32 ops / 5 bytes<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 14
Performance Measurements<br />
maximum foreseen LOFAR load<br />
≤ 77 stations<br />
488 subbands @ 195 KHz<br />
dual pol<br />
2x8 bits/sample<br />
≤ 240 Gb/s<br />
GTX 580, GTX 680, HD 6970, HD 7970<br />
need Tesla quality for real use<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 15
FIR Filter Performance<br />
GTX 580 performs best<br />
restricted by memory bandwidth<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 16
2) FFT<br />
1D<br />
complex ➜ complex<br />
16-256 points<br />
tweaked “Apple” FFT library<br />
64 work items: 1 FFT<br />
256 work items: 4 FFTs<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 17
FFT Performance<br />
N=256<br />
tweaked library<br />
5 n log(n)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 18
Clock Correction<br />
corrects cable length errors<br />
merge with next step (phase delay)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 19
3) Delay Compensation (a.k.a. Tracking)<br />
track observed source<br />
delay telescope data<br />
delay changes due to earth rotation<br />
shift samples<br />
remainder: rotate phase (= cmul)<br />
18 FLOPs / 32 bytes<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 20
4) BandPass Correction<br />
powers in channels unequal<br />
artifact from station processing<br />
multiply by channel-dependent weights<br />
1 FLOP / 8 bytes<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 21
Transpose<br />
reorder data for next step (correlator)<br />
through local memory<br />
see talk S0514<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 22
Combined Kernel<br />
combine:<br />
delay compensation<br />
bandpass correction<br />
transpose<br />
reduces global memory accesses<br />
18 FLOPs / 32 bytes<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 23
Delay / Band Pass Performance<br />
poor operational intensity<br />
156 GB/s!<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 24
5) Correlator<br />
see previous talk (S0347)<br />
multiply samples from each pair of<br />
stations<br />
integrate ~1s<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 25
Correlator Implementation<br />
global memory ➜ local memory<br />
1 thread: 2x2 stations (dual pol)<br />
4 float4 loads ➜ 64 FMAs<br />
32 accumulator registers<br />
one thread<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 26
Correlator #Threads<br />
#threads<br />
1024<br />
768<br />
512<br />
256<br />
0<br />
20 39 58 77<br />
#stations<br />
max #threads<br />
GTX 580 1024<br />
GTX 680 1024<br />
HD 6970 256<br />
HD 7970 256<br />
HD 6970 / HD 7970 need multiple passes!<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 27
Correlator Performance<br />
HD 7970: multiple passes<br />
register usage ➜ low occupancy<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 28
Combined Pipeline<br />
full pipeline<br />
2 host threads<br />
own queue, own buffers<br />
overlap I/O & computations<br />
easy model!<br />
H➜D H➜D FIR FFT D&B Correlate<br />
D➜H<br />
H➜D<br />
H➜D<br />
FIR FFT D&B<br />
Correlate<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 29
Overall Performance Imaging Pipeline<br />
#<strong>GPU</strong>s needed for LOFAR<br />
GTX 680 (marginally) fastest<br />
~13 <strong>GPU</strong>s<br />
HD 7970 real improvement over HD 6970<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 30
Performance Breakdown GTX 580<br />
dominated by correlator<br />
correlator: compute bound<br />
others: memory I/O bound<br />
PCIe I/O overlapped<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 31
Performance Breakdown GTX 680<br />
~20% faster than GTX 580<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 32
Performance Breakdown HD 7970<br />
multiple passes correlator visible<br />
poor overlap I/O<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 33
Performance Breakdown HD 6970<br />
≤ 2.7x slower<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 34
Are <strong>GPU</strong>s Efficient<br />
GTX 680 Blue Gene/P<br />
FIR filter ~21% 85%<br />
FFT ~17% 44%<br />
Delay / BandPass ~2.6% 26%<br />
Correlator ~35% 96%<br />
% of FPU peak performance<br />
Blue Gene/P: better compute-I/O balance & integrated network<br />
few tens of <strong>GPU</strong>s as powerful as 2 BG/P racks<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 35
Feasible<br />
imaging pipeline<br />
~13 GTX 680s (≈ 8 Tesla K10)<br />
+ RFI detection<br />
other pipelines<br />
240 Gb/s FDR InfiniBand transpose<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 36
Future Optimizations<br />
combine more kernels<br />
fewer passes over global memory<br />
FFT: difficult<br />
invoke FFT from <strong>GPU</strong> kernel, not CPU<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 37
Conclusions Part 1<br />
OpenCL ok<br />
FFT support = minimal<br />
GTX 680 (Kepler) marginally faster than HD 7970 (GCN)<br />
Part 2: Creation of Sky Images<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 39
Context<br />
after observation:<br />
remove RFI<br />
calibrate<br />
create sky image<br />
calibration/imaging loop possibly repeated<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 40
Creating a Sky Image<br />
convolve correlations and add to grid<br />
2D FFT ➜ sky image<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 41
Gridding<br />
corr<br />
conv<br />
(~100x100)<br />
convolve correlation and add to grid<br />
for all correlations<br />
grid<br />
(~4096x4096)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 42
Two Problems<br />
corr<br />
conv<br />
(~100x100)<br />
1. lots of FLOPS<br />
2. add to memory: slow!<br />
grid<br />
(~4096x4096)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 43
Two Solutions<br />
corr<br />
conv<br />
(~100x100)<br />
1. lots of FLOPS ➜ use <strong>GPU</strong>s<br />
2. add to memory: slow! ➜ avoid<br />
grid<br />
(~4096x4096)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 44
This Is A Hard Problem<br />
400<br />
50<br />
350<br />
45<br />
literature: 4 other <strong>GPU</strong> gridders<br />
estimated perf. on GTX680<br />
compensated faster hardware<br />
bandwidth difference + 50%<br />
GFLOPS<br />
40<br />
300<br />
1)<br />
35<br />
2)<br />
250<br />
30<br />
3)<br />
200<br />
4)<br />
25<br />
150<br />
20<br />
15<br />
100<br />
10<br />
50<br />
5<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
giga-pixel-updates-per-second<br />
conv. matrix size<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 45
This Is A Hard Problem<br />
1) MWA (Edgar et. al. [CPC'11])<br />
search correlations<br />
2) Cell BE (Varbanescu [PhD,'10])<br />
local store<br />
3) van Amesfoort et. al. [CF'09]<br />
private grid per block ➜<br />
very small grids<br />
4) Humphreys & Cornwell<br />
[SKA memo 132, '11]<br />
adds directly to grid in memory<br />
GFLOPS<br />
400<br />
50<br />
350<br />
45<br />
40<br />
300<br />
1)<br />
35<br />
2)<br />
250<br />
30<br />
3)<br />
200<br />
150<br />
4)<br />
25<br />
20<br />
15<br />
100<br />
10<br />
50<br />
5<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
giga-pixel-updates-per-second<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 46
This Is A Hard Problem<br />
400<br />
50<br />
350<br />
45<br />
~3% of FPU peak performance!<br />
SKA: exascale<br />
GFLOPS<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
1)<br />
2)<br />
3)<br />
4)<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
giga-pixel-updates-per-second<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 47
W-Projection Gridding<br />
depends on frac(u), frac(v), w<br />
(int(u), int(v))<br />
corr<br />
conv<br />
correlation has associated (u,v,w) coords<br />
(u,v) not exact grid points<br />
use different convolution matrices<br />
choose most appropriate one<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 48
Where Is The Data<br />
corr<br />
conv<br />
(~100x100)<br />
grid: device memory<br />
conv. matrices: texture<br />
correlations + (u,v,w) coords: shared (local) memory<br />
grid<br />
(~4096x4096)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 49
Placement Movement<br />
f<br />
corr<br />
conv<br />
t<br />
per baseline:<br />
(u,v,w) changes slowly<br />
grid<br />
grid locality<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 50
Use Locality<br />
corr<br />
conv<br />
reduce #memory accesses<br />
X: one thread<br />
accumulate additions in register<br />
until conv. matrix slides off<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 51
But How <br />
corr<br />
conv<br />
1 thread / grid point<br />
which correlations contribute<br />
grid<br />
severe load imbalance<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 52
An Unintuitive Approach<br />
corr<br />
conv<br />
conceptual blocks of conv. matrix size<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 53
An Unintuitive Approach<br />
corr<br />
conv<br />
1 thread monitors all X<br />
grid<br />
at any time: 1 X covers conv. matrix!!!<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 54
An Unintuitive Approach<br />
corr<br />
conv<br />
thread computes current:<br />
X grid point<br />
grid<br />
X conv. matrix entry<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 55
An Unintuitive Approach<br />
corr<br />
conv<br />
(u,v) coords change<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 56
An Unintuitive Approach<br />
corr<br />
conv<br />
(u,v) coords change more<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 57
An Unintuitive Approach<br />
corr<br />
conv<br />
grid<br />
(atomically) adds data if switching to another X<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 58
An Unintuitive Approach<br />
corr<br />
conv<br />
#threads = block size<br />
grid<br />
too many threads ➜ do in parts<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 59
(Dis)Advantages<br />
corr<br />
conv<br />
☹ overhead<br />
☺ < 1% grid-point memory updates<br />
grid<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 60
Performance Measurements<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 61
Performance Tests Setup<br />
#stations 44<br />
#channels 16<br />
integration time<br />
10 s<br />
observation time<br />
6 h<br />
conv. matrix size<br />
≤ 256x256<br />
oversampling<br />
8x8<br />
#W-planes 128<br />
grid size<br />
2048x2048<br />
(u,v,w) from real LOFAR observation (6 hour)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 62
GTX 680 Performance (CUDA)<br />
GTX 680 (CUDA)<br />
75.1-95.6 Gpixels/s<br />
25% of peak FPU<br />
overhead index computations<br />
most additions in registers<br />
0.23%-0.55% ➜ atomic add<br />
= 26% of total run time!<br />
occupancy: 0.694-0.952<br />
texture hit rate: >0.872<br />
GFLOPS<br />
1000<br />
900<br />
120<br />
110<br />
800<br />
100<br />
700<br />
90<br />
600<br />
80<br />
70<br />
500<br />
60<br />
400<br />
50<br />
300<br />
40<br />
200<br />
30<br />
20<br />
100<br />
10<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
giga-pixel-updates-per-second<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 63
GTX 680 Performance (OpenCL)<br />
OpenCL slower than CUDA<br />
no atomic FP add!<br />
use atomic cmpxchg<br />
V1.1: no 1D images (added in V1.2)<br />
2D image: slower<br />
GTX 680 (CUDA)<br />
GTX 680 (OpenCL)<br />
GFLOPS<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
giga-pixel-updates-per-second<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 64
HD 7970 Performance (OpenCL)<br />
medium & large conv. size:<br />
outperforms GTX 680<br />
~25% > bandwidth, FPU, power<br />
small conv. size:<br />
GTX 680 (CUDA)<br />
GTX 680 (OpenCL)<br />
HD 7970<br />
poor computation-I/O overlap<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
map host memory into device 0<br />
GFLOPS<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
giga-pixel-updates-per-second<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 65
2 x Xeon E5-2680 Performance (C++/AVX)<br />
C++ & AVX vector intrinsics<br />
adds directly to grid<br />
relies on L1 cache<br />
GTX 680 (CUDA)<br />
GTX 680 (OpenCL)<br />
HD 7970<br />
2 x E5-2680<br />
works well on CPU<br />
insufficient cache for <strong>GPU</strong>s<br />
1000<br />
900<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
48-79% of peak FPU 0<br />
GFLOPS<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
120<br />
110<br />
100<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
30<br />
20<br />
10<br />
giga-pixel-updates-per-second<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 66
Multi-<strong>GPU</strong> Scaling<br />
eight Nvidia GTX 580s<br />
5000<br />
4000<br />
256x256<br />
64x64<br />
16x16<br />
GFLOPS<br />
3000<br />
2000<br />
1000<br />
0<br />
0 1 2 3 4 5 6 7 8<br />
nr. <strong>GPU</strong>s<br />
131,072 threads!<br />
scales well<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 67
Green Computing<br />
power consumption (kW)<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
256x256<br />
64x64<br />
16x16<br />
power efficiency (GFLOP/W)<br />
2<br />
1.8<br />
1.6<br />
1.4<br />
1.2<br />
1<br />
0.8<br />
0.6<br />
0.4<br />
0.2<br />
256x256<br />
64x64<br />
16x16<br />
0<br />
0 1 2 3 4 5 6 7 8<br />
nr. <strong>GPU</strong>s<br />
0<br />
0 1 2 3 4 5 6 7 8<br />
nr. <strong>GPU</strong>s<br />
up to 1.94 GFLOP/w (with previous gen hardware!)<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 68
Compared To Other <strong>GPU</strong> Gridders<br />
1) MWA (Edgar et. al. [CPC'11])<br />
2) Cell BE (Varbanescu [PhD,'10])<br />
3) van Amesfoort et. al. [CF'09]<br />
4) Humphreys & Cornwell<br />
[SKA memo 132, '11]<br />
new method ~10x faster<br />
GFLOPS<br />
800<br />
700<br />
600<br />
500<br />
400<br />
300<br />
200<br />
100<br />
new<br />
1)<br />
2)<br />
3)<br />
4)<br />
100<br />
87.5<br />
75<br />
62.5<br />
50<br />
37.5<br />
25<br />
12.5<br />
giga-pixel-updates-per-second<br />
0<br />
0<br />
16x16 32x32 64x64 128x128 256x256<br />
conv. matrix size<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 69
See Also<br />
An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on<br />
<strong>GPU</strong>s, John W. Romein, ACM International <strong>Conference</strong> on Supercomputing<br />
(ICS'12), June 25-29, <strong>2012</strong>, Venice, Italy<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 70
Future Work<br />
LOFAR gridder<br />
combine with A-projection<br />
time-dependent conv. function ➜ compute on <strong>GPU</strong><br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 71
Conclusions Part 2<br />
efficient <strong>GPU</strong> gridding algorithm<br />
minimizes memory accesses<br />
OpenCL lacks atomic floating-point add<br />
~10x faster than other gridders<br />
scales well on 8 <strong>GPU</strong>s<br />
energy efficient<br />
<strong>GTC</strong>'12<br />
May 14-17, <strong>2012</strong> 72