GTC 2012 - GPU Technology Conference

Signal Processing on GPUs for Radio Telescopes 

John W. Romein 

Netherlands Institute for Radio Astronomy (ASTRON) 

Dwingeloo, the Netherlands 

GTC'12 

May 14-17, 2012 1

Overview 

radio telescopes 

six radio telescope algorithms on GPUs 

part 1: real-time processing of telescope data 

1) FIR filter 

2) FFT 

3) bandpass correction 

4) delay compensation 

5) correlator 

part 2: creation of sky images 

6) gridding (new GPU algorithm!) 



Intro: Radio Telescopes 



LOFAR Radio Telescope 

largest low-frequency telescope 

distributed sensor network 

~85,000 sensors 



LOFAR: A Software Telescope 

different observation modes require flexibility 

standard imaging 

pulsar survey 

known pulsar 

epoch of reionization 

transients 

ultra-high energy particles 

… 

need supercomputer 

real time 



LOFAR Data Processing 

Blue Gene/P supercomputer 



Square Kilometre Array 

future radio telescope 

huge processing requirements 

TFLOPS 

LOFAR (2012) ~30 

SKA 10% (2016) ~30,000 

Full SKA (2020) ~1,000,000 



Part 1: Real-Time Processing of Telescope Data 



Rationale 

2005: LOFAR needed supercomputer 

2012: can GPUs do this work 



Blue Gene/P Algorithms on GPUs 

BG/P software complex 

several processing pipelines 

try imaging pipeline on GPU 

computational kernels only 

other pipelines + control software: later 



CUDA or OpenCL 

OpenCL advantages 

vendor independent 

runtime compilation: easier programming (parameters constant) 

float2 samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS]; 

OpenCL disadvantages 

less mature 

e.g., poor support for FFTs 

cannot use all GPU features 

go for OpenCL 



Poly-Phase Filter (PPF) bank 

splits frequency band into channels 

like prism 

time resolution ➜ freq. resolution 



Poly-Phase Filter (PPF) bank 

FIR filter + FFT 



1) Finite Impulse Response (FIR) Filter 

history & weights (in registers) 

no physical shift 

many FMAs 

operational intensity = 32 ops / 5 bytes 



Performance Measurements 

maximum foreseen LOFAR load 

≤ 77 stations 

488 subbands @ 195 KHz 

dual pol 

2x8 bits/sample 

≤ 240 Gb/s 

GTX 580, GTX 680, HD 6970, HD 7970 

need Tesla quality for real use 



FIR Filter Performance 

GTX 580 performs best 

restricted by memory bandwidth 



2) FFT 

1D 

complex ➜ complex 

16-256 points 

tweaked “Apple” FFT library 

64 work items: 1 FFT 

256 work items: 4 FFTs 



FFT Performance 

N=256 

tweaked library 

5 n log(n) 



Clock Correction 

corrects cable length errors 

merge with next step (phase delay) 



3) Delay Compensation (a.k.a. Tracking) 

track observed source 

delay telescope data 

delay changes due to earth rotation 

shift samples 

remainder: rotate phase (= cmul) 

18 FLOPs / 32 bytes 



4) BandPass Correction 

powers in channels unequal 

artifact from station processing 

multiply by channel-dependent weights 

1 FLOP / 8 bytes 



Transpose 

reorder data for next step (correlator) 

through local memory 

see talk S0514 



Combined Kernel 

combine: 

delay compensation 

bandpass correction 

transpose 

reduces global memory accesses 

18 FLOPs / 32 bytes 



Delay / Band Pass Performance 

poor operational intensity 

156 GB/s! 



5) Correlator 

see previous talk (S0347) 

multiply samples from each pair of 

stations 

integrate ~1s 



Correlator Implementation 

global memory ➜ local memory 

1 thread: 2x2 stations (dual pol) 

4 float4 loads ➜ 64 FMAs 

32 accumulator registers 

one thread 



Correlator #Threads 

#threads 

1024 

768 

512 

256 

0 

20 39 58 77 

#stations 

max #threads 

GTX 580 1024 

GTX 680 1024 

HD 6970 256 

HD 7970 256 

HD 6970 / HD 7970 need multiple passes! 



Correlator Performance 

HD 7970: multiple passes 

register usage ➜ low occupancy 



Combined Pipeline 

full pipeline 

2 host threads 

own queue, own buffers 

overlap I/O & computations 

easy model! 

H➜D H➜D FIR FFT D&B Correlate 

D➜H 

H➜D 

H➜D 

FIR FFT D&B 

Correlate 



Overall Performance Imaging Pipeline 

#GPUs needed for LOFAR 

GTX 680 (marginally) fastest 

~13 GPUs 

HD 7970 real improvement over HD 6970 



Performance Breakdown GTX 580 

dominated by correlator 

correlator: compute bound 

others: memory I/O bound 

PCIe I/O overlapped 



Performance Breakdown GTX 680 

~20% faster than GTX 580 



Performance Breakdown HD 7970 

multiple passes correlator visible 

poor overlap I/O 



Performance Breakdown HD 6970 

≤ 2.7x slower 



Are GPUs Efficient 

GTX 680 Blue Gene/P 

FIR filter ~21% 85% 

FFT ~17% 44% 

Delay / BandPass ~2.6% 26% 

Correlator ~35% 96% 

% of FPU peak performance 

Blue Gene/P: better compute-I/O balance & integrated network 

few tens of GPUs as powerful as 2 BG/P racks 



Feasible 

imaging pipeline 

~13 GTX 680s (≈ 8 Tesla K10) 

+ RFI detection 

other pipelines 

240 Gb/s FDR InfiniBand transpose 



Future Optimizations 

combine more kernels 

fewer passes over global memory 

FFT: difficult 

invoke FFT from GPU kernel, not CPU 



Conclusions Part 1 

OpenCL ok 

FFT support = minimal 

GTX 680 (Kepler) marginally faster than HD 7970 (GCN) 

Part 2: Creation of Sky Images 



Context 

after observation: 

remove RFI 

calibrate 

create sky image 

calibration/imaging loop possibly repeated 



Creating a Sky Image 

convolve correlations and add to grid 

2D FFT ➜ sky image 



Gridding 

corr 

conv 

(~100x100) 

convolve correlation and add to grid 

for all correlations 

grid 

(~4096x4096) 



Two Problems 

corr 

conv 

(~100x100) 

1. lots of FLOPS 

2. add to memory: slow! 

grid 

(~4096x4096) 



Two Solutions 

corr 

conv 

(~100x100) 

1. lots of FLOPS ➜ use GPUs 

2. add to memory: slow! ➜ avoid 

grid 

(~4096x4096) 



This Is A Hard Problem 

400 

50 

350 

45 

literature: 4 other GPU gridders 

estimated perf. on GTX680 

compensated faster hardware 

bandwidth difference + 50% 

GFLOPS 

40 

300 

1) 

35 

2) 

250 

30 

3) 

200 

4) 

25 

150 

20 

15 

100 

10 

50 

5 

0 

0 

16x16 32x32 64x64 128x128 256x256 

giga-pixel-updates-per-second 

conv. matrix size 




1) MWA (Edgar et. al. [CPC'11]) 

search correlations 

2) Cell BE (Varbanescu [PhD,'10]) 

local store 

3) van Amesfoort et. al. [CF'09] 

private grid per block ➜ 

very small grids 

4) Humphreys & Cornwell 

[SKA memo 132, '11] 

adds directly to grid in memory 

GFLOPS 

400 

50 

350 

45 

40 

300 

1) 

35 

2) 

250 

30 

3) 

200 

150 

4) 

25 

20 

15 

100 

10 

50 

5 

0 

0 

16x16 32x32 64x64 128x128 256x256 






400 

50 

350 

45 

~3% of FPU peak performance! 

SKA: exascale 

GFLOPS 

300 

250 

200 

150 

100 

50 

1) 

2) 

3) 

4) 

40 

35 

30 

25 

20 

15 

10 

5 


0 

0 

16x16 32x32 64x64 128x128 256x256 




W-Projection Gridding 

depends on frac(u), frac(v), w 

(int(u), int(v)) 

corr 

conv 

correlation has associated (u,v,w) coords 

(u,v) not exact grid points 

use different convolution matrices 

choose most appropriate one 

grid 



Where Is The Data 

corr 

conv 

(~100x100) 

grid: device memory 

conv. matrices: texture 

correlations + (u,v,w) coords: shared (local) memory 

grid 

(~4096x4096) 



Placement Movement 

f 

corr 

conv 

t 

per baseline: 

(u,v,w) changes slowly 

grid 

grid locality 



Use Locality 

corr 

conv 

reduce #memory accesses 

X: one thread 

accumulate additions in register 

until conv. matrix slides off 

grid 



But How 

corr 

conv 

1 thread / grid point 

which correlations contribute 

grid 

severe load imbalance 



An Unintuitive Approach 

corr 

conv 

conceptual blocks of conv. matrix size 

grid 




corr 

conv 

1 thread monitors all X 

grid 

at any time: 1 X covers conv. matrix!!! 




corr 

conv 

thread computes current: 

X grid point 

grid 

X conv. matrix entry 




corr 

conv 

(u,v) coords change 

grid 




corr 

conv 

(u,v) coords change more 

grid 




corr 

conv 

grid 

(atomically) adds data if switching to another X 




corr 

conv 

#threads = block size 

grid 

too many threads ➜ do in parts 



(Dis)Advantages 

corr 

conv 

☹ overhead 

☺ < 1% grid-point memory updates 

grid 



Performance Measurements 



Performance Tests Setup 

#stations 44 

#channels 16 

integration time 

10 s 

observation time 

6 h 


≤ 256x256 

oversampling 

8x8 

#W-planes 128 

grid size 

2048x2048 

(u,v,w) from real LOFAR observation (6 hour) 



GTX 680 Performance (CUDA) 

GTX 680 (CUDA) 

75.1-95.6 Gpixels/s 

25% of peak FPU 

overhead index computations 

most additions in registers 

0.23%-0.55% ➜ atomic add 

= 26% of total run time! 

occupancy: 0.694-0.952 

texture hit rate: >0.872 

GFLOPS 

1000 

900 

120 

110 

800 

100 

700 

90 

600 

80 

70 

500 

60 

400 

50 

300 

40 

200 

30 

20 

100 

10 

0 

0 

16x16 32x32 64x64 128x128 256x256 





GTX 680 Performance (OpenCL) 

OpenCL slower than CUDA 

no atomic FP add! 

use atomic cmpxchg 

V1.1: no 1D images (added in V1.2) 

2D image: slower 


GTX 680 (OpenCL) 

GFLOPS 

1000 

900 

800 

700 

600 

500 

400 

300 

200 

100 

0 

0 

16x16 32x32 64x64 128x128 256x256 


120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 




HD 7970 Performance (OpenCL) 

medium & large conv. size: 

outperforms GTX 680 

~25% > bandwidth, FPU, power 

small conv. size: 



HD 7970 

poor computation-I/O overlap 

1000 

900 

800 

700 

600 

500 

400 

300 

200 

100 

map host memory into device 0 

GFLOPS 

0 

16x16 32x32 64x64 128x128 256x256 


120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 




2 x Xeon E5-2680 Performance (C++/AVX) 

C++ & AVX vector intrinsics 

adds directly to grid 

relies on L1 cache 



HD 7970 

2 x E5-2680 

works well on CPU 

insufficient cache for GPUs 

1000 

900 

800 

700 

600 

500 

400 

300 

200 

100 

48-79% of peak FPU 0 

GFLOPS 

0 

16x16 32x32 64x64 128x128 256x256 


120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 




Multi-GPU Scaling 

eight Nvidia GTX 580s 

5000 

4000 

256x256 

64x64 

16x16 

GFLOPS 

3000 

2000 

1000 

0 

0 1 2 3 4 5 6 7 8 

nr. GPUs 

131,072 threads! 

scales well 



Green Computing 

power consumption (kW) 

2.5 

2 

1.5 

1 

0.5 

256x256 

64x64 

16x16 

power efficiency (GFLOP/W) 

2 

1.8 

1.6 

1.4 

1.2 

1 

0.8 

0.6 

0.4 

0.2 

256x256 

64x64 

16x16 

0 

0 1 2 3 4 5 6 7 8 


0 

0 1 2 3 4 5 6 7 8 


up to 1.94 GFLOP/w (with previous gen hardware!) 



Compared To Other GPU Gridders 

1) MWA (Edgar et. al. [CPC'11]) 

2) Cell BE (Varbanescu [PhD,'10]) 

3) van Amesfoort et. al. [CF'09] 

4) Humphreys & Cornwell 

[SKA memo 132, '11] 

new method ~10x faster 

GFLOPS 

800 

700 

600 

500 

400 

300 

200 

100 

new 

1) 

2) 

3) 

4) 

100 

87.5 

75 

62.5 

50 

37.5 

25 

12.5 


0 

0 

16x16 32x32 64x64 128x128 256x256 




See Also 

An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on 

GPUs, John W. Romein, ACM International Conference on Supercomputing 

(ICS'12), June 25-29, 2012, Venice, Italy 



Future Work 

LOFAR gridder 

combine with A-projection 

time-dependent conv. function ➜ compute on GPU 



Conclusions Part 2 

efficient GPU gridding algorithm 

minimizes memory accesses 

OpenCL lacks atomic floating-point add 

~10x faster than other gridders 

scales well on 8 GPUs 

energy efficient

GTC 2012 - GPU Technology Conference

Create successful ePaper yourself

Delete template?

Save as template?