13.07.2015 Views

The Process of Parallelizing the Conjunction Prediction ... - ESA

The Process of Parallelizing the Conjunction Prediction ... - ESA

The Process of Parallelizing the Conjunction Prediction ... - ESA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The Process of Parallelizing the Conjunction PredictionAlgorithm of ESA’s SSA Conjunction Prediction Serviceusing GPGPUESAC Trainee ProjectSeptember 2012 – March 2013Marius FehrMentors:Vicente Navarro and Luis Martin1


Space Situational AwarenessSWE – Space WeatherNEO – Near Earth ObjectsSST – Space Surveillance and Tracking2


Space Surveillance and TrackingCatalog (JSpOC, US Air Force):16’000 Objects > 10cmEstimates:600’000 Objects > 1 cm3


The Conjunction Prediction System4


All vs. All Conjunction Analysis12345671 2 3 4 5 6 7[1,2] [1,3] [1,4] [1,5] ... [5,6] [5,7] [6,7]...> Number of pairs grows quadratically with the number of objects> The analyses of all objects pairs are independent → Huge potential forparallelism> 10k objects could theoretically be analyzed in millions of parallel threads> CPUs usually launch not more than up to a dozen of threads> How can we exploit that ?5


GPU – NVIDIAs Fermi Architecturehttp://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-shader-model-block-diagram-full.pnghttp://benchmarkreviews.com/images/reviews/processor/NVIDIA_Fermi/nvidia-fermi-gf100-gpu-block-diagram-benchmarkreviews.png6


CUDA – Grid, Blocks and ThreadsExample: Matrix Multiplication4 x 4 Matrices A and BGridBlocksThreads> Abstraction ofMultiprocessors and Cores4 blocks with 4 threads each,16 threads in total> GPU distributes blocks to idlemultiprocessors> Idle/waiting threads areswapped out instantly> Up to 65k x 65k x 65k blocks> Up to 1024 threads per block7


CAN – Conjunction AnalysisList of ObjectsPrediction Period10k Objects50M pairsApogee-Perigee Filter50M → 20MLoop over EpochsLoad Ephemeris Data toMemory if necessaryEphemerisData( Files/DB )Smart Sieve20M → 40kLinear SearchFind Time and Distanceof Closest ApproachCalculate Collision Risk and writeConjunctions to Files/DB.Conjunctions( Files/DB )9


Identifying Parallelizable Functions10'000 Objects - 8 DaysApogee-Perigee FilterLoading EphemeridesInterpolating in Smart SieveOpenMPSmart SieveInterpolating in Linear SearchLinear SearchFind TCAConjunction DefinitionPenetration FactorRemaining Operations0 100 200 300 400 500 600 700 800 900Runtime [s]10


Smart SieveKernelObject pairs that passed the Apogee-Perigee Filter[1,2] [1,3] [1,4] [1,5] [1,6] [1,7] [2,3] [2,4] ... [5,6] [5,7] [6,7].........Filter 1Filter 2Filter 3...Filter N...Increase potential pair counter with atomicAdd()Critical Section> In general: Having a large number of threads compete for a resource is expensive> BUT: only about 0.2% of all threads actually reach the critical section11


Linear SearchEpoch...-1 -1 -1 -1InterpolationSearch for Sign Change> Kernel 1: 1 Thread = 1 object :> Interpolating the state vectors needed for current time step> Kernel 2: 1 Thread = 1 potential pair :> Searching for a sign change in the relative velocity12


Linear SearchEpoch...-1 -1 -1 -1 -1InterpolationSearch for Sign Change> Kernel 1: 1 Thread = 1 object :> Interpolating the state vectors needed for current time step> Kernel 2: 1 Thread = 1 potential pair :> Searching for a sign change in the relative velocity13


Linear SearchEpoch...-1 -1 -1 -1 -1 +1InterpolationSearch for Sign Change> Kernel 1: 1 Thread = 1 object :> Interpolating the state vectors needed for current time step> Kernel 2: 1 Thread = 1 potential pair :> Searching for a sign change in the relative velocity14


Find TCAKernelPotential Pairs...Sign Change ?Zero Finder...> Kernel: 1 Thread = 1 potential pair :> Check if Linear Search found a time step with a sign change> Start zero finder (Regula Falsi)> Requires state vector interpolation for every intermediate stephttp://commons.wikimedia.org/wiki/File:Regula_falsi.gif15


GPU Timeline> Goal: Minimize memory transfers and (re)allocationsGPUISmart SieveI LS I LS I LS Find TCAISmart Sieve I LS ILoad ephemeris,pairs andconstantsResize pot. pairsallocation ifnecessaryApogee-PerigeeFilterCPULoadEphemerisRetrieve resultsof this epochEpoch 1 Epoch 2Time16


Test Environment> CPU: Intel Xenon E5620 4 Cores @ 2.4 GHz> Memory: 6 GB> GPU: NVIDIA Geforce GTX 580> 1.5 GB of Memory> 512 CUDA Cores> Compute Capability 2.0> Faster double precision calculations with the Geforce 500 Series than the newer600 Series17


Computation Time - Results I14001200FortranCCUDA1000OpenMPAlgorithmOptimizationRuntime [s]800600400GPU200- 88 %0313 625 1250 2500 5000 10000Number of Objects18


Computation Time - Results II10'000 Objects - 8 DaysCUDA KernelInterpolating in Smart SieveCUDA KernelInterpolating in Linear SearchCUDA KernelSmart SieveAlgorithm OptimizationCUDA KernelFind TCAAlgorithmOptimizationCUDA KernelConjunction DefinitionLinear SearchFortranCC CUDAAlgorithmOptimizationPenetration FactorApogee-Perigee FilterRemaining OperationsCUDA Memory operations0 50 100 150 200 250 300 350 400 450 500Runtime [s]19


Conclusion> Considerable improvement of thecomputation time> Parallelization with CUDA> Other optimizations> Bottleneck: I/O8 days, 10’000 objects,all vs. allAlgorithm Opt. -41%CUDA -88%> Reading ephemeris data from file/DB> Writing conjunctions to file/DB> Future Work> Parallelize other parts of the CPS> Computation of Conjunction Risk, Orbit Propagation, ...> Recompute ephemeris instead of loadingfrom file/DB20


What about your program?> Can your program be divided into thousands of parallel (andequal) computations?> Is there any communication or cooperation necessary betweenthreads?> Only efficient between threads of the same block (< 1024)> Is the computational effort required huge compared to the size ofthe data?> Does your program use libraries like BLAS or FFT?> Try cuBLAS and cuFFT> Be aware:The GPU has a very flat memory hierarchy / small caches> 64 KB L1 cache, 0 – 768 KB L2 cache21


Questions ?> Marius Fehrmarius.b.fehr@gmail.com> Vicente Navarrovicente.navarro@esa.int> Luis Martinluis.martin@esa.int22


The CUDA C Extensionint vector_addition ( ) {int U = 1000; int V = 256;int N = U * V;int a[N], b[N], c[N];// fill a and bint *cuda_a, *cuda_b, *cuda_c;cudaMalloc( (void**) &cuda_a, N * sizeof(int) );cudaMalloc( (void**) &cuda_b, N * sizeof(int) );cudaMalloc( (void**) &cuda_c, N * sizeof(int) );cudaMemcpyHtoD( cuda_a, a, N * sizeof(int) );cudaMemcpyHtoD( cuda_b, b, N * sizeof(int) );add>( cuda_a, cuda_b, cuda_c );cudaMemcpyDtoH( c, cuda_c, N * sizeof(int) );> All threads execute the samepiece of code, the kernel> Kernel replaces loop> Index of each thread computedfrom block and thread id and blockdimension> Global memory can be accessedby every thread> Make sure there are no raceconditions}cudaFree( cuda_a );cudaFree( cuda_b );cudaFree( cuda_c );return 0;__global__ void add( int *a, int *b, int *c ){}int idx = blockIdx.x * blockDim.x+ threadIdx.x;c [ idx ] = a [ idx ] + b [ idx ];23


CUDA GPUs35007300062500520004150031000250010Quadro 2000 Quadro 4000 Quadro 6000 Geforce GTX 580 Geforce GTX 680 Tesla C2075GPUFP32 [Gflops] FP64 [Gflops] Price [$] Memory [GB]024


Creates a GPU-enabled version of the Fortran code> Works either using directives (like OpenMP) or automaticanalysis and optimization> 30-day trial available


Warps and ThreadsThreads are grouped in Warps1 Warp = 32 Threads1 Core: 1 instruction in 1 cycleBUT all cores in the same groupexecute the same instruction at thesame timeSingle Instruction Multiple ThreadGroup 1Group 2Example: Fermi Architecture2 groups x 16 Cores: 1 Warp in 1 cycleBUT not every core has it’s ownSpecial Function Unit


Shared MemoryGPUGridData...DataBlocksThreadsDatainGlobalGPU Memory......●●●Copy from global toshared memory and backTypically 16 KB per blockVery low latencyData...Data●Shared memory is onlyvisible for threads insidethe block●CUDA provides tools forsynchronization


Texture MemoryGPUGlobalGPU MemoryDedicatedHardwareGridBlocksThreads●Global memory accessed by dedicated hardware, used for textures●Read cache; ~ 6 – 8 KB per multiprocessor, optimized for spatial locality in texturecoordinates●Serves as read-through cache and supports multiple simultaneous reads throughhardware accelerated filtering

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!