Vectorizing the forward mode of ADOL-C on a GPU ... - Autodiff.org

.. K. Kulshreshtha, A. Koniaeva 1 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013<strong>Vectorizing</strong> <strong>the</strong> <strong>forward</strong> <strong>mode</strong> <strong>of</strong> <strong>ADOL</strong>-C on aGPU using CUDAKshitij Kulshreshthajoint work withAlina KoniaevaUniversität Paderborn13 th European AD Workshop10.06.2013

. K. Kulshreshtha, A. Koniaeva 2 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013GPU Computing.General purpose compting on GPUCPU• Optimised fow low latencyaccess to cached data• Control logic for out <strong>of</strong> orderand speculative execution• Low to <strong>mode</strong>rate number <strong>of</strong>cores for computationGPU• Optimised for data parallelcomputations• Architecture allows latency• Large number <strong>of</strong> coresdedicated to computationControlCacheALUALUALUALUDRAMDRAMCPUGPU

. K. Kulshreshtha, A. Koniaeva 3 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.NVIDIA’s CUDA architectureGPU ComputingNVIDIA’s CUDA architecture• CUDA: Runtime enviorenment for<strong>of</strong>floading computations on supportedGPUs by NVIDIA• Extension for C language as well aslibraries• Level <strong>of</strong> support depends on <strong>the</strong> devicedriver in OS• Computations <strong>of</strong>floaded to GPU that havea large number <strong>of</strong> computing coresCPUGPUApplicationCUDA LibrariesCUDA RuntimeCUDA Driver

. K. Kulshreshtha, A. Koniaeva 4 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013GPU Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• Code consists <strong>of</strong> kernel functions executedon different cores• Kernel performs <strong>the</strong> same action <strong>of</strong>different data (SIMD)Kernel 1Kernel 2Block(0, 0)Block(0, 1)Grid 2Block(1, 0)Block(1, 1)Block(2, 0)Block(2, 1)Block (1, 1)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a succession <strong>of</strong> kernel invocations to <strong>the</strong> device. Each kernel is executed as a batch<strong>of</strong> threads organized as a grid <strong>of</strong> thread blocks

. K. Kulshreshtha, A. Koniaeva 4 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013GPU Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• Code consists <strong>of</strong> kernel functions executedon different cores• Kernel performs <strong>the</strong> same action <strong>of</strong>different data (SIMD)• Kernel executed distributed on grids <strong>of</strong>blocks <strong>of</strong> threads• Grouping <strong>of</strong> threads into blocks and blocksinto gridsKernel 1Kernel 2Thread(0, 0)Thread(0, 1)Block Block Block(0, 0) (1, 0) (2, 0)Block Block Block(0, 1) (1, 1) (2, 1)Grid 2Block (1, 1)Thread Thread Thread Thread(1, 0) (2, 0) (3, 0) (4, 0)Thread Thread Thread Thread(1, 1) (2, 1) (3, 1) (4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a succession <strong>of</strong> kernel invocations to <strong>the</strong> device. Each kernel is executed as a batch<strong>of</strong> threads organized as a grid <strong>of</strong> thread blocks

. K. Kulshreshtha, A. Koniaeva 4 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013GPU Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• E.g: NVIDIA adro 4000 with CUDARuntime version 4.2 hasKernel 1Block(0, 0)Block(0, 1)Block(1, 0)Block(1, 1)Block(2, 0)Block(2, 1)Maximum number <strong>of</strong> threads per block:1024Maximum sizes <strong>of</strong> each dimension <strong>of</strong> a block:1024 x 1024 x 64Maximum sizes <strong>of</strong> each dimension <strong>of</strong> a grid:65535 x 65535 x 65535Kernel 2Grid 2Block (1, 1)• Several kernels may be started parallely bydistributing <strong>the</strong>m on <strong>the</strong> gridThread(0, 0)Thread(0, 1)Thread(1, 0)Thread(1, 1)Thread(2, 0)Thread(2, 1)Thread(3, 0)Thread(3, 1)Thread(4, 0)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a succession <strong>of</strong> kernel invocations to <strong>the</strong> device. Each kernel is executed as a batch<strong>of</strong> threads organized as a grid <strong>of</strong> thread blocks

. K. Kulshreshtha, A. Koniaeva 5 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.NVIDIA’s CUDA architectureGPU ComputingNVIDIA’s CUDA architecture• Memory access is bidirectional• Data can be ga<strong>the</strong>red fromvarious memory locations toeach core• Each core may scaercomputed data across variousmemory locations• Some GPUs provide on-deviceDRAM as a buffer betweensystem memory and executioncoresControlALU ALU ALUControl...ALU ALU ALU ...CacheCacheDRAMd0 d1 d2 d3d4 d5 d6 d7Ga<strong>the</strong>rControlALU ALU ALUControl...ALU ALU ALU ...CacheCacheDRAMd0 d1 d2 d3d4 d5 d6 d7Scatter…………

. K. Kulshreshtha, A. Koniaeva 6 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Vector <strong>forward</strong> <strong>mode</strong>.Vector <strong>forward</strong> <strong>mode</strong>F : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]• Amortises <strong>the</strong> overhead by reusing intermediate values for variousdirections instead <strong>of</strong> recomputing• Serial: TIME(Ḟ)/TIME(F) ∈ [1 + p, 1 + 1.5p]• Problematic if p is large

. K. Kulshreshtha, A. Koniaeva 6 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Vector <strong>forward</strong> <strong>mode</strong>.Vector <strong>forward</strong> <strong>mode</strong>F : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]• Amortises <strong>the</strong> overhead by reusing intermediate values for variousdirections instead <strong>of</strong> recomputing• Serial: TIME(Ḟ)/TIME(F) ∈ [1 + p, 1 + 1.5p]• Problematic if p is large• Can easily be parallelised if derivatives required at a large number <strong>of</strong>evaluation points x• Can also be parallelised to propogate several direction simultaneously

. K. Kulshreshtha, A. Koniaeva 7 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.Vector <strong>forward</strong> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <strong>forward</strong> <strong>mode</strong>In parallel• Can easily be parallelised if derivatives required at a large number <strong>of</strong>evaluation points x• Can also be parallelised to propogate several direction simultaneously

. K. Kulshreshtha, A. Koniaeva 7 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.Vector <strong>forward</strong> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <strong>forward</strong> <strong>mode</strong>• Parallel evaluation at different points:• Scaer <strong>the</strong> evaluation points in <strong>the</strong> grid or among blocksIn parallel

. K. Kulshreshtha, A. Koniaeva 7 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.Vector <strong>forward</strong> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <strong>forward</strong> <strong>mode</strong>• Parallel evaluation at different points:• Scaer <strong>the</strong> evaluation points in <strong>the</strong> grid or among blocks• Parallel evaluation <strong>of</strong> different directional derivatives• Scaer directions among threadsIn parallel

. K. Kulshreshtha, A. Koniaeva 7 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.Vector <strong>forward</strong> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <strong>forward</strong> <strong>mode</strong>• Parallel evaluation at different points:• Scaer <strong>the</strong> evaluation points in <strong>the</strong> grid or among blocks• Parallel evaluation <strong>of</strong> different directional derivatives• Scaer directions among threads• Both <strong>the</strong>se parallelisations can be done simultaneously on <strong>the</strong> GPUIn parallel

. K. Kulshreshtha, A. Koniaeva 8 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013.Vector <strong>forward</strong> <strong>mode</strong>Multipoint, Multidirection GPU evaluationMultipoint, Multidirection GPUevaluation• All input and output data pointers are provided to <strong>the</strong> kernel• Kernel is copied to each thread on each block on <strong>the</strong> grid• CUDA Runtime provides following structures to a kernel• gridDim Dimensions <strong>of</strong> <strong>the</strong> Grid (∈ N 3 )• blockIdx Index <strong>of</strong> <strong>the</strong> Block in <strong>the</strong> Grid (∈ N 3 0 )• blockDim Dimensions <strong>of</strong> <strong>the</strong> Block (∈ N 3 )• threadIdx Index <strong>of</strong> <strong>the</strong> Thread in <strong>the</strong> Block (∈ N 3 0 )• Indices are used to select <strong>the</strong> data, on which <strong>the</strong> thread will work• The result is stored in <strong>the</strong> correct location using <strong>the</strong> indices• Dimensions <strong>of</strong> Grid and Block are set by <strong>the</strong> caller <strong>of</strong> <strong>the</strong> kernel

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation• namespace adtlc encapsulates <strong>the</strong> implementation• class adtlc::adouble analoguous to scalar traceless <strong>forward</strong>• Member functions and operators are annotated __device__ toindicate callablity inside GPU kernels• Evaluation routine using adtlc::adouble objects is completelyanaloguous to traceless CPU based implementation

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation• namespace adtlc encapsulates <strong>the</strong> implementation• class adtlc::adouble analoguous to scalar traceless <strong>forward</strong>• Member functions and operators are annotated __device__ toindicate callablity inside GPU kernels• Evaluation routine using adtlc::adouble objects is completelyanaloguous to traceless CPU based implementation• GPU kernel function can have a signature like__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);• The annotation __global__ indicates that all GPU threads have accessto this code, and <strong>the</strong> distribution (blocks,threads) must be specified

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directions simultaneously• Use Nb = {N x , N y , N z } blocks <strong>of</strong> Nt = {p x , p y , p z } threads each

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directions simultaneously• Use Nb = {N x , N y , N z } blocks <strong>of</strong> Nt = {p x , p y , p z } threads each• inx has n × N entries• outy has m × N entries• dery has m × p × N entries

. K. Kulshreshtha, A. Koniaeva 9 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Implementation.Implementation__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directions simultaneously• Use Nb = {N x , N y , N z } blocks <strong>of</strong> Nt = {p x , p y , p z } threads each• inx has n × N entries• outy has m × N entries• dery has m × p × N entries• Kernel call looks as followskernel_func > (inx, n, outy, m, dery);

. K. Kulshreshtha, A. Koniaeva 10 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes100001000Time (msec)10010Traceless Vector ForwardTraced Vector ForwardGPU Vector ForwardReverse15 10 20 40 80 160 320No. <strong>of</strong> PointsOption pricing example

. K. Kulshreshtha, A. Koniaeva 11 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes100001000Time (msec)10010Traceless Vector ForwardTraced Vector ForwardGPU Vector ForwardReverse15 10 20 40 80No. <strong>of</strong> PointsOption pricing example

. K. Kulshreshtha, A. Koniaeva 12 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes10000010000Time (msec)100010010Traceless Vector ForwardTraced Vector ForwardGPU Vector ForwardVector Reverse15 10 20 40 80 160 320 640 1280 2560 5120No. <strong>of</strong> PointsMultibody mechanical system

Summary, Issues & Outlook.Summary, Issues & Outlook• GPU <strong>forward</strong> <strong>mode</strong> handily beats CPU based traceless and tracedcomputations in <strong>ADOL</strong>-C• CUDA implementation <strong>of</strong> <strong>forward</strong> <strong>mode</strong> is more or less straight<strong>forward</strong>. K. Kulshreshtha, A. Koniaeva 13 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013

Summary, Issues & Outlook.Summary, Issues & Outlook• GPU <strong>forward</strong> <strong>mode</strong> handily beats CPU based traceless and tracedcomputations in <strong>ADOL</strong>-C• CUDA implementation <strong>of</strong> <strong>forward</strong> <strong>mode</strong> is more or less straight<strong>forward</strong>• The user needs to concern <strong>the</strong>mselves with extra issues likedistribution <strong>of</strong> kernel function• Some GPUs have internal DRAM so data needs to be explicitlytransferred to and from it• CUDA provides cudaMalloc() and cudaMemcpy() routines for datatransfer. K. Kulshreshtha, A. Koniaeva 13 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013

Summary, Issues & Outlook.Summary, Issues & Outlook• GPU <strong>forward</strong> <strong>mode</strong> handily beats CPU based traceless and tracedcomputations in <strong>ADOL</strong>-C• CUDA implementation <strong>of</strong> <strong>forward</strong> <strong>mode</strong> is more or less straight<strong>forward</strong>• The user needs to concern <strong>the</strong>mselves with extra issues likedistribution <strong>of</strong> kernel function• Some GPUs have internal DRAM so data needs to be explicitlytransferred to and from it• CUDA provides cudaMalloc() and cudaMemcpy() routines for datatransfer• Several future development directions and questions open• second order derivatives• taylor polynomial propagation [distribution <strong>of</strong> threads? d-dimensionaloperations? concurrency?]• portability to o<strong>the</strong>r hardware [OpenCL? GLSL?]• GPU based trace interpreer [memory allocation for trace?]. K. Kulshreshtha, A. Koniaeva 13 / 13 <strong>Vectorizing</strong> <strong>ADOL</strong>-C using CUDA Euro AD 10.06.2013

Vectorizing the forward mode of ADOL-C on a GPU ... - Autodiff.org

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?