11.07.2015 Views

Vectorizing the forward mode of ADOL-C on a GPU ... - Autodiff.org

Vectorizing the forward mode of ADOL-C on a GPU ... - Autodiff.org

Vectorizing the forward mode of ADOL-C on a GPU ... - Autodiff.org

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

.. K. Kulshreshtha, A. K<strong>on</strong>iaeva 1 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013<str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C <strong>on</strong> a<strong>GPU</strong> using CUDAKshitij Kulshreshthajoint work withAlina K<strong>on</strong>iaevaUniversität Paderborn13 th European AD Workshop10.06.2013


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 2 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013<strong>GPU</strong> Computing.General purpose compting <strong>on</strong> <strong>GPU</strong>CPU• Optimised fow low latencyaccess to cached data• C<strong>on</strong>trol logic for out <str<strong>on</strong>g>of</str<strong>on</strong>g> orderand speculative executi<strong>on</strong>• Low to <str<strong>on</strong>g>mode</str<strong>on</strong>g>rate number <str<strong>on</strong>g>of</str<strong>on</strong>g>cores for computati<strong>on</strong><strong>GPU</strong>• Optimised for data parallelcomputati<strong>on</strong>s• Architecture allows latency• Large number <str<strong>on</strong>g>of</str<strong>on</strong>g> coresdedicated to computati<strong>on</strong>C<strong>on</strong>trolCacheALUALUALUALUDRAMDRAMCPU<strong>GPU</strong>


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 3 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.NVIDIA’s CUDA architecture<strong>GPU</strong> ComputingNVIDIA’s CUDA architecture• CUDA: Runtime enviorenment for<str<strong>on</strong>g>of</str<strong>on</strong>g>floading computati<strong>on</strong>s <strong>on</strong> supported<strong>GPU</strong>s by NVIDIA• Extensi<strong>on</strong> for C language as well aslibraries• Level <str<strong>on</strong>g>of</str<strong>on</strong>g> support depends <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> devicedriver in OS• Computati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g>floaded to <strong>GPU</strong> that havea large number <str<strong>on</strong>g>of</str<strong>on</strong>g> computing coresCPU<strong>GPU</strong>Applicati<strong>on</strong>CUDA LibrariesCUDA RuntimeCUDA Driver


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 4 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013<strong>GPU</strong> Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• Code c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel functi<strong>on</strong>s executed<strong>on</strong> different cores• Kernel performs <str<strong>on</strong>g>the</str<strong>on</strong>g> same acti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>different data (SIMD)Kernel 1Kernel 2Block(0, 0)Block(0, 1)Grid 2Block(1, 0)Block(1, 1)Block(2, 0)Block(2, 1)Block (1, 1)Thread(0, 0)Thread(1, 0)Thread(2, 0)Thread(3, 0)Thread(4, 0)Thread(0, 1)Thread(1, 1)Thread(2, 1)Thread(3, 1)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a successi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel invocati<strong>on</strong>s to <str<strong>on</strong>g>the</str<strong>on</strong>g> device. Each kernel is executed as a batch<str<strong>on</strong>g>of</str<strong>on</strong>g> threads <strong>org</strong>anized as a grid <str<strong>on</strong>g>of</str<strong>on</strong>g> thread blocks


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 4 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013<strong>GPU</strong> Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• Code c<strong>on</strong>sists <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel functi<strong>on</strong>s executed<strong>on</strong> different cores• Kernel performs <str<strong>on</strong>g>the</str<strong>on</strong>g> same acti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g>different data (SIMD)• Kernel executed distributed <strong>on</strong> grids <str<strong>on</strong>g>of</str<strong>on</strong>g>blocks <str<strong>on</strong>g>of</str<strong>on</strong>g> threads• Grouping <str<strong>on</strong>g>of</str<strong>on</strong>g> threads into blocks and blocksinto gridsKernel 1Kernel 2Thread(0, 0)Thread(0, 1)Block Block Block(0, 0) (1, 0) (2, 0)Block Block Block(0, 1) (1, 1) (2, 1)Grid 2Block (1, 1)Thread Thread Thread Thread(1, 0) (2, 0) (3, 0) (4, 0)Thread Thread Thread Thread(1, 1) (2, 1) (3, 1) (4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a successi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel invocati<strong>on</strong>s to <str<strong>on</strong>g>the</str<strong>on</strong>g> device. Each kernel is executed as a batch<str<strong>on</strong>g>of</str<strong>on</strong>g> threads <strong>org</strong>anized as a grid <str<strong>on</strong>g>of</str<strong>on</strong>g> thread blocks


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 4 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013<strong>GPU</strong> Computing.NVIDIA’s CUDA architectureNVIDIA’s CUDA architectureHostDeviceGrid 1• E.g: NVIDIA adro 4000 with CUDARuntime versi<strong>on</strong> 4.2 hasKernel 1Block(0, 0)Block(0, 1)Block(1, 0)Block(1, 1)Block(2, 0)Block(2, 1)Maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g> threads per block:1024Maximum sizes <str<strong>on</strong>g>of</str<strong>on</strong>g> each dimensi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a block:1024 x 1024 x 64Maximum sizes <str<strong>on</strong>g>of</str<strong>on</strong>g> each dimensi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> a grid:65535 x 65535 x 65535Kernel 2Grid 2Block (1, 1)• Several kernels may be started parallely bydistributing <str<strong>on</strong>g>the</str<strong>on</strong>g>m <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> gridThread(0, 0)Thread(0, 1)Thread(1, 0)Thread(1, 1)Thread(2, 0)Thread(2, 1)Thread(3, 0)Thread(3, 1)Thread(4, 0)Thread(4, 1)Thread(0, 2)Thread(1, 2)Thread(2, 2)Thread(3, 2)Thread(4, 2)The host issues a successi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel invocati<strong>on</strong>s to <str<strong>on</strong>g>the</str<strong>on</strong>g> device. Each kernel is executed as a batch<str<strong>on</strong>g>of</str<strong>on</strong>g> threads <strong>org</strong>anized as a grid <str<strong>on</strong>g>of</str<strong>on</strong>g> thread blocks


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 5 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.NVIDIA’s CUDA architecture<strong>GPU</strong> ComputingNVIDIA’s CUDA architecture• Memory access is bidirecti<strong>on</strong>al• Data can be ga<str<strong>on</strong>g>the</str<strong>on</strong>g>red fromvarious memory locati<strong>on</strong>s toeach core• Each core may scaercomputed data across variousmemory locati<strong>on</strong>s• Some <strong>GPU</strong>s provide <strong>on</strong>-deviceDRAM as a buffer betweensystem memory and executi<strong>on</strong>coresC<strong>on</strong>trolALU ALU ALUC<strong>on</strong>trol...ALU ALU ALU ...CacheCacheDRAMd0 d1 d2 d3d4 d5 d6 d7Ga<str<strong>on</strong>g>the</str<strong>on</strong>g>rC<strong>on</strong>trolALU ALU ALUC<strong>on</strong>trol...ALU ALU ALU ...CacheCacheDRAMd0 d1 d2 d3d4 d5 d6 d7Scatter…………


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 6 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>F : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]• Amortises <str<strong>on</strong>g>the</str<strong>on</strong>g> overhead by reusing intermediate values for variousdirecti<strong>on</strong>s instead <str<strong>on</strong>g>of</str<strong>on</strong>g> recomputing• Serial: TIME(Ḟ)/TIME(F) ∈ [1 + p, 1 + 1.5p]• Problematic if p is large


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 6 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>F : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]• Amortises <str<strong>on</strong>g>the</str<strong>on</strong>g> overhead by reusing intermediate values for variousdirecti<strong>on</strong>s instead <str<strong>on</strong>g>of</str<strong>on</strong>g> recomputing• Serial: TIME(Ḟ)/TIME(F) ∈ [1 + p, 1 + 1.5p]• Problematic if p is large• Can easily be parallelised if derivatives required at a large number <str<strong>on</strong>g>of</str<strong>on</strong>g>evaluati<strong>on</strong> points x• Can also be parallelised to propogate several directi<strong>on</strong> simultaneously


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 7 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>In parallel• Can easily be parallelised if derivatives required at a large number <str<strong>on</strong>g>of</str<strong>on</strong>g>evaluati<strong>on</strong> points x• Can also be parallelised to propogate several directi<strong>on</strong> simultaneously


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 7 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>• Parallel evaluati<strong>on</strong> at different points:• Scaer <str<strong>on</strong>g>the</str<strong>on</strong>g> evaluati<strong>on</strong> points in <str<strong>on</strong>g>the</str<strong>on</strong>g> grid or am<strong>on</strong>g blocksIn parallel


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 7 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>• Parallel evaluati<strong>on</strong> at different points:• Scaer <str<strong>on</strong>g>the</str<strong>on</strong>g> evaluati<strong>on</strong> points in <str<strong>on</strong>g>the</str<strong>on</strong>g> grid or am<strong>on</strong>g blocks• Parallel evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> different directi<strong>on</strong>al derivatives• Scaer directi<strong>on</strong>s am<strong>on</strong>g threadsIn parallel


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 7 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> more: in parallelF : R n −→ R m , x ∈ R n , Ẋ ∈ R n×pẎ = F ′ (x)Ẋ ∈ R m×pḞ(x, Ẋ) := [F(x), F ′ (x)Ẋ]Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>• Parallel evaluati<strong>on</strong> at different points:• Scaer <str<strong>on</strong>g>the</str<strong>on</strong>g> evaluati<strong>on</strong> points in <str<strong>on</strong>g>the</str<strong>on</strong>g> grid or am<strong>on</strong>g blocks• Parallel evaluati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> different directi<strong>on</strong>al derivatives• Scaer directi<strong>on</strong>s am<strong>on</strong>g threads• Both <str<strong>on</strong>g>the</str<strong>on</strong>g>se parallelisati<strong>on</strong>s can be d<strong>on</strong>e simultaneously <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> <strong>GPU</strong>In parallel


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 8 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013.Vector <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g>Multipoint, Multidirecti<strong>on</strong> <strong>GPU</strong> evaluati<strong>on</strong>Multipoint, Multidirecti<strong>on</strong> <strong>GPU</strong>evaluati<strong>on</strong>• All input and output data pointers are provided to <str<strong>on</strong>g>the</str<strong>on</strong>g> kernel• Kernel is copied to each thread <strong>on</strong> each block <strong>on</strong> <str<strong>on</strong>g>the</str<strong>on</strong>g> grid• CUDA Runtime provides following structures to a kernel• gridDim Dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Grid (∈ N 3 )• blockIdx Index <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Block in <str<strong>on</strong>g>the</str<strong>on</strong>g> Grid (∈ N 3 0 )• blockDim Dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Block (∈ N 3 )• threadIdx Index <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> Thread in <str<strong>on</strong>g>the</str<strong>on</strong>g> Block (∈ N 3 0 )• Indices are used to select <str<strong>on</strong>g>the</str<strong>on</strong>g> data, <strong>on</strong> which <str<strong>on</strong>g>the</str<strong>on</strong>g> thread will work• The result is stored in <str<strong>on</strong>g>the</str<strong>on</strong>g> correct locati<strong>on</strong> using <str<strong>on</strong>g>the</str<strong>on</strong>g> indices• Dimensi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> Grid and Block are set by <str<strong>on</strong>g>the</str<strong>on</strong>g> caller <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>the</str<strong>on</strong>g> kernel


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>• namespace adtlc encapsulates <str<strong>on</strong>g>the</str<strong>on</strong>g> implementati<strong>on</strong>• class adtlc::adouble analoguous to scalar traceless <str<strong>on</strong>g>forward</str<strong>on</strong>g>• Member functi<strong>on</strong>s and operators are annotated __device__ toindicate callablity inside <strong>GPU</strong> kernels• Evaluati<strong>on</strong> routine using adtlc::adouble objects is completelyanaloguous to traceless CPU based implementati<strong>on</strong>


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>• namespace adtlc encapsulates <str<strong>on</strong>g>the</str<strong>on</strong>g> implementati<strong>on</strong>• class adtlc::adouble analoguous to scalar traceless <str<strong>on</strong>g>forward</str<strong>on</strong>g>• Member functi<strong>on</strong>s and operators are annotated __device__ toindicate callablity inside <strong>GPU</strong> kernels• Evaluati<strong>on</strong> routine using adtlc::adouble objects is completelyanaloguous to traceless CPU based implementati<strong>on</strong>• <strong>GPU</strong> kernel functi<strong>on</strong> can have a signature like__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);• The annotati<strong>on</strong> __global__ indicates that all <strong>GPU</strong> threads have accessto this code, and <str<strong>on</strong>g>the</str<strong>on</strong>g> distributi<strong>on</strong> (blocks,threads) must be specified


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directi<strong>on</strong>s simultaneously• Use Nb = {N x , N y , N z } blocks <str<strong>on</strong>g>of</str<strong>on</strong>g> Nt = {p x , p y , p z } threads each


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directi<strong>on</strong>s simultaneously• Use Nb = {N x , N y , N z } blocks <str<strong>on</strong>g>of</str<strong>on</strong>g> Nt = {p x , p y , p z } threads each• inx has n × N entries• outy has m × N entries• dery has m × p × N entries


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 9 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Implementati<strong>on</strong>.Implementati<strong>on</strong>__global__ void kernel_func(double* inx, size_t n,double* outy, size_t m, double* dery);__device__ void evalf(adtlc::adouble* x, size_t n,adtlc::adouble* y, size_t m);• Say derivatives are evaluated at N = (N x × N y × N z ) points inp = (p x × p y × p z ) directi<strong>on</strong>s simultaneously• Use Nb = {N x , N y , N z } blocks <str<strong>on</strong>g>of</str<strong>on</strong>g> Nt = {p x , p y , p z } threads each• inx has n × N entries• outy has m × N entries• dery has m × p × N entries• Kernel call looks as followskernel_func > (inx, n, outy, m, dery);


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 10 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes100001000Time (msec)10010Traceless Vector ForwardTraced Vector Forward<strong>GPU</strong> Vector ForwardReverse15 10 20 40 80 160 320No. <str<strong>on</strong>g>of</str<strong>on</strong>g> PointsOpti<strong>on</strong> pricing example


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 11 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes100001000Time (msec)10010Traceless Vector ForwardTraced Vector Forward<strong>GPU</strong> Vector ForwardReverse15 10 20 40 80No. <str<strong>on</strong>g>of</str<strong>on</strong>g> PointsOpti<strong>on</strong> pricing example


. K. Kulshreshtha, A. K<strong>on</strong>iaeva 12 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013Runtimes.Runtimes10000010000Time (msec)100010010Traceless Vector ForwardTraced Vector Forward<strong>GPU</strong> Vector ForwardVector Reverse15 10 20 40 80 160 320 640 1280 2560 5120No. <str<strong>on</strong>g>of</str<strong>on</strong>g> PointsMultibody mechanical system


Summary, Issues & Outlook.Summary, Issues & Outlook• <strong>GPU</strong> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> handily beats CPU based traceless and tracedcomputati<strong>on</strong>s in <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C• CUDA implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> is more or less straight<str<strong>on</strong>g>forward</str<strong>on</strong>g>. K. Kulshreshtha, A. K<strong>on</strong>iaeva 13 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013


Summary, Issues & Outlook.Summary, Issues & Outlook• <strong>GPU</strong> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> handily beats CPU based traceless and tracedcomputati<strong>on</strong>s in <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C• CUDA implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> is more or less straight<str<strong>on</strong>g>forward</str<strong>on</strong>g>• The user needs to c<strong>on</strong>cern <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves with extra issues likedistributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel functi<strong>on</strong>• Some <strong>GPU</strong>s have internal DRAM so data needs to be explicitlytransferred to and from it• CUDA provides cudaMalloc() and cudaMemcpy() routines for datatransfer. K. Kulshreshtha, A. K<strong>on</strong>iaeva 13 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013


Summary, Issues & Outlook.Summary, Issues & Outlook• <strong>GPU</strong> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> handily beats CPU based traceless and tracedcomputati<strong>on</strong>s in <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C• CUDA implementati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>forward</str<strong>on</strong>g> <str<strong>on</strong>g>mode</str<strong>on</strong>g> is more or less straight<str<strong>on</strong>g>forward</str<strong>on</strong>g>• The user needs to c<strong>on</strong>cern <str<strong>on</strong>g>the</str<strong>on</strong>g>mselves with extra issues likedistributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> kernel functi<strong>on</strong>• Some <strong>GPU</strong>s have internal DRAM so data needs to be explicitlytransferred to and from it• CUDA provides cudaMalloc() and cudaMemcpy() routines for datatransfer• Several future development directi<strong>on</strong>s and questi<strong>on</strong>s open• sec<strong>on</strong>d order derivatives• taylor polynomial propagati<strong>on</strong> [distributi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> threads? d-dimensi<strong>on</strong>aloperati<strong>on</strong>s? c<strong>on</strong>currency?]• portability to o<str<strong>on</strong>g>the</str<strong>on</strong>g>r hardware [OpenCL? GLSL?]• <strong>GPU</strong> based trace interpreer [memory allocati<strong>on</strong> for trace?]. K. Kulshreshtha, A. K<strong>on</strong>iaeva 13 / 13 <str<strong>on</strong>g>Vectorizing</str<strong>on</strong>g> <str<strong>on</strong>g>ADOL</str<strong>on</strong>g>-C using CUDA Euro AD 10.06.2013

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!