Recitation 6: Monte-Carlo Integration - Caltech
Recitation 6: Monte-Carlo Integration - Caltech
Recitation 6: Monte-Carlo Integration - Caltech
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS179 GPU Programming:<br />
<strong>Recitation</strong> 6: <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />
Lecture originally by Luke Durant and Tamas Szalay
<strong>Integration</strong><br />
●<br />
Oftentimes, we can integrate a function<br />
analytically<br />
●<br />
▸ f(x) = e x<br />
▸<br />
∫ 0<br />
1<br />
f (x)dx=e 1 −e 0 =e−1<br />
Other times, we can't..<br />
▸ f(x) = e xx<br />
▸ F(x) = ?
<strong>Integration</strong><br />
●<br />
We can use discrete Riemann integration:
<strong>Integration</strong><br />
●<br />
●<br />
But what if we don't have a defined function?<br />
e.g. find the area of the union of shapes<br />
below:
<strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />
●<br />
Solution: <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />
▸ Saturate space with a lot of random points<br />
▸ If a point is in one of the shapes, it's in the union of<br />
them<br />
▸ Calculate ratio of # of points in union to total points<br />
▸ Area = # points in union / # total points in space *<br />
area of space
Lab 6<br />
●<br />
Given an arbitrary union of N spheres, find the<br />
volume<br />
▸ Very difficult, or even impossible to do<br />
analytically<br />
▸ Use <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />
▸ Generate lots of randomized points<br />
▸ Find which points are contained in any sphere
Randomness on the GPU<br />
● Remember Lab 3<br />
●<br />
▸ Randomness on the GPU is hard<br />
▸ We just used some weird “pseudo”-random<br />
function<br />
▸ Not that great<br />
▸ Biased towards some values, like zero<br />
How can we get better, unbiased random data<br />
on the GPU quickly?
Randomness on the GPU<br />
●<br />
Naive approach:<br />
▸ Allocate arrays on the host and device<br />
▸ Generate random data on host<br />
▸ Copy to device<br />
●<br />
Problem: this is slow!<br />
▸ Even using multiple threads, the CPU cannot<br />
generate random data as quickly as the GPU<br />
▸ Also, we are copying lots of data..
Randomness on the GPU<br />
●<br />
Solution: CURAND<br />
●<br />
▸ NVIDIA's library for random number<br />
generation in CUDA<br />
Unlike most libraries, CURAND can be called<br />
from the host and the device<br />
▸ Although the APIs are a bit different
CURAND Host API<br />
●<br />
CURAND Host API provides functions callable<br />
on the host to generate random data in GPU<br />
global memory<br />
▸ Can create multiple pseudorandom<br />
generators using different algorithms<br />
▸ Can sample from a few different distributions
CURAND Host API<br />
●<br />
●<br />
Pretty easy to use:<br />
▸ curandCreateGenerator()<br />
▸ curandSetPseudoRandomGeneratorSeed()<br />
▸ curandGenerate()<br />
▸ curandDestroyGenerator()<br />
Can generate random numbers on the host too:<br />
▸ curandCreateGeneratorHost()<br />
▸ Don't really need to do this though, since we have<br />
standard C++ random functions
CURAND Host API<br />
●<br />
Example:<br />
curandGenerator_t r;<br />
// argument tells which algorithm to use<br />
curandCreateGenerator(&r,<br />
CURAND_RNG_PSEUDO_DEFAULT);<br />
curandSetStream(r, stream); // optional<br />
curandSetPseudoRandomGeneratorSeed(r, seed);<br />
curandGenerateUniform(r, data, numElems);<br />
curandDestroyGenerator(r);<br />
●<br />
Seed value can be anything
CURAND Device API<br />
●<br />
●<br />
What if you can't allocate memory for all the<br />
random data your kernel needs?<br />
Solution: Device API<br />
▸ Supports generation of random data within<br />
kernels<br />
▸ Don't need to generate all of it before running<br />
the kernel
CURAND Device API<br />
●<br />
Now, RNG states are stored entirely on GPU<br />
▸ Still need to allocate space<br />
▸ So, on the host we need to do:<br />
curandState* devStates;<br />
cudaMalloc(&devStates, numThreads *<br />
sizeof(curandState));<br />
kernel_func>(devStates);<br />
cudaFree(devStates);<br />
// don’t free devStates if you want to use<br />
// them again in another kernel
CURAND Device API<br />
●<br />
When states are allocated, initialize and use<br />
them in kernel:<br />
int x = threadIdx.x + blockIdx.x *<br />
blockDim.x;<br />
curand_init(seed, x, 0, &states[x]);<br />
// generate uniform float in [a, b]<br />
v[x] = curand_uniform(&states[x])<br />
* (b - a) + a;<br />
●<br />
Don't need to destroy states when done, just<br />
call cudaFree
CURAND Overview<br />
●<br />
●<br />
●<br />
Generate random numbers on device from<br />
either host or device<br />
Can sample different distributions (uniform,<br />
normal, log-normal)<br />
See CURAND user guide for more detailed<br />
information
Linear Algebra<br />
●<br />
Many libraries available for matrix algebra<br />
▸ GSL, CBLAS, LAPACK<br />
●<br />
Most matrix/vector operations are very<br />
parallelizable<br />
▸ Perfect for CUDA acceleration!<br />
▸ Recall matrix multiplication example
●<br />
CUBLAS<br />
NVIDIA's CUBLAS library provides many<br />
basic linear algebra functions:<br />
▸ BLAS1 – vector functions: min, max, sum,<br />
add, scale, dot, etc.<br />
▸ BLAS2 – matrix/vector functions:<br />
multiplication, transposition, system solvers<br />
▸ BLAS3 – matrix/matrix functions:<br />
multiplication<br />
▸ See CUBLAS docs for more detailed<br />
information<br />
▸ You'll need a vector sum function for this lab
Reduction<br />
●<br />
Recall our reduction from GLSL<br />
▸ Each iteration reduces a set of elements to<br />
one element through some function (e.g.<br />
addition)<br />
8 0<br />
5<br />
3<br />
16
Reduction<br />
●<br />
●<br />
We can optimize reductions a lot!<br />
See the previous lecture for some examples<br />
▸ Contiguous memory accesses<br />
▸ Avoid shared memory bank conflicts<br />
▸ Avoid thread divergence<br />
▸ Advanced: Templates and unrolling loops<br />
●<br />
Extra Credit: Get the fastest runtime on<br />
minuteman!