20.03.2015 Views

Recitation 6: Monte-Carlo Integration - Caltech

Recitation 6: Monte-Carlo Integration - Caltech

Recitation 6: Monte-Carlo Integration - Caltech

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CS179 GPU Programming:<br />

<strong>Recitation</strong> 6: <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />

Lecture originally by Luke Durant and Tamas Szalay


<strong>Integration</strong><br />

●<br />

Oftentimes, we can integrate a function<br />

analytically<br />

●<br />

▸ f(x) = e x<br />

▸<br />

∫ 0<br />

1<br />

f (x)dx=e 1 −e 0 =e−1<br />

Other times, we can't..<br />

▸ f(x) = e xx<br />

▸ F(x) = ?


<strong>Integration</strong><br />

●<br />

We can use discrete Riemann integration:


<strong>Integration</strong><br />

●<br />

●<br />

But what if we don't have a defined function?<br />

e.g. find the area of the union of shapes<br />

below:


<strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />

●<br />

Solution: <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />

▸ Saturate space with a lot of random points<br />

▸ If a point is in one of the shapes, it's in the union of<br />

them<br />

▸ Calculate ratio of # of points in union to total points<br />

▸ Area = # points in union / # total points in space *<br />

area of space


Lab 6<br />

●<br />

Given an arbitrary union of N spheres, find the<br />

volume<br />

▸ Very difficult, or even impossible to do<br />

analytically<br />

▸ Use <strong>Monte</strong>-<strong>Carlo</strong> <strong>Integration</strong><br />

▸ Generate lots of randomized points<br />

▸ Find which points are contained in any sphere


Randomness on the GPU<br />

● Remember Lab 3<br />

●<br />

▸ Randomness on the GPU is hard<br />

▸ We just used some weird “pseudo”-random<br />

function<br />

▸ Not that great<br />

▸ Biased towards some values, like zero<br />

How can we get better, unbiased random data<br />

on the GPU quickly?


Randomness on the GPU<br />

●<br />

Naive approach:<br />

▸ Allocate arrays on the host and device<br />

▸ Generate random data on host<br />

▸ Copy to device<br />

●<br />

Problem: this is slow!<br />

▸ Even using multiple threads, the CPU cannot<br />

generate random data as quickly as the GPU<br />

▸ Also, we are copying lots of data..


Randomness on the GPU<br />

●<br />

Solution: CURAND<br />

●<br />

▸ NVIDIA's library for random number<br />

generation in CUDA<br />

Unlike most libraries, CURAND can be called<br />

from the host and the device<br />

▸ Although the APIs are a bit different


CURAND Host API<br />

●<br />

CURAND Host API provides functions callable<br />

on the host to generate random data in GPU<br />

global memory<br />

▸ Can create multiple pseudorandom<br />

generators using different algorithms<br />

▸ Can sample from a few different distributions


CURAND Host API<br />

●<br />

●<br />

Pretty easy to use:<br />

▸ curandCreateGenerator()<br />

▸ curandSetPseudoRandomGeneratorSeed()<br />

▸ curandGenerate()<br />

▸ curandDestroyGenerator()<br />

Can generate random numbers on the host too:<br />

▸ curandCreateGeneratorHost()<br />

▸ Don't really need to do this though, since we have<br />

standard C++ random functions


CURAND Host API<br />

●<br />

Example:<br />

curandGenerator_t r;<br />

// argument tells which algorithm to use<br />

curandCreateGenerator(&r,<br />

CURAND_RNG_PSEUDO_DEFAULT);<br />

curandSetStream(r, stream); // optional<br />

curandSetPseudoRandomGeneratorSeed(r, seed);<br />

curandGenerateUniform(r, data, numElems);<br />

curandDestroyGenerator(r);<br />

●<br />

Seed value can be anything


CURAND Device API<br />

●<br />

●<br />

What if you can't allocate memory for all the<br />

random data your kernel needs?<br />

Solution: Device API<br />

▸ Supports generation of random data within<br />

kernels<br />

▸ Don't need to generate all of it before running<br />

the kernel


CURAND Device API<br />

●<br />

Now, RNG states are stored entirely on GPU<br />

▸ Still need to allocate space<br />

▸ So, on the host we need to do:<br />

curandState* devStates;<br />

cudaMalloc(&devStates, numThreads *<br />

sizeof(curandState));<br />

kernel_func>(devStates);<br />

cudaFree(devStates);<br />

// don’t free devStates if you want to use<br />

// them again in another kernel


CURAND Device API<br />

●<br />

When states are allocated, initialize and use<br />

them in kernel:<br />

int x = threadIdx.x + blockIdx.x *<br />

blockDim.x;<br />

curand_init(seed, x, 0, &states[x]);<br />

// generate uniform float in [a, b]<br />

v[x] = curand_uniform(&states[x])<br />

* (b - a) + a;<br />

●<br />

Don't need to destroy states when done, just<br />

call cudaFree


CURAND Overview<br />

●<br />

●<br />

●<br />

Generate random numbers on device from<br />

either host or device<br />

Can sample different distributions (uniform,<br />

normal, log-normal)<br />

See CURAND user guide for more detailed<br />

information


Linear Algebra<br />

●<br />

Many libraries available for matrix algebra<br />

▸ GSL, CBLAS, LAPACK<br />

●<br />

Most matrix/vector operations are very<br />

parallelizable<br />

▸ Perfect for CUDA acceleration!<br />

▸ Recall matrix multiplication example


●<br />

CUBLAS<br />

NVIDIA's CUBLAS library provides many<br />

basic linear algebra functions:<br />

▸ BLAS1 – vector functions: min, max, sum,<br />

add, scale, dot, etc.<br />

▸ BLAS2 – matrix/vector functions:<br />

multiplication, transposition, system solvers<br />

▸ BLAS3 – matrix/matrix functions:<br />

multiplication<br />

▸ See CUBLAS docs for more detailed<br />

information<br />

▸ You'll need a vector sum function for this lab


Reduction<br />

●<br />

Recall our reduction from GLSL<br />

▸ Each iteration reduces a set of elements to<br />

one element through some function (e.g.<br />

addition)<br />

8 0<br />

5<br />

3<br />

16


Reduction<br />

●<br />

●<br />

We can optimize reductions a lot!<br />

See the previous lecture for some examples<br />

▸ Contiguous memory accesses<br />

▸ Avoid shared memory bank conflicts<br />

▸ Avoid thread divergence<br />

▸ Advanced: Templates and unrolling loops<br />

●<br />

Extra Credit: Get the fastest runtime on<br />

minuteman!

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!