Recitation 6: Monte-Carlo Integration - Caltech

CS179 GPU Programming: 

Recitation 6: Monte-Carlo Integration 

Lecture originally by Luke Durant and Tamas Szalay

Integration 

● 

Oftentimes, we can integrate a function 

analytically 

● 

▸ f(x) = e x 

▸ 

∫ 0 

1 

f (x)dx=e 1 −e 0 =e−1 

Other times, we can't.. 

▸ f(x) = e xx 

▸ F(x) = ?


● 

We can use discrete Riemann integration:


● 

● 

But what if we don't have a defined function? 

e.g. find the area of the union of shapes 

below:

Monte-Carlo Integration 

● 

Solution: Monte-Carlo Integration 

▸ Saturate space with a lot of random points 

▸ If a point is in one of the shapes, it's in the union of 

them 

▸ Calculate ratio of # of points in union to total points 

▸ Area = # points in union / # total points in space * 

area of space

Lab 6 

● 

Given an arbitrary union of N spheres, find the 

volume 

▸ Very difficult, or even impossible to do 

analytically 

▸ Use Monte-Carlo Integration 

▸ Generate lots of randomized points 

▸ Find which points are contained in any sphere

Randomness on the GPU 

● Remember Lab 3 

● 

▸ Randomness on the GPU is hard 

▸ We just used some weird “pseudo”-random 

function 

▸ Not that great 

▸ Biased towards some values, like zero 

How can we get better, unbiased random data 

on the GPU quickly?


● 

Naive approach: 

▸ Allocate arrays on the host and device 

▸ Generate random data on host 

▸ Copy to device 

● 

Problem: this is slow! 

▸ Even using multiple threads, the CPU cannot 

generate random data as quickly as the GPU 

▸ Also, we are copying lots of data..


● 

Solution: CURAND 

● 

▸ NVIDIA's library for random number 

generation in CUDA 

Unlike most libraries, CURAND can be called 

from the host and the device 

▸ Although the APIs are a bit different

CURAND Host API 

● 

CURAND Host API provides functions callable 

on the host to generate random data in GPU 

global memory 

▸ Can create multiple pseudorandom 

generators using different algorithms 

▸ Can sample from a few different distributions


● 

● 

Pretty easy to use: 

▸ curandCreateGenerator() 

▸ curandSetPseudoRandomGeneratorSeed() 

▸ curandGenerate() 

▸ curandDestroyGenerator() 

Can generate random numbers on the host too: 

▸ curandCreateGeneratorHost() 

▸ Don't really need to do this though, since we have 

standard C++ random functions


● 

Example: 

curandGenerator_t r; 

// argument tells which algorithm to use 

curandCreateGenerator(&r, 

CURAND_RNG_PSEUDO_DEFAULT); 

curandSetStream(r, stream); // optional 

curandSetPseudoRandomGeneratorSeed(r, seed); 

curandGenerateUniform(r, data, numElems); 

curandDestroyGenerator(r); 

● 

Seed value can be anything

CURAND Device API 

● 

● 

What if you can't allocate memory for all the 

random data your kernel needs? 

Solution: Device API 

▸ Supports generation of random data within 

kernels 

▸ Don't need to generate all of it before running 

the kernel


● 

Now, RNG states are stored entirely on GPU 

▸ Still need to allocate space 

▸ So, on the host we need to do: 

curandState* devStates; 

cudaMalloc(&devStates, numThreads * 

sizeof(curandState)); 

kernel_func>(devStates); 

cudaFree(devStates); 

// don’t free devStates if you want to use 

// them again in another kernel


● 

When states are allocated, initialize and use 

them in kernel: 

int x = threadIdx.x + blockIdx.x * 

blockDim.x; 

curand_init(seed, x, 0, &states[x]); 

// generate uniform float in [a, b] 

v[x] = curand_uniform(&states[x]) 

* (b - a) + a; 

● 

Don't need to destroy states when done, just 

call cudaFree

CURAND Overview 

● 

● 

● 

Generate random numbers on device from 

either host or device 

Can sample different distributions (uniform, 

normal, log-normal) 

See CURAND user guide for more detailed 

information

Linear Algebra 

● 

Many libraries available for matrix algebra 

▸ GSL, CBLAS, LAPACK 

● 

Most matrix/vector operations are very 

parallelizable 

▸ Perfect for CUDA acceleration! 

▸ Recall matrix multiplication example

● 

CUBLAS 

NVIDIA's CUBLAS library provides many 

basic linear algebra functions: 

▸ BLAS1 – vector functions: min, max, sum, 

add, scale, dot, etc. 

▸ BLAS2 – matrix/vector functions: 

multiplication, transposition, system solvers 

▸ BLAS3 – matrix/matrix functions: 

multiplication 

▸ See CUBLAS docs for more detailed 

information 

▸ You'll need a vector sum function for this lab

Reduction 

● 

Recall our reduction from GLSL 

▸ Each iteration reduces a set of elements to 

one element through some function (e.g. 

addition) 

8 0 

5 

3 

16

Reduction 

● 

● 

We can optimize reductions a lot! 

See the previous lecture for some examples 

▸ Contiguous memory accesses 

▸ Avoid shared memory bank conflicts 

▸ Avoid thread divergence 

▸ Advanced: Templates and unrolling loops 

● 

Extra Credit: Get the fastest runtime on 

minuteman!

Recitation 6: Monte-Carlo Integration - Caltech

Create successful ePaper yourself

Delete template?

Save as template?