Lecture 2 â Threads - many-core.group

Lecture 2 – Threads 

Graham Pullan 

Department of Engineering

Overview 

• Threads, thread blocks and shared memory 

• Example – 2D heat conduction simulation 

• CPU – in C 

• GPU – in CUDA (without shared memory) 

• GPU – in CUDA (with shared memory) 

• Summary – concepts covered

Threads, thread blocks 

and shared memory

Threads 

• Example in L1 made no use of parallel processing capability of the GPU! 

• We can launch multiple copies of a kernel – one per thread 

• A thread executes an independent instance of the kernel

Threads 

• Example in L1 made no use of parallel processing capability of the GPU! 

• We can launch multiple copies of a kernel – one per thread 

• A thread executes an independent instance of the kernel 

• Alternatively: a kernel is a program written for an individual thread to 

execute

Kernel launch – multiple threads 

// launch kernel (ntot threads) 

vector_add_kernel(a_d, b_d, c_d);

Kernel for c = a + b 

__global__ void vector_add_kernel (float *a, float *b, 

float *c) 

{ 

int i 

i = threadIdx.x; 

// add i’th elements 

c[i] = a[i] + b[i]; 

}

More on threads 

• Many thousands of concurrent threads are allowed 

• “Instant” switching between threads hides memory latency: 

• While a thread is waiting for data – switch to another thread which 

has data available 

• To help the programmer keep track of threads (and to guide their 

scheduling on the GPU) threads are organised into thread blocks 

• Blocks are further organised into a grid

Threads, blocks and grid 

Block is shown as a 1D array of threads – but could also be 2D or 3D 

Grid is shown as a 1D array of blocks – but could also be 2D

More on thread blocks 

• Thread blocks are very important because: 

• All threads in a block can access the same (fast) local data store: 

shared memory 

• All threads in a block can be synchronised (i.e. the thread waits until 

all threads in the block reach the same point)

More on thread blocks 

• Thread blocks are very important because: 

• All threads in a block can access the same (fast) local data store: 

shared memory 

• All threads in a block can be synchronised (i.e. the thread waits until 

all threads in the block reach the same point) 

• A typical strategy is, therefore: 

1. Each thread loads data from global device mem. to shared mem. 

2. Synchronise threads 

3. Process data and write result back to global memory

What is happening on the hardware? 

• The cores (Scalar Processors – SPs) of the CUDA GPU are grouped 

into Streaming Multiprocessors – SMs – each with 8 SPs. 

• Each SM has 16kB of shared memory 

• All threads of a thread block will be scheduled on the same SM 

• More than one block can reside on the same SM at the same time – 

provided there is enough memory on the SM

Streaming multiprocessors and shared memory

Example – 2D heat conduction

Governing equation 

• Heat conduction is governed by a single PDE: 

∂T 

∂t = α∇ 2 T 

• where T is temperature 

t is time 

α is the thermal diffusivity 

€

2D heat conduction 

• In 2D: 

∂T 

∂t = α ⎛ ∂ 2 T 

∂x + ∂ 2 T ⎞ 

⎜ ⎟ 

⎝ 

2 ∂y 2 

⎠ 

€

2D heat conduction 

• In 2D: 

∂T 

∂t = α ⎛ ∂ 2 T 

∂x + ∂ 2 T ⎞ 

⎜ ⎟ 

⎝ 

2 ∂y 2 

⎠ 

• For which a possible finite difference approximation is: 

ΔT 

€ Δt = α ⎡ T i+1, j 

− 2T i, j 

+ T i−1, j 

⎢ 

⎣ Δx 2 

+ T i, j +1 

− 2T i, j 

+ T i, j−1 

Δy 2 

where ΔT is the temperature change over a time Δt and i,j are indices into 

a uniform structured grid (see next slide) 

⎤ 

⎥ 

⎦

Stencil 

Update red point using data from blue points (and red point)

Domain

Update kernel - CPU 

// loop over all points in domain (not boundary points) 

for (j=1; j < nj-1; j++) { 

for (i=1; i < ni-1; i++) { 

// find indices into linear memory for this point and neighbours 

i00 = I2D(ni, i, j); 

im10 = I2D(ni, i-1, j); 

... Similarly for others ... 

// evaluate derivatives 

d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10]; 

d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1]; 

// update temperatures 

temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2); 

} 

}

Results 

Initial field 

After 50000 steps

Performance 

• CPU (1 core, Intel Xeon 2.33 GHz) 

• 1.83 E-8 seconds per point per step

GPU strategy 1 

• Start a thread for each point in the domain 

• Use 2D thread blocks and a 2D grid 

• Read all Temperatures from global device memory 

• Write updated Temperature back to global device memory

GPU strategy 1 – threads and blocks


GPU strategy 1 – kernel 

// find i and j indices of this thread 

ti = threadIdx.x; 

tj = threadIdx.y; 

i = blockIdx.x*(NI_TILE) + ti; 

j = blockIdx.y*(NJ_TILE) + tj; 

// find indices into linear memory 

i00 = I2D(ni, i, j); 

im10 = I2D(ni, i-1, j); ... 

// check that compute is required for this thread 

if (i > 0 && i < ni-1 && j > 0 && j < nj-1) { 

// evaluate derivatives 

d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10]; 

d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1]; 

// update temperature 

temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2); 

}

GPU strategy 1 – kernel launch code 

// set thread blocks and grid 

grid_dim=dim3(DIVIDE_INTO(ni,NI_TILE),DIVIDE_INTO(nj,NJ_TILE),1); 

block_dim=dim3(NI_TILE, NJ_TILE, 1); 

// launch kernel 

step_kernel_gpu(ni, nj,tfac,temp1_d, 

temp2_d); 

// swap the temp pointers 

temp_tmp = temp1_d; 

temp1_d = temp2_d; 

temp2_d = temp_tmp;

Results

Performance 

• CPU – 1 core, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step 

• GPU v1 (no shared mem) – GTX280 

2.32 E-10 (80x speedup)


• Use 2D thread blocks and a 2D grid 

• Each thread in a block reads a Temperature from global memory into 

shared memory 

• Synchronise the threads in the block 

• Update Temperature using data from shared memory (cannot update 

Temperature on block boundary – stencil not complete – so blocks must 

overlap) 

• Write updated Temperature back to global device memory



GPU strategy 2 – kernel (part 1) 

// allocate an array in shared memory 

__shared__ float temp[NI_TILE][NJ_TILE]; 

// find i and j indices of current thread 

ti = threadIdx.x; 

tj = threadIdx.y; 

i = blockIdx.x*(NI_TILE-2) + ti; 

j = blockIdx.y*(NJ_TILE-2) + tj; 

// index into linear memory for current thread 

i2d = i + ni*j; 

// if thread is in domain, read from global to shared memory 

if (i2d < ni*nj) { 

temp[ti][tj] = temp_in[i2d]; 

} 

// make sure all threads have read in data 

__syncthreads();

GPU strategy 2 – kernel (part 2) 

// only compute if (a) thread is within the whole domain 

if (i > 0 && i < ni-1 && j > 0 && j < nj-1) { 

// and (b) thread is not on boundary of a block 

if ((threadIdx.x > 0) && (threadIdx.x < NI_TILE-1) && 

(threadIdx.y > 0) && (threadIdx.y < NJ_TILE-1)) { 

//evaluate derivatives 

d2tdx2 = (temp[ti+1][tj] - 2*temp[ti][tj] + temp[ti-1][tj]); 

d2tdy2 = (temp[ti][tj+1] - 2*temp[ti][tj] + temp[ti][tj-1]); 

// update temperature 

temp_out[i2d] = temp_in[i2d] + tfac*(d2tdx2 + d2tdy2); 

} 

}

Performance 



• GPU v2 (shared mem - 1) – GTX280 

2.32 E-10 (80x speedup) 

3.34 E-10 (55x speedup)

GPU strategy 2 – what went wrong? 

• The shared memory kernel performed worse than expected mainly 

because many threads do not compute (just load into shared memory): 

• For a 16x16 block, 60 threads do not compute (23%) 

• But max threads per block is 512 (sqrt(512)=22.6) 

(Also – stencil is small - little reuse)


• Can use larger blocks (more compute threads) if : 

• For each block, start a line of threads (in i direction) 

• Load three lines into shared memory, then compute one line 

• Then load next line into shared memory, and proceed in j direction

GPU strategy 3

GPU strategy 3

GPU strategy 3

GPU strategy 3

GPU strategy 3

GPU strategy 3

Performance 





2.32 E-10 (80x speedup) 

3.34 E-10 (55x speedup) 

2.05 E-10 (90x speedup)

Lecture 2 summary

Covered in Lecture 2 

• Threads, thread blocks, block grid, shared memory 

• New aspect of CUDA: 

• Thread indices (threadIdx) 

• Block indices (blockIdx) 

• Shared memory declaration (__shared__) 

• Synchronising threads (__syncthreads())

Lecture 2 â Threads - many-core.group

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?

Lecture 2 â Threads - many-core.group