Lecture 2 â Threads - many-core.group
Lecture 2 â Threads - many-core.group
Lecture 2 â Threads - many-core.group
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Lecture</strong> 2 – <strong>Threads</strong><br />
Graham Pullan<br />
Department of Engineering
Overview<br />
• <strong>Threads</strong>, thread blocks and shared memory<br />
• Example – 2D heat conduction simulation<br />
• CPU – in C<br />
• GPU – in CUDA (without shared memory)<br />
• GPU – in CUDA (with shared memory)<br />
• Summary – concepts covered
<strong>Threads</strong>, thread blocks<br />
and shared memory
<strong>Threads</strong><br />
• Example in L1 made no use of parallel processing capability of the GPU!<br />
• We can launch multiple copies of a kernel – one per thread<br />
• A thread executes an independent instance of the kernel
<strong>Threads</strong><br />
• Example in L1 made no use of parallel processing capability of the GPU!<br />
• We can launch multiple copies of a kernel – one per thread<br />
• A thread executes an independent instance of the kernel<br />
• Alternatively: a kernel is a program written for an individual thread to<br />
execute
Kernel launch – multiple threads<br />
// launch kernel (ntot threads) <br />
vector_add_kernel(a_d, b_d, c_d);
Kernel for c = a + b<br />
__global__ void vector_add_kernel (float *a, float *b,<br />
float *c)<br />
{<br />
int i<br />
i = threadIdx.x;<br />
// add i’th elements <br />
c[i] = a[i] + b[i]; <br />
}
More on threads<br />
• Many thousands of concurrent threads are allowed<br />
• “Instant” switching between threads hides memory latency:<br />
• While a thread is waiting for data – switch to another thread which<br />
has data available<br />
• To help the programmer keep track of threads (and to guide their<br />
scheduling on the GPU) threads are organised into thread blocks<br />
• Blocks are further organised into a grid
<strong>Threads</strong>, blocks and grid<br />
Block is shown as a 1D array of threads – but could also be 2D or 3D<br />
Grid is shown as a 1D array of blocks – but could also be 2D
More on thread blocks<br />
• Thread blocks are very important because:<br />
• All threads in a block can access the same (fast) local data store:<br />
shared memory<br />
• All threads in a block can be synchronised (i.e. the thread waits until<br />
all threads in the block reach the same point)
More on thread blocks<br />
• Thread blocks are very important because:<br />
• All threads in a block can access the same (fast) local data store:<br />
shared memory<br />
• All threads in a block can be synchronised (i.e. the thread waits until<br />
all threads in the block reach the same point)<br />
• A typical strategy is, therefore:<br />
1. Each thread loads data from global device mem. to shared mem.<br />
2. Synchronise threads<br />
3. Process data and write result back to global memory
What is happening on the hardware?<br />
• The <strong>core</strong>s (Scalar Processors – SPs) of the CUDA GPU are <strong>group</strong>ed<br />
into Streaming Multiprocessors – SMs – each with 8 SPs.<br />
• Each SM has 16kB of shared memory<br />
• All threads of a thread block will be scheduled on the same SM<br />
• More than one block can reside on the same SM at the same time –<br />
provided there is enough memory on the SM
Streaming multiprocessors and shared memory
Example – 2D heat conduction
Governing equation<br />
• Heat conduction is governed by a single PDE:<br />
∂T<br />
∂t = α∇ 2 T<br />
• where T is temperature<br />
t is time<br />
α is the thermal diffusivity<br />
€
2D heat conduction<br />
• In 2D:<br />
∂T<br />
∂t = α ⎛ ∂ 2 T<br />
∂x + ∂ 2 T ⎞<br />
⎜ ⎟<br />
⎝<br />
2 ∂y 2<br />
⎠<br />
€
2D heat conduction<br />
• In 2D:<br />
∂T<br />
∂t = α ⎛ ∂ 2 T<br />
∂x + ∂ 2 T ⎞<br />
⎜ ⎟<br />
⎝<br />
2 ∂y 2<br />
⎠<br />
• For which a possible finite difference approximation is:<br />
ΔT<br />
€ Δt = α ⎡ T i+1, j<br />
− 2T i, j<br />
+ T i−1, j<br />
⎢<br />
⎣ Δx 2<br />
+ T i, j +1<br />
− 2T i, j<br />
+ T i, j−1<br />
Δy 2<br />
where ΔT is the temperature change over a time Δt and i,j are indices into<br />
a uniform structured grid (see next slide)<br />
⎤<br />
⎥<br />
⎦
Stencil<br />
Update red point using data from blue points (and red point)
Domain
Update kernel - CPU<br />
// loop over all points in domain (not boundary points)<br />
for (j=1; j < nj-1; j++) {<br />
for (i=1; i < ni-1; i++) {<br />
// find indices into linear memory for this point and neighbours<br />
i00 = I2D(ni, i, j);<br />
im10 = I2D(ni, i-1, j);<br />
... Similarly for others ...<br />
// evaluate derivatives<br />
d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];<br />
d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];<br />
// update temperatures<br />
temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);<br />
}<br />
}
Results<br />
Initial field<br />
After 50000 steps
Performance<br />
• CPU (1 <strong>core</strong>, Intel Xeon 2.33 GHz)<br />
• 1.83 E-8 seconds per point per step
GPU strategy 1<br />
• Start a thread for each point in the domain<br />
• Use 2D thread blocks and a 2D grid<br />
• Read all Temperatures from global device memory<br />
• Write updated Temperature back to global device memory
GPU strategy 1 – threads and blocks
GPU strategy 1 – threads and blocks
GPU strategy 1 – kernel<br />
// find i and j indices of this thread<br />
ti = threadIdx.x;<br />
tj = threadIdx.y;<br />
i = blockIdx.x*(NI_TILE) + ti;<br />
j = blockIdx.y*(NJ_TILE) + tj;<br />
// find indices into linear memory <br />
i00 = I2D(ni, i, j);<br />
im10 = I2D(ni, i-1, j); ...<br />
// check that compute is required for this thread<br />
if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {<br />
// evaluate derivatives <br />
d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];<br />
d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];<br />
// update temperature<br />
temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);<br />
}
GPU strategy 1 – kernel launch code<br />
// set thread blocks and grid<br />
grid_dim=dim3(DIVIDE_INTO(ni,NI_TILE),DIVIDE_INTO(nj,NJ_TILE),1);<br />
block_dim=dim3(NI_TILE, NJ_TILE, 1);<br />
// launch kernel <br />
step_kernel_gpu(ni, nj,tfac,temp1_d, <br />
temp2_d);<br />
// swap the temp pointers <br />
temp_tmp = temp1_d;<br />
temp1_d = temp2_d;<br />
temp2_d = temp_tmp;
Results
Performance<br />
• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />
• GPU v1 (no shared mem) – GTX280<br />
2.32 E-10 (80x speedup)
GPU strategy 2<br />
• Use 2D thread blocks and a 2D grid<br />
• Each thread in a block reads a Temperature from global memory into<br />
shared memory<br />
• Synchronise the threads in the block<br />
• Update Temperature using data from shared memory (cannot update<br />
Temperature on block boundary – stencil not complete – so blocks must<br />
overlap)<br />
• Write updated Temperature back to global device memory
GPU strategy 2 – threads and blocks
GPU strategy 2 – threads and blocks
GPU strategy 2 – kernel (part 1)<br />
// allocate an array in shared memory<br />
__shared__ float temp[NI_TILE][NJ_TILE];<br />
// find i and j indices of current thread<br />
ti = threadIdx.x;<br />
tj = threadIdx.y;<br />
i = blockIdx.x*(NI_TILE-2) + ti;<br />
j = blockIdx.y*(NJ_TILE-2) + tj;<br />
// index into linear memory for current thread<br />
i2d = i + ni*j;<br />
// if thread is in domain, read from global to shared memory<br />
if (i2d < ni*nj) {<br />
temp[ti][tj] = temp_in[i2d];<br />
}<br />
// make sure all threads have read in data<br />
__syncthreads();
GPU strategy 2 – kernel (part 2)<br />
// only compute if (a) thread is within the whole domain<br />
if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {<br />
// and (b) thread is not on boundary of a block<br />
if ((threadIdx.x > 0) && (threadIdx.x < NI_TILE-1) &&<br />
(threadIdx.y > 0) && (threadIdx.y < NJ_TILE-1)) {<br />
//evaluate derivatives<br />
d2tdx2 = (temp[ti+1][tj] - 2*temp[ti][tj] + temp[ti-1][tj]);<br />
d2tdy2 = (temp[ti][tj+1] - 2*temp[ti][tj] + temp[ti][tj-1]);<br />
// update temperature<br />
temp_out[i2d] = temp_in[i2d] + tfac*(d2tdx2 + d2tdy2);<br />
}<br />
}
Performance<br />
• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />
• GPU v1 (no shared mem) – GTX280<br />
• GPU v2 (shared mem - 1) – GTX280<br />
2.32 E-10 (80x speedup)<br />
3.34 E-10 (55x speedup)
GPU strategy 2 – what went wrong?<br />
• The shared memory kernel performed worse than expected mainly<br />
because <strong>many</strong> threads do not compute (just load into shared memory):<br />
• For a 16x16 block, 60 threads do not compute (23%)<br />
• But max threads per block is 512 (sqrt(512)=22.6)<br />
(Also – stencil is small - little reuse)
GPU strategy 3<br />
• Can use larger blocks (more compute threads) if :<br />
• For each block, start a line of threads (in i direction)<br />
• Load three lines into shared memory, then compute one line<br />
• Then load next line into shared memory, and proceed in j direction
GPU strategy 3
GPU strategy 3
GPU strategy 3
GPU strategy 3
GPU strategy 3
GPU strategy 3
Performance<br />
• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />
• GPU v1 (no shared mem) – GTX280<br />
• GPU v2 (shared mem - 1) – GTX280<br />
• GPU v3 (shared mem - 2) – GTX280<br />
2.32 E-10 (80x speedup)<br />
3.34 E-10 (55x speedup)<br />
2.05 E-10 (90x speedup)
<strong>Lecture</strong> 2 summary
Covered in <strong>Lecture</strong> 2<br />
• <strong>Threads</strong>, thread blocks, block grid, shared memory<br />
• New aspect of CUDA:<br />
• Thread indices (threadIdx)<br />
• Block indices (blockIdx)<br />
• Shared memory declaration (__shared__)<br />
• Synchronising threads (__syncthreads())