28.11.2014 Views

Lecture 2 – Threads - many-core.group

Lecture 2 – Threads - many-core.group

Lecture 2 – Threads - many-core.group

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Lecture</strong> 2 – <strong>Threads</strong><br />

Graham Pullan<br />

Department of Engineering


Overview<br />

• <strong>Threads</strong>, thread blocks and shared memory<br />

• Example – 2D heat conduction simulation<br />

• CPU – in C<br />

• GPU – in CUDA (without shared memory)<br />

• GPU – in CUDA (with shared memory)<br />

• Summary – concepts covered


<strong>Threads</strong>, thread blocks<br />

and shared memory


<strong>Threads</strong><br />

• Example in L1 made no use of parallel processing capability of the GPU!<br />

• We can launch multiple copies of a kernel – one per thread<br />

• A thread executes an independent instance of the kernel


<strong>Threads</strong><br />

• Example in L1 made no use of parallel processing capability of the GPU!<br />

• We can launch multiple copies of a kernel – one per thread<br />

• A thread executes an independent instance of the kernel<br />

• Alternatively: a kernel is a program written for an individual thread to<br />

execute


Kernel launch – multiple threads<br />

// launch kernel (ntot threads) <br />

vector_add_kernel(a_d, b_d, c_d);


Kernel for c = a + b<br />

__global__ void vector_add_kernel (float *a, float *b,<br />

float *c)<br />

{<br />

int i<br />

i = threadIdx.x;<br />

// add i’th elements <br />

c[i] = a[i] + b[i]; <br />

}


More on threads<br />

• Many thousands of concurrent threads are allowed<br />

• “Instant” switching between threads hides memory latency:<br />

• While a thread is waiting for data – switch to another thread which<br />

has data available<br />

• To help the programmer keep track of threads (and to guide their<br />

scheduling on the GPU) threads are organised into thread blocks<br />

• Blocks are further organised into a grid


<strong>Threads</strong>, blocks and grid<br />

Block is shown as a 1D array of threads – but could also be 2D or 3D<br />

Grid is shown as a 1D array of blocks – but could also be 2D


More on thread blocks<br />

• Thread blocks are very important because:<br />

• All threads in a block can access the same (fast) local data store:<br />

shared memory<br />

• All threads in a block can be synchronised (i.e. the thread waits until<br />

all threads in the block reach the same point)


More on thread blocks<br />

• Thread blocks are very important because:<br />

• All threads in a block can access the same (fast) local data store:<br />

shared memory<br />

• All threads in a block can be synchronised (i.e. the thread waits until<br />

all threads in the block reach the same point)<br />

• A typical strategy is, therefore:<br />

1. Each thread loads data from global device mem. to shared mem.<br />

2. Synchronise threads<br />

3. Process data and write result back to global memory


What is happening on the hardware?<br />

• The <strong>core</strong>s (Scalar Processors – SPs) of the CUDA GPU are <strong>group</strong>ed<br />

into Streaming Multiprocessors – SMs – each with 8 SPs.<br />

• Each SM has 16kB of shared memory<br />

• All threads of a thread block will be scheduled on the same SM<br />

• More than one block can reside on the same SM at the same time –<br />

provided there is enough memory on the SM


Streaming multiprocessors and shared memory


Example – 2D heat conduction


Governing equation<br />

• Heat conduction is governed by a single PDE:<br />

∂T<br />

∂t = α∇ 2 T<br />

• where T is temperature<br />

t is time<br />

α is the thermal diffusivity<br />


2D heat conduction<br />

• In 2D:<br />

∂T<br />

∂t = α ⎛ ∂ 2 T<br />

∂x + ∂ 2 T ⎞<br />

⎜ ⎟<br />

⎝<br />

2 ∂y 2<br />

⎠<br />


2D heat conduction<br />

• In 2D:<br />

∂T<br />

∂t = α ⎛ ∂ 2 T<br />

∂x + ∂ 2 T ⎞<br />

⎜ ⎟<br />

⎝<br />

2 ∂y 2<br />

⎠<br />

• For which a possible finite difference approximation is:<br />

ΔT<br />

€ Δt = α ⎡ T i+1, j<br />

− 2T i, j<br />

+ T i−1, j<br />

⎢<br />

⎣ Δx 2<br />

+ T i, j +1<br />

− 2T i, j<br />

+ T i, j−1<br />

Δy 2<br />

where ΔT is the temperature change over a time Δt and i,j are indices into<br />

a uniform structured grid (see next slide)<br />

⎤<br />

⎥<br />


Stencil<br />

Update red point using data from blue points (and red point)


Domain


Update kernel - CPU<br />

// loop over all points in domain (not boundary points)<br />

for (j=1; j < nj-1; j++) {<br />

for (i=1; i < ni-1; i++) {<br />

// find indices into linear memory for this point and neighbours<br />

i00 = I2D(ni, i, j);<br />

im10 = I2D(ni, i-1, j);<br />

... Similarly for others ...<br />

// evaluate derivatives<br />

d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];<br />

d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];<br />

// update temperatures<br />

temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);<br />

}<br />

}


Results<br />

Initial field<br />

After 50000 steps


Performance<br />

• CPU (1 <strong>core</strong>, Intel Xeon 2.33 GHz)<br />

• 1.83 E-8 seconds per point per step


GPU strategy 1<br />

• Start a thread for each point in the domain<br />

• Use 2D thread blocks and a 2D grid<br />

• Read all Temperatures from global device memory<br />

• Write updated Temperature back to global device memory


GPU strategy 1 – threads and blocks


GPU strategy 1 – threads and blocks


GPU strategy 1 – kernel<br />

// find i and j indices of this thread<br />

ti = threadIdx.x;<br />

tj = threadIdx.y;<br />

i = blockIdx.x*(NI_TILE) + ti;<br />

j = blockIdx.y*(NJ_TILE) + tj;<br />

// find indices into linear memory <br />

i00 = I2D(ni, i, j);<br />

im10 = I2D(ni, i-1, j); ...<br />

// check that compute is required for this thread<br />

if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {<br />

// evaluate derivatives <br />

d2tdx2 = temp_in[im10] - 2*temp_in[i00] + temp_in[ip10];<br />

d2tdy2 = temp_in[i0m1] - 2*temp_in[i00] + temp_in[i0p1];<br />

// update temperature<br />

temp_out[i00] = temp_in[i00] + tfac*(d2tdx2 + d2tdy2);<br />

}


GPU strategy 1 – kernel launch code<br />

// set thread blocks and grid<br />

grid_dim=dim3(DIVIDE_INTO(ni,NI_TILE),DIVIDE_INTO(nj,NJ_TILE),1);<br />

block_dim=dim3(NI_TILE, NJ_TILE, 1);<br />

// launch kernel <br />

step_kernel_gpu(ni, nj,tfac,temp1_d, <br />

temp2_d);<br />

// swap the temp pointers <br />

temp_tmp = temp1_d;<br />

temp1_d = temp2_d;<br />

temp2_d = temp_tmp;


Results


Performance<br />

• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />

• GPU v1 (no shared mem) – GTX280<br />

2.32 E-10 (80x speedup)


GPU strategy 2<br />

• Use 2D thread blocks and a 2D grid<br />

• Each thread in a block reads a Temperature from global memory into<br />

shared memory<br />

• Synchronise the threads in the block<br />

• Update Temperature using data from shared memory (cannot update<br />

Temperature on block boundary – stencil not complete – so blocks must<br />

overlap)<br />

• Write updated Temperature back to global device memory


GPU strategy 2 – threads and blocks


GPU strategy 2 – threads and blocks


GPU strategy 2 – kernel (part 1)<br />

// allocate an array in shared memory<br />

__shared__ float temp[NI_TILE][NJ_TILE];<br />

// find i and j indices of current thread<br />

ti = threadIdx.x;<br />

tj = threadIdx.y;<br />

i = blockIdx.x*(NI_TILE-2) + ti;<br />

j = blockIdx.y*(NJ_TILE-2) + tj;<br />

// index into linear memory for current thread<br />

i2d = i + ni*j;<br />

// if thread is in domain, read from global to shared memory<br />

if (i2d < ni*nj) {<br />

temp[ti][tj] = temp_in[i2d];<br />

}<br />

// make sure all threads have read in data<br />

__syncthreads();


GPU strategy 2 – kernel (part 2)<br />

// only compute if (a) thread is within the whole domain<br />

if (i > 0 && i < ni-1 && j > 0 && j < nj-1) {<br />

// and (b) thread is not on boundary of a block<br />

if ((threadIdx.x > 0) && (threadIdx.x < NI_TILE-1) &&<br />

(threadIdx.y > 0) && (threadIdx.y < NJ_TILE-1)) {<br />

//evaluate derivatives<br />

d2tdx2 = (temp[ti+1][tj] - 2*temp[ti][tj] + temp[ti-1][tj]);<br />

d2tdy2 = (temp[ti][tj+1] - 2*temp[ti][tj] + temp[ti][tj-1]);<br />

// update temperature<br />

temp_out[i2d] = temp_in[i2d] + tfac*(d2tdx2 + d2tdy2);<br />

}<br />

}


Performance<br />

• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />

• GPU v1 (no shared mem) – GTX280<br />

• GPU v2 (shared mem - 1) – GTX280<br />

2.32 E-10 (80x speedup)<br />

3.34 E-10 (55x speedup)


GPU strategy 2 – what went wrong?<br />

• The shared memory kernel performed worse than expected mainly<br />

because <strong>many</strong> threads do not compute (just load into shared memory):<br />

• For a 16x16 block, 60 threads do not compute (23%)<br />

• But max threads per block is 512 (sqrt(512)=22.6)<br />

(Also – stencil is small - little reuse)


GPU strategy 3<br />

• Can use larger blocks (more compute threads) if :<br />

• For each block, start a line of threads (in i direction)<br />

• Load three lines into shared memory, then compute one line<br />

• Then load next line into shared memory, and proceed in j direction


GPU strategy 3


GPU strategy 3


GPU strategy 3


GPU strategy 3


GPU strategy 3


GPU strategy 3


Performance<br />

• CPU – 1 <strong>core</strong>, Intel Xeon 2.33 GHz 1.83 E-8 s/point/step<br />

• GPU v1 (no shared mem) – GTX280<br />

• GPU v2 (shared mem - 1) – GTX280<br />

• GPU v3 (shared mem - 2) – GTX280<br />

2.32 E-10 (80x speedup)<br />

3.34 E-10 (55x speedup)<br />

2.05 E-10 (90x speedup)


<strong>Lecture</strong> 2 summary


Covered in <strong>Lecture</strong> 2<br />

• <strong>Threads</strong>, thread blocks, block grid, shared memory<br />

• New aspect of CUDA:<br />

• Thread indices (threadIdx)<br />

• Block indices (blockIdx)<br />

• Shared memory declaration (__shared__)<br />

• Synchronising threads (__syncthreads())

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!