Tiled Matrix Multiplication lecture 21 jan 2013.pdf

More documents

Recommendations

Info

G80 Shared Memory and Threading• Each SM in G80 has 16KB shared memory– SM size is implementation dependent!– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB ofshared memory.– The shared memory can potentially have up to 8 Thread Blocks activelyexecuting• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads perblock)• The threading model limits the number of thread blocks to 3 so sharedmemory is not the limiting factor here– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB sharedmemory usage per thread block, allowing only up to two thread blocksactive at the same time• Using 16x16 tiling, we reduce the accesses to the global memory bya factor of 16– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!10
Tiling Size Effectstile do n lytile d &u n ro lle dtile do n lytile d &u n ro lle dtile do n lytile d &u n ro lle dtile do n lytile d &u n ro lle d100908070GFLOPS6050403020100no t tile d 4 x4 tile s 8 x8 tile s 1 2 x1 2 tile s 1 6 x1 6 tile s11
Page 1 and 2: TILE_WIDTHETILE_WIDTHTILE_WIDTHTile
Page 3 and 4: Every Md and Nd Element is usedexac
Page 5 and 6: Each phase of a Thread Block uses o
Page 7 and 8: CUDA Code - Kernel ExecutionConfigu
Page 9: TILE_WIDTHETILE_WIDTHTILE_WIDTHTile

Tiled Matrix Multiplication lecture 21 jan 2013.pdf

Create successful ePaper yourself

Delete template?

Save as template?