11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.3 Hardware executi<strong>on</strong> model2.3.1 Pixel shadersThe programmer can specify milli<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> threads to be executed. However not all threadsare executed at the same time <strong>on</strong> the GPU. In pixel-shader mode, the hardware dividesthe two-dimensi<strong>on</strong>al grid <str<strong>on</strong>g>of</str<strong>on</strong>g> threads into groups <str<strong>on</strong>g>of</str<strong>on</strong>g> 8×8 thread groups called wavefr<strong>on</strong>ts.An SIMD runs <strong>on</strong>e or more wavefr<strong>on</strong>ts in parallel. The wavefr<strong>on</strong>t sizes vary from GPUto GPU: it is 64 <strong>on</strong> the RV770 but it is 32 <strong>on</strong> the RV730. A wavefr<strong>on</strong>t starts executing<strong>on</strong> an SIMD and c<strong>on</strong>tinues executing until it encounters a n<strong>on</strong>-ALU instructi<strong>on</strong> (such as atexture related instructi<strong>on</strong>). Then the wavefr<strong>on</strong>t is swapped out and replaced with anotherwavefr<strong>on</strong>t while the original n<strong>on</strong>-ALU wavefr<strong>on</strong>t is sent back to be scheduled <strong>on</strong> the unitrequired to complete the instructi<strong>on</strong> that it is waiting <strong>on</strong>. Thus the GPU uses the massiveparallelism to hide the latency <str<strong>on</strong>g>of</str<strong>on</strong>g> memory loads.The number <str<strong>on</strong>g>of</str<strong>on</strong>g> wavefr<strong>on</strong>ts that can be run in parallel <strong>on</strong> an SIMD in pixel-shader modeis decided by the availability <str<strong>on</strong>g>of</str<strong>on</strong>g> registers. If R denotes the number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers required perthread, W denotes the wavefr<strong>on</strong>t size (64 <strong>on</strong> RV770), N is the number <str<strong>on</strong>g>of</str<strong>on</strong>g> wavefr<strong>on</strong>ts runningin parallel and P is the number <str<strong>on</strong>g>of</str<strong>on</strong>g> physical registers in the register file, then N = P/(W ∗R).The driver is resp<strong>on</strong>sible <str<strong>on</strong>g>for</str<strong>on</strong>g> partiti<strong>on</strong>ing the physical register file am<strong>on</strong>g all the runningwavefr<strong>on</strong>ts. When a wavefr<strong>on</strong>t is swapped out from an ALU to a texture unit, the registersbeing used by the wavefr<strong>on</strong>t are still retained in the register file until the wavefr<strong>on</strong>t finishesexecuti<strong>on</strong>.For RV770, the wavefr<strong>on</strong>t size <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 is four times the number <str<strong>on</strong>g>of</str<strong>on</strong>g> thread processors perSIMD. Thus each thread processor is running four threads in parallel and is likely to hidethe latency <str<strong>on</strong>g>of</str<strong>on</strong>g> ALU units. The register file is divided into banks <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 and thus I speculatethat each thread in a wavefr<strong>on</strong>t reads and writes to exactly <strong>on</strong>e bank.2.3.2 Compute shadersCompute shaders are similar to pixel shaders in most respects. To execute a compute shader,64 c<strong>on</strong>tinuous thread-ids in a thread are chosen as a wavefr<strong>on</strong>t. The maximum number <str<strong>on</strong>g>of</str<strong>on</strong>g>wavefr<strong>on</strong>ts that can run in parallel in a compute shader is determined by the number <str<strong>on</strong>g>of</str<strong>on</strong>g>registers used per thread as well as the amount <str<strong>on</strong>g>of</str<strong>on</strong>g> local data share owned by each thread.Shared registersThe register file in a compute shader is used not <strong>on</strong>ly <str<strong>on</strong>g>for</str<strong>on</strong>g> the per-thread registers butalso <str<strong>on</strong>g>for</str<strong>on</strong>g> shared registers. Shared registers are registers shared am<strong>on</strong>gst threads in a threadgroup. However shared registers have the restricti<strong>on</strong> that the register is actually <strong>on</strong>ly sharedbetween threads n, n+64, n+128.. and so <strong>on</strong> within a thread group. Thus thread 0 andthread 1 within a thread group do not actually see the same register. Instead there are 6412

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!