11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

and optimizati<strong>on</strong>s detailed later in Chapters 5 and 6.The JIT compiler is actually not visible to the programmer. As far as the programmeris c<strong>on</strong>cerned, an additi<strong>on</strong>al library (the JIT compiler) is linked into the applicati<strong>on</strong> but theprogrammer is not c<strong>on</strong>cerned with the functi<strong>on</strong> or inner workings <str<strong>on</strong>g>of</str<strong>on</strong>g> the JIT compiler. TheJIT compiler also makes no difference to codes that do not target GPUs.4.2.2 GPU code generati<strong>on</strong>Now c<strong>on</strong>sider the full process <str<strong>on</strong>g>of</str<strong>on</strong>g> generating GPU code. When un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> encounters a loopthat is marked by the programmer to be compiled <str<strong>on</strong>g>for</str<strong>on</strong>g> executi<strong>on</strong> <strong>on</strong> the GPU, it outputs twoversi<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> code <str<strong>on</strong>g>for</str<strong>on</strong>g> that loop. First, un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> outputs a typed AST representati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> theparallel loop to be used by jit4GPU <str<strong>on</strong>g>for</str<strong>on</strong>g> analysis and code generati<strong>on</strong>. The representati<strong>on</strong> isprinted out in a Lisp-like S-expressi<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mat read by jit4GPU al<strong>on</strong>g with a symbol tableand a table <str<strong>on</strong>g>of</str<strong>on</strong>g> array references inside the loop. However, jit4GPU is not guaranteed toproduce GPU code and can return failure if any <str<strong>on</strong>g>of</str<strong>on</strong>g> the analysis phases fails. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e,un<str<strong>on</strong>g>Pyth<strong>on</strong></str<strong>on</strong>g> also generates a C++ and OpenMP codepath <str<strong>on</strong>g>for</str<strong>on</strong>g> the parallel loop which is usedas a fallback in case jit4GPU fails to generate GPU code.Jit4GPU is implemented as a multi-phase compiler operating <strong>on</strong> typed ASTs. Anoverview <str<strong>on</strong>g>of</str<strong>on</strong>g> jit4GPU is provided in Figure 4.2. In the first phase, jit4GPU per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms anarray-access analysis detailed in chapter 5 to determine the data to be transferred to theGPU. The array access-analysis phase produces two outputs: a data structure representingthe data transfers to be per<str<strong>on</strong>g>for</str<strong>on</strong>g>med and a tree <str<strong>on</strong>g>of</str<strong>on</strong>g> code to be executed <strong>on</strong> the GPU. Ifjit4GPU fails to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m the access analysis, then it returns failure. In the next phase,jit4GPU optimizes the typed AST using loop optimizati<strong>on</strong>s detailed in chapter 6. The loopoptimizati<strong>on</strong> phases optimize the AST and can also alter the data layout chosen <strong>on</strong> theGPU and can there<str<strong>on</strong>g>for</str<strong>on</strong>g>e also change the parameters <str<strong>on</strong>g>for</str<strong>on</strong>g> data transfer. In the next phase,jit4GPU generates AMD IL assembly from the AST and passes this assembly code to thecompiler that c<strong>on</strong>verts AD IL code to GPU binaries. The AMD IL to binary compilerper<str<strong>on</strong>g>for</str<strong>on</strong>g>ms extensive low-level code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s, such as physical register allocati<strong>on</strong>, <strong>on</strong>the GPU and instructi<strong>on</strong> scheduling <strong>on</strong> various units <str<strong>on</strong>g>of</str<strong>on</strong>g> the GPU. The exact optimizati<strong>on</strong>sper<str<strong>on</strong>g>for</str<strong>on</strong>g>med by the AMD compiler are not published by AMD but I have observed, looking atthe disassembler output, that the AMD CAL compiler per<str<strong>on</strong>g>for</str<strong>on</strong>g>ms code trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s. Inthe final phase, jit4GPU issues a call to a supporting runtime library to per<str<strong>on</strong>g>for</str<strong>on</strong>g>m the actualdata transfer and to execute the code <strong>on</strong> the GPU.Once the executi<strong>on</strong> is finished and data is transfered back to the CPU, jit4GPU reportssuccess and the executi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> the program resumes <strong>on</strong> the CPU from the program point justafter the parallel loop.31

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!