11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

float4 temp1 = i1[i][threadId.y/4];float4 temp2 = i1[i+1][threadId.y/4];float4 temp1 = i1[i+2][threadId.y/4];float4 temp1 = i1[i+3][threadId.y/4];sum.x += temp.x*temp1.x;sum.x += temp.y*temp2.x;sum.x += temp.z*temp3.x;sum.x += temp.w*temp4.x;sum.y += temp.x*temp1.y;sum.y += temp.y*temp2.y;sum.y += temp.z*temp3.y;sum.y += temp.w*temp4.y;sum.z += temp.x*temp1.z;sum.z += temp.y*temp2.z;sum.z += temp.z*temp3.z;sum.z += temp.w*temp4.z;sum.w += temp.x*temp1.w;sum.w += temp.y*temp2.w;sum.w += temp.z*temp3.w;sum.w += temp.w*temp4.w;}The code above c<strong>on</strong>tains sixteen ALU operati<strong>on</strong>s and five load instructi<strong>on</strong>s resulting ina ALU:TEX ratio <str<strong>on</strong>g>of</str<strong>on</strong>g> 3.2. This yields a peak theoretical per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> 172 Gflops <str<strong>on</strong>g>for</str<strong>on</strong>g> theRade<strong>on</strong> 4870. We can further increase the tile size to further increase the ALU:TEX ratio.The tile size is restricted by the number <str<strong>on</strong>g>of</str<strong>on</strong>g> registers available per thread. To hide memorylatency, let’s assume we run 8 wavefr<strong>on</strong>ts per SIMD giving 5120 threads <strong>on</strong> the chip. Thenwe get 32 registers available per thread, each <str<strong>on</strong>g>of</str<strong>on</strong>g> size 128 bits. 32 registers allows <str<strong>on</strong>g>for</str<strong>on</strong>g> a tile<str<strong>on</strong>g>of</str<strong>on</strong>g> up to 5×8 giving a ALU:TEX ratio <str<strong>on</strong>g>of</str<strong>on</strong>g> 12 with a predicted per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> 720 GFLOPs.If the memory access latency is not completely hidden, then the actual per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance will beless than the predicted value.From the matrix multiplicati<strong>on</strong> example, it can be c<strong>on</strong>cluded that increasing the ratio <str<strong>on</strong>g>of</str<strong>on</strong>g>ALU instructi<strong>on</strong>s to texture instructi<strong>on</strong>s is essential <str<strong>on</strong>g>for</str<strong>on</strong>g> achieving maximum per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <strong>on</strong>the GPU. Loop trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>s such as unrolling and tiling and the use <str<strong>on</strong>g>of</str<strong>on</strong>g> packed 128-bitdatatypes wherever possible can substantially reduce the number <str<strong>on</strong>g>of</str<strong>on</strong>g> texture instructi<strong>on</strong>sexecuted.2.5 On-chip bandwidth bottlenecksA simple way to understand the ALU:TEX bottleneck is to understand the <strong>on</strong>-chip bandwidthavailable compared to the bandwidth available <strong>on</strong> a modern CPU. Lets compare thebandwidth from the L1 cache and the bandwidth from the register file <strong>on</strong> the Rade<strong>on</strong> 4870to an AMD Phenom II X4 940 CPU. Phenom II X4 940 is a relatively high end x86 3.0 GHzquad-core. We compare the bandwidths available to the x4 940 to the bandwidths available16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!