11.07.2015 Views

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

A Compiler for Parallel Exeuction of Numerical Python Programs on ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.1.1 SIMD unitsEach SIMD unit is composed <str<strong>on</strong>g>of</str<strong>on</strong>g> the following:1. 16 thread processors (TP). Each TP is a VLIW unit that can execute <strong>on</strong>e 5-wide VeryL<strong>on</strong>g Instructi<strong>on</strong> Word (VLIW) instructi<strong>on</strong>s each cycle. Each TP c<strong>on</strong>tains five ALUs.AMD calls each ALU as stream processor (SP). Each SP can execute <strong>on</strong>e fp32 or <strong>on</strong>eint32 multiply-and-add (MAD) operati<strong>on</strong> per cycle. One <str<strong>on</strong>g>of</str<strong>on</strong>g> the five SPs in a TP isa “fat” SP and <strong>on</strong>ly the fat SP can execute transcendental functi<strong>on</strong>s such as sin orcos. Thus, the peak per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance <str<strong>on</strong>g>of</str<strong>on</strong>g> a TP is ten fp32 FLOPs per cycle. For fp64, thescenario is more complex. The four n<strong>on</strong>-fat ALUs co-operate to execute a single fp64MAD instructi<strong>on</strong> each cycle and thus the fp64 peak <str<strong>on</strong>g>of</str<strong>on</strong>g> each TP is two FLOPs/cycleor 1/5th <str<strong>on</strong>g>of</str<strong>on</strong>g> the fp32 peak. If the compiler is unable to pack 5 instructi<strong>on</strong>s into <strong>on</strong>epacket, then the peak is not achieved because each TP is a VLIW processor.2. Each SIMD has a huge register file with 64×256 float4/int4 registers. The registersare untyped with four 32-bit comp<strong>on</strong>ents. Each comp<strong>on</strong>ent <str<strong>on</strong>g>of</str<strong>on</strong>g> a register can eitherstore a floating-point value or an integer value. Two comp<strong>on</strong>ents <str<strong>on</strong>g>of</str<strong>on</strong>g> a register can becombined to store a double-precisi<strong>on</strong> value. The register file is arranged in sixteenblocks <str<strong>on</strong>g>of</str<strong>on</strong>g> four banks each. This gives a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 64 banks and each bank can do <strong>on</strong>efloat4 transfer per cycle. Thus the bandwidth to the register file is 1024 bytes/cycle.The peak transfer rate from the register file is less than the peak demand <str<strong>on</strong>g>of</str<strong>on</strong>g> the ALUs.This is explained by the fact that there exists a <str<strong>on</strong>g>for</str<strong>on</strong>g>warding path where a TP can usea result computed in a previous cycle.3. A 16-Kbyte shared memory called the local data share (LDS). The LDS is arranged infour banks, each with 256 128-bit entries. Each bank can do <strong>on</strong>e 128-bit transfer percycle. Thus the LDS can transfer 64 bytes/cycle. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the bandwidth from theLDS is much smaller than the bandwidth from the register file. The size and layout <str<strong>on</strong>g>of</str<strong>on</strong>g>the register file suggests that it is very similar to <strong>on</strong>e block <str<strong>on</strong>g>of</str<strong>on</strong>g> a register file but withadditi<strong>on</strong>al indexing hardware.2.1.2 Texture unitsEach texture unit can compute four addresses per cycle. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e, the ten texture unitscan calculate a total <str<strong>on</strong>g>of</str<strong>on</strong>g> 40 addresses per cycle. Each texture unit has a dedicated L1 cache.The size <str<strong>on</strong>g>of</str<strong>on</strong>g> the L1 cache has not been published by AMD. Each texture unit is aligned withand services exactly <strong>on</strong>e SIMD unit. Given that an SIMD can execute 800 ALU operati<strong>on</strong>sor equivalently 1600 FLOPs per cycle, the asymmetry in the ALU:TEX ratio is apparent.The texture unit provides many facilities, such as filtering, but those are typically not usedin compute workloads and are thus not discussed in this thesis.7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!