21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

166 H. Vandierendonck et al.<br />

i-loop<br />

HH[ j]<br />

�<br />

j-loop<br />

f, e, s<br />

�<br />

vectorize by<br />

4 i-loop<br />

iterations<br />

Fig. 2. Data dependencies between dist<strong>in</strong>ct iterations of the <strong>in</strong>ner loop body of two<br />

nested loops<br />

Vectorization of prfscore(). The prfscore() function called from the PA loops<br />

computes a vector dot product. The vector length is <strong>in</strong>put-dependent but it<br />

cannot exceed 32 elements by def<strong>in</strong>ition of the data structures. Furthermore, the<br />

vector length rema<strong>in</strong>s constant dur<strong>in</strong>g the whole PA phase.<br />

We completely unroll the loop assum<strong>in</strong>g 32 iterations of the loop and we<br />

perform a 4-way vectorization. This removes all control flow at the cost of code<br />

size <strong>in</strong>crease. To take the loop iteration limit <strong>in</strong>to account, we use the spu sel()<br />

primitive together with a pre-computed mask array. The spu sel() primitive<br />

selects only those words for which the mask conta<strong>in</strong>s ones, so it allows us to sum<br />

over only those values that are required.<br />

Control Flow Optimization. The <strong>in</strong>ner loop body of the backward and forward<br />

loops conta<strong>in</strong>s a significant amount of control flow, related to f<strong>in</strong>d<strong>in</strong>g<br />

the maximum of a variable over all loop iterations. In the PW phase, the<br />

code also remembers the i-loop and j-loop iteration numbers where that maximum<br />

occurs. It is important to avoid this control flow, s<strong>in</strong>ce mispredicted<br />

branch <strong>in</strong>structions have a high penalty on the SPUs. Updat<strong>in</strong>g the runn<strong>in</strong>g<br />

maximum value (if(b > a) a=b;) can be simply avoided by us<strong>in</strong>g the SPUs<br />

compare and select assembly <strong>in</strong>structions to turn control flow <strong>in</strong>to data flow<br />

(a=spu sel(a,b,spu cmpgt(b,a));). In the same ve<strong>in</strong>, it is also possible to remember<br />

the i-loop and j-loop iteration numbers of the maximum<br />

(imax=spu sel(imax,i,spu cmpgt(b,a));).<br />

Vectorization. The forward and backward loops <strong>in</strong> the PW and PA phases have<br />

the same data dependencies, which are depicted <strong>in</strong> Figure 2. There are two<br />

nested loops, with the j-loop nested <strong>in</strong>side the i-loop. Every box <strong>in</strong> the figure<br />

depicts one execution of the <strong>in</strong>ner loop body correspond<strong>in</strong>g to one pair of i and<br />

j loop <strong>in</strong>dices. The execution of the <strong>in</strong>ner loop body has data dependencies with<br />

previous executions of the <strong>in</strong>ner loop body, as <strong>in</strong>dicated by edges between the<br />

boxes. Data dependencies carry across iterations of the j-loop, <strong>in</strong> which case<br />

the dependencies are carried through scalar variables. Also, data dependencies<br />

carry across iterations of the i-loop, <strong>in</strong> which case the dependences are carried

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!