21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Experiences with Paralleliz<strong>in</strong>g a Bio-<strong>in</strong>formatics Program on the Cell BE 173<br />

barriers, which force the execution of DMAs <strong>in</strong> launch order. We use this feature<br />

when copy<strong>in</strong>g a buffer from one SPU to another, followed by send<strong>in</strong>g a mailbox<br />

message 2 to notify the dest<strong>in</strong>ation SPU that the buffer has arrived. By us<strong>in</strong>g a<br />

barrier, we can launch the message without wait<strong>in</strong>g for completion of the buffer<br />

DMA. Furthermore, tag queues can be specified such that the barrier applies<br />

only to the DMAs <strong>in</strong> the specified tag queue. Thus, other DMAs (e.g., to stream<br />

data structures) are not affected by the barrier.<br />

We experienced some properties of the Cell BE as limitations. E.g., 32-bit<br />

<strong>in</strong>teger multiplies are not supported <strong>in</strong> hardware. Instead, the compiler generates<br />

multiple 16-bit multiplies. This is an important limitation <strong>in</strong> the prfscore()<br />

function, which extensively uses 32-bit multiplies. Also, the DMA scatter/gather<br />

functionality was not useful to us as we needed 8 byte scatter/gather operations<br />

but the cell requires that the data elements are at least 128 byte large.<br />

F<strong>in</strong>ally, although mailbox communication is easy to understand, it is a very<br />

raw device to implement parallel primitives. We f<strong>in</strong>d that mailboxes provide not<br />

enough programmer abstraction and are, <strong>in</strong> the end, hard to use. One problem<br />

results from send<strong>in</strong>g all communication through a s<strong>in</strong>gle mailbox. This makes<br />

it impossible to separately develop functionality to communicate with a s<strong>in</strong>gle<br />

SPU, as this functionality can receive unexpected messages from a different SPU<br />

and it must know how to deal with these messages. An <strong>in</strong>terest<strong>in</strong>g solution could<br />

be the use of tag queues <strong>in</strong> the <strong>in</strong>com<strong>in</strong>g mailbox, such that one can select only<br />

a particular type or source of message.<br />

7 Related Work<br />

The Cell BE Architecture promises high performance at low power consumption.<br />

Consequently, several researchers have <strong>in</strong>vestigated the utility of the Cell BE for<br />

particular application doma<strong>in</strong>s.<br />

Williams et al. [13] measure performance and power consumption of the Cell<br />

BE when execut<strong>in</strong>g scientific comput<strong>in</strong>g kernels. They compare these numbers to<br />

other architectures and f<strong>in</strong>d potential speedups <strong>in</strong> the 10-20x range. However, the<br />

low double-precision float<strong>in</strong>g-po<strong>in</strong>t performance of the Cell is a major down-side<br />

for scientific applications. A high-performance FFT is described <strong>in</strong> [14].<br />

Héman et al. [15] port a relational database to the Cell BE. Only some<br />

database operations (such as projection, selection, etc.) are executed on the<br />

SPUs. The authors po<strong>in</strong>t out the importance of avoid<strong>in</strong>g branches and of properly<br />

prepar<strong>in</strong>g the layout of data structures to enable vectorization.<br />

Bader et al. [16] develop a list rank<strong>in</strong>g algorithm for the Cell BE. List rank<strong>in</strong>g<br />

is a comb<strong>in</strong>atorial application with highly irregular memory accesses. As memory<br />

accesses are hard to predict <strong>in</strong> this application, it is proposed to use softwaremanaged<br />

threads on the SPUs. At any one time, only one thread is runn<strong>in</strong>g.<br />

When it <strong>in</strong>itiates a DMA request, the thread blocks and control switches to<br />

another thread. This results <strong>in</strong> a k<strong>in</strong>d of software f<strong>in</strong>e-gra<strong>in</strong> multi-thread<strong>in</strong>g and<br />

yields speedups up to 8.4 for this application.<br />

2 SPU-to-SPU mailbox communication is implemented us<strong>in</strong>g DMA commands.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!