01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A Tightly Coupled Accelerator Infrastructure for Exact Arithmetics 229<br />

possibly write accesses for the results, and for future extensions that modify<br />

<strong>in</strong>dividual blocks <strong>of</strong> the accumulation units.<br />

From the client side, it is important that after request<strong>in</strong>g a medium-gra<strong>in</strong>ed<br />

operation (<strong>in</strong>ner product, matrix multiplication) the relevant data is sent. This<br />

ideally happens by writ<strong>in</strong>g operand data to dist<strong>in</strong>ct memory-mapped addresses<br />

<strong>of</strong> the HTX board while respect<strong>in</strong>g the data order accord<strong>in</strong>g to Sect. 3.<br />

Recall<strong>in</strong>g the capabilities <strong>of</strong> the hardware platform and the result<strong>in</strong>g mode<br />

<strong>of</strong> operation, read<strong>in</strong>g the accumulated result can be accomplished <strong>in</strong> one <strong>of</strong> two<br />

basic ways: explicitly <strong>of</strong>fered read operations <strong>in</strong> the accumulator that are started<br />

upon read requests on the HyperTransport posted queue, and streamed, un<strong>in</strong>terrupted<br />

write-out <strong>of</strong> the accumulator to ma<strong>in</strong> memory. Explicit read operations <strong>in</strong><br />

the accumulator pipel<strong>in</strong>e violate the concept <strong>of</strong> pipel<strong>in</strong><strong>in</strong>g, mak<strong>in</strong>g it even more<br />

complicated and potentially decreas<strong>in</strong>g maximum execution speed. The second<br />

approach, i.e. to always compute the accumulator’s value <strong>in</strong> double-precision<br />

and to un<strong>in</strong>terruptedly send these values to the system’s ma<strong>in</strong> memory, us<strong>in</strong>g<br />

the IDs as least-significant parts <strong>of</strong> the address, looks f<strong>in</strong>e at first sight. While<br />

it does not <strong>in</strong>terfere with the read bandwidth <strong>of</strong> the EAU on the HTX board, it<br />

however does not allow to send the results <strong>of</strong> more than 3 EAUs, as each EAU<br />

produces a new result each 3 cycles.<br />

To circumvent the afore-mentioned limitations, we suggest to comb<strong>in</strong>e the<br />

two approaches: each EAU converts its accumulation values and returns them<br />

together with the IDs that led to these results. A separate controller cares for<br />

<strong>in</strong>com<strong>in</strong>g read requests, buffers the read address that conta<strong>in</strong>s the accumulation<br />

unit and the ID, and returns the next value <strong>of</strong> the specified EAU that has the<br />

same ID. This task perfectly fits the controller required for <strong>in</strong>itializ<strong>in</strong>g the local<br />

memory. Still, there is the choice between cont<strong>in</strong>u<strong>in</strong>g execution while wait<strong>in</strong>g for<br />

the result, or rather stall<strong>in</strong>g the accumulation unit. When choos<strong>in</strong>g the latter,<br />

no problems due to subsequent modifications can occur. While the number <strong>of</strong><br />

bits for the ID has already been set to 4 for a sufficient amount <strong>of</strong> flexibility, the<br />

number <strong>of</strong> bits for different EAUs could be chosen arbitrarily. Accord<strong>in</strong>g to our<br />

experience with a s<strong>in</strong>gle-precision implementation [10], we can say that a total <strong>of</strong><br />

16 EAUs will be enough, so that 4 bits <strong>of</strong> the address should suffice for choos<strong>in</strong>g<br />

a specific EAU. This scheme can be enhanced further by hav<strong>in</strong>g the controller<br />

automatically write back the f<strong>in</strong>al results <strong>of</strong> each completed accumulation.<br />

4.5 Putt<strong>in</strong>g It All Together<br />

In Sect. 4.2, we saw that the accumulation unit accepts operands each 3 cycles<br />

due to otherwise conflict<strong>in</strong>g accesses. By demultiplex<strong>in</strong>g the products <strong>of</strong> the<br />

multipliers onto different accumulation units as <strong>in</strong> Figure 6, we can circumvent<br />

these disadvantages, allow<strong>in</strong>g to compute un<strong>in</strong>terruptedly. When we even consider<br />

that upon each cycle we can obta<strong>in</strong> a value <strong>of</strong> the left <strong>in</strong>put matrix A, we<br />

come up with a throughput <strong>of</strong> 3 exact MAC operations per cycle, which is 300M<br />

exact MAC operations per second (eMACs/sec) when runn<strong>in</strong>g at a frequency <strong>of</strong><br />

100 MHz.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!