01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

232 F. Nowak and R. Buchty<br />

5.3 Comparison<br />

In order to evaluate our architecture, we performed 220 multiply-accumulates<br />

<strong>of</strong> the value v =2−20 onto an <strong>in</strong>it value <strong>of</strong> 220 us<strong>in</strong>g our hardware and two<br />

s<strong>of</strong>tware solutions. For the latter, we employed an implementation based on the<br />

GNU Multiprecision Library (GMP) us<strong>in</strong>g a 256 bits wide <strong>in</strong>ternal float<strong>in</strong>g-po<strong>in</strong>t<br />

representation and one based on C-XSC where a 4288 bits wide fixed-po<strong>in</strong>t accumulator<br />

is employed. Individual runtimes were measured on an Intel Pentium<br />

4 3,2 GHz over 128 runs. All programs were compiled with -O2. Subtract<strong>in</strong>g the<br />

<strong>in</strong>itialization value afterwards we obta<strong>in</strong>ed the correct result 2−20 , which fails<br />

when us<strong>in</strong>g regular double-precision float<strong>in</strong>g-po<strong>in</strong>t operations.<br />

The measurements can be found <strong>in</strong> Table 7. We additionally provide the execution<br />

time with regular double-precision, which is about 100 times less compared<br />

to the highly accurate libraries. Our architecture concept is capable <strong>of</strong> execut<strong>in</strong>g<br />

exact multiply-accumulate operations <strong>in</strong> about the same order <strong>of</strong> time.<br />

Based on the maximum achievable throughput for HT400, we can derive a<br />

theoretical maximum speedup <strong>of</strong> 400 =76.38 over state-<strong>of</strong>-the-art s<strong>of</strong>tware im-<br />

5.237<br />

plementations <strong>of</strong> exact arithmetics. Follow<strong>in</strong>g Sect. 5.1, limitations <strong>of</strong> the target<br />

88<br />

platform still allow a maximum speedup <strong>of</strong> 5.237 =16.80.<br />

6 Results and Future Work<br />

As <strong>of</strong> now, our system allows process<strong>in</strong>g two MAC operation each 3 cycles while<br />

runn<strong>in</strong>g at a frequency <strong>of</strong> 100 MHz, which leads to a theoretical process<strong>in</strong>g speed<br />

<strong>of</strong> 66M MAC operations per second. Note that these operations are carried out<br />

<strong>in</strong> exact arithmetics, avoid<strong>in</strong>g round<strong>in</strong>g and w<strong>in</strong>dow<strong>in</strong>g errors. We showed <strong>in</strong><br />

Section 3 how a multitude <strong>of</strong> accumulation units can help to constantly process<br />

streamed data and how us<strong>in</strong>g the DDR memory can further speed up the<br />

overall performance up to 400 MAC operations per second. A technology scal<strong>in</strong>g<br />

showed that this architecture is a viable approach to exploit the available<br />

bandwidth <strong>of</strong> the HyperTransport bus. For better results, the implementation<br />

<strong>of</strong> the accumulator unit will be revised, promis<strong>in</strong>g a latency <strong>of</strong> 1 and even less<br />

resource usage.<br />

Our future work consists <strong>of</strong> <strong>in</strong>tegrat<strong>in</strong>g the DDR memory and DMA controllers<br />

<strong>in</strong> order to carry out detailed measurements with numerical s<strong>of</strong>tware<br />

where computation with exact arithmetics enables solv<strong>in</strong>g more complex problems.<br />

We also plan to extend our system <strong>in</strong>frastructure by numerical error<br />

measurements so that functions be directed to their exact arithmetics implementation<br />

automatically, sacrific<strong>in</strong>g speed over robustness and numerical<br />

stability.<br />

Similar to [1], the approach will be extended towards microprogramm<strong>in</strong>g capabilities<br />

so that (parts <strong>of</strong>) algorithms can be completely loaded onto the accelerator,<br />

thus support<strong>in</strong>g sparse matrix multiplications. F<strong>in</strong>ally, the architecture is<br />

designed to deliver coarse-gra<strong>in</strong>ed function acceleration <strong>of</strong> numerical algorithms<br />

such as the Lanczos algorithm.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!