20.11.2014 Views

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

PPKE ITK PhD and MPhil Thesis Classes

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10 1. INTRODUCTION<br />

accelerator can be implemented either on multi-processor VLSI ASIC digital emulators<br />

(e.g CASTLE) [26], on DSP-, SPE-, <strong>and</strong> GPU-based hardware accelerator<br />

boards (e.g. CNN-HAC [27], Cell Broadb<strong>and</strong> Engine Architecture [28], Nvidia<br />

Cuda [29], respectively), on FPGA-based reconfigurable computing architectures<br />

(e.g. FALCON [30]), as well. Generally, they speed up the software simulators,<br />

to get higher performance, but they are slower than the analog/mixed-signal<br />

CNN-UM implementations.<br />

A special Hardware Accelerator Board (HAB) was developed for simulating<br />

up to one-million-pixel arrays (with on-board memory) with four DSP (16 bit<br />

fixed point) chips. In fact, in a digital HAB, each DSP calculates the dynamics<br />

of a partition of the whole CNN array. Since for the calculation of the CNN<br />

dynamics a major part of DSP capability is not used, special purpose chips have<br />

been developed.<br />

The first emulated-digital, custom ASIC VLSI CNN-UM processor – called<br />

CASTLE.v1 – was developed in MTA-SZTAKI in Analogical <strong>and</strong> Neural Computing<br />

Laboratory between 1998 <strong>and</strong> 2001 for processing binary images [26], [31].<br />

By using full-custom VLSI design methodology, this specialized systolic CNN<br />

array architecture greatly reduced the area requirements of the processor <strong>and</strong><br />

makes it possible to implement multiple processing elements (with distributed<br />

ALUs) on the same silicon die. The second version of the CASTLE processor<br />

was elaborated with variable computing precision (1-bit ’logical’ <strong>and</strong> 6/12-bit<br />

’bitvector’ processing modes), its structure can be exp<strong>and</strong>ed into an array of<br />

CASTLE processors. Moreover, it is capable of processing 240×320-sized images<br />

or videos at 25fps in real-time with low power dissipation (in mW range), as<br />

well. Emulated-digital approach can also benefit from scaling-down by using new<br />

manufacturing technologies to implement smaller <strong>and</strong> faster circuits with reduced<br />

power dissipation.<br />

Several fundamental attributes of the Falcon architecture [30] are based on<br />

CASTLE emulated-digital CNN-UM array processor architecture. However, the<br />

most important features which were greatly improved in this FPGA-based implementation<br />

are the flexibility of programming, the scalable accuracy of CNN<br />

computations, <strong>and</strong> configurable template size. Therefore, the majority of these

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!