PPKE ITK PhD and MPhil Thesis Classes
PPKE ITK PhD and MPhil Thesis Classes
PPKE ITK PhD and MPhil Thesis Classes
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
10 1. INTRODUCTION<br />
accelerator can be implemented either on multi-processor VLSI ASIC digital emulators<br />
(e.g CASTLE) [26], on DSP-, SPE-, <strong>and</strong> GPU-based hardware accelerator<br />
boards (e.g. CNN-HAC [27], Cell Broadb<strong>and</strong> Engine Architecture [28], Nvidia<br />
Cuda [29], respectively), on FPGA-based reconfigurable computing architectures<br />
(e.g. FALCON [30]), as well. Generally, they speed up the software simulators,<br />
to get higher performance, but they are slower than the analog/mixed-signal<br />
CNN-UM implementations.<br />
A special Hardware Accelerator Board (HAB) was developed for simulating<br />
up to one-million-pixel arrays (with on-board memory) with four DSP (16 bit<br />
fixed point) chips. In fact, in a digital HAB, each DSP calculates the dynamics<br />
of a partition of the whole CNN array. Since for the calculation of the CNN<br />
dynamics a major part of DSP capability is not used, special purpose chips have<br />
been developed.<br />
The first emulated-digital, custom ASIC VLSI CNN-UM processor – called<br />
CASTLE.v1 – was developed in MTA-SZTAKI in Analogical <strong>and</strong> Neural Computing<br />
Laboratory between 1998 <strong>and</strong> 2001 for processing binary images [26], [31].<br />
By using full-custom VLSI design methodology, this specialized systolic CNN<br />
array architecture greatly reduced the area requirements of the processor <strong>and</strong><br />
makes it possible to implement multiple processing elements (with distributed<br />
ALUs) on the same silicon die. The second version of the CASTLE processor<br />
was elaborated with variable computing precision (1-bit ’logical’ <strong>and</strong> 6/12-bit<br />
’bitvector’ processing modes), its structure can be exp<strong>and</strong>ed into an array of<br />
CASTLE processors. Moreover, it is capable of processing 240×320-sized images<br />
or videos at 25fps in real-time with low power dissipation (in mW range), as<br />
well. Emulated-digital approach can also benefit from scaling-down by using new<br />
manufacturing technologies to implement smaller <strong>and</strong> faster circuits with reduced<br />
power dissipation.<br />
Several fundamental attributes of the Falcon architecture [30] are based on<br />
CASTLE emulated-digital CNN-UM array processor architecture. However, the<br />
most important features which were greatly improved in this FPGA-based implementation<br />
are the flexibility of programming, the scalable accuracy of CNN<br />
computations, <strong>and</strong> configurable template size. Therefore, the majority of these