15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

have disparities approaching infinity, but one of the major applications of stereo vision is collision avoidance<br />

7<br />

in which it is possible to put a lower bound on the distances of objects from the camera. In practical<br />

camera systems, this results in a need to consider objects with disparities from 0 pixels (i.e., at infinity)<br />

to of the order of 10–100 pixels at closest permissible approach. Thus this problem has all of the required<br />

attributes for an efficient pipeline parallel implementation:<br />

• Parallelism of 10–100 or more<br />

• Simple calculations (comparing pixel intensities)<br />

• Regular computation (the same correlation operators are applied to each pixel)<br />

Woodfill et al., using the census transform to reduce problems caused by intensity variations and depth<br />

discontinuities, programmed a PARTS engine [17] to calculate object depths from pairs of 320 × 240 pixel<br />

images. With a maximum disparity of 32, their system was able to compute depth at 42 frames per second<br />

9<br />

[18]. They estimated that it was performing about 2.3 × 10 RISC equivalent operations per second.<br />

Piacentino et al. have built a video processing system (Sarnoff Vision Front End 200) in which reconfigurable<br />

processing elements are used not only for stereo computations, but for motion estimation and<br />

warping also [19]. They estimate that the VFE-200 can provide ~500 GOPS of processing power.<br />

Encryption/Decryption<br />

Shand and Vuillemin have used RSA cryptography as a benchmark for their PAM machines; they were<br />

able to demonstrate an order of magnitude improvement in performance relative to the best software<br />

implementations of the time. In 1992, PAM achieved over 1 Mb/s for 512 bit keys compared to 56 kb/s<br />

on a 150 MHz Alpha processor [20]. This relative performance will not change; state-of-the-art FPGAs<br />

can now fit the entire PAM system in a single device, giving the reconfigurable hardware system additional<br />

speed as it no longer needs to use slower inter-device links or external memory.<br />

Symmetric encryption algorithms are easily and efficiently implemented in FPGAs; they require a<br />

number of “rounds” of application of simple operations. Each round can be implemented as a pipeline<br />

stage. Thus, as an example, TwoFish [21] requires 16 rounds of lookup table accesses, which can be implemented<br />

as a 16-stage pipeline. This allows a stream of 32-bit input data words to be encrypted at very<br />

high input frequencies with a latency of 16 cycles. In a study of four AES candidates, Elbirt et al. report<br />

an order of magnitude difference between FPGA-based implementations and the best software ones [22];<br />

however, they also note that for one AES candidate, CAST-256, FPGA implementations were slower than<br />

their software counterparts. This result highlights the fact that the performance advantage of commodity<br />

processors can only be overcome when the problem matches the capabilities of FPGA-based custom<br />

processors. By adding further pipeline stages within each round—24 for TwoFish, for example—Chodowiec<br />

et al. were able to achieve throughputs greater than 10 Gb/s for five of the AES candidate algorithms (12 Gb/s<br />

using a 95 MHz internal clock for Rijndael, the eventual winner of the AES competition) [23].<br />

Secure communications systems require encryption hardware; placing the encryption subsystem in<br />

hardware makes it less susceptible to tampering and enables keys to be hidden in “write-only” registers.<br />

Reconfigurable hardware provides an additional capability, algorithm agility [24]. This not only enables<br />

an encryption algorithm which has become insecure to be replaced with a secure one, but permits an<br />

algorithm independent security protocol to use the hardware effectively, loading the appropriate algorithm<br />

on a transaction-by-transaction basis.<br />

Compression<br />

Using a systolic array style implementation of the LZ algorithm, Huang et al. were able to obtain throughputs<br />

30 times greater than those achievable with commodity processors [25]. This speedup was obtained<br />

even though their FPGAs (Xilinx XC4036s) were clocked at 16 MHz versus 450 MHz for the fastest<br />

software implementation. Huang et al. believe that even better relative performance would be obtained<br />

7<br />

The vehicle carrying the camera system is expected to move away before this bound is violated.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!