15.01.2013 Views

U. Glaeser

U. Glaeser

U. Glaeser

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

automata (GCA); an example of traffic system simulation is described—digital circuits and forest fire<br />

development are further systems, which have suitable characteristics [13].<br />

Petri net models are also used extensively in simulation studies; as with cellular automata, there is<br />

abundant low level parallelism to be exploited—the firability of each transition can be evaluated simultaneously.<br />

Petri net models are based on simple units—places and transitions. It is possible to create generic<br />

models in VHDL for these units [14], paving the way to automatic generation of VHDL code from natural<br />

visual representation of Petri nets, which can be compiled and downloaded to suitable hardware. A single<br />

Achilles stack is able to accommodate a model containing of the order of 200 transitions [14].<br />

Reconfigurable Processors vs. Commodity Processors<br />

Any special purpose hardware has to compete with the rapid increase in performance of commodity<br />

processors. Despite the relative inefficiency of a general-purpose processor for many applications, if the<br />

special purpose hardware only provides a speedup of, say 2, then Moore’s Law will ensure that the advantage<br />

8<br />

of the special purpose hardware is lost in a year. When assessing whether an application will benefit<br />

from use of a reconfigurable processor, one has to keep the following points in mind.<br />

Raw Performance<br />

The raw performance of FPGA-based solutions will always lag behind that of commodity processors.<br />

This is superficially reflected in maximum clock speeds: an FPGA’s maximum clock speed will typically<br />

be one-third or less of that of a commodity processor at the same point in time. This is inevitable and<br />

will continue: the reconfiguration circuitry loads a circuit and requires space, increasing its propagation<br />

delay and reducing the maximum clock speed.<br />

Parallelism<br />

Thus, to realize a benefit from a reconfigurable system, the application must have a considerable degree<br />

of inherent parallelism which can be used effectively.<br />

The parallelism may be exploited simply by deploying multiple processing blocks—each processing a<br />

separate data element at the same time—followed by some “aggregation” circuitry, which reduces results<br />

from the individual processing blocks in some way.<br />

Long Pipelines<br />

Alternatively, a long pipeline may be employed in which the same data element transits multiple processing<br />

blocks in successive clock cycles. This approach trades latency for throughput: it may take many<br />

cycles—the latency—for the first result to appear, but after that new processed data are available on<br />

each clock cycle giving high throughput. Many signal processing tasks can effectively use long pipelines.<br />

Memory<br />

FPGA devices do not provide large amounts of memory efficiently: recent devices (e.g., Altera’s APEX<br />

20K devices [2]) do attempt to address this deficiency and provide significant dedicated memory resources;<br />

however, the total number of memory bits remains relatively small and is insufficient to support applications<br />

which require large amounts of randomly accessible data. This means that, although preprocessing<br />

an image which is available as a pixel stream from a camera for edge detection is feasible, subsequent<br />

processing of the image in order to segment it is considerably more difficult. In the first case, to apply a<br />

3 × 3 mask to the pixel stream, only two preceding rows of the image need be stored. The application of<br />

the 3 × 3 mask requires a maximum of nine basic multiply-accumulate operations. Thus, it can be effectively<br />

handled in a 9-stage pipeline—easily allowing an edge-detected image to be produced at the same rate<br />

as the original image is streamed into the FPGA (allowing for an 8-pixel clock latency before the first<br />

result is available). In the second, the whole image needs to be stored and be available for random access.<br />

Although an FPGA with auxillary memory might handle this task, it is less likely to offer a significant<br />

advantage over a general-purpose processor.<br />

8<br />

The author has (somewhat arbitrarily) shortened the “break-even” point from the 18 months of Moore’s Law,<br />

because the extra cost of additional hardware needs to be factored in versus using cheap commodity hardware.<br />

© 2002 by CRC Press LLC

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!