30.04.2013 Views

BSV by Example - Computation Structures Group

BSV by Example - Computation Structures Group

BSV by Example - Computation Structures Group

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

For speed, one starts with the native speed of compiled C++. Software virtual platforms use sophisticated<br />

technology like just-in-time cross-compilation to deliver impressive execution speeds,<br />

for example booting an entire guest operating system in seconds. TLM models avoid the overheads<br />

of event-driven simulation (<strong>by</strong> sticking to SC METHOD and avoiding SC THREAD and<br />

SC CTHREAD). If they must use events, they play tricks like “temporal decoupling” to amortize<br />

the overhead.<br />

While all these tricks are laudable (and perhaps adequate for some modeling purposes), unfortunately<br />

performance drops steeply as one adds almost any architectural detail, such as actual<br />

parallelism, interconnect contention, memory or I/O bottlenecks, cycle-accurate device models, and<br />

so on. Performance also drops steeply when one adds any instrumentation at all (which is necessary<br />

for performance modeling). Further, since the basic substrate is C++, it is extraordinarily difficult<br />

to achieve any speed up <strong>by</strong> exploiting multi-core hosts.<br />

In summary, as you add architectural detail (for accuracy), you lose speed. Development time<br />

increases sharply, as programmers contort the code to apply tricks to restore speed.<br />

One attempted solution is to use FPGA-based execution for the architecturally detailed component(s)<br />

of the system model. This is still a problem because of the logistical difficulties of dealing<br />

with multiple languages and paradigms for different components, because the software side may still<br />

run too slowly, because bandwidth constraints in communicating between the software and FPGA<br />

sides may anyway limit speed, and because of limited visibility into the FPGA-side components. Further,<br />

conventional approaches have problems in creating the FPGA-side components (see synthesis<br />

discussion below).<br />

High-Level Synthesis (HLS) from C++ and SystemC is often held out <strong>by</strong> many in the industry as<br />

a potential way forward. Unfortunately there are serious problems in using C++ and SystemC as a<br />

starting point. A central issue is architecture, architecture, architecture.<br />

It is well understood that the key to high-performance software is good algorithms. Further, good<br />

algorithms are designed <strong>by</strong> people with training and experience, and not <strong>by</strong> tools (a compiler only<br />

optimizes an algorithm we write, it doesn’t improve the algorithm itself). Similarly, the key to highperformance<br />

hardware (measured as area, speed, power) is good architecture. Hence, it is crucial to<br />

give designers the maximum power to express architecture; all the rest can be left to tools.<br />

Unfortunately, HLS from C++ does exactly the opposite—it obscures architecture. These HLS tools<br />

try to choose an architecture based on internal heuristics. Designers have indirect, second-order<br />

controls on this activity, expressed as “constraints” such as bill-of-materials limits for hardware<br />

resources and pragmas for loop transformations (unrolling, peeling, fusion, etc.). It is hard (and<br />

often involves guesswork) to “steer” these tools towards a good architecture.<br />

HLS from C++ has another serious limitation. It is fundamentally an exercise in “automatic parallelization”,<br />

i.e., in automatically transforming a sequential program into a parallel implementation.<br />

This problem has been well-studied since the days of the earliest vector supercomputers like the<br />

CDC 6600 and Cray 1, and works (and even then, only moderately) for “loop-and-array” codes.<br />

This is because the key technologies needed for automatic parallelization, namely dependence analysis,<br />

alias analysis, etc., on the CDFG (Control/Data Flow Graph) have been shown to be feasible<br />

only for simple loops, often statically bounded, working on arrays with simple (affine, linear) combinations<br />

of loop indexes. Such loop-and-array computations of course exist in a few components<br />

in an SoC, notably the signal-processing ones, but are absent in the majority of the SoC, which are<br />

more “control-oriented” (processors, memory systems, interconnects, DMAs and data movers, I/O<br />

peripherals, etc.). Even in the DSP space, note that modern DSP algorithms are also becoming<br />

more control-oriented in order to become more adaptive to noise (for signal processing) and actual<br />

content (for audio and video codecs).<br />

Finally, HLS from C++ has yet another serious limitation. It is well known that, for the same computation,<br />

one may choose different algorithms depending on the computation model and associated<br />

cost model of the platform. This is why there are entire textbooks on parallel algorithms—they<br />

12

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!