12.12.2012 Views

Xcell Journal: The authoritative journal for programmable ... - Xilinx

Xcell Journal: The authoritative journal for programmable ... - Xilinx

Xcell Journal: The authoritative journal for programmable ... - Xilinx

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Functional % of Run-Time<br />

Blocks Total Cycles<br />

mv_search.c 67.31 %<br />

block.c 8.19 %<br />

refbuf.c 6.95 %<br />

macroblock.c 3.48 %<br />

rdopt.c 3.37 %<br />

biariencode.c 3.21 %<br />

cabac.c 2.98 %<br />

memcpy.asm* 2.91 %<br />

abs.c* 0.57 %<br />

image.c 0.54 %<br />

rdopt_coding_state.c 0.46 %<br />

loopFilter.c 0.03 %<br />

Table 1 – H.264/AVC encoder<br />

complexity profile by files<br />

However, computation complexity<br />

alone does not determine if a functional<br />

module should be mapped to hardware or<br />

remain in software. To evaluate the viability<br />

of software and hardware partitioning of the<br />

H.264/AVC coding standard implementation<br />

on a plat<strong>for</strong>m that consists of a mixture<br />

of FPGAs, <strong>programmable</strong> DSPs, or generalpurpose<br />

host processors, we need to look at<br />

a number of architectural issues that influence<br />

the overall design decision.<br />

• Data locality. In a synchronous design,<br />

the ability to access memory in a particular<br />

order and granularity while minimizing<br />

the number of clock cycles due<br />

to latency, bus contention, alignment,<br />

DMA transfer rate, and the types of<br />

memory used (such as ZBT memory,<br />

SDRAM, and SRAM) is very important.<br />

<strong>The</strong> data locality issue is primarily<br />

dictated by the physical interfaces<br />

between the data unit and the arithmetic<br />

unit (or the processing engine).<br />

• Data parallelism. Most signal processing<br />

algorithms operate on data that is<br />

highly parallelizable (such as FIR filtering).<br />

Single instruction multiple data<br />

(SIMD) and vector processors are particularly<br />

efficient <strong>for</strong> data that can be<br />

parallelized or made into a vector <strong>for</strong>mat<br />

(or long data width).<br />

FPGA fabric exploits this by providing<br />

a large amount of block RAM to support<br />

numerous very high aggregate<br />

bandwidth requirements. In the new<br />

<strong>Xilinx</strong> Virtex-4 SX device family, the<br />

amount of block RAM matches closely<br />

with the number of Xtreme DSP<br />

slices (SX25 – 128 block RAM, 128<br />

DSP slices; SX35 – 192 block RAM,<br />

192 DSP slices; SX55 – 320 block<br />

RAM, 512 DSP slices).<br />

• Signal processing algorithm parallelism.<br />

In a typical <strong>programmable</strong> DSP<br />

or a general-purpose processor, signal<br />

processing algorithm parallelism is<br />

often referred to as instruction level<br />

parallelism (ILP). A very long instruction<br />

word (VLIW) processor is an<br />

example of such a machine that<br />

exploits ILP by grouping multiple<br />

instructions (ADD, MULT, and BRA)<br />

to be executed in a single cycle. A<br />

heavily pipelined execution unit in the<br />

processor is also an excellent example<br />

of hardware that exploits the parallelism.<br />

Modern <strong>programmable</strong> DSPs<br />

have adopted this architecture (including<br />

the Texas Instruments<br />

TMS320C64x).<br />

However, not all algorithms can<br />

exploit such parallelism. Recursive<br />

algorithms like IIR filtering, variablelength<br />

coding (VLC) in MPEG1/2/4,<br />

context-adaptive variable length coding<br />

(CAVLC), and context-adaptive binary<br />

arithmetic coding (CABAC) in<br />

H.264/AVC are particularly sub-optimal<br />

and inefficient when mapped to<br />

these <strong>programmable</strong> DSPs. This is<br />

because data recursion prevents ILP<br />

from being used effectively. Instead,<br />

dedicated hardware engines can be<br />

built efficiently in the FPGA fabric.<br />

• Computational complexity.<br />

Programmable DSP is bounded in<br />

computational complexity, as measured<br />

by the clock rate of the processor.<br />

Signal processing algorithms implemented<br />

in the FPGA fabric are typically<br />

computationally intensive. Some<br />

examples of these are the sum of<br />

DIGITAL SIGNAL PROCESSING<br />

absolute difference (SAD) engine in<br />

motion estimation and video scaling.<br />

By mapping these modules onto the<br />

FPGA fabric, the host processor or the<br />

<strong>programmable</strong> DSP has the extra cycles<br />

<strong>for</strong> other algorithms. Furthermore,<br />

FPGAs can have multiple clock<br />

domains in the fabric, so selective hardware<br />

blocks can thus have separate<br />

clock speeds based on their computational<br />

requirements.<br />

• <strong>The</strong>oretic optimality in quality. Any<br />

theoretic optimal solution based on<br />

the rate-distortion curve can be<br />

achieved if and only if the complexity<br />

is unbounded. In a <strong>programmable</strong><br />

DSP or general-purpose processor, the<br />

computational complexity is always<br />

bounded by the clock cycles available.<br />

FPGAs, on the other hand, offer much<br />

more flexibility by exploiting data and<br />

algorithm parallelism by means of<br />

multiple instantiations of the hardware<br />

engines, or increased use of block<br />

RAM and register banks in the fabric.<br />

A <strong>programmable</strong> DSP or general-purpose<br />

processor is often limited by the<br />

number of instruction issues per cycle,<br />

the level of pipeline in the execution<br />

unit, or the maximum data width to<br />

fully feed the execution units. Video<br />

quality is often compromised as a result<br />

of the limited cycles available per task<br />

in a <strong>programmable</strong> DSP, whereas hardware<br />

resources are fully allocated in<br />

FPGA fabric (three-step vs. full-search<br />

motion estimation).<br />

Implementing Functional Modules onto FPGAs<br />

Figure 1 shows the overall H.264/AVC<br />

macroblock level encoder with major functional<br />

blocks and data flows defined. One<br />

of the primary successes of the H.264/AVC<br />

standard is its ability to predict the values<br />

of the content of a picture to be encoded by<br />

exploiting the pixel redundancy in different<br />

ways and directions not exploited previously<br />

in other standards. Un<strong>for</strong>tunately, when<br />

comparing to previous standards, this<br />

increases the complexity and memory<br />

access bandwidth approximately four-fold.<br />

Winter 2004 <strong>Xcell</strong> <strong>Journal</strong> 41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!