01.06.2013 Views

The Softer Side of Software Defined Radio - ICCAD

The Softer Side of Software Defined Radio - ICCAD

The Softer Side of Software Defined Radio - ICCAD

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>S<strong>of</strong>ter</strong> <strong>Side</strong> <strong>of</strong> S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Copyright © MediaTek Inc. All rights reserved.<br />

By Yuan Lin


Mobile Computing<br />

▪ In 2011, world-wide mobile telephone subscription: 5.6<br />

billion<br />

– ~79% <strong>of</strong> the population<br />

– Some countries have mobile penetration over 100%<br />

– Largest consumer electronic device in terms <strong>of</strong> volume<br />

▪ Multi-media anywhere at anytime<br />

▪ Wireless communication everywhere<br />

1


S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Analog<br />

Frontend<br />

TD-SCDMA<br />

LTE<br />

WCDMA<br />

Baseband<br />

Processor<br />

2<br />

Application<br />

Processors<br />

Camera<br />

Keypad<br />

Display<br />

Speaker<br />

Microphone


S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Analog<br />

Frontend<br />

TD-SCDMA<br />

LTE<br />

WCDMA<br />

Baseband<br />

Processor<br />

3<br />

Application<br />

Processors<br />

Camera<br />

Keypad<br />

Display<br />

Speaker<br />

Microphone


S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Analog<br />

Frontend<br />

TD-SCDMA<br />

LTE<br />

WCDMA<br />

Baseband<br />

Processor<br />

4<br />

Application<br />

Processors<br />

Camera<br />

Keypad<br />

Display<br />

Speaker<br />

Microphone


S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Analog<br />

Frontend<br />

TD-SCDMA<br />

LTE<br />

WCDMA<br />

Baseband<br />

Processor<br />

5<br />

Application<br />

Processors<br />

GPP<br />

Transport<br />

Network<br />

Link<br />

MAC<br />

Camera<br />

Keypad<br />

Display<br />

Speaker<br />

DSP + ASICs<br />

PHY<br />

Microphone


S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />

Analog<br />

Frontend<br />

TD-SCDMA<br />

LTE<br />

WCDMA<br />

S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong> (SDR):<br />

Baseband<br />

Processor<br />

Use <strong>of</strong> s<strong>of</strong>tware routines instead <strong>of</strong><br />

ASICs for wireless protocols’ physical<br />

layer processing<br />

6<br />

Application<br />

Processors<br />

GPP<br />

Transport<br />

Network<br />

Link<br />

MAC<br />

DSPs<br />

PHY<br />

Camera<br />

Keypad<br />

Display<br />

Speaker<br />

Microphone


Mobile SDR Design Challenges<br />

Peak Performance (Gops)<br />

1000<br />

100<br />

10<br />

1<br />

Mobile SDR<br />

Requirements<br />

100 Mops/mW<br />

10 Mops/mW<br />

Embedded<br />

DSPs<br />

TI C6x<br />

0.1 1 10 100<br />

Power (Watts)<br />

Reference: Lin et al, SODA: A High Performance DSP Architecture for S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong>, IEEE MICRO Top Pick 2007<br />

Better<br />

Power Efficiency<br />

IBM Cell<br />

High-end<br />

DSPs<br />

Core<br />

Purpose<br />

Processors<br />

1 Mops/mW General


SDR Processors: Common Trends<br />

▪ Converging DSP architecture model<br />

– scalar + DSP<br />

– VLIW on top <strong>of</strong> SIMD<br />

– Algorithm-specific accelerations<br />

▪ Higher-performance<br />

scalar<br />

core<br />

DSP core<br />

8<br />

memory<br />

Alg. Specific accl.


SDR Processors: Common Trends<br />

▪ Converging DSP architecture model<br />

– scalar + DSP<br />

– VLIW on top <strong>of</strong> SIMD<br />

– Algorithm-specific accelerations<br />

▪ Higher-performance<br />

– wider SIMD<br />

scalar<br />

core<br />

9<br />

memory<br />

DSP core<br />

Alg. Specific accl.


SDR Processors: Common Trends<br />

▪ Converging DSP architecture model<br />

– scalar + DSP<br />

– VLIW on top <strong>of</strong> SIMD<br />

– Algorithm-specific accelerations<br />

▪ Higher-performance<br />

– wider SIMD<br />

– multiple units<br />

scalar<br />

core<br />

10<br />

memory<br />

memory<br />

memory<br />

DSP<br />

DSP<br />

core<br />

core<br />

DSP core<br />

Alg. Specific accl.


SDR Processors: Common Trends<br />

▪ Converging DSP architecture model<br />

– scalar + DSP<br />

– VLIW on top <strong>of</strong> SIMD<br />

– Algorithm-specific accelerations<br />

▪ Higher-performance<br />

– wider SIMD<br />

– multiple units<br />

– multiple cores<br />

scalar<br />

core<br />

11<br />

memory<br />

memory<br />

memory<br />

DSP<br />

DSP<br />

core<br />

core<br />

DSP core<br />

Alg. Specific accl.


Mobile SDR Design Challenges<br />

Peak Performance (Gops)<br />

1000<br />

100<br />

10<br />

1<br />

SDR<br />

DSP<br />

Processors<br />

100 Mops/mW<br />

10 Mops/mW<br />

0.1 1 10 100<br />

Power (Watts)<br />

Better<br />

Power Efficiency<br />

1 Mops/mW


Trials and Tribulations <strong>of</strong> SDR Programming<br />

High-performance<br />

Portability/Productivity<br />

Update within same processor family<br />

Different processor platforms<br />

Situation Getting Worse<br />

Sigmatix Confidential<br />

Programming<br />

Gap<br />

Ease <strong>of</strong> Coding<br />

Low High<br />

Performance<br />

Low High<br />

Compiled<br />

C/C++<br />

Where we want<br />

to be<br />

Processor<br />

Specific<br />

Asm. Code


Outline<br />

▪ Language Extension for Wireless<br />

▪ Compilation Challenges<br />

▪ S<strong>of</strong>tware Synthesis<br />

▪ DSP Programming Practices<br />

14


Why C Isn’t Enough<br />

▪ C is not designed for describing DSP algorithms<br />

– Algorithm prototyping<br />

• Vector/Matrix operations are common<br />

• C obfuscate these operations<br />

– DSP firmware development<br />

• No good way to represent specialized DSP processor architectural features<br />

for (i = 0; i < N/2; i++) {<br />

out[2*i] = in[i];<br />

out[2*i+1] = in[i+N/2];<br />

}<br />

15<br />

Q: What is this vector operation?<br />

A: A basic vector perfect shuffle operation.<br />

in<br />

out


Language Extensions for Wireless Algorithms<br />

▪ A list <strong>of</strong> “nice-to-have” language features for wireless algorithms<br />

– Vector/matrix<br />

• Vector/matrix arithmetic operations<br />

• Vector/matrix permutation operations<br />

– Fixed-point and complex fixed-point<br />

• i.e. 4-bit complex arithmetic<br />

– Algorithm toolbox<br />

• i.e. multi-precision FFT/FIR library functions<br />

– System programming considerations<br />

▪ We have languages that provide one or more <strong>of</strong> these features, but none<br />

with all <strong>of</strong> them<br />

– OpenCL:<br />

• No fixed-point arithmetic or algorithm toolbox<br />

– Matlab:<br />

• No fixed-point arithmetic or explicit type declarations<br />

– SystemC:<br />

• No vector/matrix arithmetic or algorithm toolbox<br />

perm out(out) = op(perm in0(in0), perm in1(in1), …)


Compiler Challenges<br />

▪ Compilers don’t understand algorithms<br />

– Algorithm-specific parallelization techniques<br />

– Algorithm-specific optimization techniques<br />

– Algorithm-specific approximation techniques<br />

▪ <strong>The</strong>re is no best algorithm implementation<br />

– Multiple implementations<br />

– Design trade-<strong>of</strong>fs (i.e. memory versus performance)<br />

• i.e. how to determine the optimal sort algorithm?<br />

▪ Multi-core, multi-level scratchpad memories, and HW-SW co-design<br />

makes the problem even more challenging<br />

Everywhere you look, there are optimization problems<br />

17


Algorithm-Specific Compilers<br />

FFT FFT Alg1 FFT Alg2 FFT Alg3<br />

▪ Exploit algorithm-specific optimizations<br />

– Special parallelization techniques<br />

• Sliding window technique for Turbo<br />

– Different implementations<br />

• Table-lookup or run-time computation<br />

– Special processor accelerations<br />

• Radix-4 FFT instruction<br />

– Different BER<br />

• 16-bit fixed-point versus floating-point<br />

18


Algorithm-Specific Compilers<br />

FFT<br />

▪ Implementation templates<br />

– Multi-core s<strong>of</strong>tware pipeline<br />

– Multi-core parallelization<br />

– Input/output vector ordering<br />

> fft_compiler –points 64 –arch dsp.arch<br />

FFT Alg1 FFT Alg2 FFT Alg3<br />

FFT Specification<br />

1024-point<br />

Complex 16-bit<br />

FFT Compiler<br />

FFT implementation templates<br />

HW Specification<br />

512-bit SIMD<br />

2 Cores<br />

FFT Alg1 FFT Alg2 FFT Alg3<br />

FFT-Specific Optimizations<br />

High-performance<br />

C+intrinsics code


Last Thought On DSP Programming<br />

▪ How do we know if we achieved the optimal performance?<br />

– <strong>The</strong> code is fast enough because…<br />

• Um, my boss is happy with it?<br />

▪ How much <strong>of</strong> the performance is limited by our own perceived<br />

upper bound?<br />

– Performance code checker<br />

– Performance estimator<br />

20


Performance Code Checker<br />

▪ Every DSP processor comes with its own programmer’s manual<br />

– A list <strong>of</strong> good and bad coding practices<br />

– Some are universal, some are processor/compiler-specific<br />

▪ Analysis tool that check for bad programming practices<br />

– Different set <strong>of</strong> analysis rules for different processors & compilers<br />

– i.e. non-vectorizable loop structures, wrong intrinsics, etc.<br />

▪ Lint for DSP C code<br />

21


Why Use A Code Checker?<br />

What’s wrong with this code?<br />

void foo()<br />

{<br />

...<br />

char index;<br />

for (i = 0; i < 1000; i++) {<br />

array[index++] = vec[i] / 10;<br />

}<br />

}<br />

22<br />

On Cognovo platform:<br />

Should only use<br />

unsigned for array indexing<br />

because <strong>of</strong> its AGU + compiler<br />

quirk<br />

On TI C64x:<br />

<strong>The</strong>re is no integer divide unit, so<br />

this will turn into a function-call,<br />

and disable s<strong>of</strong>tware-pipelining


Performance Estimator<br />

▪ Provide performance estimations during algorithm prototyping<br />

23<br />

SoC 1 sim<br />

SoC 2 sim<br />

SoC 3 sim<br />

Est. cycles: x0<br />

Est. Power: y0<br />

Est. area: z0<br />

Est. cycles: x1<br />

Est. power: y1<br />

Est. area: z1<br />

Est. cycles: x2<br />

Est. power: y2<br />

Est. area: z2


Thanks<br />

24


Building Predictable Cyber-Physical Systems<br />

from Dynamic Applications and Platforms<br />

Department <strong>of</strong> Electrical Engineering<br />

Electronic Systems<br />

Sander Stuijk<br />

Based on joint work with<br />

Twan Basten, Marc Geilen, Bart <strong>The</strong>elen and many others


2 Embedded streaming systems<br />

Application trends<br />

Uncertainty<br />

Concurrency<br />

Dynamism


3 Model-based design<br />

Modeling<br />

Analysis<br />

Implementation<br />

Run-time management<br />

+<br />

Design flow<br />

....<br />

App 1 App 2<br />

+ =<br />

Gantt chart


4 WLAN application<br />

New OFDM symbol every 4.0 μs<br />

Sync, Header, Payload scenario process one symbol<br />

CRC processes no symbol<br />

Ports may have rates<br />

Rate one omitted for clarity<br />

Initial tokens return to original distribution after one iteration


5 WLAN application<br />

Each scenario may be modeled with a different scenario graph<br />

Persistent token names provide relation between initial tokens in different<br />

scenario graphs


6<br />

WLAN application<br />

src<br />

shift<br />

pars<br />

Src<br />

Shift<br />

Sync<br />

Hdem<br />

Hdec<br />

Pars<br />

sync<br />

0 μS 4 μS 8 μS 12 μS


7 WLAN application<br />

src<br />

shift<br />

pars<br />

Src<br />

Shift<br />

Sync<br />

Hdem<br />

Hdec<br />

Pars<br />

sync<br />

header<br />

0 μS 4 μS 8 μS 12 μS


8<br />

Analyzing SADF graphs<br />

Gantt chart for one scenario sequence<br />

src<br />

shift<br />

pars<br />

Src<br />

Shift<br />

Sync<br />

Hdem<br />

Hdec<br />

Pars<br />

sync<br />

header<br />

0 μS 4 μS 8 μS 12 μS<br />

Execution is a sequence <strong>of</strong> vector shapes<br />

Sync Header<br />

src<br />

shift<br />

pars<br />

4000 ns<br />

5940 ns<br />

0 ns<br />

Token time stamps (vector shapes) provide constraints for next iteration<br />

src<br />

shift<br />

pars<br />

delay


9<br />

Analyzing SADF graphs<br />

Max-plus automaton<br />

4000 ns, 1<br />

Sync Header Payload CRC<br />

src<br />

shift<br />

payload<br />

pars<br />

4000 ns, 1<br />

4000 ns, 1<br />

4000 ns, 1<br />

src<br />

4000 ns, 1<br />

5940 ns<br />

0 ns<br />

shift<br />

payload<br />

pars<br />

src<br />

shift<br />

payload<br />

pars<br />

0 ns, 0<br />

4000 ns, 1<br />

4000 ns, 1<br />

src<br />

shift<br />

payload<br />

pars


10 Analyzing SADF graphs<br />

Throughput: MCM/MCR<br />

Latency: longest-path<br />

4000 ns, 1<br />

Sync Header Payload CRC<br />

src<br />

shift<br />

payload<br />

pars<br />

4000 ns, 1<br />

4000 ns, 1<br />

4000 ns, 1<br />

src<br />

4000 ns, 1<br />

5940 ns<br />

0 ns<br />

shift<br />

payload<br />

pars<br />

src<br />

shift<br />

payload<br />

pars<br />

0 ns, 0<br />

4000 ns, 1<br />

4000 ns, 1<br />

src<br />

shift<br />

payload<br />

pars


11 Predictable scenario-aware design-flow<br />

Tile 1<br />

IMEM<br />

DMEM<br />

ARM<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

interconnect<br />

?<br />

SWC<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

EVP<br />

NI<br />

+<br />

Compute buffer constraints<br />

Unified resource binding<br />

Static-order scheduling<br />

TDMA time slices allocation<br />

....


12 Compute buffer constraints<br />

Model buffer size constraints with back-edge with initial tokens<br />

enlarge buffer size to 2 tokens<br />

throughput<br />

analysis<br />

buffer size <strong>of</strong> 1 token/edge<br />

throughput<br />

analysis<br />


13<br />

Tile 1<br />

IMEM<br />

DMEM<br />

Predictable scenario-aware design-flow<br />

EVP<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

interconnect<br />

SWC<br />

NI<br />

pars<br />

Tile 2<br />

IMEM<br />

DMEM<br />

ARM<br />

NI<br />

+<br />

Compute buffer constraints<br />

Unified resource binding<br />

Static-order scheduling<br />

TDMA time slices allocation<br />

....<br />

Unified mapping avoids need for runtime<br />

reconfiguration mechanism<br />

Static-order schedules may change<br />

between scenarios


14 Modeling timing impact <strong>of</strong> platform<br />

Tile 1<br />

IMEM<br />

DMEM<br />

EVP<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

interconnect<br />

SWC<br />

NI<br />

pars<br />

Tile 2<br />

IMEM<br />

DMEM<br />

ARM<br />

NI<br />

connection has delay<br />

binding-aware dataflow graph<br />

dataflow model for connection delay<br />

D<br />

pars<br />

D<br />

D


15 Predictable scenario-aware design-flow<br />

resource is shared<br />

Tile 1<br />

IMEM<br />

DMEM<br />

EVP<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

interconnect<br />

SWC<br />

NI<br />

pars<br />

Tile 2<br />

IMEM<br />

DMEM<br />

ARM<br />

NI<br />

(only SO schedule)<br />

pars


16<br />

WLAN application<br />

src<br />

shift<br />

payload<br />

pars<br />

evp<br />

swc<br />

arm<br />

<strong>of</strong>dm symbol<br />

evp active<br />

swc active<br />

arm active<br />

sync header payload<br />

0 μs 4 μs 8 μs 12 μs 16 μs 20 μs<br />

Initial tokens <strong>of</strong> resources capture resource availability<br />

Timing requirements seems to be met...<br />

crc


17 Model-based design<br />

Modeling<br />

Analysis<br />

Implementation<br />

Run-time management<br />

+<br />

Design flow<br />

....<br />

App 1 App 2<br />

+ =<br />

Gantt chart


18 Run-time reconfiguration<br />

Tile 1<br />

IMEM<br />

DMEM<br />

EVP<br />

NI<br />

Tile 2<br />

IMEM<br />

DMEM<br />

interconnect<br />

SWC<br />

NI<br />

pars<br />

Tile 2<br />

IMEM<br />

DMEM<br />

ARM<br />

NI<br />

DVFS changes actor<br />

execution times<br />

DVFS settings modeled with<br />

system scenarios


19 WLAN application<br />

src<br />

shift<br />

payload<br />

pars<br />

evp<br />

swc<br />

arm<br />

<strong>of</strong>dm symbol<br />

evp active<br />

swc active<br />

arm active<br />

sync-c1 header-c1 payload-c1 sync-c2 header-c2 payload-c2 sync-c1 header-c1 payload-c1<br />

crc-c1<br />

Reconf<br />

c1 → c2<br />

0 μs 4 μs 8 μs 12 μs 16 μs 20 μs 24 μs 28 μs 32 μs 36 μs<br />

Latency between reception <strong>of</strong> OFDM symbol and processing increases<br />

Processing cannot keep up with frequent reconfiguration<br />

crc-c2<br />

Reconf<br />

c2 → c1


20 Summary<br />

Strategy for designing predictable systems running dynamic applications<br />

Scenarios capture dynamic (application and system) behavior<br />

Resource and energy efficient implementations<br />

Predictable implementations<br />

SADF Model-<strong>of</strong>-Computation<br />

Provides many analysis techniques<br />

Provides implementation trajectory<br />

Analysis and implementation techniques implemented in SDF 3 tool kit<br />

www.es.ele.tue.nl/sdf3


Methodology and Tools for Design<br />

<strong>of</strong> Energy Efficient Multi-Core Chips<br />

Nagu Dhanwada<br />

IBM Electronic Design Automation,<br />

Systems and Technology Group.


Outline<br />

Challenges in Multi-Core Chip Design<br />

Reference Power Aware Design Methodology<br />

for Multi-Core Chips<br />

Tools and Use Cases<br />

Early Analysis Tool for Multi-Core Designs<br />

Power Management Exploration and Design<br />

Architecture and Algorithm Exploration in<br />

Embedded Systems


Introduction: Multi-Core Chip Design Challenges<br />

Time to Market<br />

IP Integration<br />

Heterogeneity<br />

Validation<br />

Performance<br />

Cache Coherency<br />

Energy Efficiency<br />

Complex Power Management<br />

Productivity and Quality<br />

Design with third party IP<br />

Designing with potentially unreliable components


Introduction: Multi-Core Chip Design Challenges<br />

Complex Power Management for Energy<br />

Efficiency<br />

Global Dynamic Voltage and Frequency Scaling<br />

Power Capping<br />

Guard Band Reduction<br />

Power Throttling<br />

Power Budgeting


What do we need?<br />

Standards based Power Aware System to Silicon<br />

Flows<br />

Tools within these flows supporting Early<br />

Analysis and Design


Reference Power Aware Design<br />

Methodology for Multi-Core Chips


System-to-Silicon Power Aware Design Flow<br />

No standard today<br />

Supported by Liberty<br />

today – LPC<br />

contribution to LTAB<br />

Power<br />

Models<br />

Design &<br />

mapping<br />

Design &<br />

Integration<br />

Optimization<br />

& Closure<br />

ESL Design<br />

Analysis &<br />

optimization<br />

RTL Design<br />

Analysis &<br />

optimization<br />

Implementation<br />

Analysis<br />

Validation<br />

Verification<br />

& Test<br />

Verification<br />

& Test<br />

Comprehensive end to end Low Power Design flow,<br />

High-level power modeling,<br />

Power Models that span across the entire flow.<br />

Power<br />

Intent<br />

Content<br />

7


ESL Design Phase<br />

A<br />

N<br />

A<br />

L<br />

Y<br />

S<br />

I<br />

S<br />

Initial Application Specification<br />

High Level<br />

Hardware<br />

Architecture<br />

(Function +<br />

Communication<br />

Architecture)<br />

ESL<br />

Hardware<br />

Design &<br />

Optimization<br />

ESL Co-Design<br />

And Mapping<br />

Hardware/<br />

S<strong>of</strong>tware<br />

Integration<br />

(High Level)<br />

S<strong>of</strong>tware<br />

Specification<br />

ESL<br />

Embedded<br />

S<strong>of</strong>tware<br />

Design<br />

Embedded<br />

S<strong>of</strong>tware<br />

Image<br />

V<br />

A<br />

L<br />

I<br />

D<br />

A<br />

T<br />

I<br />

O<br />

N


RTL Design Phase<br />

Design and Integration<br />

IP<br />

Specification<br />

Chip<br />

Specification<br />

Analysis and Optimization<br />

Chip<br />

Specification<br />

Module<br />

Module<br />

coding Module<br />

coding Module<br />

coding Module<br />

coding Module<br />

coding Module<br />

coding Module<br />

coding<br />

coding<br />

Chip Integration<br />

Global Clock Gating<br />

Power Domain<br />

DFT<br />

Analysis<br />

Module<br />

Power format<br />

Chip<br />

Power format<br />

Power<br />

constraints<br />

Chip<br />

Power format<br />

Power<br />

constraints


Implementation Phase<br />

Partitioned<br />

RTL<br />

Power<br />

constraints<br />

Tool directives<br />

library<br />

Physical<br />

data<br />

Design Optimization and Closure<br />

synthesis<br />

floorplan<br />

Analysis<br />

Placement<br />

Clock tree<br />

Power optimization & closure<br />

Route<br />

Power rule verification<br />

IR analysis


ESL Design Phase: Analysis and Optimization<br />

Workload /<br />

Stimulus<br />

Benchmark<br />

Programs/Traces<br />

Random Stimulus<br />

Generators<br />

Architecture parameters<br />

(Initial Configuration)<br />

Power Analysis<br />

ESL Design Description<br />

Power Calculation<br />

Power Intent<br />

Format<br />

SystemC<br />

Simulator<br />

Optimization and Refinement<br />

Power Intent<br />

Format<br />

RTL Design<br />

And Simulation<br />

Power Models<br />

Power Models<br />

Power<br />

Reports


RTL Design Phase: Analysis and Optimization<br />

Workloads /<br />

Stimulus<br />

Benchmark<br />

Programs/Traces<br />

Random Stimulus<br />

Generators<br />

Architecture parameters<br />

(Initial Configuration)<br />

Power Analysis<br />

RTL Description<br />

Power Calculation<br />

Power Intent<br />

Format<br />

VHDL/Verilog<br />

Simulator<br />

Optimization and Refinement<br />

Power Intent<br />

Format<br />

Implementation<br />

Level<br />

Power Models<br />

Power Models<br />

Power<br />

Reports


System-to-Silicon Power Aware Design Flow<br />

No standard today<br />

Supported by Liberty<br />

today – LPC<br />

contribution to LTAB<br />

Power<br />

Models<br />

Design &<br />

mapping<br />

Design &<br />

Integration<br />

Optimization<br />

& Closure<br />

ESL Design<br />

Analysis &<br />

optimization<br />

RTL Design<br />

Analysis &<br />

optimization<br />

Implementation<br />

Analysis<br />

Validation<br />

Verification<br />

& Test<br />

Verification<br />

& Test<br />

Comprehensive end to end Low Power Design flow,<br />

High-level power modeling,<br />

Power Models that span across the entire flow.<br />

Power<br />

Intent<br />

Content<br />

13


Power Intent Formats: Overview<br />

Functional Design Description<br />

Assume power always on, running at constant voltage<br />

Power Intent Formats<br />

Capture variation <strong>of</strong> Power Over Time (power management<br />

specification)<br />

Power Intent + Functional Description = Representation <strong>of</strong> an<br />

active power managed design


Power Intent Formats: Overview<br />

Structural Aspects<br />

Interaction between design elements having time varying<br />

power characteristics<br />

State restoration, re-initialization<br />

Examples:<br />

Power Domains, Switches, Level shifters, Isolation cells,<br />

State retention logic, Power control signals<br />

Behavioral Aspects<br />

Effect <strong>of</strong> power variation on the computation model<br />

Enumerate state <strong>of</strong> simulation model for each set <strong>of</strong><br />

design elements driven same voltage


Power Intent Specification Example<br />

Core Read<br />

Transactor<br />

core read<br />

request<br />

core read<br />

data<br />

PD1<br />

Voltage Image Scaling<br />

Processor 0.8-1.0 V<br />

bayer Switchable Core<br />

data_in<br />

Slave Interface<br />

Read/Write<br />

jpeg<br />

data_out<br />

Core write<br />

request<br />

Buffer &<br />

Core write<br />

Transactor<br />

Core write<br />

data<br />

MasterRead<br />

Interface<br />

PD2<br />

Voltage 1.0 V<br />

Switchable<br />

Memory mapped<br />

Register Bank<br />

Master Write<br />

Interface<br />

DMA READ<br />

CONTROL<br />

DMA WRITE<br />

set_design img_subsys<br />

create_power_domain –name PD1 –instances {img_core core_read core_write} -default<br />

create_power_domain –name PD2 –instances { slave_intf master_read master_write}<br />

create_nominal_condition –name standby –voltage 0.8 –state standby<br />

create_nominal_condition –name low –voltage 0.9 –state on<br />

create_nominal_condition –name high –voltage 1.0 –state on<br />

create_nominal_condition –name <strong>of</strong>f –voltage 0 –state <strong>of</strong>f<br />

create_mode –name full_speed –conditions {PD1@high && PD2@high}<br />

create_mode –name low_speed –conditions {PD1@low && PD2@high}<br />

create_mode –name sleep –conditions {PD1@standby && PD2@high}<br />

create_mode –name core_<strong>of</strong>f –conditions {PD1@<strong>of</strong>f && PD2@high}<br />

create_mode –name all_<strong>of</strong>f –conditions {PD1@<strong>of</strong>f && PD2@<strong>of</strong>f}<br />

end_design<br />

Slide Courtesy: Si2 LPC Format Working Group


System-to-Silicon Power Aware Design Flow<br />

No standard today<br />

Supported by Liberty<br />

today – LPC<br />

contribution to LTAB<br />

Power<br />

Models<br />

Design &<br />

mapping<br />

Design &<br />

Integration<br />

Optimization<br />

& Closure<br />

ESL Design<br />

Analysis &<br />

optimization<br />

RTL Design<br />

Analysis &<br />

optimization<br />

Implementation<br />

Analysis<br />

Validation<br />

Verification<br />

& Test<br />

Verification<br />

& Test<br />

Comprehensive end to end Low Power Design flow,<br />

High-level power modeling,<br />

Power Models that span across the entire flow.<br />

Power<br />

Intent<br />

Content<br />

17


Power Modeling <strong>of</strong> Complex IP: Issues<br />

Current Standards (LIBERTY) oriented toward small IP<br />

Can include complete states & transitions<br />

dependence on input slew, output load<br />

Complex IP<br />

Several Inputs, States, Power Modes and Internal Parts,<br />

Much internal power dissipation,<br />

Completeness <strong>of</strong> Model Key to Enable High Level Design,<br />

Adequate Parameterization to capture sensitivity to design decisions.<br />

Power states / modes may be:<br />

Intentionally designed<br />

E.g., power modes in power intent specifications, power domain switching<br />

Result <strong>of</strong> behavior / activity<br />

E.g., clock gating, CPF functional modes


Power Modeling <strong>of</strong> Complex IP: Issues<br />

State power modeling challenges<br />

Exponential state explosion<br />

Requires non-mutually exclusive power states<br />

Being addressed in current standards<br />

Transition energy modeling challenges<br />

Exponential state transition explosion<br />

Separation <strong>of</strong> internal transition energy from pin power<br />

Transitions between mutually exclusive states<br />

Proposals to address efficient and uniform modeling <strong>of</strong> complex<br />

IP within current standards, being developed in SI2 Low Power<br />

Coalition


Tools and Use Cases


System-to-Silicon Power Aware Design Flow<br />

No standard today<br />

Supported by Liberty<br />

today – LPC<br />

contribution to LTAB<br />

Power<br />

Models<br />

Design &<br />

mapping<br />

Design &<br />

Integration<br />

Optimization<br />

& Closure<br />

ESL Design<br />

Analysis &<br />

optimization<br />

RTL Design<br />

Analysis &<br />

optimization<br />

Implementation<br />

Analysis<br />

Validation<br />

Verification<br />

& Test<br />

Verification<br />

& Test<br />

Comprehensive end to end Low Power Design flow,<br />

High-level power modeling,<br />

Power Models that span across the entire flow.<br />

Power<br />

Intent<br />

Content<br />

21


SLATE: Early Analysis Tool for Multi-Core<br />

utilization<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

2 6 10 20 30 50 100<br />

Number <strong>of</strong> connections<br />

Performance, Functional Model<br />

Graphic<br />

Front-End<br />

CPU-Tahoe<br />

CPU-no Tahoe<br />

Rx - Tahoe<br />

Rx-no Tahoe<br />

Power [W]<br />

2.0<br />

1.6<br />

1.2<br />

0.8<br />

0.4<br />

0.0<br />

Accelerators<br />

64 128 256 512 1024 2048 4096 8192<br />

AXU<br />

C3<br />

L2<br />

AXU<br />

C3<br />

L2<br />

AXU<br />

C3<br />

L2<br />

L3 MC IO<br />

Block Diagram<br />

Packet size [bytes]<br />

Power Analysis Chip Floorplan<br />

Chip<br />

Integration<br />

Interconnect Analysis Implementation<br />

<strong>The</strong>rmal Analysis<br />

SLATE: System-Level Analysis Tool for Early Exploration<br />

AXU<br />

C3<br />

L2<br />

ASIC<br />

Import 3 rd Party IP<br />

Industry Standard Models<br />

Tool for early power, performance, physical, and thermal characteristics <strong>of</strong><br />

Multi-Core Designs.


Use Scenarios<br />

Power Management Design Exploration in High Performance<br />

Servers<br />

Early Analysis System Configured to use Trace Driven Performance<br />

Models for POWER4 based processor cores<br />

Architecture and Algorithm Exploration in Embedded Systems<br />

Early Analysis System Configured to run Embedded PowerPC and<br />

CoreConnect models in Execution driven mode running real s<strong>of</strong>tware


Power Management Design Exploration: ESL Analysis Use<br />

Scenario<br />

Power<br />

Management<br />

Policy<br />

Optimization<br />

(Manual / Tool)<br />

System Simulation Model<br />

Vdd and<br />

Frequency<br />

Changes<br />

Processor<br />

Model<br />

Arbiter<br />

Bus<br />

Power Management Unit<br />

(Algorithms, Control Modes)<br />

Performance and Power Numbers<br />

Performance Simulation Models<br />

PLB Master<br />

PLB<br />

Slave<br />

(HSMC)<br />

PLB_OPB<br />

Bridge<br />

OPB_PLB<br />

Bridge<br />

OPB<br />

Arbiter<br />

UIC<br />

OPB<br />

Bus<br />

OPB<br />

Slave<br />

OPB<br />

Master<br />

Architecture Configuration<br />

Power Management Specification<br />

Power and Clock Domain Definitions<br />

Power<br />

Models<br />

Vdd and<br />

Frequency<br />

Changes


Power Management Studies in SLATE<br />

Per-core vs. Chip-wide<br />

DVFS<br />

Fetch-Throttling (I-Cache<br />

Throttling)<br />

Parameterizable:<br />

Freq change penalty<br />

Number <strong>of</strong> V/F discrete<br />

levels or continuous<br />

mode<br />

Easy to change PMU<br />

algorithm<br />

Easy to try different<br />

configurations<br />

set_mode()<br />

Set_frequency()<br />

set_throttling_factor()<br />

get_cpi()<br />

get_temperature()<br />

get_num_commits()<br />

get_num_decodes()<br />

get_power_modes()<br />

PMU<br />

PMU-BUS<br />

core0 core1 core2 core3<br />

clk0 clk1<br />

clk2<br />

clk3<br />

Multiple Clock<br />

Generator


Power Management Studies: DVFS Algorithms<br />

Discrete MaxBIPS<br />

Assumes set <strong>of</strong> discrete power modes (Vdd-Frequency pairs) for each<br />

core which the Power Manager can control individually.<br />

Goal: Maximize overall chip performance, under a given power budget.<br />

Chip Performance: total number <strong>of</strong> completed instructions by all cores per<br />

time period,<br />

Continuous Approach (CPM)<br />

Non-linear programming to model the same DVFS problem with<br />

continuous power modes.


Relative Chip Performance<br />

Chip Performance for Decreasing Power Budgets<br />

100<br />

98<br />

96<br />

94<br />

92<br />

90<br />

88<br />

86<br />

84<br />

82<br />

80<br />

Chip Performance vs Power Budgets<br />

100 91 82 73 64 55 45<br />

Power Budget<br />

Chip Wide Discrete Chip Wide Continuous Per-Core Continuous Per-Core Discrete


Individual Core Performance under Per-Core DVFS<br />

Relative Performance<br />

100<br />

98<br />

96<br />

94<br />

92<br />

90<br />

88<br />

86<br />

84<br />

82<br />

80<br />

Individual Core Performance in Per-Core DVFS<br />

eon eon-c twolf twolf-c perl perl-c<br />

100 91 82 73 64 55 45<br />

Power Budget


Use Scenario 2: Example Embedded System<br />

PLB Arbiter<br />

HSMC<br />

EBC<br />

CLOCK/RESET<br />

GEN<br />

PLL<br />

PM<br />

Clocks / resets<br />

405 CPU<br />

MAL<br />

PLB 64 bit<br />

Private OPB<br />

UIC<br />

DMA<br />

PLB-OPB Bridge<br />

OPB Arbiter<br />

OPB 32 bit<br />

Embedded PowerPC4xx & CoreConnect IP Cores, RISCWatch debugger, Enabled<br />

for GCC tool chain<br />

RX<br />

FIFO<br />

EMAC<br />

TX<br />

FIFO<br />

GPIO<br />

UART0<br />

UART1<br />

IIC<br />

GPT


Architecture Exploration: Ethernet Packet Processing<br />

30000<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

PLB<br />

Arbiter<br />

0<br />

EBC<br />

HSMC<br />

CLOCK/RESET<br />

GEN<br />

PLL<br />

PM<br />

Clocks / resets<br />

405 CPU<br />

MAL<br />

Private<br />

OPB<br />

PLB 64 bit<br />

System Throughput (KBytes/sec)<br />

64 128 256 512 1024 2048 4096<br />

Packet sizes (bytes)<br />

UIC<br />

2 Emac<br />

1 Emac<br />

DMA<br />

PLB-OPB<br />

Bridge<br />

EMAC<br />

RX<br />

FIFO<br />

TX<br />

FIFO<br />

EMAC<br />

RX<br />

FIFO<br />

TX<br />

FIFO<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

OPB<br />

Arbiter<br />

GPIO<br />

UART0<br />

UART1<br />

IIC<br />

GPT<br />

OPB 32 bit<br />

CPU Utilization (% <strong>of</strong> total time)<br />

1 Emac<br />

64 128 256 512 1024 2048 4096<br />

Packet sizes (bytes)<br />

PPC405 Platform Based Design<br />

Ethernet Subsystem<br />

1 EMAC<br />

1 Madmal<br />

Real Embedded Application executing on TLM<br />

models<br />

Measure effects on performance, power<br />

Change to improve performance<br />

Added an extra EMAC + Fifos<br />

Two EMAC mode <strong>of</strong> the application works with<br />

packets being transmitted from one EMAC and<br />

received by the other<br />

2 Emac<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

Power (mW)<br />

64 128 256 512 1024 2048 4096<br />

Packet sizes (bytes)<br />

2 Emac<br />

1 Emac


Architecture Exploration in Embedded Systems<br />

#<br />

Cores<br />

Execution<br />

Time<br />

Speed up<br />

with Order<br />

16 Matrix<br />

1 520573 1 100<br />

Efficiency<br />

2 2632011 1.977 98.89<br />

4 1338610 3.889 97.22<br />

6 9178616 5.671 94.53<br />

8 695988 7.479 93.49<br />

Speed-up<br />

8<br />

7<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Execution<br />

Time<br />

0 1 2 3 4 5 6 7 8 9<br />

Number <strong>of</strong> processor cores<br />

Speed-up with on-chip memory Speed-up without on-chip memory<br />

Speed up<br />

with Order 16<br />

Matrix<br />

17648256 1 100<br />

Efficiency<br />

8886615 1.986 98.89<br />

4488740 3.931 97.22<br />

3023774 5.836 94.53<br />

2293580 7.695 93.49


Algorithm Exploration in Embedded Systems<br />

Time in NS<br />

6000000<br />

5000000<br />

4000000<br />

3000000<br />

2000000<br />

1000000<br />

0<br />

Order 16 Matrix Multiplication<br />

0 1 2 3 4 5 6 7 8 9<br />

Number <strong>of</strong> processor cores<br />

Execution time(16)<br />

Duration(16)<br />

Idle Time(16)<br />

Time in NS<br />

45000000<br />

40000000<br />

35000000<br />

30000000<br />

25000000<br />

20000000<br />

15000000<br />

10000000<br />

5000000<br />

0<br />

Order 32 Matrix Multiplication<br />

0 1 2 3 4 5 6 7 8 9<br />

Number <strong>of</strong> processor cores<br />

Multicore Matrix Multiplication using a parallel algorithm with caches<br />

turned on.<br />

Using an Eight Core CoreConnect based SOC (Data-Cache 4KB, Inst.-<br />

Cache 32KB)<br />

Execution time(32)<br />

Duration(32)<br />

Idle Time(32)


Load Balancing across Cores in a Multi-core SOC<br />

Efficiency in percentage<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

Efficiency <strong>of</strong> the System as the Order <strong>of</strong> the Matrix and the<br />

number <strong>of</strong> processor cores in the SOC varies<br />

0 2 4 6 8 10<br />

Number <strong>of</strong> processor cores<br />

Order 8<br />

Order 16<br />

Order 32<br />

Load breakup among the various processor cores in the<br />

design in percentage.<br />

CPU7 Duration<br />

13%<br />

CPU6 Duration<br />

13%<br />

CPU5 Duration<br />

14%<br />

CPU4 Duration<br />

12%<br />

CPU0 Duration<br />

12%<br />

CPU1 Duration<br />

12%<br />

CPU2 Duration<br />

12%<br />

CPU3 Duration<br />

12%<br />

Activity comparison among the various processor cores.<br />

Can be used for exploring Load Balancing Strategies.


Accuracy <strong>of</strong> Transaction Level Models<br />

Comparisons between simulated models and real hardware demonstrate<br />

accuracy <strong>of</strong> transaction-level models for early analysis and design space<br />

exploration.<br />

Errors below 15% in timing accuracy.<br />

Errors below 11% in power estimation.


Summary<br />

Standards based Power Aware Flows Key for Achieving Energy<br />

Efficient Multi-Core Designs<br />

Flows need Common Power Models and Power Intent<br />

Descriptions across levels <strong>of</strong> Abstraction<br />

Integrated Pre-RTL Analysis and Exploration in Power Aware<br />

Flows needed for Efficient Design <strong>of</strong> Advanced System<br />

Architectures


Acknowledgements<br />

John Darringer, David Hathaway, Arun Joseph,<br />

Jerry Frenkil, Rhett Davis, Qi Wang


References<br />

N. Dhanwada, R. Bergamaschi, W. Dungan, I. Nair, P. Gramann, W. Dougherty, I. Lin, “Transactionlevel<br />

modeling for architectural and power analysis <strong>of</strong> PowerPC and CoreConnect-based systems”,<br />

Des Autom Embed Syst (2006) 10:105–125<br />

‘‘Si2 Low Power Coalition,’’ in Si2 High Level Power Modeling Requirements, Jun. 2011.<br />

[Online]http://si2.org/openeda.si2.org/project/showfiles.php?group_id=76#p115v1.2.<br />

R.Bergamaschi, I. Nair, G.Dittmann, H. Patel, G. Janssen, N. Dhanwada, A. Buyuktosunoglu, E. Acar,<br />

G. Nam, G. Han, D. Kucar, P. Bose, J. Darringer, ”Performance Modeling for Early Analysis <strong>of</strong> Multi-<br />

Core Systems”, Proceedings <strong>of</strong> CODES+ISSS 2007.<br />

R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada,<br />

Z. Hu, P. Bose, J. Darringer,“Exploring Power Management in Multi-Core Systems”, Proceedings <strong>of</strong><br />

ASP-DAC 2008.


Dynamic Behavior Specification and<br />

Dynamic Mapping for Real-time<br />

Embedded Systems in HOPES<br />

Nov. 8, 2012<br />

Soonhoi Ha, w/ Hanwoong Jung and Chanhee Lee<br />

Seoul National University<br />

1


1. Introduction<br />

2. Dynamic Behavior Specification in HOPES<br />

3. Self-Adaptive Mapping<br />

4. Preliminary Experimental Results<br />

5. Discussion and Conclusion<br />

Contents<br />

2 HOPES project, SNU


Parallel Embedded SW Design Challenge<br />

Target-independent parallel programming for non-trivial<br />

heterogeneous systems with diverse design constraints<br />

(time, power, temperature, cost, and so on)<br />

Problem: model-based design <strong>of</strong> parallel embedded system<br />

• Parallelism extraction (multi-mode multi-tasking apps.)<br />

• Functional parallelism & data-parallelism<br />

• Partitioning and mapping<br />

• Parallel code generation<br />

• Performance estimation and verification<br />

• Design space exploration<br />

3 HOPES project, SNU


Programming platform<br />

• meet-in-the-middle approach<br />

• Role <strong>of</strong> “execution model”<br />

Applications<br />

(Manual<br />

design)<br />

S<strong>of</strong>tware Platform<br />

Hardware Platform<br />

Modelbased<br />

design<br />

(Manual<br />

design)<br />

Programming platform (CIC)<br />

S<strong>of</strong>tware Platform<br />

Hardware Platform<br />

Key Idea<br />

4 HOPES project, SNU


Dataflow Model<br />

UML<br />

KPN<br />

Automatic Code Translation<br />

Common Intermediate Code<br />

Manual Coding<br />

Task Codes(Algorithm) XML File(Architecture)<br />

Task Mapping<br />

CIC Translation<br />

C Code for various targets<br />

HOPES Design Flow<br />

Performance Lib./ Constraints<br />

Static analysis<br />

Virtual Prototyping System<br />

5 HOPES project, SNU


CIC (Common Intermediate Code)<br />

Basically an actor-oriented model (or extended dataflow model)<br />

• execution model <strong>of</strong> a parallel architecture<br />

• defines the semantics for task scheduling and task interaction<br />

OS-level task model<br />

• Large granularity – thread/function (atomic mapping unit)<br />

• Implicitly assume the existence <strong>of</strong> an OS or a run-time system<br />

T 1<br />

T 2<br />

Channel Types:<br />

FIFO or Array<br />

T 3<br />

Modeling a shared memory<br />

(indexed slots)<br />

T 4<br />

Control<br />

CIC Task Codes<br />

T 1 T 2 T 3 T 4<br />

Algorithm<br />

- Avail. Parallelism<br />

Model<br />

Architecture<br />

Info. Mapping<br />

Control<br />

6 Pr<strong>of</strong>ile HOPES project, SNU


3 types <strong>of</strong> tasks<br />

• Computation task: data-parallelism is expressed<br />

CIC task model<br />

• Control task: defines the execution mode <strong>of</strong> computation tasks<br />

• Library task: expresses vertically-layered or server-client SW<br />

Execution semantics<br />

• Time-driven or data-driven<br />

3 types <strong>of</strong> port<br />

• Data port: communicate messages between CIC tasks<br />

• System port: communication with OS or run-time system<br />

• Library port: call a library task<br />

Channel semantics<br />

• FIFO channel<br />

• Array channel: indexed access for data parallel execution<br />

• Buffer channel<br />

7 HOPES project, SNU


A CIC task is defined by three methods<br />

• TASK_INIT: before main loop<br />

• TASK_GO: in the main loop<br />

• TASK_WRAPUP: after main loop<br />

Use generic APIs for target independence<br />

CIC Task Code: Definition<br />

TASK_INIT { /* task initialization code */ };<br />

TASK_GO {<br />

MQ_RECEIVE("mq0", (char *)(ld_106->rdbfr), 2048);<br />

...<br />

//task_body()<br />

MQ_SEND(“output”, (char *)(st_107->buf), 2048);<br />

}<br />

TASK_WRAPUP { /* task wrapup code */ };<br />

8 HOPES project, SNU


CIC Translation<br />

CIC to Multi-thread codes for functional simulation<br />

• Generated codes are run on a host machine<br />

CIC to target C codes<br />

• Target specific code generation<br />

• For virtual prototyping<br />

• For MPCore<br />

• For Cell processor<br />

• GPGPU<br />

[planned]<br />

• DSP array<br />

• Reconf. Hardware<br />

• Per-processor code generation<br />

based on mapping information<br />

• Multi-threaded task codes<br />

• Interface code generation<br />

• Scheduler code generation<br />

SMP core<br />

Or<br />

Heterogeneous<br />

core<br />

comm. network<br />

DSP array<br />

HW IP<br />

Reconf.<br />

HW<br />

9 HOPES project, SNU


Challenges<br />

Lane detection algorithm on GPU<br />

• CPU+GPU heterogeneous platform<br />

• multicore CPU: multithreading<br />

• support multiple GPUs<br />

• CIC translation with asynchronous communication with CPU<br />

and GPU<br />

Load Image YUV to RGB Gaussian Sobel<br />

KNN<br />

NLM<br />

Non-<br />

Maximum<br />

Suppression<br />

Blending Sharpen<br />

Denoising Filters<br />

Hough<br />

Transform<br />

Lane Detection Filters<br />

Draw Lane Merge<br />

RGB to YUV<br />

Store Image<br />

10 HOPES project, SNU


Processor Tasks<br />

Experimental Results<br />

Time<br />

CPU 2109.5 sec<br />

1 GPU Sync 15.0266 sec<br />

Async with 2 streams 11.9998 sec<br />

Async with 3 streams 12.3378 sec<br />

Async with 4 streams 12.0846 sec<br />

2 GPUs Sync 11.332 sec<br />

Async with 2 streams 10.247 sec<br />

Async with 3 streams 9.7842 sec<br />

Async with 4 streams 9.7798 sec<br />

CPU LoadImage, Draw Lane, StoreImage<br />

GPU 0 YUVtoRGB, Gaussian, Sobel, Non-Maximum, Hough, Merge<br />

GPU 1 KNN, NLM, Blending, Sharpen, RGBtoYUV<br />

11 HOPES project, SNU


1. Introduction<br />

2. Dynamic Behavior Specification in HOPES<br />

3. Self-Adaptive Mapping<br />

4. Preliminary Experimental Results<br />

5. Discussion and Conclusion<br />

Contents<br />

12 HOPES project, SNU


At the system level<br />

Dynamic Behavior<br />

• Set <strong>of</strong> user tasks running concurrently may change<br />

• user demand<br />

At the application level<br />

• Algorithm may have multiple modes <strong>of</strong> operation<br />

• Execution times <strong>of</strong> tasks may vary<br />

At the OS level<br />

• QoS requirement changes the mode <strong>of</strong> operation<br />

At the hardware level<br />

• Unpredictable resource availability<br />

• Temporary or permanent failure <strong>of</strong> processing elements<br />

13 HOPES project, SNU


At the top-level<br />

Two-level Specification in HOPES<br />

• A control task manages the execution state <strong>of</strong> computation tasks<br />

• Each mode <strong>of</strong> operation (or user case) is defined by a set <strong>of</strong> CIC tasks<br />

that run concurrently.<br />

• <strong>The</strong> mode <strong>of</strong> operation may change dynamically.<br />

• <strong>The</strong> control task specifies the mode change by a FSM.<br />

At the task level<br />

• A CIC task may have a SADF (scenario-aware dataflow) graph<br />

inside.<br />

• <strong>The</strong> behavior <strong>of</strong> a task may change dynamically.<br />

• Finite number <strong>of</strong> scenarios <strong>of</strong> operation<br />

• Each scenario is specified by an SDF graph<br />

14 HOPES project, SNU


Dynamic behavior modeling<br />

Control Task<br />

• A control task can control the execution <strong>of</strong> computation tasks<br />

by using predefined control APIs<br />

• Triggered by data inputs from computation tasks<br />

(or, can be triggered by checking task state)<br />

• Send control messages to OS via a system port<br />

• Similar to statechart in STATEMATE or fFSM in PeaCE<br />

CIC Computation Tasks<br />

CIC Control Task<br />

15 HOPES project, SNU


Internal Specification <strong>of</strong> a Control Task<br />

Internal behavior is specified with an FSM model<br />

• Assume an implicit timer in the system: may generate realtime<br />

events<br />

• Code template is automatically generated<br />

16 HOPES project, SNU


Code Example<br />

while(1){<br />

MQ_AVAILABLE(all_ports); // 1-1. Check the existence <strong>of</strong> a new event<br />

SYS_REQ(CHECK_TASK_STATE, “task_name”, …); // 1-2. Check the termination <strong>of</strong> a task<br />

if(available) MQ_RECEIVE(selected port); // 2. read the new event<br />

if(some event or task state is triggered) break; // 3. Break a loop to make transition<br />

}<br />

switch( current_state )<br />

{<br />

case ID_STATE_S1:<br />

if(selected port==1 && input data==2){ // 4. check the transition condition<br />

current_state = ID_STATE_S2;<br />

SYS_REQ(SET_PARAM_INT, "FloatGroup", "FloatVar", input_data, 0, 0);<br />

} // 5. send the control message through the system port<br />

break;<br />

case ID_STATE_S2:<br />

if(…){<br />

….<br />

}<br />

….<br />

}<br />

17 HOPES project, SNU


PC + NXT Robot Example<br />

Control NXT robot by both a PC and the robot itself.<br />

1. SensorDetect task reads sensor values and sends them to two control tasks:<br />

ControlPC and ControlNXT.<br />

2. KeyDetect task reads key input value and sends it to ControlPC task<br />

3. Controlled by ControlPC task and ControlNXT task, Move task and Grab task run<br />

motors.<br />

4. LCD task displays the<br />

current status <strong>of</strong> NXT<br />

ControlPC<br />

ControlNXT Move<br />

KeyDetect SensorDetect<br />

Grab<br />

LCD<br />

18 HOPES project, SNU


Control NXT Task<br />

1. Control NXT task is for control NXT robot itself.<br />

2. Control NXT task includes some scenarios for the robot.<br />

(Decision <strong>of</strong> ControlNXT task)<br />

Condition> the NXT robot senses a black line on the floor<br />

1) <strong>The</strong> robot stops immediately.<br />

2) After 3 seconds,<br />

PC + NXT Robot Example<br />

2-1) if current motion <strong>of</strong> the robot is forward, the robot starts to go backward.<br />

2-2) if current motion <strong>of</strong> the robot is backward, the robot starts to go forward.<br />

3) After 2 seconds from the above action, the robot stops and starts to spin.<br />

Condition> the NXT robot hears loud sound<br />

1) <strong>The</strong> robot immediately folds/unfolds its arm.<br />

1-1) if current motion <strong>of</strong> the robot is fold, the robots unfolds its arm.<br />

1-2) if current motion <strong>of</strong> the robot is unfold, the robots folds its arm.<br />

19 HOPES project, SNU


Control NXT Task<br />

1. Control NXT task is specified by FSM manner.<br />

PC + NXT Robot Example<br />

TASK_GO {<br />

switch(stateLight) {<br />

case 0: //INIT<br />

break;<br />

…<br />

case 4: // BACKWARD<br />

SYS_REQ(SET_PARAM_INT, "Move","motion",BACKWARD,id1,0);<br />

SYS_REQ(RUN_TASK, "Move",id1,0);<br />

set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />

timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2);<br />

stateLight = 6;<br />

break;<br />

case 5: // FORWARD<br />

SYS_REQ(SET_PARAM_INT, "Move","motion",FORWARD,id1,0);<br />

SYS_REQ(RUN_TASK, "Move",id1,0);<br />

set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />

timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2);<br />

stateLight = 6;<br />

break;<br />

….<br />

}<br />

switch(stateSound) {<br />

case 0:<br />

if( AVAILABLE(port_sound) ) {<br />

prev_sound = sound_val;<br />

BUF_RECEIVE(port_sound, &sound_val, size<strong>of</strong>(U16));<br />

if(prev_sound >= 400 && sound_val < 400) stateSound = 1;<br />

break;<br />

}<br />

…<br />

}<br />

}<br />

20 HOPES project, SNU


Processing control commands<br />

1. Each control commands include time information.<br />

unsigned int time base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />

PC + NXT Robot Example<br />

SYS_REQ(SET_PARAM_INT, task name, param name, param value, time base, time <strong>of</strong>fset);<br />

2. Internal control scheduler processes control commands from control tasks based<br />

on time information <strong>of</strong> each command. (time base + time <strong>of</strong>fset)<br />

Control Task 2<br />

Control Task 1<br />

Control Task 3<br />

Send commands<br />

…<br />

Command 3<br />

Command 2<br />

Command 1<br />

Command Queue<br />

Sorted by<br />

time information<br />

Execute command !<br />

Control<br />

Scheduler<br />

21 HOPES project, SNU


SADF Specification in HOPES<br />

An SADF subgraph is regarded as a computation task from<br />

the outside<br />

• A control task can control the execution status <strong>of</strong> the entire<br />

subgraph<br />

• It resembles the hierarchical model composition in Ptolemy<br />

• For each scenario, the application is specified by a decidable<br />

dataflow graph (for now, we use an SDF graph)<br />

IntGroup SADF Subgraph<br />

22 HOPES project, SNU


Motivation<br />

SADF Subgraph<br />

• For static analysis, we may want to use SDF or its extended<br />

model as much as possible in application specification.<br />

• We explicitly specify if a CIC sub-graph is an SADF graph.<br />

SADF (Scenario-aware dataflow) model<br />

• Assume that the number <strong>of</strong> scenarios is finite.<br />

• It is associated with a MTM (Mode Transition Machine)<br />

• Each task has a different definition for each mode <strong>of</strong> operation.<br />

Sample rate can be changed<br />

23 HOPES project, SNU


MTM Specification<br />

24 HOPES project, SNU


Its behavior depends on the mode<br />

• Sample rates<br />

• Task body<br />

By default, it is an SDF task<br />

CIC Task in an SADF<br />

25 HOPES project, SNU


API for mode transition request<br />

Mode Transition in SADF<br />

• SYS_REQ(SET_MTM_PARAM_INT, Task Name, Var Name, Value, 0, 0)<br />

• Controller task calls this API to change the mode <strong>of</strong> an SADF<br />

• <strong>The</strong> argument variable <strong>of</strong> the MTM is changed immediately when<br />

the API is called<br />

Mode transition mechanism<br />

• Mode transition occurs at the iteration boundary by default.<br />

Immediate transition can be enforced.<br />

• At the start <strong>of</strong> each iteration, a task checks the current mode,<br />

and changes its sample rates and function body.<br />

• Note that each task knows how many times it should be fired<br />

in each iteration via static scheduling performed for each mode<br />

at compile-time.<br />

26 HOPES project, SNU


intGen 1<br />

intGen 2<br />

1<br />

1<br />

A<br />

B<br />

IntGroup<br />

1 1<br />

Mix I_Display<br />

FloatGroup<br />

1 C<br />

FloatGen F_Display<br />

Counter<br />

A Toy Example<br />

Control<br />

S1 S2<br />

27 HOPES project, SNU


IntGroup<br />

intGen 1<br />

intGen 2<br />

1<br />

1<br />

S1: 1<br />

S2: 2<br />

S1: 1<br />

S2: 2<br />

1 1<br />

Mix I_Display<br />

IntGroup<br />

• IntGroup task has two modes.<br />

• S1 mode: mix two 8-digit integers.<br />

• S2 mode: Mix four 8-digit integers.<br />

A Toy Example – cont’d<br />

Mode : S1, S2<br />

Variable : IntVar<br />

currentState Condition nextState<br />

S1 IntVar == 2 S2<br />

S2 IntVar == 1 S1<br />

num_1: xxxxxxxx, num_2 = yyyyyyyy -> Result = xxxxyyyy<br />

num_1: xxxxxxxx, num_2 = yyyyyyyy, num_3= zzzzzzzz, num_4 = wwwwwwww<br />

Result = xxzzyyww<br />

• IntGen_1 task sets mode value(IntVar) in MTM depending on randomly<br />

generated integer.<br />

if(rand_1 % 3 == 0)<br />

SYS_REQ(SET_MTM_PARAM_INT, “IntGen_1", "IntVar", 2, 0, 0);<br />

• Mode transition will be occurred internally at the iteration bound <strong>of</strong> the SADF<br />

graph by scheduler.<br />

MTM<br />

28 HOPES project, SNU


1. Introduction<br />

2. Dynamic Behavior Specification in HOPES<br />

3. Self-Adaptive Mapping<br />

4. Preliminary Experimental Results<br />

5. Discussion and Conclusion<br />

Contents<br />

29 HOPES project, SNU


Heterogeneous architecture<br />

• Multicore control processor<br />

• Many-core accelerator: processor arrays<br />

C (SMP)<br />

C C C C<br />

I/F<br />

M<br />

A (manycore Accelerator)<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

C<br />

M<br />

Shared<br />

Memory<br />

Modules<br />

NoC<br />

Input Queue<br />

R (Reconf. HW)<br />

CARD architecture<br />

Ext.<br />

Mem.<br />

I/F<br />

Target Architecture<br />

D (HW IPs)<br />

IP1 IP2<br />

I/O<br />

I/O<br />

I/O<br />

I/O<br />

30 HOPES project, SNU


Tile-based NoC architecture<br />

• Homogeneous processors<br />

• Processor tiles + memory titles<br />

• Distributed shared memory<br />

• Some assumptions for experiments<br />

• Task code is stored in a shared memory tile<br />

• Mesh architecture<br />

SPM based architecture<br />

Many-core Accelerator<br />

• Local memory size is given (hundreds <strong>of</strong> kilo-bytes at best)<br />

Central Manager (CM)<br />

• Maps the tasks into tiles dynamically<br />

• Move the task code to the processor tile if needed<br />

31 HOPES project, SNU


Input<br />

• HOPES Specification<br />

Problem Statement<br />

• Dynamic behavior is specified with control tasks + SADF<br />

subgraphs<br />

• Request <strong>of</strong> system status change is delivered to the CM<br />

• Object <strong>of</strong> task mapping:<br />

• Computation task with internal parallelism (considered in the future)<br />

• Tasks in SADF subgraphs<br />

Constraint<br />

• Each CIC task has a throughput (or latency) constraint<br />

Problem<br />

• How to map the tasks dynamically satisfying the real-time<br />

constraints?<br />

• Maximize the aggregate throughput surplus (ATS)<br />

32 HOPES project, SNU


Self-Adaptive Mapping Technique<br />

Hybrid Technique: <strong>The</strong> Basic Idea<br />

• Compile-time mapping <strong>of</strong> SADF subgraphs (& CIC tasks)<br />

• For a varying number <strong>of</strong> processors,<br />

• Based on the WCET <strong>of</strong> each task,<br />

• Perform scheduling to find out the real-time performance<br />

• Store the mapping information into the shared memory<br />

• A set <strong>of</strong> (number <strong>of</strong> processors, {task mapping info.}, and<br />

throughput ) information for each mode<br />

• From the minimum # <strong>of</strong> processors to satisfy the throughput<br />

performance<br />

• To the maximum # <strong>of</strong> processors until no performance improvement is<br />

obtained with more processors<br />

• Run-time mapping by the CM (Central Manager)<br />

• Allocate the processors to each task running concurrently<br />

• Bind the (virtual) processors to physical processors<br />

• Map the tasks according to the stored mapping information<br />

33 HOPES project, SNU


legend<br />

Compile-time<br />

analysis<br />

: Input<br />

:<br />

Action<br />

:<br />

Stored<br />

info.<br />

Run-time<br />

mapping<br />

Self-Adaptive Mapping Procedure<br />

• Task graphs<br />

• WCET pr<strong>of</strong>ile<br />

Per-task compile-time analysis<br />

Throughput-maximized mapping for various<br />

numbers <strong>of</strong> processors<br />

Initial task-to-virtual processor mapping<br />

Drop a task from the<br />

active task set<br />

• Virtual processor<br />

pool<br />

System status change<br />

(Task arrival/finish/operation mode change)<br />

Mapping is failed?<br />

Yes No<br />

Mapping<br />

finished<br />

Virtual processor-tophysical<br />

processor<br />

binding<br />

34 HOPES project, SNU


Map the following 4 task graphs onto 3x3 NoC<br />

30<br />

A 1<br />

30<br />

B 1<br />

60<br />

C 1<br />

D 1<br />

60 50<br />

A 2<br />

40<br />

B 2<br />

70<br />

B 3<br />

80<br />

C 2<br />

30<br />

C 3<br />

40 80<br />

D 2<br />

A 3<br />

25<br />

B 4<br />

P 1<br />

P 2<br />

P 1<br />

P 2<br />

P 1<br />

P 2<br />

P 1<br />

A 1<br />

B 1<br />

D 1<br />

C 1<br />

B 2<br />

Schedule 1 Schedule 2<br />

30 90 140 30 90 140<br />

A 2<br />

B 3<br />

1/90<br />

1/95<br />

C 3<br />

D 2<br />

1/120<br />

C 2<br />

A 3<br />

B 4<br />

P 1<br />

P 2<br />

P 3<br />

P 1<br />

P 2<br />

P 3<br />

P 1<br />

P 2<br />

P 3<br />

P 1<br />

P 2<br />

A Simple Example<br />

A 1<br />

B 1<br />

D 1<br />

C 1<br />

B 2<br />

A 2<br />

B 3<br />

1/60<br />

30 70 100 30 70 100<br />

1/70<br />

60 100 140 60 100 140<br />

1/90<br />

1/80<br />

40 120 40 120<br />

Throughput constraint: 1/130<br />

C 3<br />

D 2<br />

1/80<br />

C 2<br />

A 3<br />

B 4<br />

35 HOPES project, SNU


Run-time Mapping<br />

Objective: Maximize the aggregate throughput surplus (ATS)<br />

• m(T): current execution mode <strong>of</strong> T<br />

• th(.): throughput obtained from the static mapping result<br />

• Th min : throughput constraint <strong>of</strong> T<br />

• V(T): number <strong>of</strong> processors allocated to task graph T<br />

Meaning<br />

• If the throughput surplus is large, we may lower the power<br />

consumption <strong>of</strong> the allocated processor tiles<br />

Two-level Mapping<br />

• Node-to-(Virtual) Processor Mapping<br />

• VP-to-PP(Physical Processor) Binding<br />

36 HOPES project, SNU


Objective<br />

• Allocate virtual processors to the active tasks<br />

Node-to-VP Mapping<br />

• Input : 1) active tasks, 2) results <strong>of</strong> the compile-time<br />

analysis 3) set <strong>of</strong> virtual processors<br />

• Output : 1) mapping <strong>of</strong> tasks to virtual processors<br />

Key ideas<br />

• Allocate min. processors for each task to satisfy constraints.<br />

• Allocate the remaining processors in order to maximize ATS<br />

Time complexity<br />

• O(PA 2 ) where P is the number <strong>of</strong> processors and A is the<br />

number <strong>of</strong> tasks to be mapped. ATS computation takes O(A).<br />

37 HOPES project, SNU


Objective<br />

VP to PP Binding<br />

• Determine tile positions <strong>of</strong> virtual processors in NoC minimizing<br />

the overall communication cost<br />

• Input : 1) set <strong>of</strong> virtual processors,<br />

2) list <strong>of</strong> unmapped physical processors(tiles)<br />

• Output : mapping <strong>of</strong> virtual to physical processors<br />

Key ideas<br />

• Select the next virtual processor to be mapped as the one that<br />

has the largest communication volume to the lastly-mapped<br />

virtual processor<br />

• Select the next physical processor that minimizes the sum <strong>of</strong><br />

MD (Manhattan Distance) from the already bound physical<br />

processors<br />

Time complexity<br />

• O(P 2 )<br />

38 HOPES project, SNU


4 sets <strong>of</strong> application scenarios<br />

• Set 1: random graphs (G1 ~ G5) with 7x7 NoC<br />

• Set 2: random graphs (G1 ~ G10) with 11x11 NoC<br />

• Set 3: 4 real-life examples with 3x3 NoC<br />

• Set 4: 4 real-life examples with 4x4 NoC<br />

Task graph # nodes<br />

# virtual proc<br />

(min, max)<br />

1/ throughput<br />

(min,max)<br />

Experiments<br />

1/<br />

throughput<br />

constraint<br />

Node execution<br />

time (us)<br />

10 4,7 1960,1420 2520 [400,1600]<br />

26 11,16 2510,2080 2830 [200,1800]<br />

MPEG2 decoder 14 3,5 4378,2562 5473 [300,1954]<br />

MP3 decoder 7 2,4 5233.3709 6541 [382,3709]<br />

H263 decoder 5 2,4 1143,577 1443 [11,577]<br />

Beamformer 3 1,2 1956,994 2445 [481,962]<br />

39 HOPES project, SNU


Throughput gain<br />

Comparison with a dynamic mapping technique varying the<br />

CCR (communication-to-computation ratio)<br />

• Dynamic technique aims to minimize the communication<br />

overhead. It finds the minimum initiation interval <strong>of</strong> task<br />

graphs, not to violate the resource constraints (buffer size).<br />

Assumptions<br />

• Communication overhead is proportional to the hop distance<br />

• Run 100 iterations for each task graph<br />

ATS<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Dynamic_Set1 Proposed_Set1<br />

Dynamic_Set2 Proposed_Set2<br />

1% 2% 3% 4%<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

CCR<br />

Dynamic_Set3 Proposed_Set3<br />

Dynamic_Set4 Proposed_Set4<br />

1% 2% 3% 4%<br />

40 HOPES project, SNU


Latency Comparison<br />

Baseline: achieved latency from the proposed technique<br />

<strong>The</strong> dynamic approach has longer latency<br />

• It pays huge code migration cost.<br />

• Tasks with lower priorities may be delayed too long.<br />

Latency (avg. per iteration)<br />

2.5<br />

Dynamic_Set1<br />

2 Dynamic_Set2<br />

1.5<br />

1<br />

0.5<br />

0<br />

1% 2% 3% 4%<br />

1.5<br />

1<br />

0.5<br />

0<br />

CCR<br />

Dynamic_Set3<br />

Dynamic_Set4<br />

1% 2% 3% 4%<br />

41 HOPES project, SNU


Communication Overhead<br />

<strong>The</strong> proposed technique minimizes the code migration<br />

overhead<br />

• Node mapping is preserved without system status change.<br />

• We map VP-PP processors to minimize the communication<br />

overhead.<br />

Communication ratio to<br />

computation (CCR)<br />

Communication (us)<br />

Code migration (us)<br />

1% 2% 3% 4%<br />

Proposed 277 568 864 1157<br />

Dynamic 441 888 1334 1812<br />

Proposed 230 472 715 959<br />

Dynamic 20037 40620 56700 76794<br />

42 HOPES project, SNU


Rationale<br />

Future Work: Processor Sharing<br />

• Static mapping may underutilize some processors<br />

• We share underutilized processors between multiple tasks<br />

Determination <strong>of</strong> the “sharable” processors<br />

• Make a pessimistic assumption: the task execution time on a<br />

sharable processor is lengthened by the sharing degree<br />

• If the resultant throughput degradation is no worse than<br />

removing one processor, the processor is regarded “sharable”<br />

43 HOPES project, SNU


Some Mapping Results<br />

BN(Best Neighbor) Proposed w/o<br />

processor sharing<br />

A Simple Example<br />

Proposed w/<br />

processor sharing<br />

44 HOPES project, SNU


1. Introduction<br />

2. Dynamic Behavior Specification in HOPES<br />

3. Self-Adaptive Mapping<br />

4. A Preliminary Experiment<br />

5. Discussion and Conclusion<br />

Contents<br />

45 HOPES project, SNU


Applications (Use Cases)<br />

• Video Player : H.264 Decoder, MP3 Decoder<br />

• Music Player : MP3 Decoder<br />

A Simple Smart-phone<br />

• Video Phone : x264 Encoder, H.264 Decoder, G.723 Decoder/Encoder<br />

• Menu<br />

Typical Scenario<br />

• Display a menu and receive a user input.<br />

• Depending on a user input, execute the proper application.<br />

• When a call arrives during the application execution, the application is<br />

suspended and video phone application is executed.<br />

• When the call is finished, the application that was previously<br />

suspended is resumed.<br />

• Return to the menu when the application terminates.<br />

46 HOPES project, SNU


Task Graph<br />

Experimental Application<br />

Control Task<br />

• Control Applications<br />

• Include a FSM<br />

• Triggered by two input tasks<br />

UserInput Task<br />

• Display a Menu<br />

• Send a user input to the<br />

control task<br />

Interrupt Task<br />

• Model asynchronous event<br />

arrivals (ex: Phone call)<br />

• Send a signal to the control<br />

task<br />

47 HOPES project, SNU


Control Task<br />

Specified FSM in the Control task<br />

Control Task Specification<br />

Wait termination <strong>of</strong> the current application or<br />

asynchronous signal (phone call);<br />

switch (current_state) {<br />

case MENU:<br />

if(input == 1) execute VideoPlay;<br />

else if(input ==2) execute VideoPhone;<br />

else if(input ==3) execute MusicPlay;<br />

else exit;<br />

case VideoPlay:<br />

if(signal == On)<br />

Suspend VideoPlay & Execute VideoPhone;<br />

else<br />

Execute Menu;<br />

case VideoPhone:<br />

if(previous_state == MusicPlay)<br />

Stop VideoPhone & Resume MusicPlay;<br />

else if(previous_state == VideoPlay)<br />

Stop VideoPhone & Resume VideoPlay;<br />

…<br />

}<br />

48 HOPES project, SNU


Sample rate changes depending on the mode<br />

Sample rates = 0<br />

for I_Frame mode<br />

Sample rates = 0<br />

for P_Frame mode<br />

H.264 Decoder Task<br />

MTM<br />

• Two modes: I/P-Frame<br />

• Variable: FrameVar<br />

• Two transition information<br />

: I-Frame P-Frame<br />

49 HOPES project, SNU


x264 encoder (single mode)<br />

MP3 Player (single mode)<br />

Other SDF Task Graphs<br />

50 HOPES project, SNU


Pr<strong>of</strong>iling with Intel i7 machine<br />

• Obtain the WCET <strong>of</strong> each task node.<br />

• H264 Decoder<br />

• X264 Encoder<br />

Pr<strong>of</strong>ile results<br />

Task Time (usec/frame) Task Time (usec/frame)<br />

ReadFile 117.04 IntraPredY 1552.32<br />

Decode 2154.26 IntraPredU 297<br />

InterPredY 1584 IntraPredV 360.36<br />

InterPredU 506.88 Deblock 623.51<br />

InterPredV 514.8 WriteFile 1245.11<br />

Task Time (usec/frame) Task Time (usec/frame)<br />

Init 211.27 Deblock 517.77<br />

ME 6730.02 VLC 4984.65<br />

Encoder 2953.17<br />

51 HOPES project, SNU


Pr<strong>of</strong>iling information<br />

• MP3 Decoder<br />

Task Time (usec/iter.) Task Time (usec/iter.)<br />

VLDStream 175.03 Antialias 40.01<br />

DeQ 398.22 Hybrid 823.25<br />

Stereo 34.73 Subband 632.4<br />

Reorder 33.13 Writefile 29.62<br />

• G.723 Decoder<br />

Task Time (usec/iter.)<br />

G723Dec 1.85<br />

• G.723 Encoder<br />

Pr<strong>of</strong>ile Results<br />

Task Time (usec/iter.)<br />

G723Enc 1.38<br />

52 HOPES project, SNU


Compile-time Analysis<br />

Task graph # nodes<br />

H.264 decoder<br />

(I-frame)<br />

H.264 decoder<br />

(P-frame)<br />

10<br />

7<br />

# virtual proc<br />

(min, max)<br />

• Slow down the processor speed by x6<br />

Self-Adaptive Mapping<br />

1/ throughput<br />

(min,max)<br />

1/throughput<br />

constraint<br />

3,4 23393, 20415 us Video: 30 frame/sec<br />

1,4 53462, 20415 Phone: 15 frame/sec<br />

3,4 21711, 16206 us Video: 30 frame/sec<br />

1,4 40204, 16206 us Phone: 15 frame/sec<br />

MP3 decoder 8 2,3 8912, 4940 us 78 frame/sec<br />

x264 encoder 5 2,3 50734, 41708 us 15 frame/sec<br />

53 HOPES project, SNU


Test Scenario<br />

1) Play a video clip.<br />

2) During video playing, a call arrives.<br />

3) When the call is finished, continue playing<br />

the video clip.<br />

4) Return to the menu when video clip is finished.<br />

Throughput gain & latency comparison<br />

ATS<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

Dynamic_Video Proposed_Video<br />

Dynamic_Phone Proposed_Phone<br />

1% 2% 3% 4%<br />

Ruun-time Mapping Result<br />

CCR<br />

Latency (avg. per iteration)<br />

1.25<br />

1.2<br />

1.15<br />

1.1<br />

1.05<br />

1<br />

Dynamic_Video<br />

Dynamic_Phone<br />

1% 2% 3% 4%<br />

CCR<br />

54 HOPES project, SNU<br />

4<br />

2<br />

3


Communication overhead<br />

Ruun-time Mapping Result<br />

• <strong>The</strong> proposed technique minimizes the code migration<br />

overhead<br />

Scenario<br />

Video<br />

play<br />

Phone<br />

Communication ratio to<br />

computation (CCR)<br />

Communication (us)<br />

Code migration (us)<br />

Communication (us)<br />

Code migration (us)<br />

1% 2% 3% 4%<br />

Dynamic 34611 67514 103160 136129<br />

Proposed 23053 46119 69184 92248<br />

Dynamic 2893 6101 9200 11780<br />

Proposed 62 124 187 249<br />

Dynamic 32282 61469 88928 119777<br />

Proposed 23116 46245 69371 92496<br />

Dynamic 4926 9414 14270 18965<br />

Proposed 141 283 425 567<br />

55 HOPES project, SNU


1. Introduction<br />

2. Dynamic Behavior Specification in HOPES<br />

3. Self-Adaptive Mapping<br />

4. A Preliminary Experiment<br />

5. Discussion and Conclusion<br />

Contents<br />

56 HOPES project, SNU


Conclusion<br />

HOPES facilitates two ways to express dynamic behavior<br />

• At the top level: control task changes the system-level behavior<br />

• Comparable to RPN (Reactive Process Network)<br />

• At the lower level: SADF subgraph with MTM<br />

• Finite number <strong>of</strong> modes<br />

• Good for static analysis<br />

We developed a hybrid mapping technique to satisfy the<br />

real-time constraints (Self-Adaptive Mapping)<br />

• Compile-time analysis<br />

• For each mode <strong>of</strong> CIC task (including SADF subgraph)<br />

• Static mapping information w/ varying number <strong>of</strong> processors<br />

• Run-time mapping<br />

• Determine the number <strong>of</strong> (virtual) processors to allocate for each<br />

task<br />

• Bind the VP to PP (physical processor)<br />

57 HOPES project, SNU


Thank you !<br />

58 HOPES project, SNU

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!