The Softer Side of Software Defined Radio - ICCAD

The Softer Side of Software Defined Radio 

Copyright © MediaTek Inc. All rights reserved. 

By Yuan Lin

Mobile Computing 

▪ In 2011, world-wide mobile telephone subscription: 5.6 

billion 

– ~79% of the population 

– Some countries have mobile penetration over 100% 

– Largest consumer electronic device in terms of volume 

▪ Multi-media anywhere at anytime 

▪ Wireless communication everywhere 

1

Software Defined Radio 

Analog 

Frontend 

TD-SCDMA 

LTE 

WCDMA 

Baseband 

Processor 

2 

Application 

Processors 

Camera 

Keypad 

Display 

Speaker 

Microphone


Analog 

Frontend 

TD-SCDMA 

LTE 

WCDMA 

Baseband 

Processor 

3 

Application 

Processors 

Camera 

Keypad 

Display 

Speaker 

Microphone


Analog 

Frontend 

TD-SCDMA 

LTE 

WCDMA 

Baseband 

Processor 

4 

Application 

Processors 

Camera 

Keypad 

Display 

Speaker 

Microphone


Analog 

Frontend 

TD-SCDMA 

LTE 

WCDMA 

Baseband 

Processor 

5 

Application 

Processors 

GPP 

Transport 

Network 

Link 

MAC 

Camera 

Keypad 

Display 

Speaker 

DSP + ASICs 

PHY 

Microphone


Analog 

Frontend 

TD-SCDMA 

LTE 

WCDMA 

Software Defined Radio (SDR): 

Baseband 

Processor 

Use of software routines instead of 

ASICs for wireless protocols’ physical 

layer processing 

6 

Application 

Processors 

GPP 

Transport 

Network 

Link 

MAC 

DSPs 

PHY 

Camera 

Keypad 

Display 

Speaker 

Microphone

Mobile SDR Design Challenges 

Peak Performance (Gops) 

1000 

100 

10 

1 

Mobile SDR 

Requirements 

100 Mops/mW 

10 Mops/mW 

Embedded 

DSPs 

TI C6x 

0.1 1 10 100 

Power (Watts) 

Reference: Lin et al, SODA: A High Performance DSP Architecture for Software Defined Radio, IEEE MICRO Top Pick 2007 

Better 

Power Efficiency 

IBM Cell 

High-end 

DSPs 

Core 

Purpose 

Processors 

1 Mops/mW General

SDR Processors: Common Trends 

▪ Converging DSP architecture model 

– scalar + DSP 

– VLIW on top of SIMD 

– Algorithm-specific accelerations 

▪ Higher-performance 

scalar 

core 

DSP core 

8 

memory 

Alg. Specific accl.







– wider SIMD 

scalar 

core 

9 

memory 

DSP core 

Alg. Specific accl.








– multiple units 

scalar 

core 

10 

memory 

memory 

memory 

DSP 

DSP 

core 

core 

DSP core 

Alg. Specific accl.








– multiple units 

– multiple cores 

scalar 

core 

11 

memory 

memory 

memory 

DSP 

DSP 

core 

core 

DSP core 

Alg. Specific accl.

Mobile SDR Design Challenges 

Peak Performance (Gops) 

1000 

100 

10 

1 

SDR 

DSP 

Processors 

100 Mops/mW 

10 Mops/mW 

0.1 1 10 100 

Power (Watts) 

Better 

Power Efficiency 

1 Mops/mW

Trials and Tribulations of SDR Programming 

High-performance 

Portability/Productivity 

Update within same processor family 

Different processor platforms 

Situation Getting Worse 

Sigmatix Confidential 

Programming 

Gap 

Ease of Coding 

Low High 

Performance 

Low High 

Compiled 

C/C++ 

Where we want 

to be 

Processor 

Specific 

Asm. Code

Outline 

▪ Language Extension for Wireless 

▪ Compilation Challenges 

▪ Software Synthesis 

▪ DSP Programming Practices 

14

Why C Isn’t Enough 

▪ C is not designed for describing DSP algorithms 

– Algorithm prototyping 

• Vector/Matrix operations are common 

• C obfuscate these operations 

– DSP firmware development 

• No good way to represent specialized DSP processor architectural features 

for (i = 0; i < N/2; i++) { 

out[2*i] = in[i]; 

out[2*i+1] = in[i+N/2]; 

} 

15 

Q: What is this vector operation? 

A: A basic vector perfect shuffle operation. 

in 

out

Language Extensions for Wireless Algorithms 

▪ A list of “nice-to-have” language features for wireless algorithms 

– Vector/matrix 

• Vector/matrix arithmetic operations 

• Vector/matrix permutation operations 

– Fixed-point and complex fixed-point 

• i.e. 4-bit complex arithmetic 

– Algorithm toolbox 

• i.e. multi-precision FFT/FIR library functions 

– System programming considerations 

▪ We have languages that provide one or more of these features, but none 

with all of them 

– OpenCL: 

• No fixed-point arithmetic or algorithm toolbox 

– Matlab: 

• No fixed-point arithmetic or explicit type declarations 

– SystemC: 

• No vector/matrix arithmetic or algorithm toolbox 

perm out(out) = op(perm in0(in0), perm in1(in1), …)

Compiler Challenges 

▪ Compilers don’t understand algorithms 

– Algorithm-specific parallelization techniques 

– Algorithm-specific optimization techniques 

– Algorithm-specific approximation techniques 

▪ There is no best algorithm implementation 

– Multiple implementations 

– Design trade-offs (i.e. memory versus performance) 

• i.e. how to determine the optimal sort algorithm? 

▪ Multi-core, multi-level scratchpad memories, and HW-SW co-design 

makes the problem even more challenging 

Everywhere you look, there are optimization problems 

17

Algorithm-Specific Compilers 

FFT FFT Alg1 FFT Alg2 FFT Alg3 

▪ Exploit algorithm-specific optimizations 

– Special parallelization techniques 

• Sliding window technique for Turbo 

– Different implementations 

• Table-lookup or run-time computation 

– Special processor accelerations 

• Radix-4 FFT instruction 

– Different BER 

• 16-bit fixed-point versus floating-point 

18

Algorithm-Specific Compilers 

FFT 

▪ Implementation templates 

– Multi-core software pipeline 

– Multi-core parallelization 

– Input/output vector ordering 

> fft_compiler –points 64 –arch dsp.arch 

FFT Alg1 FFT Alg2 FFT Alg3 

FFT Specification 

1024-point 

Complex 16-bit 

FFT Compiler 

FFT implementation templates 

HW Specification 

512-bit SIMD 

2 Cores 

FFT Alg1 FFT Alg2 FFT Alg3 

FFT-Specific Optimizations 

High-performance 

C+intrinsics code

Last Thought On DSP Programming 

▪ How do we know if we achieved the optimal performance? 

– The code is fast enough because… 

• Um, my boss is happy with it? 

▪ How much of the performance is limited by our own perceived 

upper bound? 

– Performance code checker 

– Performance estimator 

20

Performance Code Checker 

▪ Every DSP processor comes with its own programmer’s manual 

– A list of good and bad coding practices 

– Some are universal, some are processor/compiler-specific 

▪ Analysis tool that check for bad programming practices 

– Different set of analysis rules for different processors & compilers 

– i.e. non-vectorizable loop structures, wrong intrinsics, etc. 

▪ Lint for DSP C code 

21

Why Use A Code Checker? 

What’s wrong with this code? 

void foo() 

{ 

... 

char index; 

for (i = 0; i < 1000; i++) { 

array[index++] = vec[i] / 10; 

} 

} 

22 

On Cognovo platform: 

Should only use 

unsigned for array indexing 

because of its AGU + compiler 

quirk 

On TI C64x: 

There is no integer divide unit, so 

this will turn into a function-call, 

and disable software-pipelining

Performance Estimator 

▪ Provide performance estimations during algorithm prototyping 

23 

SoC 1 sim 

SoC 2 sim 

SoC 3 sim 

Est. cycles: x0 

Est. Power: y0 

Est. area: z0 


Est. power: y1 

Est. area: z1 


Est. power: y2 

Est. area: z2

Thanks 

24

Building Predictable Cyber-Physical Systems 

from Dynamic Applications and Platforms 

Department of Electrical Engineering 

Electronic Systems 

Sander Stuijk 

Based on joint work with 

Twan Basten, Marc Geilen, Bart Theelen and many others

2 Embedded streaming systems 

Application trends 

Uncertainty 

Concurrency 

Dynamism

3 Model-based design 

Modeling 

Analysis 

Implementation 

Run-time management 

+ 

Design flow 

.... 

App 1 App 2 

+ = 

Gantt chart

4 WLAN application 

New OFDM symbol every 4.0 μs 

Sync, Header, Payload scenario process one symbol 

CRC processes no symbol 

Ports may have rates 

Rate one omitted for clarity 

Initial tokens return to original distribution after one iteration


Each scenario may be modeled with a different scenario graph 

Persistent token names provide relation between initial tokens in different 

scenario graphs

6 

WLAN application 

src 

shift 

pars 

Src 

Shift 

Sync 

Hdem 

Hdec 

Pars 

sync 

0 μS 4 μS 8 μS 12 μS


src 

shift 

pars 

Src 

Shift 

Sync 

Hdem 

Hdec 

Pars 

sync 

header 

0 μS 4 μS 8 μS 12 μS

8 

Analyzing SADF graphs 

Gantt chart for one scenario sequence 

src 

shift 

pars 

Src 

Shift 

Sync 

Hdem 

Hdec 

Pars 

sync 

header 

0 μS 4 μS 8 μS 12 μS 

Execution is a sequence of vector shapes 

Sync Header 

src 

shift 

pars 

4000 ns 

5940 ns 

0 ns 

Token time stamps (vector shapes) provide constraints for next iteration 

src 

shift 

pars 

delay

9 

Analyzing SADF graphs 

Max-plus automaton 

4000 ns, 1 

Sync Header Payload CRC 

src 

shift 

payload 

pars 

4000 ns, 1 

4000 ns, 1 

4000 ns, 1 

src 

4000 ns, 1 

5940 ns 

0 ns 

shift 

payload 

pars 

src 

shift 

payload 

pars 

0 ns, 0 

4000 ns, 1 

4000 ns, 1 

src 

shift 

payload 

pars

10 Analyzing SADF graphs 

Throughput: MCM/MCR 

Latency: longest-path 

4000 ns, 1 

Sync Header Payload CRC 

src 

shift 

payload 

pars 

4000 ns, 1 

4000 ns, 1 

4000 ns, 1 

src 

4000 ns, 1 

5940 ns 

0 ns 

shift 

payload 

pars 

src 

shift 

payload 

pars 

0 ns, 0 

4000 ns, 1 

4000 ns, 1 

src 

shift 

payload 

pars

11 Predictable scenario-aware design-flow 

Tile 1 

IMEM 

DMEM 

ARM 

NI 

Tile 2 

IMEM 

DMEM 

interconnect 

? 

SWC 

NI 

Tile 2 

IMEM 

DMEM 

EVP 

NI 

+ 

Compute buffer constraints 

Unified resource binding 

Static-order scheduling 

TDMA time slices allocation 

....

12 Compute buffer constraints 

Model buffer size constraints with back-edge with initial tokens 

enlarge buffer size to 2 tokens 

throughput 

analysis 

buffer size of 1 token/edge 

throughput 

analysis 

13 

Tile 1 

IMEM 

DMEM 

Predictable scenario-aware design-flow 

EVP 

NI 

Tile 2 

IMEM 

DMEM 

interconnect 

SWC 

NI 

pars 

Tile 2 

IMEM 

DMEM 

ARM 

NI 

+ 

Compute buffer constraints 

Unified resource binding 

Static-order scheduling 

TDMA time slices allocation 

.... 

Unified mapping avoids need for runtime 

reconfiguration mechanism 

Static-order schedules may change 

between scenarios

14 Modeling timing impact of platform 

Tile 1 

IMEM 

DMEM 

EVP 

NI 

Tile 2 

IMEM 

DMEM 

interconnect 

SWC 

NI 

pars 

Tile 2 

IMEM 

DMEM 

ARM 

NI 

connection has delay 

binding-aware dataflow graph 

dataflow model for connection delay 

D 

pars 

D 

D

15 Predictable scenario-aware design-flow 

resource is shared 

Tile 1 

IMEM 

DMEM 

EVP 

NI 

Tile 2 

IMEM 

DMEM 

interconnect 

SWC 

NI 

pars 

Tile 2 

IMEM 

DMEM 

ARM 

NI 

(only SO schedule) 

pars

16 

WLAN application 

src 

shift 

payload 

pars 

evp 

swc 

arm 

ofdm symbol 

evp active 

swc active 

arm active 

sync header payload 

0 μs 4 μs 8 μs 12 μs 16 μs 20 μs 

Initial tokens of resources capture resource availability 

Timing requirements seems to be met... 

crc

17 Model-based design 

Modeling 

Analysis 


Run-time management 

+ 

Design flow 

.... 

App 1 App 2 

+ = 

Gantt chart

18 Run-time reconfiguration 

Tile 1 

IMEM 

DMEM 

EVP 

NI 

Tile 2 

IMEM 

DMEM 

interconnect 

SWC 

NI 

pars 

Tile 2 

IMEM 

DMEM 

ARM 

NI 

DVFS changes actor 

execution times 

DVFS settings modeled with 

system scenarios


src 

shift 

payload 

pars 

evp 

swc 

arm 

ofdm symbol 

evp active 

swc active 

arm active 

sync-c1 header-c1 payload-c1 sync-c2 header-c2 payload-c2 sync-c1 header-c1 payload-c1 

crc-c1 

Reconf 

c1 → c2 

0 μs 4 μs 8 μs 12 μs 16 μs 20 μs 24 μs 28 μs 32 μs 36 μs 

Latency between reception of OFDM symbol and processing increases 

Processing cannot keep up with frequent reconfiguration 

crc-c2 

Reconf 

c2 → c1

20 Summary 

Strategy for designing predictable systems running dynamic applications 

Scenarios capture dynamic (application and system) behavior 

Resource and energy efficient implementations 

Predictable implementations 

SADF Model-of-Computation 

Provides many analysis techniques 

Provides implementation trajectory 

Analysis and implementation techniques implemented in SDF 3 tool kit 

www.es.ele.tue.nl/sdf3

Methodology and Tools for Design 

of Energy Efficient Multi-Core Chips 

Nagu Dhanwada 

IBM Electronic Design Automation, 

Systems and Technology Group.

Outline 

Challenges in Multi-Core Chip Design 

Reference Power Aware Design Methodology 

for Multi-Core Chips 

Tools and Use Cases 

Early Analysis Tool for Multi-Core Designs 

Power Management Exploration and Design 

Architecture and Algorithm Exploration in 

Embedded Systems

Introduction: Multi-Core Chip Design Challenges 

Time to Market 

IP Integration 

Heterogeneity 

Validation 

Performance 

Cache Coherency 

Energy Efficiency 

Complex Power Management 

Productivity and Quality 

Design with third party IP 

Designing with potentially unreliable components

Introduction: Multi-Core Chip Design Challenges 

Complex Power Management for Energy 

Efficiency 

Global Dynamic Voltage and Frequency Scaling 

Power Capping 

Guard Band Reduction 

Power Throttling 

Power Budgeting

What do we need? 

Standards based Power Aware System to Silicon 

Flows 

Tools within these flows supporting Early 

Analysis and Design

Reference Power Aware Design 

Methodology for Multi-Core Chips

System-to-Silicon Power Aware Design Flow 

No standard today 

Supported by Liberty 

today – LPC 

contribution to LTAB 

Power 

Models 

Design & 

mapping 

Design & 

Integration 

Optimization 

& Closure 

ESL Design 

Analysis & 

optimization 

RTL Design 

Analysis & 

optimization 


Analysis 

Validation 

Verification 

& Test 

Verification 

& Test 

Comprehensive end to end Low Power Design flow, 

High-level power modeling, 

Power Models that span across the entire flow. 

Power 

Intent 

Content 

7

ESL Design Phase 

A 

N 

A 

L 

Y 

S 

I 

S 

Initial Application Specification 

High Level 

Hardware 

Architecture 

(Function + 

Communication 

Architecture) 

ESL 

Hardware 

Design & 

Optimization 

ESL Co-Design 

And Mapping 

Hardware/ 

Software 

Integration 

(High Level) 


Specification 

ESL 

Embedded 


Design 

Embedded 


Image 

V 

A 

L 

I 

D 

A 

T 

I 

O 

N

RTL Design Phase 

Design and Integration 

IP 

Specification 

Chip 

Specification 

Analysis and Optimization 

Chip 

Specification 

Module 

Module 

coding Module 

coding Module 

coding Module 

coding Module 

coding Module 

coding Module 

coding 

coding 

Chip Integration 

Global Clock Gating 

Power Domain 

DFT 

Analysis 

Module 

Power format 

Chip 

Power format 

Power 

constraints 

Chip 

Power format 

Power 

constraints

Implementation Phase 

Partitioned 

RTL 

Power 

constraints 

Tool directives 

library 

Physical 

data 

Design Optimization and Closure 

synthesis 

floorplan 

Analysis 

Placement 

Clock tree 

Power optimization & closure 

Route 

Power rule verification 

IR analysis

ESL Design Phase: Analysis and Optimization 

Workload / 

Stimulus 

Benchmark 

Programs/Traces 

Random Stimulus 

Generators 

Architecture parameters 

(Initial Configuration) 

Power Analysis 

ESL Design Description 

Power Calculation 

Power Intent 

Format 

SystemC 

Simulator 

Optimization and Refinement 

Power Intent 

Format 

RTL Design 

And Simulation 

Power Models 

Power Models 

Power 

Reports

RTL Design Phase: Analysis and Optimization 

Workloads / 

Stimulus 

Benchmark 

Programs/Traces 

Random Stimulus 

Generators 

Architecture parameters 

(Initial Configuration) 

Power Analysis 

RTL Description 

Power Calculation 

Power Intent 

Format 

VHDL/Verilog 

Simulator 

Optimization and Refinement 

Power Intent 

Format 


Level 

Power Models 

Power Models 

Power 

Reports




today – LPC 


Power 

Models 

Design & 

mapping 

Design & 

Integration 

Optimization 

& Closure 

ESL Design 

Analysis & 

optimization 

RTL Design 

Analysis & 

optimization 


Analysis 

Validation 

Verification 

& Test 

Verification 

& Test 




Power 

Intent 

Content 

13

Power Intent Formats: Overview 

Functional Design Description 

Assume power always on, running at constant voltage 

Power Intent Formats 

Capture variation of Power Over Time (power management 

specification) 

Power Intent + Functional Description = Representation of an 

active power managed design

Power Intent Formats: Overview 

Structural Aspects 

Interaction between design elements having time varying 

power characteristics 

State restoration, re-initialization 

Examples: 

Power Domains, Switches, Level shifters, Isolation cells, 

State retention logic, Power control signals 

Behavioral Aspects 

Effect of power variation on the computation model 

Enumerate state of simulation model for each set of 

design elements driven same voltage

Power Intent Specification Example 

Core Read 

Transactor 

core read 

request 

core read 

data 

PD1 

Voltage Image Scaling 

Processor 0.8-1.0 V 

bayer Switchable Core 

data_in 

Slave Interface 

Read/Write 

jpeg 

data_out 

Core write 

request 

Buffer & 

Core write 

Transactor 

Core write 

data 

MasterRead 

Interface 

PD2 

Voltage 1.0 V 

Switchable 

Memory mapped 

Register Bank 

Master Write 

Interface 

DMA READ 

CONTROL 

DMA WRITE 

set_design img_subsys 

create_power_domain –name PD1 –instances {img_core core_read core_write} -default 

create_power_domain –name PD2 –instances { slave_intf master_read master_write} 

create_nominal_condition –name standby –voltage 0.8 –state standby 

create_nominal_condition –name low –voltage 0.9 –state on 

create_nominal_condition –name high –voltage 1.0 –state on 

create_nominal_condition –name off –voltage 0 –state off 

create_mode –name full_speed –conditions {PD1@high && PD2@high} 

create_mode –name low_speed –conditions {PD1@low && PD2@high} 

create_mode –name sleep –conditions {PD1@standby && PD2@high} 

create_mode –name core_off –conditions {PD1@off && PD2@high} 

create_mode –name all_off –conditions {PD1@off && PD2@off} 

end_design 

Slide Courtesy: Si2 LPC Format Working Group




today – LPC 


Power 

Models 

Design & 

mapping 

Design & 

Integration 

Optimization 

& Closure 

ESL Design 

Analysis & 

optimization 

RTL Design 

Analysis & 

optimization 


Analysis 

Validation 

Verification 

& Test 

Verification 

& Test 




Power 

Intent 

Content 

17

Power Modeling of Complex IP: Issues 

Current Standards (LIBERTY) oriented toward small IP 

Can include complete states & transitions 

dependence on input slew, output load 

Complex IP 

Several Inputs, States, Power Modes and Internal Parts, 

Much internal power dissipation, 

Completeness of Model Key to Enable High Level Design, 

Adequate Parameterization to capture sensitivity to design decisions. 

Power states / modes may be: 

Intentionally designed 

E.g., power modes in power intent specifications, power domain switching 

Result of behavior / activity 

E.g., clock gating, CPF functional modes

Power Modeling of Complex IP: Issues 

State power modeling challenges 

Exponential state explosion 

Requires non-mutually exclusive power states 

Being addressed in current standards 

Transition energy modeling challenges 

Exponential state transition explosion 

Separation of internal transition energy from pin power 

Transitions between mutually exclusive states 

Proposals to address efficient and uniform modeling of complex 

IP within current standards, being developed in SI2 Low Power 

Coalition

Tools and Use Cases




today – LPC 


Power 

Models 

Design & 

mapping 

Design & 

Integration 

Optimization 

& Closure 

ESL Design 

Analysis & 

optimization 

RTL Design 

Analysis & 

optimization 


Analysis 

Validation 

Verification 

& Test 

Verification 

& Test 




Power 

Intent 

Content 

21

SLATE: Early Analysis Tool for Multi-Core 

utilization 

100 

80 

60 

40 

20 

0 

2 6 10 20 30 50 100 

Number of connections 

Performance, Functional Model 

Graphic 

Front-End 

CPU-Tahoe 

CPU-no Tahoe 

Rx - Tahoe 

Rx-no Tahoe 

Power [W] 

2.0 

1.6 

1.2 

0.8 

0.4 

0.0 

Accelerators 

64 128 256 512 1024 2048 4096 8192 

AXU 

C3 

L2 

AXU 

C3 

L2 

AXU 

C3 

L2 

L3 MC IO 

Block Diagram 

Packet size [bytes] 

Power Analysis Chip Floorplan 

Chip 

Integration 

Interconnect Analysis Implementation 

Thermal Analysis 

SLATE: System-Level Analysis Tool for Early Exploration 

AXU 

C3 

L2 

ASIC 

Import 3 rd Party IP 

Industry Standard Models 

Tool for early power, performance, physical, and thermal characteristics of 

Multi-Core Designs.

Use Scenarios 

Power Management Design Exploration in High Performance 

Servers 

Early Analysis System Configured to use Trace Driven Performance 

Models for POWER4 based processor cores 

Architecture and Algorithm Exploration in Embedded Systems 

Early Analysis System Configured to run Embedded PowerPC and 

CoreConnect models in Execution driven mode running real software

Power Management Design Exploration: ESL Analysis Use 

Scenario 

Power 

Management 

Policy 

Optimization 

(Manual / Tool) 

System Simulation Model 

Vdd and 

Frequency 

Changes 

Processor 

Model 

Arbiter 

Bus 

Power Management Unit 

(Algorithms, Control Modes) 

Performance and Power Numbers 

Performance Simulation Models 

PLB Master 

PLB 

Slave 

(HSMC) 

PLB_OPB 

Bridge 

OPB_PLB 

Bridge 

OPB 

Arbiter 

UIC 

OPB 

Bus 

OPB 

Slave 

OPB 

Master 

Architecture Configuration 

Power Management Specification 

Power and Clock Domain Definitions 

Power 

Models 

Vdd and 

Frequency 

Changes

Power Management Studies in SLATE 

Per-core vs. Chip-wide 

DVFS 

Fetch-Throttling (I-Cache 

Throttling) 

Parameterizable: 

Freq change penalty 

Number of V/F discrete 

levels or continuous 

mode 

Easy to change PMU 

algorithm 

Easy to try different 

configurations 

set_mode() 

Set_frequency() 

set_throttling_factor() 

get_cpi() 

get_temperature() 

get_num_commits() 

get_num_decodes() 

get_power_modes() 

PMU 

PMU-BUS 

core0 core1 core2 core3 

clk0 clk1 

clk2 

clk3 

Multiple Clock 

Generator

Power Management Studies: DVFS Algorithms 

Discrete MaxBIPS 

Assumes set of discrete power modes (Vdd-Frequency pairs) for each 

core which the Power Manager can control individually. 

Goal: Maximize overall chip performance, under a given power budget. 

Chip Performance: total number of completed instructions by all cores per 

time period, 

Continuous Approach (CPM) 

Non-linear programming to model the same DVFS problem with 

continuous power modes.

Relative Chip Performance 

Chip Performance for Decreasing Power Budgets 

100 

98 

96 

94 

92 

90 

88 

86 

84 

82 

80 

Chip Performance vs Power Budgets 

100 91 82 73 64 55 45 

Power Budget 

Chip Wide Discrete Chip Wide Continuous Per-Core Continuous Per-Core Discrete

Individual Core Performance under Per-Core DVFS 

Relative Performance 

100 

98 

96 

94 

92 

90 

88 

86 

84 

82 

80 

Individual Core Performance in Per-Core DVFS 

eon eon-c twolf twolf-c perl perl-c 

100 91 82 73 64 55 45 

Power Budget

Use Scenario 2: Example Embedded System 

PLB Arbiter 

HSMC 

EBC 

CLOCK/RESET 

GEN 

PLL 

PM 

Clocks / resets 

405 CPU 

MAL 

PLB 64 bit 

Private OPB 

UIC 

DMA 

PLB-OPB Bridge 

OPB Arbiter 

OPB 32 bit 

Embedded PowerPC4xx & CoreConnect IP Cores, RISCWatch debugger, Enabled 

for GCC tool chain 

RX 

FIFO 

EMAC 

TX 

FIFO 

GPIO 

UART0 

UART1 

IIC 

GPT

Architecture Exploration: Ethernet Packet Processing 

30000 

25000 

20000 

15000 

10000 

5000 

PLB 

Arbiter 

0 

EBC 

HSMC 

CLOCK/RESET 

GEN 

PLL 

PM 

Clocks / resets 

405 CPU 

MAL 

Private 

OPB 

PLB 64 bit 

System Throughput (KBytes/sec) 

64 128 256 512 1024 2048 4096 

Packet sizes (bytes) 

UIC 

2 Emac 

1 Emac 

DMA 

PLB-OPB 

Bridge 

EMAC 

RX 

FIFO 

TX 

FIFO 

EMAC 

RX 

FIFO 

TX 

FIFO 

120 

100 

80 

60 

40 

20 

0 

OPB 

Arbiter 

GPIO 

UART0 

UART1 

IIC 

GPT 

OPB 32 bit 

CPU Utilization (% of total time) 

1 Emac 

64 128 256 512 1024 2048 4096 


PPC405 Platform Based Design 

Ethernet Subsystem 

1 EMAC 

1 Madmal 

Real Embedded Application executing on TLM 

models 

Measure effects on performance, power 

Change to improve performance 

Added an extra EMAC + Fifos 

Two EMAC mode of the application works with 

packets being transmitted from one EMAC and 

received by the other 

2 Emac 

350 

300 

250 

200 

150 

100 

Power (mW) 

64 128 256 512 1024 2048 4096 


2 Emac 

1 Emac

Architecture Exploration in Embedded Systems 

# 

Cores 

Execution 

Time 

Speed up 

with Order 

16 Matrix 

1 520573 1 100 

Efficiency 

2 2632011 1.977 98.89 

4 1338610 3.889 97.22 

6 9178616 5.671 94.53 

8 695988 7.479 93.49 

Speed-up 

8 

7 

6 

5 

4 

3 

2 

1 

0 

Execution 

Time 

0 1 2 3 4 5 6 7 8 9 

Number of processor cores 

Speed-up with on-chip memory Speed-up without on-chip memory 

Speed up 

with Order 16 

Matrix 

17648256 1 100 

Efficiency 

8886615 1.986 98.89 

4488740 3.931 97.22 

3023774 5.836 94.53 

2293580 7.695 93.49

Algorithm Exploration in Embedded Systems 

Time in NS 

6000000 

5000000 

4000000 

3000000 

2000000 

1000000 

0 

Order 16 Matrix Multiplication 

0 1 2 3 4 5 6 7 8 9 


Execution time(16) 

Duration(16) 

Idle Time(16) 

Time in NS 

45000000 

40000000 

35000000 

30000000 

25000000 

20000000 

15000000 

10000000 

5000000 

0 

Order 32 Matrix Multiplication 

0 1 2 3 4 5 6 7 8 9 


Multicore Matrix Multiplication using a parallel algorithm with caches 

turned on. 

Using an Eight Core CoreConnect based SOC (Data-Cache 4KB, Inst.- 

Cache 32KB) 

Execution time(32) 

Duration(32) 

Idle Time(32)

Load Balancing across Cores in a Multi-core SOC 

Efficiency in percentage 

120 

100 

80 

60 

40 

20 

0 

Efficiency of the System as the Order of the Matrix and the 

number of processor cores in the SOC varies 

0 2 4 6 8 10 


Order 8 

Order 16 

Order 32 

Load breakup among the various processor cores in the 

design in percentage. 

CPU7 Duration 

13% 

CPU6 Duration 

13% 

CPU5 Duration 

14% 

CPU4 Duration 

12% 

CPU0 Duration 

12% 

CPU1 Duration 

12% 

CPU2 Duration 

12% 

CPU3 Duration 

12% 

Activity comparison among the various processor cores. 

Can be used for exploring Load Balancing Strategies.

Accuracy of Transaction Level Models 

Comparisons between simulated models and real hardware demonstrate 

accuracy of transaction-level models for early analysis and design space 

exploration. 

Errors below 15% in timing accuracy. 

Errors below 11% in power estimation.

Summary 

Standards based Power Aware Flows Key for Achieving Energy 

Efficient Multi-Core Designs 

Flows need Common Power Models and Power Intent 

Descriptions across levels of Abstraction 

Integrated Pre-RTL Analysis and Exploration in Power Aware 

Flows needed for Efficient Design of Advanced System 

Architectures

Acknowledgements 

John Darringer, David Hathaway, Arun Joseph, 

Jerry Frenkil, Rhett Davis, Qi Wang

References 

N. Dhanwada, R. Bergamaschi, W. Dungan, I. Nair, P. Gramann, W. Dougherty, I. Lin, “Transactionlevel 

modeling for architectural and power analysis of PowerPC and CoreConnect-based systems”, 

Des Autom Embed Syst (2006) 10:105–125 

‘‘Si2 Low Power Coalition,’’ in Si2 High Level Power Modeling Requirements, Jun. 2011. 

[Online]http://si2.org/openeda.si2.org/project/showfiles.php?group_id=76#p115v1.2. 

R.Bergamaschi, I. Nair, G.Dittmann, H. Patel, G. Janssen, N. Dhanwada, A. Buyuktosunoglu, E. Acar, 

G. Nam, G. Han, D. Kucar, P. Bose, J. Darringer, ”Performance Modeling for Early Analysis of Multi- 

Core Systems”, Proceedings of CODES+ISSS 2007. 

R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada, 

Z. Hu, P. Bose, J. Darringer,“Exploring Power Management in Multi-Core Systems”, Proceedings of 

ASP-DAC 2008.

Dynamic Behavior Specification and 

Dynamic Mapping for Real-time 

Embedded Systems in HOPES 

Nov. 8, 2012 

Soonhoi Ha, w/ Hanwoong Jung and Chanhee Lee 

Seoul National University 

1

1. Introduction 

2. Dynamic Behavior Specification in HOPES 

3. Self-Adaptive Mapping 

4. Preliminary Experimental Results 

5. Discussion and Conclusion 

Contents 

2 HOPES project, SNU

Parallel Embedded SW Design Challenge 

Target-independent parallel programming for non-trivial 

heterogeneous systems with diverse design constraints 

(time, power, temperature, cost, and so on) 

Problem: model-based design of parallel embedded system 

• Parallelism extraction (multi-mode multi-tasking apps.) 

• Functional parallelism & data-parallelism 

• Partitioning and mapping 

• Parallel code generation 

• Performance estimation and verification 

• Design space exploration 


Programming platform 

• meet-in-the-middle approach 

• Role of “execution model” 

Applications 

(Manual 

design) 

Software Platform 

Hardware Platform 

Modelbased 

design 

(Manual 

design) 

Programming platform (CIC) 

Software Platform 

Hardware Platform 

Key Idea 


Dataflow Model 

UML 

KPN 

Automatic Code Translation 

Common Intermediate Code 

Manual Coding 

Task Codes(Algorithm) XML File(Architecture) 

Task Mapping 

CIC Translation 

C Code for various targets 

HOPES Design Flow 

Performance Lib./ Constraints 

Static analysis 

Virtual Prototyping System 


CIC (Common Intermediate Code) 

Basically an actor-oriented model (or extended dataflow model) 

• execution model of a parallel architecture 

• defines the semantics for task scheduling and task interaction 

OS-level task model 

• Large granularity – thread/function (atomic mapping unit) 

• Implicitly assume the existence of an OS or a run-time system 

T 1 

T 2 

Channel Types: 

FIFO or Array 

T 3 

Modeling a shared memory 

(indexed slots) 

T 4 

Control 

CIC Task Codes 

T 1 T 2 T 3 T 4 

Algorithm 

- Avail. Parallelism 

Model 

Architecture 

Info. Mapping 

Control 

6 Profile HOPES project, SNU

3 types of tasks 

• Computation task: data-parallelism is expressed 

CIC task model 

• Control task: defines the execution mode of computation tasks 

• Library task: expresses vertically-layered or server-client SW 

Execution semantics 

• Time-driven or data-driven 

3 types of port 

• Data port: communicate messages between CIC tasks 

• System port: communication with OS or run-time system 

• Library port: call a library task 

Channel semantics 

• FIFO channel 

• Array channel: indexed access for data parallel execution 

• Buffer channel 


A CIC task is defined by three methods 

• TASK_INIT: before main loop 

• TASK_GO: in the main loop 

• TASK_WRAPUP: after main loop 

Use generic APIs for target independence 

CIC Task Code: Definition 

TASK_INIT { /* task initialization code */ }; 

TASK_GO { 

MQ_RECEIVE("mq0", (char *)(ld_106->rdbfr), 2048); 

... 

//task_body() 

MQ_SEND(“output”, (char *)(st_107->buf), 2048); 

} 

TASK_WRAPUP { /* task wrapup code */ }; 


CIC Translation 

CIC to Multi-thread codes for functional simulation 

• Generated codes are run on a host machine 

CIC to target C codes 

• Target specific code generation 

• For virtual prototyping 

• For MPCore 

• For Cell processor 

• GPGPU 

[planned] 

• DSP array 

• Reconf. Hardware 

• Per-processor code generation 

based on mapping information 

• Multi-threaded task codes 

• Interface code generation 

• Scheduler code generation 

SMP core 

Or 

Heterogeneous 

core 

comm. network 

DSP array 

HW IP 

Reconf. 

HW 


Challenges 

Lane detection algorithm on GPU 

• CPU+GPU heterogeneous platform 

• multicore CPU: multithreading 

• support multiple GPUs 

• CIC translation with asynchronous communication with CPU 

and GPU 

Load Image YUV to RGB Gaussian Sobel 

KNN 

NLM 

Non- 

Maximum 

Suppression 

Blending Sharpen 

Denoising Filters 

Hough 

Transform 

Lane Detection Filters 

Draw Lane Merge 

RGB to YUV 

Store Image 


Processor Tasks 

Experimental Results 

Time 

CPU 2109.5 sec 

1 GPU Sync 15.0266 sec 

Async with 2 streams 11.9998 sec 



2 GPUs Sync 11.332 sec 




CPU LoadImage, Draw Lane, StoreImage 

GPU 0 YUVtoRGB, Gaussian, Sobel, Non-Maximum, Hough, Merge 

GPU 1 KNN, NLM, Blending, Sharpen, RGBtoYUV 







Contents 


At the system level 

Dynamic Behavior 

• Set of user tasks running concurrently may change 

• user demand 

At the application level 

• Algorithm may have multiple modes of operation 

• Execution times of tasks may vary 

At the OS level 

• QoS requirement changes the mode of operation 

At the hardware level 

• Unpredictable resource availability 

• Temporary or permanent failure of processing elements 


At the top-level 

Two-level Specification in HOPES 

• A control task manages the execution state of computation tasks 

• Each mode of operation (or user case) is defined by a set of CIC tasks 

that run concurrently. 

• The mode of operation may change dynamically. 

• The control task specifies the mode change by a FSM. 

At the task level 

• A CIC task may have a SADF (scenario-aware dataflow) graph 

inside. 

• The behavior of a task may change dynamically. 

• Finite number of scenarios of operation 

• Each scenario is specified by an SDF graph 


Dynamic behavior modeling 

Control Task 

• A control task can control the execution of computation tasks 

by using predefined control APIs 

• Triggered by data inputs from computation tasks 

(or, can be triggered by checking task state) 

• Send control messages to OS via a system port 

• Similar to statechart in STATEMATE or fFSM in PeaCE 

CIC Computation Tasks 

CIC Control Task 


Internal Specification of a Control Task 

Internal behavior is specified with an FSM model 

• Assume an implicit timer in the system: may generate realtime 

events 

• Code template is automatically generated 


Code Example 

while(1){ 

MQ_AVAILABLE(all_ports); // 1-1. Check the existence of a new event 

SYS_REQ(CHECK_TASK_STATE, “task_name”, …); // 1-2. Check the termination of a task 

if(available) MQ_RECEIVE(selected port); // 2. read the new event 

if(some event or task state is triggered) break; // 3. Break a loop to make transition 

} 

switch( current_state ) 

{ 

case ID_STATE_S1: 

if(selected port==1 && input data==2){ // 4. check the transition condition 

current_state = ID_STATE_S2; 

SYS_REQ(SET_PARAM_INT, "FloatGroup", "FloatVar", input_data, 0, 0); 

} // 5. send the control message through the system port 

break; 

case ID_STATE_S2: 

if(…){ 

…. 

} 

…. 

} 


PC + NXT Robot Example 

Control NXT robot by both a PC and the robot itself. 

1. SensorDetect task reads sensor values and sends them to two control tasks: 

ControlPC and ControlNXT. 

2. KeyDetect task reads key input value and sends it to ControlPC task 

3. Controlled by ControlPC task and ControlNXT task, Move task and Grab task run 

motors. 

4. LCD task displays the 

current status of NXT 

ControlPC 

ControlNXT Move 

KeyDetect SensorDetect 

Grab 

LCD 


Control NXT Task 

1. Control NXT task is for control NXT robot itself. 

2. Control NXT task includes some scenarios for the robot. 

(Decision of ControlNXT task) 

Condition> the NXT robot senses a black line on the floor 

1) The robot stops immediately. 

2) After 3 seconds, 


2-1) if current motion of the robot is forward, the robot starts to go backward. 

2-2) if current motion of the robot is backward, the robot starts to go forward. 

3) After 2 seconds from the above action, the robot stops and starts to spin. 

Condition> the NXT robot hears loud sound 

1) The robot immediately folds/unfolds its arm. 

1-1) if current motion of the robot is fold, the robots unfolds its arm. 

1-2) if current motion of the robot is unfold, the robots folds its arm. 


Control NXT Task 

1. Control NXT task is specified by FSM manner. 


TASK_GO { 

switch(stateLight) { 

case 0: //INIT 

break; 

… 

case 4: // BACKWARD 

SYS_REQ(SET_PARAM_INT, "Move","motion",BACKWARD,id1,0); 

SYS_REQ(RUN_TASK, "Move",id1,0); 

set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE); 

timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2); 

stateLight = 6; 

break; 

case 5: // FORWARD 

SYS_REQ(SET_PARAM_INT, "Move","motion",FORWARD,id1,0); 

SYS_REQ(RUN_TASK, "Move",id1,0); 

set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE); 

timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2); 

stateLight = 6; 

break; 

…. 

} 

switch(stateSound) { 

case 0: 

if( AVAILABLE(port_sound) ) { 

prev_sound = sound_val; 

BUF_RECEIVE(port_sound, &sound_val, sizeof(U16)); 

if(prev_sound >= 400 && sound_val < 400) stateSound = 1; 

break; 

} 

… 

} 

} 


Processing control commands 

1. Each control commands include time information. 

unsigned int time base = SYS_REQ(GET_CURRENT_TIME_BASE); 


SYS_REQ(SET_PARAM_INT, task name, param name, param value, time base, time offset); 

2. Internal control scheduler processes control commands from control tasks based 

on time information of each command. (time base + time offset) 

Control Task 2 



Send commands 

… 

Command 3 

Command 2 

Command 1 

Command Queue 

Sorted by 

time information 

Execute command ! 

Control 

Scheduler 


SADF Specification in HOPES 

An SADF subgraph is regarded as a computation task from 

the outside 

• A control task can control the execution status of the entire 

subgraph 

• It resembles the hierarchical model composition in Ptolemy 

• For each scenario, the application is specified by a decidable 

dataflow graph (for now, we use an SDF graph) 

IntGroup SADF Subgraph 


Motivation 

SADF Subgraph 

• For static analysis, we may want to use SDF or its extended 

model as much as possible in application specification. 

• We explicitly specify if a CIC sub-graph is an SADF graph. 

SADF (Scenario-aware dataflow) model 

• Assume that the number of scenarios is finite. 

• It is associated with a MTM (Mode Transition Machine) 

• Each task has a different definition for each mode of operation. 

Sample rate can be changed 


MTM Specification 


Its behavior depends on the mode 

• Sample rates 

• Task body 

By default, it is an SDF task 

CIC Task in an SADF 


API for mode transition request 

Mode Transition in SADF 

• SYS_REQ(SET_MTM_PARAM_INT, Task Name, Var Name, Value, 0, 0) 

• Controller task calls this API to change the mode of an SADF 

• The argument variable of the MTM is changed immediately when 

the API is called 

Mode transition mechanism 

• Mode transition occurs at the iteration boundary by default. 

Immediate transition can be enforced. 

• At the start of each iteration, a task checks the current mode, 

and changes its sample rates and function body. 

• Note that each task knows how many times it should be fired 

in each iteration via static scheduling performed for each mode 

at compile-time. 


intGen 1 

intGen 2 

1 

1 

A 

B 

IntGroup 

1 1 

Mix I_Display 

FloatGroup 

1 C 

FloatGen F_Display 

Counter 

A Toy Example 

Control 

S1 S2 


IntGroup 

intGen 1 

intGen 2 

1 

1 

S1: 1 

S2: 2 

S1: 1 

S2: 2 

1 1 

Mix I_Display 

IntGroup 

• IntGroup task has two modes. 

• S1 mode: mix two 8-digit integers. 

• S2 mode: Mix four 8-digit integers. 

A Toy Example – cont’d 

Mode : S1, S2 

Variable : IntVar 

currentState Condition nextState 

S1 IntVar == 2 S2 

S2 IntVar == 1 S1 

num_1: xxxxxxxx, num_2 = yyyyyyyy -> Result = xxxxyyyy 

num_1: xxxxxxxx, num_2 = yyyyyyyy, num_3= zzzzzzzz, num_4 = wwwwwwww 

Result = xxzzyyww 

• IntGen_1 task sets mode value(IntVar) in MTM depending on randomly 

generated integer. 

if(rand_1 % 3 == 0) 

SYS_REQ(SET_MTM_PARAM_INT, “IntGen_1", "IntVar", 2, 0, 0); 

• Mode transition will be occurred internally at the iteration bound of the SADF 

graph by scheduler. 

MTM 







Contents 


Heterogeneous architecture 

• Multicore control processor 

• Many-core accelerator: processor arrays 

C (SMP) 

C C C C 

I/F 

M 

A (manycore Accelerator) 

C 

M 

C 

M 

C 

M 

C 

M 

C 

M 

C 

M 

C 

M 

C 

M 

Shared 

Memory 

Modules 

NoC 

Input Queue 

R (Reconf. HW) 

CARD architecture 

Ext. 

Mem. 

I/F 

Target Architecture 

D (HW IPs) 

IP1 IP2 

I/O 

I/O 

I/O 

I/O 


Tile-based NoC architecture 

• Homogeneous processors 

• Processor tiles + memory titles 

• Distributed shared memory 

• Some assumptions for experiments 

• Task code is stored in a shared memory tile 

• Mesh architecture 

SPM based architecture 

Many-core Accelerator 

• Local memory size is given (hundreds of kilo-bytes at best) 

Central Manager (CM) 

• Maps the tasks into tiles dynamically 

• Move the task code to the processor tile if needed 


Input 

• HOPES Specification 

Problem Statement 

• Dynamic behavior is specified with control tasks + SADF 

subgraphs 

• Request of system status change is delivered to the CM 

• Object of task mapping: 

• Computation task with internal parallelism (considered in the future) 

• Tasks in SADF subgraphs 

Constraint 

• Each CIC task has a throughput (or latency) constraint 

Problem 

• How to map the tasks dynamically satisfying the real-time 

constraints? 

• Maximize the aggregate throughput surplus (ATS) 


Self-Adaptive Mapping Technique 

Hybrid Technique: The Basic Idea 

• Compile-time mapping of SADF subgraphs (& CIC tasks) 

• For a varying number of processors, 

• Based on the WCET of each task, 

• Perform scheduling to find out the real-time performance 

• Store the mapping information into the shared memory 

• A set of (number of processors, {task mapping info.}, and 

throughput ) information for each mode 

• From the minimum # of processors to satisfy the throughput 

performance 

• To the maximum # of processors until no performance improvement is 

obtained with more processors 

• Run-time mapping by the CM (Central Manager) 

• Allocate the processors to each task running concurrently 

• Bind the (virtual) processors to physical processors 

• Map the tasks according to the stored mapping information 


legend 

Compile-time 

analysis 

: Input 

: 

Action 

: 

Stored 

info. 

Run-time 

mapping 

Self-Adaptive Mapping Procedure 

• Task graphs 

• WCET profile 

Per-task compile-time analysis 

Throughput-maximized mapping for various 

numbers of processors 

Initial task-to-virtual processor mapping 

Drop a task from the 

active task set 

• Virtual processor 

pool 

System status change 

(Task arrival/finish/operation mode change) 

Mapping is failed? 

Yes No 

Mapping 

finished 

Virtual processor-tophysical 

processor 

binding 


Map the following 4 task graphs onto 3x3 NoC 

30 

A 1 

30 

B 1 

60 

C 1 

D 1 

60 50 

A 2 

40 

B 2 

70 

B 3 

80 

C 2 

30 

C 3 

40 80 

D 2 

A 3 

25 

B 4 

P 1 

P 2 

P 1 

P 2 

P 1 

P 2 

P 1 

A 1 

B 1 

D 1 

C 1 

B 2 

Schedule 1 Schedule 2 

30 90 140 30 90 140 

A 2 

B 3 

1/90 

1/95 

C 3 

D 2 

1/120 

C 2 

A 3 

B 4 

P 1 

P 2 

P 3 

P 1 

P 2 

P 3 

P 1 

P 2 

P 3 

P 1 

P 2 

A Simple Example 

A 1 

B 1 

D 1 

C 1 

B 2 

A 2 

B 3 

1/60 

30 70 100 30 70 100 

1/70 

60 100 140 60 100 140 

1/90 

1/80 

40 120 40 120 

Throughput constraint: 1/130 

C 3 

D 2 

1/80 

C 2 

A 3 

B 4 


Run-time Mapping 

Objective: Maximize the aggregate throughput surplus (ATS) 

• m(T): current execution mode of T 

• th(.): throughput obtained from the static mapping result 

• Th min : throughput constraint of T 

• V(T): number of processors allocated to task graph T 

Meaning 

• If the throughput surplus is large, we may lower the power 

consumption of the allocated processor tiles 

Two-level Mapping 

• Node-to-(Virtual) Processor Mapping 

• VP-to-PP(Physical Processor) Binding 


Objective 

• Allocate virtual processors to the active tasks 

Node-to-VP Mapping 

• Input : 1) active tasks, 2) results of the compile-time 

analysis 3) set of virtual processors 

• Output : 1) mapping of tasks to virtual processors 

Key ideas 

• Allocate min. processors for each task to satisfy constraints. 

• Allocate the remaining processors in order to maximize ATS 

Time complexity 

• O(PA 2 ) where P is the number of processors and A is the 

number of tasks to be mapped. ATS computation takes O(A). 


Objective 

VP to PP Binding 

• Determine tile positions of virtual processors in NoC minimizing 

the overall communication cost 

• Input : 1) set of virtual processors, 

2) list of unmapped physical processors(tiles) 

• Output : mapping of virtual to physical processors 

Key ideas 

• Select the next virtual processor to be mapped as the one that 

has the largest communication volume to the lastly-mapped 

virtual processor 

• Select the next physical processor that minimizes the sum of 

MD (Manhattan Distance) from the already bound physical 

processors 

Time complexity 

• O(P 2 ) 


4 sets of application scenarios 

• Set 1: random graphs (G1 ~ G5) with 7x7 NoC 

• Set 2: random graphs (G1 ~ G10) with 11x11 NoC 

• Set 3: 4 real-life examples with 3x3 NoC 

• Set 4: 4 real-life examples with 4x4 NoC 

Task graph # nodes 

# virtual proc 

(min, max) 

1/ throughput 

(min,max) 

Experiments 

1/ 

throughput 

constraint 

Node execution 

time (us) 

10 4,7 1960,1420 2520 [400,1600] 

26 11,16 2510,2080 2830 [200,1800] 

MPEG2 decoder 14 3,5 4378,2562 5473 [300,1954] 

MP3 decoder 7 2,4 5233.3709 6541 [382,3709] 

H263 decoder 5 2,4 1143,577 1443 [11,577] 

Beamformer 3 1,2 1956,994 2445 [481,962] 


Throughput gain 

Comparison with a dynamic mapping technique varying the 

CCR (communication-to-computation ratio) 

• Dynamic technique aims to minimize the communication 

overhead. It finds the minimum initiation interval of task 

graphs, not to violate the resource constraints (buffer size). 

Assumptions 

• Communication overhead is proportional to the hop distance 

• Run 100 iterations for each task graph 

ATS 

5 

4 

3 

2 

1 

0 

Dynamic_Set1 Proposed_Set1 


1% 2% 3% 4% 

5 

4 

3 

2 

1 

0 

CCR 



1% 2% 3% 4% 


Latency Comparison 

Baseline: achieved latency from the proposed technique 

The dynamic approach has longer latency 

• It pays huge code migration cost. 

• Tasks with lower priorities may be delayed too long. 

Latency (avg. per iteration) 

2.5 

Dynamic_Set1 

2 Dynamic_Set2 

1.5 

1 

0.5 

0 

1% 2% 3% 4% 

1.5 

1 

0.5 

0 

CCR 

Dynamic_Set3 

Dynamic_Set4 

1% 2% 3% 4% 


Communication Overhead 

The proposed technique minimizes the code migration 

overhead 

• Node mapping is preserved without system status change. 

• We map VP-PP processors to minimize the communication 

overhead. 

Communication ratio to 

computation (CCR) 

Communication (us) 

Code migration (us) 

1% 2% 3% 4% 

Proposed 277 568 864 1157 

Dynamic 441 888 1334 1812 

Proposed 230 472 715 959 

Dynamic 20037 40620 56700 76794 


Rationale 

Future Work: Processor Sharing 

• Static mapping may underutilize some processors 

• We share underutilized processors between multiple tasks 

Determination of the “sharable” processors 

• Make a pessimistic assumption: the task execution time on a 

sharable processor is lengthened by the sharing degree 

• If the resultant throughput degradation is no worse than 

removing one processor, the processor is regarded “sharable” 


Some Mapping Results 

BN(Best Neighbor) Proposed w/o 

processor sharing 

A Simple Example 

Proposed w/ 

processor sharing 





4. A Preliminary Experiment 


Contents 


Applications (Use Cases) 

• Video Player : H.264 Decoder, MP3 Decoder 

• Music Player : MP3 Decoder 

A Simple Smart-phone 

• Video Phone : x264 Encoder, H.264 Decoder, G.723 Decoder/Encoder 

• Menu 

Typical Scenario 

• Display a menu and receive a user input. 

• Depending on a user input, execute the proper application. 

• When a call arrives during the application execution, the application is 

suspended and video phone application is executed. 

• When the call is finished, the application that was previously 

suspended is resumed. 

• Return to the menu when the application terminates. 


Task Graph 

Experimental Application 

Control Task 

• Control Applications 

• Include a FSM 

• Triggered by two input tasks 

UserInput Task 

• Display a Menu 

• Send a user input to the 

control task 

Interrupt Task 

• Model asynchronous event 

arrivals (ex: Phone call) 

• Send a signal to the control 

task 


Control Task 

Specified FSM in the Control task 

Control Task Specification 

Wait termination of the current application or 

asynchronous signal (phone call); 

switch (current_state) { 

case MENU: 

if(input == 1) execute VideoPlay; 

else if(input ==2) execute VideoPhone; 

else if(input ==3) execute MusicPlay; 

else exit; 

case VideoPlay: 

if(signal == On) 

Suspend VideoPlay & Execute VideoPhone; 

else 

Execute Menu; 

case VideoPhone: 

if(previous_state == MusicPlay) 

Stop VideoPhone & Resume MusicPlay; 

else if(previous_state == VideoPlay) 

Stop VideoPhone & Resume VideoPlay; 

… 

} 


Sample rate changes depending on the mode 

Sample rates = 0 

for I_Frame mode 

Sample rates = 0 

for P_Frame mode 

H.264 Decoder Task 

MTM 

• Two modes: I/P-Frame 

• Variable: FrameVar 

• Two transition information 

: I-Frame P-Frame 


x264 encoder (single mode) 

MP3 Player (single mode) 

Other SDF Task Graphs 


Profiling with Intel i7 machine 

• Obtain the WCET of each task node. 

• H264 Decoder 

• X264 Encoder 

Profile results 

Task Time (usec/frame) Task Time (usec/frame) 

ReadFile 117.04 IntraPredY 1552.32 

Decode 2154.26 IntraPredU 297 

InterPredY 1584 IntraPredV 360.36 

InterPredU 506.88 Deblock 623.51 

InterPredV 514.8 WriteFile 1245.11 

Task Time (usec/frame) Task Time (usec/frame) 

Init 211.27 Deblock 517.77 

ME 6730.02 VLC 4984.65 

Encoder 2953.17 


Profiling information 

• MP3 Decoder 

Task Time (usec/iter.) Task Time (usec/iter.) 

VLDStream 175.03 Antialias 40.01 

DeQ 398.22 Hybrid 823.25 

Stereo 34.73 Subband 632.4 

Reorder 33.13 Writefile 29.62 

• G.723 Decoder 

Task Time (usec/iter.) 

G723Dec 1.85 

• G.723 Encoder 

Profile Results 

Task Time (usec/iter.) 

G723Enc 1.38 


Compile-time Analysis 

Task graph # nodes 

H.264 decoder 

(I-frame) 

H.264 decoder 

(P-frame) 

10 

7 

# virtual proc 

(min, max) 

• Slow down the processor speed by x6 

Self-Adaptive Mapping 

1/ throughput 

(min,max) 

1/throughput 

constraint 

3,4 23393, 20415 us Video: 30 frame/sec 

1,4 53462, 20415 Phone: 15 frame/sec 

3,4 21711, 16206 us Video: 30 frame/sec 

1,4 40204, 16206 us Phone: 15 frame/sec 

MP3 decoder 8 2,3 8912, 4940 us 78 frame/sec 

x264 encoder 5 2,3 50734, 41708 us 15 frame/sec 


Test Scenario 

1) Play a video clip. 

2) During video playing, a call arrives. 

3) When the call is finished, continue playing 

the video clip. 

4) Return to the menu when video clip is finished. 

Throughput gain & latency comparison 

ATS 

6 

5 

4 

3 

2 

1 

0 

Dynamic_Video Proposed_Video 

Dynamic_Phone Proposed_Phone 

1% 2% 3% 4% 

Ruun-time Mapping Result 

CCR 

Latency (avg. per iteration) 

1.25 

1.2 

1.15 

1.1 

1.05 

1 

Dynamic_Video 

Dynamic_Phone 

1% 2% 3% 4% 

CCR 

54 HOPES project, SNU 

4 

2 

3

Communication overhead 

Ruun-time Mapping Result 

• The proposed technique minimizes the code migration 

overhead 

Scenario 

Video 

play 

Phone 

Communication ratio to 

computation (CCR) 





1% 2% 3% 4% 

Dynamic 34611 67514 103160 136129 

Proposed 23053 46119 69184 92248 

Dynamic 2893 6101 9200 11780 

Proposed 62 124 187 249 

Dynamic 32282 61469 88928 119777 

Proposed 23116 46245 69371 92496 

Dynamic 4926 9414 14270 18965 

Proposed 141 283 425 567 





4. A Preliminary Experiment 


Contents 


Conclusion 

HOPES facilitates two ways to express dynamic behavior 

• At the top level: control task changes the system-level behavior 

• Comparable to RPN (Reactive Process Network) 

• At the lower level: SADF subgraph with MTM 

• Finite number of modes 

• Good for static analysis 

We developed a hybrid mapping technique to satisfy the 

real-time constraints (Self-Adaptive Mapping) 

• Compile-time analysis 

• For each mode of CIC task (including SADF subgraph) 

• Static mapping information w/ varying number of processors 

• Run-time mapping 

• Determine the number of (virtual) processors to allocate for each 

task 

• Bind the VP to PP (physical processor) 


Thank you !

The Softer Side of Software Defined Radio - ICCAD

Create successful ePaper yourself

Delete template?

Save as template?