The Softer Side of Software Defined Radio - ICCAD
The Softer Side of Software Defined Radio - ICCAD
The Softer Side of Software Defined Radio - ICCAD
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>The</strong> <strong>S<strong>of</strong>ter</strong> <strong>Side</strong> <strong>of</strong> S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Copyright © MediaTek Inc. All rights reserved.<br />
By Yuan Lin
Mobile Computing<br />
▪ In 2011, world-wide mobile telephone subscription: 5.6<br />
billion<br />
– ~79% <strong>of</strong> the population<br />
– Some countries have mobile penetration over 100%<br />
– Largest consumer electronic device in terms <strong>of</strong> volume<br />
▪ Multi-media anywhere at anytime<br />
▪ Wireless communication everywhere<br />
1
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Analog<br />
Frontend<br />
TD-SCDMA<br />
LTE<br />
WCDMA<br />
Baseband<br />
Processor<br />
2<br />
Application<br />
Processors<br />
Camera<br />
Keypad<br />
Display<br />
Speaker<br />
Microphone
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Analog<br />
Frontend<br />
TD-SCDMA<br />
LTE<br />
WCDMA<br />
Baseband<br />
Processor<br />
3<br />
Application<br />
Processors<br />
Camera<br />
Keypad<br />
Display<br />
Speaker<br />
Microphone
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Analog<br />
Frontend<br />
TD-SCDMA<br />
LTE<br />
WCDMA<br />
Baseband<br />
Processor<br />
4<br />
Application<br />
Processors<br />
Camera<br />
Keypad<br />
Display<br />
Speaker<br />
Microphone
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Analog<br />
Frontend<br />
TD-SCDMA<br />
LTE<br />
WCDMA<br />
Baseband<br />
Processor<br />
5<br />
Application<br />
Processors<br />
GPP<br />
Transport<br />
Network<br />
Link<br />
MAC<br />
Camera<br />
Keypad<br />
Display<br />
Speaker<br />
DSP + ASICs<br />
PHY<br />
Microphone
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong><br />
Analog<br />
Frontend<br />
TD-SCDMA<br />
LTE<br />
WCDMA<br />
S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong> (SDR):<br />
Baseband<br />
Processor<br />
Use <strong>of</strong> s<strong>of</strong>tware routines instead <strong>of</strong><br />
ASICs for wireless protocols’ physical<br />
layer processing<br />
6<br />
Application<br />
Processors<br />
GPP<br />
Transport<br />
Network<br />
Link<br />
MAC<br />
DSPs<br />
PHY<br />
Camera<br />
Keypad<br />
Display<br />
Speaker<br />
Microphone
Mobile SDR Design Challenges<br />
Peak Performance (Gops)<br />
1000<br />
100<br />
10<br />
1<br />
Mobile SDR<br />
Requirements<br />
100 Mops/mW<br />
10 Mops/mW<br />
Embedded<br />
DSPs<br />
TI C6x<br />
0.1 1 10 100<br />
Power (Watts)<br />
Reference: Lin et al, SODA: A High Performance DSP Architecture for S<strong>of</strong>tware <strong>Defined</strong> <strong>Radio</strong>, IEEE MICRO Top Pick 2007<br />
Better<br />
Power Efficiency<br />
IBM Cell<br />
High-end<br />
DSPs<br />
Core<br />
Purpose<br />
Processors<br />
1 Mops/mW General
SDR Processors: Common Trends<br />
▪ Converging DSP architecture model<br />
– scalar + DSP<br />
– VLIW on top <strong>of</strong> SIMD<br />
– Algorithm-specific accelerations<br />
▪ Higher-performance<br />
scalar<br />
core<br />
DSP core<br />
8<br />
memory<br />
Alg. Specific accl.
SDR Processors: Common Trends<br />
▪ Converging DSP architecture model<br />
– scalar + DSP<br />
– VLIW on top <strong>of</strong> SIMD<br />
– Algorithm-specific accelerations<br />
▪ Higher-performance<br />
– wider SIMD<br />
scalar<br />
core<br />
9<br />
memory<br />
DSP core<br />
Alg. Specific accl.
SDR Processors: Common Trends<br />
▪ Converging DSP architecture model<br />
– scalar + DSP<br />
– VLIW on top <strong>of</strong> SIMD<br />
– Algorithm-specific accelerations<br />
▪ Higher-performance<br />
– wider SIMD<br />
– multiple units<br />
scalar<br />
core<br />
10<br />
memory<br />
memory<br />
memory<br />
DSP<br />
DSP<br />
core<br />
core<br />
DSP core<br />
Alg. Specific accl.
SDR Processors: Common Trends<br />
▪ Converging DSP architecture model<br />
– scalar + DSP<br />
– VLIW on top <strong>of</strong> SIMD<br />
– Algorithm-specific accelerations<br />
▪ Higher-performance<br />
– wider SIMD<br />
– multiple units<br />
– multiple cores<br />
scalar<br />
core<br />
11<br />
memory<br />
memory<br />
memory<br />
DSP<br />
DSP<br />
core<br />
core<br />
DSP core<br />
Alg. Specific accl.
Mobile SDR Design Challenges<br />
Peak Performance (Gops)<br />
1000<br />
100<br />
10<br />
1<br />
SDR<br />
DSP<br />
Processors<br />
100 Mops/mW<br />
10 Mops/mW<br />
0.1 1 10 100<br />
Power (Watts)<br />
Better<br />
Power Efficiency<br />
1 Mops/mW
Trials and Tribulations <strong>of</strong> SDR Programming<br />
High-performance<br />
Portability/Productivity<br />
Update within same processor family<br />
Different processor platforms<br />
Situation Getting Worse<br />
Sigmatix Confidential<br />
Programming<br />
Gap<br />
Ease <strong>of</strong> Coding<br />
Low High<br />
Performance<br />
Low High<br />
Compiled<br />
C/C++<br />
Where we want<br />
to be<br />
Processor<br />
Specific<br />
Asm. Code
Outline<br />
▪ Language Extension for Wireless<br />
▪ Compilation Challenges<br />
▪ S<strong>of</strong>tware Synthesis<br />
▪ DSP Programming Practices<br />
14
Why C Isn’t Enough<br />
▪ C is not designed for describing DSP algorithms<br />
– Algorithm prototyping<br />
• Vector/Matrix operations are common<br />
• C obfuscate these operations<br />
– DSP firmware development<br />
• No good way to represent specialized DSP processor architectural features<br />
for (i = 0; i < N/2; i++) {<br />
out[2*i] = in[i];<br />
out[2*i+1] = in[i+N/2];<br />
}<br />
15<br />
Q: What is this vector operation?<br />
A: A basic vector perfect shuffle operation.<br />
in<br />
out
Language Extensions for Wireless Algorithms<br />
▪ A list <strong>of</strong> “nice-to-have” language features for wireless algorithms<br />
– Vector/matrix<br />
• Vector/matrix arithmetic operations<br />
• Vector/matrix permutation operations<br />
– Fixed-point and complex fixed-point<br />
• i.e. 4-bit complex arithmetic<br />
– Algorithm toolbox<br />
• i.e. multi-precision FFT/FIR library functions<br />
– System programming considerations<br />
▪ We have languages that provide one or more <strong>of</strong> these features, but none<br />
with all <strong>of</strong> them<br />
– OpenCL:<br />
• No fixed-point arithmetic or algorithm toolbox<br />
– Matlab:<br />
• No fixed-point arithmetic or explicit type declarations<br />
– SystemC:<br />
• No vector/matrix arithmetic or algorithm toolbox<br />
perm out(out) = op(perm in0(in0), perm in1(in1), …)
Compiler Challenges<br />
▪ Compilers don’t understand algorithms<br />
– Algorithm-specific parallelization techniques<br />
– Algorithm-specific optimization techniques<br />
– Algorithm-specific approximation techniques<br />
▪ <strong>The</strong>re is no best algorithm implementation<br />
– Multiple implementations<br />
– Design trade-<strong>of</strong>fs (i.e. memory versus performance)<br />
• i.e. how to determine the optimal sort algorithm?<br />
▪ Multi-core, multi-level scratchpad memories, and HW-SW co-design<br />
makes the problem even more challenging<br />
Everywhere you look, there are optimization problems<br />
17
Algorithm-Specific Compilers<br />
FFT FFT Alg1 FFT Alg2 FFT Alg3<br />
▪ Exploit algorithm-specific optimizations<br />
– Special parallelization techniques<br />
• Sliding window technique for Turbo<br />
– Different implementations<br />
• Table-lookup or run-time computation<br />
– Special processor accelerations<br />
• Radix-4 FFT instruction<br />
– Different BER<br />
• 16-bit fixed-point versus floating-point<br />
18
Algorithm-Specific Compilers<br />
FFT<br />
▪ Implementation templates<br />
– Multi-core s<strong>of</strong>tware pipeline<br />
– Multi-core parallelization<br />
– Input/output vector ordering<br />
> fft_compiler –points 64 –arch dsp.arch<br />
FFT Alg1 FFT Alg2 FFT Alg3<br />
FFT Specification<br />
1024-point<br />
Complex 16-bit<br />
FFT Compiler<br />
FFT implementation templates<br />
HW Specification<br />
512-bit SIMD<br />
2 Cores<br />
FFT Alg1 FFT Alg2 FFT Alg3<br />
FFT-Specific Optimizations<br />
High-performance<br />
C+intrinsics code
Last Thought On DSP Programming<br />
▪ How do we know if we achieved the optimal performance?<br />
– <strong>The</strong> code is fast enough because…<br />
• Um, my boss is happy with it?<br />
▪ How much <strong>of</strong> the performance is limited by our own perceived<br />
upper bound?<br />
– Performance code checker<br />
– Performance estimator<br />
20
Performance Code Checker<br />
▪ Every DSP processor comes with its own programmer’s manual<br />
– A list <strong>of</strong> good and bad coding practices<br />
– Some are universal, some are processor/compiler-specific<br />
▪ Analysis tool that check for bad programming practices<br />
– Different set <strong>of</strong> analysis rules for different processors & compilers<br />
– i.e. non-vectorizable loop structures, wrong intrinsics, etc.<br />
▪ Lint for DSP C code<br />
21
Why Use A Code Checker?<br />
What’s wrong with this code?<br />
void foo()<br />
{<br />
...<br />
char index;<br />
for (i = 0; i < 1000; i++) {<br />
array[index++] = vec[i] / 10;<br />
}<br />
}<br />
22<br />
On Cognovo platform:<br />
Should only use<br />
unsigned for array indexing<br />
because <strong>of</strong> its AGU + compiler<br />
quirk<br />
On TI C64x:<br />
<strong>The</strong>re is no integer divide unit, so<br />
this will turn into a function-call,<br />
and disable s<strong>of</strong>tware-pipelining
Performance Estimator<br />
▪ Provide performance estimations during algorithm prototyping<br />
23<br />
SoC 1 sim<br />
SoC 2 sim<br />
SoC 3 sim<br />
Est. cycles: x0<br />
Est. Power: y0<br />
Est. area: z0<br />
Est. cycles: x1<br />
Est. power: y1<br />
Est. area: z1<br />
Est. cycles: x2<br />
Est. power: y2<br />
Est. area: z2
Thanks<br />
24
Building Predictable Cyber-Physical Systems<br />
from Dynamic Applications and Platforms<br />
Department <strong>of</strong> Electrical Engineering<br />
Electronic Systems<br />
Sander Stuijk<br />
Based on joint work with<br />
Twan Basten, Marc Geilen, Bart <strong>The</strong>elen and many others
2 Embedded streaming systems<br />
Application trends<br />
Uncertainty<br />
Concurrency<br />
Dynamism
3 Model-based design<br />
Modeling<br />
Analysis<br />
Implementation<br />
Run-time management<br />
+<br />
Design flow<br />
....<br />
App 1 App 2<br />
+ =<br />
Gantt chart
4 WLAN application<br />
New OFDM symbol every 4.0 μs<br />
Sync, Header, Payload scenario process one symbol<br />
CRC processes no symbol<br />
Ports may have rates<br />
Rate one omitted for clarity<br />
Initial tokens return to original distribution after one iteration
5 WLAN application<br />
Each scenario may be modeled with a different scenario graph<br />
Persistent token names provide relation between initial tokens in different<br />
scenario graphs
6<br />
WLAN application<br />
src<br />
shift<br />
pars<br />
Src<br />
Shift<br />
Sync<br />
Hdem<br />
Hdec<br />
Pars<br />
sync<br />
0 μS 4 μS 8 μS 12 μS
7 WLAN application<br />
src<br />
shift<br />
pars<br />
Src<br />
Shift<br />
Sync<br />
Hdem<br />
Hdec<br />
Pars<br />
sync<br />
header<br />
0 μS 4 μS 8 μS 12 μS
8<br />
Analyzing SADF graphs<br />
Gantt chart for one scenario sequence<br />
src<br />
shift<br />
pars<br />
Src<br />
Shift<br />
Sync<br />
Hdem<br />
Hdec<br />
Pars<br />
sync<br />
header<br />
0 μS 4 μS 8 μS 12 μS<br />
Execution is a sequence <strong>of</strong> vector shapes<br />
Sync Header<br />
src<br />
shift<br />
pars<br />
4000 ns<br />
5940 ns<br />
0 ns<br />
Token time stamps (vector shapes) provide constraints for next iteration<br />
src<br />
shift<br />
pars<br />
delay
9<br />
Analyzing SADF graphs<br />
Max-plus automaton<br />
4000 ns, 1<br />
Sync Header Payload CRC<br />
src<br />
shift<br />
payload<br />
pars<br />
4000 ns, 1<br />
4000 ns, 1<br />
4000 ns, 1<br />
src<br />
4000 ns, 1<br />
5940 ns<br />
0 ns<br />
shift<br />
payload<br />
pars<br />
src<br />
shift<br />
payload<br />
pars<br />
0 ns, 0<br />
4000 ns, 1<br />
4000 ns, 1<br />
src<br />
shift<br />
payload<br />
pars
10 Analyzing SADF graphs<br />
Throughput: MCM/MCR<br />
Latency: longest-path<br />
4000 ns, 1<br />
Sync Header Payload CRC<br />
src<br />
shift<br />
payload<br />
pars<br />
4000 ns, 1<br />
4000 ns, 1<br />
4000 ns, 1<br />
src<br />
4000 ns, 1<br />
5940 ns<br />
0 ns<br />
shift<br />
payload<br />
pars<br />
src<br />
shift<br />
payload<br />
pars<br />
0 ns, 0<br />
4000 ns, 1<br />
4000 ns, 1<br />
src<br />
shift<br />
payload<br />
pars
11 Predictable scenario-aware design-flow<br />
Tile 1<br />
IMEM<br />
DMEM<br />
ARM<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
interconnect<br />
?<br />
SWC<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
EVP<br />
NI<br />
+<br />
Compute buffer constraints<br />
Unified resource binding<br />
Static-order scheduling<br />
TDMA time slices allocation<br />
....
12 Compute buffer constraints<br />
Model buffer size constraints with back-edge with initial tokens<br />
enlarge buffer size to 2 tokens<br />
throughput<br />
analysis<br />
buffer size <strong>of</strong> 1 token/edge<br />
throughput<br />
analysis<br />
13<br />
Tile 1<br />
IMEM<br />
DMEM<br />
Predictable scenario-aware design-flow<br />
EVP<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
interconnect<br />
SWC<br />
NI<br />
pars<br />
Tile 2<br />
IMEM<br />
DMEM<br />
ARM<br />
NI<br />
+<br />
Compute buffer constraints<br />
Unified resource binding<br />
Static-order scheduling<br />
TDMA time slices allocation<br />
....<br />
Unified mapping avoids need for runtime<br />
reconfiguration mechanism<br />
Static-order schedules may change<br />
between scenarios
14 Modeling timing impact <strong>of</strong> platform<br />
Tile 1<br />
IMEM<br />
DMEM<br />
EVP<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
interconnect<br />
SWC<br />
NI<br />
pars<br />
Tile 2<br />
IMEM<br />
DMEM<br />
ARM<br />
NI<br />
connection has delay<br />
binding-aware dataflow graph<br />
dataflow model for connection delay<br />
D<br />
pars<br />
D<br />
D
15 Predictable scenario-aware design-flow<br />
resource is shared<br />
Tile 1<br />
IMEM<br />
DMEM<br />
EVP<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
interconnect<br />
SWC<br />
NI<br />
pars<br />
Tile 2<br />
IMEM<br />
DMEM<br />
ARM<br />
NI<br />
(only SO schedule)<br />
pars
16<br />
WLAN application<br />
src<br />
shift<br />
payload<br />
pars<br />
evp<br />
swc<br />
arm<br />
<strong>of</strong>dm symbol<br />
evp active<br />
swc active<br />
arm active<br />
sync header payload<br />
0 μs 4 μs 8 μs 12 μs 16 μs 20 μs<br />
Initial tokens <strong>of</strong> resources capture resource availability<br />
Timing requirements seems to be met...<br />
crc
17 Model-based design<br />
Modeling<br />
Analysis<br />
Implementation<br />
Run-time management<br />
+<br />
Design flow<br />
....<br />
App 1 App 2<br />
+ =<br />
Gantt chart
18 Run-time reconfiguration<br />
Tile 1<br />
IMEM<br />
DMEM<br />
EVP<br />
NI<br />
Tile 2<br />
IMEM<br />
DMEM<br />
interconnect<br />
SWC<br />
NI<br />
pars<br />
Tile 2<br />
IMEM<br />
DMEM<br />
ARM<br />
NI<br />
DVFS changes actor<br />
execution times<br />
DVFS settings modeled with<br />
system scenarios
19 WLAN application<br />
src<br />
shift<br />
payload<br />
pars<br />
evp<br />
swc<br />
arm<br />
<strong>of</strong>dm symbol<br />
evp active<br />
swc active<br />
arm active<br />
sync-c1 header-c1 payload-c1 sync-c2 header-c2 payload-c2 sync-c1 header-c1 payload-c1<br />
crc-c1<br />
Reconf<br />
c1 → c2<br />
0 μs 4 μs 8 μs 12 μs 16 μs 20 μs 24 μs 28 μs 32 μs 36 μs<br />
Latency between reception <strong>of</strong> OFDM symbol and processing increases<br />
Processing cannot keep up with frequent reconfiguration<br />
crc-c2<br />
Reconf<br />
c2 → c1
20 Summary<br />
Strategy for designing predictable systems running dynamic applications<br />
Scenarios capture dynamic (application and system) behavior<br />
Resource and energy efficient implementations<br />
Predictable implementations<br />
SADF Model-<strong>of</strong>-Computation<br />
Provides many analysis techniques<br />
Provides implementation trajectory<br />
Analysis and implementation techniques implemented in SDF 3 tool kit<br />
www.es.ele.tue.nl/sdf3
Methodology and Tools for Design<br />
<strong>of</strong> Energy Efficient Multi-Core Chips<br />
Nagu Dhanwada<br />
IBM Electronic Design Automation,<br />
Systems and Technology Group.
Outline<br />
Challenges in Multi-Core Chip Design<br />
Reference Power Aware Design Methodology<br />
for Multi-Core Chips<br />
Tools and Use Cases<br />
Early Analysis Tool for Multi-Core Designs<br />
Power Management Exploration and Design<br />
Architecture and Algorithm Exploration in<br />
Embedded Systems
Introduction: Multi-Core Chip Design Challenges<br />
Time to Market<br />
IP Integration<br />
Heterogeneity<br />
Validation<br />
Performance<br />
Cache Coherency<br />
Energy Efficiency<br />
Complex Power Management<br />
Productivity and Quality<br />
Design with third party IP<br />
Designing with potentially unreliable components
Introduction: Multi-Core Chip Design Challenges<br />
Complex Power Management for Energy<br />
Efficiency<br />
Global Dynamic Voltage and Frequency Scaling<br />
Power Capping<br />
Guard Band Reduction<br />
Power Throttling<br />
Power Budgeting
What do we need?<br />
Standards based Power Aware System to Silicon<br />
Flows<br />
Tools within these flows supporting Early<br />
Analysis and Design
Reference Power Aware Design<br />
Methodology for Multi-Core Chips
System-to-Silicon Power Aware Design Flow<br />
No standard today<br />
Supported by Liberty<br />
today – LPC<br />
contribution to LTAB<br />
Power<br />
Models<br />
Design &<br />
mapping<br />
Design &<br />
Integration<br />
Optimization<br />
& Closure<br />
ESL Design<br />
Analysis &<br />
optimization<br />
RTL Design<br />
Analysis &<br />
optimization<br />
Implementation<br />
Analysis<br />
Validation<br />
Verification<br />
& Test<br />
Verification<br />
& Test<br />
Comprehensive end to end Low Power Design flow,<br />
High-level power modeling,<br />
Power Models that span across the entire flow.<br />
Power<br />
Intent<br />
Content<br />
7
ESL Design Phase<br />
A<br />
N<br />
A<br />
L<br />
Y<br />
S<br />
I<br />
S<br />
Initial Application Specification<br />
High Level<br />
Hardware<br />
Architecture<br />
(Function +<br />
Communication<br />
Architecture)<br />
ESL<br />
Hardware<br />
Design &<br />
Optimization<br />
ESL Co-Design<br />
And Mapping<br />
Hardware/<br />
S<strong>of</strong>tware<br />
Integration<br />
(High Level)<br />
S<strong>of</strong>tware<br />
Specification<br />
ESL<br />
Embedded<br />
S<strong>of</strong>tware<br />
Design<br />
Embedded<br />
S<strong>of</strong>tware<br />
Image<br />
V<br />
A<br />
L<br />
I<br />
D<br />
A<br />
T<br />
I<br />
O<br />
N
RTL Design Phase<br />
Design and Integration<br />
IP<br />
Specification<br />
Chip<br />
Specification<br />
Analysis and Optimization<br />
Chip<br />
Specification<br />
Module<br />
Module<br />
coding Module<br />
coding Module<br />
coding Module<br />
coding Module<br />
coding Module<br />
coding Module<br />
coding<br />
coding<br />
Chip Integration<br />
Global Clock Gating<br />
Power Domain<br />
DFT<br />
Analysis<br />
Module<br />
Power format<br />
Chip<br />
Power format<br />
Power<br />
constraints<br />
Chip<br />
Power format<br />
Power<br />
constraints
Implementation Phase<br />
Partitioned<br />
RTL<br />
Power<br />
constraints<br />
Tool directives<br />
library<br />
Physical<br />
data<br />
Design Optimization and Closure<br />
synthesis<br />
floorplan<br />
Analysis<br />
Placement<br />
Clock tree<br />
Power optimization & closure<br />
Route<br />
Power rule verification<br />
IR analysis
ESL Design Phase: Analysis and Optimization<br />
Workload /<br />
Stimulus<br />
Benchmark<br />
Programs/Traces<br />
Random Stimulus<br />
Generators<br />
Architecture parameters<br />
(Initial Configuration)<br />
Power Analysis<br />
ESL Design Description<br />
Power Calculation<br />
Power Intent<br />
Format<br />
SystemC<br />
Simulator<br />
Optimization and Refinement<br />
Power Intent<br />
Format<br />
RTL Design<br />
And Simulation<br />
Power Models<br />
Power Models<br />
Power<br />
Reports
RTL Design Phase: Analysis and Optimization<br />
Workloads /<br />
Stimulus<br />
Benchmark<br />
Programs/Traces<br />
Random Stimulus<br />
Generators<br />
Architecture parameters<br />
(Initial Configuration)<br />
Power Analysis<br />
RTL Description<br />
Power Calculation<br />
Power Intent<br />
Format<br />
VHDL/Verilog<br />
Simulator<br />
Optimization and Refinement<br />
Power Intent<br />
Format<br />
Implementation<br />
Level<br />
Power Models<br />
Power Models<br />
Power<br />
Reports
System-to-Silicon Power Aware Design Flow<br />
No standard today<br />
Supported by Liberty<br />
today – LPC<br />
contribution to LTAB<br />
Power<br />
Models<br />
Design &<br />
mapping<br />
Design &<br />
Integration<br />
Optimization<br />
& Closure<br />
ESL Design<br />
Analysis &<br />
optimization<br />
RTL Design<br />
Analysis &<br />
optimization<br />
Implementation<br />
Analysis<br />
Validation<br />
Verification<br />
& Test<br />
Verification<br />
& Test<br />
Comprehensive end to end Low Power Design flow,<br />
High-level power modeling,<br />
Power Models that span across the entire flow.<br />
Power<br />
Intent<br />
Content<br />
13
Power Intent Formats: Overview<br />
Functional Design Description<br />
Assume power always on, running at constant voltage<br />
Power Intent Formats<br />
Capture variation <strong>of</strong> Power Over Time (power management<br />
specification)<br />
Power Intent + Functional Description = Representation <strong>of</strong> an<br />
active power managed design
Power Intent Formats: Overview<br />
Structural Aspects<br />
Interaction between design elements having time varying<br />
power characteristics<br />
State restoration, re-initialization<br />
Examples:<br />
Power Domains, Switches, Level shifters, Isolation cells,<br />
State retention logic, Power control signals<br />
Behavioral Aspects<br />
Effect <strong>of</strong> power variation on the computation model<br />
Enumerate state <strong>of</strong> simulation model for each set <strong>of</strong><br />
design elements driven same voltage
Power Intent Specification Example<br />
Core Read<br />
Transactor<br />
core read<br />
request<br />
core read<br />
data<br />
PD1<br />
Voltage Image Scaling<br />
Processor 0.8-1.0 V<br />
bayer Switchable Core<br />
data_in<br />
Slave Interface<br />
Read/Write<br />
jpeg<br />
data_out<br />
Core write<br />
request<br />
Buffer &<br />
Core write<br />
Transactor<br />
Core write<br />
data<br />
MasterRead<br />
Interface<br />
PD2<br />
Voltage 1.0 V<br />
Switchable<br />
Memory mapped<br />
Register Bank<br />
Master Write<br />
Interface<br />
DMA READ<br />
CONTROL<br />
DMA WRITE<br />
set_design img_subsys<br />
create_power_domain –name PD1 –instances {img_core core_read core_write} -default<br />
create_power_domain –name PD2 –instances { slave_intf master_read master_write}<br />
create_nominal_condition –name standby –voltage 0.8 –state standby<br />
create_nominal_condition –name low –voltage 0.9 –state on<br />
create_nominal_condition –name high –voltage 1.0 –state on<br />
create_nominal_condition –name <strong>of</strong>f –voltage 0 –state <strong>of</strong>f<br />
create_mode –name full_speed –conditions {PD1@high && PD2@high}<br />
create_mode –name low_speed –conditions {PD1@low && PD2@high}<br />
create_mode –name sleep –conditions {PD1@standby && PD2@high}<br />
create_mode –name core_<strong>of</strong>f –conditions {PD1@<strong>of</strong>f && PD2@high}<br />
create_mode –name all_<strong>of</strong>f –conditions {PD1@<strong>of</strong>f && PD2@<strong>of</strong>f}<br />
end_design<br />
Slide Courtesy: Si2 LPC Format Working Group
System-to-Silicon Power Aware Design Flow<br />
No standard today<br />
Supported by Liberty<br />
today – LPC<br />
contribution to LTAB<br />
Power<br />
Models<br />
Design &<br />
mapping<br />
Design &<br />
Integration<br />
Optimization<br />
& Closure<br />
ESL Design<br />
Analysis &<br />
optimization<br />
RTL Design<br />
Analysis &<br />
optimization<br />
Implementation<br />
Analysis<br />
Validation<br />
Verification<br />
& Test<br />
Verification<br />
& Test<br />
Comprehensive end to end Low Power Design flow,<br />
High-level power modeling,<br />
Power Models that span across the entire flow.<br />
Power<br />
Intent<br />
Content<br />
17
Power Modeling <strong>of</strong> Complex IP: Issues<br />
Current Standards (LIBERTY) oriented toward small IP<br />
Can include complete states & transitions<br />
dependence on input slew, output load<br />
Complex IP<br />
Several Inputs, States, Power Modes and Internal Parts,<br />
Much internal power dissipation,<br />
Completeness <strong>of</strong> Model Key to Enable High Level Design,<br />
Adequate Parameterization to capture sensitivity to design decisions.<br />
Power states / modes may be:<br />
Intentionally designed<br />
E.g., power modes in power intent specifications, power domain switching<br />
Result <strong>of</strong> behavior / activity<br />
E.g., clock gating, CPF functional modes
Power Modeling <strong>of</strong> Complex IP: Issues<br />
State power modeling challenges<br />
Exponential state explosion<br />
Requires non-mutually exclusive power states<br />
Being addressed in current standards<br />
Transition energy modeling challenges<br />
Exponential state transition explosion<br />
Separation <strong>of</strong> internal transition energy from pin power<br />
Transitions between mutually exclusive states<br />
Proposals to address efficient and uniform modeling <strong>of</strong> complex<br />
IP within current standards, being developed in SI2 Low Power<br />
Coalition
Tools and Use Cases
System-to-Silicon Power Aware Design Flow<br />
No standard today<br />
Supported by Liberty<br />
today – LPC<br />
contribution to LTAB<br />
Power<br />
Models<br />
Design &<br />
mapping<br />
Design &<br />
Integration<br />
Optimization<br />
& Closure<br />
ESL Design<br />
Analysis &<br />
optimization<br />
RTL Design<br />
Analysis &<br />
optimization<br />
Implementation<br />
Analysis<br />
Validation<br />
Verification<br />
& Test<br />
Verification<br />
& Test<br />
Comprehensive end to end Low Power Design flow,<br />
High-level power modeling,<br />
Power Models that span across the entire flow.<br />
Power<br />
Intent<br />
Content<br />
21
SLATE: Early Analysis Tool for Multi-Core<br />
utilization<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
2 6 10 20 30 50 100<br />
Number <strong>of</strong> connections<br />
Performance, Functional Model<br />
Graphic<br />
Front-End<br />
CPU-Tahoe<br />
CPU-no Tahoe<br />
Rx - Tahoe<br />
Rx-no Tahoe<br />
Power [W]<br />
2.0<br />
1.6<br />
1.2<br />
0.8<br />
0.4<br />
0.0<br />
Accelerators<br />
64 128 256 512 1024 2048 4096 8192<br />
AXU<br />
C3<br />
L2<br />
AXU<br />
C3<br />
L2<br />
AXU<br />
C3<br />
L2<br />
L3 MC IO<br />
Block Diagram<br />
Packet size [bytes]<br />
Power Analysis Chip Floorplan<br />
Chip<br />
Integration<br />
Interconnect Analysis Implementation<br />
<strong>The</strong>rmal Analysis<br />
SLATE: System-Level Analysis Tool for Early Exploration<br />
AXU<br />
C3<br />
L2<br />
ASIC<br />
Import 3 rd Party IP<br />
Industry Standard Models<br />
Tool for early power, performance, physical, and thermal characteristics <strong>of</strong><br />
Multi-Core Designs.
Use Scenarios<br />
Power Management Design Exploration in High Performance<br />
Servers<br />
Early Analysis System Configured to use Trace Driven Performance<br />
Models for POWER4 based processor cores<br />
Architecture and Algorithm Exploration in Embedded Systems<br />
Early Analysis System Configured to run Embedded PowerPC and<br />
CoreConnect models in Execution driven mode running real s<strong>of</strong>tware
Power Management Design Exploration: ESL Analysis Use<br />
Scenario<br />
Power<br />
Management<br />
Policy<br />
Optimization<br />
(Manual / Tool)<br />
System Simulation Model<br />
Vdd and<br />
Frequency<br />
Changes<br />
Processor<br />
Model<br />
Arbiter<br />
Bus<br />
Power Management Unit<br />
(Algorithms, Control Modes)<br />
Performance and Power Numbers<br />
Performance Simulation Models<br />
PLB Master<br />
PLB<br />
Slave<br />
(HSMC)<br />
PLB_OPB<br />
Bridge<br />
OPB_PLB<br />
Bridge<br />
OPB<br />
Arbiter<br />
UIC<br />
OPB<br />
Bus<br />
OPB<br />
Slave<br />
OPB<br />
Master<br />
Architecture Configuration<br />
Power Management Specification<br />
Power and Clock Domain Definitions<br />
Power<br />
Models<br />
Vdd and<br />
Frequency<br />
Changes
Power Management Studies in SLATE<br />
Per-core vs. Chip-wide<br />
DVFS<br />
Fetch-Throttling (I-Cache<br />
Throttling)<br />
Parameterizable:<br />
Freq change penalty<br />
Number <strong>of</strong> V/F discrete<br />
levels or continuous<br />
mode<br />
Easy to change PMU<br />
algorithm<br />
Easy to try different<br />
configurations<br />
set_mode()<br />
Set_frequency()<br />
set_throttling_factor()<br />
get_cpi()<br />
get_temperature()<br />
get_num_commits()<br />
get_num_decodes()<br />
get_power_modes()<br />
PMU<br />
PMU-BUS<br />
core0 core1 core2 core3<br />
clk0 clk1<br />
clk2<br />
clk3<br />
Multiple Clock<br />
Generator
Power Management Studies: DVFS Algorithms<br />
Discrete MaxBIPS<br />
Assumes set <strong>of</strong> discrete power modes (Vdd-Frequency pairs) for each<br />
core which the Power Manager can control individually.<br />
Goal: Maximize overall chip performance, under a given power budget.<br />
Chip Performance: total number <strong>of</strong> completed instructions by all cores per<br />
time period,<br />
Continuous Approach (CPM)<br />
Non-linear programming to model the same DVFS problem with<br />
continuous power modes.
Relative Chip Performance<br />
Chip Performance for Decreasing Power Budgets<br />
100<br />
98<br />
96<br />
94<br />
92<br />
90<br />
88<br />
86<br />
84<br />
82<br />
80<br />
Chip Performance vs Power Budgets<br />
100 91 82 73 64 55 45<br />
Power Budget<br />
Chip Wide Discrete Chip Wide Continuous Per-Core Continuous Per-Core Discrete
Individual Core Performance under Per-Core DVFS<br />
Relative Performance<br />
100<br />
98<br />
96<br />
94<br />
92<br />
90<br />
88<br />
86<br />
84<br />
82<br />
80<br />
Individual Core Performance in Per-Core DVFS<br />
eon eon-c twolf twolf-c perl perl-c<br />
100 91 82 73 64 55 45<br />
Power Budget
Use Scenario 2: Example Embedded System<br />
PLB Arbiter<br />
HSMC<br />
EBC<br />
CLOCK/RESET<br />
GEN<br />
PLL<br />
PM<br />
Clocks / resets<br />
405 CPU<br />
MAL<br />
PLB 64 bit<br />
Private OPB<br />
UIC<br />
DMA<br />
PLB-OPB Bridge<br />
OPB Arbiter<br />
OPB 32 bit<br />
Embedded PowerPC4xx & CoreConnect IP Cores, RISCWatch debugger, Enabled<br />
for GCC tool chain<br />
RX<br />
FIFO<br />
EMAC<br />
TX<br />
FIFO<br />
GPIO<br />
UART0<br />
UART1<br />
IIC<br />
GPT
Architecture Exploration: Ethernet Packet Processing<br />
30000<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
PLB<br />
Arbiter<br />
0<br />
EBC<br />
HSMC<br />
CLOCK/RESET<br />
GEN<br />
PLL<br />
PM<br />
Clocks / resets<br />
405 CPU<br />
MAL<br />
Private<br />
OPB<br />
PLB 64 bit<br />
System Throughput (KBytes/sec)<br />
64 128 256 512 1024 2048 4096<br />
Packet sizes (bytes)<br />
UIC<br />
2 Emac<br />
1 Emac<br />
DMA<br />
PLB-OPB<br />
Bridge<br />
EMAC<br />
RX<br />
FIFO<br />
TX<br />
FIFO<br />
EMAC<br />
RX<br />
FIFO<br />
TX<br />
FIFO<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
OPB<br />
Arbiter<br />
GPIO<br />
UART0<br />
UART1<br />
IIC<br />
GPT<br />
OPB 32 bit<br />
CPU Utilization (% <strong>of</strong> total time)<br />
1 Emac<br />
64 128 256 512 1024 2048 4096<br />
Packet sizes (bytes)<br />
PPC405 Platform Based Design<br />
Ethernet Subsystem<br />
1 EMAC<br />
1 Madmal<br />
Real Embedded Application executing on TLM<br />
models<br />
Measure effects on performance, power<br />
Change to improve performance<br />
Added an extra EMAC + Fifos<br />
Two EMAC mode <strong>of</strong> the application works with<br />
packets being transmitted from one EMAC and<br />
received by the other<br />
2 Emac<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
Power (mW)<br />
64 128 256 512 1024 2048 4096<br />
Packet sizes (bytes)<br />
2 Emac<br />
1 Emac
Architecture Exploration in Embedded Systems<br />
#<br />
Cores<br />
Execution<br />
Time<br />
Speed up<br />
with Order<br />
16 Matrix<br />
1 520573 1 100<br />
Efficiency<br />
2 2632011 1.977 98.89<br />
4 1338610 3.889 97.22<br />
6 9178616 5.671 94.53<br />
8 695988 7.479 93.49<br />
Speed-up<br />
8<br />
7<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Execution<br />
Time<br />
0 1 2 3 4 5 6 7 8 9<br />
Number <strong>of</strong> processor cores<br />
Speed-up with on-chip memory Speed-up without on-chip memory<br />
Speed up<br />
with Order 16<br />
Matrix<br />
17648256 1 100<br />
Efficiency<br />
8886615 1.986 98.89<br />
4488740 3.931 97.22<br />
3023774 5.836 94.53<br />
2293580 7.695 93.49
Algorithm Exploration in Embedded Systems<br />
Time in NS<br />
6000000<br />
5000000<br />
4000000<br />
3000000<br />
2000000<br />
1000000<br />
0<br />
Order 16 Matrix Multiplication<br />
0 1 2 3 4 5 6 7 8 9<br />
Number <strong>of</strong> processor cores<br />
Execution time(16)<br />
Duration(16)<br />
Idle Time(16)<br />
Time in NS<br />
45000000<br />
40000000<br />
35000000<br />
30000000<br />
25000000<br />
20000000<br />
15000000<br />
10000000<br />
5000000<br />
0<br />
Order 32 Matrix Multiplication<br />
0 1 2 3 4 5 6 7 8 9<br />
Number <strong>of</strong> processor cores<br />
Multicore Matrix Multiplication using a parallel algorithm with caches<br />
turned on.<br />
Using an Eight Core CoreConnect based SOC (Data-Cache 4KB, Inst.-<br />
Cache 32KB)<br />
Execution time(32)<br />
Duration(32)<br />
Idle Time(32)
Load Balancing across Cores in a Multi-core SOC<br />
Efficiency in percentage<br />
120<br />
100<br />
80<br />
60<br />
40<br />
20<br />
0<br />
Efficiency <strong>of</strong> the System as the Order <strong>of</strong> the Matrix and the<br />
number <strong>of</strong> processor cores in the SOC varies<br />
0 2 4 6 8 10<br />
Number <strong>of</strong> processor cores<br />
Order 8<br />
Order 16<br />
Order 32<br />
Load breakup among the various processor cores in the<br />
design in percentage.<br />
CPU7 Duration<br />
13%<br />
CPU6 Duration<br />
13%<br />
CPU5 Duration<br />
14%<br />
CPU4 Duration<br />
12%<br />
CPU0 Duration<br />
12%<br />
CPU1 Duration<br />
12%<br />
CPU2 Duration<br />
12%<br />
CPU3 Duration<br />
12%<br />
Activity comparison among the various processor cores.<br />
Can be used for exploring Load Balancing Strategies.
Accuracy <strong>of</strong> Transaction Level Models<br />
Comparisons between simulated models and real hardware demonstrate<br />
accuracy <strong>of</strong> transaction-level models for early analysis and design space<br />
exploration.<br />
Errors below 15% in timing accuracy.<br />
Errors below 11% in power estimation.
Summary<br />
Standards based Power Aware Flows Key for Achieving Energy<br />
Efficient Multi-Core Designs<br />
Flows need Common Power Models and Power Intent<br />
Descriptions across levels <strong>of</strong> Abstraction<br />
Integrated Pre-RTL Analysis and Exploration in Power Aware<br />
Flows needed for Efficient Design <strong>of</strong> Advanced System<br />
Architectures
Acknowledgements<br />
John Darringer, David Hathaway, Arun Joseph,<br />
Jerry Frenkil, Rhett Davis, Qi Wang
References<br />
N. Dhanwada, R. Bergamaschi, W. Dungan, I. Nair, P. Gramann, W. Dougherty, I. Lin, “Transactionlevel<br />
modeling for architectural and power analysis <strong>of</strong> PowerPC and CoreConnect-based systems”,<br />
Des Autom Embed Syst (2006) 10:105–125<br />
‘‘Si2 Low Power Coalition,’’ in Si2 High Level Power Modeling Requirements, Jun. 2011.<br />
[Online]http://si2.org/openeda.si2.org/project/showfiles.php?group_id=76#p115v1.2.<br />
R.Bergamaschi, I. Nair, G.Dittmann, H. Patel, G. Janssen, N. Dhanwada, A. Buyuktosunoglu, E. Acar,<br />
G. Nam, G. Han, D. Kucar, P. Bose, J. Darringer, ”Performance Modeling for Early Analysis <strong>of</strong> Multi-<br />
Core Systems”, Proceedings <strong>of</strong> CODES+ISSS 2007.<br />
R. Bergamaschi, G. Han, A. Buyuktosunoglu, H. Patel, I. Nair, G. Dittmann, G. Janssen, N. Dhanwada,<br />
Z. Hu, P. Bose, J. Darringer,“Exploring Power Management in Multi-Core Systems”, Proceedings <strong>of</strong><br />
ASP-DAC 2008.
Dynamic Behavior Specification and<br />
Dynamic Mapping for Real-time<br />
Embedded Systems in HOPES<br />
Nov. 8, 2012<br />
Soonhoi Ha, w/ Hanwoong Jung and Chanhee Lee<br />
Seoul National University<br />
1
1. Introduction<br />
2. Dynamic Behavior Specification in HOPES<br />
3. Self-Adaptive Mapping<br />
4. Preliminary Experimental Results<br />
5. Discussion and Conclusion<br />
Contents<br />
2 HOPES project, SNU
Parallel Embedded SW Design Challenge<br />
Target-independent parallel programming for non-trivial<br />
heterogeneous systems with diverse design constraints<br />
(time, power, temperature, cost, and so on)<br />
Problem: model-based design <strong>of</strong> parallel embedded system<br />
• Parallelism extraction (multi-mode multi-tasking apps.)<br />
• Functional parallelism & data-parallelism<br />
• Partitioning and mapping<br />
• Parallel code generation<br />
• Performance estimation and verification<br />
• Design space exploration<br />
3 HOPES project, SNU
Programming platform<br />
• meet-in-the-middle approach<br />
• Role <strong>of</strong> “execution model”<br />
Applications<br />
(Manual<br />
design)<br />
S<strong>of</strong>tware Platform<br />
Hardware Platform<br />
Modelbased<br />
design<br />
(Manual<br />
design)<br />
Programming platform (CIC)<br />
S<strong>of</strong>tware Platform<br />
Hardware Platform<br />
Key Idea<br />
4 HOPES project, SNU
Dataflow Model<br />
UML<br />
KPN<br />
Automatic Code Translation<br />
Common Intermediate Code<br />
Manual Coding<br />
Task Codes(Algorithm) XML File(Architecture)<br />
Task Mapping<br />
CIC Translation<br />
C Code for various targets<br />
HOPES Design Flow<br />
Performance Lib./ Constraints<br />
Static analysis<br />
Virtual Prototyping System<br />
5 HOPES project, SNU
CIC (Common Intermediate Code)<br />
Basically an actor-oriented model (or extended dataflow model)<br />
• execution model <strong>of</strong> a parallel architecture<br />
• defines the semantics for task scheduling and task interaction<br />
OS-level task model<br />
• Large granularity – thread/function (atomic mapping unit)<br />
• Implicitly assume the existence <strong>of</strong> an OS or a run-time system<br />
T 1<br />
T 2<br />
Channel Types:<br />
FIFO or Array<br />
T 3<br />
Modeling a shared memory<br />
(indexed slots)<br />
T 4<br />
Control<br />
CIC Task Codes<br />
T 1 T 2 T 3 T 4<br />
Algorithm<br />
- Avail. Parallelism<br />
Model<br />
Architecture<br />
Info. Mapping<br />
Control<br />
6 Pr<strong>of</strong>ile HOPES project, SNU
3 types <strong>of</strong> tasks<br />
• Computation task: data-parallelism is expressed<br />
CIC task model<br />
• Control task: defines the execution mode <strong>of</strong> computation tasks<br />
• Library task: expresses vertically-layered or server-client SW<br />
Execution semantics<br />
• Time-driven or data-driven<br />
3 types <strong>of</strong> port<br />
• Data port: communicate messages between CIC tasks<br />
• System port: communication with OS or run-time system<br />
• Library port: call a library task<br />
Channel semantics<br />
• FIFO channel<br />
• Array channel: indexed access for data parallel execution<br />
• Buffer channel<br />
7 HOPES project, SNU
A CIC task is defined by three methods<br />
• TASK_INIT: before main loop<br />
• TASK_GO: in the main loop<br />
• TASK_WRAPUP: after main loop<br />
Use generic APIs for target independence<br />
CIC Task Code: Definition<br />
TASK_INIT { /* task initialization code */ };<br />
TASK_GO {<br />
MQ_RECEIVE("mq0", (char *)(ld_106->rdbfr), 2048);<br />
...<br />
//task_body()<br />
MQ_SEND(“output”, (char *)(st_107->buf), 2048);<br />
}<br />
TASK_WRAPUP { /* task wrapup code */ };<br />
8 HOPES project, SNU
CIC Translation<br />
CIC to Multi-thread codes for functional simulation<br />
• Generated codes are run on a host machine<br />
CIC to target C codes<br />
• Target specific code generation<br />
• For virtual prototyping<br />
• For MPCore<br />
• For Cell processor<br />
• GPGPU<br />
[planned]<br />
• DSP array<br />
• Reconf. Hardware<br />
• Per-processor code generation<br />
based on mapping information<br />
• Multi-threaded task codes<br />
• Interface code generation<br />
• Scheduler code generation<br />
SMP core<br />
Or<br />
Heterogeneous<br />
core<br />
comm. network<br />
DSP array<br />
HW IP<br />
Reconf.<br />
HW<br />
9 HOPES project, SNU
Challenges<br />
Lane detection algorithm on GPU<br />
• CPU+GPU heterogeneous platform<br />
• multicore CPU: multithreading<br />
• support multiple GPUs<br />
• CIC translation with asynchronous communication with CPU<br />
and GPU<br />
Load Image YUV to RGB Gaussian Sobel<br />
KNN<br />
NLM<br />
Non-<br />
Maximum<br />
Suppression<br />
Blending Sharpen<br />
Denoising Filters<br />
Hough<br />
Transform<br />
Lane Detection Filters<br />
Draw Lane Merge<br />
RGB to YUV<br />
Store Image<br />
10 HOPES project, SNU
Processor Tasks<br />
Experimental Results<br />
Time<br />
CPU 2109.5 sec<br />
1 GPU Sync 15.0266 sec<br />
Async with 2 streams 11.9998 sec<br />
Async with 3 streams 12.3378 sec<br />
Async with 4 streams 12.0846 sec<br />
2 GPUs Sync 11.332 sec<br />
Async with 2 streams 10.247 sec<br />
Async with 3 streams 9.7842 sec<br />
Async with 4 streams 9.7798 sec<br />
CPU LoadImage, Draw Lane, StoreImage<br />
GPU 0 YUVtoRGB, Gaussian, Sobel, Non-Maximum, Hough, Merge<br />
GPU 1 KNN, NLM, Blending, Sharpen, RGBtoYUV<br />
11 HOPES project, SNU
1. Introduction<br />
2. Dynamic Behavior Specification in HOPES<br />
3. Self-Adaptive Mapping<br />
4. Preliminary Experimental Results<br />
5. Discussion and Conclusion<br />
Contents<br />
12 HOPES project, SNU
At the system level<br />
Dynamic Behavior<br />
• Set <strong>of</strong> user tasks running concurrently may change<br />
• user demand<br />
At the application level<br />
• Algorithm may have multiple modes <strong>of</strong> operation<br />
• Execution times <strong>of</strong> tasks may vary<br />
At the OS level<br />
• QoS requirement changes the mode <strong>of</strong> operation<br />
At the hardware level<br />
• Unpredictable resource availability<br />
• Temporary or permanent failure <strong>of</strong> processing elements<br />
13 HOPES project, SNU
At the top-level<br />
Two-level Specification in HOPES<br />
• A control task manages the execution state <strong>of</strong> computation tasks<br />
• Each mode <strong>of</strong> operation (or user case) is defined by a set <strong>of</strong> CIC tasks<br />
that run concurrently.<br />
• <strong>The</strong> mode <strong>of</strong> operation may change dynamically.<br />
• <strong>The</strong> control task specifies the mode change by a FSM.<br />
At the task level<br />
• A CIC task may have a SADF (scenario-aware dataflow) graph<br />
inside.<br />
• <strong>The</strong> behavior <strong>of</strong> a task may change dynamically.<br />
• Finite number <strong>of</strong> scenarios <strong>of</strong> operation<br />
• Each scenario is specified by an SDF graph<br />
14 HOPES project, SNU
Dynamic behavior modeling<br />
Control Task<br />
• A control task can control the execution <strong>of</strong> computation tasks<br />
by using predefined control APIs<br />
• Triggered by data inputs from computation tasks<br />
(or, can be triggered by checking task state)<br />
• Send control messages to OS via a system port<br />
• Similar to statechart in STATEMATE or fFSM in PeaCE<br />
CIC Computation Tasks<br />
CIC Control Task<br />
15 HOPES project, SNU
Internal Specification <strong>of</strong> a Control Task<br />
Internal behavior is specified with an FSM model<br />
• Assume an implicit timer in the system: may generate realtime<br />
events<br />
• Code template is automatically generated<br />
16 HOPES project, SNU
Code Example<br />
while(1){<br />
MQ_AVAILABLE(all_ports); // 1-1. Check the existence <strong>of</strong> a new event<br />
SYS_REQ(CHECK_TASK_STATE, “task_name”, …); // 1-2. Check the termination <strong>of</strong> a task<br />
if(available) MQ_RECEIVE(selected port); // 2. read the new event<br />
if(some event or task state is triggered) break; // 3. Break a loop to make transition<br />
}<br />
switch( current_state )<br />
{<br />
case ID_STATE_S1:<br />
if(selected port==1 && input data==2){ // 4. check the transition condition<br />
current_state = ID_STATE_S2;<br />
SYS_REQ(SET_PARAM_INT, "FloatGroup", "FloatVar", input_data, 0, 0);<br />
} // 5. send the control message through the system port<br />
break;<br />
case ID_STATE_S2:<br />
if(…){<br />
….<br />
}<br />
….<br />
}<br />
17 HOPES project, SNU
PC + NXT Robot Example<br />
Control NXT robot by both a PC and the robot itself.<br />
1. SensorDetect task reads sensor values and sends them to two control tasks:<br />
ControlPC and ControlNXT.<br />
2. KeyDetect task reads key input value and sends it to ControlPC task<br />
3. Controlled by ControlPC task and ControlNXT task, Move task and Grab task run<br />
motors.<br />
4. LCD task displays the<br />
current status <strong>of</strong> NXT<br />
ControlPC<br />
ControlNXT Move<br />
KeyDetect SensorDetect<br />
Grab<br />
LCD<br />
18 HOPES project, SNU
Control NXT Task<br />
1. Control NXT task is for control NXT robot itself.<br />
2. Control NXT task includes some scenarios for the robot.<br />
(Decision <strong>of</strong> ControlNXT task)<br />
Condition> the NXT robot senses a black line on the floor<br />
1) <strong>The</strong> robot stops immediately.<br />
2) After 3 seconds,<br />
PC + NXT Robot Example<br />
2-1) if current motion <strong>of</strong> the robot is forward, the robot starts to go backward.<br />
2-2) if current motion <strong>of</strong> the robot is backward, the robot starts to go forward.<br />
3) After 2 seconds from the above action, the robot stops and starts to spin.<br />
Condition> the NXT robot hears loud sound<br />
1) <strong>The</strong> robot immediately folds/unfolds its arm.<br />
1-1) if current motion <strong>of</strong> the robot is fold, the robots unfolds its arm.<br />
1-2) if current motion <strong>of</strong> the robot is unfold, the robots folds its arm.<br />
19 HOPES project, SNU
Control NXT Task<br />
1. Control NXT task is specified by FSM manner.<br />
PC + NXT Robot Example<br />
TASK_GO {<br />
switch(stateLight) {<br />
case 0: //INIT<br />
break;<br />
…<br />
case 4: // BACKWARD<br />
SYS_REQ(SET_PARAM_INT, "Move","motion",BACKWARD,id1,0);<br />
SYS_REQ(RUN_TASK, "Move",id1,0);<br />
set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />
timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2);<br />
stateLight = 6;<br />
break;<br />
case 5: // FORWARD<br />
SYS_REQ(SET_PARAM_INT, "Move","motion",FORWARD,id1,0);<br />
SYS_REQ(RUN_TASK, "Move",id1,0);<br />
set_time_base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />
timer_id1 = SYS_REQ(SET_TIMER, set_time_base, 2);<br />
stateLight = 6;<br />
break;<br />
….<br />
}<br />
switch(stateSound) {<br />
case 0:<br />
if( AVAILABLE(port_sound) ) {<br />
prev_sound = sound_val;<br />
BUF_RECEIVE(port_sound, &sound_val, size<strong>of</strong>(U16));<br />
if(prev_sound >= 400 && sound_val < 400) stateSound = 1;<br />
break;<br />
}<br />
…<br />
}<br />
}<br />
20 HOPES project, SNU
Processing control commands<br />
1. Each control commands include time information.<br />
unsigned int time base = SYS_REQ(GET_CURRENT_TIME_BASE);<br />
PC + NXT Robot Example<br />
SYS_REQ(SET_PARAM_INT, task name, param name, param value, time base, time <strong>of</strong>fset);<br />
2. Internal control scheduler processes control commands from control tasks based<br />
on time information <strong>of</strong> each command. (time base + time <strong>of</strong>fset)<br />
Control Task 2<br />
Control Task 1<br />
Control Task 3<br />
Send commands<br />
…<br />
Command 3<br />
Command 2<br />
Command 1<br />
Command Queue<br />
Sorted by<br />
time information<br />
Execute command !<br />
Control<br />
Scheduler<br />
21 HOPES project, SNU
SADF Specification in HOPES<br />
An SADF subgraph is regarded as a computation task from<br />
the outside<br />
• A control task can control the execution status <strong>of</strong> the entire<br />
subgraph<br />
• It resembles the hierarchical model composition in Ptolemy<br />
• For each scenario, the application is specified by a decidable<br />
dataflow graph (for now, we use an SDF graph)<br />
IntGroup SADF Subgraph<br />
22 HOPES project, SNU
Motivation<br />
SADF Subgraph<br />
• For static analysis, we may want to use SDF or its extended<br />
model as much as possible in application specification.<br />
• We explicitly specify if a CIC sub-graph is an SADF graph.<br />
SADF (Scenario-aware dataflow) model<br />
• Assume that the number <strong>of</strong> scenarios is finite.<br />
• It is associated with a MTM (Mode Transition Machine)<br />
• Each task has a different definition for each mode <strong>of</strong> operation.<br />
Sample rate can be changed<br />
23 HOPES project, SNU
MTM Specification<br />
24 HOPES project, SNU
Its behavior depends on the mode<br />
• Sample rates<br />
• Task body<br />
By default, it is an SDF task<br />
CIC Task in an SADF<br />
25 HOPES project, SNU
API for mode transition request<br />
Mode Transition in SADF<br />
• SYS_REQ(SET_MTM_PARAM_INT, Task Name, Var Name, Value, 0, 0)<br />
• Controller task calls this API to change the mode <strong>of</strong> an SADF<br />
• <strong>The</strong> argument variable <strong>of</strong> the MTM is changed immediately when<br />
the API is called<br />
Mode transition mechanism<br />
• Mode transition occurs at the iteration boundary by default.<br />
Immediate transition can be enforced.<br />
• At the start <strong>of</strong> each iteration, a task checks the current mode,<br />
and changes its sample rates and function body.<br />
• Note that each task knows how many times it should be fired<br />
in each iteration via static scheduling performed for each mode<br />
at compile-time.<br />
26 HOPES project, SNU
intGen 1<br />
intGen 2<br />
1<br />
1<br />
A<br />
B<br />
IntGroup<br />
1 1<br />
Mix I_Display<br />
FloatGroup<br />
1 C<br />
FloatGen F_Display<br />
Counter<br />
A Toy Example<br />
Control<br />
S1 S2<br />
27 HOPES project, SNU
IntGroup<br />
intGen 1<br />
intGen 2<br />
1<br />
1<br />
S1: 1<br />
S2: 2<br />
S1: 1<br />
S2: 2<br />
1 1<br />
Mix I_Display<br />
IntGroup<br />
• IntGroup task has two modes.<br />
• S1 mode: mix two 8-digit integers.<br />
• S2 mode: Mix four 8-digit integers.<br />
A Toy Example – cont’d<br />
Mode : S1, S2<br />
Variable : IntVar<br />
currentState Condition nextState<br />
S1 IntVar == 2 S2<br />
S2 IntVar == 1 S1<br />
num_1: xxxxxxxx, num_2 = yyyyyyyy -> Result = xxxxyyyy<br />
num_1: xxxxxxxx, num_2 = yyyyyyyy, num_3= zzzzzzzz, num_4 = wwwwwwww<br />
Result = xxzzyyww<br />
• IntGen_1 task sets mode value(IntVar) in MTM depending on randomly<br />
generated integer.<br />
if(rand_1 % 3 == 0)<br />
SYS_REQ(SET_MTM_PARAM_INT, “IntGen_1", "IntVar", 2, 0, 0);<br />
• Mode transition will be occurred internally at the iteration bound <strong>of</strong> the SADF<br />
graph by scheduler.<br />
MTM<br />
28 HOPES project, SNU
1. Introduction<br />
2. Dynamic Behavior Specification in HOPES<br />
3. Self-Adaptive Mapping<br />
4. Preliminary Experimental Results<br />
5. Discussion and Conclusion<br />
Contents<br />
29 HOPES project, SNU
Heterogeneous architecture<br />
• Multicore control processor<br />
• Many-core accelerator: processor arrays<br />
C (SMP)<br />
C C C C<br />
I/F<br />
M<br />
A (manycore Accelerator)<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
C<br />
M<br />
Shared<br />
Memory<br />
Modules<br />
NoC<br />
Input Queue<br />
R (Reconf. HW)<br />
CARD architecture<br />
Ext.<br />
Mem.<br />
I/F<br />
Target Architecture<br />
D (HW IPs)<br />
IP1 IP2<br />
I/O<br />
I/O<br />
I/O<br />
I/O<br />
30 HOPES project, SNU
Tile-based NoC architecture<br />
• Homogeneous processors<br />
• Processor tiles + memory titles<br />
• Distributed shared memory<br />
• Some assumptions for experiments<br />
• Task code is stored in a shared memory tile<br />
• Mesh architecture<br />
SPM based architecture<br />
Many-core Accelerator<br />
• Local memory size is given (hundreds <strong>of</strong> kilo-bytes at best)<br />
Central Manager (CM)<br />
• Maps the tasks into tiles dynamically<br />
• Move the task code to the processor tile if needed<br />
31 HOPES project, SNU
Input<br />
• HOPES Specification<br />
Problem Statement<br />
• Dynamic behavior is specified with control tasks + SADF<br />
subgraphs<br />
• Request <strong>of</strong> system status change is delivered to the CM<br />
• Object <strong>of</strong> task mapping:<br />
• Computation task with internal parallelism (considered in the future)<br />
• Tasks in SADF subgraphs<br />
Constraint<br />
• Each CIC task has a throughput (or latency) constraint<br />
Problem<br />
• How to map the tasks dynamically satisfying the real-time<br />
constraints?<br />
• Maximize the aggregate throughput surplus (ATS)<br />
32 HOPES project, SNU
Self-Adaptive Mapping Technique<br />
Hybrid Technique: <strong>The</strong> Basic Idea<br />
• Compile-time mapping <strong>of</strong> SADF subgraphs (& CIC tasks)<br />
• For a varying number <strong>of</strong> processors,<br />
• Based on the WCET <strong>of</strong> each task,<br />
• Perform scheduling to find out the real-time performance<br />
• Store the mapping information into the shared memory<br />
• A set <strong>of</strong> (number <strong>of</strong> processors, {task mapping info.}, and<br />
throughput ) information for each mode<br />
• From the minimum # <strong>of</strong> processors to satisfy the throughput<br />
performance<br />
• To the maximum # <strong>of</strong> processors until no performance improvement is<br />
obtained with more processors<br />
• Run-time mapping by the CM (Central Manager)<br />
• Allocate the processors to each task running concurrently<br />
• Bind the (virtual) processors to physical processors<br />
• Map the tasks according to the stored mapping information<br />
33 HOPES project, SNU
legend<br />
Compile-time<br />
analysis<br />
: Input<br />
:<br />
Action<br />
:<br />
Stored<br />
info.<br />
Run-time<br />
mapping<br />
Self-Adaptive Mapping Procedure<br />
• Task graphs<br />
• WCET pr<strong>of</strong>ile<br />
Per-task compile-time analysis<br />
Throughput-maximized mapping for various<br />
numbers <strong>of</strong> processors<br />
Initial task-to-virtual processor mapping<br />
Drop a task from the<br />
active task set<br />
• Virtual processor<br />
pool<br />
System status change<br />
(Task arrival/finish/operation mode change)<br />
Mapping is failed?<br />
Yes No<br />
Mapping<br />
finished<br />
Virtual processor-tophysical<br />
processor<br />
binding<br />
34 HOPES project, SNU
Map the following 4 task graphs onto 3x3 NoC<br />
30<br />
A 1<br />
30<br />
B 1<br />
60<br />
C 1<br />
D 1<br />
60 50<br />
A 2<br />
40<br />
B 2<br />
70<br />
B 3<br />
80<br />
C 2<br />
30<br />
C 3<br />
40 80<br />
D 2<br />
A 3<br />
25<br />
B 4<br />
P 1<br />
P 2<br />
P 1<br />
P 2<br />
P 1<br />
P 2<br />
P 1<br />
A 1<br />
B 1<br />
D 1<br />
C 1<br />
B 2<br />
Schedule 1 Schedule 2<br />
30 90 140 30 90 140<br />
A 2<br />
B 3<br />
1/90<br />
1/95<br />
C 3<br />
D 2<br />
1/120<br />
C 2<br />
A 3<br />
B 4<br />
P 1<br />
P 2<br />
P 3<br />
P 1<br />
P 2<br />
P 3<br />
P 1<br />
P 2<br />
P 3<br />
P 1<br />
P 2<br />
A Simple Example<br />
A 1<br />
B 1<br />
D 1<br />
C 1<br />
B 2<br />
A 2<br />
B 3<br />
1/60<br />
30 70 100 30 70 100<br />
1/70<br />
60 100 140 60 100 140<br />
1/90<br />
1/80<br />
40 120 40 120<br />
Throughput constraint: 1/130<br />
C 3<br />
D 2<br />
1/80<br />
C 2<br />
A 3<br />
B 4<br />
35 HOPES project, SNU
Run-time Mapping<br />
Objective: Maximize the aggregate throughput surplus (ATS)<br />
• m(T): current execution mode <strong>of</strong> T<br />
• th(.): throughput obtained from the static mapping result<br />
• Th min : throughput constraint <strong>of</strong> T<br />
• V(T): number <strong>of</strong> processors allocated to task graph T<br />
Meaning<br />
• If the throughput surplus is large, we may lower the power<br />
consumption <strong>of</strong> the allocated processor tiles<br />
Two-level Mapping<br />
• Node-to-(Virtual) Processor Mapping<br />
• VP-to-PP(Physical Processor) Binding<br />
36 HOPES project, SNU
Objective<br />
• Allocate virtual processors to the active tasks<br />
Node-to-VP Mapping<br />
• Input : 1) active tasks, 2) results <strong>of</strong> the compile-time<br />
analysis 3) set <strong>of</strong> virtual processors<br />
• Output : 1) mapping <strong>of</strong> tasks to virtual processors<br />
Key ideas<br />
• Allocate min. processors for each task to satisfy constraints.<br />
• Allocate the remaining processors in order to maximize ATS<br />
Time complexity<br />
• O(PA 2 ) where P is the number <strong>of</strong> processors and A is the<br />
number <strong>of</strong> tasks to be mapped. ATS computation takes O(A).<br />
37 HOPES project, SNU
Objective<br />
VP to PP Binding<br />
• Determine tile positions <strong>of</strong> virtual processors in NoC minimizing<br />
the overall communication cost<br />
• Input : 1) set <strong>of</strong> virtual processors,<br />
2) list <strong>of</strong> unmapped physical processors(tiles)<br />
• Output : mapping <strong>of</strong> virtual to physical processors<br />
Key ideas<br />
• Select the next virtual processor to be mapped as the one that<br />
has the largest communication volume to the lastly-mapped<br />
virtual processor<br />
• Select the next physical processor that minimizes the sum <strong>of</strong><br />
MD (Manhattan Distance) from the already bound physical<br />
processors<br />
Time complexity<br />
• O(P 2 )<br />
38 HOPES project, SNU
4 sets <strong>of</strong> application scenarios<br />
• Set 1: random graphs (G1 ~ G5) with 7x7 NoC<br />
• Set 2: random graphs (G1 ~ G10) with 11x11 NoC<br />
• Set 3: 4 real-life examples with 3x3 NoC<br />
• Set 4: 4 real-life examples with 4x4 NoC<br />
Task graph # nodes<br />
# virtual proc<br />
(min, max)<br />
1/ throughput<br />
(min,max)<br />
Experiments<br />
1/<br />
throughput<br />
constraint<br />
Node execution<br />
time (us)<br />
10 4,7 1960,1420 2520 [400,1600]<br />
26 11,16 2510,2080 2830 [200,1800]<br />
MPEG2 decoder 14 3,5 4378,2562 5473 [300,1954]<br />
MP3 decoder 7 2,4 5233.3709 6541 [382,3709]<br />
H263 decoder 5 2,4 1143,577 1443 [11,577]<br />
Beamformer 3 1,2 1956,994 2445 [481,962]<br />
39 HOPES project, SNU
Throughput gain<br />
Comparison with a dynamic mapping technique varying the<br />
CCR (communication-to-computation ratio)<br />
• Dynamic technique aims to minimize the communication<br />
overhead. It finds the minimum initiation interval <strong>of</strong> task<br />
graphs, not to violate the resource constraints (buffer size).<br />
Assumptions<br />
• Communication overhead is proportional to the hop distance<br />
• Run 100 iterations for each task graph<br />
ATS<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Dynamic_Set1 Proposed_Set1<br />
Dynamic_Set2 Proposed_Set2<br />
1% 2% 3% 4%<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
CCR<br />
Dynamic_Set3 Proposed_Set3<br />
Dynamic_Set4 Proposed_Set4<br />
1% 2% 3% 4%<br />
40 HOPES project, SNU
Latency Comparison<br />
Baseline: achieved latency from the proposed technique<br />
<strong>The</strong> dynamic approach has longer latency<br />
• It pays huge code migration cost.<br />
• Tasks with lower priorities may be delayed too long.<br />
Latency (avg. per iteration)<br />
2.5<br />
Dynamic_Set1<br />
2 Dynamic_Set2<br />
1.5<br />
1<br />
0.5<br />
0<br />
1% 2% 3% 4%<br />
1.5<br />
1<br />
0.5<br />
0<br />
CCR<br />
Dynamic_Set3<br />
Dynamic_Set4<br />
1% 2% 3% 4%<br />
41 HOPES project, SNU
Communication Overhead<br />
<strong>The</strong> proposed technique minimizes the code migration<br />
overhead<br />
• Node mapping is preserved without system status change.<br />
• We map VP-PP processors to minimize the communication<br />
overhead.<br />
Communication ratio to<br />
computation (CCR)<br />
Communication (us)<br />
Code migration (us)<br />
1% 2% 3% 4%<br />
Proposed 277 568 864 1157<br />
Dynamic 441 888 1334 1812<br />
Proposed 230 472 715 959<br />
Dynamic 20037 40620 56700 76794<br />
42 HOPES project, SNU
Rationale<br />
Future Work: Processor Sharing<br />
• Static mapping may underutilize some processors<br />
• We share underutilized processors between multiple tasks<br />
Determination <strong>of</strong> the “sharable” processors<br />
• Make a pessimistic assumption: the task execution time on a<br />
sharable processor is lengthened by the sharing degree<br />
• If the resultant throughput degradation is no worse than<br />
removing one processor, the processor is regarded “sharable”<br />
43 HOPES project, SNU
Some Mapping Results<br />
BN(Best Neighbor) Proposed w/o<br />
processor sharing<br />
A Simple Example<br />
Proposed w/<br />
processor sharing<br />
44 HOPES project, SNU
1. Introduction<br />
2. Dynamic Behavior Specification in HOPES<br />
3. Self-Adaptive Mapping<br />
4. A Preliminary Experiment<br />
5. Discussion and Conclusion<br />
Contents<br />
45 HOPES project, SNU
Applications (Use Cases)<br />
• Video Player : H.264 Decoder, MP3 Decoder<br />
• Music Player : MP3 Decoder<br />
A Simple Smart-phone<br />
• Video Phone : x264 Encoder, H.264 Decoder, G.723 Decoder/Encoder<br />
• Menu<br />
Typical Scenario<br />
• Display a menu and receive a user input.<br />
• Depending on a user input, execute the proper application.<br />
• When a call arrives during the application execution, the application is<br />
suspended and video phone application is executed.<br />
• When the call is finished, the application that was previously<br />
suspended is resumed.<br />
• Return to the menu when the application terminates.<br />
46 HOPES project, SNU
Task Graph<br />
Experimental Application<br />
Control Task<br />
• Control Applications<br />
• Include a FSM<br />
• Triggered by two input tasks<br />
UserInput Task<br />
• Display a Menu<br />
• Send a user input to the<br />
control task<br />
Interrupt Task<br />
• Model asynchronous event<br />
arrivals (ex: Phone call)<br />
• Send a signal to the control<br />
task<br />
47 HOPES project, SNU
Control Task<br />
Specified FSM in the Control task<br />
Control Task Specification<br />
Wait termination <strong>of</strong> the current application or<br />
asynchronous signal (phone call);<br />
switch (current_state) {<br />
case MENU:<br />
if(input == 1) execute VideoPlay;<br />
else if(input ==2) execute VideoPhone;<br />
else if(input ==3) execute MusicPlay;<br />
else exit;<br />
case VideoPlay:<br />
if(signal == On)<br />
Suspend VideoPlay & Execute VideoPhone;<br />
else<br />
Execute Menu;<br />
case VideoPhone:<br />
if(previous_state == MusicPlay)<br />
Stop VideoPhone & Resume MusicPlay;<br />
else if(previous_state == VideoPlay)<br />
Stop VideoPhone & Resume VideoPlay;<br />
…<br />
}<br />
48 HOPES project, SNU
Sample rate changes depending on the mode<br />
Sample rates = 0<br />
for I_Frame mode<br />
Sample rates = 0<br />
for P_Frame mode<br />
H.264 Decoder Task<br />
MTM<br />
• Two modes: I/P-Frame<br />
• Variable: FrameVar<br />
• Two transition information<br />
: I-Frame P-Frame<br />
49 HOPES project, SNU
x264 encoder (single mode)<br />
MP3 Player (single mode)<br />
Other SDF Task Graphs<br />
50 HOPES project, SNU
Pr<strong>of</strong>iling with Intel i7 machine<br />
• Obtain the WCET <strong>of</strong> each task node.<br />
• H264 Decoder<br />
• X264 Encoder<br />
Pr<strong>of</strong>ile results<br />
Task Time (usec/frame) Task Time (usec/frame)<br />
ReadFile 117.04 IntraPredY 1552.32<br />
Decode 2154.26 IntraPredU 297<br />
InterPredY 1584 IntraPredV 360.36<br />
InterPredU 506.88 Deblock 623.51<br />
InterPredV 514.8 WriteFile 1245.11<br />
Task Time (usec/frame) Task Time (usec/frame)<br />
Init 211.27 Deblock 517.77<br />
ME 6730.02 VLC 4984.65<br />
Encoder 2953.17<br />
51 HOPES project, SNU
Pr<strong>of</strong>iling information<br />
• MP3 Decoder<br />
Task Time (usec/iter.) Task Time (usec/iter.)<br />
VLDStream 175.03 Antialias 40.01<br />
DeQ 398.22 Hybrid 823.25<br />
Stereo 34.73 Subband 632.4<br />
Reorder 33.13 Writefile 29.62<br />
• G.723 Decoder<br />
Task Time (usec/iter.)<br />
G723Dec 1.85<br />
• G.723 Encoder<br />
Pr<strong>of</strong>ile Results<br />
Task Time (usec/iter.)<br />
G723Enc 1.38<br />
52 HOPES project, SNU
Compile-time Analysis<br />
Task graph # nodes<br />
H.264 decoder<br />
(I-frame)<br />
H.264 decoder<br />
(P-frame)<br />
10<br />
7<br />
# virtual proc<br />
(min, max)<br />
• Slow down the processor speed by x6<br />
Self-Adaptive Mapping<br />
1/ throughput<br />
(min,max)<br />
1/throughput<br />
constraint<br />
3,4 23393, 20415 us Video: 30 frame/sec<br />
1,4 53462, 20415 Phone: 15 frame/sec<br />
3,4 21711, 16206 us Video: 30 frame/sec<br />
1,4 40204, 16206 us Phone: 15 frame/sec<br />
MP3 decoder 8 2,3 8912, 4940 us 78 frame/sec<br />
x264 encoder 5 2,3 50734, 41708 us 15 frame/sec<br />
53 HOPES project, SNU
Test Scenario<br />
1) Play a video clip.<br />
2) During video playing, a call arrives.<br />
3) When the call is finished, continue playing<br />
the video clip.<br />
4) Return to the menu when video clip is finished.<br />
Throughput gain & latency comparison<br />
ATS<br />
6<br />
5<br />
4<br />
3<br />
2<br />
1<br />
0<br />
Dynamic_Video Proposed_Video<br />
Dynamic_Phone Proposed_Phone<br />
1% 2% 3% 4%<br />
Ruun-time Mapping Result<br />
CCR<br />
Latency (avg. per iteration)<br />
1.25<br />
1.2<br />
1.15<br />
1.1<br />
1.05<br />
1<br />
Dynamic_Video<br />
Dynamic_Phone<br />
1% 2% 3% 4%<br />
CCR<br />
54 HOPES project, SNU<br />
4<br />
2<br />
3
Communication overhead<br />
Ruun-time Mapping Result<br />
• <strong>The</strong> proposed technique minimizes the code migration<br />
overhead<br />
Scenario<br />
Video<br />
play<br />
Phone<br />
Communication ratio to<br />
computation (CCR)<br />
Communication (us)<br />
Code migration (us)<br />
Communication (us)<br />
Code migration (us)<br />
1% 2% 3% 4%<br />
Dynamic 34611 67514 103160 136129<br />
Proposed 23053 46119 69184 92248<br />
Dynamic 2893 6101 9200 11780<br />
Proposed 62 124 187 249<br />
Dynamic 32282 61469 88928 119777<br />
Proposed 23116 46245 69371 92496<br />
Dynamic 4926 9414 14270 18965<br />
Proposed 141 283 425 567<br />
55 HOPES project, SNU
1. Introduction<br />
2. Dynamic Behavior Specification in HOPES<br />
3. Self-Adaptive Mapping<br />
4. A Preliminary Experiment<br />
5. Discussion and Conclusion<br />
Contents<br />
56 HOPES project, SNU
Conclusion<br />
HOPES facilitates two ways to express dynamic behavior<br />
• At the top level: control task changes the system-level behavior<br />
• Comparable to RPN (Reactive Process Network)<br />
• At the lower level: SADF subgraph with MTM<br />
• Finite number <strong>of</strong> modes<br />
• Good for static analysis<br />
We developed a hybrid mapping technique to satisfy the<br />
real-time constraints (Self-Adaptive Mapping)<br />
• Compile-time analysis<br />
• For each mode <strong>of</strong> CIC task (including SADF subgraph)<br />
• Static mapping information w/ varying number <strong>of</strong> processors<br />
• Run-time mapping<br />
• Determine the number <strong>of</strong> (virtual) processors to allocate for each<br />
task<br />
• Bind the VP to PP (physical processor)<br />
57 HOPES project, SNU
Thank you !<br />
58 HOPES project, SNU