ewC2018_Sammelmappe_Seitenzahl

Conference Chair: 

Prof. Dr. Matthias Sturm 

HTWK Leipzig 

Project Manager 

Renate Ester 

P + 49 (0)89 255 56-1349 

E-Mail: REster@weka-fachmedien.de 

Coordinator Conference Attendees 

Juliane Heger 

P + 49 (0)89 255 56-1155 

E-Mail: JHeger@weka-fachmedien.de 

WEKA FACHMEDIEN GmbH 

Richard-Reitzner-Allee 2 

85540 Haar, Germany 

www.weka-fachmedien.de 

ISBN 978-3-645-50173-6 

www.embedded-world.eu 

1

Copyright 

©2018 WEKA FACHMEDIEN GmbH, Richard-Reitzner-Allee 2, 85540 Haar, Germany, 

phone: + 49. (0) 89.255 56 – 1000 

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by 

any means, included photocopying, scanning, duplicating or transmitting electronically without the 

written permission of the copyright holder, application for which should be addressed to the 

publisher. Such written permission must also be obtained before any part of this publication is 

stored in a retrieval system of any nature. 

The publisher, his employees and agents exercise the customary degree of care in accepting and 

checking advertisement texts and conference papers, but are not liable for misleading or deceiving 

conduct by the client. 

The company addresses contained in these proceedings are subject to the data protection law. 

The use of this information for advertising purposes is prohibited 

2

Emulation and Rich, Non-intrusive Analytics Address 

Verification Complexity 

Rupert Baines 

UltraSoC 

UK 

Rupert.Baines@ultrasoc.com 

Russell Klein 

Mentor, a Siemens Company 

Wilsonville, OR, U.S.A. 

Russell.Klein@Siemens.com 

Breaking down the pre-silicon/post silicon divide by 

combining hardware-based on-chip analytics with leading edge 

emulation technology enables a consistent debug approach 

through all stages of the design cycle. Tracing capabilities added 

to SoC devices provide debug visibility into systems running at 

full speed and under typical operating conditions. However, even 

this type of advanced tracing cannot deliver complete visibility 

into the hardware. Diagnosing certain problems requires 

visibility into any register to net in the design. Emulation 

systems deliver this complete debug visibility, but traditionally 

have presented this data in a means inconsistent with post silicon 

debug tools, that is disruptive to the debug process. Combining 

these technologies delivers the optimal combination of 

performance and visibility. 

I. INTRODUCTION 

Software plays an increasing role in the functionality of 

moderns SoCs. As such, verification of SoCs must include 

software execution. Many of the most challenging bugs are 

only seen when the SoC is run as a complete system of both 

hardware and software, operating at full speed and under 

typical operating conditions. These bugs are often sensitive to 

the smallest change in operating conditions. Adding a logging 

statement of a single character can dramatically change the 

behavior of the bug, or even completely mask it. Diagnosing 

this type of bug requires a non-intrusive tracing technique, 

capable of working while the SoC is running at full speed and 

processing typical datasets. 

Sometimes trace data from the processors, bus fabrics, and 

surrounding logic can provide the insight needed to understand 

the bug. Other times additional debug data will be needed. 

This additional data is usually obtained by reverting to a presilicon 

execution platform, where hardware debug visibility is 

greater than is possible in post-silicon environments. Typically 

this will occur when the basis of the problem is in hardware, 

thus deepening introspection into the hardware is required. 

Traditionally, pre-silicon and post-silicon debug approaches 

presented data in very different ways, as the nature of the data 

was quite different. Because the time scales of operation can 

be orders of magnitude apart it makes common data collection 

and presentation the exception rather than the rule. This results 

in a discontinuity in the debug process. As developers move 

from a characterization of the problem in post-silicon, they 

need to then re-capture and re-characterize the problem in the 

pre-silicon environment. When the bug is sensitive to minor 

timing or operation condition changes this can prove both time 

consuming and frustrating – just at the time when debug 

efficiency is needed. 

II. THE CHALLENGE 

One of the biggest challenges in SoC design today is 

systemic complexity. Block-level verification means we can be 

confident in the test coverage of individual blocks; but when 

these are integrated into a whole system the complexity 

increases and problems slip through. This is especially the case 

for heterogeneous, multi-core systems or those with many 

different IP blocks. 

The problems are worsened because the hardware blocks 

themselves may come from many different sources. Some will 

be designed in-house; others will be licensed-in from external 

vendors. It is a hard job to bring together these various CPUs, 

GPUs and accelerators – particularly in the absence of a 

unifying tool-chain that can deal with IP from many vendors. 

The number, complexity and interaction of IP blocks is by 

no means the end of the story. The software that runs on a large 

chip will be every bit as complex as the hardware: verifying the 

functionality of the software itself, and its interaction with the 

underlying hardware, brings yet another level of complexity. 

However, the problems for the SoC team do not end there. 

Still more complexity is revealed when we understand the end 

goal – which is to produce an SoC that functions correctly, and 

with the expected performance, in real-life situations. It is not 

uncommon to encounter issues that reveal themselves in the 

field only on a timescale of days or weeks of continuous 

running. Such issues cannot practically be found in simulation 

and verification, because of the time involved. 

The endgame is that many of today's chips are so complex 

that it is impossible for the design team which created them to 

fully understand their operation in-life or in-field. 

III. CURRENT APPROACHES 

The modern SOC might have 1 billion transistors and more 

than 100 individual IP blocks. Such a system obviously 

presents a challenge to simulate, validate or verify. Typically 

the design flow moves from simulation to emulation and FPGA 

3

prototyping, then tape out and post-silicon integration, bringup, 

system-level test and finally deployment. 

This has created two distinct parts of the flow, demarcated 

at tape-out: the pre- and post-silicon worlds have traditionally 

been completely separate, with little connection between the 

two. 

A. Pre-Silicon 

The pre-silicon world of simulation, emulation and 

prototyping has typically been considered a single domain 

served by EDA vendors. That domain is virtual, which has 

huge advantages in terms of flexibility and scope. 

Today’s simulators and similar tools are highly competent – 

we can have great confidence that a given block will work asdesigned. 

But there are difficulties: it is challenging to test 

software at anything approaching real-time speeds; it is very 

difficult to model systems that depend on real-world inputs; the 

systemic complexity in the hardware may require extended run 

times to achieve anything like acceptable coverage. 

In particular, it is not sufficient to model only hardware 

states: it is also necessary to include the execution of software. 

Software activity is typically dependent on real-world contexts 

and external inputs, making the challenge even greater. Tools 

like Mentor’s Veloce emulation platform help address this 

problem by enabling pre-silicon testing and debug at hardware 

speeds, using real-world data, while both hardware and 

software designs are still fluid. 

B. Post Silicon 

Bring up, integration, verification and validation are done 

by systems developers: and typically with very few specific 

tools to help them address their problems. Traditional debug 

tools are processor-centric – the ARM CoreSight system may 

be of little use in spotting an interaction issue with a CEVA 

DSP. 

The situation is typified by the fact that perhaps the only 

“standard” tool that can be of help here (aside from free or 

open source debug software like GDB and commercial tools 

like Lauterbach’s TRACE32) is JTAG, a 30-year-old 

technology which is very limited in its scope. 

In recent years there have been some steps forward in 

solving this problem. UltraSoC, for example, provides 

analytics IP that allows the construction of a universal on-chip 

debug infrastructure, independent of the main system. The 

CPU vendors themselves have responded to such developments 

with improvements to their embedded debug capabilities, 

though these remain by no means “universal”. 

UltraSoC’s IP integrates a system of monitors and analytics 

modules into the device itself, at hardware level. This gives the 

post-silicon team a system-wide view of their SoC – as a 

complete entity and under real-world operating conditions. 

Importantly, this includes both hardware and software, 

enabling full insight into processor performance and into how 

processes interact with each other and with hardware elements 

inside the system. 

IV. EMBEDDED ANALYTICS 

The embedded analytics concept puts hard-wired, nonintrusive, 

vendor neutral, ‘smart’ monitoring and analysis 

capabilities into the chip itself. It includes local intelligence – 

again hard-wired – for filtering and statistics, reducing the 

amount of information that needs to be brought off-chip. 

Such a capability makes it very much easier to bring-up and 

debug the chip – even if there are subtle issues or interactions 

involved. 

UltraSoC’s embedded analytics capability includes 

protocol-aware bus monitors for major interconnects and buses, 

including AMBA 5 CHI, AXI, OCP and AHB. It supports all 

of the common processors (including ARM, MIPS, CEVA, 

XTensa, and RISC-V), co-processors and custom logic (with 

sophisticated logic-analyzer functionality) and delivers the 

development team an integrated view of their chip. Within the 

RISC-V ecosystem, UltraSoC is the only company with 

commercial products for run-control, debug and processor 

trace. 

The offering also includes analytics software that supports 

engineers in developing and optimizing their products. 

V. EMULATION TECHNOLOGY 

The Mentor Veloce emulation platform enables high-speed 

emulation of complex SoCs to quickly identify design issues 

under critical traffic conditions and enable users to improve 

device performance and reduce time to market. It has both the 

performance and the capacity to execute an SoC design in the 

pre-silicon phase and run it under typical operating conditions, 

including a full software payload. 

Veloce delivers full SoC hardware visibility. That is, the 

developer can trace all register states in the complete design, 

and all wires (or nets) that connect those registers. These 

traces can be collected for any time periods through the 

execution of the SoC. Veloce uses hardware and software that 

enables this data collection to be performed even when the 

design is running in the emulator at the emulator’s maximum 

operating speed. 

In a typical debug session with Veloce, the developer will 

not collect traces for all registers for all time, but will trace a 

limited set of signals and registers for a selected period of time; 

strategically selected to provide insight into the problem being 

diagnosed. Upon examination of the traces collected, it may 

lead the developer to want to see additional traces; perhaps 

from other parts of the design, and possibly earlier or later in 

the run of the system. Through a sophisticated system of 

hardware and software that can save and restore the design’s 

state in the emulator, and effectively reconstruct the design’s 

state at any time in a past emulation, Veloce is able to deliver 

traces for any part of the design, and for any time in the 

emulation run. 

This level of debug visibility into the hardware state 

ensures that hardware problems can be quickly and confidently 

diagnosed and resolved. 

4

As stated earlier, the activity of the design is dependent on 

the inputs to the system and the context of the operation. 

Veloce has a complete library of virtual peripherals and traffic 

generators, as well as a large number of connections to physical 

devices. Both virtual and physical peripherals can be time 

synchronized with the time domain of the design running in the 

emulator, allowing realistic timing and performance 

characterization. 

VI. COMBINING ANALYTICS AND EMULATION CREATES A 

POWERFUL PLATFORM 

Bringing together on-chip analytics and hardware-based 

emulation tools like UltraSoC and Veloce allows designers 

using RISC-V not only to improve the effectiveness of their 

emulation efforts, but also to take an important step towards 

bringing together the currently disparate pre- and post-silicon 

worlds. 

UltraSoC IP incorporated within the device can gather 

information about the real-world environment the chip will 

encounter – information that can be gathered both in the lab, 

and even after deployment in the field. 

This real-world traffic can be used effectively to extend the 

use of the Veloce platform to prototyping. 

Uniting the pre- and post-silicon worlds in this way gives 

the design team access to visualizations and statistics based on 

real-world behavior, and allows them to compare modeled / 

predicted behavior with actual / captured behavior to identify 

discrepancies. Those discrepancies might be bugs (for example 

deadlocks). They may also be more subtle phenomena – for 

example contention or underutilization – that can impact longterm 

performance or affect power dissipation, without creating 

a catastrophic bug. 

Many of these issues may not have been observable in the 

slow virtual world of traditional simulation, but with this 

combination can easily be observed and addressed. 

This approach also enables an easy transition from 

emulation to post-silicon implementation, giving a seamless 

flow from virtual to physical, and to in-life / in-field 

deployment and optimization. 

5

Efficiency of the RISC-V ISA-Level Custom 

Extension for AES Standard Acceleration: 

a Case Study 

Pavel Smirnov, Grigory Okhotnikov, Dmitry Pavlov 

Syntacore 

Saint Petersburg, Russia 

{sps|go|dp}@syntacore.com 

In this work, we present an example of the custom RISC-V ISA 

extension using the familiar AES cryptographic standard as a target 

workload. 

The recently-introduced RISC-V ISA standard is flexible and 

extensible by design. We utilize standard extensibility features 

offered by the ISA and stay within its basic capacity. Based on the 

algorithm analysis, custom AES instruction extension is designed 

and implemented in HW. The proposed extension includes 6 new 

instructions, operates over the standard RV32GC FPU register file 

and supports all the variations of the AES algorithm defined by the 

AES standard both for encryption and decryption. 

The proposed custom instruction set extension was implemented 

in hardware. The demonstrated results are based on the real-time 

FPGA implementation and end-to-end benchmarking using 

accelerated SW library with extensions support. The prototype setup 

includes RISC-V RV32GC based processor core with GCC toolchain, 

modified to support the designed custom extension. The resulting 

implementation demonstrates more than 50x ciphers speed-ups vs 

base, SW-only implementation for the RISC-V RV32GC system at 

the expense of a modest additional HW footprint (~30 kgates). 

The resulting extension is then compared with AES extensions 

from other contemporary commercial CPUs, currently available on 

the market, and proves to be competitive both in code density and 

performance. 

Keywords—RISC-V; ISA; AES; custom instructions 


Recent slowdowns in the semiconductor technology scaling 

[8] and “traditional” CPU performance growth [2] established 

platform heterogeneity and HW specialization as a fundamental 

trend in the computer architecture and design. Contemporary 

SoCs are increasingly heterogeneous, but dominating use case is 

addition of the workload-specific accelerators and driver model 

for SW deployment, which limits resources reuse and efficiency 

of the resulting solution. 

The recently introduced open RISC-V ISA instruction set [9] 

includes support for user-defined extensions. Although 

technologies, based on the ISA extensibility are well known and 

have been both extensively explored in the academia and 

successfully applied in the industry in several cases, RISC-V for 

the first time enables such technologies in the main system 

sockets for a wide range of applications. 

In this work, we explore efficiency of the ISA-level RISC-V 

extensibility for acceleration of the familiar AES algorithms 

suite and compare it with contemporary alternatives. We intent 

to study RISC-V ISA suitability and limitations based on the 

practical case, provide qualitative and quantitative comparison 

with functional equivalents in other co-temporary ISAs and 

measure performance and efficiency of a developed custom 

extension using real-time FPGA prototype and end-to-end SW 

stack. 

II. 

APPLICATION ANALYSIS 

A. AES Overview 

AES [1] is arguably one of the most widely adopted 

encryption standards. For this work, we consider all the 

applicable block lengths defined by the AES specification. 

Without going into full details of the algorithm itself, we 

would like to start from a high-level algorithm overview, which 

will be useful for the following ISA mapping. We also note a 

few aspects, important for the following analysis. 

AES operates over the 128-bit blocks that are processed by 

a sequence of encryption rounds using round keys. The number 

of rounds and, accordingly, keys depends on the length of the 

master key, 128, 192 or 256 bits key are defined by the standard. 

Depending on the length, 10, 12 or 14 round keys are used, each 

round key being 128 bits. 

The input message block (called the “state”) is sequentially 

turned into a cipher block of the same length as a result of the 

transformation rounds. Each AES round transformation depends 

on the results of the previous one, which restricts parallel rounds 

execution. 

1) AES Encryption 

Overall, AES encryption operation sequence consists from 

the following steps (Fig. 1). 

6

Fig 1. AES encryption algorithm 

AddRoundKey is 128-bit bitwise ‘exclusive or’ operation 

on round key and state. 

SubBytes is a nonlinear bijective operation of a byte 

substitution. 

ShiftRows is shuffling of bytes by a fixed permutation. 

MixColumns is a linear transform. This is fixed irreducible 

polynomial Galois field multiplication of 4×4 coefficient matrix 

to 4×4 state bytes matrix. 

2) AES Decryption 

Block decryption is a sequence of inverse operations. 

Standard [1] defines two algorithms: Inverse Cipher and 

Equivalent Inverse Cipher. 

The main differences between these algorithms are the 

sequence, in which operations are applied and InvMixColumns 

transform, which must be done over the round key for the 

Equivalent Inverse Cipher (Fig. 2). 

Fig 3. AES decryption algorithms 

III. 

RISC-V ISA EXTENSIBILITY OVERVIEW 

RISC-V ISA [9] supports variable length instructions where 

length unit is 16 bits. The length is encoded in the first bits (LSB) 

of instruction (Fig. 4). 

Fig 4. Possible lengths of instruction 

Base 32-bit RV32I ISA instructions have length of 32 bits. 

Specification defines several instruction formats (Fig. 5): R-type 

for register only operations, I-type for instructions with 

immediate operands (including loads), S-type for store 

instructions, B-type for branches, U-type for the upper bits 

immediate operands and J-type for jump/call instructions. Such 

a unification significantly simplifies decoding. 

Fig 2. Order of operations in Inverse Cipher and Equivalent Inverse Cipher 

AES decryption operation sequence consists from the 

following steps (Fig. 3) 

Fig 5. Types of instructions 

7

All RISC-V standard extensions use these instruction 

formats and can define own types/subtypes. For example, 

floating point operations of single (S-extension) and double 

precision (D-extensions) define multiply-add operations with 3 

source and 1 destination register operands as so-called R4-type 

subtype of R-type format (Fig. 6). 

Fig. 6. R- and R4-type of instruction 

Base opcode map for 32-bit instructions is represented in the 

Fig. 7. 

Fig. 7. Base opcode map 

Already in the basic ISA, there is some opcode space 

reserved for the custom instructions. For example, RV32 and 

RV64 architectures can use so-called custom-0 and custom-1 

instructions. 

A. Staying Compliant with Standard RISC-V Extensible 

Features 

For the initial evaluation exercise, we intentionally stay 

within capabilities provided by the standard basic instruction set 

and its extensibility features. We intend to explore all the 

supported basic XLEN values, but this initial work focuses 

specifically on the RV32GC basic set. 

In addition, for this work, we do not use advanced features, 

like additional i/o ports, non-standard registers, and load-store 

units. Instead, proposed implementation utilizes part of the 

standard opcode space, reserved for custom extensions in the socalled 

“custom0/custom1” opcodes. 

IV. 

CUSTOM AES EXTENSTION DESIGN 

A. Custom AES Instructions Functional Description 

The algorithm described in [1] consists of transformations, 

namely SubBytes, ShiftRows, MixColumns, AddRoundKey, 

and their inverses. Each operation transforms a single 128-bit 

block also named “state”. AddRoundKey transform requires an 

additional 128-bit argument which is called “round key”. Each 

operand is split into high-bits part and low-bits part. The detailed 

description of these transformations can be found in Chapter 5 

of [1]. 

Based on the analysis of the algorithm, the following 

instruction basis can be proposed: 

• XOR128; 

• AESENC; 

• AESENCLAST; 

• AESDEC; 

• AESDECLAST; 

• KEYGENASSIST. 

The rest of this section contains functional description of the 

basic operations in pseudo-code. 

1) XOR128 

AddRoundKey transformation from [1] is basically 128-bit 

XOR operation over operands. 

Input: state_high[64], state_low[64], key_high[64], 

key_low[64] 

State_high[63:0] := state_high[63:0] ^ key_high[63:0] 

State_low[63:0] := state_low[63:0] ^ key_low[63:0] 

Output: state_high[64], state_low[64] 

2) AESENC 

A composition of ShiftRows, SubBytes, MixColumns, and 

AddRoundKey transformations which results in a single round 

of encryption. 


key_low[64] 

Tmp[63:0] := state_high[63:0] 

Tmp[127:64] := state_low[63:0] 

Tmp[127:0] := ShiftRows(tmp[127:0]) 

Tmp[127:0] := SubBytes(tmp[127:0]) 

Tmp[127:0] := MixColumns(tmp[127:0]) 

State_high[63:0], state_low[63:0] := XOR128(tmp[63:0], 

tmp[127:64], key_high[63:0], key_low[63:0]) 


3) AESENCLAST 

A special kind of AESENC instruction that is performed 

during the last stage of encryption. This instruction combines 

ShiftRows, SubBytes, and AddRoundKey transformations. 


key_low[64] 


Tmp[127:64] := state_low[63:0] 

Tmp[127:0] := ShiftRows(tmp[127:0]) 

Tmp[127:0] := SubBytes(tmp[127:0]) 

State_high[63:0], state_low[63:0] := XOR128(tmp[63:0], 



4) AESDEC 

Performs transformation of a single round decryption, which 

is a composition of InvShiftRows, InvSubBytes, 

AddRoundKey, InvMixColumns transformations. It should be 

noted that this instruction follows Inverse Cipher procedure 

described in Chapter 5.3 of [1], not the Equivalent Inverse 

Cipher. 


key_low[64] 


Tmp[127:64] := state_low[63:0] 

Tmp[127:0] := InvShiftRows(tmp[127:0]) 

Tmp[127:0] := InvSubBytes(tmp[127:0]) 

Tmp[63:0], tmp[127:64] := XOR128(tmp[63:0], tmp[127:64], 

key_high[63:0], key_low[63:0]) 

Tmp[127:0] := InvMixColumns(tmp[127:0]) 

State_high[63:0] := tmp[63:0] 

State_low[63:0] := tmp[127:64] 


8

5) AESDECLAST 

This instruction, similar to AESENCLAST, performs the last 

stage of decryption. It combines InvShiftRows, InvSubBytes, 

and AddRoundKey transformations. 

Input: state_high, state_low, key_high, key_low 


Tmp[127:64] := state_low[63:0] 

Tmp[127:0] := InvShiftRows(tmp[127:0]) 

Tmp[127:0] := InvSubBytes(tmp[127:0]) 

State_high[63:0], State_low[63:0] := XOR128(tmp[63:0], 



6) KEYGENASSIST 

This is a helper instruction, used in the process of the round 

keys expansion. 

Two auxiliary functions are introduced to simplify the 

notation. SubWord transformation performs substitution of 

bytes sequentially, similar to SubBytes. RotWord results in word 

rearrangement: RotWord(tmp[31:0]) := tmp[31:24] | tmp[7:0] 

| tmp[15:8] | tmp[23:16]. 

Input: state_high, state_low, rcon[8] 

Tmp1[31:0] := state_high[31:0] 

Tmp2[31:0] := state_high[63:32] 

Tmp3[31:0] := state_low[31:0] 

Tmp4[31:0] := state_low[63:32] 

State_high[31:0] := RotWord(SubWord(tmp1[31:0])) ^ 

rcon[7:0] 

State_high[63:32] := SubWord(tmp1[31:0]) 

State_low[95:64] := RotWord(SubWord(tmp3[31:0])) ^ 

rcon[7:0] 

State_low[127:96] := SubWord(tmp3[31:0]) 


V. CUSTOM AES ISA EXTENSION 

A. Operand Format and FP Register File Reuse 

For all the operations in the proposed AES extension, each 

of the instructions takes two 128-bit input operands, performs 

required sequence of transformations, and returns one 128-bit 

output. 

In this initial implementation, the proposed AES extension 

uses pairs of 64-bit floating point registers as aliases for 128-bit 

operands. The half that contains the most significant bits of an 

argument is called “high part” (state_high) and the other half is 

called “low part” (state_low). 

This approach is practical (vs dedicated 128-bit storage) and 

allows to optimize additional area, required to support the 

extension. 

B. AES Operational Basis Interface Requirements 

The proposed in the previous section functional basis allows 

specifying encryption, decryption and key expansion procedures 

in terms of rounds. This simplifies the programming process 

because the number of instructions stacked together corresponds 

exactly to the number of transformation rounds. Thus, only 6 

custom instructions will be required for AES 

encryption/decryption algorithms. 

The data flow of AES algorithm allows accumulative 

updating of 128-bit state block. Correspondingly, when 

operating over the 64-bit entry-register file, every instruction 

needs four input registers and two of those registers should be 

updated as result of the transformation. 

The proposed basis fits into the standard R4-type instruction 

format, but does not strictly comply with the standard R4-type 

instruction template semantics. An ordinary R4-type instruction 

takes three source registers and returns its result to one 

destination register, while described basis requires two 

destination registers. This suggests specific modification of the 

R4-type instructions as described in the following section. 

C. AES Custom Extension Instruction Format 

To implement instruction-level interface of the described 

functionality, we introduce additional format as a subset of the 

custom0/custom1 R4 (Fig. 6) format encoding. In this newly 

introduced format, we define instructions with funct3 values of 

‘001’ and ‘101’ within the R4-type. Funct2 field encodes 

different AES transformations. 

Each instruction of AES custom extension set takes 4 FPU 

registers as operands. Every such instruction has one important 

change in the operation semantics: ‘rd’ and ‘rs3’ arguments are 

both source and destination registers while other opcodes in this 

format have 3 source operand and a single destination operand. 

To prevent data corruption we do not allow instructions in which 

a single register is used twice for both ‘rd’ and ‘rs3’ destinations. 

Full listing of the proposed custom AES extension 

instructions is in the Table I. 

Instruction 

TABLE I. 

AES EXTENSION INSTRUCTIONS 

31 27 26 25 24 20 19 15 14 12 11 7 6 0 

rs3 funct2 rs2 rs1 funct3 rd opcode 

AESENC any 00 any any 001 any 0001011 

AESENC 

LAST 

any 01 any any 001 any 0001011 

AESDEC any 10 any any 001 any 0001011 

AESDEC 

LAST 

any 11 any any 001 any 0001011 

XOR128 any 00 any any 101 any 0001011 

AESKEYGEN 

ASSIST 

any 11 any any 101 any 0001011 

Overall, if implementation allows for a full operand registers 

flexibility (which is preferable — first hand, from the 

instrumentation point of view, but not required), the proposed 

extension occupies around 6M opcodes (mainly, due to the 

operand address fields), leaving more than 90% (and ~56 million 

combinations) of the opcode space available within RV32 

opcode space (Table II below). 

TABLE II. 

CUSTOM INSTRUCTIONS OPCODE SPACE 

custom0 custom1 Total % 

Full opcode space 2 35 ~32M 2 35 ~32M 64M 100 

Number of possible opcodes 

used for AES insturctions a 6×2 20 -2 15 ~6M 0 6M 9.375 

Opcode space still available 3×2 23 +2 21 +2 15 >26M 32M 56M 90.625 

a. 

With full 4-operand register flexibility 

9

VI. 

AES FUNCTIONAL UNIT IMPLEMENTATION 

This section provides high-level overview of the AES 

functional unit which implements the proposed ISA extension. 

Full RTL code of the module is published in [6]. 

A. AES Module Block Diagram 

The high-level AES functional unit diagram is represented 

in the Fig. 8. 

As this evaluation exercise is ISA-centric, the AES unit 

implementation is pretty straight-forward. The unit has four 64- 

bits input operands and two 64-bits results. The unit is 

implemented as a single stage pipeline, where input operands 

and the command are latched in the input register for the 

following processing. The computational part of the current unit 

is implemented as a combinatorial logic. Correspondingly, the 

latency of the initial AES unit implementation is just a single 

clock in the current prototype. Although, data path is rather 

simple and can be pipelined, if necessary. The “Substitution” 

and “Inverse Substitution” units are simple LUTs of 256 keys. 

The “Keygen” unit executes bit interleaver as a logic data path. 

The “Shift rows” and “Mix column” units execute rows and 

columns interleaving, correspondingly. 

Fig. 9. AES block interface 

B. Implementation Complexity 

The described implementation has been synthesized at 

TSMC 28nm library. Synthesis results are included in Table III 

below. 

Module 

TABLE III. 

IMPLEMENTATION COMPLEXITY 

Basic core complexity, 

kGates 

Extended core complexity, 

kGates 

Core total, logic only 275 307 11.5 

Diff, 

% 

AES module — 20.8 7.5 

Other core 

modifications 

— 10.9 4 

VII. EXPERIMENTAL RESULTS AND COMPARISON 

Fig. 8. AES functional unit diagram 

AES block interfacing with RISC-V pipeline is shown in the 

Fig. 9. 

A. Software Packages Used 

AES algorithm described in [1] was implemented in C++. 

This implementation is based on the Botan library [4]. The 

conventional implementations of AES benefit from well-known 

optimization tricks. For example, combining ShiftRows and 

MixColumns transformations into a single operation etc. Botan 

library supports cryptographic extensions for various 

architectures, including Intel AES-NI and ARMv8 

cryptographic extension. 

The correctness of the resulting application was tested using 

reference vectors from [1]. A small cross-platform 

benchmarking library was implemented in C++11 for 

correctness verification and performance testing. The 

implementation was compiled with MSCV 2015 compiler for 

Windows 10 (64-bit), g++ 5.4.1 for Linux (64-bit), g++ 5.2.0 for 

RISC-V Linux, and g++ 6 for Raspberry Pi 3. 

B. Reference Implementations 

Intel architecture AES-NI extension supports AES subset 

with 128-bit register/memory operands [5]: 

• AESENC. This instruction performs a single round 

of encryption. The instruction combines the four 

transformations of the AES algorithm, namely: 

ShiftRows, SubBytes, MixColumns & 

AddRoundKey (as described in [1]) into a single 

instruction. 

10

• AESENCLAST. Instruction for the last round of 

encryption. Combines the ShiftRows, SubBytes, 

and AddRoundKey steps into one instruction. 

• AESDEC. Instruction for a single round of 

decryption. This combines the four steps of AES 

transformations InvShiftRows, InvSubBytes, 

InvMixColumns, AddRoundKey into a single 

instruction. 

• AESDECLAST. Performs last round of decryption. 

It combines InvShiftRows, InvSubBytes, 

AddRoundKey into a single instruction. 

• AESKEYGENASSIST assists in an algorithm of 

generating of the round keys used for encryption. 

• AESIMC is used for converting the encryption 

round keys to a form suitable for decryption using 

the Equivalent Inverse Cipher. 

ARMv8 also has cryptographic extension with 128-bit 

coprocessor SIMD registers operands [3]: 

• AESD – AES single round decryption. 

• AESE – AES single round encryption. 

• AESIMC – AES inverse MixColumns 

transformation. 

• AESMC – AES MixColumns transformation. 

C. AES Operation-to-Instruction Mapping 

In this section, we compare mappings of the AES cipher 

operations in the designed extension with functionally similar 

extensions in the select co-temporary ISAs. 

AES encryption operation mapping is shown in the Fig. 10. 

One can observe that Intel AES-NI and the proposed custom 

AES extension for RISC-V have similar instructions 

functionality. Number of executed instructions is equal to the 

number of round keys in both cases. Corresponding ARMv8 

extension is different and requires twice as many instructions for 

the round. 

For decryption, both AES-NI and ARMv8 use an equivalent 

algorithm of inverse cipher, which requires a separate set of 

round keys that is different from the encryption keys (Fig. 11). 

Again, one can see that, same as in the cipher algorithm, the 

number of executed instructions in the RISC-V extension is 

equal to the number of round keys. It is equal to the number of 

instructions in the AES-NI instruction set, whereas ARMv8 

requires twice as many instructions to be executed. 

Advantages of hardware implementation of AES 

transformations are described in detail in chapter “Software Side 

Channels and the AES Instructions” of [5]. 

However, the proposed custom RISC-V extension for AES, 

in contrast to AES-NI, decomposes the main AES Inverse 

Cipher algorithm which uses the same set of round keys as 

Cipher algorithm does. 

D. Instruction Scheduling Comparison 

In this section, we compare disassembled code 

corresponding to the main encryption and decryption loops. A 

well-scheduled sequence of instructions may significantly 

improve computational performance of an algorithm. 

Fig. 10. AES encryption instructions for different ISAs 

Fig. 11. AES decryption instructions for different ISAs 

11

The following samples contain block of C++ code preceding 

its disassembly. Round keys are assumed to be stored in the 

m_EK array. Each code fragment performs 11 round keys 

applications for each 128-bit block. 

Listing 1 contains code fragment that uses Intel AES-NI 

instruction set for a single block encryption. We can see 

instructions loading round keys from the memory. Number of 

registers is not enough to preload keys into registers outside the 

main loop. The total number of instructions in the loop is 28, and 

the total number of used ‘xmm’ registers is 3 out of 8 for the 

AES-128 example below. 

Listing 1. Block encryption using Intel AES-NI instruction set 

void 

encrypt(block_type const& plain_text_block, 

block_type& cipher_block) const 

{ 

__m128i B = 

_mm_loadu_si128(std::addressof(plain_text_block)); 

B = _mm_xor_si128(B, _mm_loadu_si128(&m_EK[0])); 

for (int i = 1; i < number_of_round_keys - 1; ++i) 

{ 

B = _mm_aesenc_si128(B, 

_mm_loadu_si128(&m_EK[i])); 

} 

B = _mm_aesenclast_si128(B, 

_mm_loadu_si128(&m_EK[number_of_round_keys - 1])); 

_mm_storeu_si128(std::addressof(cipher_block), B); 

} 

... 

.L217: 

movdqu (%rdx), %xmm0 

addq $16, %rdx 

addq $16, %r8 

movdqu (%rax), %xmm2 

pxor %xmm2, %xmm0 

movdqu 16(%rax), %xmm1 

aesenc %xmm1, %xmm0 


















aesenclast %xmm1, %xmm0 

movups %xmm0, -16(%r8) 

cmpq %rsi, %rdx 

jne .L217 

Code fragment in the Listing 2 uses ARMv8 cryptographic 

instructions. Again, we can see instructions loading round keys 

from the memory. Number of registers is again not sufficient to 

preload keys into the registers outside of the main loop. The total 

number of instructions in the loop is 40, and the total number of 

used ‘q’ registers is 7 (14 ‘d’ registers out of 16) for the AES- 

128. 

Listing 2. Block encryption using ARM8 with cryptographic instructions 

void 



{ 

auto const p_inp = reinterpret_cast(&plain_text_block); 

auto const p_out = 

reinterpret_cast(&cipher_block); 

auto data = vld1q_u8(p_inp); 

{ 

auto const& keys = m_EK; 

for (int i = 0; i < number_of_round_keys - 2; 

++i) { 

auto const p_key = 

reinterpret_cast(&keys[i]); 

auto const rkey = vld1q_u8(p_key); 

data = vaeseq_u8(data, rkey); 

data = vaesmcq_u8(data); 

} 

{ 


reinterpret_cast(&keys[number_of_round_keys - 2]); 

auto rkey = vld1q_u8(p_key); 

data = vaeseq_u8(data, rkey); 

} 

{ 


reinterpret_cast(&keys[number_of_round_keys - 1]); 

auto const rkey = vld1q_u8(p_key); 

data = veorq_u8(data, rkey); 

} 

} 

vst1q_u8(p_out, data); 

} 

... 

.L164: 

sub r0, fp, #1012 

vld1.8 {d20-d21}, [r0:64] 

ldr r0, [fp, #-1060] 

vld1.8 {d18-d19}, [r0:64] 

ldr r0, [fp, #-1076] 

vld1.8 {d16-d17}, [r3:128]! 

cmp r3, ip 

aese.8 q8, q10 

vld1.8 {d22-d23}, [r1:64] 

vld1.8 {d20-d21}, [r2:64] 

aesmc.8 q8, q8 

vld1.8 {d28-d29}, [lr:64] 

aese.8 q8, q9 

vld1.8 {d26-d27}, [r10:64] 

vld1.8 {d18-d19}, [r8:64] 


vld1.8 {d24-d25}, [r4:64] 


vld1.8 {d22-d23}, [r5:64] 



vld1.8 {d20-d21}, [r6:64] 


aese.8 q8, q9 

vld1.8 {d18-d19}, [r7:64] 




12








veor q8, q9, q8 

vst1.8 {d16-d17}, [r0:128]! 

str r0, [fp, #-1076] 

bne .L164 

For AES-256 

.L185: 

sub r3, fp, #1012 

vld1.8 {d16-d17}, [r2:128] 

.L184: 

vld1.8 {d18-d19}, [r3:64]! 

cmp r9, r3 

aese.8 q8, q9 


bne .L184 

add r2, r2, #16 

cmp r2, r0 

vld1.8 {d20-d21}, [r9:64] 

vld1.8 {d18-d19}, [r10:64] 


veor q8, q9, q8 

vst1.8 {d16-d17}, [r1:128]! 

bne .L185 

Listing 3 demonstrates efficient unrolling of the main 

encryption loop. It can be noticed that no keys loading 

operations interfere with the round transformation sequence. All 

keys are preloaded into the ‘f’ registers as a loop invariant 

variables. The total number of instructions is 18, and the total 

number of used ‘f’ registers is 24 out of 32 in the AES-128 

example below. 

Listing 3. Block encryption using the proposed RISC-V AES custom instruction 

set extension 

void 



{ 

uint128_type B = reinterpret_cast(plain_text_block); 

{ 

B ^= m_EK[0]; 

for (int i = 1; i < number_of_round_keys - 1; ++i) 

B.aesenc(m_EK[i]); 

} 

B.aesenclast(m_EK[number_of_round_keys - 1]); 

reinterpret_cast(cipher_block) = B; 

} 

... 

.L183: 

fld fa3,0(a5) 

fld fa5,8(a5) 

add a4,a4,16 

add a5,a5,16 

sc_xor128 fa3, fa5, fs6, fs8 

sc_aesenc fa3, fa5, ft6, fs1 

sc_aesenc fa3, fa5, fs2, fs4 

sc_aesenc fa3, fa5, ft4, ft8 


sc_aesenc fa3, fa5, ft2, ft10 


sc_aesenc fa3, fa5, ft0, fs10 


sc_aesenc fa3, fa5, fa2, fa6 

sc_aesenclast fa3, fa5, fa0, fa4 

fsd fa3,-16(a4) 

fsd fa5,-8(a4) 

bne a5,a2,.L183 

Tables IV and V summarize number of instructions and 

registers needed for different block length encryption/decryption 

for 128-bit and 256-bit respectively. 

ISA 

TABLE IV. 

AES-128 RESOURCES USED 

Instructions 

per block 

Number of used 

registers 

Load/store memory 

operations (per 

loop) 

Intel x86 AES-NI 28 3 xmm (from 8) 12 load + 1 store 

ARM8 aarch64-crypto 39 

Custom RV32 AES extenstion 

(single invocation a ) 


(inlined in a loop, round keys 

are preloaded b ) 

ISA 

40 

18 

7 vector ‘q’ (or 

14 FPU ‘d’ from 

16) 

4 from 32 FPU 

‘d’ 

24 from 32 FPU 

‘d’ 

15 load + 1 store 



a. 

Use case: standalone “deep” invocation. 

b. 

Use case: steady-state block encryption/decryption with the same set of round keys. 

TABLE V. 

AES-256 RESOURCES USED 

Instructions 

per block 

Number of used 

registers 

Load/store memory 

operations (per 

loop) 

Intel x86 AES-NI 36 3 xmm (from 8) 16 load + 1 store 

ARM8 aarch64-crypto 75 


(single invocation) 


(inlined in a loop, round keys 

are preloaded) 

52 

22 

3 vector ‘q’ (or 6 

FPU ‘d’ from 

16) 

4 from 32 FPU 

‘d’ 

32 from 32 FPU 

‘d’ 




As can be seen from the tables above, the compiler-produced 

code, generated for the block encryption using Intel AES-NI and 

ARMv8 -crypto extensions, doesn’t depend on the invocation 

type and stays the same both for the inlined functions and “deep” 

standalone calls. Number of available registers is a known 

limiting factor at these platforms. It is not enough to hold all the 

loop invariant variables in the registers, which forces keys 

reloading for every new data block. 

For the proposed RISC-V custom extension, keys loading is 

only required for the standalone context-free “deep” calls. In 

contrast, for the block encryption operations called in the loop, 

compiler extracts code key loading sequence as the loop 

invariant part, which is executed only once outside of the loop. 

This reduces the number of per-block required instructions by 

more than 2x and removes unnecessary key loads. 

It should also be noted, what experimental RISC-V CPU 

implementation only supports 64-bit loads and stores, while 

other considered CPU can operate with 128-bit operands. We 

intend to extend this exercise for the RV64 and RV128 platforms 

upon availability. 

E. Instruction Scheduling Comparison 

In this section, we compare resulting instruction sequences 

for the 128-bit block encryption using two ISAs. Summary 

results are included in the Table VI below, reference 

implementations in C for the 128-bit block encryption and 

corresponding assembly listings are included in the Section 

VII.D, full sources are published in [6]. 

13

TABLE VI. 

Platform 

Ubuntu 16.04, Intel 

6800K 

Windows 10, Intel 

Core i7 

Ubuntu 16.04 Intel 

Core i3 

SC RISC-V FPGA 

implementation 

AES-128 ENCRYPTION BENCHMARKS 

MHz 

Without extensions 

Using extensions 

MB/MHz clocks/block MB/MHz clocks/block 

Speedup 

3400 0.100 80.06 1.140 7.02 11.4 

2500 0.086 186.42 0.943 16.97 10.99 

2300 0.092 173.727 0.312 51.293 3.39 

20 0.015 1077.44 0.725 22.07 48.82 

VIII. CONCLUSION AND FUTURE WORK 

In this work, we considered case of the custom instructionlevel 

RISC-V ISA extension for the familiar AES cryptography 

suite. 

We review AES algorithms and propose functional 

decomposition and the possible operational basis for AES 

acceleration, and then describe design of the corresponding ISA 

extension as well as execution unit, which covers full 

requirements of the AES standard [1]. 

The proposed custom AES ISA is implemented as an 

extension of the SCR5 RV32GC processor core [7] using basic 

extensibility features of the RISC-V ISA. The extended core 

exists as a real-time FPGA prototype with full end-to-end AESbased 

SW stack integrated. 

The resulting implementation has been benchmarked using 

the open-source SW library, results have been compared with 

AES extensions of other modern ISA, based on co-temporary 

CPU implementations. 

We’ve demonstrated, what even with the basic extensibility 

features provided by the RISC-V ISA, and staying “compatible” 

with the standard instruction formats and using only the standard 

architectural RF, it’s possible to produce a high-quality ISA 

extension even for the RV32GC baseline architecture, in all the 

aspects comparable with the modern commercial systems. 

The proposed extension is minimalistic and includes only 6 

new instructions, which can be easily deployed by the standard 

RISC-V compiler to produce high-quality results, competitive 

with the best co-temporary AES implementations using other 

ISAs, — both in code density and performance. It should be 

noticed, what RISC-V extension performance results have been 

obtained using 32-bit baseline architecture over 64-bit register 

files, while referenced architectures are both 64-bit with 128-bit 

register capabilities. 

Although AES algorithm is not memory-bound, some 

additional benefits can be expected from deploying wider, 64bit 

or 128bit baseline architectures (although, for the later, support 

in the C/C++ languages are somewhat behind, as none of the 

current standards had introduced portable integer int128_t or 

uint128_t types). This is part of the follow-up work — authors 

intend to advance the proposed custom AES extension 

implementation for the RV64 and RV128 baseline RISC-V 

architecture cases to evaluate incremental benefits from wider 

architectures. Additionally, the recently published V standard 

extension proposal [10] is currently in review for a potential 

applicability. 

As a minor side result, we’ve noticed some improvement 

opportunities in the extensibility-related aspects of the current 

SW infrastructure. We’ve accumulated these and communicated 

to the RISC-V GCC toolchain maintainers. 

REFERENCES 

[1] Pub, NIST FIPS. “197: Advanced encryption standard (AES).” Federal 

information processing standards publication 197.441 (2001): 0311. 

[2] C.Moore “Data processing in exascale class computer systems” The 

Salishan Conference on High Speed Computing, 2011. Available: 

http://www.lanl.gov/conferences/salishan/salishan2011/3moore.pdf 

[3] ARM® Cortex®-A57 MPCore Processor Cryptography Extension. 

Technical Reference Manual (2015). Available: 

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0514g/DDI0514G 

_cortex_a57_mpcore_cryptography_trm.pdf . 

[4] Botan: Crypto and TLS for C++11 URL: https://botan.randombit.net/ 

[5] Gueron S. “Intel® Advanced Encryption Standard (AES) New 

Instructions Set”. Available: 

https://software.intel.com/sites/default/files/article/165683/aes-wp-2012- 

09-22-v01.pdf. 

[6] Software and hardware implementation of AES using custom instructions. 

URL: https://github.com/syntacore/aes-paper 

[7] Syntacore | custom cores and tools. URL: https://syntacore.com/ 

[8] T. N. Theis and H.-S. P. Wong, “The end of Moore’s Law: A new 

beginning for information technology,” Computing in Science & 

Engineering, vol. 19, no. 2, pp. 41–50, 2017. 

[9] “The RISC-V Instruction Set Manual. Volume 1: User-Level ISA, 

Document Version 2.2”, Editors: A. Waterman, K. Asanović, RISC-V 

Foundation, May 2017. 

[10] K. Asanovic, R. Espasa, “The RISC-V Vector ISA”. Available: 

https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330- 

RISCVRogerEspasaVEXT-v4.pdf 

14

A RISC-V Based Open Hardware Platform for 

Wearable Smart Sensors 

Manuel Eggimann*, Stefan Mach*, Michele Magno*°, Luca Benini*° 

*Dept. of Information Technology and Electrical Engineering, ETH Zurich, Switzerland 

°Dept. of Electrical, Electronic, and Information Engineering, Università di Bologna, Italia 

Abstract—Wearable smart sensing is a promising technology 

to enhance user experience in sport/fitness, as well as health 

monitoring. Wearable sensing systems, as also Internet of Thing 

(IoT) systems, not only provide continuous data monitoring and 

acquisition but are also expected to filter, process and extract 

meaningful information from the acquired data in similar ways as 

human experts do. Supporting continuous “smart” operation on 

ultra-small batteries poses unique challenges in energy efficiency. 

In this work, we present an ultra-low-power open embedded 

platform that hosts a scalable array of analogue-to-digital 

converters for biomedical and inertial sensors, and it can parallelprocess 

data on-board with machine learning algorithms (i.e. 

SVM, KNN, Neural Networks) in with orders-of-magnitude higher 

computing power and energy efficiency compared to standard 

state-of-the-art microcontrollers. The platform’s compute engine 

is a heterogeneous multi-core parallel ultra-low power (PULP) 

processor based on RISC-V, capable to deliver up to 2.5 GOPS. 

These features are provided under 55 mW power consumption, 

which makes the platform ideal for battery-powered activities 

typical of wearable applications, with a peak of 38.3x energy 

efficiency increase (0.7 V, 85 MHz) compared to standard 

microcontrollers (MCUs) with similar power budgets. The 

wearable platform can be interfaced with “electronic skin” (Eskin) 

arrays of tactile sensors with up to 64 channels and 

ECG/EMG sensors up to 8 channels. Moreover, the platform 

includes a Bluetooth Low Energy 5.0 module for energy-efficient 

wireless connectivity. 

Keywords—Wearable Devices, Ultra Low Power, Parallel 

Architectue, RISC-V, Energy Efficienct, Smart Sensing. 


Recent advances in electronics performance and 

miniaturization in combination with the availability of sensors 

and reliable connectivity are enabling the design of more 

intelligent and miniaturized devices that embed a huge amount 

of computational performance compared to only a few years 

ago. Among other smart devices, wearable devices are tightly 

coupled to the human body [1]. Concurrently, there has been a 

significant increase of interest in monitoring human health 

during daily activities and use smart wearable devices for 

medical applications, aimed to reduce the hospitalization of 

users with cronic diseases. One of the most desirable features of 

wearable devices is their capability to autonomously process 

data from the sensor and take action trough actuators [1]. 

Machine Learning (ML) methods are playing an important 

role in wearable devices, and they are particularly interesting 

tools for various emerging applications [2]. In fact, they are used 

for data analysis in many domains and they are the core of the 

perception units in our integrating intelligence era. Embedding 

machine learning methods is expected to enable machine 

intelligence in lightweight wearable devices as well as robotics, 

and prosthetics. Researchers on embedded machine learning are 

putting emphasis on designing specialized architectures to deal 

with the demand for large computational and storage capability 

[3]Due to algorithm complexity and large datasets, embracing 

typical machine learning methods for machine intelligence is 

still a challenging task in battery powered wearable devices, 

with limited hardware capability especially when dealing with 

real-time functionality . 

Today, microcontrollers, especially from the ARM 

Cortex-M family, are achieving a good tradeoff between power 

consumption (in the order of mW) and computational resources 

(in the order of MOPS)[4]. Power consumption of a few tens of 

mW, or lower, is a must for operating battery-operated devices 

without too frequent battery recharges; however, computational 

resources of microcontrollers tend to be insufficient to perform 

on-board processing for complex algorithms and sensors 

targeting biomedical/wearable applications [2]. In recent years, 

there have been many research efforts to design new processors 

to match the requirements of computational resources required 

by on-board data processing with state-of-the-art machine 

learning algorithms, as required for smart wearables. There are 

two approaches to improve the performance of ultra-low-power 

processors that have shown promise [5][6]. The first one is to 

exploit parallelism as much as possible. Parallel architectures for 

near-threshold operation, based on multi-core clusters, have 

been explored in recent years with different application 

workloads for an implementation in a 90nm technology [3]. A 

second very prolific research area is exploiting low-power fixedfunction 

hardware accelerators coupled with programmable 

parallel processors to retain flexibility while improving energy 

efficiency for specific workloads [7]. Such near-threshold 

parallel heterogeneous computing approaches hold great 

promise. 

In this work, we present a hardware platform that includes a 

heterogeneous parallel ultra-low power (PULP) processor based 

four multicore RISC-V. RISC-V is an open source instruction 

15

set architecture achieving similar core density performance of 

commercial and academic microprocessors based on a 

proprietary ISA, as for instance the ARM Cortex-M. The version 

of PULP employed in this work is capable to deliver up to 2.5 

GOPS with the power consumption in the range of mW. The 

whole platform is designed for battery-powered applications 

such as wearable electronics and to be interfaced with tactile 

sensors [7]. Evaluation in terms of power consumption, 

functionality and energy efficiency are presented with 

experimental measurements. 

process. Fig. 2 shows the block diagram of the processor 

including the 4 RISC-V cores and the shared memory. Honey 

Bunny offers a range of standard peripheral interfaces such as 

SPI, I 2 C or UART that allow interfacing it with commercial 

sensor and MCUs. 

The rest of the paper is organized as follows: Section 2 

describes the proposed wearable platform and it presents the 

whole system architecture with the SoC ultra-low power multicore 

parallel platform (PULP), Section 3 shows the experimental 

results and Section 4 concludes the paper. 

II. 

SYSTEM ARCHITECTURE 

Fig.1 shows the block diagram of the proposed wearable and 

wireless device that targets e-health applications. In this work, 

we focus on the platform architecture exploiting an ultra-low 

power and energy efficient parallel processor (PULP) based on 

RISC-V. The platform includes two sensor interfaces: an 8 

channel ECG/EMG analog front-end and a 64 channel currentinput 

ADC intended to interface a piezoelectric tactile sensor 

matrix for e-skin applications [9]. Communication between the 

wearable platform and external devices (i.e. smartphones) is 

enabled by a Bluetooth Low Energy 5.0 SoC with an ARM 

Cortex M4F core embedded on the SoC. 

Fig. 1 Overview of the Platform 

A. Honey Bunny:A RISC-V Parallel Ultra-Low Power 

Processor (PULP) 

The core of the designed platform is a Parallel Ultra-Low 

Power (PULP) processor 1 with four RISC-V cores. PULP is an 

open-hardware platform effort by ETH Zurich and University of 

Bologna featuring near-threshold multicore processing with 

tightly coupled memory, such as presented in [7]. The PULP 

SoC employed here, called Honey Bunny, features four RISC-V 

cores and is manufactured in GlobalFoundries’ 28nm CMOS 

Fig. 2 Block Diagarm of PULP Honey Bunny 

B. Wireless Interface 

The wearable platform presented here can communicate with 

external devices through Bluetooth Low Energy 5.0. The 

NRF52832 from Nordic Semiconductor was selected due to its 

ultra-low power consumption in transmission mode (7.1 mA 

current during 1 Mbit/s transmission at 0dBm) and a reasonable 

number of peripherals for sensor connection. A Bluetooth Low 

Energy custom service has been implemented to send the 

processed data to a smartphone where it can be stored, plotted in 

real time, or forwarded to the cloud In this version of the 

platform, we also use the NRF52832, which includes an ARM- 

Cortex M4F core to interface both sensor interfaces via SPI. In 

this preliminary work, we selected this configuration to save 

energy when we need to acquire data, as PULP can stay in deep 

sleep when its processing power is not needed, and the data just 

needs to be transmitted directly to the BTLE client. However, a 

configuration where the PULP processor is directly interfaced 

with the sensors can also be envisioned. 

C. Sensors subsytems 

As mentioned, the platform targets e-health applications, so 

two analog-to-digital front-ends are included in the design. For 

the ECG/EMG electrodes, an ADS1298 from Texas Instruments 

has been chosen. The integrated circuit front-end (IC) supports 

8 channels which enable conventional 12 lead ECG 

measurement at up to 32 kSPS per channel. Aside from the eight 

24-bit sigma-delta converters the chip also includes 

programmable gain amplifiers and circuitry to generate the 

necessary reference voltage for the unipolar leads. 

Besides ECG measurements, the platform is intended to 

interface tactile sensor matrices based on piezoelectric polymers 

for e-skin applications. This kind of sensors separate electrical 

charge proportionally to the applied force. It has been shown that 

the DDC current integrator ADC family from Texas Instruments 

1 

http://www.pulp-platform.org/ 

16

is capable to provide sufficiently high sensitivity (20-bit and a 

maximum charge range of 150pC) and frequency response (up 

to 3.1 kSPS per channel) for tactile sensing applications [9]. Our 

platform uses the DDC264 which contains 64 channels. The 

DDC264 uses two integrators per channel to allow continuous 

integration of the input current. While one of them is 

accumulating charge, the other one is connected to the ADC. An 

external signal multiplexes between both integrators. To achieve 

the strict timing constraints of the DDC264 multiplex signal and 

the SPI transactions despite the frequent high priority interrupts 

of the Bluetooth stack, a special feature of the NRF52832 was 

used. The NRF52832 allows to trigger and perform SPI 

transactions, GPIO transitions and other tasks without the need 

of an interrupt service routine. The tasks can be triggered by 

events from other peripherals, e.g. a timer compare event. In 

conjunction with DMA, this allowed to generate the necessary 

clock signals and SPI transactions without any involvement of 

the core. 

III. 

EXPERIMENTAL RESUTLS 

To demonstrate the computational capabilities and energy 

efficiency of the data processing unit of the platform, a 

comparison of Honey Bunny and an ARM Cortex-M4F 

microcontroller has been conducted. Furthermore, an 

application utilizing the whole system pipeline has been tested 

in practice to highlight the functionality of the whole platform. 

the microcontrollers’. As PULP boasts a much larger 

application speed, power can be saved by reducing the 

frequency of Honey Bunny to match the application speed of 

the STM32F407. This shrinks power draw to merely 5 mW. 

However, energy efficiency relatively suffers by running a 

processor below the maximum possible frequency at a given 

supply voltage. 

Energy efficiency can be increased by scaling down the core 

supply voltage – a feature not available on the STM32F407 due 

to its fixed built-in core supply – while keeping operating 

frequency as high as possible. Fig. 3 shows the effects of the 

applied voltage scaling on processor power, application speed 

and energy efficiency. As such, we are able to boost PULP’s 

energy efficiency compared to the microcontroller up to 38x 

while still handling the filtering workload over 2x faster. The 

resulting operating point at 0.7 V core supply voltage dissipates 

4.5 mW which confirms PULP’s suitability for batteryoperated 

scenarios. Thus, utilizing PULP as a processing unit 

enables handling much more complex workloads under a much 

tighter power envelope than the microcontroller examined. 

A. Exploration of Honey Bunny PULP 

Honey Bunny operates at a nominal core supply voltage of 

1 V and at 1.8 V of I/O voltage. In this regime, up to 2.5 GOPS 

can be achieved as the four cores can run up to 625MHz, greatly 

exceeding the computational capabilities of low-power 

microcontrollers. The STM32F407 from STMicroelectronics is 

one of the most popular ARM Cortex-M4F microcontrollers 

[citation needed], commonly used for the processing of medical 

sensor data and it is our target microcontroller for performance 

comparisons. The STM32F407 performs 168MOPS at its 

fastest operating point. 

TABLE I. 

FILTERING KERNEL UNDER NOMINAL CONDITIONS 

STM32F407 Honey Bunny PULP 

168MHz 625MHz 40MHz 

Power 79.7 mW 54.6 mW 5.0 mW 

Speed (normalized) 1.00 15.8 1.01 

Energy Efficiency 

(normalized) 

1.00 23.0 16.13 

To compare the two processors, a kernel used in FIR 

filtering applications for ECG data has been executed on both 

processors. For this comparison, the performance (application 

speed in ms) of the STM32F407 when operating at 168MHz – 

its fastest operating point – serves as the baseline. TABLE I. 

lists the comparison of power, normalized application speed 

and normalized energy efficiency of the ARM Cortex-M4F 

microcontroller and Honey Bunny PULP. Honey Bunny 

delivers almost 16x performance at only two thirds of power 

drawn, resulting in an energy efficiency that is 23x higher than 

Fig. 3 Processor power dissipation and application characteristics of Honey 

Bunny PULP during scaling of supply voltage. The fastest possible operating 

frequency was used for each supply voltage point. 

B. ECG Monitoring and Processing 

One application of the platform tested in practice is ECG 

measurement: electrodes were attached to the left arm (LA) and 

right arm (RA) to measure the corresponding ECG signal Lead 

I (LA-RA). 

The data is then sent to PULP were power noise and baseline 

wandering is removed through FIR filtering. A simple bandpass 

filter with a lower cut-off frequency of 0.5 Hz, an upper 

cut-off frequency of 30 Hz and a filter order of 1138 was used. 

The filtered samples are sent back to the NRF52832 and are 

then transmitted via Bluetooth to a connected smartphone 

together with the unprocessed samples. The Android 

application decodes the incoming data and plots it in real time 

(see Fig. 2). In future applications the preprocessed data could 

also be used to perform further ECG analysis (e.g. QRS- 

17

complex detection, atrial fibrillation detection) directly on 

PULP. 

targets long-lasting wearable e-health application so the design 

needs to meet the stringent requirements of computational 

resources to perform the classification algorithm and cope with 

limited energy resources when is battery powered. 

ACKNOWLEDGMENT 

This work was in part funded by the Swiss National Science 

Foundation projects ‘MicroLearn: Micropower Deep 

Learning’(Nr. 162524) 

REFERENCES 

Fig. 2 Realtime ECG Plot on the connected Smartphone. The upper Plot 

shows the unprocessed raw sensor data. The lower plot shows the same signal 

FIR filtered by PULP. 

IV. 

CONCLUSIONS 

We presented an energy efficient wireless platform to 

process data from ECG and EMG Sensors. The platform core is 

a Honey Bunny PULP processor that has four RISC-V cores to 

boost-up the computational resources available with a power 

envelope of a few m-Watts. Due to the parallel ultra-low power 

processor, it is possible to achieve up to 2.5GOPS needed for 

emerging smart wearable applications, while keeping the power 

in mW range. We compare in this work the energy efficiency of 

the novel parallel processor over an ARM-Cortex M4F 

microcontroller, and showed that it is possible to achieve up to 

38.3 times more energy efficiency. The developed platform 

[1] Soh, Ping Jack, et al. "Wearable wireless health monitoring: Current 

developments, challenges, and future trends." IEEE Microwave Magazine 

16.4 (2015): 55-70. 

[2] Conti, Francesco, et al. "Accelerated visual context classification on a 

low-power smartwatch." IEEE Transactions on Human-Machine Systems 

47.1 (2017): 19-30. 

[3] Cavigelli, L., Magno, M. and Benini, L., 2015, June. Accelerating realtime 

embedded scene labeling with convolutional networks. In 

Proceedings of the 52nd Annual Design Automation Conference (p. 108). 

ACM. 

[4] Magno, M., Pritz, M., Mayer, P. and Benini, L., 2017, June. DeepEmote: 

Towards multi-layer neural networks in a low power wearable multisensors 

bracelet. In Advances in Sensors and Interfaces (IWASI), 2017 7th 

IEEE International Workshop on (pp. 32-37). IEEE. 

[5] Z. Wang, Y. Liu, Y. Sun, Y. Li, D. Zhang and H. Yang, "An energyefficient 

heterogeneous dual-core processor for Internet of Things," 2015 

IEEE International Symposium on Circuits and Systems (ISCAS), 

Lisbon, 2015. 

[6] Ghasemzadeh, H.; Jafari, R.; "Ultra low-power signal processing in 

wearable monitoring systems: A tiered screening architecture with 

optimal bit resolution." ACM Transactions on Embedded Computing 

Systems (TECS), 2013 

[7] M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions 

for Scalable IoT Endpoint Devices," in IEEE Transactions on Very Large 

Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700-2713, Oct. 

2017. 

[8] L. Seminara et al., "Electronic skin and electrocutaneous stimulation to 

restore the sense of touch in hand prosthetics," 2017 IEEE International 

Symposium on Circuits and Systems (ISCAS), Baltimore, MD, 2017, pp. 

1-4. 

[9] L. Pinna, A. Ibrahim, and M. Valle, “Interface Electronics for Tactile 

Sensors Based on Piezoelectric Polymers,” IEEE Sens. J., vol. 17, no. 18, 

pp. 5937–5947, Sep. 2017 

18

Using RISC-V in high computing, ultra-low power, 

programmable circuits for inference on battery 

operated edge devices 

Eric Flamand 

CTO 

GreenWaves Technologies 

Villard-Bonnot, France 

eric.flammand@greenwaves-technologies.com 

Abstract— Current ultra-low power edge devices operating 

for years on a battery are limited to relatively data poor sensors 

such as temperature and pressure. Allowing the next generation 

of edge devices to process data from richer sensors such as audio, 

image or motion/vibration enables many exciting new 

applications but also poses some serious challenges. 

1) How does one transform a large amount of input data into 

something that is several orders of magnitude smaller? 

2) Much more input data implies much more processing 

capability. How does one support algorithms requiring multigiga 

operations per second while keeping a power consumption in 

the mW range? 

3) Edge devices tend to have irregular activity patterns. How 

does one remain energy efficient in a range of workload that goes 

from 0 to multi GOPs? 

4) Finally, and just as importantly, how to do all of this while 

retaining a simple programming model in a context where, for 

the sake of energy efficiency, hardware complexity must be kept 

minimal? 

In this paper we will show how a combination of architectural 

innovation, design trade-offs and tools innovation that makes it 

possible to tackle these challenges. 

We will show how the RISC-V’s extendable ISA allows 

specific optimizations for energy efficiency and enables 

architectural innovation. 

We will use several real-life examples from the image and 

audio domain to illustrate how an actual multi-core RISC-V 

processor implementation can perform on these applications and 

what the path is to efficient implementation. 

Keywords—RISC-V; PULP; GAP8; CNN; IoT; 


During the last few years we have seen rapid progress in 

the field of data analytics thanks to a large variety of robust 

learning techniques combined with the availability of training 

sets. Common sources of data are those produced from sensors 

probing the environment for data such as images, sounds, 

vibrations. It is possible to directly connect sensors to cloud 

servers carrying out analysis however if the device needs 

wireless operation start of the art data links do not deliver 

sufficient energy efficiency to allow for battery operation as 

soon as the data volume becomes significant. 

Performing all or part of data analytics on the edge device 

can dramatically reduce the amount of data transmitted over 

the air since what needs to be carried is a qualitative view of 

the raw data, for example the presence of a given object, that is 

at least 5 orders of magnitude smaller than the original image. 

The challenge becomes how to deliver peak processing 

capabilities well above 1 giga operations per second operating 

on battery delivering a reasonable battery life time expectation 

(a year or more). 

We introduce GAP8, a multi core programable device 

derived from the PULP [1] [2] open source project PULP 

project itself is built on top the RISC-V project [3]. We 

examine how GAP8 uses the flexible attributes of the RISC-V 

ISA to deliver a state of the art micro controller (MCU), rich 

peripherals, ease of programming, security associated with a 

powerful programable parallel processing structure for heavy 

duty workloads which includes a dedicated hardware 

accelerator to offload the compute intense part of convolutional 

neural networks (CNNs). These two key building blocks are 

supported by aggressive on chip power management to 

minimize the amount of energy needed for a given task. The 

architecture can deliver up to 8 fully software programable giga 

operations per second (GOPS), 12 GOPS in the case that the 

CNN accelerator is used, while consuming only 1 milliwatt for 

0.17 GOPS. 

Ease of programming is a challenge in a context where 

several compromises in hardware architecture have to be made 

to keep power under strict control. To alleviate the impact of 

these tradeoffs we examine appropriate software automation 

19

tools that greatly simplify development of highly optimized 

computing kernels automating the generation of the glue code 

that sits in between a compute kernel and its data allocated 

across an un-cached data memory hierarchy. 

II. ARCHITECTURE 

Figure 1 provides a top-level view of the GAP8 architecture. 

Fig. 1. The GAP8 Architecture 

A. Fabric Controller 

The Fabric Controller, a micro controller unit (MCU), is 

located on the left side of figure 1. The fabric controller 

operates in its own independent power and frequency domain. 

It contains one RISC-V ISA programable core equipped with 

an instruction cache and a fast access-time data memory. It 

includes a set of peripherals enabling parallel capture of 

images, sounds and vibrations as well as connectivity to an 

external radio transceiver through a LVDS link and a 4 

channels PWM interface for motor control for use in 

applications such as domestic robotics. Most of the peripherals 

are shielded by a multi-channel micro DMA to minimize the 

amount of interactions with the controlling core when 

performing IO. This is illustrated in Figure 2. 

programmable DC/DC, LDO regulator, internal clock 

generation and real time clock. 

B. Cluster 

On the right side of figure 1 is the cluster domain. The 

cluster is in a separate voltage and frequency domain and is 

turned on and adjusted to the right voltage and frequency only 

when the software application running on the fabric controller 

needs it. It contains 8 identical cores based on the RISC-V ISA 

themselves identical to the core used in the fabric controller. 

This allows the SoC to be able to run the same binary code on 

either the fabric controller or the cluster. These 8 cores are 

served by a shared data memory making the cluster friendly to 

all the variants of models for shared memory based parallel 

programming, OpenMP being a good example. The shared data 

memory can serve all memory access requests in parallel with a 

very short access time latency that is completely absorbed by 

the core pipeline and a very low contention rate. This is 

enabled by a highly optimized interconnect located in between 

the cores’ load/store units and the memory banks. Program 

cache is also shared to benefit from the high occurrence of 

situations where all cores execute instructions in a relatively 

small window of code with the result that a fetched instruction 

is very likely to be used by several cores at different points in 

the same time window. Event service, parallel thread 

dispatching, and synchronization is supported by a dedicated 

hardware block (HW Sync). A fast event service is one of the 

key parameter for efficient parallel execution since any cycles 

wasted in forking and joining tasks on the cluster adds to the 

serial part of the application being run limiting the cluster’s 

ability to scale performances linearly with the number of cores 

involved in the task. Ultra-low overhead parallel dispatching 

and synchronization also allows very fine-grained parallelism. 

The HW Sync block controls the top-level clock gating of 

every single core in the cluster. A core waiting for an event 

(attached to a synchronization barrier or general event) is 

instantly brought into a fully clock gated state, zeroing its 

dynamic power consumption. Figure 3 illustrates dispatch from 

master core of a C function Foo(Arg) on all the 8 cores. 

Fig. 2. uDMA and I/O Architecture 

The L2 memory is located within the fabric controller 

perimeter. It is dimensioned to 512 megabytes optionally 

extendable via a DDR HyperBus interface. Also, in this area is 

a ROM containing the primary boot-code including secured 

boot support through eFUSE stored keys. The last important 

block is dedicated to power management including an on chip 

Fig. 3. Dispatch on cluster cores 

GAP8’s memory hierarchy is organized as a single name 

space: every single core in the chip can see all memory 

locations, unless they are protected by the Memory Protection 

Unit (MPU), with an access time which increases when the 

20

target address is in L2 memory or in external memory (L3). To 

hide the access cost of L3 and L2 memory the cluster contains 

a multi-channel DMA capable of 1D and 2D memory accesses. 

When the cluster runs CNN based applications it can 

offload compute intense convolutional layers to a dedicated 

accelerator, the Hardware Convolution Engine (HWCE) [4]. 

This block can evaluate a full 5x5 convolution or three 3x3 

convolutions on 16-bit operands in a single cycle. It is directly 

connected to the cluster's shared L1 memory through several 

load store units similar to the ones used in the cluster’s 

programable cores. Since the HWCE shares its memory with 

the cores and has access to the synchronization resources, a 

HW accelerated convolution can be freely mixed with activities 

running on the cores. Besides boosting performance, the 

HWCE plays an essential role in improving the energy 

efficiency of the overall system when running CNN based 

applications. The fact that it internally maximizes data and 

coefficient reuse leads to a 4 to 5 times energy efficiency 

improvement compared with a pure software parallel and 

vectorized implementation. 

C. Processor 

One of the key building blocks of GAP8 are its cores. The 

elementary core is a straight in order, 4 stage pipeline, 

compliant with the RISC-V ISA subsets I, M, C. Since the 

RISC-V ISA is architected to be extendable we have used 

extended instructions [5] to boost performance for DSP centric 

kernels manipulating integer or complex numbers represented 

as vectors of short integers. Dedicated support has been added 

to support zero overhead hardware loops, pointer post modified 

load and store, one cycle multiply and accumulate, one cycle 

complex multiplication, as well as dedicated instructions for 

efficient rounding, normalization and clipping. To increase the 

intrinsic level of parallelism (ILP), single instruction multiple 

data (SIMD) support has been added enabling vectors of 4-byte 

elements or 2 short elements. SIMD operations can produce 

either vectors or scalars in operations such as dot products or 

accumulation of dot products. Finally, some bit-manipulation 

oriented instructions, like bit insertion or extraction, are also 

added to make control-oriented code more compact and more 

cycle efficient. The elementary core complies with the RISC-V 

privileged instructions specification to enable the execution of 

secured code, assisted by a built-in programable memory 

protection unit (MPU). 

Fig. 4. Baseline RISC-V versus extended ISA code comparison 

Fig. 4 gives an example of the difference between native 

RISC-V assembly code (on the right) and extended RISC-V 

assembly code (bottom left) for the same C code. In both cases 

the code is automatically generated by the compiler. 

Fig. 5. Baseline RISC-V versus extended ISA 

In Fig. 5 the performance improvement of the extended 

ISA versus the RISC-V base ISA is illustrated. ISA extensions 

are organized in 2 groups: V2 is DSP centric, V3 is V2 plus 

SIMD, bit manipulation and more DSP instructions. The 

performance figures are obtained on a group of representative 

kernels containing convolution, FFT, filtering, and cyphering. 

Beyond performance improvements these extensions bring 

additional energy efficiency since for the same performance on 

a given size of workload, the clock frequency and, in some 

cases, the supply voltage can be lowered. 

D. Power Management 

To achieve power efficiency and to minimize the number of 

external components the SoC contains an internal DC/DC 

convertor that can be directly connected to an external battery. 

It can deliver voltages in the range of 1.0V to 1.2V when the 

circuit is active. When the circuit is in sleep mode this 

regulator is turned off and a linear drop output (LDO) is used 

to power the real time clock which controls programmed wake 

up and, optionally, part of the L2 memory allowing retention of 

application state for fast wakeup. When in deep sleep the 

current consumption is reduced to 70nA (assuming the real 

time clock is active and no data retention). The two main 

domains have their own separate clocks. Special attention has 

been paid to the time needed to turn on and turn off the cluster. 

The typical turn-around time is between 100us and 150us 

allowing for agile power state transitions. 

E. Software development flow 

Code generation is supported by GCC (7.1). The RISC-V 

mainstream GCC compiler and binutils suite has been modified 

to natively support all ISA extensions described earlier, 

including vector support. Applications are written in C/C++. 

To exploit parallelism and since the cluster is designed for 

shared memory programming models OpenMP can be used but 

a simple and efficient API is also available providing very low 

overhead access to task dispatch and synchronization 

resources. Platform resource management (peripherals, time, 

memory, power) is supported either by the PULP-OS (event 

based) or through an ARM Mbed port (thread or event based). 

Other RTOS ports are planned in the future. In both cases the 

21

cluster is used with a component-based model: a software 

component is installed onto the cluster after it has been turned 

on and configured. Interfaces are negotiated between the RTOS 

running on the MCU and the component in the cluster. 

Synchronization and a small number of system services are 

local to the cluster. More sophisticated services, involving 

peripherals for example, are delegated to the MCU part of the 

system. This approach makes the software part running on the 

cluster independent of the RTOS running on the MCU. 

Programs are always fetched from L2 memory through the 

previously described instruction caches and can be statically or 

dynamically linked. In the latter case dynamic relocation is 

performed through a light weight model to limit code 

expansion in L2. Combining dynamic relocation with the 

user/machine mode support in the core and memory protection 

(MPU) eases the deployment of secured kernels and 

applications. 

Data is not cached to avoid the significant power penalty 

associated with data caches. To use the architecture at its best 

the preferred approach is to always try to keep data as close as 

possible to the cores using it. Data is efficiently moved from 

and to L2 and L1 by the DMA or uDMA engines, but code 

restructuring and organization can prove to be error prone and 

time consuming. To ease development a tool has been 

developed to automate this process. Basic kernels are first 

written without taking into consideration where data is located. 

These kernels can be optimized and parallelized without the 

programmer being concerned with data placement. The basic 

kernels are then combined into what we call user kernels. A 

user kernel is described as a multi-dimensional iteration space. 

It contains a collection of connected basic kernels that can be 

inserted into the user kernel iteration space at pre-defined 

locations (depth, prologue, body, epilogue). The user kernel 

definition contains a series of argument definitions for the basic 

kernel arguments. Each argument defines which sub space of 

the kernel iteration space it is concerned with, a tiling direction, 

a home location of the data (L2 memory, external memory) and 

a set of properties. Using this model and a given L1 memory 

budget the tool infers a tiling structure for each argument 

fitting within the L1 memory budget and satisfying the set of 

constraints put on all the arguments. Once the tiling structure 

has been computed it generates a C program that takes care of 

providing tiles to the basic kernels in a pipelined manner to 

keep all the cores continuously working. The tool is delivered 

as a C library which exposes an API to create models. The 

models themselves are written in C and once linked with the 

library become executable. Running a compiled model 

produces C code wrappers containing calls to the basic kernels 

(either sequential or parallel) as well as DMA transactions. 

With this approach it is possible to build generators for specific 

algorithms and then combine these together. We have 

developed generators that, for example, automatically generate 

different types of CNN layer. Adding a layer to a CNN graph is 

simplified to defining the nature of the layer (convolution, 

pooling, rectification, and so on), dimensionality and location 

of the coefficients (L2, external memory). These parameters 

are all captured in a single call to a generator. Since generators 

can be combined, the whole network can be easily built and 

executing the model will produce all the wrappers calling the 

optimized basic kernels and managing memory movements. 

Fig. 6. Input to code generation tool (MNIST CNN) 

In Fig. 6 we show how a CNN generator is instantiated for 

a MNIST network. Offloading to the convolutional accelerator 

(HWCE) fits nicely within this approach since the accelerator 

consumes and produces data from and to shared L1 memory 

and as such it conforms perfectly to the definition of a basic 

kernel. Beyond the domain of CNNs other generators have 

been developed: 2D FFT, various feature extractors (Histogram 

of gradients, difference of gradients, various estimators, image 

resize, HOG + weak-predictor based object recognition, and so 

on). 

This automatic tiling tool is used as a backend for higher 

level tools to automate support direct export from CNN 

frameworks such as TensorFlow into optimized C code ready 

to be compiled on the chip. 

Fig. 7. Software development flow 

Fig. 7 summarizes the key modules involved in the 

software development flow. 

22

F. Application Results 

TABLE I. 

APPLICATION RESULTS 

Application 

Cores 

1 2 4 8 

1D FFT1024 

Radix4 

28.2 14.3 7.8 4.7 

2D FFT 256 x 

256 Radix4 

78.9 41.9 22.6 13.3 

Byte 5x5 Conv 18.5 9.3 4.7 2.2 

Short 5x5 Conv 37.8 18.9 9.5 4.6 

Binary 5x5 

Conv 

20.8 10.5 5.3 2.8 

Short 

MaxPool2x2 

8.2 4.2 2.1 1.1 

Short MatMult 

32x32 

41.9 20.9 14.0 5.2 

Short 2048 to 1 

Fully 

3112.0 1616.0 847.0 495.0 

Connected 

CannyEdge 99.5 50.9 26.2 12.7 

AES-CTR 128b 15.3 7.7 4.0 2.1 

64 Mel 

Coefficients 

542.7 299.4 176.7 101.3 

HoG, 8x8 

Cells, 

2x2Blocks, 9 

Bins 

65.0 35.0 18.0 9.0 

4. Short Max Pool 2x2: Key kernel for CNN, operands 

are 16 bits, one output is produced, and cycles 

reported for it. 

5. Short MatMult 32x32: performs a matrix 

multiplication between 2 16-bit 32x32 matrices. 1024 

outputs are produced, cycle count is for 1 output. 

6. Short 2048 to 1 Fully connected: CNN's fully 

connected layer 2046 inputs, 2048 coefficients, 1 

output, all 16-bit. Cycle count is for 1 output. 

7. Canny Edge Detector: Gaussian smoothing (5x5), 

Gradient magnitude and orientation, non-max 

suppression, blob extraction. Average cycles per 

output image pixel is reported. 

8. AES-CTR 128: AES encryption/decryption, CTR 

form, 128b key, cycles reported for 1 output bit. 

9. 64 Mel coefficients: Sub band analysis on 64 bands 

(64 mel coefficients). Input is 16KHz, 16bits PCM, 

frame: 400 samples, frame overlap: 10ms. Steps: preemphasis, 

hamming window, radix2 fft 512, mel and 

mel derivatives extraction. Cycles reported for 1 mel 

and derivatives coefficient. 

In table I cycles per produced elementary output are given 

as a function of the number of cores used when running sample 

test applications. It should be noted that the exact same binary 

is used when running on 1, 2, 4 or 8 cores. The code is 

automatically dispatched on the number of cores dynamically 

passed to the hardware dispatcher. All cycles counts are 

obtained using a hardware timer to capture time before and 

after the application. The cycles-counts capture all activities, 

not just CPU time but also DMA and distant level memory 

accesses. The list bellow provides a detailed description of 

each test application. The test applications have been 

implemented to optimally benefit from the parallelization and 

vectorization opportunities provided by the architecture. 

1. 1D FFT 1024 Radix 4: A fixed point (Real and 

Imaginary are in Q15) single dimension radix 4 FFT, 

1024 outputs are produced, cycles is for 1 output. 

2. 2D FFT 256x256 Radix4: A fixed point (Real and 

Imaginary are in Q15) bi-dimensionsal radix 4 FFT, 

65536 outputs are produced. Cycles is for 1 output. 

3. 5x5 convolutions: Key kernel for CNN and variants. 

25 sums of products, byte variant handles byte input, 

short handles 16bits inputs, binary performs binary 

convolution. A single output is produced, and cycles 

reported for it. 

Fig. 8. Baseline RISC-V versus extended ISA 

Fig. 8 shows the factor of performance increase as a 

function of the number of cores used. The speedup factor 

indicates the architecture’s ability to efficiently scale in 

performance without being impaired by elements such as 

synchronization overhead and memory contention. For all the 

reported applications the geometric mean for the speedup 

23

factor when using 8 cores compared to a single core is 7.1. This 

shows that for an application set with enough diversity the 

architecture scales very well. 

TABLE III. 

CNN PERFORMANCE 

CIFAR10 

TABLE II. 

CNN TOPOLOGIES 

In W H Out 

Arithmetic 

Ops 

Conv5x5/1 1 32 32 8 313600 

MaxPool2x2/2 8 28 28 8 4704 

Conv5x5/1 8 14 14 12 480000 

MaxPool2x2/2 12 10 10 12 900 

FullyConnected 300 1 1 10 6000 

MNIST 

805204 

Conv5x5/1 1 28 28 32 921600 

ReLU 32 24 24 32 18432 

MaxPool2x2/2 32 24 24 32 13824 

Conv5x5/1 32 12 12 64 6553600 

ReLU 64 8 8 64 4096 

MaxPool2x2/2 64 8 8 64 3072 


TEXT RECO 

7535104 

Conv3x3/1 1 128 128 32 9144576 

ReLU 32 126 126 32 508032 

MaxPool2x2/2 32 126 126 32 381024 

Conv3x3/1 32 63 63 32 68585472 

ReLU 32 61 61 32 119072 

MaxPool2x2/2 32 61 61 32 89304 

Conv3x3/1 32 30 30 32 14450688 

ReLU 32 28 28 32 25088 

MaxPool2x2/2 32 28 28 32 18816 


ReLU 64 1 1 64 64 


94126616 

To give some more insight into the architecture’s 

performance on more complex applications involving 

Convolutional Neural Networks (CNNs) we provide 

performance evaluations on 3 networks. The first two, 

CIFAR10 and MNIST, are well known. The third one is 

significantly larger with 421263 trainable parameters and 

1511904 neurons. This network is used to perform text 

recognition from images. The characteristics of the network are 

provided in table II. The arithmetic operation column gives the 

total number of dyadic arithmetic operations needed when 

performing inference, when memory accesses are excluded. 

Cycles / Cores 

1 2 4 8 

8 Cores 

+ 

HWCE 

Speedup 

CIFAR10 711042 415838 254988 178458 65033 10.9 

MNIST 7620725 4099793 2359415 1559166 816731 9.3 

TextReco 97823730 51461201 28198788 17274720 8325727 11.7 

Table III shows the total number of cycles required when 

running a full inference on these 3 networks. Cycle counts here 

include all operations. For example, for the TEXT RECO 

network, access to coefficients stored in an external memory is 

included. Cycles are reported for 5 configurations: 1, 2, 4, 8 

cores and 8 cores with the HWCE accelerator running the 

convolutional layers while all the other layers are running on 

the 8 cores. For the 8 cores plus HWCE configuration the 

HWCE Total cycles versus total arithmetic operations gives a 

good measurement on how the architecture behaves. 

G. Conclusion 

We have presented an ultra-low power programmable 

platform derived from two major open source initiatives: PULP 

and RISC-V. Content understanding, and in particular CNNbased 

solutions, are the primary focus of this platform. We 

have provided evidence that when this platform operates on 

real networks the combination of parallelism and a hardware 

accelerator leads to a 10x improvement versus a single core 

model while also improving energy efficiency. Through a set 

of kernels and real-life applications we have shown the 

capability of this platform to scale efficiently with the number 

of used cores. We have explained how the architecture is 

organized and in particular, the trade-offs we have chosen to 

improve energy efficiency. We have shown the importance of 

high level tools to efficiently map complex applications onto a 

parallel architecture which by necessity is not equipped with 

hardware assistance to hide the complexity of explicit memory 

hierarchy management. 

The architecture has been taped out in GAP8 using the 

TSMC55LP process. It shares power and size characteristics 

with state of the art ultra-low power MCUs but at the same 

time, thanks to aggressive parallel and vector computing 

capabilities, is capable of delivering several giga operations per 

second within a very small power envelope. We show how this 

architecture enables new applications for battery operated edge 

devices with rich-data sensing capabilities. 

REFERENCES 

[1] Davide Rossi, Francesco Conti, Andrea Marongiu, Antonio Pullini, Igor 

Loi, Michael Gautschi, Giuseppe Tagliavini, Philippe Flatresse, Luca 

Benini “PULP: A Parallel Ultra-Low-Power Platform for Next 

Generation IoT Applications” 

[2] http://www.pulp-platform.org 

[3] https://riscv.org/ 

24

[4] Francesco Conti, Luca Benini, “A Ultra-Low-Energy Convolution 

Engine for Fast Brain-Inspired Vision in Multicore Clusters”, 

Proceedings of the 2015 Design, Automation & Test in Europe 

Conference & Exhibition, 2015 

[5] Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor 

Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gurkaynak, 

Luca Benini, “Near-Threshold RISC-V Core With DSP Extensions for 

Scalable IoT Endpoint Devices”, IEEE Transactions on Very Large 

Scale Integration Systems (TVLSI) 

25

Precisely engineered RISC-V embedded processors in 30 days 

Keith A. Graham 

Electrical, Computer & Energy Engineering 

University of Colorado 

Boulder, USA 

Keith.A.Graham@Colorado.EDU 

Abstract— RISC-V can transform embedded systems by 

providing precisely engineered processors to satisfy performance, 

cost, and power which have been unavailable in conventional IP 

processor core designs. These precisely designed embedded 

processors will meet the application’s requirement by only 

including the necessary hardware resources. To make this 

realization happen, today’s embedded processor designer must 

become a core processor engineer. In this paper, we explore 

engineering an embedded processor that can real-time process data 

at 80 mega samples per second utilizing a highly abstracted CPU 

processor development strategy in 30 days. 

Keywords—RISC-V, Embedded Processor Design, Vector 

Processor, Custom CPU, Soft Processor Core, IP core processor 


Over the last two to three decades, custom CPU design has 

been replaced by IP processor cores. A major motivation 

towards these IP cores has been the availability of open source 

software that minimizes development costs as well as long 

term support cost. Today, if you design a system using an 

ARM processor, there are open source operating systems, 

drivers, and applications that are available to the embedded 

solution. RISC-V is changing the balance towards customs 

solution by making an open source Instruction Set 

Architecture, ISA, as well as a platform for open source code. 

By removing the requirement of custom software for noncritical 

application code as well as long term support costs, 

RISC-V design teams can focus on custom CPU resources and 

software to provide a unique customer experience. 

RISC-V is a modular ISA architecture in which specific 

extensions are defined. The benefit of modularity is that a 

solution only needs to support the hardware resources required 

for a specific ISA extension or solution, thus minimizing 

power, die area, and cost [1]. The RISC-V extensions can be 

considered as base starting points of a custom embedded 

solution. For example, the 80 MSPS signal processing 

application only requires 32-bit integer math and minimal 

division so it will be based on a RV32I ISA standard core. 

There are extensions that define 64-bit and 128-bit instruction 

addressing, Multiply/Division, and Atomic instructions as 

examples. It should be noted to be RISC-V certified, each 

hardware instantiation must support all basic RISC-V 

instructions via direct hardware execution or through software 

exception handling. 

In today’s IP core based embedded solutions, if an 

application does not specifically match an IP core, choosing an 

IP core is analogous to shopping in a grocery store where you 

must purchase the next larger processor box to make your 

favorite recipe. The next larger box will have leftovers that 

you purchased in terms of power and silicon. Most designs 

must balance development time and risk management. You 

would expect that IP cores would have an advantage in both 

time and risk management, but modern highly abstracted CPU 

processor development tool chains can change the balance 

towards customized cores. If at the later stages of a project 

where the application requirements increase, the ability to 

provide an altered CPU core reduces risk compared to a static 

IP core solution. 

For this paper, we will be developing a customized RISC-V 

processor to meet an 80 MSPS real-time processing project. 

The design is based on a 5-stage RV32I core designed at the 

University of Colorado at Boulder, Colorado for instructional 

and research purposes. The CPU processor development 

environment, Codasip Studio 7.0, enables the development of 

an optimized processor by adding CPU resources while 

generating a c-compiler to access these additional resources. 

Tools like Codasip Studio release the power of RISC-V by 

enabling RISC-V cores that execute open source RISC-V 

binaries to execute application specific code compiled to utilize 

the optimized CPU. 

II. 

GOAL SETTING 

Like all projects, goal setting defines the end objective of 

the embedded solution. In our project, the instrument 

development team defined a goal of 80 MSPS of 14-bit 

incoming data that must be processed real-time. Breaking 

down the algorithm into its sub-components, we will be 

focusing on the Kurtosis algorithm. The goal is to process 

10,000 sets of 8,192 samples in 0.10 seconds. A further goal 

based on limiting the FPGA in size, power, and cost, these 

10,000 Kurtosis operations can be spread out over a maximum 

of 40 CPUs on a single FPGA resulting in a Kurtosis 

completing every 400 micro-seconds. 

26

Fig. 1. Kurtosis algorithm 

Each RISC-V core will have a local buffer that will 

calculate the average of the incoming data block as it is stored 

from the ADC. The incoming data stream will generate the 

sum of each incoming block in hardware as it is brought into 

the RISC-V local memory buffer. With the Kurtosis block data 

average being done in hardware, the number of cycles to 

calculate the Kurtosis algorithm will be the difference between 

the CPU clock cycle upon exiting the Kurtosis routine and the 

cycle count upon entering the Kurtosis algorithm. 

III. 

EXECUTION TIME 

There are three key variables that make up the execution 

time of the Kurtosis algorithm or any CPU execution time; 

number of instructions, average cycles per instruction, and 

clock period [2]. 

Time(secs) = # of Instruction ∗ Cycles per Instruction 

∗ Clock Period( secs 

cycle ) 

Time(secs) = Total Cycles ∗ 

1 

( sec 

Frequency 

cycles ) (1) 

Using data driven techniques in building the embedded 

processor, all three of these variables will be considered. The 

adding of CPU resources to the base RV32I core will have the 

greatest impact on instruction count and cycles per instruction, 

but architecting the resources can also impact the frequency, 

clock period. 

IV. 

GETTING A BASELINE 

First, let’s determine whether the base RV32I core can 

achieve the desired Kurtosis performance without 

modification. All simulations will be done using the Codasip 

Studio c-compiler generated for the CU RISC-V 

implementation with maximum optimization setting of -O1 on 

a cycle accurate model. This optimization was chosen for the 

ease of setting breakpoints upon entering and exiting the 

Kurtosis routine for benchmarking. To more accurately 

emulate -O3 optimizations, the algorithm will be written to 

support loop unrolling in -O1 which would be implemented by 

optimization -O3. The cycle accurate model represents a 5- 

stage pipeline hardware design that can generate RTL which 

then can be synthesized into a FPGA. 

The CU RV32I FPGA design is expected to operate at 

75MHz operation. We will use this frequency as our 

benchmark frequency. 

1 

Time(secs) = Total Cycles ∗ 

Frequency ( sec 

cycles ) 

1 

Time(secs) = 2,191,817 ∗ 

75000000 

Time(secs) = 29,224 uS (2) 

The base core executing the Kurtosis routine simulated in 

2,191,817 cycles representing 29,224 uS which will not meet 

the applications requirement of 400 uS. To achieve the goal of 

minimal CPU resources to satisfy the application requirements, 

resources will be added one at a time starting with reducing 

instruction count through the addition of an instruction. To 

determine a possible instruction to add, the integrated Codasip 

Studio profiler tool highlights the routine’s “hot spots” in red. 

These “hot spots” indicate the largest concentration of clock 

cycles per line of c-code. 

Fig. 2: CU RISC-V cycle accurate Kurtosis profile 

Fig. 3: Profiled assembly instructions 

Based on the Kurtosis routine profiled “hot spot,” replacing 

the Multiply math function with a Multiply instruction may 

significantly reduce the number of instructions which should 

reduce the execution per equation (1) Execution time. 

V. MULTIPLY 

The RISC-V ISA defines an “M” extension which includes 

multiply and divide operations. This extension defines two 

multiply instructions. The first to return the lower 32-bits of a 

27

32-bit by 32-bit multiply and a second to return the upper 32- 

bits of a 32-bit by 32-bit multiply. The ISA allows a 

microarchitecture to fuse these instructions into a single 

operation to obtain the full 64-bit result [3]. In the Kurtosis 

application, the “hot spot” is the multiply instruction and not 

the divide. With the focus on minimal CPU resources to 

provide the required performance, only the multiply 

operations will be added to the CPU. To further define the 

requirement, the incoming data stream will not result in a 

multiply whose return value is greater than 32-bits which 

further defines the desired multiplication to the RISC-V MUL 

instruction. 

Fig. 4: RISC-V MUL instruction format 

For a two operand and a destination register operation, R- 

type, the complete instruction opcode comprises of three 

fields, the opcode, bits 6-0, the funct3, bits 14-12, and funct7, 

31-25. The plan is to implement this RISC-V MUL 

instruction, opcode = 0b0110011, funct3 = 0b000, and funct7, 

0b0000001 [3]. 

Now, the power of a highly abstracted processor 

development tool begins to become apparent. To add the 

Multiply instruction, only four segments of code need to be 

added. First, adding the MUL opcode to the CPU model in 

opcodes.hcodal. 

Fig. 6. Adding MUL in isa.codal 

For readability in the instruction accurate model, we have 

created a routine, alu(opc, src1, src2) that the instruction 

accurate model calls to determine what ALU operation to 

complete based on the incoming opcode [4]. The ALU routine 

is a large switch statement that returns the result of the 

operation based on the complete opcode provided to the 

routines switch statement. The MUL implementation is a 

signed multiplication based on casting both src1 and src2 to 

(int32) 

Fig. 5: MUL definitions in opcodes.hcodal 

Second, the instruction must be added to the instruction 

accurate model so that the compiler becomes aware of the 

MUL operation. To add this instruction, two codes segments 

are required in the instruction accurate model to enable the 

assembler, disassembler, and c-compiler generation. In the 

isa.codal file, we define using DEF_OPC mnemonics for the 

assembler and the associated complete opcode, OPC_MUL. 

By adding opc_mul to the set of i_alu, we have just added the 

MUL to the R-type of instructions. 

Fig. 7. MUL define in ia_utils.codal 

It took five minutes to add the MUL to the instruction 

accurate model. To verify the correct operation at the 

instruction accurate model, Codasip will generate an 

assembler(ia) and simulator(ia) respectively in the Codasip 

Tasks window. The compiling took a total of 2.11 minutes; 

1.11 minutes for the assembler(ia) and 1.00 minutes for the 

simulator(ia). 

28

Fig. 8: Codasip Task window 

Using standard RISC-V assembly format, the following 

test code was added to an assembly routine that performs a 

regression test on all CPU instructions. The regression test 

includes tests for data forwarding and load data hazard 

detection as well. 

Fig. 9: MUL added to regression test 

In under 15 minutes, an instruction was added to the CPU 

model and verified through simulation. With the instruction 

integrated into the model, it must now be added to the cycle 

accurate model which RTL can be generated and synthesized 

into a FPGA or ASIC. Like the instruction accurate model, 

minimal code is required to add the MUL to the 5-stage 

pipeline physical model. First, the MUL must be added to the 

instruction decoder and then added to the ALU. It can be 

added into the general ALU or as a separate execution unit. 

For this paper, it will be added to the general ALU. 

Fig. 11: ML added to execute pipeline stage 

With these two files updated, we can now generate a cycle 

accurate simulation by clicking on simulator(ca) in the 

Codasip Tasks window. Upon completion of the cycle 

accurate simulator which took 1.81 minutes, the regression 

test was ran on the cycle accurate model which verified 

correct execution of the MUL instruction. In under 30 

minutes, the multiple instruction was added and verified in a 

5-stage cycle accurate model. 

VI. 

MUL SPEED UP 

With the Cycle Accurate model completed, the instruction 

count calculated to complete the Kurtosis algorithm went from 

2,191,817 to 68,211 cycles. At 75MHz operation, that would 

appear to get close to the target, 909uS versus the target goal 

of 400 uS. This speed up is due to instruction count reduction. 

We now need to add the impact of adding a multiplication in 

the ALU to the clock period. At the time of this paper, the 

FPGA has not been chosen and we will be using an estimated 

clock frequency set by the multiply instruction at 25 MHz. 

With the MUL instruction now setting the clock frequency 

(the clock period), going back to the Execution Time in 

equation (1), the simulated Kurtosis time is 2,728 uS. There is 

a dramatic increase in performance, but with the MUL 

operation now specifying the clock period of all the 

instructions, the non-MUL instructions are negatively 

impacted by the addition of the MUL instruction. 

Fig. 10: MUL added to cycle accurate decoder 

29

Rerunning the Kurtosis simulation with speeding up the 

“Common Case,” the instruction counts increased from 68,211 

clock cycles to 109,181 clock cycles, but with the clock 

frequency back to 75MHz instead of 25, the execution time is 

now 1,456 uS or a speed up of 183% over the implementation 

where the MUL instruction set the clock period, frequency. 

VII. MULTIPLY ACCUMULATE (MAC) 

Through data analysis, the CPU performance in several 

hours has increased by 20 times, but still not achieving the 

goal of 400 uS execution time. Going back to the profiler, the 

instruction sequence for the first line inside the Kurtosis loop 

is a MUL followed by and ADD. There is a possibility of 

further instruction count reduction by realizing this sequence 

of MUL and ADD as a Multiple Accumulate function. The 

first line of code would reduce the instruction count from five 

to four instructions. 

Fig. 12: Kurtosis loop disassembled 

From the screen shot of the Kurtosis loop, only 8 of the 23 

instructions are MUL instruction. The other 15 instructions 

could be operating at 75MHz. Going back to one of the Great 

Ideas of Computer Architecture, “Make the Common Case 

Fast,” lets change the cycle accurate model to where the MUL 

operation is a multiple cycle instruction that takes three clock 

cycles, 25 MHz, and all other instructions take one clock cycle 

[2]. By adding these resources, the performance of the non- 

MUL instructions will be back to 75MHz. 

Fig. 13: Adding resource to define multiple cycle instructions 

Fig. 15: Kurtosis loop profile with MUL instruction 

Learning from the previous performance optimization, 

both instruction count and clock frequency impact execution 

time. If the MAC operation reduced the instruction count but 

did not decrease cycle count, no performance increase would 

be realized. Going back to the 5-stage pipeline 

implementation in Fig. 16, the pipeline stage after the execute 

stage is the memory stage. For a non-load or store operation, 

this stage becomes a pass through. To gain the benefit of 

reducing the instruction count, the accumulate portion of the 

MAC instruction will be performed in the memory pipeline 

stage. Adding the MAC instruction is like the MUL 

instruction with the addition of adding an accumulate 

operation in the memory pipeline stage. It took 60 minutes to 

implement and verify the addition of the MAC routine. 

Fig. 16: CU RISC-V 5-stage pipeline 

Fig. 14: Multi-cycle delay function in execute stage 

Adding these changes to the cycle accurate model took 15 

minutes and another 15 minutes to regenerate the software 

development kit, the assembler, the compiler, the c-libraries, 

etc. In 30 minutes, the performance data on speeding up the 

common case was available. 

The penalty of completing an instruction over two stages is 

that the result cannot be forwarded back to the ALU for an 

additional clock cycle. The hardware must insure proper 

execution by adding a bubble if the instruction immediately 

following the MAC requires the result of the MAC operation. 

If the compiler would generate this data dependency series of 

instructions, the benefit of moving the accumulate operation to 

the memory stage would be lost. To instruct the compiler that 

it should organize the instructions so that the instruction 

immediately following the MAC has no data dependency, 

30

codasip_compiler_schedule_class(sc_mac) is set to two 

instructions in the instruction accurate model. 

and power estimator. With the design not synthesized at this 

time, the estimator is only relative. 

With adding the MAC additional resources, the estimated 

area increased by 17% and the power increased by 23% over 

the multi-cycle MUL implementation. With no realized 

performance benefit to adding the MAC, further design 

optimizations will be focusing on the multi-cycle Multiplier 

implementation. 

At this point, I have completed my first day of the project 

and I have sped up the performance by 20.08 times, increased 

estimated area by 24% and decreased power by 95% 

compared to the base core implementation. 

Fig. 17: Adding instruction scheduling 

The instruction counts for the MAC implementation 

increased 15,752 to 124,933 clock cycles over the multiple 

cycle MUL implementation increasing execution time to 1,666 

uS. In analyzing the data, this increase in instruction count is 

based on the code implemented for the Kurtosis loop and not 

able to always insert a useful operation immediately following 

the MAC operation. When the compiler cannot find a useful 

instruction, it inserts a NOP, addi x0, x0, 0, instruction. 

VIII. IMPROVING BRANCHES AND JUMPS 

Going back to the basic 5-stage pipeline design, branches 

and Jumps are decided in the execute pipeline stage [2]. If the 

branch is false, the code flows linear without a penalty, by 

default, predicting branches not being taken. In a loop, 

branches are normally taken. In the basic model, if a branch is 

taken, the instruction in the decode and fetch stages must be 

discarded, effectively making a branch taken three clock 

cycles instead of one. To eliminate the performance penalty 

of a branch being taken, a dynamic predicted branch buffer 

will be added to the fetch pipeline stage. This buffer will act 

as a cache. If a branch has been previously taken, the 

instruction after the branch will be the branch target address 

instead of the instruction immediately following the branch, 

effectively making taken branches one cycle. When the 

branch is not taken, it will now have a penalty to recover from 

the miss prediction. Like the above examples, adding 

resources for the branch prediction buffer is quite easy. I 

chose to implement the buffer using the Codasip register file 

component broken into the three elements; the valid bit, the 

tag, and the address to branch. 

Fig. 18: Kurtosis disassembly with MAC instruction 

Hand assembly could generate more efficient code in this 

example. The compiler used an additional MAC operation per 

line of code that could have been an ADD instruction. With 

the MAC instruction equating to a three-cycle instruction and 

the ADD to a single instruction, the instruction count 

increased. The Codasip Studio profiler includes a relative area 

Fig. 19: Dynamic Branch Prediction Buffer resources 

31

The use of #define statements enable the power of using 

the high-level abstraction design flow for your processor 

design. By changing the definitions of BPB_SIZE, you can 

quickly recompile and experiment to achieve an optimal 

branch prediction buffer size. To allow the cycle accurate 

model to automatically reconfigure the change in the branch 

prediction buffer size, BPB_INDEX_SIZE is used to 

determine the width of the tag per Fig. 20. 

Fig. 20: Dynamic Branch Prediction Buffer in cycle accurate model 

Obtaining execution data on the combinations for buffer 

sizes 2, 4, 8, and 16 took 45 minutes. It was demonstrated that 

the Kurtosis routine did not experience any performance 

increase for a buffer size greater than 2. With the dynamic 

branch buffer, each Kurtosis block is now being executed in 

1,347 uS. The overall test program had a performance benefit 

at a buffer size of 16. For the analysis of this test loop, we 

will set the dynamic branch prediction buffer to a depth of 2, 

but realize the buffer size may change as additional elements 

of the overall algorithm are developed. 

At this point, I have completed my second day of the 

project and I have sped up the performance by 21.69 times, 

increased estimated area by 44% and decreased power by 95% 

compared to the base design. 

IX. 

VECTOR PROCESSOR 

A significant tuning of the base processor was 

accomplished in two days, but to realize the performance of 

sub 400 uS per Kurtosis block of 8,192 elements, based on the 

test data of the optimized processor, an increase of 

performance greater than 3 and less than 4 times is required. 

To obtain this level of performance increase, data level 

parallelism will be implemented using a 4-lane vector 

processor. Adding a vector processor is significantly greater 

effort than the resources added to this point, but still very 

manageable. Using the same design methodology of the CPU, 

the instruction accurate model should be developed first to 

verify the instructions in an assembly routine as well as to 

develop the compiler. After verifying an instruction accurate 

model, the cycle accurate model is created based on the 

defined instruction accurate model. From Fig 21, you will 

notice that many files are replicated with the vector processor 

additions and indicated by simd, Single Instruction Multiple 

Data. 

Fig. 21: Vector processor files indicated by simd 

To further aid vector processor develop, Codasip Studio 

has many built in functions to ease the development of the 

vector processor Instruction Set such as 

codasip_select_v4u32(v_cond, v_src1, v_src2). This routine 

will select which vector lane data will be stored in the 

destination vector register based on the value in the v_cond 

vector register. This built-in is an example of many highly 

abstracted functions available to the designer. 

When developing the vector processor, the vector 

processor instruction set must include the required functions 

of the target application and the instructions to enable the 

compiler to generate code. For example, the compiler requires 

a vector move instruction to move the value of one vector 

register to another. This move can be a unique instruction or 

an alias instruction that effectively performs the move, Fig 22. 

32

Fig. 22: Vector MV alias / pseudo instruction 

Supporting the development of the compiler will enable the 

compiler to automatically take advantage of the vector 

processor in loops when appropriate. The effort to support the 

compiler will be worth the portability and performance 

enhancements of the algorithm. 

At the time of this paper, I am working on the compiler to 

optimize the use of a secondary buffer memory, a dedicated 

vector memory. To obtain simulation results, I utilized the 

compiled inner loop in an assembly routine. The execution 

time of the Kurtosis test loop is now 389 uS, achieving the 

target goal of sub 400 uS. 

The effort to design, implement, and verify the vector 

processor took ten days. At the end of the twelfth day, I have 

sped up the performance by 75 times, increased estimated area 

by 612% and decreased power by 95% compared to the base 

design. 

X. CACHES 

Once the FPGA selection has been narrowed, to obtain the 

frequency of operation, a cache may need to be implemented 

for performance or remove a structural hazard on a single 

cycle where both the instruction and data need to access the 

same memory [2]. In a highly abstracted design environment, 

cache implementation can be reduced to 6 lines of code: size, 

latencies, number of sets, block size, replacement policy, and 

non-cacheable addresses. The complexity of comparing and 

managing the cache tags, valid bit, and dirty bit are handled by 

the Cache component. Similar to evaluating the different 

depths of the dynamic branch prediction buffer, experiments 

on the cache size can be turned around in 10 to 15 minutes. 

The ability to obtain data easily and quickly enables the design 

to be developed and optimized on time with minimal risk. 

Fig. 23: Codasip Data Cache component 

XI. 

CONCULSION 

An open hardware architecture, RISC-V, provides a 

platform that the designer can customize a solution to match 

an end application while retaining the ability to execute open 

source software. The development of highly abstracted 

development tools provides the mechanism to customize the 

solution to meet the end application’s performance, power, 

area, and cost goals through simulation and data analysis. The 

ability to optimize the processor at any time of the 

development cycle, reduces project risk and enhances time to 

market. The goal of the design is realization in silicon. These 

development tools enable the generation of RTL to obtain a 

RTL soft core that can be synthesized into a FPGA or ASIC. 

Fig 24 is an example of the time Codasip Studio took to 

generate RTL of the RISC-V 4-lane vector processor. 

Fig. 24: CU RISC-V Vector Processor RTL generation 

Based on data driven techniques, the project embedded 

processor is designed with a multi-cycle MUL instruction, 

without a MAC instruction, a minimum 2-deep dynamic 

branch prediction buffer, and a 4-lane vector processor. These 

same techniques can be used to continually optimize the 

processor as the signal processing algorithm is developed 

minimizing any risk of a choosing the incorrect IP core 

processor without having leftovers purchased in terms of 

power and silicon area. 

33

XII. RESULTS 

ACKNOWLEDGMENTS 

I would like to acknowledge the efforts of Pavan Suresh 

Dhareshwar, Aravind Venkitasubramony, and Shivasankar 

Gunasekaran that have made this project possible. I would 

like to give special gratitude to Zdenek Prikryl for his teaching 

and assistance with Codasip Studio. 

REFERENCES 

[1] David Patterson, Andrew Waterman, The RISC-V Reader, An Open 

Architecture Atlas, Berkeley, CA: Strawberry Canyon, 2017. 

[2] David A. Patterson, John L. Hennessy, Computer Organization and 

Design, The Hardware/Software Interface, RISC-V Edition, Cambridge, 

MA: Morgan Kaufmann, 2017. 

[3] The RISC-V Instruction Set Manual, Volume I: User-Level ISA, 

Document Version 2.2, RISC-V standard, 2017 

[4] Codasip Instruction Accurate Model Tutorial, Version 7.0.0, 2017 

34

Securing RISC-V Machines Dynamically 

with Hardware-Enforced Metadata Policies 

Steve Milburn 

Chief Technical Officer 

Dover Microsystems, Inc. 

Waltham, MA, USA 

steve@dovermicrosystems.com 

Greg Sullivan 

Chief Scientist 

Dover Microsystems, Inc. 

Waltham, MA, USA 

gregs@dovermicrosystems.com 


At the highest, most simplified level, the ideal approach to 

securing a system (hardware or software) consists of three 

inter-related activities: 

1. Specification: Precise definition of the required behavior 

of the system, including both acceptable and disallowed 

behaviors. This specification subsumes strictly “securityrelated” 

concerns and includes basic functional 

correctness. Groups within the RISC-V community are 

currently working in this area. 

2. Implementation: Implementation of hardware in some 

HDL (hardware description language), e.g. Verilog, and 

software in programming languages, such as C/C++, Java, 

etc. Presumably the implementation will be done with 

requirements, including security, in mind. 

3. Verification: Proof, or argument, that the implementation 

satisfies the specification. For simplicity, we will focus on 

verification of security properties, which can be viewed as 

the definition of “bad” behaviors and the proof that any 

“bad” behaviors cannot happen at runtime. 

The verification process can be divided into two 

timeframes: 

1. Static analysis: Examine the implementation artifacts and 

prove (through reasoning about possible runtime 

behaviors given the semantics of the implementation 

languages) that disallowed behaviors can never occur. 

The form of a proof statement is generally “for all 

possible inputs, the system will never exhibit this bad 

behavior.” 

2. Dynamic analysis: While executing the system, monitor 

for bad behavior and interrupt execution and recover if 

bad behavior is detected. 

Testing is a type of dynamic analysis in that it demonstrates 

the absence of bad behavior in some finite number of system 

executions. Testing can be used to increase confidence in the 

correctness of a system, but, for complex systems, testing is far 

short of a proof of correctness. 

In summary, a system cannot be considered fully trusted 

unless every component of the system has been subjected to 

either formal static verification or non-subvertible, 

comprehensive dynamic analysis. We summarize this 

methodology as follows: 

1. Prove what you can, before deployment: Using formal 

verification (described more fully later), verify as many 

properties as possible, for as many elements of the 

system, as possible. It is beyond our current capabilities to 

prove complete correctness and security for all hardware 

and software elements involved in complex systems. But 

we can prove many critical facts about many critical 

elements of systems, and we should. 

2. Enforce at runtime any properties not completely 

proven: For remaining misbehaviors that we cannot 

formally rule out, we need to detect those misbehaviors at 

runtime, before they cause harm, and recover. 

In the following sections, we focus on formal specification 

of security policies using micro-policies, and hardware-based 

dynamic enforcement of those micro-policies. 

It should be noted that the current state of the art in formal 

verification of hardware and software implementations falls far 

short of complete system-wide verification. However, great 

strides are being made in this area every year, especially on 

RISC-V. 

It is a daunting task to formalize the correctness and 

security of a complex system. It is even more difficult to 

attempt to prove that a large, complex implementation 

precisely matches the formal specification to which it is 

supposed to adhere. Indeed, what constitutes a “complete” 

specification of a system is not well-defined. Nonetheless, our 

goal is to promote an incremental approach where every 

additional specification of some of the properties of a part of 

the overall system increases overall trust and security of the 

system. 

35

II. 

FORMAL SPECIFICATION OF SECURITY POLICIES USING 

MICRO-POLICIES 

As previously described, our goals are: 

1. Enable incremental, formal specification of desired 

security properties of a system, 

2. Formally prove, when possible, whether some elements 

(ideally, all) of the system adhere to some (ideally, all) of 

the security properties, and 

3. Dynamically enforce any policies, on any elements of the 

system, where static compliance (goal #2) has not been 

proven. There is ongoing research on the interaction 

between statically verified code and untrusted code [1]. 

We want the same security specifications to apply both 

formal proof and dynamic checking. 

Our general approach, which we call micro-policies, 

formalizes the collection, propagation, and combination of 

metadata during program execution. The semantics of micropolicies 

leads to a natural implementation as runtime monitors. 

A. Micro-Policies 

Micro-policies were introduced in [2]. The formal model of 

a micro-policy is a function of five arguments (inputs) that 

returns three values (outputs), and that is invoked once for 

every instruction executed on the host machine. 

Inputs. The five inputs of a micro-policy capture the 

instantaneous state relevant to the execution of a single 

instruction on a host computer. Note that the values below are 

metadata about the corresponding elements of the host 

computer state: 

1. PC (Program Counter): Metadata associated with the 

distinguished PC register in the machine, that is updated 

at every cycle to point to the next instruction word to be 

executed. The PC metadata is a convenient place to track 

dynamic information flow properties, such as “have we 

branched on sensitive data?” or “are we in the process of 

a control flow transfer?” 

2. CI (Current Instruction): Metadata associated with the 

word containing the instruction currently being executed. 

At compile time, static analysis infers security-relevant 

properties of instructions, and that metadata is stored in 

the binary along with the instructions. When an 

application binary is loaded into memory, the information 

collected at compile time is associated, as metadata, with 

instruction words. 

3. OP1 (1st operand): Metadata associated with the first 

argument/operand of the current instruction. For a 

STORE instruction, this might be the register containing 

the address to be dereferenced. For an ADD instruction, 

OP1 will be the metadata associated with the first of two 

summands. 

4. OP2 (2nd operand): Metadata for the second operand. 

For a STORE instruction, this might be the register 

containing the data to write to memory. For an ADD 

instruction this will be the second summand. 

5. MEM (referenced memory): For memory operations 

(LOAD, STORE), metadata associated with the word in 

memory being referenced. 

Outputs: A policy first indicates whether the five inputs, 

above, represents an allowed instruction, or whether it is a 

policy violation. If the inputs are allowed by the policy, then a 

policy can update the metadata associated with any outputs of 

the instruction. The PC is always an output, but different 

instructions have different updates. For example, a STORE 

instruction will update the value at the referenced address, 

whereas an ADD instruction will update the value in a 

destination register. For our current purposes, we will assume 

that each instruction updates the PC and a single location 

(either a register or a memory address). Thus, the outputs of a 

policy, per instruction, are: 

1. Allowed?: If false, an interrupt is thrown on the host 

processor. 

2. PC’ (updated program counter): Updated value of 

metadata associated with the PC. 

3. RES (result): Updated value of metadata associated with 

a register or memory location modified by the current 

instruction. 

A policy can therefore be considered as having two 

components: 

1. A predicate that determines whether the current 

instruction, based on metadata associated with relevant 

state, is allowed, and 

2. A flow rule that updates the metadata on the PC and any 

data referenced by the instruction. 

B. Metadata Initialization 

The abstract model of micro-policies assumes that every 

word and register in a system has metadata associated with it. 

But where does this metadata come from? There are two 

possible scenarios for data entering the live memory of a 

running system: 

1. Unknown / untrusted. With no other information, we 

can only label data according to the route through which it 

entered the system. That is, we can label data according to 

which IO mechanism brought it into main memory (serial, 

network, DMA region, etc.) 

2. Labeled. We can devise trusted mechanisms for data to 

arrive in a system already labeled with some metadata. An 

example of this process is loading an application binary 

from untrusted persistent storage. A trusted analysis of 

source code, during the compilation process, produces a 

signed, encrypted collection of metadata indexed to the 

executable produced by the compilation process. When 

the operating system loader loads instruction words into 

memory, policy code concurrently loads metadata 

36

describing the application instruction words. Policy 

metadata loading checks the signature to verify that: 

a) The metadata was produced by a trusted analysis, and 

b) The binary described by the metadata has not been 

altered since the metadata was generated. 

With a mechanism for labeling words entering the system 

with initial metadata, we outline the sorts of properties that can 

be expressed and enforced using micro-policies. 

C. Expressible Micro-Policies 

Given the general framework outlined above, what sorts of 

properties can be enforced using micro-policies? Since the 

description so far of micro-policies has been abstract, the 

following are example policies that show how to state both 

metadata update rules and security predicates against metadata. 

Taint Tracking Policy: “Taint tracking” can be used to 

enforce either confidentiality (where data is allowed to travel) 

and integrity (tracking trusted data) policies. For this example, 

imagine that we want to ensure that confidential data must go 

through a designated encryption routine before being copied to 

a memory-mapped IO region. As a first step, we define the 

type of metadata maintained by taint tracking: 

• Tainted?: Every word in memory will have a Boolean 

metadata attribute indicating whether it is tainted. In our 

example, tainted corresponds to being considered 

confidential. For simplicity, we assume that all 

confidential data in an application had its Tainted? 

Metadata field set to True, and all other data has its 

Tainted? Metadata field set to False. 

• Outflow: Memory-mapped IO locations to which we 

want to apply our taint tracking policy will have 

Outflow set to True; all other words will have Outflow 

set to False. 

• UntaintInstr: A particular instruction 1 within the 

designated encryption routine will have the metadata 

field UntaintInstr set to True. All other words will have 

UntaintInstr as False. 

After defining the fields of metadata being maintained, we 

define the Flow Rules for propagating and updating those 

fields: 

• If any input to an operation is tainted, the output of the 

operation becomes tainted. 

• If the current instruction has UntaintInstr set to True, 

the output of the operation is untainted. 

Finally, we define the Security Predicate for the taint 

tracking policy: 

• If the output location of a memory operation has 

Outflow set to True, and the input value for the memory 

operation has Tainted set to True, then raise a security 

policy violation exception. 

1 In fact, we want to ensure that the entire encryption routine is executed, 

but we will keep it simple here and designate a single instruction as 

representing execution of the entire function. 

Every micro-policy has these three components: metadata 

type structure, flow rules, and a predicate. We can define flow 

rules that are used by multiple micro-policy predicates. 

Control Flow Integrity Policy: Another example of a 

micro-policy is a simple control flow integrity (CFI) policy 2 . 

This CFI policy will label all instructions in a loaded 

application as either legal targets of control flow instructions 

(branches, jumps), or as not targets. The security predicate 

simply checks that all control flow instructions land at a legal 

target. We will assume that there is metadata on each 

instruction word indicating whether or not it is a control flow 

instruction (i.e. a branch, jump, or call). The metadata fields 

are: 

• InstrJump? – True for words containing instructions that 

are control flow instructions; False otherwise. 

• Target? – True for words containing instructions that 

are legitimate targets of control flow instructions; False 

otherwise. 

• Jumping? – This field is only True in the metadata for 

the PC (Program Counter) and when the preceding 

instruction was a control flow transfer instruction. 

There is only a single Flow Rule: 

• If the current instruction has InstrJump? set to True, set 

the PC Jumping? field to True. Otherwise, set the PC 

Jumping? field to False. 

The Security Predicate is: 

• If the PC Jumping? field is True, and the current 

instruction metadata does not have Target? set to True, 

raise a Policy Violation exception. 

Hopefully at this point the reader gets a sense of the process 

of defining flow rules and security predicates, even though we 

have elided many technical details. 

D. Composition of Micro-Policies 

Multiple micro-policies can be combined to create a 

Composed Policy. Flow rules compose in a straightforward 

manner, as long as they are concerned with updating disjoint 

metadata fields. There can be dependencies between flow 

rules; one flow rule may depend on metadata calculated by 

another flow rule, inducing an execution order on rules. 

Composing security predicates is also straightforward: each 

micro-policy has “veto power.” That is, if any individual 

security predicate raises an exception, then the composed 

policy raises that exception. 

III. 

HARDWARE-BASED DYNAMIC ENFORCEMENT OF 

MICRO-POLICIES 

Micro-policy enforcement at runtime can be viewed as a 

sort of Reference Monitor 3 and the form (data type, flow rules, 

predicate) outlined in the previous section lends itself to 

2 We are not implying that this simple control flow integrity policy will block 

sophisticated control flow hijacking attacks. 

3 Reference Monitor on Wikipedia: 

https://en.wikipedia.org/wiki/Reference_monitor (checked August 2017) 

37

implementation as an Inline Reference Monitor [3]. 

Unfortunately, inline reference monitors suffer from two 

serious drawbacks in practice: 

1. The monitoring code happens in-line with the application 

code, adding substantial runtime overhead. 

2. If the attacker is assumed to have some control over the 

control flow of the target application, the attacker may be 

able to jump around or otherwise subvert the inline 

reference monitors. 

Dover’s CoreGuard solution implements dynamic 

enforcement of security policies in hardware, with 

configuration of that hardware done using updatable software 

running on an independent processor. For each instruction the 

host processor executes, CoreGuard hardware gathers the 

relevant metadata tags for the instruction’s inputs, checks a 

hardware policy rule cache, and applies the appropriate security 

predicate flow rule to the instruction’s outputs. If a policy rule 

for the inputs is not found in the hardware policy rule cache, 

CoreGuard’s included RISC-V processor core checks the 

installed policy software to determine the needed rule, and 

provides it to the hardware for application and caching. This 

hybrid approach enables CoreGuard to run at-speed with the 

host processor checking every single executed instruction, 

while also allowing for the benefits of software-defined 

policies, such as updateability, arbitrary complexity, and 

compositing of multiple micro-policies. 

CoreGuard’s hardware uses three main components: A) 

Hardware Interlock, B) Rule Cache, and C) Policy Executor 

(PEX). 

A. Hardware Interlock 

The hardware interlock controls communication between 

the host processor and the rest of the system to ensure that 

nothing is written to external memory or peripherals without 

first flowing through CoreGuard. 

B. Rule Cache 

The CoreGuard hardware uses a rule cache to optimize 

performance by storing rule processing data so that future 

requests for that data can be served faster. The rule cache stores 

a number of metadata combinations for allowed instructions— 

that is, for instructions that complied with micro-policies and 

were therefore allowed to execute. Each rule cache entry 

corresponds to a unique set of metadata tags for the 

instruction's inputs and outputs (described in the earlier Micro- 

Policies section). When CoreGuard processes the current 

instruction, it looks to see if the instruction's input metadata tag 

combination exists in the rule cache for that instruction. 

The rule cache is a multi-way skew associative cache, 

which aims at reducing misses by using different indices 

through hashing. By default, CoreGuard will evict the least 

recently added cache entry when it needs to make room for a 

new tag combination; this eviction policy, however, is 

configurable via micro-policies. With each instruction that it 

processes, CoreGuard updates the output metadata for that 

instruction. 

C. Policy Executor (PEX) 

The Policy Executor (PEX) is the RISC-V processor core 

included with CoreGuard to execute micro-policy code. Having 

a separate processor enables a clean separation between policy 

processing and host processing, which gives CoreGuard greater 

control of rule processing and better ability to optimize 

performance. 

When the system is first initialized, the PEX initializes 

metadata for all the words in memory available to the system. 

It then loads the application and sets up all application-specific 

metadata in memory. 

When there is a rule cache miss, the PEX crosschecks the 

metadata for the current instruction against the micro-policies 

installed on the system. Based on input metadata, the PEX 

updates and creates new output metadata. 

For complete protection, Dover Microsystems also 

recommends there be a mechanism in the SOC that restricts the 

host processor from having access to the metadata and policy 

software regions of memory utilized by the CoreGuard 

hardware. This ensures the Host software cannot manipulate 

the metadata or policy software to bypass the security created. 

This enforcement can be realized by existing technologies 

typically present in SOC network fabrics. 

IV. 

CONCLUSION 

By combining formal verification of system components 

where the technology exists to enable it, and dynamic analysis 

with Dover CoreGuard utilizing Software Defined Metadata 

Processing, an execution environment can approach complete 

protection. Formal verification of the metadata policies 

running in the CoreGuard system completes the security 

solution, and is the subject of ongoing research. 

REFERENCES 

[1] Y. Juglaret, C. Hritcu, A. Azevedo, B. C. Pierce, A. 

Spector-Zabusky and A. Tolmach, "Towards a fully 

abstract compiler using Micro-Policies: Secure 

compilation for mutually distrustful components," arXiv, 

2015. 

[2] A. Azevedo de Amorim, M. Denes, N. Giannarakis, C. 

Hritcu, B. C. Pierce, A. Spector-Zabusky and A. Tolmach, 

"Micro-Policies: Formally Verified, Tag-Based Security 

Monitors," in IEEE Symposium on Security and Privacy, 

SP 2015, San Jose, CA, USA, 2015. 

[3] U. Erlingsson, "The inlined reference monitor approach to 

security policy enforcement.," 2003. 

38

Cycle Approximate Simulation of RISC-V 

Processors 

Lee Moore, Duncan Graham, Simon Davidmann 

Imperas Software Ltd. 

Oxford, United Kingdom 

simond@imperas.com 

Felipe Rosa 

Universidad Federal Rio Grande Sul 

Brazil 

Abstract— Historically, architectural estimation, analysis and 

optimization for SoCs and embedded systems has been done 

using either manual spreadsheets, hardware emulators, FPGA 

prototypes or cycle approximate and cycle accurate simulators. 

This precision comes at the cost of performance and modeling 

flexibility. 

Instruction accurate simulation models in virtual platforms, 

have the speed necessary to cover the range of system scenarios, 

can be available much earlier in the project, and are typically an 

order of magnitude less expensive than cycle approximate or 

cycle accurate simulators. Previously, because of a lack of timing 

information, virtual platforms could not be used for timing 

estimation. We report here on a technique for dynamically 

annotating timing information to the instruction accurate 

software simulation results. This has achieved an accuracy of 

better than +/-10%, which is appropriate for early design 

architectural exploration and system analysis. This Instruction 

Accurate + Estimation (IA+E) approach is constructed by using 

Open Virtual Platforms (OVP) processor models plus a library 

that can introspect the running system and calculate an estimate 

for the cycles taken to execute the current instruction. Not only 

can these add-on libraries dynamically inspect the running 

system estimate timing effects, they can annotate calculated 

instruction cycle timing back into the simulation and affect 

timing of the simulation. 

Keywords—RISC-V, virtual platform, instruction accurate, 

processor models, timing estimation 


Performance and power consumption are two key attributes 

of any SoC and embedded system. Systems often have hard 

timing requirements that must be met, for example in safety 

critical systems where reaction time is of paramount 

importance. Other systems, particularly battery powered 

systems, have power consumption limitations. 

Because of the importance of these characteristics, many 

techniques have been developed for estimation of performance 

and power consumption. Recently, with the explosion of 

system scenarios that must be considered, this job has become 

much more difficult. 

Instruction accurate simulation has previously not been 

considered as a potential technique for timing and power 

estimation, because it is instruction accurate and does not 

model processor microarchitecture details: there is no 

information about timing or power consumption of instructions 

and actions in instruction accurate models and simulators. 

Recently some universities, using the Open Virtual Platforms 

(OVP) models and OVPsim simulator [1], have experimented 

with adding this information into the instruction accurate 

simulation environment as libraries, with no changes to the 

models or simulation engines [2]. These efforts have shown 

great promise, with timing estimation results within +/- 10% of 

the actual timing results for the hardware for limited cases. 

We report here on the further development of this 

technique, and the extension of this technique for RISC-V ISA 

based processors. This is critical for the RISC-V ecosystem, 

since for RISC-V semiconductor vendors to win embedded 

system sockets, their customers are going to want to know 

about the timing and power consumption of those SoCs when 

running different application software. 

II. CURRENT STATE OF THE ART 

Historically, SoC architectural estimation, analysis and 

optimization has been done using either manual spreadsheets, 

hardware emulators, FPGA prototypes, cycle approximate 

simulators or cycle accurate simulator and performance 

simulators such as Gem5 [3]. These all have significant 

drawbacks: insufficient accuracy, high cost, RTL availability 

(meaning that the technique is only available later in the project 

when the RTL design is complete), low performance, limited 

ability to support a wide range of system scenarios or are very 

complex to use and gain good results. Table 1 provides a 

summary of the strengths and weaknesses of each technique. 


39

TABLE I. STRENGTHS AND WEAKNESSES OF CURRENTLY USED 

TECHNIQUES FOR TIMING AND POWER ESTIMATION 

Technique Strength Weaknesses 

Ease of use 

Manual 

spreadsheets 

Hardware 

emulators 

Cycle accurate 

Lack of accuracy; inability 

to support estimations with 

real software 

High cost (millions USD); 

needs RTL; < 5 mips 

performance 

FPGA prototypes Cycle accurate High cost (hundreds of 

thousands USD); needs 

RTL 

Cycle approximate 

simulation 

Good performance Lack of accuracy; lack of 

availability of models 

Cycle accurate Cycle accurate 

High cost (hundreds of 

simulation 

thousands of USD); lack of 


Gem5 Microarchitectural detail A lot of work to develop a 

model of specific 

microarchitecture and to 

get realistic traces of SoC. 

III. 

INSTRUCTION ACCURATE SIMULATION 

Instruction set simulators (ISSs) have long been used by 

software engineers as a vehicle for software development. 

Over the last 20 years, this technique has been extended to 

support not only modeling of the processor core, but also 

modeling of the peripherals and other components on the SoC. 

The advantages of these simulators are their performance, 

typically hundreds of millions of instructions per second 

(MIPS), and the relative ease of building the necessary models. 

However, the simulator engines and models are instruction 

accurate, and are not built to support timing and power 

estimation. 

The performance of these simulators comes from the use of 

Just-In-Time (JIT) binary translation engines, which translate 

the instructions of the target processor (e.g. Arm) to 

instructions on the host x86 PC. This enables users to run the 

same executables on the instruction accurate simulator as on 

the real hardware, such that the software does not know that it 

is not running on hardware. Peak performance with these 

simulators can reach billions of instructions per second. A 

more typically use case, such as booting SMP Linux on a 

multicore Arm processor, takes less than 10 seconds on a 

desktop x86 machine. 

There are also significant libraries of models available, and 

it is easier to build instruction accurate models than models 

with timing or power consumption information, or real 

implementation details. One such library and modeling 

technology is available from OVP. The OVP processor model 

library includes models of over 200 separate processors (e.g. 

Arm, MIPS, Power, Renesas, RISC-V), plus a similar number 

of peripheral models. Most of these models are available as 

open source. The C APIs for building these models are also 

freely available as an open standard from OVP. Finally, 

complete content and organizational editing before formatting. 

Please take note of the following items when proofreading 

spelling and grammar: 

IV. INSTRUCTION ACCURATE SIMULATION PLUS 

ESTIMATION 

Instruction accurate simulation holds the promise of faster 

simulation performance to support examination of more system 

scenarios, plus lower cost and earlier availability. With the 

Imperas APIs and dynamic model introspection it is easy to 

add in timing and power estimation capabilities into the 

instruction accurate simulation environment. 

The idea of adding these capabilities as libraries is the 

combination of annotation techniques and binary interception 

libraries used with JIT simulation engines. Annotation 

techniques can be imagined as a full instruction trace which is 

then annotated with the timing or power information. 

However, just using annotation requires significant host PC 

memory, and can slow the simulation. 

Binary interception libraries are used with the Imperas JIT 

simulators to enable the non-intrusive addition of tools, such as 

code coverage and profiling, to the simulation environment. 

Combining these techniques maintains the high simulator 

performance with minimal memory costs. This combined 

technique is being called Instruction Accurate + Estimation 

(IA+E). 

In the Imperas simulation products, which require the use 

of OVP models, it is possible to create a standalone library 

module with entry points that are called when instructions are 

executed. This library can introspect the running system and 

calculate an estimate for the cycles taken to execute the current 

instruction, and can take into account overhead of different 

memory and peripheral component latencies. Not only can 

these add-on libraries dynamically inspect the running system 

and estimate timing affects, they can annotate calculated 

instruction cycle timing back into the simulation and affect (i.e. 

stretch) timing of the simulation. An overview of the 

simulation architecture is shown in Figure 1. 

Fig. 1. Overview of the Imperas IA+E simulation environment. 

For processors, the instruction estimation algorithm 

includes: 

• a mixture of table look ups for simple instructions 

• dynamic calculations for data dependent instructions 

• adjustments due to code branches taken 

• taking into account effects of memory and register 

accesses 

A view of the timing estimation mechanism is shown in 

Figure 2. 

40

Fig. 2. Simplified view of the timing estimation mechanism. 

For memory subsystems and peripheral components table, 

lookup and dynamic estimation can be made and timing back 

annotated into the simulation to simulate the delay effects of 

slow memories and other components. 

With this Instruction Accurate + Estimation (IA+E) 

approach, there is a separation of processor model functionality 

and timing estimation. This means while building a functional 

model there is no need to worry about any timing or cycle 

complexity. It is only when the more detailed timing is needed 

is it necessary to add the extra timing data to enable the 

Imperas IA+E timing tools to provide cycle approximate 

timing simulation for the RISC-V processors. 

This extra timing data is added in two steps. First, the cycle 

information is added to the library. Second, the time per cycle, 

which is dependent upon the specific semiconductor process 

and physical implementation details, is added. 

The approach of providing the timing data as a separately 

linked dynamic program enables RISC-V processor designers 

to create a cycle approximate timing simulation for their 

specific processor implementation - without sharing any 

internal information. 

IA+E simulation performance slows down from normal 

simulation performance, with typical overhead of about 50% of 

normal performance. Still, this puts IA+E simulation 

performance at 100-500 MIPS. 

IA+E does have some limitations. This technique has 

currently been proven only for simple processors with a single 

core, no cache, and in-order pipeline. 

V. RESULTS 

This IA+E technique was first tested with Arm Cortex-M4 

based processors. The results were much better than expected, 

with an average estimation error of +/- 5% as compared to the 

actual device. The device was an ST Microelectronics 

STM32F on a standard development board, running the 

FreeRTOS real time operating system, with 39 different 

benchmark applications used. Almost all timing estimation 

errors were within +/- 10% of actual timing values. Figure 3 

shows these results. 

Fig. 3. Timing estimation results for IA+E simulation show average errors of 

better than +/- 5% over 39 different benchmarks for Arm Cortex-M4. 

IA+E was recently extended to support RISC-V processors, 

by using publicly available information (from the processor 

vendor's data books) to build the cycle data libraries. 

In the data below, showing processor implementations from 

Andes Technology, Microsemi and SiFive, only the cycle data 

is presented, since comparing timing for the various 

implementations would not be an accurate comparison. Also, 

in keeping with this theme, different benchmark applications 

were used for each of the different processors. All benchmarks 

were run with the range of compiler optimization settings, and 

estimated cycles were reported first assuming 1 cycle per 

instruction, i.e. using IA, then using the IA+E technique. 

These results are shown in Figs. 4-6. 

VI. CONCLUSIONS 

The Instruction Accurate + Estimation (IA+E) technique 

developed here has shown excellent results for timing 

estimation of in-order processors. It also has the benefits of 

easy model building, high performance to enable examination 

of multiple benchmarks and system scenarios, and lower cost 

than other techniques. In this paper, the IA+E technique has 

been extended to support RISC-V processors. Further work is 

needed to apply this technique to power estimation, and to 

more complex processors. 


The authors would like to thank Andes Technology, 

Microsemi, and SiFive for access to their processor datasheets 

and/or databooks. 

REFERENCES 

[1] Open Virtual Platforms (OVP), www.OVPworld.org 

[2] Felipe Da Rosa, Luciano Ost, Ricardo Reis, Gilles Sassatelli. 

Instruction-Driven Timing CPU Model for Efficient Embedded Software 

Development Using OVP. ICECS: International Conference on 

Electronics, Circuits, and Systems, Dec 2013, Abu Dhabi, United Arab 

Emirates. 

[3] Gem5, www.gem5.org 


41

Fig. 4. IA+E cycle estimation results for the Andes N25 processor. 

Fig. 5. IA+E cycle estimation results for the Microsemi Mi-V RV32IMA processor. 

Fig. 6. IA+E cycle estimation results for the SiFive E31 processor. 

42

A RISC-V Based Heterogeneous Cluster with 

Reconfigurable Accelerator for Energy Efficient 

Near-Sensor Data Analytics 

Davide Rossi 

DEI, University of Bologna 

Bologna, Italy 

davide.rossi@unibo.it 

Abstract- The end-nodes of the IoT require high performance 

and energy efficiency to math stringent constraints of complex 

near-sensor data analytics algorithms. Processing on multiple 

near-threshold processors is an emerging paradigm which 

combines the energy efficiency of low-voltage operation with the 

performance of parallel execution. In this work, we present a 

near-threshold heterogeneous architecture which extends a 

RISC-V based parallel processor cluster with a reconfigurable 

Integrated Programmable Array (IPA) accelerator. While the 

homogeneous cluster delivers high-performance when executing 

data-parallel kernels, offloading control-intensive kernels to the 

IPA leads to much higher system level performance and energyefficiency 

thanks to the exploitation of instruction level 

parallelism rather than data-level parallelism. Results show that 

the heterogeneous architecture outperforms an 8-core cluster by 

up to 4.8x in performance and 4.5x in energy efficiency when 

executing a mix of control-intensive and data-intensive kernels 

typical of near-sensor data analytics applications. 

Keywords-RISC-V processor, parallel architecture, nearthreshold 

computing, heterogeneous computing, reconfigurable 

computing. 


High performance and extreme energy efficiency are strict 

requirements for many deeply embedded near-sensor 

processing applications such as wireless sensor networks, endnodes 

of the Internet of Things (IoT) and wearables. One of the 

most traditional approaches to improve energy efficiency of 

deeply embedded computing systems is achieved exploiting 

architectural heterogeneity by coupling general-purpose 

processors with application- or domain-specific accelerators in 

a single computing fabric [1][2]. On the other hand, most 

recent ultra-low power designs exploit multiple homogeneous 

programmable processors operating in near-threshold [3]. Such 

an approach, which joins parallelism with low-voltage 

computing, is emerging as an attractive way to join 

performance scalability with high energy efficiency. 

In this paper, we present a heterogeneous architecture 

which integrates a near-threshold tightly-coupled cluster of 

processors [3] augmented with the Integrated Programmable 

Array (IPA) presented in [4]. This approach joins the 

programming legacy of instruction processors with the flexible 

performance and efficiency boost of Coarse Grain 

Reconfigurable Arrays [4][5] (CGRA). A similar approach was 

adopted in [6], which introduced an ultra-low power 

heterogeneous system featuring a Single Instruction Multiple 

Data (SIMD) CGRA as reconfigurable accelerator for biosignal 

analysis. With respect to this domain-specific 

architecture, where the computational kernels are mapped 

manually on the CGRA, the system proposed in this work is 

meant for general-purpose near-sensor data analytics, also 

relying on an automated compilation flow that allows 

generating the configuration bitstream for the CGRA starting 

from a general-purpose ANSI-C code [4]. 

We synthesized the architecture in a 28nm FD-SOI 

technology, and we carried out a quantitative exploration 

combining physical synthesis results (i.e. frequency, area, and 

power) and benchmarking of a set of signal processing kernels 

typical of end-nodes IoT applications. Two interesting findings 

of the proposed exploration show that (1) the performance of 

the IPA is much less sensitive to memory bandwidth than 

parallel processor clusters and that (2) the simpler nature of its 

architecture allows the IPA to run twice as fast as the rest of the 

system. Exploiting these two features of the architecture, we 

show that the heterogeneous cluster achieves significant 

performance and energy improvement for both compute and 

control intensive benchmarks with respect to the 8 core 

homogeneous cluster, achieving up to 4.8x speed-up and up to 

4.4x better energy efficiency. 

II. 

HETEROGENEOUS CLUSTER ARCHITECTURE 

The proposed heterogeneous cluster architecture is based 

on the PULP (Parallel Ultra Low Power) platform [3], 

featuring a configurable number of RI5CY processors [7]. The 

cores are based on an in-order pipeline with four balanced 

stages optimized for energy efficient operation, which share a 

multi-banked scratchpad memory through a low-latency 

logarithmic interconnect [8]. The original RISV-V ISA is 

extended with instructions targeting energy efficient digital 

signal processing such as hardware loops, memory accesses 

with automatic pointers increment, SIMD operations, bit 

manipulation instructions. The cores share a latch-based 

43

Fig. 1. Heterogeneous PULP Cluster Architecture. 

instruction cache to boost performance and energy-efficiency 

over traditional SRAM-based private instruction caches. A 

lightweight multichannel DMA optimized for energy-efficient 

operation manages data transfers between the L1 memory and 

the off-cluster L2 memory. Both the I$ and the DMA 

converges on an AXI4 cluster bus connected to dual clock 

FIFOs featuring level shifters, enabling the cluster to operate at 

the desired voltage and frequency independently on the rest of 

the SoC. A peripheral interconnect connects the processors to 

the cluster peripherals such as timers, an event unit, and other 

memory mapped peripherals or accelerators integrated into the 

cluster, such as the IPA. 

The IPA is built around an array of 16 processing elements 

(PEs) communicating through a 2D torus interconnect. Each 

PE features a 32-bit ALU, supporting a reduced instruction set 

that includes arithmetic and logic operations, 16-bit to 32-bit 

multiplications and control flow operations such as jumps and 

branches. The PEs fetch instructions from the Instruction 

Register File (IRF), which stores the program. A Regular 

Register File (RRF) stores temporary variables, while a 

Constant Register File (CRF) stores immediates. The ALU 

feature two input operands coming from neighbors PEs or the 

internal register files (RRF and CRF). A parametric number of 

PEs, defined at design-time, can be instrumented with a loadstore 

unit employing the request-grant protocol of the PULP 

logarithmic interconnect [8]. This protocol allows the 

integration of the IPA into the heterogeneous cluster just as any 

other programmable processor, sharing the same multi-banked 

memory. The configuration bitstream for the IPA is generated 

automatically by a compilation flow starting from ANSI-C 

description of the computational kernels [4]. Since PEs may 

not all operate at the same time, to reduce dynamic power 

consumption in idle mode, the IPA integrates a tiny Power 

Management Unit (PMU) responsible for clock gating PEs 

when idle. 

The heterogeneous PULP cluster described in this work is 

based on 8 RI5CY processors, 64kB of shared data memory, 

4kB of shared instruction memory, and is extended with the 

Integrated Programmable Array accelerator (Fig. 1). Fig. 2 

shows a detailed block diagram of the subsystem including the 

Fig. 2. Block Diagram of the IPA subsystem. 

IPA array. The configuration bitstream is stored into a global 

context memory (GCM) loaded into the IPA PEs through a 

dedicated controller (IPAC). The GCM is connected through a 

DMA-capable AXI-4 port to the cluster bus, enabling 

prefetching of IPA contexts from L2 memory. The GCM is 

sized as twice as the configuration bitstream of the IPA in the 

worst case, allowing to employ a ping-pong buffering policy 

where a new bitstream is loaded from L2 when the current one 

is being loaded on the array, completely hiding the 

reconfiguration time. A set of memory mapped control 

registers are used for control purposes, and they are used to 

load a new context to the IPA array, trigger execution kernels 

and synchronize with the programmable processors in the 

cluster. 

As opposed to many CGRA architectures, the IPA is 

capable of accessing a multi-banked shared memory through 8 

master ports connected to the low-latency interconnect. This 

eases data sharing with the other processors of the cluster, 

following the computational model described in [4]. The 

optimal number of port has been chosen to optimize the 

tradeoff between the size of the interconnect and the bandwidth 

requirements of the IPA. Since the IPA can operate twice as 

fast as the processors, we have extended the architecture of the 

cluster in a way that the IPA can work as twice as the 

frequency of the rest of the cluster. This approach allows to 

operate each component in the cluster at the optimal frequency, 

without paying the overheads of dual-clock FIFOs, requiring a 

significant amount of logic and synchronization overhead. On 

the contrary, the hardware support for the dual-frequency mode 

includes a clock divider to generate the two different edge 

aligned clocks, and two modules needed to adapt the requestgrant 

protocol of the low-latency interconnect [8] to deal with 

the frequency domain crossing. 

44

TABLE I. 

EXECUTION TIME OF KERNELS RUNNING ON THE 

HETEROGENEOUS CLUSTER (NS) 

TABLE II. 

ENERGY OF KERNELS RUNNING ON THE HETEROGENEOUS 

CLUSTER (µJ) 

Kernel Single core Multi core IPA Gain 

MatMul 3.3 M 435 K 432 K 1.0x 

Conv. 9.7 M 1.5 M 1.5 M 1.0x 

FFT 767 K 142 K 94 K 1.5x 

FIR 182 K 33 K 33 K 1.0x 

Sep. Filter 39 M 6.4 M 6.3 M 1.0x 

Sobel Filter 117 M 40 M 28 M 1.4x 

GCD 2.9 M 2.9 M 610 K 4.8x 

Cordic 9 K 7 K 3.6 K 1.9x 

Manh. Dist. 244 K 164 K 70 K 2.3x 

III. 

EXPERIMENTAL RESULTS 

In this section, we present the implementation and 

benchmarking results of the heterogeneous PULP cluster. The 

SoC was synthesized with Synopsys Design Compiler 

2013.12-SP3 on a STMicroelectronics 28nm UTBB FD-SOI 

technology library, while Synopsys PrimePower 2013.12-SP3 

was used for timing and power analysis at the supply voltage of 

0.6V, 25C temperature, in typical process conditions. The 

benchmarks are implemented in fully portable C, using the 

OpenMP programming model to parallelize the applications on 

the PULP cluster. The three operating modes considered in 

these comparisons are: (a) single-core: running applications on 

a single core, (b) multicore: running applications on 8 parallel 

cores (c), IPA: running applications in the IPA. 

Table I reports the execution time of several near-sensor 

processing kernels running on a single-core, on 8 cores and on 

the IPA. Comparing to the performance of execution in singlecore, 

the accelerator achieves a maximum speed-up of 8x. The 

performance gain in the accelerator for the compute intensive 

kernels like matrix multiplication, convolution, FIR and 

separable filters is limited if compared to the performance of 

parallel-cores. However, the relatively small performance gain 

compared to the parallel cluster is compensated by the gain in 

energy efficiency as shown in Table II. The gain in energy 

efficiency is mainly due to (i) the simpler nature of the 

compute units of the IPA with respect to full processors, (ii) 

the smaller number of power-hungry load/store operations, and 

(iii) the fine-grained power management architecture that 

allows clock gate the inactive PEs during execution. On the 

other hand, the control intensive kernel like GCD does not 

exhibit significant data-level parallelism, hence parallel 

execution over multiple cores does improve performance of the 

homogeneous cluster. Contrarily, the execution on the IPA 

improves the performance by almost 5x and energy efficiency 

by almost 4.5x, exploiting also instruction-level parallelism 

rather than data-level parallelism only as the homogeneous 

processors cluster. More precisely, although data-parallel 

applications can be effectively parallelized on homogeneous 

clusters, the exploitation of the IPA results into a more efficient 

utilization of the hardware resources for control-intensive 

kernels causing a huge performance bottleneck in several nearsensor 

analytics applications. 

Kernel Single-core Multi-core IPA Gain 

MatMul 1.2 0.3 0.2 1.5x 

Convolution 2.8 1.1 0.65 1.7x 

FFT 0.3 0.09 0.04 2.25x 

FIR 0.08 0.03 0.025 1.2x 

Sep. Filter 16.6 4.6 4.3 1.1x 

Sobel Filter 51.5 29.5 12.7 2.3x 

GCD 1.1 1.1 0.25 4.4x 

Cordic 0.004 0.003 0.001 3x 

Manh. Dist. 0.1 0.1 0.03 3.3x 

IV. 

CONCLUSION 

In this paper, we present a novel approach towards 

heterogeneous computing, augmenting ultra-low power 

reconfigurable accelerator in the PULP multi-core cluster. The 

experiments integrating IPA in the PULP platform suggests 

that architectural heterogeneity is a powerful approach to 

improve energy profile of the computing systems. We have 

presented three possible executions of the benchmarks in the 

IPA integrated PULP platform. The heterogeneous cluster 

achieves achieving up to 4.8x speed-up and up to 4.4 better 

energy efficiency with respect to an 8-core homogeneous 

cluster. 

REFERENCES 

[1] F. Conti, A. Marongiu, and L. Benini. Synthesis-friendly techniques for 

tightly-coupled integration of hardware accelerators into shared-memory 

multi-core clusters. CODES+ISSS ’13, pages 5:1–5:10, Piscataway, NJ, 

USA, 2013. IEEE Press. 

[2] M. B. Taylor. Is dark silicon useful? harnessing the four horsemen of the 

coming dark silicon apocalypse, In Design Automation Conference 

(DAC), 2012 49th ACM/EDAC/IEEE , pages 1131–1136. IEEE, 2012. 

[3] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Grkaynak, A. Teman, J. 

Constantin, A. Burg, I. Miro-Panades, E. Beign, F. Clermidy, P. 

Flatresse, and L. Benini. Energy-efficient near-threshold parallel 

computing: The pulpv2 cluster. IEEE Micro, 37(5):20–31, September 

2017. 

[4] S. Das, K. J. M. Martin, P. Coussy, D. Rossi, and L. Benini. Efficient 

mapping of cdfg onto coarse-grained reconfigurable array architectures. 

In 2017 22nd Asia and South Pacific Design Automation Conference 

(ASP-DAC), pages 127–132, Jan 2017. 

[5] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grained 

reconfigurable array architectures. In S. S. Bhattacharyya, E. F. 

Deprettere, R. Leupers, and J. Takala, editors, Handbook of Signal 

Processing Systems, pages 449–484. Springer US, 2010. 

[6] L. Duch, S. Basu, R. Braojos, G. Ansaloni, L. Pozzi, and D. Atienza. 

Heal-wear: An ultra-low power heterogeneous system for bio-signal 

analysis. IEEE Transactions on Circuits and Systems I: Regular Papers, 

2017. 

[7] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. 

Flamand, F. K. Grkaynak, and L. Benini. Near-threshold risc-v core with 

dsp extensions for scalable iot endpoint devices. IEEE Transactions on 

Very Large Scale Integration (VLSI) Systems , PP(99):1–14, 2017. 

[8] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. A fully-synthesizable 

single-cycle interconnection network for shared-l1 processor clusters. In 

2011 Design, Automation & Test in Europe , pages 1–6. IEEE, 2011. 

45

OpenWrt 101: How to Build a Linux Embedded 

System in Just 30 Minutes 

Cesare Garlati 

prpl Foundation 

Santa Clara, CA USA 

cesare@prplFoundation.org 

Luka Perkov 

Sartura 

Zagreb, Croatia 

luka.perkov@sartura.hr 

Abstract — OpenWrt is the de-facto standard Linux 

distributions for embedded devices. Originally developed for 

Internet routing devices, such as home gateways and wireless 

routers, OpenWrt is now widely used in many applications 

ranging from laptops to mobile devices to the ever increasing 

number of IoT devices. The core design philosophy of minimal 

footprint, broad platform support and ease of customization has 

made OpenWrt the option of choice for developers, end users, 

and commercial service providers. This paper offers a practical 

introduction to OpenWrt: it is the basis for a 30 minutes class 

teaching how to setup, compile, and run a complete OpenWrt 

system on a commercial router of choice. 

Keywords—OpenWrt; embedded; Linux; operating system; 

software distribution; open source software; router; home gateway, 

Wi-Fi,IoT, Internet of Things. 


OpenWrt is a highly extensible open source GNU/Linux 

distribution for embedded devices. Although primarily targeted 

to home gateways, OpenWrt runs on wireless routers, pocket 

computers, laptops and many classes of IoT devices. Since its 

inception in 2007, the goal of the OpenWrt project has been to 

provide free open source tools to build and customize firmware 

images for numerous embedded platforms. In creating a novel 

embedded distribution for networking applications, the 

OpenWrt project followed the three main principles of small 

footprint, portability and customizability. 

These elements have made OpenWrt a desirable choice for 

a vast amount of projects and products varying in size and 

applications. A 2017 survey presented at OpenWrt/LEDE 

Summit 2017 [1] points out that OpenWrt is used across a 

broad variety of commercial Wi-Fi routers including TP-Link’s 

Archer C7, TL-WR1043ND, TL-WR841ND, TL-WDR3600, 

Ubiquiti Networks NanoStation AC and PicoStation, VPN 

appliances, IoT development boards, wireless printers, TOR 

servers, file and media sharing appliances, mesh network nodes 

and even wind turbines. 

Table 1 shows a functional comparison between OpenWrt 

and two other popular embedded distributions: Yocto and 

Buildroot. 

TABLE 1: COMPARISON - OPENWRT, BUILDROOT, YOCTO 

Component OpenWrt Buildroot Yocto 

menuconfig Kconfig Kconfig Kconfig 

C libraries 

File Systems 

uClibc 

glibc 

musl 

OverlayFS 

tmpfs 

SquashFS 

JFFS2 

UBIFS 

ext* 

glibc 

uClibc-ng 

musl 

cramfs 

JFFS2 

romfs 

cloop 

ISO 9660 

cpio 

UBI 

UBIFS 

SquashFS 

ext* 

Root Necessary Yes No Yes 

Init Systems 

procd 

BusyBox 

systemV 

BusyBox 

systemd 

EGLIBC 

Btrfs 

cpio* 

cramfs 

ELF 

ext* 

ISO 

JFFS2 

multiubi 

SquashFS 

UBI 

UBIFS 

SysVinit 

systemd 

Package Manager opkg - smart 

A. Small footprint 

Internet routing devices share many constraints of typical 

embedded devices: small processors, tiny memory and low 

power. OpenWrt system architecture is optimized in size to 

generate firmware images that fit the limited memory available 

in most commercial routers with little or no overhead. 

Traditional Linux distributions require large software 

libraries that introduce many additional dependencies, such as 

C standard library glibc, D-Bus inter-process communications 

facilities and heavy Network Manager applications. On the 

contrary, OpenWrt relies on simpler – and more robust - 


46

general-purpose components such as musl C standard library 

(before uClibc [2]), ubus RPC daemon which is similar to D- 

Bus but has a more user-friendly API, and an RPC-capable 

daemon netifd to manage complex network interface 

configurations. Through these and many other lightweight 

components, a heavily-stripped OpenWrt build can run even on 

devices with 16 MB or 8 MB main memory size – in fact the 

authors were able to experiment with images as small as 4 MB. 

Along with preserving a small footprint, OpenWrt also 

implements a single configuration and access point philosophy 

through UCI, or Unified Configuration Interface. UCI 

centralizes and eases the configuration of crucial system 

settings and stores them into a single configuration directory 

(/etc/config/). In addition, a large number of third-party 

libraries have also been made UCI-compatible to facilitate the 

management of their configuration files within OpenWrt. 

B. Portability 

A key element of OpenWrt is the wide support of many 

hardware platforms. CPU architectures supported by OpenWrt 

include ARM, MIPS, x86, x86_64 and many SoC platforms 

produced by Broadcom, Atheros/Qualcomm, Lantiq/Intel, 

Marvell and others. 

As of today the official OpenWrt’s Table of Hardware [3] 

(Fig 1.) lists 680 devices, including devices from leading 

networking device manufacturers such as Zyxel, Netgear, 

Linksys, TP-Link, Ubiquiti Networks, D-Link and others. 

Fig 1. OpenWrt - Table of Hardware 

C. Customizability 

OpenWrt firmware images can be obtained in two ways: 

either by downloading pre-built firmware through the OpenWrt 

build infrastructure – see https://downloads.openwrt.org - or by 

configuring and building the image with the help of the 

OpenWrt Build System. Pre-built images are easy to obtain and 

install and are the option of choice for standard applications 

targeting specific devices. On the other hand, images can be 

built manually using OpenWrt’s Build System, a set of 

Makefiles and patches for generating a cross-compilation 

toolchain and a root file system for embedded systems. 

Configuring and building firmware images using OpenWrt 

Build System provides users with menuconfig, OpenWrt 

default Build System configuration interface. This user friendly 

application allows the configuration of a wide array of options 

including: chipset architecture, router hardware model, root file 

system, application packages, and kernel options. The menudriven 

interface allows quick and painless configuration and 

generation of the firmware image according to user 

requirements. 

Once OpenWrt is successfully booted, its package manager 

opkg enables users to install and remove several thousand 

packages from the OpenWrt package repository. As opposed to 

traditional Linux-based firmware relying on read-only file 

systems, opkg gives users the possibility to modify the installed 

software without rebuilding and reflashing a completely new 

image. By using the large amount of available packages, users 

can utilize their OpenWrt routers for a wide variety of 

networking applications. To name a few, these include setting 

up a VPN, installing a BitTorrent client, performing trafficshaping 

and quality of service, creating guest networks, 

running server software, and many other popular applications 

that are not typically bundled with commercial off-the-shelf 

devices. 

II. 

BUILDING OPENWRT FIRMWARE 

To show how to build an OpenWrt firmware image in just 

30 minutes, we are going to use a pre-compiled OpenWrt Build 

System environment tailored for the Marvell ESPRESSObin 

[4], an ARMADA 88F3700 SoC-powered commercial router 

designed primarily for computing, storage and networking 

applications. The prebuilt environment is available as a 

downloadable Docker container based on Ubuntu 16.04. It 

allows instant setup and generation of an installable OpenWrt 

firmware for the ESPRESSObin board. 

First we need to setup the Docker container on the local 

machine. The Docker platform runs natively on Linux x86-64, 

Linux/ARM and Windows x86-64. The Docker image is pulled 

with: 

$ docker pull 

sartura/build_openwrt_ubuntu_16.04:espress 

obin 

The large size of this Docker image (around 12 GB) is to 

accommodate the space needed for downloading the OpenWrt 

build system, OpenWrt feeds and source packages, and finally 

for building (cross-compiling) OpenWrt and generating the 

OpenWrt firmware image. 

After the image is pulled, a local folder needs to be 

prepared where the build artifacts will be copied to: 

$ mkdir ~/espressobin 

$ chmod 777 ~/espressobin 

Downloaded Docker image is run with: 

$ docker run -it -v 

/home//espressobin:/opt/espresso 

bin --name espressobin 

sartura/build_openwrt_ubuntu_16.04:espress 

obin 

Inside the Docker container, Marvell-specific repositories 

required for building OpenWrt firmware are openwrt-dd/ and 

47

openwrt-kernel/, both located under /home/build directory. 

Already built OpenWrt firmware images for the 

ESPRESSObin board are located in /home/build/openwrtdd/bin/mvebu64/ 

directory: 

$ cd /home/build/openwrt-dd/ 

$ ls -1 bin/mvebu64/ 

armada-3720-community.dtb 

openwrt-armada-ESPRESSObin-Image 

openwrt-armada-ESPRESSObin-Image-initramfs 

openwrt-armada-ESPRESSObin-Image.gz 

openwrt-mvebu64-armada-espressobinrootfs.tar.gz 

openwrt-mvebu64-vmlinux.elf 

packages 

sha256sums 

Modifying and rebuilding images simply requires running 

standard make commands. Within Docker, remember to issue 

all OpenWrt Build System commands as non-root user and to 

issue them in the directory where the OpenWrt sources have 

been cloned, in this case from /home/build/openwrt-dd. First, 

invoke the menuconfig interface with: 

$ make menuconfg 

This menu-driven interface (Fig 1.) is OpenWrt’s main 

configuration interface. 

Fig 2. OpenWrt - make menuconfig 

Here it is necessary to configure the target system, the 

target profile and target images, and lastly to set the path of the 

openwrt-kernel directory as an external kernel tree: 

Target System ---> 

Marvell 64b Boards 

Target Profle ---> 

ESPRESSObin (Marvell Armada 3700 

Community Board) 

Target Images ---> 

[x] ramdisk ---> 

* Root flesystem archives * 

[x] tar.gz 

* Root flesystem images * 

[x] ext4 ---> 

[x] Advanced confguration options (for 

developers) ---> 

(/home/build/openwrt-kernel) Use 

external kernel tree 

OpenWrt also features a highly modular Web User 

Interface called LuCI, which can be enabled by selecting the 

luci package: 

LuCI ---> 

1. Collections ---> 

luci 

Once these options are set, save your configuration and exit 

the interface. Now issue the rebuild with make: 

$ make 

Multiple cores can be utilized to speed up the build process: 

$ make -j$(($(nproc)+1)) 

Again, the build artifacts are stored in 

/home/build/openwrt-dd/bin/mvebu64/, so copy the needed 

contents of this directory (device tree file, OpenWrt image and 

root file system) to the local directory and exit the container: 

$ cp bin/mvebu64/*armada* 

/opt/espressobin/ 

$ exit 

The ESPRESSObin board uses the micro SD card as its 

main storage and booting environment. The last step of booting 

OpenWrt on ESPRESSObin consists of preparing the micro 

SD card, transferring the build files and setting the U-Boot 

parameters for the board to boot from micro SD card. After 

inserting the microSD card (listed here as /dev/sdX), first clear 

everything from it: 

$ sudo dd if=/dev/zero of=/dev/sdX bs=1M 

count=100 

Then create a new partition (sdX1 in our example): 

$ (echo n; echo p; echo 1; echo ''; echo 

''; echo w) | sudo fdisk /dev/sdX 

Followed by formatting this partition as ext4 with: 

$ sudo mkfs.ext4 /dev/sdX1 

Mount the micro SD card on your Linux machine (e.g. 

to /mnt) and position into that directory: 

$ sudo mount /dev/sdX1 /mnt 

$ cd /mnt 


48

Once here, transfer the necessary ESPRESSObin build files 

from the ~/espressobin/ directory. First, extract the root file 

system: 

$ sudo tar -xzf ~/espressobin/openwrtmvebu64-armada-espressobin-rootfs.tar.gz 

-C . 

Then create a boot directory where the device tree file and 

OpenWrt image will be copied to: 

$ sudo mkdir -p boot/ 

$ sudo cp ~/espressobin/armada-3720- 

community.dtb boot/ 

$ sudo cp ~/espressobin/openwrt-armada- 

ESPRESSObin-Image boot/ 

Exit the mounted directory and unmount the micro SD card 

from the local machine. Plug the micro SD card into the SD 

card slot on the ESPRESSObin and connect to the 

ESPRESSObin board via micro USB cable. Using a micro 

USB cable and a serial connection software of choice (e.g. C- 

Kermit, Minicom) access the console on the ESPRESSObin. 

Once the boot starts, hit any key to stop autoboot and to 

access the Marvell U-Boot prompt: 

Hit any key to stop autoboot: 

Marvell>> 

Now set the necessary U-Boot parameters for the name and 

location of the device tree file and OpenWrt image: 

Marvell>> setenv fdt_name 'boot/armada- 

3720-community.dtb' 

Marvell>> setenv image_name 'boot/openwrtarmada-ESPRESSObin-Image' 

Finally, set the bootmmc variable which will be used to 

boot from the micro SD card, save the defined environment 

parameters and boot using run bootmmc: 

Marvell>> setenv bootmmc 'mmc dev 0; 

ext4load mmc 0:1 $kernel_addr 

$image_name;ext4load mmc 0:1 $fdt_addr 

$fdt_name;setenv bootargs $console 

root=/dev/mmcblk0p1 rw rootwait; booti 

$kernel_addr - $fdt_addr' 

Marvell>> save 

Marvell>> run bootmmc 

OpenWrt should now successfully boot on the 

ESPRESSObin. Once connected to the ESPRESSObin, access 

LuCI through the browser by typing the IP address of the board 

(set to 192.168.1.1 by default) in the URL bar. 

III. CONCLUSION 

Throughout this brief paper – and the relative 30 minutes 

class – we have shown how to configure, build, install and run 

a typical OpenWrt system. The OpenWrt project is supported 

by a vibrant global community of open source developers, 

industry leaders and non-profit organizations [7]. We invite the 

reader to explore the many opportunities to be involved with 

OpenWrt to help shape the technology that powers the 

embedded devices for the Internet of Things (IoT) and the 

smart society of the future. 

REFERENCES 

[1] OpenWrt/LEDE 2017 Summit Survey: 

https://openwrtsummit.files.wordpress.com/2017/11/summit-survey- 

2017.pdf 

[2] https://elinux.org/images/e/eb/Transitioning_From_uclibc_to_musl_for_ 

Embedded_Development.pdf 

[3] OpenWrt Table of Hardware: https://wiki.openwrt.org/toh/start 

[4] ESPRESSObin website: http://espressobin.net/ 

[5] OpenWrt Wiki: https://wiki.openwrt.org/doc/techref/architecture 

[6] OpenWrt website: https://openwrt.org/ 

[7] Prpl Foundation: https://prplfoundation.org/prplwrt/ 

49

Live Hacking: Hardware-enforced Virtualization of a 

Linux Home Gateway 

Michael Hohmuth, Adam Lackorzynski 

Kernkonzept GmbH 

Dresden, Germany 

michael.hohmuth@kernkonzept.com 



Santa Clara, CA, USA 

cesare@prplFoundation.com 

Abstract — Trust and security are central to embedded computing 

as network devices - such as home gateways - have become 

the first line of defense for the IoT devices connected to the smart 

home. In this paper, we present a virtualization-based approach 

to securing home gateway while preserving functionality and 

performance. 

Keywords—home gateway; router; virtualization; security; live 

hacking; linux, hypervisor, microkernel, IoT, Internet 


Trust and security have never been more important to the 

embedded computing world, especially when it comes to network 

devices, such as home gateways, that are the first line of 

defense for the IoT devices connected to the smart home [4]. In 

2017, a plethora of stories have confirmed that these devices 

are fundamentally broken from a security perspective. 

At embedded world 2017, we hosted a successful live 

demonstration showing attendees how the prpl Foundation’s 

new approach to embedded computing security works in an 

industrial Internet scenario – that is, secure remote control of a 

robotic arm. We are back this year with a new demonstration 

designed to show the application of the new capabilities of the 

prplSecurity Framework 2.0 – as implemented in the opensource 

L4Re hypervisor – to a different real word scenario: a 

typical Linux-based Internet router, deployed as a home gateway, 

that connects home computers, smartphones, IoT devices 

and other smart devices to the Internet. 

Linux is the dominant operating-system used for Internet 

routing devices. Optimized Linux distributions, like OpenWrt, 

add to the vanilla kernel a configuration system, additional 

applications including IP-telephony, network-printing services, 

VPN, media streaming and a browser-based administrative UI. 

Although optimized for minimal system footprint, many components 

of the resulting software stack are complex and inevitably 

enlarge the attack surface. The Linux kernel alone is 

comprised of millions of lines of code. And a large part of the 

code runs in privileged CPU mode or with elevated OS rights. 

This is a major security concern especially because many 

home-gateway vendors have shown marginal attention to securing 

devices in the field. Availability of security updates is 

sporadic and the patching process in not fully automated as it 

typically requires end-user intervention. As a result, home 

routers present a large attack surface and many exploitable 

vulnerabilities. This puts the security of personal data, smarthome 

applications and IoT devices at risk. Given the sheer 

number of connected devices, it also represents a great risk for 

the Internet infrastructure itself as shown by the recent DDOS 

attacks such as Mirai and similar. 

This unsatisfying state of home-router security has led the 

telecom industry to look for solutions that guarantee availability, 

security, and remote patching of home routers independently 

from the Linux operating system itself. One such approach is 

to use a software partition that can restart or even update the 

main OS from a clean state, and that is isolated from the main 

OS kernel and software. This isolation can be implemented in 

hardware using a separate CPU to run the update/restart process, 

or more cost-effectively in software using an array of 

hardware/software virtualization technologies. 

This paper shows the application of a light-weight type-1 

hypervisor to isolate the router software, including the Linux 

kernel and the user-land applications, into a virtual machine 

(VM). The secure update/restart process runs in a separate VM 

completely isolated from the rest of the system. Our work is 

based on the open-source L4Re hypervisor. This hypervisor 

leverages the hardware virtualization support of modern CPUs 

to provide isolation and efficiency and, most importantly, the 

ability to run unmodified Linux Software. 

The live demonstration starts by downloading the necessary 

code from various open source repositories. We then configure, 

build and install a new hardened firmware image to create 

multi-domain security via hardware virtualization. We then 

launch in real time several network attacks to exploit known 

vulnerabilities of the Linux instance. This shows how the 

breach is contained to the target VM, while the system critical 

components remain unaffected. This session is aimed at security 

architects, penetration testers and anyone who wants to see 


50

how a real-world attack is conducted and how hardware virtualization 

can effectively mitigate the overall impact on the 

system. 

This paper is organized as follows. In Section II, we discuss 

virtualization as a security mechanism and introduce virtualization 

concepts such as full virtualization, 

paravirtualization, and containerization. Section III introduces 

the open source L4Re hypervisor. Section IV explains the 

home-router setup referenced throughout this paper. Section V 

outlines the live-hacking scenario we present during the live 

demonstration, before we conclude the paper in Section VI. 

II. 

. 

VIRTUALIZATION AND SECURITY 

In general, virtualization is the concept of abstracting away 

from the specifics of an underlying hardware mechanism. For 

example, most OSes offer virtual memory as an abstraction of 

physical memory, providing programs with the illusion of an 

abstract computer with a separate, isolated memory space. In 

this paper, we use virtualization in a narrower, more specific 

definition: as a mechanism for providing virtual machines 

(VMs) that provide user software with the illusion of running 

on a separate physical computer. 

Virtualization can be provided on various levels. The Linux 

already comes with several virtualization mechanisms, including 

control groups and containers, which provide the illusion 

of several isolated Linux instances although all instances share 

the same Linux kernel, and kernel virtual machines (KVM), 

which provides VMs that look like physical computers and that 

run unmodified OS kernels (such as another Linux kernel) as 

unprivileged VM guests. These mechanisms all share the property 

that all VMs must trust the hosting Linux kernel and 

Linux-based OS, which can be undesirable from a security 

perspective. 

The alternative solution is to deprivilege all Linux instances 

by running them on top of a small hypervisor such as the L4Re 

hypervisor. Depending on which critical services these Linux 

guests provide, it is possible to remove the Linux OS from the 

critical trusted path, or trusted computing base (TCB), of a 

security-sensitive application. If the hypervisor is small, the 

critical application’s TCB can be several orders of magnitude 

smaller than the Linux kernel alone. 

Full virtualization (allowing unmodified OS kernels to run 

in VMs) can greatly benefit from virtualization assists provided 

in hardware by the platform’s CPU. Fortunately, modern server, 

desktop, and embedded CPUs all provide hardware-assisted 

virtualization features (i.e., nested paging, interrupt-controller 

virtualization, and I/O-MMUs). Where these hardware features 

are not available, either hypervisors need to resort to costly 

emulation of VM features that allow unmodified guest OSes to 

run; or, guest OSes need to be modified to be able to run as 

VM guests on top of the hypervisor. In the latter case, the guest 

OS kernel is said to be paravirtualized; of course, this is feasible 

only for OSes for which source code is available. There is 

also a hybrid approach in which the OS kernel proper runs 

unmodified, using hardware-assisted virtualization, but certain 

device drivers use hypervisor APIs directly (instead of emulated 

device interfaces) for performance reasons. The VirtIO set 

of APIs is a well-known example for such a paravirtualized 

device API. 

Hardware-assisted virtualization and paravirtualization are 

conceptually similar when implemented in the same software 

layer. Minor differences in complexity and attack surface mostly 

stem from additional emulation layers needed to provide the 

physical-machine illusion for full, hardware-assisted virtualization. 

III. 

THE L4RE HYPERVISOR AND OPERATING SYSTEM 

The L4Re system is a light-weight, microkernel-based, real-time 

operating system with support for hardware-assisted 

virtualization and paravirtualization [7,8]. The system components 

include: 

 

 

 

 

The L4Re Hypervisor 

The L4Re Runtime Environment, a POSIX-like programming 

environment for implementing native, trusted 

L4Re microapps 

A VMM component for hardware-assisted virtualization 

of unmodified guest OSes (i.e., Linux and 

FreeRTOS) 

L4Linux, a paravirtualized Linux kernel 

The L4Re system supports many platforms including x86, 

ARM and MIPS architectures in 32-bit and 64-bit mode. 

Hardware-assisted virtualization and device-memory virtualization 

(IOMMUs) are also supported if available. Additionally, 

experimental support is available for PowerPC and Sparc. 

L4Re is easily portable: developing a board-support package 

(BSP) for a new platform usually takes few developer-days. 

L4Re is a mature OS that has been in development since 

1997. Originally developed at TU Dresden, it has recently 

obtained vast commercial uptake and support. The L4Re software 

is licensed under GNU GPLv2 and it is available for 

download at https://www.kernkonzept.com. A dual-licensing 

schema is available for commercial applications if desired. 

The L4Re system aims at minimizing each application's or 

VM's TCB. The hypervisor is a classic L4 microkernel as it 

implements only those OS mechanisms that are required to 

securely implement isolation (i.e. address spaces/VMs, 

threads/virtual CPUs, scheduling/clocks, and inter-process 

communication) and leaves implementing all other typical 

operating-system services (such as resource or file management) 

to user-level programs. 

One such user-level component is L4Re's Virtual Machine 

Monitor (VMM), which is used to emulate the virtual platform 

that is made available to (hardware-assisted) virtual machines. 

Components that do not need virtualization do not have to 

depend on the VMM which then yields a smaller TCB. In fact, 

each VM can have its own (custom) VMM, allowing to further 

reduce the trust relationship among different, mutually untrusting 

VMs. 

For virtualization-friendly guest OS kernels such as Linux, 

the L4Re system provides a special, tiny VMM called uvmm. 

This VMM does little more than providing a boot API for guest 

OSes, providing virtual interrupts and CPUs, connecting the 

51

VM to VirtIO-based virtual devices (such as a virtual network 

interface), and passing through physical devices the VM is 

allowed to access. 

Apart from the VMM, the L4Re system provides components 

for memory and CPU management, for setting up VM 

and application resources such as physical memory and communications 

relationships, for bus virtualization and platform 

and device management, and for securely multiplexing a GUI. 

The loader component can be scripted with Lua language and 

allows static or dynamic device (pass-through), memory, and 

communication-relation assignments. 

For more information on the L4Re system, please refer to 

our EW2016 paper [3]. 

IV. 

VIRTUALIZATION OF A LINUX-BASED ROUTER OS 

Our demonstration and evaluation vehicle for running a 

router OS in a virtual machine has been implemented on the 

NXP FRDM platform and uses two hardware-assisted VMs. 

NXP’s QorIQ FRDM-LS1021A board uses an LS1021A 

SoC that provides two ARM Cortex-A53 processors. These 

CPUs provide ARM’s virtualization technology, which allows 

using hardware-assisted full virtualization on this board. To 

provide Wi-Fi routing capability, we have attached a Wi-Fi 

dongle to the board’s USB interface. 

Using uvmm, we run the following two VMs on top of the 

L4Re hypervisor: 

Router OS—This VM runs a copy of OpenWrt with an unmodified 

Linux kernel. This VM drives the Wi-Fi device, 

which is passed through into this VM, and exposes its configuration 

interface over the Wi-Fi interface. As its Internet uplink, 

Router OS has a virtual network connection to Telco OS: 

Telco OS—This VM runs a simple, FreeRTOS-based system 

with two main functions: It has pass-through access to one 

Ethernet interface that serves as the uplink port and passes all 

traffic on to the Router OS via the virtual-network interface— 

except that it accepts commands on a “telco” service it implements 

itself. This service is intended for use by the Internet 

provider (or telco operator) to trigger reboots of the Router OS 

from its boot image, which is invisible to, and unmodifiable by, 

the Router OS, and therefore always represents a clean state 

from which the first VM can restart. Reboots of the Router-OS 

VM do not require a platform reset or reboot. (In the future, 

this service may also update the Router OS’s boot image.) 

This architecture provides only a minimal attack surface for 

the Telco OS because it does not inspect data intended for the 

Router OS, and does not implement any application or configuration 

services. 

This architecture has the property that any compromises of 

the Router OS, initiated either externally (from the Internet) or 

internally (by a rogue or cracked IoT device) can be undone 

from within Telco OS, without having to trust Router OS at all. 

V. LIVE DEMONSTRATION 

In our live demo session, we will run an exploit against a 

known bug in OpenWrt. 

At first, we will demonstrate how an instance of regular 

OpenWrt running natively (without a hypervisor) will become 

unresponsive once the attack is performed. 

Then, we’ll run OpenWrt in a virtual machine as described 

in the proceeding section. We will show that attacks on 

OpenWrt are still possible, but can be mitigated by the telco by 

remotely rebooting OpenWrt from a clean state, and possibly 

even updating OpenWrt from Telco OS. 

VI. 

CONCLUSION 

The security benefits of virtualization are no longer confined 

to big iron datacenter applications. Virtualization can 

effectively be implemented in resource-constrained embedded 

systems such as home routers. It allows to separate complex 

operating system software, such as the Linux-based OpenWrt, 

from the trusted computing base in critical applications. 

In preparation for the hand-on workshop: please download 

the software from https://l4re.org/download.html. The authors 

will provide additional download links and instructions for the 

demo during the class. 

REFERENCES 

[1] The L4Re System, https://l4re.org/ 

[2] A. Lackorzynski and A. Warg. Taming subsystems: Capabilities as 

Universal Resource Access Control in L4. In Proceedings of the Second 

Workshop on Isolation and Integration in Embedded Systems, Eurosys 

affiliated workshop, IIES ’09, pages 25–30. ACM, 3 2009. ISBN 978-1- 

60558-464-5. 

[3] M. Röder, M. Hohmuth and A. Lackorzynski. Tux Airborne: 

Encapsulating Linux — real-time, safety and security with a trusted 

microhypervisor. In Proceedings of the Embedded World Conference 

2016 

[4] Security Guidance for critical areas of embedded computing – prpl 

Foundation, January 2016 https://prpl.works/security-guidance/ 


52

Achieving Ultra Low Power in Embedded Systems 

Understand where your power goes and what you can do to make things better 

Herman Roebbers 

Embedded Systems 

Altran Netherlands B.V. 

Eindhoven, The Netherlands 

Herman.Roebbers@altran.com 

Abstract— Over the course of the last years the need to reduce 

energy consumption is growing. This article focuses on the 

possibilities for reduction of energy consumption in embedded 

systems. We argue that energy consumption is a system issue and 

therefore a matter of making compromises. Energy consumption 

can be reduced by software, but only so far as hardware allows. 

There are many things that can be done to reduce energy 

consumption. The goal is to define an approach for achieving less 

energy consumption. Also criteria for the selection of an 

appropriate MCU are presented. Conclusion: Many (unexpected) 

things can have a big impact on your achievable battery lifetime. 

Look beyond just the CPU/processor and software in order to 

achieve better results. 

Keywords— Ultra Low Power; approach; embedded; system 

issue; reducing energy consumption 


In the last years the need to reduce energy consumption is 

growing. One the one hand this is instigated by governments 

(e.g. EnergyStar), on the other hand by the need to do more with 

the same or less energy (think mobile telephone battery lifetime, 

Internet-of-Things node battery life time). In this article we will 

focus on the backgrounds of energy consumption in embedded 

systems and how to reduce this consumption (or its effect). This 

article covers a part of a two-day Ultra Low Power workshop 

about this subject which is available via the High Tech Institute 

(http://www.hightechinstitute.nl), T2prof and Altran. 

The fact that energy consumption is an important issue is 

illustrated by the fact that chip manufacturers make a lot of noise 

about their energy-economic chips. There even are benchmarks 

for energy economy of embedded processors: the EEMBC 

ULPMark TM (http://www.eembc.org/ulpbench) CP (Core 

Profile) and PP (Peripheral Profile), IoTMark-BLE and the 

soon-to-be-released SecureMark. 

Energy consumption is an important point in all sorts of 

systems. It gets more and more important in the IoT world, 

where the biggest consumer is usually the radio. All sorts of 

solutions are tried to require the radio for as short as possible. 

This leads to non-standard protocols that use much less 

energy than standard protocols. 

It is important to realize that energy consumption is a system 

issue. And a matter of weighing one thing against another and 

making compromises. Energy consumption can be reduced by 

software, but only so far as hardware allows. It is also a 

multidisciplinary thing, because both software discipline and 

hardware discipline must be involved in the design in order to 

achieve the desired goal. 

For this article we limit ourselves to smaller embedded 

systems like sensor nodes. These systems are typically asleep for 

a large proportion of the time. Depending on what functionality 

is required during sleep and how fast the system must wake up, 

the system can sleep lighter or deeper. 

There are many measures that can reduce energy 

consumption. The goal is to define an approach that should lead 

to less energy consumption. That approach is detailed in this 

article as well as in the workshop. 

II. 

CATEGORIES OF MECHANISMS FOR ENERGY 

REDUCTION 

The mechanisms for energy reduction fall into three main 

categories. TABLE 1 lists commonly used mechanisms per 

category. This list is not exhaustive. Different vendors may use 

different names for the same mechanism. 

A. Software only (includes compiler) 

The energy reduction mechanism is solely implemented in 

the software domain. 

B. Software and hardware combined 

Hardware and software together implement an energy 

reduction mechanism. 

C. Hardware only 

The energy reduction mechanism is implemented at the 

hardware level. 

Each of the hardware mechanisms mentioned in the table 

below may or may not be available in your system. If the 

hardware does not support it then software cannot use it. 

53

Overview of 

Power 

Management 

Mechanisms 

Power 

management 

works at all 

these levels 

TABLE 1. POWER MANAGEMENT MECHANISMS 

Level Mechanism Category 

Application 

Operating 

System 

Driver 

Board 

Chip 

IP block / 

chip 

IP block / 

RTL 

Transistor 

Substrate 

III. 

Event driven architecture 

Use Low Power modes 

Select radio protocol 

… 

Power API 

Operation Performance 

Points API 

Tickless operation 

Use DMA 

Use HW event 

mechanisms 

Suspend / resume API 

Dynamic Voltage and 

Frequency Scaling. 

Power Gating via I/O pin 

Controlling Voltage 

regulator via I/O pin. 

Clock Frequency Mgt. 

Controlling device 

shutdown pins by I/O pin. 

Power Gating 

Offer Low Energy Modes 

(Automatic) clock gating 

Clock frequency 

management 

Dynamic Power Switching 

Adaptive Voltage Scaling 

Static Leakage 

Menagement 

Power Gating State 

Retention 

Automatic power / clock 

gating 

Body Bias 

FinFet 

TriGate Fet 

Sub-threshold operation 

SOI, FD-SOI 

SIMPLE THINGS TO DO 

(Domain) 

Category A 

(Software) 

Category B 

(Software 

& 

Hardware) 

Category C 

(Hardware) 

A. Look at the OS configuration (if there is an OS) 

Operating Systems use a periodic scheduler invocation 

(‘tick’) to check whether the currently executing process is still 

allowed to use the processor or that it should be descheduled in 

favor of some other process. This periodic invocation can take 

quite some time, and also happens if no processes are ready for 

execution. In this case a so-called idle task is executed, which 

usually consists of a simple while (1) {}; loop, just 

burning energy. 

Some Operating Systems (e.g. Linux and FreeRTOS) offer 

what is known as a tickless configuration to make the CPU sleep 

until either a timer expires or an interrupt occurs. The standard 

scheduler tick timer (default 100 Hz for Linux versions prior to 

version 3.10) is then no longer necessary. In versions before 3.10 

the #define CONFIG_NO_HZ configures this behavior, in later 

versions it is the #define CONFIG_NO_HZ_IDLE. In order for 

FreeRTOS to be used in this way the #define 

configUSE_TICKLESS_IDLE must be set. When applicable, 

this is a very simple way to (possibly substantially) reduce 

power. 

B. We look at the architecture of the application. 

If we look at the architecture of the application software we 

can distinguish two major types: Super loop or event driven. The 

super loop goes around one big loop all of the time, often not 

sleeping at any time. In order to reduce energy consumption we 

would like the system to sleep as long as possible between 

successive passes through the loop. It depends on the application 

whether sleeping is allowed at all and what the maximum 

sleeping time can be. It may, however, be quite possible to do 

some sleeping at the end of the loop without causing any 

problem and in doing so save substantial energy. 

IV. 

APPROACH FOR OBTAINING ULTRA LOW POWER. 

We will now describe our approach toward achieving ultralow 

power in a step by step fashion. Basically the strategy is: 

Use the facilities the hardware offers. We can do this in steps, 

roughly in the order these features were offered over time. 

A. In the beginning 

In the beginning there was only one bus master in the system: 

the CPU. It could read data from instruction memory and read 

from and write data to data memory and peripherals. In order to 

check for an event the CPU had to resort to polling: 

while (!event_occurred()) 

{}; 

This piece of code keeps the CPU busy, as well as the code 

memory and the bus. Both the CPU and the code memory (flash 

usually) are big contributors to the total energy consumption, 

especially when code memory isn’t cached. 

B. Phase 2: Introducing Direct Memory Access (DMA) 

At some point in time a second bus master is introduced: The 

DMA unit. It is capable (after being programmed by the CPU) 

to access memory and peripherals autonomously. It can also 

generate an interrupt to the CPU to signal completion of its task, 

e.g. copying of peripheral data to memory or vice versa. This 

DMA unit can operate in parallel with the CPU, but they cannot 

access the bus simultaneously. While the DMA is copying data, 

the CPU can check a variable in memory for DMA completion. 

Pseudocode of the Interrupt Service Routine (ISR): 

void ISR_DMA_done(void){ 

} 

... /* clear interrupt */ 

ready = true; 

54

The main program: 

volatile bool ready = false; 

setup_peripherals_and_DMA(); 

start_DMA(); 

while ( ! ready ) 

{ 

} 

__delay_cycles(CHECK_INTERVAL); 

Here we check another variable, but not continuously. 

The __delay_cycles() function executes NOP 

instructions during CHECK_INTERVAL. This keeps the data 

bus free so that the DMA unit isn’t hindered by the CPU’s data 

accesses and so may complete its assignment quicker. The CPU 

is still fetching code from instruction memory, though. 

C. Stop the CPU clock when possible 

A relatively recent addition to the CPU’s capabilities is 

stopping the CPU clock until an interrupt occurs, saving power 

by doing so. This can be in the form of a 

WAIT_FOR_INTERRUPT instruction, which removes the 

clock from the CPU core until an interrupt occurs. ARM CPU 

cores offer the WFI instruction for this purpose, others such as 

MSP430 set a special bit in the processor status register to 

achieve the same effect. This does not affect our interrupt service 

routine. Our main program code changes thus: 



start_DMA(); 


{ 

} 

__WFI(); /*special insn, CPU sleeps*/ 

In the new situation the CPU is stopped by disabling its clock 

until the interrupt occurs. This saves energy in several ways: The 

CPU is not active, instruction memory is not read and both the 

data bus and the instruction bus are completely available for the 

DMA unit to use. Most new processors know this trick. 

D. Events 

Later CPUs have the notion of events that also can be used 

to wake up the CPU from sleep. This mechanism is quite similar 

to that of using the interrupt, except that no ISR gets invoked. 

This saves some overhead if the ISR didn’t have to do anything 

other than wake the CPU. Using this mechanism requires that 

the CPU have an instruction to WaitForEvent. ARM Cortex 

processors have the WFE instruction, others, such as MSP430 

don’t have it. 

E. Passing events around: Event router 

When this event mechanism is coupled with peripherals that 

can produce and consume events using some programmable 

event connection matrix (‘event router’), a very powerful system 

emerges. In the case of Silabs EFM32 series the mechanism is 

referred to as Peripheral Reflex System; Nordic has another 

name for it. MSP430 has something a bit simpler than the other 

two. 

This mechanism allows quite complex interaction between 

peripherals can take place without CPU interaction. This allows 

the CPU to go into a deeper sleep mode and save more energy. 

As an example we can configure a system to do the following 

without any CPU interaction: On a rising edge on a given I/O 

pin an ADC conversion is started. The conversion done event 

triggers the DMA to read the conversion result and store it into 

memory, incrementing the memory address after each store. 

After 100 conversions the DMA transfer is done, generating an 

event to the CPU to start a new acquisition series and to process 

the buffered data. 

F. Controlling power modes 

The latest ULP processors have a special hardware block to 

manage energy modes and transitions between them in the 

system, combined with managing clocks and power gating 

peripherals in certain energy modes: The Energy Control Unit in 

EFM32, or Power Management Module for MSP430 for 

instance. They can save a lot of time otherwise required to 

program many registers when going to or coming out of sleep. 

They can also manage retaining peripheral register content at 

retention voltage (lower than operational voltage), such that the 

peripheral can immediately resume operation when power is 

restored. This hardware mechanism is called State Retention 

Power Gating. 

The main program is now: 

setup_hw_for_event_generation(); 

configure_sleep();/*This is the extra*/ 

start_DMA(); 

__WFE();/*CPU sleeps, low power mode*/ 

Using a deeper sleep can make a difference of more than a 

factor thousand! 

We have just seen what stepwise refinements we can 

implement to reduce energy consumption. Each step can be 

implemented as a logical successor to the previous one. 

V. WHAT TO LOOK FOR WHEN SELECTING AN MCU 

There is a number of parameters that one can look at and 

compare to select the best MCU for the application at hand. Here 

is one set of parameters: 

1) What is the active current (A/MHz) at what voltage 

2) What is the performance of the CPU (CoreMark/MHz) 

3) What is the sleep current in each of the low power modes 

intended to be used 

4) What is the wake-up time from each of these low power 

modes. 

5) What is the power consumption of each of the 

peripherals used 

6) What peripherals are available in which low power 

modes 

55

7) Can peripherals operate autonomously (e.g. be 

controlled by a DMA engine) 

8) Is there a hardware event mechanism to orchestrate 

hardware-based event production and consumption 

9) Do the available low power modes fit well with the 

application 

10) Are the peripherals designed for ultra low power 

operation (e.g. Low Energy UART, Low Power Timer) 

11) Can sensors be operated with low energy consumption 

(e.g. Low Energy sensor interfaces) 

12) Are there “on-demand oscillators” 

The answers to these questions serve as a guide to an informed 

selection of the MCU type to use for best performance for the 

given application. They can be used as input for a power 

model of the application and, together with a battery model 

can help predict the battery/charge lifetime for the application. 

VI. 

WHAT ELSE CAN ONE DO? 

There are still many more factors that all can play a role in 

the overall energy consumption. These are factors not obvious 

to many people, such as: 

 

 

 

 

 

 

 

 

Regulator efficiency 

Switching sensors off when not in use: prepare your 

hardware to be able to do so 

Clocks: how to set them for lowest energy 

consumption 

Voltages: lower is better, the fewer the better 

Compiler: can make 50 % difference 

Compiler settings: can make 50 % difference 

Where do I locate critical code / data 

How to measure the consumption? 

 

I/O pin settings 

Battery properties in relation to energy 

consumption profile. 

 

Look for possibilities to make use of energy 

harvesting to prolong battery lifetime 

During the workshop many of these issues and others will be 

addressed and illustrated through hands-on sessions. 

VII. CONCLUSIONS 

Ultra-Low Power is a system thing. Hardware alone or 

software alone cannot achieve the lowest consumption. 

We have shown a stepwise approach to reducing energy 

consumption. 

In order to realize the maximum energy reduction one has to 

understand the details of the hardware and write the software to 

use available features. 

Energy savings can be found in unexpected places. 

It is possible to reduce consumption by more than a factor 

thousand in certain scenarios. 


The author wishes to thank Altran for giving him the 

opportunity to investigate this subject matter and his colleagues 

for helpful feedback during the development of the workshop 

and for reviewing related publications [1]. 

REFERENCES 

[1] H. Roebbers, “Hoe spaar je energie in een embedded systeem?,” Bits & 

Chips 08, pp. 34-39, October 2015. 

56

Top Misunderstandings about Functional Safety 

Christian Dirmeier, Claudio Gregorio 

Rail Automation 

TÜV SÜD Rail 

Munich, Germany 

christian.dirmeier@tuev-sued.de 

Abstract—TÜV SÜD has more than 20 years of experience in 

testing and certifying Functional Safety related systems and 

components. The presentation summarizes the most often 

experienced issues due to misunderstanding of key concepts in 

Functional Safety during this time. 

Keywords—functional safety, certification, safety systems 

I. MOTIVATION: 

TÜV SÜD has more than 20 years of experience in testing 

and certifying Functional Safety related systems and 

components. During this time our employees have observed 

recurring issues arising from misunderstanding of some 

concepts of functional safety. The sections below highlight 

some of the most interesting errors. The presentation address 

persons firstly approaching functional safety as well as 

experienced safety engineers, safety managers, project leaders 

who are familiar with Functional Safety topics and would like 

to see some not conventional aspects of the functional safety. 

II. 

PFH/PFD: NECESSARY BUT NOT SUFFICIENT FOR A SIL 

Often manufacturers just calculate a PFD/PFH value for 

their System or Subsystem and claim afterwards a SIL for it. 

The PFH/PFD express the Probability of a dangerous failure of 

a safety related system (or sub-system) per hour (PFH) or on 

demand (PFD). Both values address random Hardware faults 

and usually are calculated with use of FMEDAs. Fulfilling a 

specific Safety Integrity Level (SIL) requires not only the 

control of random failures of Hardware but also the avoidance 

and control of systematic failures in Hardware and Software. 

The latter is expressed as Systematic Capability (SC, values 

from 1 to 4, corresponding to the four SIL values) and reflects 

methods and techniques used during development of the safety 

related system. Therefore a SIL always consists of both: 

PFD/PFH and a determination of the robustness of its 

development process i.e. the SC. 

III. 

SIL DOES NOT MEAN RELIABILITY OF THE CONTROL 

SYSTEM 

Sometimes system integrators and plant manufacturer 

require their suppliers to deliver control systems for normal 

operation with a SIL, meaning that this will ensure a certain 

reliability of the control system and/or utility. Aim of a safety 

function (which is performed by a safety related system) is to 

put an Equipment Under Control (EUC) into a safe state. 

A safe state of a EUC is a result of the hazard and risk 

analysis and depends on its different operational modes. In this 

context we frequently observe misunderstandings about 

strategies and concepts regarding fail safe scenarios, i.e. e.g. 

shutting down the EUC in case of a failure, and fail operational 

scenarios, i.e. e.g. keep the EUC as much as possible in 

operation. 

Random Hardware failures (and reliability) are calculated 

based on failure rates. In terms of Functional Safety failure 

rates are split into safe and dangerous failures. Only dangerous 

failures (preventing the safety function to perform as intended) 

are considered in the calculation of the PFD/PFH values. A SIL 

therefore is only a degree of reliability that the safety function 

will perform as intended when it is required to put the EUC in 

a safe state. 

IV. 

WATCHDOGS AND MICROCONTROLLER 

Watchdogs of Microcontroller (µC) often just reset the 

controller but do not control any outputs in a direct 

independent way. In case of a fault inside the µC a 

deterministic behavior of the outputs is required. It is usually 

not possible to prove that a defect micro controller can trigger a 

watchdog in a correct way. 

V. PROVEN IN USE SOFTWARE 

We still observe approaches to claim systematic capability 

for existing SW based on operational experience (Route 2S in 

IEC 61508). Since IEC TS 61508-3-1 all individual failures of 

the SW need to be detected and reported during the observation 

period and also all combinations of input data, sequences of 

execution and timing relations must be documented. This 

approach is usually not possible from a practical point of view. 

57

Verification of Memory Interferences in Automotive 

Software: A Practical Approach 

Ludovic Pintard, Abdelillah Ymlahi-Ouazzani 

VALEO 

GEEDS Safety Department 

Créteil, France 

[name].[surname]@valeo.com 

Abstract— Freedom From Interferences (FFI) is a main concern 

in software safety. According to ISO26262 Standard on automotive 

functional safety, FFI means that a fault in a software component 

shall not propagate in a more critical one. Many projects involve 

mixed criticality architectures, i.e., they run applications with 

different Automotive Safety Integrity Levels (ASIL) on the same 

microcontroller. 

If architectural solutions are known to ensure FFI – using software 

partitioning, hardware memory protections, or other safety 

mechanisms – the verification and debugging can be difficult. 

Indeed, the implementation of a Memory Protection Unit (MPU) 

often reveals the weaknesses of the design, and it is time consuming 

to understand all the exceptions late in the process. 

This paper discusses how the design of complex application can be 

verified with regards to memory interferences and illustrates it with 

a case study on an Advanced Driving Assistance System. The focus 

is on the process to verify the FFI at different steps of the 

development cycle and how to improve it using tooling. 

The results obtained on the project have demonstrated that memory 

interferences can be efficiently detected in the early phase of 

architectural design. 

Keywords— Functional Safety; Freedom From Interferences; 

Mixed Criticality; Memory Interferences; ISO26262 

INTRODUCTION 

With the introduction of new Advance Driver-Assistance 

Systems –ADAS– and autonomous vehicle, the complexity of 

the automotive systems has been increased. In software 

automotive development, safety is a critical objective, and the 

introduction of ISO26262 standard for functional safety for 

road vehicles, has helped the industry to address a common 

the state of the art and improve practices. 

One of the important properties for ensuring safety at the 

software architectural level is Freedom From Interference 

(FFI). This property is important in automotive systems as it 

enables to develop software with mixed-ASILs on one microcontroller, 

instead of having a monolithic ASIL solution. 

Hence, the effort is put on the development of modules with a 

direct impact on safety requirements, and not all the modules. 

However, even if solutions are now known to ensure FFI 

such as: safety analyses, safety mechanisms (software 

partitioning, Memory Protection Unit (MPU), OS timing 

protection, watchdog, etc.), testing with fault injection, these 

solutions lead to big regression loop as errors are found during 

testing. 

The contribution of the paper is to propose an efficient 

way to implement and verify memory interferences on a 

project, with the help of Tasking Safety Checker tool [5]. This 

approach will be exemplified with the result of an evaluation 

done by Valeo. 

Section I introduces fundamentals notions of ISO26262 

and FFI. Section II describes the current state of the art in 

software process in the automotive industry. Section III points 

out drawbacks identified of the process described in Section 

II, and then describes a more robust process. Section IV gives 

the results of our experimentation of the tool: TASKING 

Safety Checker and the results obtained with it on several 

internal projects. Finally, Section V is a summary of the 

strengths and limitations of the approach. 

I. BACKGROUND 

A. ISO26262 

ISO26262 standard for functional safety in road vehicles, 

introduced in 2011, has helped the industry by presenting the 

state of the art methods in development as opposed to current 

practices. ISO26262 pushed recommendation toward methods 

and techniques to ensure that “no unreasonable risk is due to 

hazards caused by malfunctioning behavior of electrical and 

electronic systems”. The standard has 10 parts, but we focus 

on Part 6: “product development at the software level”. The 

standard follows the well-known V model for engineering. 

B. Freedom From Interferences 

With the increase of processing capabilities of the 

microcontrollers used for automotive systems, software 

designs now integrate applications with different criticality 

levels, i.e., different Automotive Safety Integrity Level 

(ASIL). 

1 

58

According to the ISO26262, one method consists in 

developing all the modules with the highest ASIL. Hence, 

lower ASIL applications should be developed also with a 

higher effort as higher ASIL modules require applying more 

techniques and methods. 

The alternative method is to integrate modules with 

different ASILs, but requires that the Freedom From 

Interferences (FFI) is ensured. FFI is defined as the “absence 

of cascading failures between two or more elements that could 

lead to the violation of a safety requirement”. If a lower ASIL 

component fails, it should not propagate to a higher ASIL 

component and make it fails. As an example, in a software 

design, a Quality Management (QM) can be integrated with 

ASIL-C software modules only if it could be proven that it 

does not interfere with the ASIL-C modules by any source 

like: 

 

 

 

Memory interferences: these correspond to 

corruption of content, read/write access to 

memory allocated to another element. 

Exchange of information interferences: These 

are errors between a sender and a receiver caused 

by: repetition of information, loss of information, 

delay of information, insertion of information, 

blocking a communication channel, etc. 

Timing and execution interferences: They 

occur during runtime if a safety-relevant software 

element is blocked due to another software 

element, or the system is in a deadlock, livelock, 

etc. 

II. CURRENT SOFTWARE PROCESS AND ARCHITECTURE TO 

ENSURE SAFETY 

In this paper, we take as a hypothesis that we are 

developing a system with safety critical requirements at 

software level and mixed-ASILs software modules. In this 

section, we describe the different activities performed in the 

software development process. 

A. Preliminary Software Architecture 

1) Software Requirements 

To design preliminary software architecture, the process 

starts with the allocation of system requirements on software 

and hardware level. Both functional and safety requirements 

are used to define a first allocation and definition of the 

needed components for the implementation of the 

functionalities. 

Today, most of the automotive projects follow AUTOSAR 

architectures depicted hereafter. 

2) AUTOSAR: AUTomotive Open System Architecture 

AUTOSAR [2] is a standard for automotive E/E software 

architecture developed by major OEMs and suppliers. It is a 

major enhancement to software development in the 

automotive industry. 

AUTOSAR brings in the realization of an applicationspecific 

approach for automotive software development as 

opposed to an ECU-specific one. The AUTOSAR architecture 

mainly encompasses an application layer (comprising 

Software Components (SWC), a Run-Time Environment 

(RTE) and the Basic Software (BSW). 

B. Software Safety Analyses 

Based on the Technical Safety Concept (TSC) performed 

at system level and with the preliminary software architecture, 

it is possible to start software safety analyses, such as: 

 

 

 

Software Fault Tree Analysis (SwFTA) 

Software Failure Mode and Effect Analysis 

(SwFMEA) 

Software Critical Path Analysis (SwCPA) 

These analyses aim at determining how the fault 

propagates into the software architecture, in order to allocate 

the ASIL to each software component in the critical path. 

With this allocation of critical modules, the safety 

mechanisms to ensure FFI can be defined. The state of the art 

measures to ensure this property is to implement: 

 

 

 

End-to-End Protections in order to protect 

sender/receiver communications, with Cyclic 

Redundancy Check (CRC), Timeouts, Counters, 

etc. 

Timing protections from the OS: tasks and 

interrupts execution budget protection, etc. The 

timing protection for FFI can also be performed 

with AUTOSAR Watchdog Manager (WdgM) 

module. The WdgM module [4] is a key SW 

Module in the AUTOSAR-based architecture to 

ensure the application works safely, detecting 

violation of timing and logical constraints. The 

WdgM is part of the System Services layer and is 

responsible for error detection, isolation and 

recovery. It provides three supervision 

mechanisms: alive supervision, deadline 

supervision, and control flow supervision. 

Software partitioning and use of a Memory 

Protection Unit on the micro-controller, in order 

to mitigate Memory interferences. 

C. Software Partitioning and Memory Protection Unit 

In order to implement the partitioning (see Fig 1.), the 

following AUTOSAR concept and modules are used. 

OS-Application: The AUTOSAR-OS offers the possibility 

to group different OS objects (Tasks, ISRs, Alarms, Schedules 

Tables, Counters, etc.) into so called OS-Applications. All 

objects within one OS-Application share their memory 

protection scheme and the access rights. 

According to AUTOSAR-OS Specifications [3], OS- 

Applications can either be trusted or non-trusted. Trusted OS- 

Applications are allowed to run in CPU Supervisor Mode 

without restrictions and non-trusted ones are running in CPU 

User Mode with limited access to OS and HW resources. 

2 

59

Runnables 

Runnables ASIL B 

Runnables QM 

Switch 

Event 

HeadLight 

ComStack 

Switch 

Event 

HeadLight 

ComStack 

RTE 

RTE 

RTE 

E2E 

OS 

I/O HW 

Abstraction 

IoHwAbs 

I/O Drivers 

Dio 

Light Switch 

ASIL B 

Communication 

Services 

Com 

Communication 

HW Abstraction 

CanIf 

CanTrcv 

Communication 

Drivers 

Can 

Ignition 

HeadLights Switch ECU 

ASIL B 

Dashboard 

QM 

Partitionning 

E2E 

OS 

I/O HW 

Abstraction 

I/O Drivers 

ASIL B 

IoHwAbs 

Dio 

Light Switch 

ASIL B 

Shared 

Memory 

HeadLights 

ASIL B 

Communication 

Services 

Com 

Communication 

HW Abstraction 

CanIf 

CanTrcv 

Communication 

Drivers 

Can 

QM 

Ignition 

Switch ECU 

Dashboard 

QM 

Legend 

ASIL Modules 

QM Modules 

Fig. 1. Software Partitionning 

MMU/MPU: The basic memory protection requirement 

that shall be fulfilled by the OS is to protect data, code and 

stack section of OS-Application. In the AUTOSAR OS 

standard, this protection is activated during the execution of 

the non-trusted OS-Applications in order to prevent the 

corruption of the trusted OS-Application memory sections. 

Moreover, it could be also used to protect private data and 

stack within the same OS-application if necessary. 

The memory protection requires hardware support in the 

microcontroller of a Memory Protection Unit MPU and/or 

Memory Management Unit MMU. 

AUTOSAR Inter OS-Application Communicator – IOC: 

The communication between two OS-Applications has also to 

be protected. Indeed, OS-Applications intend to create 

memory protection boundaries, therefore dedicated 

communication mechanisms are needed to cross them. This 

feature is implemented in AUTOSAR-OS and called IOC. It is 

the dedicated communication mean between OS-Applications, 

whether or not the OS-Applications are allocated to the same 

core (the communication could be between two OS- 

Applications on the same core, or allocated to two different 

cores in multicore architectures). Its main function is to ensure 

the integrity of the transmitted messages via a buffer. These 

messages could be data structure or notifications (activation of 

a task, callback, etc.). 

D. Verification and Validation 

The final part of the process is the verification and 

validation testing. 

It is important to note that all this process is incremental, 

and if errors are detected in one of the activities, the previous 

ones may be impacted. Indeed, testing may highlight that a 

software module should be redesigned or reallocated and 

designed with a higher ASIL. 

Fig. 2. MPU Functional Description 

III. IMPROVEMENT OF SOFTWARE PROCESS 

In the previous Section, we describe the automotive 

software development process for the implementation of 

memory protection. In this new Section, we will explain the 

limitations and difficulties link to this process and propose 

solutions. 

A. Identification of Gaps and Lacks in the Development of 

Memory Protection 

This process tends to be implemented on most of the 

projects that are deployed today in automotive. However, this 

process often leads to several iterations in order to obtain the 

final version of stable software. 

3 

60

Indeed, the development of such systems leads to these 

outcomes: 

 

Preliminary software architecture it is a starting 

input of software safety but it often needs 

modification. Therefore Software Safety analysis 

is performed on code to save time. 

In many projects, it is a challenge to have software safety 

design that is completely in line with the actual code 

implementation. And as the code reflects the most the 

requirements, software safety engineer performs his analysis 

directly on the code. This introduces a difficulty to have a 

systematic way to perform this analysis, as the code alone has 

more complexity. Hence, the main problem is to have a 

systematic way to perform the analysis. 

 

Software safety analyses are performed by hand. 

Today, there are no tools widely adopted by safety actors. 

Even if internal tools may have been developed based on 

company knowhow and chosen solution, there are no known 

solutions other than performing the software safety analysis by 

hand. 

 

Verification that the code ensures FFI before 

integration tests is very difficult. 

When the Safety analysis has been performed, and a first 

version of the code is available, it is not easy to verify by hand 

that the code ensure FFI requirements. Indeed, the call graph 

of all the lower ASILs functions should be inspected to verify 

that it does not potentially corrupt the data of a higher safety 

module. 

Hence, we rely either on the expertise of the analyst, every 

time the analysis is performed, or on the late integration tests 

that will cause MPU exceptions due to architectural faults. 

 

MPU integration leads to lot of debug that needs 

new architectural design. 

Anybody, who has integrated an MPU in a project knows 

that it will highlight remaining architectural problems, and it 

can take a long time to debug. Also, it could lead to modify 

significantly the software architecture. 

 

MPU integration needs high testing coverage 

Particularly, for memory interferences verification, the 

objective is to perform complete fault injection tests to verify 

that the MPU protects the data of the higher ASIL level 

partition from corruption by lower ASIL modules. Indeed, in 

most implementations the MPU description registers are 

configured dynamically, and it is handled by the OS on 

context switches (when there is a new task, an interrupt is 

started, the MPU configuration is changed). 

To improve the testing coverage, the objective is to verify 

in each context (execution of each task or interrupt) that the 

good access rights are activated. Particularly, for the 

verification of the FFI, a task in lower ASIL partition should 

have access to write into the higher ASIL data sections (see 

Fig 3.). 

Fig. 3. Fault Injection tests to verify FFI 

B. Importance of Tooling 

In automotive software another challenge comes from the 

number of parties involved in the development of one 

Electronic Control Unit (ECU). Indeed; the development 

requires interactions between OEMs (Renault, PSA, Daimler, 

GM, BMW, etc.), Tier one suppliers (Bosh, Delphi, 

Continental, Valeo, etc.) and Hardware suppliers that provide 

the microcontroller and the Micro-controller Abstraction 

Layer (MCAL) (NXP, Renesas, Infineon, etc.) and the 

supplier of the Real-Time Operating System (ETAS, Vector, 

Elektrobit, etc.). As different suppliers are developing part of 

the final code, it is a challenge to integrate everything and 

ensure FFI. 

Also, new hardware technologies are based on complex 

heterogeneous multi-core architectures and it will be also 

more complicated to implement and verify that the good 

access rights are given for the execution of a software 

(different shared memories regions, shared cores for the 

execution of different applications). All these may lead to the 

introduction of even more memory interferences. 

One of the major breakthroughs in software development 

was the introduction of automatic static code analyzer, to 

improve code robustness and quality. These tools verify 

different rules from standards such as MISRA-C [6]. 

Polyspace also verifies dynamic properties (stack usage for the 

code). 

Hence, the development of such tools for software safety 

purpose is needed. 

C. TASKING Safety Checker 

The TASKING Safety Checker [5] is an automated 

analysis tool that statically analyzes the component’s source 

code for FFI verification and access violations that would 

trigger MPU exceptions. Indeed it enables to detect wrong 

access as if the code was executed with a MPU wellconfigured. 

The Safety Checker framework is similar to compiler 

environment and it eases a smooth integration of the tool in 

any project. As an input, the analyst needs to provide, on the 

one hand, all the sources of the projects: C files, H files, and 

on the other hand, a file describing the allocation of all the 

files to different classes. The definition of the classes is up to 

the analyst, that can create the needed ones, and as a start for 

the analysis, they reflect ASILs of the modules of the project: 

QM, ASIL_A, ASIL_B, ASIL_C, ASIL_D. 

The definition of such classes then enables to define the 

access rights between two classes like on a MPU, when you 

restrict the access to a region. Hence you have to define what 

4 

61

the allowed accesses are for Read, Write, and Execute from 

one class to another. 

The last configuration of the tools is the allocation of 

functions, global variables and local variables to the classes. 

This allocation is done by assigning the C files or parts of 

them to a class, similarly to a linker script in a compiling 

process. The same allocation can also be performed on static 

memory regions by giving the start address and the end 

address and assigning this region to a class. 

The complete process of the tool is given in Fig.5. 

Then the Safety Checker can be run and will give the 

following outputs: 

 

 

Call graph of all the functions with variables or 

addresses accessed. 

List of all the read/write/execute violations found 

by the tool based on the configuration. 

D. Process for Robust Implementation 

Consequently, to improve our software development 

process for memory protection integration, the following 

activities are recommended (see Fig 4.). 

In the early phases of development of a trial project, when 

the safety analyses have been performed and a first 

implementation of the code is available, we decided to 

perform a first round analysis with Safety Checker in order to 

verify the architecture that has been designed. The tool 

performs first a check of the software modules ASIL 

allocation, and then verifies the interferences. In addition, it 

can help to adjust the MPU configuration for certain sections 

of the code, by helping to decide if Read/Write/Execute rights 

are suitable on the current implementation with regards to the 

safety requirements. 

Moreover, we decided to use this tool every time we had a 

new software release on this project to verify that the code 

modifications do not introduce any new interference. It should 

be highlighted that, from a quality perspective, it is important 

to verify each version of the software, so that we cover the 

maximum number of systematic faults. 

Preliminary 

Software 

Architecture 

Software Safety 

Analyses 

Verification 

of ASIL 

allocation 

Software 

Partitionning & 

MPU 

Implementation 

Verification of 


on each 

release 

Verification & 

Validation 

MPU testing 

with Fault 

Injection 

Fig. 4. Proposed Development Process for Memory 

Software development Activities 

New Activities with Safety Checker 

New Fault Injection Activities 

The main objective is also to start the activity as soon as 

possible, to detect the error early. Then, it can be run again, 

more quickly, on new versions, and the new results can be 

analyzed more easily by focusing on new findings. 

In spite of all these features, the usage of the tool does not 

replace the integration of a MPU to catch interferences, and 

when a MPU in implemented, it is still needed to test the 

implementation of the protection on the memory sections that 

are protected via fault injection tests. These tests should 

enable to also evaluate the error detection time, the error 

reaction time and the reaction mechanisms. 

IV. EXPERIENCE RESULTS ON VALEO PROJECTS 

Currently, we have started to integrate this new process for 

nine months on several projects. 

A. Characteristic of Targeted Projects 

We decided to target mainly complex ADAS projects, like 

automatic parking assistance system, front camera systems 

etc.. These projects are mainly based on AUTOSAR. We also 

evaluated our approach on a non-AUTOSAR project. The 

highest ASIL of these projects at software level is ASIL B. 

one of the project has mixed ASIL requirements, with three 

part ASIL A, ASIL B and QM, but the others projects are a 

mix of ASIL B and QM software modules. 

Fig. 5. TASKING Safety Checker Process 

5 

62

The different projects where this process has been applied 

are composed of around 100 software modules (Operating 

System, basic software and Application layer) with around 

300 to 600 C files. 

The chosen projects also target several microcontrollers, 

representative of automotive: Power PC microcontroller from 

STMicroelectronics or RH850 from Renesas. There are also 

different build environment with Green Hills or Windriver. 

B. Tool Experimentation 

The first evaluation of the Safety Checker, in the described 

process, is the configuration time. After a first integration on a 

mockup project to understand the tool, the different projects 

have shown that one week of configuration of the tool is 

needed on a new project. 

 

 

This step has two challenges: The allocation of all 

the variables and functions to the correct class is 

done manually. Thus, the allocation of hundreds 

of C files has to be done by hand. 

The tool does not analyze assembly code. This 

part of the code must be removed, in order to run 

the tool. (This will be discussed in the following 

section.) 

Then, when the project has been configured once, it is 

possible to assess a new version of the software (with minor 

new modifications, if the analysis is done regularly). Three 

days are needed to run again the analysis on the project. 

When the tool is configured then it is easy to get different 

results, as you test different allocations of the software 

modules, in order to verify the preliminary allocation and also 

verify different variants of the allocation. Hence, even for 

these complex software, the tool runs in less than one hour to 

provide all the safety violation in read, write and execute. 

On the projects, to verify the FFI, we focused on the 

detection of Write access from a lower ASIL module to a 

higher ASIL module, and also on read access from a higher 

ASIL module to a lower ASIL module. For the read access, 

even if it is not prohibited to do so, the objective is to be able 

to justify the use of such access, by mean of checks (CRC, 

range check, plausibility check…). 

The next section will describe and analyze the results 

obtained with the tool on the project. 

C. Results 

a) Tool Findings 

In the different projects, the Safety Checker, enabled to 

find several lacks in the preliminary architectures, and this 

was taken into account to modify the architecture and the 

implementation of such access violations for write access from 

lower ASIL to a higher ASIL. In most of the project, the tool 

finds around dozens of such violations. 

Adding to that, the tool enables to check the code for read 

accesses from a higher ASIL to a lower one. Then, it can be 

traced with the software safety requirements if these access 

where designed and then, do a manual check in the code to 

verify that the plausibility check or safety mechanisms have 

been implemented. This is an important feedback, for the 

software safety analyst to get an automatic evaluation of the 

current implementation of safety on the project. 

Example of finding: 

 

 

 

Error: [xxx.c" 3845] safety violation writing "pos" 

(ASIL_B) from "XXX_GetGlobalPosition32" (ASIL_A) 

Info 1: [yyy.c" 951] the address of "pos" is passed 

from function "CalcTravelDist" as parameter #1 to 

function "ZZZ_GetGlobalVehiclePosition" 

Info 2: [zzz.c" 9430] parameter #1 containing the 

address of "pos" is passed from function 

"ZZZ_GetGlobalVehiclePosition" as parameter #1 to 

function "XXX_GetGlobalPosition32" 

In this right access violation, the error is to access a data 

“pos” that is ASIL B from a function "XXX_GetGlobalPosition32" 

that is ASIL A, by using pointers. 

In this case, the correction is to read directly the position 

value from ASIL B context instead of providing it via ASIL A, and 

to add a plausibility check on the value of global position in ASIL 

B context. 

Next, all the generated findings must be analyzed to assess 

what is the severity of the access, and the real impact. The 

Safety Checker only reports access violation without severity 

assessment, so the results must be analyzed. Particularly, 

justification of all errors is needed and/or creating tickets in 

the tool chosen to reports bugs, if the architecture must be 

modified. 

After the modifications have been taken into account on 

the final version, the good practice is to compare old and new 

files in order to verify, if the error linked to ticket has been 

fixed and the error does not exist anymore, and see if there are 

regressions, i.e., new access violations. Then, the justified 

errors that remain can be kept if the justification is still valid. 

b) Safety Management 

One of the interesting results of the process is that it eases 

the communication between software safety analyst and 

software development. 

It is a systematic way to perform part of safety analysis, 

and it is more exhaustive and quicker than doing safety 

analysis by reading the code on each new versions. Moreover 

the tool provides the call graphs and the lines of code creating 

the error. 

As it is helpful for the software safety engineer, it also 

helps software engineers to be more involved in the process of 

reviewing the code and understanding FFI requirements. 

c) Misuse and Limitations 

However, this process may create misuse and have 

limitations 

 

Misuse: Using such a tool does not prevent to 

implement software partition and MPU 

6 

63

Limitation 1: The tool does not analyze the 

assembly code. It is focused on C code, that 

enables to have enough abstraction, but as the 

instructions depend on microcontroller 

architecture, the tool, does not go in these details. 

Hence, assembly code should be peer-reviewed to 

make sure it will not lead to memory 

interferences. 

Limitation 2: The tool is not a replacement of the 

software safety analyst, as it only tackles only the 

memory interferences. Hence, we still need a 

safety engineer to allocate the ASILs on 

components based on the results of a safety 

analysis (SwFMEA, SwCPA, SwFTA). 

Moreover, it does not help for other interferences: 

communication, execution or timing. The tool 

itself does not replace the complete process, but it 

eases it. 

V. CONCLUSION 

This paper first describes the state of the art methods used 

for the development of mixed-criticality automotive systems, 

for which according to ISO26262, FFI must be ensured. This 

process relies on software safety analysis made by hand. This 

is error prone, and cause inconsistencies depending on the 

analyst. The integration memory access protection, i.e., MPU, 

is also really difficult as it finds all the wrong access in the 

code, in the final phase of the integration when it is the most 

difficult to modify the architecture of the software. 

This paper then defines amelioration that can be done on 

this development process. On the one hand, the use of a tool 

TASKING Safety Checker, in order to have a systematic and 

automatic way to assess FFI on software code. On the other 

hand, the use of fault injection techniques to have a higher 

testing coverage of the MPU. 

Finally, in a final Section, the paper discusses the results 

obtained on different projects inside Valeo, where this new 

process has been used. These results are promising as it helps 

with a minimum effort with the tool to assess FFI in the early 

phase of the development, and so, reduce the time of MPU 

debug due to software architectural errors, and systematic 

faults. The introduction of fault injection for MPU also 

enables to improve and make the implementation of such 

mechanisms more robust. 


The authors would like to thank their colleagues from 

Valeo for their insightful comments. 

REFERENCES 

[1] ISO 26262 – Road Vehicles – Functional Safety, 10 November, 2011. 

http://www.iso.org/iso/home/news index/news archive/news.htm?refid= 

Ref1499 [Online; accessed 17-Jan-2018]. 

[2] AUTOSAR Development Cooperation, http://www.autosar.org [Online; 

accessed 17-Jan-2018]. 

[3] AUTOSAR Specification of Operating System, Release 4.3.1, 

https://www.autosar.org/fileadmin/user_upload/standards/classic/4- 

3/AUTOSAR_SWS_OS.pdf [Online; accessed 17-Jan-2018] 

[4] AUTOSAR Specification of Watchdog Manager, Release 4.3.1, 

https://www.autosar.org/fileadmin/user_upload/standards/classic/4- 

3/AUTOSAR_SWS_WatchdogManager.pdf [Online; accessed 17-Jan- 

2018]. 

[5] TASKING Safety Checker: http://www.tasking.com/content/safetychecker-asil-verification 

[Online; accessed 17-Jan-2018] 

[6] MISRA-C: 

https://www.misra.org.uk/Activities/MISRAC/tabid/160/Default.aspx 

[Online; accessed 17-Jan-2018] 

7 

64

Functional Safety in AI-controlled Vehicles 

If not ISO 26262, then what? 

Joseph Dailey 

Global functional safety manager 

Mentor, a Siemens Business 

Phoenix, Arizona, USA 

joe_dailey@mentor.com 

Abstract— Since its establishment in 2011, the ISO 26262 

international functional safety standard has rapidly emerged as 

the definitive guideline for automotive engineers looking to 

optimize the safety of electrical and/or electronic (E/E) automotive 

systems. But because ISO 26262 implies strict adherence to 

analyzing architecture and its effect on safety, the shift toward 

machine learning for critical driving decisions in self-driving cars 

threatens to break the standard’s direct link between functional 

safety and the requirements for how these new concepts should be 

fulfilled. 

After presenting a short history of automotive functional safety 

standards up to ISO 26262, this paper outlines the standard’s 

specific deficiencies related to artificial intelligence (AI)-controlled 

vehicles. Particular attention will be paid to challenges faced by 

existing standards bodies grappling with a more autonomous 

future. Whereas ISO 26262 specifies requirements for eliminating 

safety hazards in the presence of an E/E system fault, this paper 

suggests that new standards for the AI era must address so-called 

safety of the intended functionality (SOTIF), which means helping 

to validate that advanced automotive functionalities are 

engineered into the vehicle to avoid safety hazards even in the 

absence of a fault. The paper will draw on my experience working 

on the world’s first SOTIF standard, ISO/WD PAS 21448, which 

is under development. 

Keywords— autonomous vehicles, ISO 26262, functional safety, 

self-driving cars 

I. STANDARDS: A SHORT HISTORY 

Since the beginning of the Industrial Revolution, standards 

have been helping commerce by breaking down trade barriers. 

The first standard was created in 1841 and concerned screw 

thread measurements. In 1900 during the Paris International 

Electrical Congress, discussions between the British and 

American electrical engineering professional associations 

established the International Electrotechnical Commission. The 

IEC held its first meeting on June 26, 1906, and is still operating 

today. In 1926, the International Federation of the National 

Standardizing Associations (ISA) was established, focusing 

heavily on mechanical engineering. During World War Two 

(WWII), the ISA become the International Organization for 

Standardization (ISO). 

While standardizing products facilitated commerce within a 

country, there was no legal requirement between countries to 

adopt the same standards. Differing national approaches to 

standards soon created barriers to international commerce. 

After WWII there was a need to promote international trade 

by reducing or eliminating trade barriers. As a result, with the 

help of the newly formed United Nations, the General 

Agreement on Tariffs and Trade (GATT) was signed by 23 

nations on October 30, 1947, and went into effect January 1, 

1948. In the 1994 Uruguay Round Agreements, the World Trade 

Organization (WTO) was established as GATT’s successor. 

More than 120 countries took part in the Uruguay Round, 

producing the binding Agreement on Technical Barriers to 

Trade (TBT), administered by the WTO. 

The TBT agreement aimed to further the 1994 GATT 

objectives by: 

Recognizing the important contribution that 

international standards and conformity assessment systems 

can make in this regard by improving efficiency of 

production and facilitating the conduct of international 

trade [1] 

In one of its opening paragraphs, the agreement goes on to 

give a rationale for embracing standards as a means to ensure 

safety, defined as the protection of human, animal or plant life 

or health: 

Recognizing that no country should be prevented from 

taking measures necessary to ensure the quality of its 

exports, or for the protection of human, animal or plant life 

or health, of the environment, or for the prevention of 

deceptive practices, at the levels it considers appropriate, 

subject to the requirement that they are not applied in a 

manner which would constitute a means of arbitrary or 

unjustifiable discrimination between countries where the 

same conditions prevail or a disguised restriction on 

international trade, and are otherwise in accordance with 

the provisions of this Agreement [1] 

65

Since the WTO was formed, member nations have generally 

adhered to TBT’s original intent in creating mandatory and 

voluntary standards and guidelines with international standards 

organizations like ISO, IEEE and IEC. Standards now exist for 

everything from electrical and mechanical hardware, software 

and systems, to manufacturing processes and occupational 

hazards for hundreds of different applications and products. 

These standards furthered goals like consistency and reliability, 

but assumptions that a reliable product was also safe were called 

into question, particularly as systems became larger and more 

complex. 

In the 1990s, to create a guild of functional safety-minded 

engineers, the IEC led comprehensive studies of the process of 

creating both hardware- and software-based systems. The 

objective was to provide a standard so that hardware and 

software developers could claim their systems were safe. These 

studies led to the release of IEC 1508, which after public 

comments and a few years of further revisions, became the 

world’s first functional safety standard, IEC 61508, in 1998; last 

four parts of the standard were released in 2000. IEC 61508 has 

spawned similar standards in a range of industries, including 

automotive (ISO 26262), rail software (IEC 62279), process 

industries (IEC 61511), nuclear power plants (IEC 61513), 

machinery (IEC 62061) and more. 

Though our deliberations are by necessity private, as a 

member of the committee working on the update, I can say that 

we are struggling with the typical normative approach of 

standards, which simply doesn’t make sense in addressing the 

coming revolution of ever more powerful advanced driver 

assistance systems (ADAS) technologies and autonomous 

vehicles. As a result a new group was formed to address safety 

of the intended functionality (SOTIF) for autonomous vehicles. 

Despite our best efforts, standardizing ADAS and autonomous 

systems will remain controversial given the challenge of 

identifying all possible driving scenarios, and creating relevant 

normative standards, and then validating those standards. 

II. 

QUESTIONS ACCOMPANY THE RISE OF AI 

How is it possible to standardize autonomous automotive 

outcomes which rely heavily on machine learning and AI? How 

does a standards committee factor in all of the unsafe conditions 

and possible AI-based responses that may occur? Today ISO is 

wrestling with a new standard, ISO/WD PAS 21448, to address 

the rise of ADAS and AI in vehicles [2]. The basic goal of 

functional safety in ADAS/AI is the same as always — to avoid 

unintended system behavior, in the absence of a fault, resulting 

from technological and system shortcomings and reasonably 

foreseeable misuse. But even if an autonomous system is built 

Figure 1. A Google self-driving 

car in 2013. Waymo (formerly 

Google) says it logs 25,000 

miles of road tests weekly for its 

autonomous fleet. In 2016, the 

company drove a billion miles 

in simulation, as well. Still, no 

matter how deep the pockets or 

good the methodology, testing 

will always be beset by inherent 

limitations. Image courtesy 

Becky Stern on Flickr under 

terms of Creative Commons 

(CC BY-SA 2.0). 

A joint effort of ISO and the Society for Automotive 

Engineers, ISO 26262 addresses (E/E) systems in passenger cars 

with a maximum gross weight up to 3,5oo kg. Though less than 

a decade old, ISO 26262 has become one of the most important 

standards in the automotive industry today. And this evolution 

of E/E functional safety is not over; the ISO 26262 committee is 

back at work developing the next revision, expected to be 

released in the third quarter of 2018. The revision adds 

motorcycles, and it addresses trucks and buses, which eliminates 

any weight restriction. Now the only exclusion is for mopeds. 

from technologies aligned with this goal, does that guarantee 

driver and public safety? Or does terminology like ‘foreseeable’ 

emphasize that safety is always a relative notion? Perhaps the 

proper goal is just to be safer than yesterday, or as Elon Musk 

and other AI proponents would say, safer than notoriously bad 

human drivers. Will society eventually accept entirely 

unforeseen accidents of robot cars? The issue with any standard 

is that real life situations never restrict themselves to foreseeable 

events. So the struggle and deficiency in any forthcoming ADAS 

and autonomous standards is how to make the safety of the 

66

intended functionality (SOTIF) — which is all that can 

reasonably be addressed given the basic facts about the 

underlying technologies — actually safe enough for societal 

acceptance. 

So let’s start by looking at some of the common ADAS 

technologies and AI techniques. 

III. 

ADAS AND AI 

It’s helpful to recall that, whether it’s an algorithm or an 

adolescent at the wheel, driving is a learned behavior. I 

remember my dad always telling me to keep my distance and 

that you should always be able to see the rear tires of the car in 

front of you. Such instruction is fine, but all humans learn by 

confronting actual experiences while driving — say merging 

into freeway traffic — then using the brain’s prodigious 

capabilities to quickly evaluate scenarios and decide on the most 

intelligence includes the ability to reason, represent knowledge, 

plan, learn, communicate and integrate these skills toward a 

common goal. 

AI has progressed through the years. In the 1980s, machine 

learning started with supervised learning techniques, which used 

known data structures to make decisions. Among the first 

examples many of us encountered were email rules, alerts, and 

spam filters. These rules recognized specific words, websites, 

addresses, or other known information, and then placed suspect 

emails into a specific folder. Users could change and improve 

the rules by either identify something new to add to the suspect 

list or marking certain data causing messages to be flagged by 

mistake. (Of course, Google’s application of deep-learning AI 

across its mail services has by now mostly rendered such manual 

maintenance unnecessary, at least when it comes to Gmail, the 

world’s largest email service.) 

Figure 2. How a convolutional neural network distinguishes between a lion and a tiger. For a 

more detailed description, see: http://bit.ly/2EPuPA7, a blog post by Facebook software 

engineer, Shafeen Tejani. 

likely safe outcome, and finally taking action to accomplish that 

outcome. (In getting onto the freeway, perhaps the answer is to 

accelerate to get in front of the doddering senior in the merge 

lane, or if it’s a barreling semi instead, maybe it’s best to hang 

back until the truck has passed.) 

However, even experienced drivers come across situations 

they have never observed or expected. The reaction to these 

events must be quick. Have you ever hit the brakes to avoid an 

animal on the road? Did you look behind you? If not, did you 

get hit from behind? Did you swerve? How about coming up to 

a truck blocking the road. Did you break the law by crossing the 

double yellow lines to go around? Any attempt to consider all 

scenarios, including road and weather conditions, driver error or 

misuse, or acts of God, in all combinations, would result in a 

massive list of driving rules, one that inevitably would also be 

incomplete. Whether for human or computer, the only approach 

is to ‘look’ or ‘sense’ the environment around a vehicle, predict 

the most likely safe outcome and then act. AI increasingly is 

making such predictions for individual ADAS technologies and 

the entire autonomous systems. 

A. A look back at AI 

A media darling today, AI has been around since the 1950s 

and simply means a machine that can perform an intelligent task 

generally performed by humans. The criteria for what comprises 

Next came unsupervised learning which used unknown data 

sets and determined actions based on probabilities. These tasks 

were used in anomaly detection algorithms or regression 

analysis. Machines got smarter, reinforcing their decisionmaking 

processes with previous outcomes and probabilities. By 

the 2010s, advanced neural networks were making possible 

surprising feats of deep learning, such as the Google AI triumph 

over one of the world’s best Go players. 

Still, the ability of AI to solve any problem with a nearhuman 

level of intelligent actions is far off in the future and may 

stay in the realm of science fiction forever. But since 

increasingly autonomous vehicles are here now, the question for 

standards committees is how to make AI workable for an 

industry, not just the deep-pocketed industry giants. Doing so 

means understanding how AI does what it does, at least at a 

cursory level. 

B. Deep learning 

Deep learning uses multiple layers of nonlinear processing 

units for feature learning and classification of its inputs. Each 

layer is taught under supervisory control and once it learns, is 

placed in an unsupervised state. This learning technology has 

proved successful in natural language processing, computer 

vision, speech recognition, audio recognition and social network 

filtering. 

67

Deep learning techniques consist of artificial neural 

networks (ANNs) and deep neural networks (DNNs). ANNs and 

DNNs are systems inspired by the neural circuitry in the brain; 

such systems progressively improve their ability to do tasks 

through accumulated examples of rewards from distinct actions 

done correctly. A DNN uses a hidden layer that can model 

complex non-linear relationships. 

The challenge with this technique is it determines outcomes 

by fitting a pattern. If the pattern does not fit all of the available 

data, then a system based on such a DNN will probably make 

the wrong decision. These systems also take a lot of computing 

power, which is costly in such a competitive markets, like the 

relatively low-margin automotive industry. 

C. Deep reinforcement learning (DRL) 

Deep reinforcement learning (DRL) is another type of 

machine learning algorithm. This technique uses its experience 

to determine the ideal behavior. Instead of rewards, DRL 

classifies what amount to punishments and pain into what is 

called a Q value, which it seeks to optimize as it makes 

predictions from its inputs. Accordingly, DRL learns from 

examples to create new and ever more sophisticated models. 

(DRL systems for visually analyzing imagery are often based on 

so-called convolutional neural networks, which have proven 

specifically useful in creating ADAS technologies that sense and 

act on the world from the point of view of a driver.) 

D. Cognitive computing 

Cognitive computing is a technique where computing 

systems try to simulate human thought processes. Whereas 

typical AI systems follow complex algorithms to solve a 

problem, cognitive computing tries to mimic the very social 

human brain that is always interacting with humans, be they 

passengers, other drivers or pedestrians. 

The potential is huge if AI, deep learning, reinforcement 

learning and cognitive computing can be applied together in an 

automotive context. But one fact still remains a problem for 

standards committees — AI actions do not follow a determinist 

and normalized model. In other words, AI-driven actions are not 

directly linked to a system’s input, parameters, initial conditions, 

or prescribed rule. There simply is no way to define a standard 

reaction, since most actions are now based on probabilistic 

models, and reinforced with previous actions and outcomes. 

The path forward for a rule-making committee may be 

creating standards that themselves adapt to the way we think and 

act in the world, though perhaps this is an impossible task as it’s 

akin to defining those processes in the human brain. (The 

committees, of course, are generally full of engineers, not 

neuroscientists.) We need to provide guidelines that fit the way 

AI is evolving — increasingly, self-driving cars will use 

accumulated previous knowledge to determine actions in 

specific scenarios. That is, cars will teach themselves, and as 

long as we provide the correct guidance, our collective safety 

should be enhanced as the vehicles convey that learning into new 

actions in the world. 

IV. 

THE RESPONSE OF REGULATORS, LAWMAKERS AND 

STANDARDS BODIES (SO FAR) 

The rise of AI doesn’t obviate the need for existing standards 

regimes, which will remain the same for supporting processes, 

hardware and software development. However, the looming 

stumbling block is dealing with the SOTIF idea, particularly 

relevant in ADAS and autonomous applications. Engineering 

teams across the supply chain, especially at relatively small 

companies, will be looking for guidance on how to conduct a 

state of the art development cycle that includes advanced 

concepts in artificial intelligence. 

As ISO 26262 committee struggles with the new SOTIF 

standard ISO/PAS 21448, organizations like the U.S. National 

Highway and Transportation Safety Administration (NHTSA) 

are similarly working through the new issues posed by 

autonomy. On September 12, 2017, NHTSA released the new 

federal guidelines for Automated Driving Systems called A 

Vision for Safety 2.0, which includes a normative statement that 

“the process shall describe design redundancies and safety 

strategies for handling automated driving system (ADS) 

malfunctions.” [3] But there is no guidance on how to determine 

those redundancies and strategies but refers to ISO 26262. 

The NHTSA publication is non-regulatory and covers 

autonomous safety elements from autonomous driving levels 

three to five, placing significant emphasis on software 

development, verification and validation. It also provides 

guidance on ADS safety elements covering system safety, 

operational design domain (road type, geographic area, speed 

range and other constraints), object and event detection and 

response (crash avoidance capability), fallback or minimal risk 

conditions, validation methods and cybersecurity. Though it 

provides a means for self-assessment, the document is not meant 

to be a legal requirement. 

In the United States, aside from the NHTSA guidance, 

momentum is apparent in state legislatures and governors’ 

offices when it comes to passing laws on self-driving. Since 

Nevada authorized the first autonomous vehicles in 2011, 

twenty additional states have passed similar legislation related 

to autonomous vehicles. At least a dozen more have introduced 

laws on the subject, while governors in five states have bypassed 

the legislative process altogether and issued executive orders 

related to autonomous vehicles. [4] 

The international situation is just as scattershot. Rules differ 

by country, though generally are based on the U.N. Economic 

Commission for Europe (UNECE) Convention on Road Traffic, 

commonly known as the Vienna Convention on Road Traffic. 

First agreed upon during the Global Forum for Road Traffic 

Safety in November 1968, the agreement has 36 signatories and 

has been ratified by 75 parties, including most of Europe, and 

parts of the Americas, Asia, the Middle East, Africa, Russia and 

Indonesia. The major countries that are not a part of this 

agreement include the United States, Canada, China, Japan, 

Australia and India. 

In March 2016, the UNECE passed a regulatory milestone 

towards the deployment of automated vehicle technologies with 

an amendment to the 1968 rules, allowing automated driving 

technologies that transfer driving responsibilities to the vehicle 

68

in traffic, provided these technologies can be overridden or 

switched off by the driver. The amendment includes discussions 

on self-steering systems that take over the control of the vehicle 

under the permanent supervisor of the driver, like lane-assist, 

self-parking or highway autopilots. 

Despite this flurry of activity, there is a conspicuous lack of 

focus on what it means to achieve ‘safety.’ Regulations describe 

at great length what and how a vehicle is to be tested, and who 

can conduct those tests, and even the road and environmental 

conditions during testing. California’s Senate Bill No 1298 is 

representative of most other legislation in the amorphous way it 

references safety, with a stated goal to “… [create] appropriate 

rules intended to ensure that the testing and operation of 

autonomous vehicles in the state are conducted in a safe 

manner.” Little guidance is given on defining ‘a safe manner’ 

except to ensure that a human driver can take over when 

necessary. The law states that “[t]he autonomous vehicle shall 

allow the operator to take control in multiple manners, including, 

without limitation, through the use of the brake, the accelerator 

pedal, or the steering wheel, and it shall alert the operator that 

the autonomous technology has been disengaged.” [5] 

This guidance is problematic on many levels. A driver is 

required to be the fallback to dynamic driving tasks, which 

assumes that the driver is aware of all unsafe situations and able 

to recover into a safe state when pressed to do so. Already a 

significant percentage of accidents are caused by distracted 

drivers, a group unlikely to be ready to take over in a crisis and 

that seems poised to grow in number as technology mediates 

more of the driving experience and life in general. The more 

glaring problem is that human fallback option is only applicable 

for level three and below; the language doesn’t seem to include 

fully autonomous level four and five vehicles, where the system 

itself is the ‘fallback’ option. 

V. THE LIMITS OF BRUTE-FORCE TESTING 

Another solution for providing safe autonomous systems is 

an abundance of testing. Waymo (previously Google) is the 

flagship example here, logging approximately 25,000 miles 

every week with its fleet autonomous test vehicles operating in 

four U.S. cities. Granted, testing will be critical, requiring both 

actual road miles and simulation. Waymo, arguably the leader in 

real-world testing, notes that it also drove a billion miles in 

simulation in 2016 alone. (Tass International, a Siemens 

business, offers a range of simulation and validation solutions 

for automated driving, including a platform called PreScan for 

simulating traffic and road environments, and support for 

hardware in the loop testing for various sensor and 

communication systems.) 

Still, no matter how deep the pockets or how good the 

technology, testing will always be beset by inherent limitations. 

The most obvious one is the difficulty in testing all edge cases. 

Even after millions or billions of city-street miles and virtual 

testing, an autonomous system might still react in an unsafe 

manner when confronted with an unforeseen set of inputs. (And 

yes driving, as all tech-mediated human activity, will always 

involve unforeseen inputs.) 

A second and related problem is determining what is correct 

in terms of autonomous decision-making. Invariably, robot car 

decisions can only be measured in degrees of correctness. Who 

determines what is correct enough? If the car is going to hit 

something on the road, does it run over the object? Stop and risk 

a rear-end collision? Swerve? Cross the double yellow lines? Or 

do we allow a more simple result — that is, an autonomous 

vehicle “passes” the test if it responds to a situation so that no 

one gets hurt? In functional safety, there is much attention given 

to testing and evidence that builds confidence that a given 

system is safe, but this notion is murky when confidence is not 

tied to yes/no outcomes but instead to degrees of probability or 

correctness. 

The only certainty is that a combination of simulation, 

laboratory testing and real-world testing is the only practical 

method, so it’s up to committees to normalize or standardize 

these methods. 

VI. 

LIABILITY AND INSURANCE 

The question of liability shadows the work of all standards 

committees. If deterministic requirements for autonomous 

vehicles are published, and a developer follows that standard to 

the strictest degree and an accident still occurs, who assumes 

liability? 

Standards, of course, are generally not legally binding 

though are often used by the courts to settle legal disputes, 

especially in product liability cases. Invariably accountability 

for accidents will shift from drivers and their insurance 

companies to carmakers, and the tech vendors increasingly 

prominent in the auto supply chain. (An example: California 

recently scrapped a planned rule that would have let carmakers 

off the hook if their autonomous vehicle crashed if it was 

determined that the car hadn't been maintained to spec. That is, 

a carmaker might have been spared liability if a vehicle in a 

fender bender had muddy sensors, even if the accident actually 

stemmed from sloppy code. Regulators said no way.) Standards 

committees comprised mostly of carmakers and their 

hardware/software suppliers might be disinclined to accelerate 

this tectonic shift in how blame is apportioned for accidents, 

particularly in environments still mostly devoid of consistent 

national regulation. 

What makes the most sense is for these committees to 

provide requirements on a V-cycle development process, 

including historical reconstruction of such processes. Indeed, 

this is already required in ISO 26262. But when it comes to 

autonomous architecture and AI development, the committee 

might need only to provide information or guidelines in detailing 

the ongoing operational parameters, algorithm use cases and 

probability-based evaluation of outcomes which lead up to an 

unsafe situation. 

VII. AUTONOMOUS STANDARDS TODAY AND BEYOND 

The good news is that we have robust standards, notably ISO 

26262, for processes, and hardware/software development. And 

governments are increasingly issuing guidelines and laws 

concerning self-driving cars for real-world testing, slowly 

clarifying or at least populating the previously barren regulatory 

landscape. 

But since AI development points the way to more and more 

nondeterministic applications, instead of focusing on proving 

69

inputs, parameters, and logical paths that lead to a safe selfdriving 

application, we must shift our focus to proving the safety 

of both the process of creating the AI in the first place and then 

to its eventual outcomes when it makes decisions and mistakes 

on public roads. It won’t be easy given all the vagaries around 

the notion of safety which, according to the philosophers, is 

often fundamentally in conflict with freedom and free will. 

For now, a standard like the forthcoming ISO/PAS 21448 

will likely be the best guideline our industry is capable of 

producing for now. The guidance itself will fit the 

nondeterministic nature of AI, but will at least provide some 

normalization and information. The committee members I’m 

serving with represent a breadth of impressive technological 

expertise in the ADS field. Among other outcomes, expect the 

standard to provide a much needed common vocabulary so we 

can all begin communicating effectively on autonomous safety. 

And it’s clear that the standard will provide guidance even on 

slightly opaque issues such as how to consider known and 

unknown use cases, dependencies, the limitation of 

countermeasures, automation authority and warning strategies. 

There also will be information on more conventional topics like 

verification and validation. 

VIII. CONCLUSION 

Starting more than a century ago with an effort make screw 

threads uniform, standards have boosted commerce by breaking 

down trade barriers between companies and countries. The 

existing ISO 26262 and forthcoming ISO/PAS 21448 standards 

are no different. But as products and even the design flows 

themselves become more autonomous and non-deterministic, 

standards committees will need to wrestle with outcomes 

determined from probabilities. 

Standards bodies and society at-large will need to accept the 

reality that there will be unsafe situations, accidents, injuries and 

deaths in the autonomous future. The goal of functional safety is 

not to eliminate these events since they will always happen; 

instead, the goal is to limit their likelihood. 

More to the point, in our era of measuring and optimizing 

metrics of all kinds, can we make unsafe situations significantly 

less likely than they are today? The answer is, of course! Despite 

the difficulty in describing precisely how a robot car will make 

decisions, standardization efforts like ISO 26262 and the 

forthcoming SOTIF work by ISO provide much needed 

guidance on how to determine how safe is safe enough as these 

cars proliferate. 

And that’s an excellent outcome, any way you describe it. 


Thanks to Robert Bates and Andrew Macleod for their 

review of this paper, and to Geoff Koch for editing and layout 

assistance. 

REFERENCES 

[1] “Agreement on Technical Barriers to Trade,” accessed January 2018, 

http://bit.ly/2FIDA0n. 

[2] ISO/WD PAS 21448, "Road vehicles -- Safety of the intended 

functionality," ISO Standards Catalogue, 

www.iso.org/standard/70939.html. 

[3] "Automated Driving Systems 2.0: A Vision for Safety," NHTSA, 

accessed January 2018, www.nhtsa.gov/manufacturers/automateddriving-systems. 

[4] "Autonomous Vehicles | Self-Driving Vehicles Enacted Legislation," 

National Conference of State Legislatures (NCSL), accessed January 

2018, http://bit.ly/2ELZ4YI. 

[5] California SB-1298, "Vehicles: autonomous vehicles: safety and 

performance requirements, (2011-2012), http://bit.ly/2EPXDc1. 

FURTHER READING 

[1] Robert Bates, “Is it Possible to Know How Safe We Are in a World of 

Autonomous Cars,” 2017 Mentor Graphics whitepaper, 

http://go.mentor.com/4VxbP. 

[2] A. G. Foord and W. G. Gulland, 4-Sight Consulting, UK; C. R. Howard, 

Istech Consulting Ltd, UK, "Ten Years of IEC 61508; Has It Made Any 

Difference?" IChemE Symposium Series No. 156, 2011, 

http://bit.ly/2FIGBxD. 

[3] "An Introduction to Functional Safety and IEC 61508," MTL Instruments 

Group plc, 2002, http://bit.ly/2FKpj3a. 

[4] "Global status report on road safety," World Health Organization, updated 

July 2017, http://bit.ly/2FLmcb4. 

70

Designing Embedded Systems for Autonomous 

Driving with Functional Safety and Reliability 

David Lopez 

NXP Semiconductors, 

Marketing and application manager, 

Safety and power management 

Toulouse, France 

Jean-Philippe Meunier 


Functional safety architect, 



Maxime Clairet 


Systems and applications engineer, 



Abstract-. Societal changes and policy regulations are driving 

automotive requirements for electrification, connectivity and 

autonomy. Embedded systems need safety defined and designed 

solutions as well as extended robustness to accompany this 

transition. This paper addresses a methodology applied to power 

management devices that require the highest level of functional 

safety; how those processes extend robustness and are 

instrumental to reducing and preventing systematic failures with 

dedicated hardware management strategies. Extended 

qualification tests that assess reliability robustness demonstrate 

how a device can operate under different environments, with 

different grade levels representing the qualification. This paper 

will discuss the results of some Grade 0 tests performed on power 

management solutions, to secure the temperate operating range. 

Keywords—functional safety, power management architectures, 

fail silent, fault tolerant, grade 0, robustness. 


Today’s society is faced with considerable challenges 

towards the impending energy transition. Concerns about 

climate change, urbanization and austerity measures due to 

shortage of resources are combining with the need for 

increasing safety on the roads. The automotive industry’s role 

in contributing to this transition is clear-cut. The dominant 

drivers will be electrification, connectivity and autonomy 

enabling a driverless system that improves the mobility 

experience and helps to reduce fatalities. 

However, other industries such as the aeronautic industry are 

also dominant and collaboration will be essential to the rapid 

development of future systems. In fact, the aeronautics industry 

set the precedent for embedded systems with full redundancy to 

reach higher levels of autonomy. This is what the automotive 

industry aspires to, to assist or replace drivers. Systems such as 

these, albeit adapted to the automotive market, would still 

require high dependability, with cost effective architectures and 

solutions. This article will introduce the evolution of system 

architectures from fail safe to fail operational, with a highlight 

on power management solutions developed to simplify system 

design and secure safety assessments. Each market uses its own 

standard, methodologies and certification, but the evolution of 

embedded systems required for future mobility (road or air) 

requires closer collaboration. 

Finally, the consumer market is permeating the automotive 

market with solutions for artificial intelligence. However, these 

solutions need to be adapted to the constraints of the automotive 

environment. The international SAE standard sets out different 

levels of automation. In this model, the vehicle operates in fail 

safe mode in levels 0 through 2 and in fail operational mode in 

levels 3 through 5. These levels are essential for setting out 

minimum requirements in terms of functional safety for 

autonomous vehicles. 

Robustness, lifetime, quality and reliability are the key 

challenges to adapting these technologies to the new mobility 

requirements. The next section will discuss the features of a failsafe 

architecture with examples of typical system 

implementations. 

II. 

FAIL-SAFE ARCHITECTURE 

The safety architecture for a given system such as Electrical 

Power Steering is illustrated in “Fig. 1,” This architecture, 

traditionally based on a fail-safe topology, aims to disactivate 

driver assistance functionality in the case of a failure. 

VBAT 

Core Supply 

ADC Ref 

Sensor Supply 

FS6500 

SAFETY Power 

Management 

ASIL D 

VCOREMON 

VMON1 

VMON2 

Challenger 

WD 

FCCU1 

FCCU2 

RSTb 

FS0b 

FS1b 

(Delayed) 

CAN 

Sensor supply 

Core Supply 

VDDIO 1 

ADC Ref 1 

4 

SPI 

Safety MCU - ASIL D 

FCCU1 

FCCU2 

RESETB 

Fig. 1. EPS Fail-Safe architecture 

Sensor 

FS0b 

Safety Switch 

FS0b 

VBAT 

FCCU 

Gate driver 

FS1b 

(delayed) 

M 

71

In this specific implementation, when a failure occurs, the 

safety switches are opened so that the system remains 

controllable. A functional safety system basis chip, such as the 

FS6500, plays an important role here because it is the only 

component able to reset the microcontroller and transition the 

system in safe-state in case of hardware or software issues. The 

microcontroller hardware failures are monitored by the FS6500 

via Fault Collector Control Unit (FCCU) inputs and software 

plus temporal aspects are monitored via the watchdog 

challenger. However, this architecture limits availability. So, if 

a failure occurs in the system, availability is reduced, or even 

lost, forcing it to move into a safe state and therefore losing 

driver assistance functionality. 

The Automotive Safety Integrity Level (ASIL), defined by 

the ISO 26262 Functional safety for road vehicles standard, for 

the FS6500 is classified “ASIL D”. 

“Fig 2,” shows a simple representation of a fail operational 

architecture. 

TABLE II. 

Fig. 2. Fail-Operational Unit 

This “ASIL D” level dictates the highest integrity 

requirements which are basically reported on the individual 

components that compose the system including the Power 

Supply. Technical safety assumptions were taken during the 

development phase because the System Basis Chip was 

developed as a Safety Element out of Context (SEooC). 

Several technical safety requirements derived from these 

safety assumptions, that highlighted the independencies of the 

safety monitoring unit of the product. This includes the 

independencies and redundancies of the reference voltages, 

current references, clocks and state machines compared to the 

power management domain (SMPS, LDOs, system features). 

This independent safety monitoring unit rated “ASIL D” 

provides a set of fully configurable built-in safety mechanisms 

to the system integrator. 

In this architecture, two fail silent units are used. The 

independency of the power sources is ensured by VBAT1 and 

VBAT2. Redundant and independent power supplies (FS65) and 

processing (MCU) are elaborated and both ECUs can drive the 

actuator. In case of failure in one of the units, the second backup 

ECU is able to take the control. In addition to the full 

redundancy, both ECUs can check critical information with each 

other to increase the global diagnostic coverage of the system. 

A. Fail operational concept applied to the Electrical Power 

Steering use case. 

If we apply this high-level concept using the electrical power 

steering architecture, we end-up with the following 

implementation shown in “Fig. 3.” 

III. 

FAIL OPERATIONAL ARCHITECTURES 

In accordance with the SAE standard, the highest levels of 

automated driving systems (Level 4/ Level 5), require new fail 

operational architecture implementations to deliver functionality 

in the vehicle. 

Fail operational systems guarantee the full or degraded 

operation of a function in case of failure. So, a single Fail-Safe 

ECU can no longer be used for the reasons explained above. 

To satisfy the requirements of a very high-availability 

system, redundant fail-silent units are envisaged. However, a 

complete safety analysis must be done to ensure diversity in the 

information channel and to eliminate common cause failures. 

72

FS6500 

Safety Switch 

FS0b-1 

FCCU 

system must be redundant and independent as described in “Fig 

5.” 

Sensor supply 1 

Sensor 

1 

VBAT1 Core Supply 1 

SAFETY Power VDDIO 1 

Management 

ASIL D ADC Ref 1 

Core Supply 1 VCOREMON 

ADC Ref 1 VMON1 

Supply Torque 1 VMON2 


VBAT1 

Gate driver 1 

FS1b 

(delayed) 

SENSING 

THINKING 

ACTING 

Challenger 

WD 

FCCU1 

FCCU2 

RSTb 

FS0b-1 

FS1b 

(Delayed) 

I2C-1 

CAN 1 

4 

SPI 

FCCU1 

FCCU2 

RESETB 

FS0b-1 

M 

Sensors 

ASIL Sensors B 

ASIL 

Sensors 

B 

ASIL B 

Thinking 

ASIL B 

Sensor Fusion module 1 

Decision Making 

ASIL D 

Vehicle 

Control 

Often ASIL 

D 

Safety Switch 

VBAT2 

Core Supply 2 

ADC Ref 2 

Supply Torque 2 

Sensor Supply 2 

SAFETY Power 

Management 

ASIL D 

I2C-2 

Core Supply 2 

VCOREMON VDDIO 2 

VMON1 ADC Ref 2 

VMON2 


Sensor 

2 

FS0b-1 

VBAT2 

FCCU 

Gate driver 2 

FS1b 

(delayed) 

Thinking 

ASIL B 

Fig. 5. Fault tolerant central fusion 

Sensor Fusion module 2 


ASIL D 

FS0b-2 

Challenger 

WD 

4 

SPI 

FCCU1 

FCCU2 

RSTb 

FS0b-2 

FS1b 

(Delayed) 

CAN 2 

FCCU1 

FCCU2 

RESETB 

Each sensor fusion module can be decomposed as shown in 

“Fig 6,” FS6500 is rated “fit for ASIL D” and is the ideal 

companion-ship of a safety MCU in the ASILD(D) domain. 

Fig. 3. Fail operational EPS (degraded operation) 

In this configuration two FS6500 supply two different 

MCUs. The full chain is independent from the power (VBAT 1 

and VBAT 2) to the gate drivers (GD1, GD2). Only one 

electrical motor is used (6 phases). If a failure occurs in one of 

the channels, the relevant channel is switched off and the 

operation will continue using the back-up channel. This option 

loses roughly 50% of the torque assist, but means that it will 

continue to work in degraded operation. 

B. Fail operational concept applied to central fusion use 

case. 

The central fusion system takes data from various sensors in 

the vehicle such as Radar, Camera, Lidar; merges and computes 

that information to then command the actuators like braking and 

steering. 

“Fig 4,” represents a high-level block diagram of a central 

fusion system. The ASIL allocation for each function is also 

represented. As a general comment, in the central fusion module, 

it is very common to find ASIL D(D) elements coexisting with 

ASIL B(D) on the same module. 

SENSING 

Sensors 

ASIL B 

Radar 

Camera 

Lidar 

Thinking 

ASIL B 

THINKING 

Sensor Fusion 

Fig. 4. High Level Central Fusion Block Diagram 


ASIL D 

ACTING 

Vehicle 

Control 

Often ASIL 

D 

Engine 

Transmission 

Brake 

Steering 

Airbag 

Suspension 

Then, in a view of a fault tolerant system for driving 

automation starting at level 3, all different parts that compose the 

CAMERA 

RADAR 

LIDAR 

PMIC EN-1 

Fig. 6. Central Fusion Unit 

VBAT 

Core Supply 

VCCA 

2 

VAUX 

PFxxx 

ASIL B 

SAFETY SBC 

FS6500 

ERRMON 

SOC-1 

VCOREMON 

VMON1 

VMON2 

Challenger 

WD 

FS0b 

RSTb 

MCU HW 

Monitoring 

DDR 

Core Supply 

VCCA 

VAUX 

I2C 

4 

Communication 

interfaces 

SOC - 1 

ASILB 

ASIL B(D) 

Domain 

SPI 

RESETB 

FCCU1 

FCCU2 

Safety MCU 

ASIL D 

ERRMON 

SOC1 

I2C 

ASIL D(D) 

Domain 

DDR 

PMIC EN-1 

Central Fusion - 1 

FS6500 integrates a “safety island” where all safety 

mechanisms are designed. This safety island is biased by a full 

redundant architecture completely independent from the power 

management side. Also, a specific focus has been done in terms 

of isolation eliminating perturbations from switchers (e.g. 

negative substrate injection). 

To achieve these safety architectures, different measures are 

taken to assess the failure probability of the IC. The attach 

strategy for power management ICs to microcontrollers, means 

that products are defined from the outset to go together. It highly 

simplifies the safety strategy and design of the system and means 

that integrated measurements can be implemented. However, in 

order to analyze the risk at a system level, it is necessary to be 

able to quantify the risk of each individual IC failure. 

73

IV. 

QUANTITATIVE ANALYSIS: FROM 

RELIABILITY TO FUNCTIONAL SAFETY 

Functional safety metrics are calculated based on the Failure 

In Time metric (FIT), that quantifies the risk of failure during 

the lifetime of an application, according to the IECTR62380 

standard [2]. This FIT rate is calculated with (1) below: 

FIT = λdie + λpackage + λoverstress 

where λdie, λpackage and λoverstress are respectively the risk 

of failure related to the integrated circuit, all the parts 

constituting the package and the system stress during operation. 

The parameter λ die is calculated with (2) below: 

0. 

35a 

 

Ne 

y 

 

 

i 

 

 

 

1 

die 

2 

on 

 

 

t 

 

i i 

 

 

off 

 

1 

 

 

Based on hardware deterioration, the FIT rate calculation 

helps to determine the following ISO 26262 metric. The FIT rate 

of the device is distributed to the device functions based on their 

representative die size and for each function it is equally 

distributed to all possible failure modes. If the failure mode of a 

safety related function violates one of the application safety 

goals, a safety mechanism is required to detect it. One FIT 

represents one failure in 109 device hours, or 114 years. 

This FIT rate is an input of a SafeAssure tool developed by 

NXP. The Dynamic FMEDA (Failure Mode Effect and 

Diagnostic Analysis) calculates three ISO 26262 [1] metrics 

required to qualify for ASIL level. 

The SPFM (Single Point Fault Metric) represents a failure 

rate coverage which violates an application safety goal: >99% 

for ASIL D. Depending on the diagnostic coverage of the safety 

mechanism, low-60%, medium-90% or high-99%, the residual 

FIT of the undetected failure mode is calculated with (5). 

 

SPFM 

 

 

1 

 

SR 

RF 

 

 

 

where λ 1 is per transistor base failure rate of the integrated circuit 

family, λ 2 is the failure rate related to the technology mastering 

of the integrated circuit, N is the number of transistors of the 

integrated circuit, a is the year of manufacturing minus 1998, 

(π t) i is i th temperature factor related to the i th junction temperature 

of the integrated circuit mission profile, τ i is i th working time 

ratio of the integrated circuit for the i th junction temperature of 

the mission profile, τ on is the total working time ratio of the 

integrated circuit and τ off is the time ratio for the integrated 

circuit being in storage. 

The parameter λ package is calculated with (3) below: 

where λ SR is equal to the FIT rate of safety related functions 

The LFM (Latent Fault Metric) failure in the safety detection 

mechanism (also called Monitoring) can lead to the violation of 

the application safety goal in conjunction with a single point 

fault: >90% for ASIL D. The same approach is applied to the 

LFM and the residual FIT of the undetected failure mode (by 

BIST example) is calculated with the following equation: 

 

LFM 

1 

 

 

MPF 

 

RF 

 

SR 

 

 

 

z 

3 

0. 

68 

2, 

75. 

10 

n 

Ti 

 

i 

3 

i1 

 

 

 

 

 

package 

 

 

where λ MPF is equal to the residual FIT of latent faults. 

where π α is the influence factor related to the thermal expansion 

coefficients difference between the mounting substrate and the 

package material, (π n) i is i th influence factor related to the annual 

cycles number of thermal variations seen by the package, with 

the amplitude ΔT i, ΔT i is the i th thermal amplitude variation of 

the mission profile and Λ 3 is the base failure rate of the 

integrated circuit package. 

The PMHF (Probability Metric of Hardware Failure), 

concerns the residual probability of breaching a safety goal 

(

TABLE III. 

LEVEL OF METRICS FOR TYPE OF ASIL 

V. APPLICATION MISSION PROFILE AND 

AUTOMOTIVE QUALIFICATION REQUIREMENT 

The automotive market is moving towards the 

convergence of the electrification and autonomous driving to 

lower emissions, optimize traffic congestion and reduce other 

hazards. This trend needs more and more electronic systems that 

are capable of acting in the place of a human driver such as for 

steering, braking or transmission. However, they also need to be 

able to manage an efficient monitoring and charging of the 

battery to optimize vehicle autonomy and the battery lifetime 

over 15 years. In addition to these automotive profiles, there are 

qualification requirements for standard products that need to be 

validated to be used in an automotive context. 

A. Grade 0 requirement 

New steering, braking and transmission power train 

applications are more and more highly integrated and are 

sometimes combined with lower housing thermal performance, 

to reduce the production cost. The consequence is a higher 

working temperature range requiring an AEC-Q100 RevH [3] 

Grade 0 qualification (Ta=150°C and Tj=175°C). The +25°C 

delta temperature on both Ta and Tj compared to a standard 

Grade 1 qualfication is an important gap to satisfy with 

additional qualification stress to perform. 

Table IV below shows an automotive mission profile 

requiring Grade 0 qualification. 

TABLE IV. 

GRADE 0 MISSION PROFILE 

Ambient 

temperature (Ta) 

Junction 

temperature (Tj) 

Operation 

time (Hrs) 

-40 -15 260 

-15 10 450 

5 30 550 

45 70 700 

75 100 800 

85 110 900 

95 120 1200 

105 130 3100 

115 140 2100 

125 150 1600 

135 160 330 

145 170 10 

Average Ta Average Tj Total op time 

73°C 98°C 12000Hrs 

The FS6500 fit for ASIL D system basis chip family 

has qualified for Grade 0 according to the above mission profile. 

It has 2000 hours of High Temperature Operating Life Test 

(HTOL) performed at Tj=175°C, 3000 Temperature Cycles 

(TC) performed from -55°C to +150°C and 2000Hrs of High 

Temperature Storage Life Test (HTSL) performed at +150°C. 

The FS6500 portfolio, extended with the Grade 0 

MC35FS6500 family, has an outstanding reliability 

performance to support high temperature applications required 

by the harshest automotive environment and market trends. 

B. Extended Grade 1 requirement 

New Battery Management applications for electrical 

vehicles require longer device operation time up to 30% of the 

device lifetime. Indeed, compared to a traditional Internal 

Combustion Engine (ICE), batteries for an Electrical Vehicle 

(EV) require the electronics in charge of the battery management 

to be active even during the charging phase, while the car is in 

parking mode. 

Table V below shows an Electrical Vehicle mission 

profile where we can see the long operation time around 60°C. 

This thattakes into account the charging phase of the batteries 

with a total operation time of 40,000 Hours, between three to 

four times longer than a typical ICE mission profile. 

TABLE V. 

EV MISSION PROFILE 

Ambient Junction Operation 

temperature (Ta) temperature (Tj) time (Hrs) 

-40 -25 2400 

63 78 19703 

70 85 2656 

90 105 5201 

100 115 6440 

105 120 2407 

115 130 1093 

125 140 100 

Average Ta Average Tj Total op time 

79°C 94°C 40000Hrs 

. 

The FS6500 fit for ASIL D family developped by NXP 

has successfully passed 4200 Hours of High Temperature 

Operating Life Test (HTOL) performed at Tj=150°C covering 

the mission profile above. 

C. FIT rate impact 

These mission profiles, extended in temperature range 

or in operation time, have to be carefully analyzed by the 

semiconductor supplier to determine the appropriate reliability 

stress conditions and durations to be performed during the 

qualification of the device. 

Moreover, the Failure In Time (FIT) rate of the 

electronic devices selected in case of safety automotive 

applications are calculated with the mission profile input. As 

described in Chapter I, the application mission profile influences 

the total FIT rate of the device and consequently it proportionally 

affects the PMHF metric, output of the FMEDA analysis as we 

can see in table VI below. 

75

TABLE VI. 

FS6500 FIT AND PMHF COMPARISON 

Grade 1 EV Grade 0 

FIT Rate (FIT) 53.6 75.3 70.1 

PMHF (FIT) 0.72 1.02 0.96 

A tough mission profile will increase the FIT rate and 

the PMHF. This is where a redundant device architecture 

between the function and its monitoring makes the difference. 

The FS6500 safety related functions and their monitoring are 

physically and electrically independent, allowing a limited 

PMHF impact against such a mission profile and facilitating 

safety automotive applications development up to ASIL D 

compliancy. 

On the other hand, for the final pillar of automotive 

qualification requirements, where consumer grade 

semiconductors are used in vehicles, ZVEI and other industry 

partners developed a framework to handle this permutation. The 

guidelines help facilitate the use of products created without 

automotive processes for applications that require more stringent 

reliability and robustness. Several systems from the consumer, 

gaming and the networking markets are crossing over to the 

automotive market, making mobility connected, efficient and 

autonomous. These components require the adaptation of 

technologies to the automotive environment, or at the least an 

evolution of the design and qualification process. 

VI. 

CONCLUSION 

Convergence between different markets means that 

embedded systems not specifically designed for automotive 

environment conditions are being used in vehicles, with 

associated guarantees and performance- mostly to respond to 

demands for high-performance infotainment, radar and camera 

driver assistance technologies. 

With the move towards electrification and automation, more 

stringent test and reliability stresses need to be performed to 

ensure the high level of quality and robustness required at the 

component level for specific environment conditions. 

As the paper has demonstrated, the combination of 

functional safety measures, with IC robustness improvements 

applied at a component and system level, helps to reduce the risk 

of failure in vehicles. This also complies with the automotive 

environment evolution. 

At a system level, the safety architecture and system design 

aim to enable full redundancy and therefore facilitate higher 

levels of autonomous driving to achieve fault tolerance in the 

case of failure. 

VII. 

REFERENCES 

[1] ISO/DIS 26262, “ISO-26262 Road vehicles - Functional safety”, 

International Organization for Standardization, 2011 

[2] IECTR62380:2004 - Reliability data handbook - Universal model for 

reliability prediction of electronics components, PCBs and equipment 

[3] AEC-Q100-RevH: Failure mechanism based stress test qualification for 

integrated circuits 

76

Safety Architectures on Multicore Processors – 

Mastering the Time Domain 

Thomas Barth 

Department of Electrical Engineering 

Hochschule Darmstadt – University of Applied Sciences 

Darmstadt, Germany 

thomas.barth@h-da.de 

Abstract— A key architecture for building safe architectures 

is a strict separation of normal application code (also referred to 

as QM code) and safety function code, considering a separation 

not only in the memory and peripheral domain but also in the 

time domain. Whereas hardware features like memory- or busprotection 

units allow a comparable simple protection of the 

memory domain, the supervision of the timing domain is a lot 

more complex. Race conditions on multicore system are far more 

likely and complex as compared to a single core system, as we 

have a true parallel execution of code and more asynchronous 

architectural patterns. Most safety standards such as IEC61508 

[1] and ISO26262 [2] require: 

Prof. Dr.-Ing. Peter Fromm 

Department of Electrical Engineering 

Hochschule Darmstadt – University of Applied Sciences 


peter.fromm@h-da.de 

I. SAFETY ARCHITECTURE 

A very common design pattern for the implementation of a 

safety architecture is the use of redundant and independent 

channels. By monitoring and comparing the input, control and 

output values of both channels, single errors can be detected 

and the system can be switched into a safe state, as shown in 

Figure 1. 

• Alive monitoring 

• Real-time monitoring 

• Control flow monitoring 

In this paper we will describe a typical signal flow on a 

multicore safety system and based on this architecture introduce 

an innovative second-level monitoring layer, which is supervising 

the real-time constraints of the safety and functional monitoring 

functions. We will demonstrate the use of selected hardware 

features of the Infineon AURIX and TLF watchdog chip together 

with the SafetyOS PxROS from the company HighTec and show, 

how they can be used in the context of a safety architecture. 

Furthermore, we will demonstrate the use of a combined 

watchdog / smart power module, which does not only support an 

emergency switch-off but also the control of multiple power 

domains and defined reboot sequences in case of system errors. 

Keywords—Timing, Control Flow, Functional Safety, Safety 

Architectures, Multicore, Runtime Environment 

Figure 1 - Dual channel fail safe architecture 

Transferring this architecture onto a multicore system by 

simply replacing the ECU’s with a core will not lead to the 

same level of reliability, as the probability of common cause 

failures is higher compared to the discrete setup, which is due 

to shared resources, common power supply and similar [3]. 

Using a safe operating system providing separation techniques 

will help, but still the risks caused by a wrong configuration 

remains. 

A possible approach to overcome these weaknesses is the 

introduction of a multi-layer monitoring architecture [4]. The 

first layer of monitoring functions will supervise the coherency 

of the sensor input data, the calculated control variables and the 

correct transfer to the actors, which still can and should be 

implemented in a multi-channel structure. As long as the 

monitoring functions work as intended, the system can be 

assumed to be in a safe operational state. [5] 


77

Figure 2 – Multi-layer monitoring architecture 

However, what happens if a bug in one of the units shows 

an impact on the functionality of the monitors? In this case, the 

system might end up in a dangerous state, as the correct 

operation of the Input-Logic-Output channel is not supervised 

anymore. Therefore a second layer of monitoring functions is 

introduced, which monitors the health state of the two layers 

below. This health state needs to be actively and periodically 

reported to an external safety device like a watchdog chip, 

which in case of a failure will bring the system into a safe state 

[6]. 

The external device is required, because in case of a system 

error the main controller might not be able to reach the safe 

state by itself, e.g. due to an output task which is acting 

incorrectly or due to misconfigured or frozen safety ports. 

The health-monitoring layer can be divided into four major 

blocks: 

System supervision covers hardware errors 

reported e.g. by a lockstep core, memory bit flips 

and similar. 

 

 

 

Memory and bus supervision focusses on the 

separation in the memory domain, by detecting 

illegal memory accesses reported by the memory 

protection unit or access violations of shared 

busses. 

Timing supervision ensures that critical software 

components are executed in predefined intervals. 

Furthermore, possible violations of the agreed 

real-time constraints are checked as well as the 

correct execution order of safety functions. This 

block will be the focus of this paper. 

Finally yet importantly, the peripheral supervision 

block ensures that all peripheral modules work as 

expected. Often, access violations can be detected 

and handled by the core’s safety logic using the 

MPU and bus protection. In addition, the physical 

operation of the pin can be checked by reading it 

back or by using external supervision modules. 

II. WHY TIMING SUPERVISION? 

The supervision of the time domain is a typical requirement 

in most safety standards in order to detect system malfunctions 

and to take corrective actions before a system failure might 

harm humans or the environment. 

Alive monitoring checks, if critical functions are executed 

at all. This is typically done by introducing a watchdog, which 

needs to be triggered in predefined intervals. In case the 

watchdog is not triggered, the system will respond with a 

hardware reset or similar action. This supervision is 

comparable easy in its implementation; however, the error 

handling scenarios are limited and usually quite harsh. This 

technique is often used in fail-safe systems. A failure of the 

alive monitoring check leads to a transition into a safe state, 

e.g. by using an external emergency shut-off unit. 

Real-time monitoring measures the execution time of safety 

functions and checks, if the defined timing gates are met. This 

approach can be used to detect exceeded runtime of a function 

caused e.g. by a buggy algorithm, resulting in a late update of 

data required by a following process and subsequent system 

failure. 

Control flow monitoring addresses the correct execution 

order of code. On the level of single functions this is ensured to 

a certain extend by using a qualified compiler. In the following 

code sequence, we can assume that the assignment in line 1 

will be executed before the if-statement in line 2, which is 

followed by the function call in line 4, in case funcA returned 

the value 4. 

1 a = funcA(); //a is global 

2 if (4 == a) 

3 { 

4 funcB() 

5 } 

Figure 3 - Code snippet control flow 

However, what happens, if an interrupt appears between 

line 1 and 2, modifying the value of the global variable a? 

Probably the behavior is not as expected. The same might 

happen, if we use a preemptive operating system. Here, a 

higher priority task may interrupt a lower priority task if it is 

activated. This might lead to a wrong behavior if not all critical 

sections are correctly identified and secured. This becomes 

even more an issue on multicore systems, where the cores 

execute code completely independent but share memory and 

other resources. 

78

III. 

ALIVE MONTORING 

Alive monitoring is the most basic check, which detects if a 

system is alive or not. Being alive does not mean that a system 

is operational, it simply means that there is user-code executed 

and that the system is not locked inside an infinite loop, ISR, 

Trap or similar. 

As alive monitoring aims to check if software is executed, 

the monitor itself cannot be implemented in software but 

hardware features have to be utilized, in order to ensure that 

errors can be detected even if no code is executed. A common 

hardware feature for alive monitoring are watchdog timers, 

which need to be triggered in predefined intervals. If a 

watchdog timer is not triggered as expected, it causes a 

hardware event, which can be used for error escalation. 

Hardware vendors provide different watchdog timers. The 

most basic is a hardware counter, which automatically counts 

down a timeout value. If it reaches zero, a hardware event is 

triggered. Software has to ensure that the counter value is reset 

before the counter reaches zero; it becomes possible to detect 

whether software has reached the retriggering sequence within 

a certain interval. However, it is not possible to check if the 

watchdog has been triggered multiple times during an interval. 

Hence, it is only possible to check if user-code is executed 

within a maximum time but not if it is executed with the 

correct frequency. 

A window watchdogs feature a time-window, in which it 

expects to be triggered. Only if it is triggered within the 

window, it is properly reset. If it is triggered outside of the 

window, it will report an error. Window watchdogs therefore 

not only allow checking if user-code is executed within a 

maximum time, but also introduce a minimum time. With 

window watchdogs, it becomes possible to monitor if software 

is executed within defined timing constraints. 

On bare metal systems with super loop architecture, the 

watchdog could be triggered in each iteration of the super loop. 

However, most systems run an operating system where the 

alive state is only given if certain tasks are executed 

periodically. In this case, every periodic task has to be 

monitored. This can be solved by defining a background or 

watchdog task, which triggers the watchdog only if all tasks 

report execution. With this approach, the task triggering the 

watchdog needs to have a higher cycle time and a lower 

priority than any of the monitored tasks. 

As a major drawback, only cyclic tasks can be monitored 

using alive monitoring. Event driven tasks do not have a 

constant start time. More advanced concepts like deadline 

monitoring or control flow monitoring are required to secure 

such tasks. 

While the alive state of a single core controller can be 

defined quite easily, the alive state of a multicore controller 

with multiple independent CPUs might be more complex. 

However, shared resources and similar can be used for intercore 

communication. In this scenario, a watchdog-task on one 

core could gather information about all the tasks executed, 

even those on remote cores. Alive monitoring on multicore 

controllers is manageable but requires a well-designed overall 

architecture, which considers alive monitoring and inter-core 

communication. 

IV. REALTIME MONITORING 

Real-time monitoring measures the execution time of a 

software function and compares it against a given design goal. 

The following picture shows the most important timing gates 

of a software function. The release time R of a function 

determines the earliest point of time a function can start. The 

start time S is the true time the function will start and the end 

time E is the true time the function does end. The deadline D is 

the latest time the function may end. The computation time C is 

the time, the function is active. As long as the execution of the 

function is not interrupted, C=E-S. 

Figure 4 - Timing gates 

Functional Watchdogs extend the trigger mechanism by 

introducing a protocol. Only if the protocol is adhered the 

watchdog is triggered, otherwise an error is reported. An 

example for a functional watchdog is the question and answer 

watchdog, where the watchdog provides a question, which has 

to be answered by software. In the most basic fashion, there is 

a limited set of questions and the answers are stored in a 

constant table. Functional watchdogs add a certain complexity 

and allow checking not only if the watchdog is triggered but 

also if basic mechanisms of the system are operational. 

As long as the conditions R < S and E < D are true for all 

functions, the system fulfills the real-time requirements. 

A very simple and commonly used solution is to measure 

the idle time of a low priority background task. As long as the 

background task is executed, the system is not working to 

capacity - at least if all runnables which should have been 

called in this cycle have been activated. 

To get a more detailed picture, we could also start a 

measurement at the entry point of the function and stop it at the 

end, which allows us to get values for the timing gates S and E. 

Safety operating systems like PXROS-HR [7] provide special 

services to abort a function if a timing gate is not set. 


79

1 abortEventMask = PxExpectAbort(ev, func); 

2 if (abortEventMask.events != 0) 

3 { 

4 //Do some error handling 

5 } 

Figure 5 - Realtime monitoring using abort functions 

In line 1, the function func will be called and an event is 

provided, which will terminate the function func in case it is 

activated. A possible configuration would e.g. be to use a 1ms 

timing event. If the function requires more than 1ms runtime, it 

will be terminated and error handling can be initiated. 

Compared to traditional timing measurement, this approach has 

the advantage that the worst-case execution time of the 

function is known, allowing accurate timing violation 

detection. If a timing violation is caught on functional level, 

aggressive error escalation on system level can be avoided. 

Another advantage is that we can start the aborting event at the 

release time R and set it to the maximum cycle time (D-R) to 

protect all runnables, which will be called during this period. 

comparable low and a higher timer resolution can be applied. 

Furthermore, the absolute time can be used to analyze the 

control flow to a certain extend. 

V. CONTROL FLOW MONITORING 

In a multicore environment introducing asynchronous, nonblocking 

messaging mechanisms between the cores, deadline 

monitoring alone comes to a limit, as shown in the following 

example. 

In order to avoid unintended data corruption on 

communication channels, only one task in the system is 

allowed to physically access the communication ports, e.g. 

CAN. All other tasks who want to use this port send a message 

containing the data to be transmitted to this service task. The 

service task will queue and transmit the data through the bus 

and will return an answer protocol to the requesting task using 

another message. 

For data transmission, the requester task thus requires two 

runnables: One for sending the message and one, which is 

activated upon message-receive to process the return data. Let 

us assume the following valid sequence of operations: 

Figure 6 - Using abort event for deadline monitoring 

A more data centric approach is to store the age of data 

together with the data payload. The age metadata is set to 0 

whenever a new value is written to the data and incremented in 

cyclic intervals, e.g. every 1ms. Before using the data, the age 

and implicitly the call updating the data can be verified. 

1 template 

2 class data{ 

3 uint21_t m_age; 

4 T m_data; 

5 }; 

The disadvantage of this solution is the comparable high 

runtime overhead required for data update. This can be limited 

by increasing the update cycle time, but this decreases the 

precision of the information. 

An alternative approach is to store the absolute time 

whenever the data is updated. When using the data, the current 

time needs to be subtracted to get the age. As this will typically 

only happen a few times during every cycle, the overhead is 

Figure 7 - Asynchronous communication 

Runnable run2() transfers data to a service task of core 2 

using an asynchronous message. The service is executed in 

runnable run3() and the return value is send back using 

another message, which will activate runnable run4() on 

core 1. The received data is stored in a shared memory and is 

used by runnable run5(). Runnable run5() assumes that 

the data has been updated in this cycle. 

What happens if the service runnable run3() is delayed? 

If the delay is short, the timeout event will fire, a correct 

detection / handling of the error as shown in the picture below 

is possible. 

80

Figure 8 - Asynchronous messaging and deadline monitoring 

If the delay increases, we might run in the following 

situation, where the update of the data in run4() happens in 

the next cycle. Core 2 might detect the deadline violation of 

run3()(at least if the service is executed in a blocking way) 

but core 1 might not be aware of what has happened, because 

the timing event supervising the execution of the runnables has 

been reset at the deadline D. As no code has been executed at 

that time, the behavior seems ok. 

Figure 9 - Undetected error with asynchronous messaging 

What can be done? One option would be to escalate a 

possible deadline violation of core 2 to core 1, but this will 

result in a complex error handling hierarchy. 

The key problem is the following: Whereas the runnables 

run1() and run2() are called in a synchronous way, the 

event driven runnables run4() and run5() have a rather 

stochastic start time. Obviously, we have to verify that all 

runnables are executed within the expected order and timeline 

to ensure a proper operation of the system. 

A comparable easy approach, which is a first step toward 

control flow monitoring, is to use the update time metadata 

concept introduced in the previous chapter. By comparing the 

update time of the message data in run4() with the request 

time of run2(), the significant delay would become visible. 

Alive and deadline monitoring mainly focus on runtime 

constraints, whereas control flow monitoring checks if code is 

executed in a valid sequence. How can we describe a valid 

sequence? In the example above, obviously the flow 1-2-3-4-5 

would be a valid sequence. This however only is true, if all 

runnables are executed in the same cycle, which is adding 

another condition. Furthermore, runnable run1() is 

independent from the other runnables and may be executed 

anytime i.e. also 2-1-3-4-5 would be a valid sequence. This 

trivial example shows that the description and validation of all 

rules for a real system will become very complex and time 

consuming. 

A compromise could be to exclusively monitor critical 

sequences and conditions, which cannot be detected using the 

simpler and robust alive and real-time monitoring approaches. 

An alternative implementation is to use token passing. 

Tokes can be sent from one runnable to another runnable and 

ensure the correct order of execution. However, also tokens 

have limitations if there are multiple valid sequences or if the 

data-path is reconfigured during runtime. 

For a multicore system, a key requirement for control flow 

monitoring is a synchronous operation of the cores, which is 

typically hard to be achieved, as the cores might have different 

boot times. Using a common timer to store the update time of 

data signals is a possible solution to solve this problem. 

VI. ERROR ESCALATION AND REACHING THE SAFE STATE 

Occasional violations in real-time or control flow 

monitoring might be handled locally without negative impact 

on the safety function. However, frequent timing violations as 

well as violations of the alive monitoring indicate severe 

malfunction of the system and need to be escalated 

One approach to satisfy those needs is the introduction of a 

warning counter with threshold. A detected timing violation 

increases the warning counter, while correct timing decreases 

the counter. If a predefined threshold is reached, the timing 

violation is escalated. 

The escalation of critical timing violations needs to be 

handled in hardware, as it cannot be guaranteed that software is 

executed properly anymore. Microcontrollers used for safety 

applications such as the Infineon AURIX feature a Safety 

Management Unit (SMU), which collects hardware error 

signals and defines the system reaction to an error. All 

watchdog error events cause so-called SMU alarms. The SMU 

reaction to an alarm is configurable; it is possible to send an 

interrupt/NMI request, to stop certain cores or to perform a 

reset. To implement a safety architecture, the SMU needs to be 

combined with an external watchdog such as the Infineon 

TLF35584, which is a multiple output power system supply for 

safety-relevant applications. In addition to power supply 

functionalities, it provides functional safety features like 

voltage monitoring, external watchdogs and error monitoring. 

A companion chip reduces the probability of common cause 

failures, as it is equipped with own vital components such as 

power supply or clock generation. By creating an own time 

domain on an external chip, the reliability of the watchdog 

concept is increased. Standards such as ISO26262 require the 

utilization of an external monitor in order to reach higher safety 

integrity levels. 


81

Furthermore, all power domains are permanently monitored 

for voltage and current overflows. In our architecture, the 

module also performs alive monitoring for all CANOpen 

nodes. The implementation of the logic is based on a Cypress 

PSoC, where the safety functions are realized in software and 

programmable hardware. This allows easy adaption of the 

system to different user requirements. 

Figure 10 - AURIX Microcontroller with TLF companion chip 

With the presented methods, it is possible to detect and 

escalate timing violations within the scope of the 

microcontroller. However, as the controller is usually part of a 

larger system, it has to be ensured that errors detected in the 

scope of the controller may not lead to unintended behavior of 

attached modules, such as actuators. 

A possible solution is the implementation of a safe power 

supply unit. A prototype of such a system has been developed 

in the framework of the publically funded ZIM project “Future 

Technology Multicore” focusing on providing design patterns 

and solutions for safe multicore applications. The safe power 

supply ”SmartPower” is providing supply voltage and boot up, 

reboot and shutdown sequences for three safety domains: ECU, 

logic modules and actuators. ”SmartPower” is connected to the 

safe state pin of the TLF and turns off the system as a last line 

of defense in case of fatal errors. 

VII. REFERENCES 

[1] IEC, Teil 3: Anforderungen an die Software/ Funktionale 

Sicherheit elektrischer/elektronischer/programmierbarer 

elektronischer Systeme, 61508, VDE, 2001 (3 July 2001). 

[2] ISO, Part 6: Product development: software level / Road 

vehicles - Functional Safety, 26262, Genf, 2011 (2011). 

[3] J. Barth, et al., 10 Schritte zum Performance Level, 

Bosch Rexroth Group, 2014. 

[4] Thomas Barth, Peter Fromm, A Monitoring Based Safety 

Architecture for Multicore Microcontrollers, Nürnberg, 

2017. 

[5] Thomas Barth, Peter Fromm, Functional Safety on 

Multicore Microcontrollers for Industrial Applications, 

Nürnberg, 2016. 

[6] Prof. Dr.-Ing. Peter Fromm, Thomas Barth, Mario 

Cupelli, Sicherheit auf allen Kernen - Entwicklung einer 

Safety Architektur auf dem AURIX TC27x, 

Sindelfingen, 2015. 

[7] HighTec EDV Systeme GmbH, Tricore Development 

Platform User Guide v4.6.5.0, Saarbrücken, 2015. 

Figure 11 - Complete safety architecture 

82

Developing Medical Device Software to be 

compliant with IEC 62304-Amendment 1:2015 

Mark A. Pitchford 

Technical Specialist 

LDRA 

Wirral, UK 

mark.pitchford@ldra.com 


Paraphrasing European Union directive 2007/47/EC of the 

European parliament of the council 1 , a medical device can be 

defined as: 

“Any instrument, apparatus, appliance, software, material or 

other article, whether used alone or in combination … to be 

used for human beings for the purpose of: 

• Diagnosis, prevention, monitoring, treatment, or 

alleviation of disease 

• Diagnosis, monitoring, treatment, alleviation of, or 

compensation for an injury or [disability] 

• Investigation, replacement, or modification of the 

anatomy or of a physiological process 

• Control of conception” 

Given that such definitions encompass a large majority of 

medical products other than drugs, it is small wonder that 

medical device software now permeates a huge range of 

diagnostic and delivery systems. The reliability of the 

embedded software used in these devices and the risk 

associated with it has been an ever-increasing concern as that 

software becomes ever more prevalent. 

In initial response to that concern, the functional safety 

standard IEC 62304 3 “Medical device software – Software life 

cycle processes” emerged in 2006 as an internationally 

recognized mechanism for the demonstration of compliance 

with the relevant local legal requirements 4 . The set of 

processes, activities, and tasks described in this standard 

established a common framework for medical device software 

life cycle processes as shown in Figure 1. 

FDA's Center for Devices and Radiological Health (CDRH) is 

responsible for regulating firms who manufacture, repackage, 

relabel, and/or import medical devices sold in the United 

States. The FDA’s introduction to its rules for medical device 

regulation states 2 : 

“Medical devices are classified into Class I, II, and III. 

Regulatory control increases from Class I to Class III. The 

device classification regulation defines the regulatory 

requirements for a general device type. Most Class I devices 

are exempt from Premarket Notification 510(k); most Class II 

devices require Premarket Notification 510(k); and most Class 

III devices require Premarket Approval.” 

Figure 1: Overview of software development processes and 

activities according to IEC 62304:2006 +AMD1:2015 5 

1 

"Directive 2007/47/ec of the European parliament and of the council". 

Eur-lex Europa. 5 September 2007. 

2 https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/ 

Overview/ 

3 

IEC 62304 International Standard Medical device software – Software 

life cycle processes Edition 1 2006-05 

4 

IEC 62304 International Standard Medical device software – Software life 

cycle processes Consolidated Version Edition 1.1 2015-06 

5 

IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device software 

- Software life cycle processes Figure 1 – Overview of software 

development PROCESSES and ACTIVITIES 

83

On June 15, 2015, the International Electrotechnical 

Commission, IEC, published Amendment 1:2015 to the IEC 

62304 standard “Medical device software – software life cycle 

processes” 6 . The amendment complements the 1st edition 

from 2006 by adding and amending various requirements, 

including those relating to safety classification, the handling of 

legacy software, and software item separation. 

In practice, for all but the most trivial applications compliance 

with IEC 62304 can only be demonstrated efficiently with a 

comprehensive suite of automated tools. This paper describes 

the key software development and verification processes of 

the standard, and shows how automation both minimizes the 

cost of development and verification, and provides a sound 

foundation for an effective maintenance system once the 

product is in the field. 

Work on the second, updated edition of IEC 62304 is ongoing. 

The 2nd edition will possibly be published in 2018. It seems 

very likely that the changed requirements included in 

Amendment 1:2015 will be integrated into the updated edition. 

II. 

CLASSIFICATION 

One of the more significant changes concerns the new riskbased 

approach to the safety classification of medical device 

software. The previous concept was based exclusively on the 

severity of the resulting harm. Downgrading of the safety 

classification of medical device software from C to B or B to 

A used to be possible by adopting hardware-based risk 

mitigation measures external to the software. The new 

amendment now replaces this concept with safety 

classification as shown in a decision tree (Figure 2) 

The three classes are defined in the standard as follows: 

Class A 

The software system cannot contribute to a hazardous 

situation, or the software system can contribute to a hazardous 

situation which does not result in unacceptable risk after 

consideration of risk control measures external to the software 

system. 

Class B 

The software system can contribute to a hazardous situation 

which results in unacceptable risk after consideration of risk 

control measures external to the software system, but the 

resulting possible harm is non-serious injury. 

Class C 

The software system can contribute to a hazardous situation 

which results in unacceptable risk after consideration of risk 

control measures external to the software system, and the 

resulting possible harm is death or serious injury. 

III. 

PARTITIONING OF SOFTWARE ITEMS 

The classification assigned to any medical device software 

has a tremendous impact on the code development process 

from planning, developing, testing, and verification through 

to release and beyond. It is therefore in the interests of 

medical device manufacturers to invest the effort to get it 

right the first time, minimizing unnecessary overhead by 

resisting over classification, but also avoiding expensive and 

time-consuming rework resulting from under classification. 

IEC 62304:2006 +AMD1:2015 helps to minimise 

development overhead by permitting software items to be 

segregated. In doing so, it requires that “The software 

ARCHITECTURE should promote segregation of software 

items that are required for safe operation and should describe 

the methods used to ensure effective segregation of those 

SOFTWARE ITEMS” 

Figure 2: Safety classification according to IEC 62304:2006 

+AMD1:2015 7 

Amendment 1 clarifies the position on that software 

segregation by stating that segregation is not restricted to 

physical separation, but instead permits “any mechanism that 

prevents one SOFTWARE ITEM from negatively affecting 

another” suggesting that separation in software is similarly 

valid. 

6 IEC 62304:2006/AMD1:2015 AMENDMENT 1 - MEDICAL DEVICE SOFTWARE 

- SOFTWARE LIFE CYCLE PROCESSES 

7 

IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device software 

- Software life cycle processes Figure 3 – Assigning software safety 

classification 

84

Reference to Figure 4 shows that it applies only to Class C 

code. 

Figure 3: Example of partitioning of software items according 

to IEC 62304:2006 +AMD1:2015 Figure B.1 8 

Figure 3Figure 3 shows the example used in the standard. In it, 

a software system has been designated Class C. That system 

can be segregated into one software item to deal with 

functionality of limited safety implications (software item X), 

and another to handle highly safety critical aspects of the 

system (software item Y). 

That principle can be repeated in a hierarchical manner, such 

that software item Y can itself be segregated into software 

items W and Z, and so on – always on the basis that no 

segregated software item can negatively affect another. At the 

bottom of the hierarchy, software items such as X, W and Z 

that are divided no further are defined as software units. 

IV. 

CLAUSE 5. SOFTWARE DEVELOPMENT PROCESS 

In practice, any company developing medical device software 

will carry out verification, integration and system testing on 

all software regardless of the safety classification, but the 

depth to which each of those activities is performed varies 

considerably. Table 1 is based on table A1 of the standard, and 

gives an overview of what is involved. 

For example, subclass 5.4.2 of the standard states that “The 

MANUFACTURER shall document a design with enough 

detail to allow correct implementation of each SOFTWARE 

UNIT.” 

8 IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device 

software - Software life cycle processes Figure B.1 – Example of 

partitioning of SOFTWARE ITEMS 

Software Development PROCESS requirements 

by software safety CLASS 

Clause 

Sub-clauses 

5.1.1, 5.1.2, 

5.1 Software 

5.1.3, 5.1.6, 

development planning 5.1.7, 

5.1.8, 5.1.9 

Class 

A 

Class 

B 

Class 

C 

X X X 

5.1.5, 5.1.10, 

X X 

5.1.11,5.1.12 

5.1.4 X 

5.2 Software requirements 5.2.1, 5.2.2, X X X 

analysis 

5.2.4, 5.2.5, 

5.2.6 

5.2.3 X X 

5.3 Software 

5.3.1, 5.3.2, 

X X 

ARCHITECTURAL 

design 

5.3.3, 5.3.4, 

5.3.6 

5.3.5 X 

5.4 Software detailed 

5.4.1 X X 

design 

5.4.2, 5.4.3, 

X 

5.4.4 

5.5 SOFTWARE 

5.5.1 X X X 

UNIT implementation 5.5.2, 5.5.3, 

X X 

and verification 

5.5.5 

5.5.4 X 

5.6 Software integration and All 

X X 

integration testing 

requirements 

5.7 SOFTWARE SYSTEM All 

X X X 

testing 

requirements 

5.8.1,,5.8.2,5. X X X 

5.8 Software release 

8.4,5.8.7,5.8. 

8 

5.8.3, 5.8.5, 

X X 

5.8.6, 

Figure 4: Summary of which software safety classes are 

assigned to each requirement in the development lifecycle 

requirement, highlighting clause 5.4.2 as an example 9 . 

IEC 62304 is essentially an amalgam of existing best practice 

in medical device software engineering, and the functional 

safety principles recommended by the more generic functional 

safety standard IEC 61508 10 , which has been used as a basis for 

industry specific interpretations in a host of sectors as diverse 

9 

Based on IEC 62304:2006/AMD1:2015 Amendment 1 - Medical 

device software - Software life cycle processes Table A.1 – Summary 

of requirements by software safety class 

10 

IEC 61508:2010 Functional safety of 

electrical/electronic/programmable electronic safety-related systems 

85

as the rail industry, the process industries, and earth moving 

equipment manufacture. 

A process-wide, proven tool suite has been shown to help ensure 

compliance to such software safety standards (in addition to 

security standards) by automating both the analysis of the code 

from a software quality perspective and the required validation 

and verification work. Equally important, such a tool suite 

enables life-cycle transparency and traceability into and 

throughout the development and verification activities 

facilitating audits from both internal and external entities. 

The V diagram in Figure 5 illustrates how tools can help through 

the software development process described by IEC 62304. The 

tools also provide critical assistance through the software 

maintenance process (clause 6) and the risk management process 

(clause 7). Clause 5 of IEC 62304 details the software 

development process through eight stages ending in release. 

Notice that the elements of Clause 5 map to those in Figure 1 

and Figure 5. 

verification plan to include tasks to be performed during 

software verification and their assignment to specific 

resources. 

Software Requirements Analysis (Sub-clause 5.2) involves 

deriving and documenting the software requirements based on 

the system requirements. 

Achieving a format that lends itself to bi-directional 

traceability will help to achieve compliance with the standard. 

Bigger projects, perhaps with contributors in geographically 

diverse locations, are likely to benefit from an application 

lifecycle management tool such as IBM ® Rational ® 

DOORS ®11 , or Siemens ® Polarion ® PLM ®12 . Smaller projects 

can cope admirably with carefully worded Microsoft ® Word ® 

or Microsoft ® Excel ® documents, written to facilitate links up 

and down the development process model. 

This Bidirectional Traceability of Requirements 13 (Figure 6) 

would be easily achieved in an ideal world. But most projects 

suffer from unexpected changes of requirement imposed by a 

customer. What is then impacted? Which requirements need 

re-writing? What elements of the code design? What code 

needs to be revised? And which parts of the software will 

require re-testing? 

Figure 5: Mapping the capabilities of the LDRA tool suite to 

the guidelines of IEC 62304:2006 +AMD1:2015 

Sub-clause 5.1 Software Development Planning outlines the 

first objective in the software development process, which is 

to plan the tasks needed for development of the software in 

order to reduce risks and communicate procedures and goals 

to members of the development team. 

The foundations for an efficient development cycle can be 

established by using tools that can facilitate structured 

requirements definition, such that those requirements can be 

confirmed as met by means of automated document (or 

“artefact”) generation. 

The preparation of a mechanism to demonstrate that the 

requirements have been met will involve the development of 

detailed plans. A prominent example would be the software 

Figure 6: An Illustration of the principles of Bidirectional 

Traceability 

Requirements rarely remain unchanged throughout the 

lifetime of a project, and that can turn the maintenance of a 

traceability matrix into an administrative nightmare. 

Furthermore, connected systems extend that headache into 

maintenance phase, requiring revision whenever a 

vulnerability is exposed. 

A requirements traceability tool alleviates this concern by 

automatically maintaining the connections between 

requirements, development, and testing artefacts and activities. 

Any changes in the associated documents or software code are 

automatically highlighted such that any tests required to be 

revisited can be dealt with accordingly (Figure 7). 

11 

http://www-03.ibm.com/software/products/en/ratidoor 

12 

https://polarion.plm.automation.siemens.com/ 

13 

http://www.compaid.com/caiinternet/ezine/westfall-bidirectional.pdf 

Bidirectional Requirements Traceability, Linda Westfall 

86

Figure 7: Automating requirements traceability with the 

TBmanager component of the LDRA tool suite 

Software Architectural Design (Sub-clause 5.3) requires the 

manufacturer to define the major structural components of the 

software, their externally visible properties, and the 

relationships between them. Any software component 

behaviour that can affect other components should be 

described in the software architecture, such that all software 

requirements can be implemented by the specified software 

items. This is generally verified by technical evaluation. 

Developing the architecture means defining the interfaces 

between the software items that will implement the 

requirements. Any third-party software integration must be in 

accordance with Sub-clause 4.4, “Legacy Software”. 

If a model-based approach is taken to software architectural 

design using tools such as MathWorks ® Simulink ®14 , IBM ® 

Rational ® Rhapsody ®15 , or ANSYS ® SCADE 16 , then their 

integration with test tools will make for seamless analysis of 

generated code and ensure traceability to the models. 

Software Detailed Design (Sub-clause 5.4) involves the 

specification of algorithms, data representations, and 

interfaces between different software units and data structures 

to implement the verified requirements and architecture. 

Later in the development cycle, tools can help by generating 

graphical artefacts suited to the review of the implemented 

design by means of walkthroughs or inspections. One 

approach is to prototype the software architecture in an 

appropriate programming language, which can also help to 

find any anomalies in the design. Graphical artefacts like call 

graph and flow graphs are well suited for use in the review of 

the implemented design by visual inspection (Figure 8). 

Figure 8: Diagrammatic representations of control and data 

flow generated from source code by the LDRA tool suite aid 

verification of software architectural and detailed design 

Software Unit Implementation and Verification (Subclause 

5.5) involves the translation of the detailed design into 

source code. To consistently achieve the desirable code 

characteristics, coding standards should be used to specify a 

preferred coding style, aid understandability, apply language 

usage rules or restrictions, and manage complexity. The code 

for each unit should be verified using a static analysis tool to 

ensure that it complies in a timely and cost-effective manner. 

Verification tools offer support for a range of coding standards 

such as MISRA C and C++, JSF++ AV, HIS, CERT C, and 

CWE. The better tools will be able to confirm adherence to a 

very high percentage of the rules dictated by each standard, 

and will also support the creation of, and adherence to, inhouse 

standards from both user-defined and industry standard 

rule sets. 

IEC 62304 also requires strategies, methods, and procedures 

for verifying each software unit. Amongst the acceptance 

criteria are considerations such as the verification of the 

proper event sequence, data and control flow, fault handling, 

memory management and initialization of variables, memory 

overflow detection and checking of all software boundary 

conditions. 

Unit test tools offer a graphical user interface for the 

specification of requirements-based tests and to present a list 

of all such defined test cases with appropriate pass/fail status. 

By extending the process to the automatic generation of test 

vectors, such tools provide a straightforward means to analyse 

boundary values without creating each test case manually. 

Test sequences and test cases are retained so that they can be 

repeated (“regression tested”), and the results compared with 

those generated when they were first created. 

14 

https://uk.mathworks.com/products/simulink.html 

15 

http://www-03.ibm.com/software/products/en/ratirhapfami 

16 

http://www.ansys.com/products/embedded-software/ansys-scadesuite 

87

Thorough verification also requires static and dynamic data 

and control flow analysis. Static data flow analysis produces a 

cross reference table of variables, which documents their type, 

and where they are utilized within the source file(s) or system 

under test. It also provides details of data flow anomalies, 

procedure interface analysis and data flow standards 

violations. 

Dynamic data flow analysis builds on that accumulated 

knowledge, mapping coverage information onto each variable 

entry in the table for current and combined datasets and 

populating flow graphs to illustrate the control flow of the unit 

under test. 

Software Integration and Integration Testing (Sub-clause 

5.6) focuses on the transfer of data and control across a 

software module’s internal interfaces and external interfaces 

such as those associated with medical device hardware, 

operating systems, and third party software applications and 

libraries. This activity requires the manufacturer to plan and 

execute integration of software units into ever larger 

aggregated software items, ultimately verifying that the 

resulting integrated system behaves as intended. 

Integration testing can also be used to demonstrate program 

behaviour at the boundaries of its input and output domains 

and confirms program responses to invalid, unexpected, and 

special inputs. The program’s actions are revealed when given 

combinations of inputs or unexpected sequences of inputs are 

received, or when defined timing requirements are violated. 

The test requirements in the plan should include, as 

appropriate, the types of white box testing and black box 

testing to be performed as part of integration testing. 

To show which parts of the code base have been exercised 

during testing, the LDRA tool suite has the capability to 

perform dynamic structural coverage analysis, both at system 

test level and at unit test level. Mechanisms for structural 

coverage such as statement, branch, condition, 

procedure/function call, and data flow coverage vary in 

intensity, and so are specified by the standard depending on 

classification. 

A common approach is to operate unit and system test in 

tandem, so that (for instance) coverage can be generated for 

most of the source code through a dynamic system test, and 

complemented using unit tests to exercise such as defensive 

code. It is advisable to re-run (or “regression test”) these test 

cases as a matter of course and perhaps automatically, to 

ensure that any changed code has not affected proven 

functionality elsewhere. 

Software System Testing (Sub-clause 5.7) requires the 

manufacturer to verify that the requirements for the software 

have been successfully implemented in the system as it will be 

deployed, and that the performance of the program is as 

specified. 

V. CLAUSE 6. SOFTWARE MAINTENANCE PROCESS 

With the advent of the connected device and the Internet of 

Things, system maintenance takes on a new significance. 

For any connected systems, requirements don’t just change 

in an orderly manner during development. They change 

without warning - whenever some smart Alec finds a new 

vulnerability, develops a new hack, compromises the 

system. And they keep on changing throughout the lifetime 

of the device. 

For that reason, the ability of next-generation automated 

management and requirements traceability tools and 

techniques to create relationships between requirements, 

code, static and dynamic analysis results, and unit- and 

system-level tests is especially valuable for connected 

systems. Linking these elements already enables the entire 

software development cycle to become traceable, making it 

easy for teams to identify problems and implement solutions 

faster and more cost effectively. But they are perhaps even 

more important after product release, presenting a vital 

competitive advantage in the ability to respond quickly and 

effectively whenever security is compromised. 

Many software modifications will require changes to the 

existing software functionality – perhaps with regards to 

additional utilities in the software. In such circumstances, it is 

important to ensure that any changes made or additions to the 

software do not adversely affect the existing code. 

Automatically maintaining the connections between the 

requirements, development, and testing artefacts and activities 

helps alleviate this concern – not just during development, but 

onwards into deployment and the maintenance phase. 

VI. 

CONCLUSION 

A software functional safety standard such as that 

prescribed by IEC 62304 with its many sections, clauses 

and sub-clauses may at first seem intimidating. However, 

once broken down into digestible pieces, its guiding 

principles offer sound guidance in the establishment of a 

high quality software development process, not only 

leading up to initial product release but into maintenance 

and beyond. Such a process is paramount for the assurance 

of true reliability and quality—and above all the safety and 

effectiveness of medical devices. When used with a 

complementary and comprehensive suite of tools for 

analysis and testing, it can smooth the way for development 

teams to work together to effectively develop and maintain 

large projects with confidence in their quality. 

VII. WORKS CITED 

"Directive 2007/47/ec of the European parliament and of the 

council". Eur-lex Europa. 5 September 2007. 

US Food and Drug Administration website 

https://www.fda.gov/MedicalDevices/DeviceRegulationandGu 

idance/Overview/ 

88

IEC 62304 International Standard Medical device software – 

Software life cycle processes Edition 1 2006-05 


Software life cycle processes Consolidated Version Edition 

1.1 2015-06 

IEC 61508:2010 Functional safety of 

electrical/electronic/programmable electronic safety-related 

systems 

IBM Rational DOORS website http://www- 

03.ibm.com/software/products/en/ratidoor 

Siemens Polarion ALM website 

https://polarion.plm.automation.siemens.com/ 

Object Management Group Requirements Interchange Format 

website http://www.omg.org/spec/ReqIF/ 

Bidirectional Requirements Traceability, Linda Westfall 

http://www.compaid.com/caiinternet/ezine/westfallbidirectional.pdf 

MathWorks SIMULINK website 


IBM Rational Rhapsody family website 

http://www-03.ibm.com/software/products/en/ratirhapfami 

ANSYS SCADE SUITE WEBSITE 

HTTP://WWW.ANSYS.COM/PRODUCTS/EMBEDDED- 

SOFTWARE/ANSYS-SCADE-SUITE 

COMPANY DETAILS 

LDRA 

Portside 

Monks Ferry 

Wirral 

CH41 5LH 

United Kingdom 

Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 

E-mail:info@ldra.com 

Presentation Co-ordination 

Mark James 

Marketing Manager 

E:mark.james@ldra.com 

Presenter 

Mark Pitchford 

Technical Writer 

E:mark.pitchford@ldra.com 

89

Certifying Linux: Lessons Learned in Three Years of 

SIL2LinuxMP 

Andreas Platschek 

OpenTech EDV Research GmbH 

Augasse 21 

2193 Bullendorf, AUSTRIA 

andi@opentech.at 

Nicholas Mc Guire 

OSADL eG 

Am Neuenheimer Feld 583 

D-69120 Heidelberg, GERMANY 

hofrat@osadl.org 

Lukas Bulwahn 

BMW Car IT GmbH 

Moosacher Straße 86 

80809 Munich, GERMANY 

Lukas.Bulwahn@bmw-carit.de 

Abstract—When the SIL2LinuxMP project was started about 

three years ago, many non-safety-critical systems using Linux 

were already built and in operation. Industry designed them in 

such a way mostly due to the tremendous security capabilities 

as well as the unmatched support for modern hardware. Both 

requirements are important for modern industrial applications 

and can be met using Linux on contemporary multi-core CPUs. 

However the question whether a safety argumentation for systems 

based on Linux can be provided and maintained was still open. 

While the ultimate goal of certifying a system based on Linux 

has still not been achieved as of today, it definitely is in reach 

for the basic components (Linux, glibc, busybox). 

The SIL2LinuxMP project was started as an industrial 

research project with the goal to find out whether or not it 

is possible to build complex software-intensive safety-related 

systems using the Linux operating system as its foundation. 

During the course of those last years, a number of potential 

issues that were seen in the early days turned out to be mostly 

manageable, while other problems took us by surprise. The most 

striking one being the fact that to this day no certified multi-core 

CPU (with four or more cores) seems to be available. 

This paper not only presents the issues encountered and 

status achieved during the last three years, it also discusses the 

approaches currently being proposed to resolve them. 

These approaches cover all aspects of the system safety lifecycle. 

At the system engineering level, we devised appropriate 

processes to tailor the safety process, moving from a development 

to a controlled selection process. We developed a layered 

system hazard analysis to systematically derive adequate safety 

properties and demonstrated the capabilities of this analysis on 

a use case. We covered the Linux development process with an 

assessment for which we devised data mining methods, to quantify 

software quality, utilizing the available development data. Based 

on this data, we derived statistical arguments to demonstrate the 

suitability of the development process. To address residual uncertainty 

in the area of Linux source code and its assessment, we 

combined the quality assessments for multiple semi-independent 

Linux features, capable of mitigating the same systematic fault, 

into a single safety argumentation. This multilayer handling of 

residual faults is based on a software layers-of-protection analysis. 

These approaches provide the necessary means for a Linux 

qualification route suitable for up to safety integrity level 2 

according to IEC 61508 (SIL2). 

Keywords—Linux, Safety, Qualification of Pre-Existing Software 


Over recent years, industries have announced new developments 

that rely on highly complex systems. Examples for this 

are autonomous vehicles or shared working environments for 

industrial robots and humans. While these new developments 

promise great improvements for everyone’s private and work 

life, they have so far mostly been tackled from a functional 

side. However, apart from the tackle from that functional side, 

it will be necessary to consider the non-functional properties, 

such as safety and security as well, before these systems can 

be used by the general public. 

This paper focusses on the safety properties of such highly 

complex systems and gives insight into the experience that was 

gained during the first three years of the SIL2LinuxMP [1] 

project, managed by the Open Source Automation Development 

Lab (OSADL) and a number of partner companies from 

various industries. 

The goal of the SIL2LinuxMP project is to create a framework 

that can be used for providing a safety argumentation of 

a mainline Linux kernel that guides the certification process 

and reduces the effort of the qualification process as far as 

possible. We achieve this reduction by automating as much of 

the process as possible and make the quality assessment of 

Linux repeatable for the continuously evolving Linux kernel. 

In order to verify that the framework is actually usable, we 

perform the qualification of Linux for a specific use case. 

Before investigating solutions, this paper shows the implications 

from the huge step on the complexity scale, compared 

to other previously existing safety-critical systems. These 

implications, discussed in Section II, justify the need for the 

SIL2LinuxMP project and the (for a safety-critical system) 

excessively large software stack that it involves. 

Then, this paper introduces the most important problems 

that are tackled in the project and presents the currently 

intermediate results. We split the discussion of these problems 

into two sections, depending on whether they have been 


90

anticipated at the beginning of the project (Section III), or 

whether they were unexpectedly discovered in the course of 

the project (Section IV). 

Note, that while this paper presents the more interesting 

and novel approaches, a significant part of the work still 

involves traditional safety engineering activities, which are not 

mentioned explicitly here. 

A. Goals and Common Misunderstandings of SIL2LinuxMP 

Before we dive into discussion of the technical aspects 

in Sections II-IV, we present the goals of the SIL2LinuxMP 

project. These goals are described here as clear counter 

statements to re-occuring misconceptions that have caused 

confusion with discussion partners. This makes clear what to 

expect and what not to expect and it summarizes how the 

SIL2LinuxMP project handles the particular issue. 

This clarification of goals is essential as it requires a major 

paradigm shift for companies that are used to buy and treat the 

operating system as a black box. While Linux comes with the 

advantage of royalty-free licensing for each deployed system, 

the costs are shifted towards the building up knowledge about 

proper safety engineering in the respective companies. 

• Goal: Establish a framework that gives guidance on 

how to build Linux-based safety-related systems. 

Common Misconception: Some people perceive that 

SIL2LinuxMP is creating a product that will be available 

in a shrink-wrapped package without additional 

engineering effort. 

The idea that SIL2LinuxMP creates a packaged product 

is a very common misunderstanding, and unfortunately 

one that seems to keep some companies from 

actively participating in the SIL2LinuxMP project. 

The authors do not think that there can and will ever 

be a shrink-wrapped package with a Linux version 

that can be executed on arbitrary hardware without 

any restrictions and still be used for safety-critical 

applications. 

This is a dream many seem to have (some try to name 

that dream SEooC), but unfortunately, this dream is 

unrealistic for a variety of reasons. The most important 

one being, that the interface to the Linux kernel is 

rather big. It is just under 1500 API calls, and this 

does not even include other interfaces into the kernel, 

e.g. the proc or sys pseudo filesystems. 

Analyzing it down to the last system call may be 

doable in theory, but it definitely is not maintainable 

and thus not economical. At the time of this writing, 

the list of system calls used in the SIL2LinuxMP use 

case is about 30-35 API calls (system calls as well as 

library calls, such as e.g. memset()), with additional 

restrictions on their parameters, i.e., only certain flags 

are allowed, and on their usage, i.e., some calls may 

only be used during system initialization. All of these 

API calls are considered commonly used in most of 

the existing applications that run on Linux, none of 

the more esoteric and rarely used calls are used. 

In addition, their usage is restricted to specific combinations, 

e.g., allocation of memory requires a combination 

of malloc(), mlockall() and memset() 

and is further only allowed during system initialization. 

Furthermore, we only implement mechanisms to 

counter-act use-case specific faults that were revealed 

during the hazard analysis. 

In contrast to the non-implementable care-free shrinkwrapped 

pre-certified Linux package, the goal of the 

SIL2LinuxMP project is to create a framework that 

guides the safety engineer through the process of 

certifying a system based on Linux. 

That means, it will be necessary to do a specific 

analysis for every Linux-based safety-critical system. 

Of course with time progressing, there will be certain 

patterns that emerge and using Linux in safety-critical 

systems will become easier, but nevertheless, an operating 

system of this complexity can never be expected 

to be resilient against all credible faults in all current 

and future applications. 

• Goal: Qualify Linux as pre-existing software element 

following Route 3 S in IEC 61508-3 [2]. 

Common Misconception: Since Linux is widely in 

use, a Proven-in-use strategy can be done. 

A proven-in-use argument (named Route 2 S in 

IEC 61508-3 [2]) seems to be the first naive attempt 

for everyone who has never thought about the 

issue of Linux certification before. Unfortunately, 

proven-in-use qualifies as unusable as soon as one 

studies the pre-requisites for the collected historic 

data of such an argument in the relevant standards 

(e.g. in IEC 61508-7, C.2.10.1 [2]). 

In contrast, the SIL2LinuxMP project provides the 

safety qualification argument for the pre-existing software 

elements (Linux kernel, glibc, busybox) following 

IEC 61508, Route 3 S . This means, we provide an 

argument explaining why the development process of 

those pre-existing software elements satisfies the high 

standards of IEC 61508. 

Nevertheless, the SIL2LinuxMP project makes use of 

the popularity and wide usage of Linux by considering 

it in the selection process (see Section III-A for 

details)—but only as an additional parameter for (de- 

)selection and not as the sole or main argument. 

• Goal: Provide a minimal run-time environment for 

safety-critical applications up to a level of SIL 2. 

Common Misconception: A Full-fledged distribution, 

e.g. Debian, Yocto, will be available. 

Unfortunately, running a full-fledged distribution with 

bells and whistles is not seen as doable (and practical 

for that matter), as the code base of all packages in a 

distribution is just too big. 

Therefore the SIL2LinuxMP project is restricted to 

the Linux kernel, a minimum set of standard libraries 

and a minimum run-time environment (based on 

busybox). While some out there may still hope for 

a SIL2-certified Android or Yocto-based system, 

including graphics stack and everything—this is 

certainly not our goal. 


91

II. 

Co-location of a SIL0 container in an overall mixedcriticality 

system is being considered. While this will 

allow somewhat more non-safe functionality to run in 

parallel, it also will not provide a full-fledged Linux 

distribution without any restrictions. 

The SIL2LinuxMP approach is to keep those software 

elements to the QM/SIL0 1 container where all kinds 

of non-safety applications can be executed (see Figure 

3) and keep the safety-critical application to a bare 

minimum. Integration of applications with mixed criticality 

is done using isolation mechanisms in the Linux 

kernel, as described later in Section III-C. 

IMPLICATIONS OF THE COMPLEXITY INCREASE 

As already mentioned in the introduction, a significant 

increase in complexity is happening in many industries. This 

increase in complexity is driven by the applications that are 

developed for the near (and not so near) future. We derive 

a number of implications from these anticipated applications 

that have a significant impact on non-functional properties of 

the systems. 

1) Computing Performance – The performance needs 

for these highly complex applications is much higher 

than in traditional systems. This implies not only 

much higher energy consumption [3], but also that 

CPUs that are to date not used in industrial applications 

but, e.g., in the server market, need to be 

used as otherwise the necessary performance cannot 

be provided. 

2) Concurrent Computation Capabilities – The above 

mentioned need for state-of-the-art processors due to 

performance requirements also mandates an operating 

system that is able to manage such modern multicore 

CPUs and is able to take advantage of as many 

performance-enhancing features as possible. 

3) Security – Another common need across all 

industries is the need to connect systems to the 

outside world—often through the internet. This 

inevitably leads to security issues. 

close this gap and allow the usage of Linux-based computing 

platforms that satisfy the performance as well as safety needs 

for future applications, the SIL2LinuxMP project was started. 

III. 

ANTICIPATED RESEARCH QUESTIONS 

This section gives a summary on research questions that 

we anticipated from the start of the project. This does not 

mean that the solution was fully clear, only that the problems 

in principle were recognized with some concepts in place on 

how to tackle them. 

A. From Implementation to Selection 

The main difference between the SIL2LinuxMP platform 

and the approach commonly used is that the basic software 

elements are pre-existing software that have not been subject 

to dedicated development. 

This means that there are no provisions in the lifecycle for 

fault elimination and this implies a strong concentration on 

fault mitigation—a fundamentally wrong approach to system 

safety. 

Thus the safety lifecycle is altered in such a way, that 

it mitigates this flaw by adjustment of the safety lifecycle. 

Specifically, the V-Model is split into two parts, where the 

upper part is the system specification and architecture. This 

part is developed for this particular system and thus done in 

the regular route 1 S development as defined in IEC 61508 [2]. 

The bottom part is where usually, the dedicated software would 

be designed, developed and integrated. Since pre-existing 

software elements are used, this bottom part is replaced by 

a software selection process, as shown in Figure 1. 

While SIL2LinuxMP does not focus on security, the 

project has set itself the goal to check every design 

decision on the architecture in order to assure that 

the design does not conflict with generally applied 

security concepts. 

These properties of a computing platform for up-coming 

high-complexity applications make Linux the prime suspect 

as the basis for such a computing platform, since Linux is 

being used in high-performance as well as security-demanding 

applications for many years. Therefore it provides a number 

of mechanisms that allow an optimal utilization of the given 

resources while providing outstanding security capabilities 

(protection and monitoring). 

However, the usage of Linux for safety-critical systems 

has so far been restricted to some very specific cases [4], 

[5] but a general approach to certify an unmodified mainline 

kernel is not available even though discussed in the past [6]. To 

1 Without going into detail, please note, that QM/SIL0 does not mean 

arbitrarily crappy code may be executed in the QM/SIL0 container! 

Fig. 1. 

Safety Lifecycle: Selection Process for Pre-Existing Elements. 

This adapted lifecycle model describes the workflow 

of how elements are selected. Depending on the element, 

the possible selection items vary. For the elements in the 

SIL2LinuxMP project, the following variables are up for 

selection: 

• Kernel Version – A new stable version of the Linux 

kernel is released approximately every two months. In 


92

addition about once a year one of these stable versions 

is made a long-term stable (LTS) version is released. 

Not every version is as stable as the others, thus an 

important part of the SIL2LinuxMP projects selection 

process is to select stable kernels (see Section III-D 

for details on how to use development data to identify 

a stable version), that are long-term supported (LTS) 

and ideally used in a very broad context (e.g. used by 

major distributions) and thus well tested. 

• Kernel Configuration – The goal of the 

SIL2LinuxMP project is not to provide a certificate 

of the full kernel and allow it to be configured 

however one might want to, but rather to establish 

a framework that allows the certification of one 

specific configuration, the assumption being that this 

configuration is reduced to functions that are: 

◦ Needed by the application to satisfy functional 

or non-functional requirements, while their selection 

is driven by the maturity and quality of 

the respective candidate, or 

◦ Established as de-facto standard, i.e., a kernel 

configuration without that configuration item 

would be far away from every other kernel 

configuration in use. 

Thus the kernel configuration also allows the (de-)- 

selection of certain subsystems based on criteria, such 

as their development history, novelty of design, size, 

and known bug rate. 

• Non-Kernel Elements – In addition to the kernel, 

other pre-existing elements will be used, e.g., C library, 

and math library. Usually, different variants 

for these libraries are available. For these non-kernel 

elements, the selection process thus starts with the selection 

of the variant that shall be used. This selection 

can—amongst the criteria used for the kernel itself— 

also include the width of deployment and the activity 

of the development. Using a C library that is only 

deployed in a few systems and was not maintained 

for some years just does not make any sense. Instead 

the advantage of a broad and active community has 

to be considered by selecting an active and healthy 

project. 

B. Adapting Methods 

One issue in many safety-related projects is tailoring 

methods or arguing entirely new methods. The SIL2LinuxMP 

project with the outlined selection process of pre-existing 

software elements (see Section III-A requires that not only 

the software (i.e. source code) itself has to be considered, but 

also the development environment including all tools that are 

used in the development of the pre-existing software elements 

need to be investigated. This investigation entails an analysis 

of the contribution to safety by the tools (if any). 

Furthermore, the increased complexity requires that methods 

not suitable for such applications are replaced by stateof-the-art 

methods that can handle this increased complexity. 

Unfortunately the methods suggested by standards, such as 

IEC 61508 [2] are in large parts rather antiquated and proper 

replacement needs to be argued. 

Finally, the application of methods to pre-existing elements 

also changes the properties of applied established methods, 

e.g., the specification of an pre-existing element in retrospective 

can not provide a contribution to ”Freedom from intrinsic 

specification faults, ...” as suggested in the measures and 

techniques properties of Annex C, i.e., 61508-3 Table C.1. 

Thus the list of methods that need to be tailored or newly 

introduced is quite significant and we considered it as good 

practice to not try and wildly argue methods, but rather build 

up a systematic process that not only allows the argumentation 

of tailored as well as new methods, but also shows that 

original intent of IEC 61508 [2] is addressed with regards 

to the covered properties by the new set/combination of used 

methods. 

IEC 61508-3 

Annex A/B 

(Detailed Tables) 

Method to be Tailored / Replaced. 

New / Tailored Method 

IEC 61508-7 

Anne C 

Aim/Description 

Reverseengineer 

Rationale 

Impact on 

Dependent 

Measur s 

Fig. 2. Example of Tailoring a Method to the Context of Modern Computing 

Platforms. 

In Figure 2, the workflow for tailoring a method is outlined. 

Basically, it shows the transition (represented by the black 

arrow with the dotted line) from a current method that shall 

be tailored/replaced by a new/tailored method. 

The approach taken (see the red path going through area 1○ 

in Figure 2), is to reverse-engineer the rationale in IEC 61508- 

7 in order to find out what the intend of the original was. Then 

IEC 61508-3, Annex C is used to retrieve the properties that 

the method contributes to the system development. Now the 

contribution to those properties by the newly introduced (or 

tailored) method is evaluated. 

Next (following the green path going through area 2○in Figure 

2) the contribution to the properties by the new methods are 

compared to the contribution of the original method. Note, that 

at this point, multiple new methods could be used to replace 

one original method. If this is the case, the contribution of all 

the (relevant) new methods are compared to the contribution 

of the original method at least at a semi-quantitative level. 

The last step (following the blue path going through area 3○ 


93

in Figure 2) is to do a gap analysis between the contribution to 

properties by the new and original method(s) and, if necessary 

adjust the method set to cover those gaps. This iterative process 

is done until an appropriate coverage is reached. 

C. Isolation Properties 

One of the key capabilities of the Linux kernel that are 

widely used are its isolation mechanisms. Previously, they were 

mainly used in security-aware systems, but nowadays, with 

the rise of containers used for simplified system deployment, 

system adminisation and application development, these isolation 

mechanism are used virtually in every running system. 

These isolation mechanisms are used to build containers that 

are isolated from the rest of the system, in order to make the 

failure of the contained applications independent of the core 

system, to limit the impact of a security breach to the container 

and to assure that unrelated applications can not crash each 

other or the core system. 

The availability of multiple independently designed isolation 

and protection mechanisms allows to build up layered 

isolation architectures. Figure 3 illustrates how independent 

applications of mixed-criticality are isolated, using multiple 

layers of protection. 

Fig. 3. 

SIL2LinuxMP base system 

Monitoring 

busybox 

glibc 

CPU 0 

RAMbank 0..n 

SIL 2 SIL 2 

Safety app. 

32bit FP 

glibc 32bit 

seccomp 

CPU 1 

RAMbank n+1..m 

Safety app. 

64bit INT 

glibc 64bit 

seccomp 

CPU 2 

RAMbank m+1..i 

SIL2LinuxMP – One Possible Architecture 

SIL 0 

Debian 

Container 

CPU 3 

RAMbank i+1..j 

Notably, there are two boundaries on which these isolation 

properties are used in SIL2LinuxMP: 

• to achieve isolation between independent applications 

(of different criticality), and 

• to assure that API constraints specified by the result 

of the hazard analysis (see Section IV-B) are honored. 

While this immediately sounds intriguing—having unrelated 

applications, and even applications with mixed 

criticality—without the worry of inter-dependence between 

those applications, the old problem that appears when investigating 

the usability for safety, always is how to verify that this 

is adequately safe, and that the isolation properties that 

allow independence of applications failure are trustworthy. 

In this particular case, it was realized that this is a problem 

that is similar to a problem well-known in the process industry, 

well formulated by Audrey Canning: 

”Yet further concerns relate to whether a consequence can 

be so severe that the frequency of the hazardous situation 

should not be taken into account, thus negating the concept 

for ’risk’ in selecting the appropriate set of implementation 

techniques. In order to address this concern IEC 61511 formalized 

the concept of ’layers of protection’ requiring diversity 

between the different layers.” [7] 

The situation in this particular case is similar, in so far, 

as the risk can not be evaluated because the frequency of 

the hazardous situation cannot—at least not with reasonable 

effort—be obtained. For that reason, our solution leans on the 

basics for layers of protection analysis (LOPA). The intention 

is to assign multiple layers of protection for each class of 

hazards. The (usually already very unlikely) event of the 

hazardous situation will then only happen undetected if all 

the layers of protection fail at same time, making the event of 

the hazardous situation even less likely—arguably extremely 

unlikely. 

In order to employ a LOPA and truthfully conclude the 

previous assertion, the isolation mechanisms satisfy the basic 

properties of independent protection layers (IPL), cf. [8, Section 

1.3]: 

• Independence: a LOPA only comes to a correct risk 

assessment, if the protection layers are sufficiently 

independent from each other. If physical isolation 

is the target, then this might be problematic with 

software, therefore the focus is shifted towards logical 

isolation. 

• Effectiveness: It needs to be assured that the functionality 

of the layer protects or mitigates from the 

studied consequences and works even if the hazard 

happens. Each layer of protection shall provide sufficient 

protection or mitigation on its own, the use of 

multiple layers is meant to add an additional safety net 

for weaknesses in the argumentation or analysis of the 

individual layers. This does not mean, that it allows the 

developers to just not do an analysis, but only refers 

to situations where the exact quantitative frequency of 

failure is simply not available or is attached with an 

uncertainty that is too high for a proper argumentation. 

• Auditability: It shall be possible to inspect the design 

and development of the IPLs as well as the IPLs 

themselves to assure the safety of the individual IPLs. 

Since the SIL2LinuxMP project uses only free/libre 

open-source software (FLOSS) the source code of 

the IPLs themselves is available for analysis, as is 

a plethora of documentation, and the development 

history (revision control system, mailing lists, bug 

reports, etc.). 

With this LOPA executed, we show that the isolation 

mechanisms are sufficient to provide a proper logical isolation 

of the different applications running on the same, shared, 

Linux-based system. 


94

D. Statistical Modelling 

Traditionally, safety-critical software development follows 

a rigorous development process guided by the relevant standard. 

The assumption is, that this rigorous development process 

leads to a residual bug rate that satisfies the targeted SIL 

level. Alternatively, we can ask the question What process 

is responsible for the presence of a software fault? We 

anticipate answering this question with statistical methods 

in SIL2LinuxMP. With other words, we target to provide 

statistical statements about faults introduced by a stochastic 

(human) process in the development lifecycle activities. 

Traditional safety-related systems had, not too surprising, 

no means of quantifying systematic faults in software due to 

the small software size and the statistically low number of 

iterative re-designs. Instead, a qualitative defense with deeper 

analysis was considered. Essentially, this pans out to take 

sets of methods, applying these sets by qualified teams of 

engineers and wrapping this all into a controlled process for 

which metrics can serve as indicators of systematic capabilities 

bestowed on the software elements. 

Provided adequate trace data and development metadata is 

available for such a pre-existing element, we are able to infer 

adequate process compliance on statistical basis through an 

indirect metric on the non-compliant development. The basic 

model is depicted in Figure 4. 

on the applied methodology—even though rigor R1 (see IEC 

61508-3 Ed 2 Annex C.1.1) R1: without objective acceptance 

criteria, or with limited objective acceptance criteria. E.g., 

black-box testing based on judgement, field trials.” might seem 

to be calling for very little. If we statistically establish a clear 

relation between activities called for and effective findings, 

we can establish a overall claim of principle achievement of 

the objectives of IEC 61508, specifically the fifth objective 

of clause 7.4 is being addressed here [2, Part3, 7.4] The fifth 

objective of the requirements of this sub-clause is to verify that 

the requirements for safety-related software (in terms of the 

required software safety functions and the software systematic 

capability) have been achieved. 

Such a predictive model is an evaluation of the presumed 

stable underlying process of development, not an assessment 

of the systematic faults in a particular version itself. To achieve 

this, we model successive cycles of the kernel DLC and deduce 

continuity and trends of improvement for the overall kernel as 

well as for a particular selected configuration. At the heart, 

these are regression analysis of the -stable releases for Long- 

Term-Stable (LTS) kernels using negative-binomial regression 

models bootstrapped on the development trace data of the 

Linux kernel. 

CMMI 

review 

testing 

audit 

... 

Defects 

FLOSS 

Time 

review 

testing 

usage 

... 

Defects 

Time 

Fig. 4. 

Principle of statistical modelling using development data. 

Establishing process compliance for a highly-complex software 

element, e.g., the Linux kernel, is a two-step process: 

1) Establish the principle existence of mandatory activities, 

essentially this is what route 3 S in IEC 61508-3 

Ed 2 7.4.2.13 a-i encodes, and 

2) Establish the actual effectiveness of these methods 

based on statistical analysis of process metadata. 

The Linux kernel developers define a quite rigorous development 

process [9] which, in principle, can address most of the 

requirements for a structured and managed process set forth in 

IEC 61508 Ed 2 part 3. But clearly this FLOSS project lacks 

the safety management structure to claim any particular rigor 

Fig. 5. 

config) 

Linux 4.4 patches development over SUBVERSIONS (Use-Case 

Modeling the development of patches over stable kernel 

developments based on development of patches shown in 

Figure 5 respectively by analysis of the specific hunks of 

applicable patches applied in Figure 6 in the selected kernel 

feature set, allows estimating the residual bugs in the kernel as 

well as an overall judgement of process robustness. The goal 

of such models is not to imply we know the number of yet-tobe-discovered 

bugs in the kernel, rather it allows to judge if 

the development is comparable to a bespoke development, and, 

equally important, if the rate of reported bugs can be managed 

in a safety lifecycle or not. 

From current, still quite limited set of root-cause analysis 


95

Fig. 6. Linux 4.4 applicable hunks development over SUBVERSIONS (Use- 

Case config) 

data, where bug-fixes in -stable kernels were analyzed, we 

estimate that ≤ 1/30 bugs are safety related for our specific 

use case and thus the expected number of bugs is manageable. 

Regressions though are assuming that bugs are field findings 

and thus discovered by a time-dependent process, naturally 

this is not true for all bugs—the recently emerging meltdown 

and spectra bugs, which led to a very significant updated of 

critical kernel elements, demonstrates that such predictions 

have their limits. Nevertheless it is a first quantification of 

potential impact and thus an important metric for selecting a 

particular kernel version and configuration selection. 

IV. 

UNEXPECTED TROUBLES 

While the things that were presented above were already 

known from the beginning, there were some further issues that 

we did not anticipate at the beginning of SIL2LinuxMP project. 

A. Impact Analysis 

A question that is raised at discussions quite often is How 

do you plan to certify an operating system with 19million lines 

of code?. The simple answer is: This was never the plan. 

As already mentioned in Section III-A, the selection of the 

configuration items is part of the safety development lifecycle. 

Essentially, the selection is not only a step to eliminate faults, 

as well as minimize residual faults, but also a step allowing to 

dramatically reduce the code base. 

The Linux kernel configuration takes care that only a 

fraction of the actual code base is actually used. In addition, 

every analysis that is done (e.g. the statistical models presented 

in III-D) ultimately should focus its consideration on those 

commits that have an impact on the specific configuration. 

In order to being able to do this, the SIL2LinuxMP project 

uses two tools developed in the context of the project: 

• The minimization [10] tool was developed by Hitachi 

in the context of the SIL2LinuxMP project. It allows 

to—based on the kernel configuration—produce a 

code base where all the code that is not used is stripped 

based on C-macros used for configuration. 

• The patch impact tester(PIT) is a tool that is currently 

still under development (but shows very promising 

results). In contrast to the minimization tool, it 

does not work a single version of the kernel, but rather 

it tests whether a given patch has an impact in a given 

configuration. This way, the development data of the 

individual changes is preserved, and the number of 

commits that have to be considered is reduced. 

While this problem may seem trivial at first, it has 

quite a number of cases where this is not a trivial 

problem. 

The PIT itself is based on a GCC plugin. This plugin 

is used when compiling the kernel and configuration 

that shall be used. The information provided by this 

plugin is collected in a database, and can then be used 

to check whether a given patch has an impact on the 

configuration, or not. 

B. HD 3 – Hazard driven Decomposition, Design and Development 

While the complexity of the use case considered in the 

SIL2LinuxMP project is far below the complexity of intended 

future applications, e.g., autonomous driving, it became obvious 

during hazard analysis that this kind of complexity 

is not controllable with traditional hazard analysis methods. 

For that reason, a new approach based on the hazard and 

operability study (HAZOP) method was investigated. This new 

approach is called Hazard-driven Decomposition, Design and 

Development (HD 3 ). 

The primary premise of HD 3 is that the design of the 

system shall be driven by identified hazards. The idea is to 

use the hazards as design input and use this to eliminate them 

at the design level, where possible. This way, the need for 

mitigation mechanisms is minimized, preventing the system 

complexity to unnecessarily increase. 

The general procedure of HD 3 is to start with the basic 

functionality of the system. In the SIL2LinuxMP use case, 

this was ”Measure the quality of water.”. Based on this 

basic functionality, a technology-agnostic process was derived, 

i.e., the process as if done by a biochemist in a laboratory 

setup. This technology-agnostic process is then subject to the 

first hazard analysis. A traditional HAZOP is conducted on 

the technology-agnostic process, revealing the hazards at this 

highly abstract level. Elimination conditions and mitigation 

capabilities are recorded in the form of Safety Application 

Conditions (SACs) for each of the analysis levels. These SACs 

are then consolidated into a set of derived items—still at a 

technology-agnostic level. 

The results of the hazard analysis at the technologyagnostic 

level are used as input for a technology-aware design 

while still staying technology-unspecific. That means, that an 

automated system is designed by allocating unspecific devices 

(motors, pumps, valves, sensors, etc.) to perform the actions 


96

that are performed by the bio-chemist in the technologyagnostic 

process. At this level no specific devices is yet 

allocated (e.g. only a ”pump” is used without knowing whether 

it’s a diaphragm pump, radial pump, peristaltic pump, etc.). 

This technology-aware unspecific design is then fed into 

yet another hazard analysis. The result of this second round 

of hazard analysis is used to do a more detailed technologyaware 

design using specific devices. The important part here 

is, that based on the hazards at the higher level of abstraction, 

it was possible to select specific devices that are inherently 

safe against a number of the specific identified faults. This 

leaves a limited (ideally minimized) set of hazards that cannot 

be eliminated this way. Only for this limited set, a mitigation 

mechanism has to be introduced into the system design to 

assure a low residual probability of failure. 

The actual allocation of mitigations finally can be at 

the level of the specific safety related application or at the 

unspecific (generic) level of selected elements (see LOPA). 

The result of the second hazard analysis is then used to 

go into a third level of hazard analysis where the unspecific 

devices are replaced by the selected specific devices. 

It is important to note, that the HD 3 approach is then completed 

with further intermediate layers of derived requirements 

that are necessary to get the needed hazard information for the 

next level of design. 

Furthermore, each hazard analysis also emits Safety Application 

Conditions (SACs) that put conditions on the system 

that have to be met. The approach of HD 3 results in these 

SACs showing hierarchical properties, as SACs from higher 

levels of abstraction map to more fine-grained SACs at the 

detailed level. For example, at the technology-agnostic level, 

the SAC ”Critical data must be verified after write.” at 

the technology-agnostic level is refined as follows: In the 

unspecific technology-aware level, this is reflected in the form 

of ”Grey-channel the storage media.”, and in the specific 

technology-aware level this becomes: ”Written data must be 

read back.”, ”Individual measurement values shall be timestamped.”, 

”A CRC shall be stored for individual measurement 

values.”. 

This crude example shows how SACs are refined conducting 

the hazard analysis at the various levels of abstraction. 

Similarly, the required API manifests itself. What can not be 

seen in this example is that the SACs and the used API calls 

that are part of the result constitute 

• a minimum set of parameter constrained API calls, 

and 

• a maximum set of constraints. 

Thus, reducing the functional subset that needs to be 

analyzed. This way, the complexity of the software components 

is handled with a minimized effort. 

While the experience with HD 3 is still limited due to its 

novelty, the first results look very promising and a thorough 

analysis from the high-level design down to the implementation 

was possible for the SIL2LinuxMP use case. Maybe a 

traditional hazard analysis would have been possible for this 

level of complexity, but from our experience the effort to do 

so would have been significantly higher, and more importantly, 

honoring the important rule of ”first eliminate, then mitigate” 

would not have been possible to the same extent. 

V. CONCLUSION 

While the goal of a SIL2-certified platform has not been 

reached within the first three years of the SIL2LinuxMP 

project, partly due to the lack of certified multi-core CPU 

hardware, it has been shown that this goal is not out of reach, 

especially for our main investigation subject, the Linux kernel. 

The above sections present the progress that has been made 

in various parts of the safety lifecycle. The biggest issue in the 

project’s endeavor was to find ways to handle the complexity. 

This was achieved using the HD 3 method (introduced in 

Section IV-B) to perform hazard analysis and do the system 

design. Second, the software LOPA (Section III-C) provides 

argumentation for the partitioning of the problem by separating 

applications of same and different criticality and allowing them 

to be handled separately. Furthermore, the impact analysis, as 

described in Section IV-A, allows the automatic reduction of 

the Linux kernel code base to those lines of code that do have a 

direct impact on the specific configuration in use, thus reducing 

the effort as everything else can be discarded. 

For the re-use of pre-existing open-source elements, the 

most important steps were the transition from a traditional 

V-model for software development to the selection process 

outlined in Section III-A, as well as the systematic process 

to arguing the use of new methods as well as the tailoring 

existing methods, discussed in Section III-B. 

In summary, the overall progress of the SIL2LinuxMP 

project is at a point where the authors are confident that the 

goal can be achieved by completing these sketched activities. 

VI. 

ACKNOWLEDGEMENTS 

The SIL2LinuxMP project is organized by the Open- 

Source Automation Development Lab (OSADL), as well as 

the SIL2LinuxMP partner companies. We thank them for their 

support and their funding. 

REFERENCES 

[1] OSADL, SIL2LinuxMP Webpage, https://www.osadl.org/SIL2LinuxMP.sil2- 

linux-project.0.html, 2016 

[2] IEC 61508 Edition2, Functional safety of electrical/electronic/programmable 

electronic safety-related systemsIEC, 

2010 

[3] Bloomberg, Driverless Cars Giving Engineers a Fuel Economy 

Headache, https://www.bloomberg.com/news/articles/2017-10- 

11/driverless-cars-are-giving-engineers-a-fuel-economy-headache, 

October 2017 

[4] Andreas Gerstinger, Heinz Kantz and Christoph Scherrer, TAS Control 

Platform: A Platform for Safety-Critical Railway Applications, 

https://publik.tuwien.ac.at/files/PubDat 167529.pdf 

[5] Peter Sieverding and Detlef John, SICAS ECC - die Platform fr Siemens- 

ESTW fr den Nahverkehr, Signal und Draht, May 2008 

[6] CSE International Limited for the Health and Safety Executive, RE- 

SEARCH REPORT 011: Preliminary assessment of Linux for safety 

related systems, 2002 

[7] Audrey Canning, Functional Safety: Where have we come from? Where 

are we going? in Proceedings of the Twenty-fifth Safety-critical Systems 

Symposium, Briston, UK, 2017 


97

[8] Guidelines for Initiating Events and Independent Protection Layers 

in Layer of Protection Analysis, Center for Chemical Process Safety 

(CCPS), 2015, Published by Wiley&Sons 

[9] A guide to the Kernel Development Process, 

https://www.kernel.org/doc/html/latest/process/developmentprocess.html, 

2018 

[10] GIT Repository of the Minimization Tool, https://github.com/Hitachi- 

India-Pvt-Ltd-RD/minimization, 2017 


98

A Multi-Platform Modern C++ Framework for 

Safety-Critical Embedded Software 

Daniel Tuchscherer (Author) 

Automotive Systems Engineering 

Hochschule Heilbronn, Germany 

daniel.tuchscherer@gmail.com 

Ingmar Troniarsky (Author) 

ITronic GmbH 

Erdmannhausen, Germany 

i.tron@itgroup-europe.com 

Markus Hinse (Author) 

ITronic GmbH 

Erdmannhausen, Germany 

m.hinse@itgroup-europe.com 

Frank Tränkle (Author) 

Automotive Systems Engineering 

Hochschule Heilbronn, Germany 

frank.traenkle@hs-heilbronn.de 

Abstract—The choice of a programming language and its 

idioms have a critical impact on reliability, safety and efficiency 

of the embedded software under development. In the automotive 

and robotics domains, the C programming language as well as 

model-driven tools are well established for safety-critical 

software. However, automated driving and innovative robotics 

applications are both examples for the emerging complexity of 

safety-critical software. Both domains contribute to the 

increasing popularity of modern approaches among the 

established ones to increase flexibility, such as Modern C++ with 

the ISO-standards C++11 and C++14 

This paper discusses experiences in applying Modern C++ as 

efficiently and as effectively as possible for developing safetycritical 

software. A multi-platform and simple-to-use framework 

for safety-critical software in Modern C++ is developed and 

applied to a concrete industrial application in the area of humanrobot 

collaboration. On the one side, Modern C++ is used to 

realize the speed control of the collaborative robotic system, 

which includes a proximity sensor system that measures distances 

between the robot and humans. On the other side, safety 

mechanisms are realized with Modern C++ in order to monitor 

system entities and communication channels for failures. In case 

of real-time violations or failures, the safety-control software in 

Modern C++ must ensure safety-stops in order to prevent 

humans from hazards and resulting injuries. In concrete terms, 

this paper discusses in which way Modern C++ enhances 

usability, reliability and safety for the implementation of a busindependent 

safety-communication protocol, which is used to 

provide message-based real-time monitoring, dual-channel 

utilities and actuation monitoring in a maintainable, extensible 

way. 

Keywords—Modern C++, embedded safety software, reliability, 

human-robot collaboration, reliable communication, IEC 61508, 

ROS (key words) 


Cooperating and collaborating robots interacting with 

humans without any barrier can be classified as so-called 

safety-critical systems. Safety-critical systems are embedded 

systems where malfunctions may lead to hazards that 

potentially result in severe injuries for humans [17], [18]. In 

case of a fault that may boil up to a failure, human damage 

must be prevented in first place. This is why measures to 

detect, avoid and handle malfunctions play a crucial role from 

the earliest phase of development. These measures relate to the 

topic of functional safety and include in-depth work with safety 

standards [28]. In conformance with these safety standards, it 

shall be verified and validated that safety-critical systems such 

as applications in the fields of human-robot collaboration 

(HRC) work as specified and maintain their intended 

functionality [7]. 

At the same time, every safety project is limited by factors 

such as budget, time and resources. The programming language 

and tools along the development process have significant 

impact on these factors as Binkley states in an article about 

C++ for safety-critical software [1]. The efficiency of the 

development process and reliability of safety-critical software 

is dependent on the programming language and the 

programming idioms utilized - this is especially the case for 

safety-critical software with additional safety requirements. 

Modern programming languages and tools shall support 

embedded software developers to build reliable, safe, 

maintainable and simple code under highest efficiency. One 

example for a programming language, that gains popularity in 

the domains mentioned is Modern C++ including the ISOstandards 

C++11, C++14 and C++17. C++ is powered by these 

modern standards to facilitate time- and cost-effective 

development of high-quality software for features such as 

99

communication protocols and control functions, by providing 

paradigms for holistic views on the system and embedded 

software under development. A tool like Robot Operating 

System (ROS) supports developers to maximize flexibility in 

addition. ROS is a middleware for the development of 

autonomous systems and robots. It provides reusable utilities to 

visualize, communicate, test, simulate, trace as well as control 

robots to speed up development. All these utilities are 

accessible via C++ APIs. 

However, despite its popularity, Modern C++ for safetycritical 

software leaves room for discussion, if and how it will 

be applicable in detail. In this work, Modern C++ and ROS are 

utilized for an application in the area of HRC. This shall 

demonstrate and discuss in which way Modern C++ can be 

applied as efficient and effective as possible for the 

development of safety-critical embedded software in general. 

MODBAS-Safe is presented in this work - a multi-platform 

and simple-to-use framework for safety-critical embedded 

software written in Modern C++. The framework is applied to 

a concrete industrial application in the area of HRC as a 

demonstration use-case with respect to safety standards such as 

ISO 10218, ISO 13849 and IEC 61508. Modern C++ is used to 

implement the functional requirements of the collaborative 

robotic system, which includes a speed control of the robot in 

collaborative mode. For the speed control, a proximity sensor 

system measures the distances between the robot's tool center 

point (TCP) and the human worker. The embedded software in 

Modern C++ evaluates these distances. Is the distance less than 

a given tolerance, the software shall stop any motion by driving 

the robot's actuators to prevent humans from damage. The 

robotics system, including the system architecture and the 

speed control is described in Section III while the safety 

architecture is presented in Section IV. 

In addition, safety mechanisms are also realized in Modern 

C++ in order to monitor system entities and communication 

channels for faults and failures. In case of real-time violations 

or failures, the safety-control software in Modern C++ must 

ensure safety-stops to prevent humans from hazards and 

resulting injuries. The safety-control software makes extensive 

use of the MODBAS-Safe framework features, including busindependent 

safety-communication protocol presented in 

Section V. The safety-communication protocol provides means 

for the realization of real-time violation monitoring, dualchannel 

functionality or actuation monitoring. The safetycontrol 

software and the MODBAS-Safe framework are topic 

of Section VI. This section demonstrates in which way features 

of C++11 and C++14 can be used to boost reliability and 

prevent from incorrect usage by utilizing compile-time checks, 

computations and transformations. Code examples of this 

section show in which way the multi-paradigm approach of 

Modern C++ helps to reduce the overall complexity and makes 

it simple to transform mental models, functional and safety 

requirements directly into code. 

The C++ programming language is only a part of the 

toolchain: ROS as a tool is used for visualization, verification 

and validation within this project. In this work, ROS is clearly 

separated from the embedded software deployed. None of the 

ROS components used for testing are part of productive 

embedded code. In other words: The embedded software in 

Modern C++ is fully functional without the ROS ecosystem 

and its dependencies to third-party software. The embedded 

code shall not contain any build-time or runtime dependencies 

to ROS, which reduces certification effort and makes the 

embedded software portable. MODBAS-Safe instead provides 

a decoupling mechanism by using the safety-communication 

protocol and a developed ROS gateway to transfer runtime data 

from the target to the ROS ecosystem for visualization and 

verification purposes to still take advantage of the possibilities 

that come with ROS. 

Section VII gives a conclusion about the usage of Modern 

C++ for the application in HRC and how Modern C++ can be 

used for deploying high-quality embedded software in general 

from the experiences made. Section VIII gives a brief outlook 

about the on MODBAS-Safe and what are the open challenges 

of applying Modern C++ to safety-critical software. 

II. 

RELATED WORK 

Writing safety-critical software in C++ is often still limited 

to the usage of language standards including the release in year 

2003 (C++03). One example for C++03 is the Program for 

Development of Joint Strike Fighter (JSF) [8]. A helpful 

compilation on the use of C++ and the intents for using this 

programming language within the JSF projects is presented by 

Stroustrup and Carroll [32]. From these projects, the JSF AV 

C++ coding standard emerged. Based on JSF++, additional 

programming guidelines such as MISRA C++ were published 

in 2008. Other publications that relate to writing safety-critical 

embedded software in C++ are by Binkley [1] and Williams 

[34]. Binkley targets on design patterns for the safe handling of 

fixed-point and floating-point arithmetics by using C++ 

classes. Reinhardt provides detailed and extensive view on 

C++ for safety-critical system [27]. The interesting point about 

this work is the close and direct relation to the relevant safety 

standard IEC 61508 and the comparison of C++ with other 

programming languages for safety-critical software 

development. 

Still, there are concerns and issues raised when writing 

embedded software in C++. These concerns are remains from a 

time, where no modern C++ language features existed and the 

C++ tool support was limited, which especially includes (cross- 

)compilers that were not available in this wide variety, as it is 

the case today. A rebuttal of common concerns and guidelines 

on how to write efficient embedded software in C++ is 

provided by Goldthwaite in the Technical Report on C++ 

Performance [10] and Stroustrup [31]. The recommendations 

in the technical report represents the basis for the general usage 

of C++ in this work. This includes recommendations, which 

C++ paradigms are applicable to boost efficiency and 

effectivity for embedded systems where real-time constraints 

matter. 

Individual reports and books point to the modern standards 

C++11 and C++14 in the context of embedded software 

programming. Extensive examples about C++ for real-time 

systems are presented in the book written by Kormanyos, with 

reference to the automotive industry and AUTOSAR [19]. A 

compact overview is given by Grimm [11]. Our work heavily 

relies on the experiences documented in the related work. This 

includes the interaction with new language features like 

100

`constexpr` for compile-time computations, 

`static_assert()` for compile-time checks and the 

effective use of the C++ STL in parts suitable for safety-related 

software. 

reflection principle. It contains 12 detection sensors composed 

of 2 coupled sensor modules for complete redundancy. For failsafe 

reasons a crosscheck between the proximity sensor 

modules is implemented, checking signal integrity two times 

Figure 1: System Architecture of the HRC application. 

III. 

APPLICATION 

This work is located in the innovative field of human-robot 

collaboration (HRC), which relates to Industry 4.0 - the socalled 

next industrial revolution [29], [20]. In the context of 

HRC, human and robot work together without any locking 

guards [25]. A compact overview of the current status of HRC 

is given by Huelke [13]. There are a number of robots on the 

market designed for collaboration. Known examples are the 

robot-series UR3, UR5 and UR10 by the Danish company 

Universal Robots, the KUKA LBR iiwa 4 or the collaborative 

robot CR-35iA by FANUC. In order to avoid or mitigate 

severe injuries, all these robotic systems have a collision 

detection embedded. The collision detection is realized either 

by torque-monitoring or, as in the case of KUKA CR-35iA, by 

force-monitoring. On exceedance of a mostly configurable 

torque limit, a collision is detected and the robot executes a 

safety-stop. However, this implies a collision, before the 

system even stops. Thus, these systems currently are limited to 

a tool center point (TCP) speed of 250 mm/s by the safety 

standard ISO 10218. Due to the reduced speed, severe injuries 

are mitigated. For higher speeds exceeding the current limit, a 

locking guard is still mandatory [6]. The limited use-cases with 

speed-limitation lead to an increasing interest of collision-free 

HRC, to be able to operate collaborative systems at higher 

speeds. Proximity sensors support a collision-free 

collaboration. If the distance between human and robot is less 

than a given tolerance, a sensor system shall help to execute a 

safety-halt, even before robot and human collide. Once the 

distance is greater than the critical limit again, the robot shall 

continue its work without confirmation. The basis for a sensorbased, 

collision-free HRC is presented by Ostermann [24], 

[25]. 

The proximity sensor system (sensorhead in Figure 1 and 

Figure 2) for contactless object detection has been developed 

by ITSoft company (member of ITGroup). This sensorhead is 

attached at the last joint of the robot, designed to be used with 

Universal Robots UR3, UR5, UR10 or FANUC-robot family. 

The sensing range is from 0 to 1500 mm based on ultrasonic 

for each sensor module. Also a sensor check is performed in 

every measurement. Interferences of the measurements are 

detected by the sensorhead and reported to the evaluation unit. 

The sensorhead power supply is 24 volts DC. Status LEDs 

in the sensorhead indicate obstacle recognition, operation mode 

and fail-signaling. The detection spread is 360 degrees in 

vertical axis and 60 degrees in horizontal axis at all time. For 

development of sensorhead software a TÜV-certified compiler 

(armcc.exe V5.04 update 2 build82) is applied. 

Figure 2: Ultrasonic sensorhead of the safety system. 

The following work is based on the idea of a completely 

collision-free collaboration by measuring distances with this 

ultrasonic proximity sensor system. To prevent collisions as 

best as possible, the intended function is to ensure the 

execution of a safety-halt (SS2), if a certain minimum allowed 

distance between human and robot is violated [4]. As a 

collaborative robot, the Universal Robot's UR3 (3 kg payload) 

is used. Figure 1 shows the general architecture of the system 

under development in a block diagram. The collaborative 

system consists of the proximity sensor system, an evaluation 

unit for safe speed control and the UR3. The proximity sensor 

101

system is mounted on the UR3's tool head and measures the 

distances to environmental objects within a certain range based 

on the systems performance. The evaluation unit is an 

embedded system used to realize the speed control of the 

robot's TCP during collaboration mode based on the proximity 

sensor distances. The distance samples are sent from the sensor 

system to the evaluation unit via CAN; the robot's current state 

(including actual pose, speed) is sent periodically every 8ms 

via Ethernet TCP/IP. 

On violation of certain minimum distances, the evaluation 

unit shall either reduce the robot's speed or even stop any 

motion of the robot, as soon as the distance of robot and human 

is too low. This paper proposes two identical instances of this 

evaluation unit with cross-monitoring using 2 CAN bus 

channels for communication. Missing, wrong or implausible 

events in the measurement or in the communication between 

the sensor system and the evaluation units are detected and 

published on all channels in order to trigger a safety-stop of the 

complete system. 

The processing of sampled sensor distances and the safe 

speed control is realized by an evaluation unit application 

software executed on the target. This embedded software is 

completely developed in Modern C++. Its development and the 

execution is made under a Linux OS with PREEMPT_RT 

patch to meet required real-time constraints. Developing under 

Linux allows both the execution of the compiled and linked 

software on the development PC for rapid-prototyping and fast 

feedback. At the same time, it is possible to deploy the same 

software to an embedded target running an Embedded Linux 

(including PREEMPT_RT) without any adaptations needed 

(e.g. no specific build defines / settings necessary). From the 

earliest phase of development, the software is executed on the 

development PC and is later deployed on an embedded target 

without additional effort. One possibility is to execute the 

evaluation software for the safe speed control in Software-inthe-Loop 

(SiL) simulation mode, in which the sensor system 

and UR3 are simulated. It is also possible to run the software 

on the development PC and communicate with the real system 

entities sensor system and UR3. For the realization of the SiLtest 

ROS and V-REP are applied. ROS rviz is used for 

visualization. ROS rviz provides means to display sensor 

distances and the robot's pose in a three-dimensional view for 

rapid-prototyping and fast feedback on testing new features. In 

order to simulate the robot's environment for SiL tests of the 

evaluation unit application software, V-REP as a multi-rigidbody 

simulation tool is used to simulate the robot's dynamics, 

the behavior of the proximity sensor system and the dynamic 

obstacles such as humans. The evaluation unit software either 

communicates directly via virtual TCP/IP and virtual 

SocketCAN with the simulation or with the real system 

entities. A switch between virtual and physical interfaces does 

not require any adaptions to the software under test. 

The collaboration space / working area is splitted into three 

spatial zones as depicted. These zones can be imagined as fixed 

to the tool head's frame moving together with the robot's sensor 

system. The distances to adhere are calculated according to 

DIN EN ISO 13855 and ISO TS 15066. 

If there a dynamical obstacle occurs in the permissible 

zone, the evaluation unit application software may operate the 

robot with a maximum speed of v TCP = 1 m/s. Within the 

tolerance zone the robot's TCP speed is limited to the safety 

limited speed (SLS) of v TCP = 250 mm/s. In the safe zone the 

specified distance to a dynamic obstacle must lead to a safetyhalt 

triggered by the evaluation unit to stop any motion of the 

robot as long as the obstacle remains in this range. For the 

evaluation unit application software to be able to detect 

dynamic obstacles like humans the evaluation unit must 

distinguish between static and dynamic objects. Before any 

collaboration a reference drive to record the static environment 

is mandatory. Within this automated teach mode the robot is 

driven along the paths with constant speed it will take during 

collaboration without the existence of any dynamic objects in 

the collaboration space - only static objects of the environment 

exist during teach mode. The distances measured by the sensor 

system and the robot pose are sampled periodically in fixed 

time steps. The evaluation unit application software stores 

every record consisting of the distances of the sensor system 

and the robot's pose in a table-based, application-specific data 

model. Over each record entry a CRC is computed to ensure 

data integrity. In collaboration mode, this persistent data model 

is accessed to compare the reference samples with the current 

sensor distances at a specific robot pose. The deviation to the 

current dynamic environment can be evaluated. In 

collaboration mode, however, the challenge is that the speed is 

not constant as it is in teach mode; in other words, the time 

information from the teach mode gets lost, because the robot is 

slowed down or accelerated during collaboration. In order to 

relate corresponding record entries of the collaboration and 

teach modes at a given robot's pose Dynamic Time Warp [23] 

is applied. 

This section specifies the general concept of the system to 

be developed under the use of Modern C++ in order to realize 

the functional requirements and some of the safety 

requirements. The intent is not to give an in-depth look on 

functional safety of the system, but to provide a basic 

understanding of the features needed to be realized in Modern 

C++. In the following section, the safety architecture is 

described. This represents the basis for the safety control 

software that monitors the evaluation unit for real-time 

violations, for instance. This clear separation of the application 

logic of the evaluation unit application software and the safety 

control software including fault-detection and safetymechanisms 

is obligatory. This partitioning is highly 

recommended according to the literature [2], [5]. Avoiding 

mix-ups and the clear distinction between the application 

software and the safety-related software enable easier testing 

and verification of the individual software components. In 

addition, this also leads to smaller, simpler and thus 

maintainable software units. 

102

IV. 

SAFETY ARCHITECTURE 

The safety-limited speed function in the tolerance zone or 

the safety-halt for the safe zone covers aspects of functional 

safety. The safety mechanism for the HRC application in case 

of a malfunction is accomplishes by transferring the system to 

 

 

 

Invalid robot state 

Actuator command from the evaluation unit not 

executed by the robot 

Invalid record entries from the teach mode 

Figure 3: Safety architecture category 3 required according to ISO 10218-1. 

a safe state. The system is not required to be fail-operational. In 

case of a deviation from the intended functionality, the 

complete system is transferred into this safe state. The system 

will remain in this safe state until there is a manual reset by an 

operator. In this context, a safe state means to stop any robot 

motion to reduce risks for the worker that collaborates with the 

robot. As a recap, the following safety functions are realized by 

the evaluation unit application software: 

 

 

If there is a dynamical object within the tolerance 

zone then the application software shall reduce the 

robot's TCP speed to the safety limited speed (SLS) of 

v TCP = 250 mm/s. 

If there is a dynamical object within the safe zone 

then the application software shall execute a safetyhalt 

(SS2) as long as the object remains in the zone. 

These two safety functions are sufficient to reduce risks for 

the human. However, this is only the case if the system 

operates with its intended functionality. Since no system is 

completely free of runtime faults, additional safety functions 

need to be specified. These safety functions are executed, if the 

intended functionality of the system cannot be guaranteed. 

First, possible faults need to be identified. Examples for faults 

and failures that need to be detected are: 

Real-time violations of sensor or robot 

communication: e.g., the periodical update of robot 

state or sensor data is not transmitted in time. 

 

 

Failure of the evaluation unit 

Invalid sensor distances 

For each of these faults or failures the complete system 

shall be transferred into a safe state by a safety function the 

UR3 robot provides. The UR3 provides the following built-in 

safety functions: Safety Limited Speed (SLS), Safety-Halt 

(SS2) and Safety-Stop (SS1). All of these can be triggered 

externally through digital inputs of the robot. In order to detect 

and handle faults and failures, an appropriate safety 

architecture must be chosen to detect and handle faults as 

described above. According to ISO 10218-1 safety-relevant 

entities of the robot's safety control shall reach Performance 

Level d (PL d) with a category 3 architecture [15]. According 

to category 3 a HRC system must provide two independent 

channels. In addition, the architecture must provide means that 

both logic devices of the two-channel system can do a crossmonitoring 

of computations and the actuator command. This 

cross-monitoring is used to determine any deviation from the 

intended functionality of one of the logic devices. In this case 

one logic device is represented by one evaluation unit. 

According to category 3 architecture an actuator monitoring is 

recommended. For instance, this monitoring ensures that an 

executed safety-stop is really maintained as long as the system 

is exposed to a hazard. 

Figure 3 depicts the safety architecture of the HRC system 

to achieve category 3. With this architecture malfunctions 

described above can be detected. The safety architecture 

represents an extension of the initial system architecture shown 

in Section III that meets the recommended category 3 

architecture from ISO 13849 [26]. In direct comparison to the 

reference from ISO 13849, the one for the HRC is also based 

on two channels. An inter-process communication (IPC) 

mechanism is used to send safety-relevant data from the 

application process to the safety control software for 

103

monitoring. This IPC-mechanism uses the safety protocol 

specifically developed for the demonstration of Modern C++ in 

the context of safety-critical systems. Both evaluation units 

cross-monitor each others' results using the same safety 

protocol to detect deviations. Actuator monitoring can be 

achieved by feeding back the current robot state to the 

evaluation units. The proximity sensor system is redundant. In 

each evaluation unit the application software and the safety 

control software (SafetyMaster) is executed. The SafetyMaster 

is responsible to monitor malfunctions like real-time violations, 

incorrect actuator commands as well as to monitor failures of 

individual system entities such as the robot or the proximity 

sensor system. 

V. SAFETY PROTOCOL 

A reliable inter-process communication (IPC) system forms 

the basis of a correct operation of this HRC application and to 

achieve the safety measures of Section IV. The IPCmechanism 

allows the data exchange between the application 

software process and the safety control process. For a reliable 

and safe communication, a safety protocol is specified, 

implemented in Modern C++. This safety protocol an 

elementary part of the Modern C++ safety framework 

MODBAS-Safe. The safety protocol's design is focused on the 

detection and control of common transmission errors, because 

there is no guarantee for an error-free data transmission. The 

detection of transmission errors and an adequate reaction is 

important to prevent from violations of safety integrity. 

Important measures for error-detection and -control in the 

context of reliable communication for safety-critical systems is 

presented in literature by DIN 61784-3, [3] and [12]. These 

measures are also applied to the safety protocol developed. A 

safety header within a safety frame of the protocol consists of 

the following elements: A 32-bit CRC, the message length of 

the payload in bytes, a timestamp, a numeric identifier to 

distinguish between safety-related and non-safety-related 

messages and a message counter. The identifier is also used as 

the priority of a frame. 

For the transmission of safety-relevant data, established, 

standardized protocols such as Ethernet UDP/IP or field bus 

systems like CAN or FlexRay are used. These communication 

channels are not safe in themselves. For the exchange of safetyrelevant 

data additional measures on higher OSI layers are 

mandatory. The majority of real-time communication protocols 

utilized for safety-critical systems like FlexRay, EtherCAT, 

Ethernet POWERLINK or PROFINET are bound to specific 

field bus systems. In contrast to this, bus-independent safety 

protocols like openSAFETY or End-to-End-Protection (E2E) 

known from AUTOSAR are available. The safety protocol in 

development is also bus-independent by applying the blackchannel 

principle. The black-channel principle allows the 

transmission of safety-relevant data and non-safety data over 

the same communication channel. In general, the safety 

protocol in development shall support the following use-cases: 

 

 

Reliable and safe transmission of sensor data and 

actuator commands. 

Monitoring of sensors and actuators for real-time 

violations and connection losses. 

 

 

Cross-Monitoring of computed results for system 

entities in multi-channel architectures. 

Rapid extension for additional communication nodes 

for extended visualization and diagnosis (passive 

read-only nodes). 

Based on these requirements the Modern C++ 

implementation shall fulfill the following requirements: 

 

 

Easy-to-use API: The API for the application 

developer that uses the safety protocol shall be as 

simple as possible to prevent from incorrect usage. 

The creation, transmission and reception of message 

objects shall be unambiguous. 

Compile-Time Checks: Frame and packet lengths 

shall be only configurable during compile-time. Used 

types and configurations shall be checked during 

compile-time to maximize reliability and prevent from 

errors from the earliest phase of application 

development. 

Several nodes use the safety protocol and communicate 

with each other over the same communication channel, sharing 

safety-critical and non-safety-critical data. Nodes can be either 

individual processes running within the operating system that 

communicate over a virtual network device (IPC) or 

distributed, embedded systems that access the same physical 

communication channel. In this work, the safety protocol is 

based on Linux SocketCAN. SocketCAN allows the usage of 

CAN Flexible Datarate (CAN FD) both for communication via 

virtual interfaces as well as for real CAN-channels. A switch 

between virtual interfaces in a simulation for instance and real 

physical interfaces is possible with code adaptations. 

The safety control software assembles information of the 

system entities it interacts with by using the safety-protocol and 

thus is able to detect malfunctions. The messages sent from the 

application process to the safety control process are eventbased 

and time-based. An example for event-based messages is 

the cyclic transmission of the robot's state. As soon as the 

application receives the current robot's state from the UR3, an 

event message is sent to the safety control. If the application 

process does not receive data, no event message is transmitted 

to the safety control. Consequently, the safety control process 

will detect a timeout / real-time violation if a specified deadline 

is not met. The same applies to the cyclic transmission of the 

distances to obstacles measured by the proximity sensor 

system. If no samples are received in the application process, 

no event message is sent to the safety control process - a realtime 

violation is detected by the safety control process. 

VI. 

APPLYING MODERN C++ 

Modern C++ as a programming language is used to realize 

both the evaluation unit application software and the safety 

control software. Before the development of the concrete 

application and safety control, patterns for the development of 

safety-critical software are assembled into one Modern C++ 

framework named MODBAS-Safe. MODBAS-Safe is a safety 

framework implemented in Modern C++ (C++11 and C++14). 

This framework is developed under consideration of 

programming and development guidelines for safety-critical 

104

software. Particularly this includes guidelines from IEC 61508- 

3 as a basis and recommendations from established 

programming guidelines such as MISRA C++, JSF AV C++, 

HIC++, the NASA JPL guidelines as well as CERT C++ to 

develop a high-quality safety-framework in Modern C++. Also, 

current developments in the Modern C++ community like the 

C++ Core Guidelines and the methodology of defining a 

superset instead of language subset are considered [33], [31]. 

Keeping in mind these guidelines during development helps to 

maximize reliability, maintainability, readability, portability 

and robustness. 

MODBAS-Safe. The central point during development was to 

create a simple-to-use safety framework completely written in 

C++ focusing on specific language features and generic parts 

from the current standards and the STL implementation that 

maximize readability, maintainability and thus effectivity. At 

the same time MODBAS-Safe hinders from using idioms that 

may lower readability, raise overall complexity and verification 

effort right away (static code analysis and testing), by 

providing reusable generic, compile-time configurable software 

modules. In a nutshell, the design rule of making interfaces 

easy to use correctly and hard to use incorrectly is promoted 

Figure 4: Modern C++ STL features and idioms that facilitate high-quality embedded software development. 

This framework provides the following generic features 

independent of any application: 

 

 

 

 

An easy-to-use safety protocol to exchange safetycritical 

and non-safety-relevant data. 

Application Monitoring: By utilizing the developed 

message-based monitoring and safety protocol, 

applications can be monitored. 

Real-Time Monitoring: Timeout / deadline violation 

detection 

Fault-Management: Persistent fault storage, logging, 

fault-handling, safety-function execution and faultdetection 

messaging 

In relation to the concrete application these features can be 

used and configured. The intent of MOBBAS-Safe is to 

provide a collection of proven-in-use and verified solutions in 

C++11 and C++14, applicable for safety-critical software 

applications in the domains of automated driving and humanrobot 

collaboration, that speed up development and at the same 

time boost quality measures such as reliability and flexibility. 

Recurring challenges for safety-critical software 

development such as developing reliable communication, 

modeling real-time constraints and fault-handling in code or 

transforming safety-requirements directly in C++-code in 

general can be solved efficiently by using the modules of 

[22]. This makes certain unsafe C++ features for embedded 

programming seem to be not that relevant. MODBAS-Safe was 

developed under the following constraints and requirements 

kept in mind to maximize reliability in the first place and to 

lower verification effort in consequence: 

 

 

No RTTI (`-fno-rtti` option for gcc and clang) 

and no virtual methods: Virtual methods can lower 

efficiency especially when called with a high 

frequency. In addition, RTTI raises difficulties for 

WCET estimations. Dynamic polymorphism in 

general undermines certain aspects of static code 

analysis [27]. From the experience made during 

research and development the additional complexity 

introduced by RTTI is not worth the benefit using it. 

No exceptions (`-fno-exceptions` option for 

gcc and clang): The same applies for exceptions as for 

RTTI. WCET estimations are not that simple and rely 

on the application. From the experiences made during 

development of the HRC system the benefit 

introduced by using C++ exceptions in embedded 

software is not worth the additional effort needed for 

verification and WCET estimation [10]. But 

exceptions are not only a problem because of the 

hidden control path. Using exceptions raises memory 

consumption [21]. 

105

No compiler optimizations: Indeed, the full 

performance and power of the C++ programming 

language relies on compiler optimizations. At the 

same time, this is a major challenge for developing 

safety-critical software. As a demonstration use-case 

compiler optimizations are disabled due to the fact 

that it is easier to inspect and verify what the compiler 

generates. 

 

 

No dynamic memory used within the application: 

During runtime, no dynamic memory allocations and 

de-allocations shall occur within the code [10]. 

Full, pedantic warnings and warnings as errors (`- 

Wall -Wextra -Wpedantic -Werror` 

options): From the earliest stage of development the 

complete code is compiled with full warnings and 

warnings as errors enabled, which maximizes 

reliability by detecting unused/uninitialized variables 

for instance. The warnings as error flag (`-Werror`) 

can also be a hint for detecting false optimizations in 

case compiler optimizations are enabled. 

In general, meeting the constraints above are not mandatory 

for every embedded software project developed in C++ and 

should be seen as recommended guidelines from the 

experiences made and literature given. Potentially everything 

can be used. However, it must be always clear which additional 

effort, complexity and side-effects the usage of a certain idiom 

introduces and if the benefit is worth the verification effort. 

MODBAS-Safe instead focuses on language features that gain 

effectivity. Such language features with small footprint, but 

highest effectivity for safety-critical software development in 

the domains of HRC are depicted in Figure 4. MODBAS-Safe 

is based on three C++ paradigms shown in Figure 4 that makes 

the framework that effective: Generic programming by using 

templates, object-orientation and the usage of C++11 STL 

features such as `std::array`, ``, 

`std::chrono`, `std::tuple` and 

``. 

MODBAS-Safe itself consists of generic parts based on 

these C++ language features and idioms shown in Figure 4. For 

a concrete application, these generic parts must be instantiated 

and configured accordingly before compile-time. Dependent on 

the application, three steps must configured for the integration: 

1. Definition of a periodical update rate, the scheduling 

priority and policy of module `SafetyMaster`. With this 

period the safety control software is invoked periodically. 

2. Implementation of individual monitoring units that fulfill 

a certain monitoring function in conformance with the safety 

specification. These monitoring units are collected within the 

`SafetyModules` container and called periodically for 

monitoring each update step. 

3. Definition of possible malfunctions in the module 

`FaultManager` and an appropriate reaction (e.g. execution 

of a safety-function or fault-handling). 

In reference to the first point, Listing 1 shows the 

instantiation, initialization and the cyclic update of one 

`safety_master` object to realize fault- and failuremonitoring 

by utilizing the safety protocol. First the CAN FD 

interface `"vcan1"` is passed at construction. With a call to 

`SafetyMaster::init()` the POSIX scheduling priority 

and policy are set. Both parameters are specified by template 

parameters in the backend and checked for valid ranges of the 

parametrized policy and priority at compile-time with the 

SafetyMaster safety_master{"vcan1"}; 

int main () noexcept { 

} 

AR::boolean ok = safety_master.init(); 

while ( ok ) { 

} 

safety_master.update(); 

ok = safety_master.is_ok(); 

safety_master.idle(); 

return 0; 

Listing 1: Initialization of the SafetyMaster node for monitoring. 

C++11 feature `static_assert()`. During startup, the 

communication binding is also being initialized. In the specific 

use-case SocketCAN is used to transmit and receive messages. 

Within a loop `SafetyMaster::update()` accesses the 

communication binding first to receive incoming safety-related 

messages for routing them to registered monitoring units. On 

an incoming message the `safety_master` object 

broadcasts the message to all registered monitoring units of this 

application. If there are no messages pending to route or 

process, the `safety_master` still is in charge to trigger all 

monitoring units registered periodically. The individual 

monitoring units decide on their own if this message is relevant 

and process it further if this is the case. With a call to 

`SafetyMaster::idle()` the single-threaded process 

sleeps for the period configured to achieve a fixed time-step. 

The sleep is implemented with `clock_nanosleep()` 

internally. As a clock `CLOCK_MONOTONIC` is used. 

Each monitoring unit is based on the reception of messages 

using the safety protocol of the safety framework. With eventbased 

messages timeouts can be detected. Periodic messages 

holding the actual actuator command as data can be used to 

detect data integrity violations. Application messages are 

modeled as C++ objects containing data. For the user it shall be 

as simple as possible to create messages for transmission and 

reception. 

Listing 2 shows the description of a safety-related message 

as an example. The message `RobotDataUpdateMsg` is 

derived from the base class `SafetyMessage`. This 

signalizes the backend / stack implementation of the safety 

protocol that this is a safety-related message and which checks 

must be done on the structure / class with the help of the 

C++11 header library ``. This STL library 

allows type checks and transformations during compile-time. 

First, the backend of the safety protocol implementation checks 

with the help of C++11 feature `static_assert()` that a 

106

given type of the message object represents a *concrete* object 

and no pointer for instance. Thus, passing a pointer to the 

`SafetyProtocol::Send()` method will raise an error at 

compile-time. Also, the priority `kId` can be checked at 

compile-time if it's in the valid range for safety-critical 

messages. All safety-relevant message shall have a priority in 

the range of 0 (highest) to 100 (lowest) for instance. In a 

further step, `std::conditional` available in 

`` is used to select the type of frame to be 

sent. If the application-specific message is derived from 

struct RobotDataUpdateMsg : public 

SafetyMessage { 

/// High priority of safety - related 

message 

static constexpr auto kId{10U}; 

/// Send the robot joints for 

visualization purposes 

std::array < float32 , 6U > joints_; 

/// Send the TCP speed for actuationmonitoring 

}; 

float32 tcp_speed_; 

Listing 2: Description of safety message objects. The safety protocol 

implementation checks for constraints at compile-time. 

`SafetyMessage`, `std::conditional` will return 

the type `SafetyFrame` including the `SafetyHeader`. 

If the message type is not derived from `SafetyMessage`, a 

`StdFrame` (non-safety) is the selected frame to be sent. This 

is automatically done during compile-time in the backend of 

the safety protocol. Based on the application-specific message 

declared, `static_assert()` allows a compile-time check 

to not exceed the maximum transmission unit (MTU) of the 

communication binding used by the safety protocol to send and 

receive frames. 

// monitoring the robot update interval for 

real - time violations 

RTMonitor 

robot_data_rt{RobotDataUpdateMsg(), 50ms, 

kUpdatePeriod}; 

// monitoring the sensorhead update 

interval for real - time violations 

RTMonitor 

sensorhead_rt{SensorheadUpdateMsg(), 100ms, 

kUpdatePeriod}; 

Listing 3: Description of monitoring objects in C++. 

Application messages as the example in Listing 2 can be 

used for the realization of real-time violation / timeout 

detection. Listing 3 shows two examples for describing realtime 

monitoring objects in C++. Real-time constraints are 

directly transformed in C++ with the template class 

`RTMonitor` provided by MODBAS-Safe. The first 

template parameter of `RTMonitor` specifies the module 

to inform if a violation occurs. In this case, the 

`FaultManager` is the one to inform about a real-time 

violation. `FaultManager` handles any violation as 

configured by the user. As the first parameter of the constructor 

the message is passed to `RTMonitor`. This is the 

message to listen for. As soon as this message is received by 

`safety_master`, the timeout timer is reset. The second 

parameter of the constructor specifies the deadline within a 

message of this type given as the first constructor parameter 

must be received. The third constructor argument specifies a 

constant expression value (constexpr) to specify at which 

period the `RTMonitor` object is updated. As it can be 

seen from the listing all time units are defined with C++11 

`std::chrono` to make time conversions and measurements 

simple. In addition, C++14 `std::chrono_literals` 

allows to use time units directly in C++ (s, ms, us). This 

enhances readability. 

Individual monitoring objects, as shown in listing 3 are 

assembled into a safety modules collection and accessed by the 

`safety_master` object in each update step and in case of 

a notification. The safety modules collection is simply 

extensible by the use of C++11 variadic templates. Within this 

variadic template class named `SafetyModules`, a C++11 

`std::tuple` contains all the monitoring units defined. 

Besides the real-time monitoring units as depicted in the 

example an additional unit for actuation monitoring could be 

registered with one line of code. The C++11 STL template 

class `std::tuple` is a heterogenous container of 

static size. The `SafetyMaster` accesses this tuple object 

each update call and iterates over all monitoring units within 

the container. The same pattern of having a tuple of units to 

manage is also applied to the `FaultManager` which is in 

charge of executing safety functions in case of a fault / failure 

reported by one of the monitoring units and transmitting fault 

messages over the safety protocol in order to inform other 

nodes about an active safety function. 

VII. CONCLUSION 

Modern C++ for innovative safety-critical applications in 

the fields of human-robot collaboration and highly automated 

driving enables simple, holistic views of the system in 

development by facilitating modern language features and the 

multi-paradigm methodology. In these domains of research and 

development, it is of great importance to get a fast feedback 

and to get things done. Within shortest amount of time, the 

embedded application code and the safety control in the 

domains of HRC was completely written in Modern C++ 

proving efficiency. Driven by the superset of a subset 

methodology, the safety framework MODBAS-Safe gives 

generic, simple solutions in Modern C++. On the one side, this 

leads to flexibility; on the other side, C++ allows to interact 

with safety standards like IEC 61508 and ISO 13849 closely. 

Generic programming / C++ templates, C++11/C++14 (STL) 

features such as static_assert(), chrono, variadic templates, 

std::tuple, std::array, decltype, type_traits and user-defined 

literals are used to transfer mental models, functional and 

107

safety requirements directly into C++ code. Moreover, the 

toolchain of and around the C++ programming language 

supports this efficient development workflow for producing 

safety-critical embedded software rapidly. Compilers like clang 

or gcc as well as compiler extensions such as clang-tidy act as 

a first part of static code analysis and enhance reliability right 

from the earliest development phase. Tools like the ROS 

middleware along with C++ support the developer in rapidprototyping. 

ROS as not being part of the productive embedded 

code is used for SiL-, HiL-simulation, verification, for tracing 

and replay with rosbag and visualization in ROS rviz. The 

result of all these measures is simple, readable, maintainable 

and thus high-quality embedded code used for an application in 

HRC in the shortest amount of time, even under the additional 

load of satisfying safety requirements. 

VIII. FUTURE WORK 

MODBAS-Safe is currently used for the development of 

control and monitoring functions in the fields of highly 

automated driving. However, two major challenges need to be 

solved when using C++ for developing productive safetycritical 

embedded code. First, an extensive analysis of C++ 

compiler optimizations is needed. Second, template 

instantiations must be somehow made be clearly visible to the 

application developer for simplifying verification and to avoid 

unintended behavior: which template instantiation is called at 

which time. Another challenge that arises when developing 

embedded software in Modern C++ is security. Secure code in 

Modern C++ will take a major role during development, since 

the growing connectivity and security issues likely have an 

impact on functional safety. 

REFERENCES 

[1] Binkley, David W. (1997). “C++ in safety critical systems”. In: Annals 

of Software Engineering 4. 

[2] DGUV (2008). BGIA-Report 2/2008 - Funktionale Sicherheit von 

Maschinensteuerungen - Anwendung der DIN EN ISO 13849. Hrsg. von 

BGIA. 

[3] DGUV (2014). Grundsätze für die Prüfung und Zertifizierung von 

"Bussystemen für die Übertragung sicherheitsbezogener Nachrichten". 

Techn. Ber. Deutsche Gesetzliche Unfallversicherung. 

[4] DGUV (2015). DGUV Information 209-074 - Industrieroboter . Techn. 

Ber. DGUV. 

[5] Douglass, Bruce P. (1998). Safety-Critical Systems Design. Techn. Ber. 

i-Logix. 

[6] Dürr, Klaus und Jochen Vetter (2014). Auf die Applikation kommt es an. 

Hrsg. von funktionale sicherheit. 

[7] Dunn, William R. (2003). “Designing Safety-Critical Computer 

Systems”. In: IEEE Computer Society. URL: 

https://pld.ttu.ee/IAF0530/01244533.pdf. 

[8] Emshoff, Bill (2014). Using C++ on Mission and Safety Critical 

Platforms. CppCon. URL: https://channel9.msdn.com/Events/CPP/C- 

PP-Con-2014/010-Using-C-on-Mission-and-Safety-Critical-Platforms. 

[9] Gerald, J. (2006). “The Power of 10: Rules for Developing Safety- 

Critical Code”. In: Computer 39.6, S. 95–97. DOI: 

10.1109/MC.2006.212. 

[10] Goldthwaite, Lois (2004). Technical Report on C++ Performance. 

Techn. Ber. ISO/IEC. URL: http://www.openstd.org/Jtc1/SC22/WG21/docs/papers/2004/n1666.pdf. 

[11] Grimm, Rainer (2014). Embedded programming with C++11. 

[12] Hannen, Heinrich-Theodor (2012). “Beitrag zur Analyse sicherer 

Kommunikationsprotokolle im industriellen Einsatz”. Diss. University 

of Kassel. 

[13] Huelke, Michael (2014). Kollaborierende Roboter – Zum Stand von 

Forschung, Normung und Validierung. URL: http://www.suqr.uni- 

wuppertal.de/fileadmin/site/suqr/Kolloquium_Download/Huelke_2014- 

01-14.pdf . 

[14] IEC (2010). IEC 61508 - Functional safety of 

electrical/electronic/programmable electronic safety-related systems - 

Part 3: Software requirements. Techn. Ber. IEC. 

[15] ISO (2011). Industrieroboter - Sicherheitsanforderungen - Teil 1: 

Roboter (ISO 10218-1:2011). Techn. Ber. International Organization for 

Standardization. 

[16] ISO (2016). ISO/TS 15066 - Robots and robotic devices - Collaborative 

robots. Techn. Ber. ISO. 

[17] Kalinsky, David (2005). "Architecture of safety-critical systems." 

Embedded Systems Programming. 

[18] Knight, J. C. (2002). “Safety critical systems: challenges and 

directions”. In: IEEE Software Engineering. 

[19] Kormanyos, Christopher (2013). Real-Time C++. Efficient Object- 

Oriented and Template Microcontroller Programming. doi : 

10.1007/978-3-642-34688-0. 

[20] KUKA Aktiengesellschaft (2015). Hello Industrie 4.0 - we go digital. 

KUKA Robots. URL: https://www.kuka.com/-/media/kukacorporate/documents/press/broschuereindustrie40de.pdf. 

[21] LLVM Compiler Infrastructure (2016). LLVM Coding Standards. 

[22] Meyers, Scott (2014). The Most Important Design Guideline. URL: 

https://www.youtube.com/watch?v=5tg1ONG18H8&t=1729s. 

[23] Müller, Meinard (2007). Information Retrieval for Music and Motion. 

[24] Ostermann, Björn (2014). “Entwicklung eines Konzepts zur sicheren 

Personenerfassung als Schutzeinrichtung an kollaborierenden 

Robotern”. Diss. Bergische Universität Wuppertal. 

[25] Ostermann, Björn, Michael Huelke und Anke Kahl (2010). Von Zäunen 

befreit – Industrieroboter mit Ultraschall absichern . 

[26] Pilz GmbH & Co. KG (2017). EN ISO 13849-1: Performance Level 

(PL). URL: https://www.pilz.com/de-DE/knowhow/law-standardsnorms/functional-safety/en-iso-13849-1. 

[27] Reinhardt, Derek W. (2004). “Use of the C++ Programming Language 

in Safety Critical Systems”. Thesis. University of York. URL: 

https://pdfs.semanticscholar.org/c7d1/ca2b4aade2c7d5a8784dddaf401f1 

7e06853.pdf. 

[28] Rolle, Ingo (2013). “Funktionale Sicherheit programmierbarer 

elektronischer Systeme”. In: Funktionale Sicherheit - Echtzeit 2013. 

[29] Rossi, Ben (2017). The Fourth Industrial Revolution: Technology 

alliances lead the charge. URL: http://www.information-age.com 

/fourth-industrial-revolution-technology-alliances-lead-charge- 

123465633/. 

[30] Schwan, Ben (2013). Kollege Roboter: BMW testet Zusammenarbeit 

von Mensch und Roboter. URL: 

https://www.heise.de/newsticker/meldung/Kollege-Roboter-BMWtestet-Zusammenarbeit-von-Mensch-und-Roboter-1972138.html. 

[31] Stroustrup, Bjarne (2005). “A rationale for semantically enhanced 

library languages”. In: LCSD. URL: 

http://www.stroustrup.com/SELLrationale.pdf. 

[32] Stroustrup, Bjarne und Kevin Carroll (2006). C++ in Safety-Critical 

Applications: The JSF++ Coding Standard. 

[33] Stroustrup, Bjarne und Herb Sutter (2017). C++ Core Guidelines. 

[34] Williams, Stephen (1997). “Embedded Programming with C++”. In: 

Third USENIX Conference on Object-Oriented Technologies and 

Systems. 

108

Challenges in Virtualizing Safety-Critical 

Cyber-Physical Systems 

Alessandro Biondi, Mauro Marinoni, 

and Giorgio Buttazzo 

Scuola Superiore Sant’Anna 

Pisa, Italy 

{alessandro.biondi, mauro.marinoni, 

giorgio.buttazzo}@santannapisa.it 

Claudio Scordino and Paolo Gai 

Evidence SRL 

Pisa, Italy 

{claudio, pj}@evidence.eu.com 

Abstract — Embedded computing platforms are evolving 

towards heterogeneous architectures that require new software 

support for simplifying their usage, optimizing the available 

resources, and providing a predictable runtime behavior for 

managing concurrent safety-critical applications. This paper 

describes the main challenges in providing such a software support 

through virtualization techniques, while taking into account safety 

requirements, security issues, and real-time performance. An 

automotive application is considered as a case of study to illustrate 

some of the presented concepts. 

Keywords — Heterogeneous platforms, embedded computing, 

real-time systems, virtualization, hypervisor. 


The design of computing infrastructures for modern cyberphysical 

systems is facing with two major trends that are 

significantly steering the development process of embedded 

software. On one hand, the last years have been characterized 

by a continuous increase of the software complexity to meet 

more and more richer functional requirements and to support 

new technologies. At the same time, computing platforms are 

evolving toward heterogeneous designs that integrate multiple 

components such as multicore processors, general-purpose 

graphic processing units (GPGPUs), and field programmable 

gate arrays (FPGAs), which allow power-efficient parallel 

execution of multiple software systems at the cost of a paradigm 

shift in their development. 

These two trends are increasingly pushing software 

designers to integrate a higher number of functions in the same 

hardware platform, typically resorting to methodologies such 

as component-based software design (CBSD) and also facing 

with the problem of incorporating legacy software. 

Furthermore, in many industrial fields, integration is considered 

the most affordable solution to problems related to space, 

weight, power, and cost (SWaP-C). 

Virtualization of computational resources established as a 

de-facto technique to address these needs while efficiently 

exploiting the processing power of modern platforms. 

Virtualization is typically achieved via hypervisors (also called 

virtual machine monitors), which allow executing multiple 

software domains upon the same platform, each of them 

possibly executing a different operating system (OS). The 

domains benefit from the illusion of disposing of a dedicated 

computing platform, while in reality the access to the shared 

computational resources is regulated by the hypervisor, which 

typically offers to the domains sets of virtualized memory 

address spaces, CPUs, and possibly peripherals. Nowadays, this 

technology is increasingly adopted to realize multi-OS 

solutions [22] for mixed-criticality systems, integrating a 

mission-critical real-time operating system (e.g., to perform 

sensing, control, and actuation tasks), with rich, non-critical 

operating systems such as Linux, which exploit a large 

availability of drivers, libraries, and connectivity stacks. 

Realistic designs possibly also include the integration of legacy 

software systems as-a-whole, i.e., with their original operating 

system, drivers, and configurations, thus favoring the evolution 

of cyber-physical systems towards centralized schemes with 

few but powerful computing platforms. 

Orthogonally to such major trends, designers of newgeneration 

embedded software cannot neglect safety and 

security needs, which inevitably affect the functionality 

provided by virtualization stacks. The former are driven by 

increasingly stringent legal regulations and certifiability 

requirements, while the latter are becoming of paramount 

importance due to the exposure of embedded computing 

platforms by means of network connections. The integration of 

components with different safety and security levels (also 

known as MILS systems) may pose hazards in guaranteeing key 

requirements of the critical software such as timing constraints 

and data integrity and confidentiality. For instance, if no proper 

isolation mechanisms are provided by the hypervisor, a 

malfunctioning or an attack interesting a low-critical domain 

may arbitrarily delay the execution of critical tasks, thus 

compromising the system behavior or strongly jeopardizing its 

performance. 

The joint consideration of all such a kind of aspects poses 

several challenges in the development of suitable virtualization 

layers. The scope of this short paper is to discuss some of such 

challenges, with a particular focus on temporal and spatial 

isolation of software domains, timing predictability, resource 

109

contention, and the management of hardware-based security 

technologies. 

II. 

BACKGROUND 

A. Hypervisors 

The concept of Hypervisors dates back to the 60's [13], but 

it became significant in the last decade as a fundamental 

solution to harness the complexity of the modern hardware 

platforms, and the multiple applications executing concurrently 

on top of them. This need for isolation could be declined in 

different ways depending on the specific application 

requirements and the underlying platform executing it. 

A platform on which the hypervisor executes is denoted as 

the host machine, and each virtual machine managed by the 

hypervisor is called a guest. The two main features upon which 

is based the classification of a hypervisor concern the type of 

implementation and the abstraction provided to the guest virtual 

machine. There are two types of hypervisor: 

● 

● 

Type-1, also called native or bare-metal, which 

directly run on the hosting hardware to control it and 

to handle guest operating systems; 

Type-2, also called hosted, where the hypervisor is 

provided as an extension to an operating system that is 

executed on the host while the guests run as tasks.. 

Another element of distinction comes from the API exposed 

by the host to the generic guest OS: 

● 

● 

In fully virtualized solutions the guest executes in a 

transparent manner and without software 

modifications, while the hypervisor provides the API 

to emulate the underlying platform; 

In a paravirtualized implementation the guest is aware 

of the presence of virtualization. Thus it uses an API 

similar, but not identical, to that of the underlying 

hardware. This allows to create specific solutions and 

reduce the overhead. 

Due to the advantages of higher flexibility and no 

modifications required in the guest domains, the hardware 

manufacturers started providing virtualization extensions to 

support full virtualization, which allow minimizing the 

overheads resulting from the emulation of the underlying 

platform. 

B. Existing solutions 

The wide range of application scenarios and platforms 

fostered the creation of a significant number of hypervisors, 

each of them with a focus on a subset of the several issues 

concerning virtualization. Moreover, the profound interaction 

between the hypervisor and the hardware platform leads to a 

considerable effort when porting the hypervisor to a new 

architecture, also due to the extensive use of specific platform 

features to improve performance. The result is a reduced set of 

hypervisors available for each particular platform. 

Since some application fields, like mainframes, cloud 

infrastructures, and virtualized network infrastructures highly 

benefit from virtualization and massively relies on Linux, 

several hypervisors pivoting on the latter have been developed. 

Among the firsts and one of the most famous is Xen [14], which 

executes Linux in a privileged domain called dom0. The wide 

range of supported platforms is considered one of its main 

advantages, but also as a drawback because it has lead to a 

considerable codebase. A similar approach is followed by KVM 

[15], which is a virtualization infrastructure available in the 

mainline kernel that turns it into a type-1 hypervisor. Jailhouse 

[16] is a type-1 partitioning hypervisor, more concerned with 

isolation rather than virtualization, aiming at creating a small 

and lightweight hypervisor targeting industrial-grade 

applications. Like Xen, Jailhouse requires Linux to provide the 

management interface, which allowed keeping the size of 

source code small. Like KVM, it is loaded from a regular Linux 

system, but when started, it takes full control of the hardware 

and splits the hardware resources into isolated compartments 

(called cells) that are entirely dedicated to guest software 

programs (called inmates). One cell runs the Linux OS and is 

known as the root cell, that is similar to the dom0 in Xen, but 

the root cell doesn't assert full control over hardware resources 

as dom0 does. 

When dealing with embedded systems and their possible 

requirements regarding safety and security, it is essential to 

exploit solutions characterized by a small codebase both for 

SWaP and certification issues. Xvisor [17] is a type-1 

hypervisor, aiming at providing an entirely monolithic, lightweight 

and portable virtualization solution. The most appealing 

characteristic of Xvisor is that it provides full virtualization, and 

therefore supports a wide range of unmodified guest operating 

systems. NOVA [18] is an academic hypervisor designed at TU 

Dresden. It follows the micro-kernel approach, and it has been 

developed using the C++ programming language. Another 

significant feature is the fixed-priority preemptive scheduler 

with execution time budgets and priority inheritance. XtratuM 

[19] is a hypervisor specially designed for real-time embedded 

systems, providing fixed priority scheduling, and relying on 

paravirtualization. Fiasco [20] is a hypervisor based on the L4 

ABI and is implemented using the C++ programming language. 

The Fiasco kernel is enriched by a broad set of user-space 

components, collectively called L4 Runtime Environment 

(L4Re). Attempts have been made to exploit the TrustZone 

security features available on modern ARM processors into 

hypervisors. An example is the SierraVisor [21] hypervisor. 

Despite all the effort from these and other projects, there are 

still significant issues to be addressed before being able to 

provide a considerable level of isolation and virtualization for 

modern heterogeneous platforms. The next section outlines 

some of the more significant ones. 

III. 

MAJOR CHALLENGES 

A. Achieving effective isolation on multicores 

Isolation capabilities are of paramount importance for an 

hypervisor to be used within a mixed-criticality system. Two 

types of isolation can be identified: spatial and temporal. Most 

(if not all) solutions provide support for spatial isolation of 

memory spaces, which is typically achieved by means of 

memory virtualization leveraging memory management units 

(MMU). Temporal isolation is generally realized by reserving 

110

dedicated CPUs to a domain, or by implementing bandwidth 

reservation schemes for the CPU time, e.g., by reserving a 

budget of execution time that is periodically provided to a 

domain by the hypervisor scheduler. 

Although these features are primary, and in fact are widely 

supported by open-source and commercial hypervisors, they are 

not enough to guarantee an effective isolation on commercial 

off-the-shelf (COTS) multicore platforms. Indeed, even if the 

domains access separate memory regions, and execute upon 

disjoint sets of CPUs, mutual interference is still possible due 

to the implicit contention of architectural resources such as 

caches and memory banks. These resources are typically not 

under the control of the hypervisor, but rather they are 

transparently managed by chip subsystems (e.g., the memory 

controller) that in most cases are not conceived to enforce 

isolation nor to guarantee timing predictability [5][6]. 

pending memory transactions. Furthermore, DRAM memory 

controllers generally resort to scheduling algorithms that reorder 

memory accesses with the aim of improving throughput. 

While these algorithms provide benefits in the average-case, 

they leave room for pathological scenarios that lead to high 

worst-case latencies, hence harming the system predictability. 

In the literature, several clever solutions have been proposed 

to solve this kind of issues in non-virtualized multicore systems. 

Software-based approaches such as cache coloring or cache 

lockdown [7] can be employed to partition the amount of cache 

used by a core, or more in general by a set of software tasks. 

Reservation of memory bandwidth [5] and bank-aware memory 

allocators [7] have also been proposed to control the contention 

in accessing the main memory. Nevertheless, to the best of our 

records, adequate support for such techniques is limited in 

commercial hypervisors. 

Modica et al. [8] realized effective isolation mechanisms for 

shared caches and main memories in an open-source hypervisor 

targeting ARM platforms. The authors developed a new virtual 

memory allocator that employs cache coloring to statically 

isolate the amount of shared cache reserved to each domain. 

Furthermore, a bandwidth reservation mechanism to access the 

main memory has been integrated with the hypervisor 

scheduler. Their experimental results showed that inter-domain 

interference can increase the execution time of state-of-the-art 

benchmarks up to the 50%, while the realized mechanisms can 

restore isolation at the price of degrading average-case 

performance. 

Figure 1 - Inter-core interference in accessing a shred level of cache 

For instance, consider a quad-core platform with private 

level-1 caches for each core and a shared level-2 cache, as it is 

illustrated in Figure 1. Suppose that a critical real-time 

operating system is executing upon the first core, while the 

remaining three cores are dedicated to execute a generalpurpose 

Linux domain. The execution of the critical domain 

results in fetching data and code from the main memory, 

consequently populating the level-2 shared cache (green box in 

the figure). In parallel, the Linux domain can also populate the 

same cache, with the result that the content stored by the critical 

domain can be evicted, hence provoking cache misses at the 

next access. This phenomenon may generate large and 

unpredictable interference across domains, thus breaking 

isolation by introducing a strong coupling of their timing 

properties. Conversely, if the Linux domain is subject to an 

attack or a malfunctioning such that it floods the system with 

memory transactions, proper isolation mechanisms should 

shield the critical domain. 

To further complicate the problem, inter-domain 

interference can also arise when accessing the main memory, 

e.g., in correspondence to cache misses. The access to DRAM 

memories is subject to highly variable delays that depends on 

the actual memory location to be accessed and simultaneous 

B. Virtualization of FPGAs and GPGPUs 

Heterogeneous platforms that include FPGAs and/or 

GPGPUs represent very attractive and powerful solutions to 

implement modern cyber-physical systems, but at the same time 

they introduce new problems in terms of resource management. 

Concerning virtualized systems, FGPAs and GPGPUs should 

also be controlled by the hypervisor and made available to 

domains in a controlled manner. 

Modern FPGAs dispose of dynamic partial reconfiguration 

(DPR) capabilities, which allow reprogramming a portion of the 

FPGA area while the rest continues to operate. This interesting 

feature may be used to virtualize the FPGA area supporting 

several hardware modules and accelerators in time sharing, 

whose overall area consumption exceeds the one that is actually 

available in the platform. A framework [11] has also been 

proposed to ensure that the reconfiguration and area contention 

delays are predictable, thus making realistic the adoptance of 

this technique in the context of critical systems. Static FPGA 

virtualization is also possible by controlling its configuration 

phases. Unfortunately, no integration within a hypervisor is 

today available. 

Dually, work has also been dedicated to the development of 

software mechanisms to integrate the advantages of GPGPU 

into the virtualization paradigm. Hong et al. [23] provided an 

overview of the state-of-the-art of virtualization techniques, 

hardware supports, and scheduling mechanisms for multiple 

concurrent requests. They also outlined a list of challenges that 

still require being addressed to improve the exploitation of 

111

GPGPUs, ranging from overheads reduction to energy 

management, from scalability and space optimization to 

security. 

Another issue consists in the fact that modules deployed 

onto the FPGA and GPGPUs can typically act as memory 

masters on the system bus, hence (i) generating additional 

memory interference (e.g., see [10]) that complicates the 

problems discussed in the previous section, and (ii) potentially 

exposing memories to an uncontrolled access that may bypass 

the spatial isolation. The first problem needs to be addressed 

with adequate support, such as specialized software-based 

memory bandwidth controllers, or in the case of FPGAs with 

the development of hardware bandwidth controllers deployed 

onto the FPGA and managed by the hypervisor. The second 

problem requires dealing with virtualization techniques and 

components such as I/O MMUs. 

C. Supporting hardware-based security technologies 

Due to the external exposure by means of network and bus 

connections, security issues became central aspects in the 

design and development of modern embedded computing 

systems. Although a rich set of software-based techniques have 

been developed to increase the security level of a software 

system, cyber attacks are also increasingly becoming more and 

more complex, defeating most attack mitigation techniques 

and/or exploiting wrong software configurations. With the 

intent of providing a robust support to implement security 

features, chip makers are moving towards architectures that 

offer hardware-based solutions to realize trusted execution 

environments (TEEs). TEEs must be strictly isolated for the 

normal execution environment and should also dispose of 

dedicated computing resources. 

One of the most popular of such technologies is TrustZone 

developed by ARM. TrustZone provides hardware-based 

isolation of two execution worlds: secure, conceived to support 

the execution of a TEE, and non-secure, which is provided to 

host the execution of a rich (classical) operating system. 

TrustZone-enabled chips may also include support for secure 

boot, i.e., cryptographic validation of the firmware to be 

executed, and cryptographic hardware accelerators. The 

introduction of such features poses new challenges when 

realizing a security-aware virtualization stack. 

First, there is the need to virtualize such hardware-based 

security technologies to allow the coexistence of multiple 

domains each potentially comprising a TEE running in a 

virtualized secure world. Initial attempts in this direction have 

been made by Cicero et. al [9], which proposed an open-source 

dual-hypervisor solution where two jointly-configured 

hypervisors are employed to virtualize secure and non-secure 

worlds, respectively, both orchestrated by a monitor firmware 

that handles world switches and dispatches interrupt signals. 

This solution avoids the existence of a single point of failure 

and aims at containing the run-time overhead. Remarkable 

efforts have also been spent by Hua et al. [12], which proposed 

a centralized solution to virtualize TrustZone by building upon 

the Xen hypervisor. 

Second, hypervisors should offer the virtualization of 

cryptographic hardware resources, possibly guaranteeing strict 

integrity and confidentiality of data even in the presence of sidechannel 

attacks. Built-in support for software-based attack 

mitigation techniques such as data execution prevention (DEP), 

address-space layout randomization (ASLR), and control flow 

integrity (CFI) are also desirable. The latter require careful 

attention when integrated with virtualization mechanisms. 

Third, to the end of supporting component-based software 

design and possibly open environments, hypervisors should 

provide software authentication mechanisms also at the level of 

domains, paying particular attention at rollback-based attacks. 

The authors believe that list is not limited to the abovementioned 

challenges and that security-related aspects will 

likely steer the design of future virtualization software. 

IV. 

THE AUTOMOTIVE CASE 

As a proof of concept, this section describes a realistic 

scenario related to the automotive domain in which 

virtualization is applied. The described solution, from the 

RETINA project [1], aims at providing an AUTOSARcompliant 

software stack for next-generation automotive 

systems. The stack allows the integration of components with 

different criticality levels onto modern multi-core SoCs, 

reducing the overall time-to-market and manufacturing costs. 

At the lowest level, the stack consists of an hypervisor to 

enforce isolation (thus, reliability and safety) between the guest 

operating systems. The RETINA project relies on Jailhouse [2], 

a small and lightweight type-1 hypervisor developed by 

Siemens and released as Open-Source software. The hypervisor 

supports both x86-64 and ARM-based platforms, provided the 

availability of hardware virtualization instructions. Rather than 

providing resource virtualization and scheduling (like e.g. the 

Xen hypervisor), Jailhouse focuses on isolation and resource 

partitioning. For this reason, there is no intra-core scheduling 

(i.e., each core cannot run more than one guest OS) and 

resources are statically assigned to only one guest. This static 

approach allows to: 

● provide average latencies and jitters similar to baremetal 

solutions, due to the low run-time overhead; 

● ease potential certification processes in the future, 

thanks to a very small codebase. 

On top of the hypervisor, the RETINA project runs two 

guest OSs with different criticality levels. The real-time and 

safety-critical tasks are run by the ERIKA Enterprise RTOS [3]. 

ERIKA Enterprise is a tiny RTOS (i.e. a few KBs of footprint) 

designed and certified for the automotive market. It is 

developed by Evidence Srl and released as Open-Source 

software under a dual licensing model. 

The less critical tasks (e.g., HMI, logging, etc.), instead, are 

executed on a Linux guest, improved through the 

PREEMPT_RT real-time patch [4] when needed. The 

communication between the two OSs is done by means of a 

library exposing an API similar to the one specified by the 

AUTOSAR COM standard. The library is meant to be used by 

an AUTOSAR Run-Time Environment (RTE) generator 

developed by Evidence Srl for its RTOS. Most critical tasks are 

run using the SCHED_DEADLINE Linux scheduler [17]. 

112

Figure 2 summarizes the main components of the 

automotive software stack described above. 

[11] A. Biondi, A. Balsini, M. Pagani, E. Rossi, M. Marinoni, and G. Buttazzo, 

“A framework for supporting real-time applications on dynamic 

reconfigurable FPGAs,” in Proc. of the IEEE Real-Time Systems 

Symposium (RTSS 2016), December 2016, pp. 1–12 

[12] Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, and H. Guan, “vTZ: Virtualizing 

ARM trustzone,” in In Proc. of the 26th USENIX Security Symposium, 

2017. 

[13] R. Adair, R. Bayles, L. Comeau, and R. Creasy. “A virtual machine 

system for the 360/40,” Technical Report 320-2007, IBM Corporation, 

Cambridge Scientific Center, May 1966. 

[14] Xen project, https://www.xenproject.org/ 

[15] Linux Kernel Virtual Machine, http://www.linuxkvm.org/page/Main_Page 

[16] Jailhouse project page, https://github.com/siemens/jailhouse 

[17] J. Lelli, C. Scordino, L. Abeni, D. Faggioli, “Deadline scheduling in the 

Linux kernel”, Software: Practice and Experience, 46(6): 821-839, June 

2016. 

[18] Nova hypervisor, http://www.hypervisor.org 

[19] XtratuM project page, http://www.xtratum.org 

[20] Fiasco project page, https://l4re.org/fiasco/ 

[21] SierraVisor, http://www.openvirtualization.org 

Figure 2 - Multi-OS automotive software stack developed for the 

RETINA project. 

[22] PikeOS hypervisor, https://www.sysgo.com/products/pikeos-hypervisor/ 

[23] Hong, Cheol-Ho & Spence, Ivor & Nikolopoulos, Dimitrios, “GPU 

Virtualization and Scheduling Methods: A Comprehensive Survey”. 

ACM Computing Surveys. 50. 1-37, 2017. 

V. CONCLUSIONS 

This paper presented some of the major challenges in 

providing a software support for exploiting modern 

heterogeneous platforms for complex safety-critical systems 

consisting of several interacting components with real-time 

requirements. Virtualization techniques, successfully used to 

isolate the behavior of software components running on the 

same processor, are considered to be extended for managing 

other architectural resources, such as shared memories, and 

other computational units, such as FPGAs and GPUs. Issues 

concerning safety, security, and real-time performance are also 

discussed and illustrated using a case of study taken from the 

automotive domain. 

REFERENCES 

[1] RETINA EUROSTARS project, http://retinaproject.eu/ 

[2] Siemens, Jailhouse hypervisor, https://github.com/siemens/jailhouse 

[3] Evidence Srl, ERIKA Enterprise RTOS, http://www.erikaenterprise.com/ 

[4] The Linux Foundation, Real-Time 

https://wiki.linuxfoundation.org/realtime 

collaborative project, 

[5] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memguard: 

Memory bandwidth reservation system for efficient performance isolation 

in multi-core platforms,” in 19th IEEE Real-Time and Embedded 

Technology and Applications Symposium (RTAS), 2013, pp. 55–64 

[6] H. Yun, R. Mancuso, Z. P. Wu, and R. Pellizzoni, “PALLOC: DRAM 

bank-aware memory allocator for performance isolation on multicore 

platforms,” in 19th IEEE Real-Time and Embedded Technology and 

Applications Symposium (RTAS), April 2014. 

[7] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröhlich, and R. 

Pellizzoni, “A survey on cache management mechanisms for real-time 

embedded systems,” ACM Comput. Surv., vol. 48, no. 2, Nov. 2015. 

[8] P. Modica, A. Biondi, G. Buttazzo, and A. Patel, “Supporting temporal 

and spatial isolation in a hypervisor for arm multicore platforms,” in 

Proceedings of the 18th IEEE International Conference on Industrial 

Technology (ICIT 2018), Feb. 2018. 

[9] G. Cicero, A. Biondi, G. Buttazzo, and A. Patel, “Reconciling Security 

with Virtualization: A Dual-Hypervisor Design for ARM TrustZone,” in 

Proceedings of the 18th IEEE International Conference on Industrial 

Technology (ICIT 2018), Feb. 2018 

[10] B. Forsberg, A. Marongiu and L. Benini, "GPUguard: Towards 

supporting a predictable execution model for heterogeneous SoC," 

Design, Automation & Test in Europe Conference & Exhibition (DATE), 

2017, Lausanne, 2017, pp. 318-321 

113

Security In Manufacturing 

Closing the Backdoor in IoT Products 

Josh Norem 

Ass. Staff Systems Engineer 

Silicon Labs 

joshua.norem@silabs.com 

Abstract— It is common for system developers to pay a lot of 

time and attention to developing secure products and ensuring 

that their devices are difficult to exploit in the field. Unfortunately, 

security in the build process and supply chain receives much less 

consideration. In this paper, we discuss the various attack vectors 

present in the process of designing, building and testing IoT 

systems as well as methods for preventing these attacks. 

Keywords— IoT, Security, Manufacturing, Assembly 


It’s well understood that any secure system is only as strong 

as its weakest component. Unfortunately, it’s all too common 

to forget that every step in the manufacturing process is a 

component in that system. While much has been written about 

the security of wireless protocols, ICs, and deployed systems, 

securing the manufacturing process for those systems is often 

forgotten. 

To illustrate, this let’s examine how we might attack an 

embedded system. In our case we’ll use a smart lock as an 

example. First, it’s important to note that if we are serious about 

attacking this system, we probably don’t want to compromise 

just one lock. We want to create a systematic exploit that can 

be used against any lock and then sold to others who want to 

bypass one specific lock in the field. 

The manufacturer of this lock has anticipated our attack and 

spared no expense creating a secure product. From multiple 

code reviews, to anti-side-channel-attack hardware, to 

extensive penetration testing, the product is well designed and 

well protected. This would be a problem if we were going to 

attack the lock itself, but luckily, we have another option. 

Instead, we’re going to attack the contract manufacture (CM) 

that assembles and tests the lock. It is almost universally 

required for firmware images to be transferred, stored, and 

programmed in plain text. All we need do is bribe one of the 

CM employees to give us the image, and then swap it out with 

an image we modified. The firmware will be nearly identical, 

but with a backdoor we can exploit whenever we wish. The CM 

will then be manufacturing fundamentally compromised 

devices for us. 

Our exploit requires no special hardware and only a 

moderate amount of sophistication to develop, which makes it 

extremely cheap to create. It also completely bypasses all the 

time and effort the manufacturer spent to make their product 

secure. 

A. Protecting Firmware Integrity 

The fundamental problem in manufacturing is that with 

current embedded processors it’s very difficult to guarantee the 

integrity of a firmware image. If the firmware is programmed 

in plain text, we can easily modify it on the test system as shown 

in block diagram 1 of Figure 1, where the red maker indicates 

code vulnerable to attack. 

If the manufacturer decides to encrypt their code and load it 

via a secure boot loader, we attacked the boot loader, which had 

to be stored and programmed in plain text. This is shown in 

block diagram 2 of Figure1. If the manufacturer uses external 

test hardware to verify the firmware after it’s programmed, we 

attacked both the firmware and the code that checks it as shown 

in block diagram 3 of Figure 1. No matter how many layers are 

added, we ultimately reach something that had to be 

programmed in plain text and can be attacked. 

Figure 1: Points of attack in firmware programming 

114

It’s also worth noting that manufacturing is not the only time 

code can be modified. For example, an exploit that results in 

arbitrary code execution becomes much more valuable if it can 

permanently install itself by reprogramming the device. A 

complete solution to the problem of code integrity in 

manufacturing also addresses other source of firmware image 

corruption. 

B. Protecting Firmware Confidentiality 

In addition to ensuring that a system is programed with the 

intended firmware, it may sometimes be necessary to protect 

the confidentiality of that firmware. For example, if there is a 

proprietary algorithm we want to ensure competitors don’t have 

access to, we need to ensure that the code can’t be obtained by 

simply copying a file from our CM test/programming system. 

Implementing firmware confidentiality can be done in a 

variety of ways and benefits from other hardware-based 

security features. However, any confidential boot loading 

process that takes place at an untrusted CM will ultimately 

follow the same pattern. First the device is locked so that an 

untrusted manufacturing site can no longer access or modify the 

contents of the device. Then the device performs a key 

exchange with a trusted server using a private key that the 

manufacture never has access to, normally generated on the 

device after it is locked. Once the key exchange is complete, 

information can be passed confidentially between the trusted 

server and device. 

Confidentially requires integrity. If an attacker can modify 

the device’s firmware to generate a known private key, then 

they can trivially decrypt the image sent to that device. 

While this paper focuses on providing firmware integrity, 

which is only one of the components needed to provide 

firmware confidentiality, more information on firmware 

confidentiality can be obtained from many sources including 

the author of this paper. 

C. Secure Debugging 

Another historic issue in the manufacturing process is the 

ability to diagnose issues in the field or when products are 

returned. For both the IC manufacturer and system developer, 

there is a need to gain access to locked devices to perform this 

analysis. Historically this has been done by introducing 

backdoor access, which is by definition a security hole. 

The most common approach to this problem is to allow 

unlock + erase such that a device can be unlocked but all flash 

is erased during the unlock progress. This process has several 

drawbacks. First, in some cases access to the current contents 

of flash may be needed for debug purposes and will not be 

available. Second, this opens a security hole for attacks centered 

on erasing and reprogramming the device with modified code. 

Other approaches provide an unrestricted back door that 

unlocks without erasing, or offer a permanent lock that will 

protect the part but makes debug of failure impossible. Both 

options have some well understood drawbacks. 

II. 

FIRMWARE INTEGRITY HALF MEASURES 

There are some things we can do today to address this 

problem and make attacking our manufacturing process more 

difficult and less profitable. 

A. Sampling 

The simplest solution is to implement a sampling 

authentication program in another site. For example, we could 

pick systems at random (say, one out of every 1000 we build) 

and have them sent to an engineering/development site where 

we read out the firmware and validate it. If someone tampers 

with our CM, this sampling will indicate that this has happened. 

To circumvent this check, the attacker must either compromise 

our engineering site in addition to the CM, or be able to know 

which devices will be sent for verification and exclude them 

from the attack. 

There is still a technical problem here. To authenticate the 

code at our engineering site, we need to be able to read that code 

out. Typically, MCUs are locked after production to ensure that 

memory cannot be modified or read out, which will also prevent 

us from checking that the contents are correct. 

It’s important to remember our method of checking needs 

to assume any code on the device may be compromised. For 

example, one option is to have a verification function that 

computes a simple checksum or hash of the image that we can 

read out through a standard interface (UART, I2C). 

Unfortunately, that option relies on code that may be 

compromised to generate the hash. If an attacker has replaced 

our image, they can also replace our hashing function to return 

the expected value for a good image instead of re-computing it 

based on the contents of flash. 

To make this authentication work, we need to find an 

operation that can only be accomplished if the entire correct 

image is present in the device. One way of doing this would be 

to have our verification function simply dump out all the code. 

An even better idea is to have our function generate a hash of 

the image based on a seed the test system randomly generates 

and passes in. Now the attacker can’t simply store a 

precomputed hash because the hash value changes based on the 

seed. To respond with the correct result, the attacker’s code 

must now have access to the entire original image and correctly 

compute the hash. 

B. Dual Site Manufacturing 

Similar to the sampling program, board assembly and 

programming could be carried out at one site and then board 

tested at another. This has the benefit of catching an attack 

immediately and preventing any compromised units from being 

shipped. It also has all the drawbacks of the sampling method 

since it requires some way to authenticate the firmware during 

the test phase. It also has a higher cost to implement than the 

sampling method. 

It may be tempting to program but not lock the device 

during manufacturing and then lock after test. This would 

eliminate the need for special verification code since the 

contents of the device can simply be read out. However, for 

most embedded processors, leaving debug unlocked also leaves 

programming unlocked. In this case, attackers could 

115

compromise only the second (test) site and simply program 

their modified firmware there. 

C. Over the Air/Field Updates 

Another way to mitigate an attack on a connected system, 

as well as several other unrelated security issues, is to 

implement and use over-the-air (OTA) updates or some other 

periodic style of firmware update. 

In most OTA systems, any manufacturing time 

modifications will be discovered or overwritten with the next 

OTA update. For a system that regularly rolls out updates, 

quarterly for example, the value of a factory compromise is 

greatly reduced if it’s only available for that short time. This is 

an excellent example of the value of in-field updates for secure 

systems. 

III. 

A FULL SOLUTION FOR FIRMWARE INTEGRITY 

The fully secure solution to this problem relies on hardware. 

Specifically, hardware must contain a hard-coded public 

authentication key and hard-coded instructions to use it. For this 

purpose, ROM is an excellent solution. Though ROM is 

notoriously easy to read through physical analysis, it is difficult 

to modify in a controlled and non-destructive way. 

Any firmware loaded into the device must then be signed. 

Out of reset, the CPU begins execution of ROM and can 

validate that the contents of flash are properly signed using the 

public authentication key, which is also stored in ROM. If an 

attacker attempts to load a modified version of the firmware, 

authentication will fail, and the part will not boot. To get a 

modified image to boot, the attacker would need to provide a 

valid signature for their modified firmware, which can only be 

generated using a well-protected private key. 

With a real IC, security measures are a bit more complex to 

support numerous use cases and to avoid security holes. The 

hard-coded public key (Manufacturer Public Key) will be the 

same for all devices since it is not modifiable. This makes it 

incredibly valuable, providing the root of trust for all devices. 

The associated private key (Manufacturer Private Key) must be 

closely guarded by the IC manufacturer and will never be 

provided to users to sign their own code. 

When booted the Manufacturer Public Key will be used to 

validate any code provided by the IC manufacturer that resides 

in flash. This gives the ability to ensure that code or other 

information provided by the manufacturer is not tampered with 

as shown in step 1 of Figure 2. 

Users of the device will need to have their own key pair (User 

Private Key and User Public Key) for signing and 

authenticating their firmware images. To link the User Public 

Key into the root of trust, the IC manufacturer must sign the 

User Public Key with the Manufacturer Private Key creating a 

User Certificate. A certificate is simply a public key and some 

associated metadata that has been signed. When booted, the part 

authenticates the User Certificate using the Manufacturer 

Public Key, as shown in in step 2 of Figure 2. 

Finally, the user firmware can be authenticated with the User 

Public Key in the known-valid User Certificate. This is shown 

in step 3 of Figure 2. 

Figure 2: A secure boot process 

It’s important to note that one additional step is required to 

lock a device to a specific user. The system described in steps 

1-3 can only ensure that the User Certificate was signed by the 

IC manufacturer. This does prevent a random person from 

reprograming the part, but another legitimate customer of the 

IC manufacturer could write their code and their legitimately 

signed certificate onto the part, and it would boot. This 

effectively means that if an attacker can convince the IC 

manufacturer they are a legitimate customer and can generate a 

signed User Certificate, they can get the device to boot their 

code. 

To lock a part to a specific end user, the User Certificate 

needs to contain not only the User Public Key but also a user 

ID so that changing either the key or the ID will invalidate the 

certificate. The IC manufacturer will program the user ID into 

the Manufacturer Code area where it is protected by the 

Manufacturer Public Key. Finally, at boot time, in addition to 

verifying the signature of the User Certificate, the boot process 

also compares the user ID in the User Certificate against the one 

in the Manufacturer Code, as shown in step 4 of Figure 2. 

116

Let’s see what happens when someone attempts to modify 

each part of the system. 

• If the user identifier in the Manufacturer Code is 

changed the signature of that space is no longer valid 

and the part does not boot. 

• If the user ID in the User Certificate is changed it will 

not match the one in the Manufacture Certificate and 

the part will not boot. 

• If either the User Public Key or the User ID in the User 

Certificate are modified the certificate will be invalid 

and the part will not boot. 

• Finally, if the User firmware image is changed the 

signature will be invalid and the part will not boot. 

We now have a system that will only boot firmware properly 

signed by the customer who ordered the part from the IC 

manufacture. Furthermore, this entire system relies on only 2 

secrets, the Manufacture Private key and the User Private Key, 

both of which are only ever accessed to sign new images and 

can be extremely well protected due to the infrequency of that 

process. 

A. Additional Mitigations 

Of course, even in this system correct construction of the 

certificate system is required. As we have defined, the 

Manufacturer Private Key is extremely valuable as it applies to 

every device the IC manufacturer ever builds. It’s also accessed 

far too frequently since it’s constantly being used to sign User 

Certificates. 

This can be addressed by creating a different Manufacturer 

Public/Private Key pair for each die so that compromising one 

Manufacturer Private Key only exposes that die. Similarly, 

instead of directly signing User Certificates with the 

Manufacturer Private Key, a hierarchy of sub-keys can be 

developed and used for that operation such that sub-key can be 

revoked by a Manufacturer Code update if compromised. The 

details of such schemes are beyond the scope of this paper, but 

many things are possible with the fundamental hardware root of 

trust established. 

IV. 

PROVIDE SECURE UNLOCK 

Providing secure debug unlock turns out to be a simple task. 

First, each system developer generates a key pair for debug 

access and programs the Public Debug Key onto the device. The 

integrity of that key can be established in the same manner as 

the user’s firmware, preventing anyone from tampering with the 

Public Debug Key. This is shown in step 5 of Figure 3. Each 

device is also provided with a unique ID, which is almost 

universally available on MCUs today. 

To unlock the part, its unique ID is read out (1) and signed 

with the Private Debug Key (2), creating an Unlock Certificate. 

The Unlock Certificate is then fed into the device for 

authentication against the Public Debug Key (3). If it 

authenticates, the part is unlocked. This ensures only those with 

access to the Private Debug Key may generate an Unlock 

Certificate, and only those with an Unlock Certificate may 

unlock the part. 

Figure 3: A method of secure debug unlock 

The Private Debug Key can be stored on a secure server and 

be extremely well protected. Note that since the ID used by the 

device does not change, the process of generating an Unlock 

Certificate happens only once, and then that certificate may be 

used to unlock the part as long as is desired. 

A benefit of this method is that it generates Unlock 

Certificates on a per-device basis. That means it’s possible to 

grant unlock privileges to field service personnel or the IC 

manufacturer on only the device they are trying to diagnose. 

A drawback to this method is that once a valid Unlock 

Certificate is created, anyone with access to that certificate may 

unlock the device. To mitigate that risk of a valid Unlock 

Certificate being obtained by an attacker, a counter can be 

added to the end of the unique ID so that after an Unlock 

Certificate is no longer needed, it can be revoked by 

incrementing the counter via a debugger command. This will 

cause a new ID to be generated, and the old certificate will no 

longer be valid. 

Finally, it’s important to note that the more devices a private 

key gives access to, the more valuable it becomes. As a result, 

system developers may want to change debug unlock keys 

periodically to limit the number of devices affected in the event 

that a Private Unlock Key is compromised. 

V. OTHER MANUFACTURING CONSIDERATION 

A. Test-based Security Holes 

It is extremely common for the needs of manufacturing to 

result in the intentional or unintentional introduction of security 

holes. 

An example of an unintentional hole is when a system 

manufacturer forgets to disable the debug interface as part of 

their board test and ships units with a wide-open debug port. 

Even more common are intentional security holes. For 

example, a developer may want to provide a way to reopen 

debug access after locking and put in a ‘secret’ command or pin 

state to unlock the part. If discovered, this allows any attacker 

the same unlock capability in the field. 

Developers should always take care to implement 

manufacturing and test processes in a secure way. This includes 

117

avoiding any intentional security holes and conducting reviews 

to catch unintentional ones. 

B. The Offshore Process 

Any system firmware images or test programs pass through 

are part of the manufacturing flow and may be vulnerable to 

attack. Having a secure system and secure manufacturing 

process won’t help if files are transferred to the CM through an 

FTP or email server that hasn’t been patched in ten years. Every 

place files are stored should be considered part of the system 

and secured. 

C. Product Development 

Just like product manufacturing is often a secondary 

thought, product development is also often overlooked. The 

measures discussed in this paper will not be helpful if an 

attacker can commit unnoticed changes to the source code 

repository. Sometimes this takes the form of an external 

penetration (electronically or physically walking into the 

building), and sometimes by compromising an employee. 

Standard IT system security practices and standard coding 

practices play a huge role in preventing this type of attack. 

These practices include ensuring all PCs automatically lock 

when not in use, requiring user logins to access code 

repositories, performing code reviews on all repository 

commits, and performing test regressions on release candidates. 

VI. 

CONCLUSION 

Security is increasingly important in embedded systems. 

Products that were once standalone are now part of a network, 

increasing both their vulnerability and value. Much has been 

published in the past few years about securing IoT devices 

themselves, but not enough attention has been focused on 

ensuring security throughout the design and manufacturing 

processes. 

This paper has demonstrated how historical manufacturing 

processes can be easily compromised and has explored some 

simple steps that can be taken today in both design and 

manufacturing to make attacking a CM or engineering site more 

difficult and less profitable. In addition, we have presented 

some hardware improvements that can ensure firmware 

integrity and provide secure access for failure analysis and field 

debugging. 

Effective security requires everyone, from silicon vendors to 

design firms to OEMs, to work together to ensure that supply 

chain security receives the time and attention it deserves. The 

good news is that not only are new hardware features being 

developed to address these issues, but there are some simple 

measures system developers can start implementing today to 

create a more secure manufacturing environment. 

118

Rowhammer - a survey assessing the severity of 

this attack vector 

Norbert Wiedermann ∗ , Sven Plaga ∗ 

∗ Fraunhofer Institute AISEC, Garching bei München, Germany 

{norbert.wiedermann, sven.plaga}@aisec.fraunhofer.de 

Abstract—Dynamic random access memory (DRAM) is a 

cheap manufacturable main memory architecture and widely 

used in consumer and professional Information Technology (IT) 

systems. In March 2015 Seaborn et al. presented sample code 

[1] to demonstrate how an already known technical issue of 

this memory architecture can be exploited by making use of 

insights from Kim et al. [2]. This work proofed that the issue 

can be abused to compromise current IT systems. Using this 

knowledge as starting point other research teams continued the 

work. A JavaScript based approach was for example presented 

by Gruss et al. in [3]. As presented exploits gained high medial 

attention in non-scientific press, the rowhammer bug [4] and 

mitigation strategies [5] are still object of research. In this paper, 

the hardware related circumstances are reviewed and analysed 

to provide an understanding of the technical aspects which led 

to this bug. Based on related work an own test setup was used to 

comprehend the steps of the attack. The challenges in creating 

this independent and functional setup based on x86 and Linux 

are introduced. Additionally, constraints in mounting an attack 

on current Linux distributions and possible mitigation strategies 

are presented. This paper summarises the current state of the 

art and provides insight to this severe though complex attack 

vector. With the presented results it is possible to estimate future 

refinements of rowhammer and identify mitigation strategies for 

own designs. 

I. Introduction 

The term “rowhammer bug” refers to a hardware related 

flaw, which can be utilised to create bit flips in a computers 

main memory. This issue was first discussed in the paper by 

Yoongu Kim et al. in 2014 [2]. The issue gained great attention 

in public after a blog post by Mark Seaborn [1]. Together 

with Thomas Dullien he presented two working exploits and 

gave a talk at the Black Hat conference in 2015 [6]. 

The bug can be exploited by performing high frequent, 

un-cached read access to dynamic random access memory 

(DRAM) which eventually causes a bit flip in a memory area 

next to the accessed one. This flaw can be abused to gain 

full memory access, even to privileged areas, and by that 

obtain full control over the attacked system. This hardware 

related flaw can be abused even from user space applications 

to escape sandboxes and obtain kernel privileges, as Seaborn 

demonstrated with his exploits. From that point, the work 

was also continued by other researchers. One example for 

this follow-up investigations are the results of Daniel Gruss 

and Clémentine Maurice who presented a JavaScript based 

approach which was published in a paper [3] and also presented 

in a talk at the 32c3 in December 2015. 

Due to the fact, that DRAM is widely used in consumer 

as well as in professional IT systems, and the presented 

exploits had high impact, the topic dominated publications 

in technical and even non-scientific related press for a while. 

Nevertheless, the circumstances of this bug are quite complex 

and require good in-depth knowledge of IT systems, such as the 

Central Processing Unit (CPU) architecture and its associated 

chip-set which are strongly closely linked to main memory 

management. Though the numerous high-level summaries 

published by the specialised press based on the corresponding 

scientific publications, these often showed a certain level of 

simplification in order to meet the readers level of knowledge. 

On the one hand, this led to wide awareness on this topic. 

On the other hand, however, these simplifications caused also 

uncertainty on how to classify related risks. 

This motivated the current work, in which the bug was 

studied in more detail to comprehend its circumstances and 

effects. The following Section II gives a short overview on 

memory architectures. With this background information 

the hardware related reasons behind the bug are clarified. 

Further on, related work is discussed providing a context 

to other research results (Section III). The theory behind 

the rowhammer attack is discussed in detail in Section IV. 

Thereafter, in Section V, the test setup used to work on a 

result replication. The test approaches are outlined as well. 

Discussing the achieved results which were summarised in the 

preceding section, possible mitigation strategies are presented 

in Section VI. In Section VII the paper concludes by putting 

the rowhammer attack into a context to other common issues. 

Finally, a small guideline for risk estimation is provided 

followed by ideas for possible connection-work. 

II. Background 

This section provides basic background information on 

how main memory is organised in current IT systems. The 

following Section II-A describes physical characteristics and 

how memory cells are organisation in hardware. Thereafter, 

Section II-B outlines basic logical approaches used by the 

operating system (OS) to map memory to real hardware. 

A. Physical Memory Organisation 

Since modern IT systems are quite complex and executed 

tasks requiring large amounts of main memory [7], market 

demands cheap memory modules providing sufficient storage 

capacities. Addressing these requirements, manufacturing 

processes get further refined enabling smaller semiconductor 

scales. Shrinking the semiconductor structures, more components 

can be placed on a chip thus providing more memory 

capacity. On the contrary, the transistors and capacitors on the 

structures also getting smaller resulting in digital information 

being represented by tiny electrical capacities in the range of 

some femtofarad [2]. 

119

The main memory of computer systems is organised in 

a multi level hierarchy which is employed to manage the 

high quantity of memory cells. Each cell is made from a 

transistor (T) and a capacitor (C) which is the actual store for 

a single bit. Such a basic DRAM memory cell is depicted in 

Figure 1. 

bitline 

C Sp 

T 

Figure 1. Basic DRAM memory cell circuitry. 

wordline 

U Sp 

Several of these basic cells are then arranged in rows which 

then form matrices. This structure is referred to as a memory 

bank. Additional hardware components are necessary to 

translate a memory address to select a specific row in such a 

bank. A simplified memory bank is illustrated by Figure 2. 

row address register 

row decoder 

00 

01 

10 

11 

Basic DRAM cell 

wordline 

bitline 

between these signals occur when the signals are toggled 

between “low” to “high” state. This causes weak magnetic 

induction in adjacent wordlines resulting in further electrical 

leakage of capacitors attached to transistors controlled by the 

effected wordlines [8]. To preserve the saved information, 

the memory rows gets refreshed periodically. Typical refresh 

cycles are performed every 64 ms. 

To retrieve Information, the memory cell is discharged 

and restored after its access. For management reasons, it is 

only possible to access the rows of a memory bank by an 

address. As a result, reading information causes the discharge 

of all capacitors of a memory row. To maintain the stored 

information during the process, it is written to the row buffer. 

Afterwards, the electrical charge is restored to the original 

row. This process of discharging and charging can interfere 

memory cells of adjacent rows. Finally, high frequent read 

operations within the refresh timeframe on the same memory 

row can cause significant electrical leakage in memory cells 

of adjacent rows. As a consequence, reads to adjacent rows 

can lead to misinterpretations by the row buffer when the 

amount of charge fell below a certain threshold. This results to 

flipped bits, which are subsequently provided to the requesting 

software layer. Since the physical fundamentals of the attack 

are requiring a high frequent accesses to certain memory rows, 

it is referred to as “hammering” which is also the origin of 

the attack’s name. 

B. Logical Memory Organisation 

Main memory is organised using virtual addresses on the 

software layer. Virtual address spaces are an abstraction 

mechanism to separate memory areas for each application 

being executed. The mapping from virtual to physical memory 

addresses is handle by the underlying OS and the Memory 

Management Unit (MMU) of the hardware. In cases the 

physical available memory is smaller than the provided virtual, 

this is solved by these two instances using memory mapping 

and swapping to other memory, such as harddrives. This 

mapping is illustrated in Figure 3. 

row buffer 

Virtual Memory 

0x0000 

0x0000 

Physical Memory 

00 01 10 11 

column decoder 

column address register 

OS & 

MMU 

0x0fff 

Figure 2. Simplified DRAM memory bank. 

Memory chips, containing eight memory banks are then 

placed on the Printed Circuit Board (PCB) of the respective 

DRAM module. Usually, eight chips are used on a single 

ranked DRAM module, or 16 chips for a double ranked 

module. 

Due to the small semiconductor scales and physics, the 

transistors in the memory cell tend to leak electrical charge of 

the attached capacitors. Because of the manufacturing process, 

it is not avoidable to route some of the wordlines, used to 

access the rows, parallel to each other. Therefore, interference 

0xffff 

Figure 3. Simplified view on memory mapping. Virtual memory is mapped 

by operating system (OS) and Memory Management Unit (MMU) to physical 

available memory. 

Virtual memory layout is organised in different hierarchies, 

such as page directories (PDs), page table (PT) and pages. 

120

To access this structure, virtual addresses are used with 

fixed offset to gain access to a certain level of this memory 

management. On the lowest level the physical address to the 

value is stored. These aspects are depicted in Figure 4 using a 

32 Bit architecture as reference. On current 64 Bit IT systems 

memory hierarchies are organised using analogous concepts 

but with deeper nested structures. 

31 

CR3 

10 Bit 

Offset Page Directory 

Phys. Addr. PT 

virtual address (32 Bit) 

10 Bit 

Offset to Page Table 

Phys. Addr. Page 

Page Directory Page Table Page 

12 Bit 

Offset to Page 

Physical Address 

Figure 4. Mapping of virtual to physical addresses. Management hierarchy 

uses page directories (PD), page table (PT), and pages. Sample based on a 

32 Bit architecture. 

The final mapping of physical addresses to actual cells on 

the memory module depents on the architecture of the CPU. 

Each generation features different approaches to optimize 

this mapping. Further, the underlying CPU architecture also 

specifies the applied caching schemes. The algorihms implementing 

this mapping and caching are not public documented 

by the CPU vendors, hence, reverse engineering is inevitable 

to understand them. 

Seaborn started off on Intel’s Ivy Bridge and Sandy Bridge 

CPU architecture as test platform. While the Ivy Bridge 

architecture contained some more complexity he successful 

reversed the mapping for Sandy Bridge and documented his 

results in [9]. Further insights on how the memory mapping is 

conducted is documented in [10], also covering the Ivy Bridge 

architecture. Additional CPU architectures were reversed by 

Pessl et al. in [11]. Approaches to circumvent various levels 

of caches on the CPU was researched by Hund et al. in 

[12]. By measuring timing differences of memory accesses 

the actual mapping can be deduced. Invoking the clflush 

command to clear caches forces the hardware to reload the 

accessed row. An algorihm describing timing based analysis 

is presented by Liu in [13]. 

This detailed knowledge how to find the location of specific 

physical addresses in DRAM is essential in order to mount 

a rowhammer attack. Based on the documented memory 

mappings and algorihms to circumvent caches the physical 

addresses of memory rows located adjacent to each other 

can be retrieved and correlated to virtual addresses. Thus, it 

is possible to identify so called aggressor rows next to the 

targeted victim row. For the aggressor rows the following 

requirement holds: Same bank, different row like the victim 

row. Previously to the reversed memory mapping, or without 

sufficient knowledge about the target platform with respect to 

memory organisation, the rowhammer attack can be performed 

by randomly selecting virtual addresses. Applying algorihms 

like presented by Liu [13], aspects can be reversed and thereby 

the attack becomes more efficient. For example after retrieving 

the specific mapping of virtual addresses to physical memory 

Value 

0 

rows, attacking a victim row with and adjacent and subjacent 

attacker row is possible. 

III. Related Work 

Corruption of information stored in random access memory 

(RAM) is not a new phenomenon. These issues are well 

known since the time when Intel introduced its first commercial 

DRAM chips [14]. Traditionally, these drawbacks 

are comprehended and handled as reliability issue. Usually, 

memory corruption occurring on a random basis caused 

by environmental influences such as radiation or significant 

variation in temperature [15], [16]. Reliability of main memory 

can be increased by using specialized memory modules. 

Employing error correction codes (ECC) memory can help 

to reduce the risk of data corruption, since it is capable to 

correct single bit errors. An error caused by two corrupt bits 

at least can be detected and usually results in a system crash. 

Coupling effects between bitlines and wordlines were 

researched by Redeker et. al in a paper presented in [8] 

(2002). Effects described in this work are also present in the 

rowhammer bug where coupling effects between wordlines 

adds to charge leakage. In extreme conditions, such as several 

thousand row accesses, bit flips are caused. 

Environmental influences were also already exploited to affect 

the reliability of memory. Govindavajhala et al. presented 

a paper [17] (2003) where they applied strong temperature 

variations to memory modules. They used a custom designed 

adapter to influence and control the temperature resulting in 

flipped bits. This work demonstrated how to abuse randomly 

flipped bits to escape virtual machine environments, such as 

Microsoft .NET or Java Virtual Machine (JVM). 

Until then, errors in DRAM have been investigated predominantly 

in laboratory setups with controlled environment. But 

further research revealed that data corruptions are day to day 

business for operators of large scale data centers. Schroeder 

et al. used measurement data of commodity server operators 

within a period of 2.5 years to research errors of DRAM 

under real-world conditions. The findings are presented in 

[18] (2009). Schroeder was able to clearify two aspects of 

so far state of the art knowledge. First, it was identified that 

temperature changes have less effect on integrity of main 

memory as assumed. Second, error rates in DRAM are 

significant higher for real world applications than documented 

by scienfic research. 

Another large study on reliability of DRAM was conducted 

by Sridharan [19] (2012). Using a dataset comprising 11 

months of measurements for a data cluster he was able 

to identify typical reasons memory modules stop working. 

Nevertheless, both studies emphasise reliability of main 

memory as an essential requirement for IT systems. 

Researching a wide range of different DRAM memory 

modules from three large manufactures led to a publication 

with significant insights. In their work, Kim et al. [2] (2014) 

identified how bits in adjacent memory cells can be affected 

on purpose. Using certain access patterns they were able to 

cause targeted memory corruptions and hypothesised how to 

apply it as attack vector. From the high frequent access pattern 

on adjacent rows in RAM the name “rowhammer attack” was 

derived. 

121

Building on these research results, Seaborn presented a 

Proof-of-Concept (PoC) implementation of two exploits. The 

first PoC show cased how to use the rowhammer attack to 

cause targeted bit flips to escape Google’s isolated Native 

Client (NaCl) sandboxing environment. The second PoC is 

capable of gaining root privileges on a system [1] (2015). 

The PoC by Seaborn made use of instructions to circumvent 

caching in modern CPUs. Gruss et al. continued this work 

and ported the PoCs presented by Seaborn to JavaScript [3] 

(2015). In such higher languages there is no possibility to 

directly influence caching. The presented JavaScript approach 

proofed that cache eviction is sufficient fast, if RAM access 

patterns are adapted and optimised. 

The combination of smaller semiconductor scales and 

malicious access patterns opened a new field of possible attacks 

on current IT systems. The basics have been researched and 

documented by various research teams. Finding novel schemes 

preventing or mitigating such attacks was identified as new 

challenge. The work by Seyedzadeh [20] (2017) addressed this 

open issue by proposing approaches to identify and mitigate 

crosstalking in DRAM. Several encoding techniques were 

researched and compared in this paper. By applying them in 

upcoming memory modules the influences of the malicious 

access patterns of the rowhammer attack could be limited. 

On the other hand, there are still a lot legacy systems in 

operation. Providing a suitable fix for them is the objective of 

the research findings published by Brasser et al. [5] (2017). 

This work follows an idea proposed by Gruss and suggestes 

a separation of memory in areas with different privileges. 

By that, user space applications causing rowhammer typical 

access to main memory are not able to gain access to memory 

areas of other applications or reserved for higher privileged 

processes such as the OS kernel. 

IV. Attack Details in Theory 

In order to utilize the rowhammer bug for a successful attack 

on a target system, several preconditions have to be fulfilled 

by the attacker. The most essential one is the legitimation to 

execute code on the target platform. Further requirements are 

also discussed in Section IV-A. 

If an attacker satisfies all preconditions, the rowhammer 

bug can be exploited. The actual attack can be separated in 

different stages which are outlined in Section IV-B. 

A. Required Preconditions 

First of all, the attacker needs to be able to execute own 

code on the target platform. Documented working samples 

are available in C/C++ but also languages like JavaScript have 

been shown suitable to perform such kind of attacks [21], 

[22]. 

Since this attack is very specific to the hardware it is 

executed on, detailed knowledge about the target is essential 

to fine tune memory access. Specifically, the attacker has 

to have information about the used CPU architecture and 

which kind of main memory is installed. Architectures such 

as Intel’s Sandy Bridge, Ivy Bridge or Haswell were found to 

be affected by the rowhammer bug. Successful attacks were 

also conducted on AMD’s Piledriver based systems [2]. The 

underlying CPU architecture has influence on the implemented 

algorithms used for memory access optimizations and caching 

approaches. Since the rowhammer bug requires direct access 

to the main memory, varying solutions for memory access 

need to be considered in the attacking application. 

Currently, only DRAM based memory is known to be 

affected by this bug. Comparing DRAM to other memory 

architectures, such as Static Random Access Memory (SRAM), 

there is a significant difference in structure. For instance, 

SRAM cell based systems are not affected by rowhammer. 

Since SRAM is composed from larger structures on the dies 

available memory capacity is reduced and manufacturing 

costs are increased. Consequently, it is not economical to 

replace DRAM cells with other memory architectures for a 

straightforward mitigation approach. 

Most of this technical insights are not publicly documented. 

The necessary documents are only available from the hardware 

manufacturers after signing Non-Disclosure Agreements 

(NDAs) and/or pay large amounts of money. One example 

for this practice are the notorious individually watermarked 

yellow and red covered specifications issued by Intel only to 

a small and selected circle of partners. 

To circumvent this issue, most researchers reverse engineer 

the target systems to understand the inner workings. Their 

documentation though, is not necessarily complete, correct 

or easy to understand for someone start working in this area. 

Therefore, having code fragments or some released samples 

is often not enough to get started. Tedious rework and guess 

work needs to complete the code at hand in order to catch up 

to the state of the art. 

B. Attack Stages 

The rowhammer attack exploits the memory organisation 

with PDs, PTs, and pages of current OSs. An example is used 

to illustrate the attack stages. The assumed and simplified 

memory hierarchy is illustrated in an abstract tree structure 

which is shown in Figure 5. There are two applications, 

each having a PD, PTs and pages assigned. Access paths are 

indicated by arrows. 

PT 0:0 

Pg 0:0:0 

PD 0 

CR3 

PD 1 

PT 1:0 PT 1:1 

Pg 1:1:0 

Figure 5. Simplified memory hierarchy as abstract tree before a rowhammer 

attack is conducted. 

Step 1): This hierarchy from Figure 5 is translated to a 

representation describing a simplified view on the DRAM 

layout, illustrated in Figure 6. The unallocated memory is 

highlighted by dotted areas. Analogous to Figure 5, access 

paths are indicated by arrows. 

122

Since each user space application has its own virtual 

memory area, applications are separated and access is only 

possible to pages of their own area. A direct write access 

to PTs is not possible for an application. Key aspect of 

the rowhammer attack is to allocate memory using the mmap 

syscall. By iteratively invoking mmap using all possible virtual 

addresses, each call results in a generated PT. This process 

is denoted as memory spraying. 

It is assumed, that the application belonging to the memory 

structure described by PD 1 has conducted this spraying. In 

the subsequent step it accesses its memory areas with high 

frequency. Basically, it is now performing the rowhammer 

attack. 

Addr 

0 

Addr 

Pg 0:0:0 

PT 1:1 

PT 1:0 

PD 1 

Pg 1:1:0 

Addr 

Addr 

Pg 1:1:0 

PT 0:0 

Addr 

PT 1:1 

Addr 

PD 0 

Addr 

PT 1:0 

Addr 

PD 1 

Addr 

CR3 

Figure 7. The rowhammer attack caused a bit flip in PT 1:1. Manipulated 

address points to PT 1:0. By that, PT 1:0 is treated like a page and write 

access is possible. 

Pg 0:0:0 

PT 0:0 

PD 0 

Addr 

Addr 

Addr 

3 

PT 1:1 

PT 1:0 

Pg 1:1:0 

Addr 

Addr 

2 

CR3 

Addr 

PD 1 

Addr 

Figure 6. Simplified memory layout with two applications organized in PD 0 

and PD 1. Dotted areas are unallocated memory. Arrows indicate access 

paths. 

Pg 0:0:0 

4 

Step 2): The first step generated lots of PTs, which are 

available in the memory. As a result of the high frequent 

access to the application’s allocated memory, it is assumed 

that finally a bit of PT 1:1 is flipped. Since memory is full 

of PTs, there is some probability that this bit flip results in 

pointing to another PT. This situation is illustrated by Figure 

7, indicated by number 0 ❥ . 

Accessing this address treats PT 1:0 like a normal page of 

the application. Thereby, PT 1:0 becomes writeable for the 

malicious user space application. 

Step 3): This manipulation to PT 1:1, enables the user 

space application to write any address to PT 1:0. With this 

acquired privilege, the system can be exploited. Writing any 

arbitrary address to PT 1:0 allows full access to the complete 

main memory. The numbers in Figure 8 indicate the sequence 

of the access path. 

The situation after a successful rowhammer attack can be 

summarized in an abstract tree model which is illustrated by 

PT 0:0 

PD 0 

Addr 

Addr 

CR3 

Figure 8. PT 1:0 is treated like a page. Thereby, it is writeable and any 

address can be used to access other memory areas. 

Figure 9. The black arrows indicate the access path utilising 

PT 1:0 accessing any desired memory area, symbolised by 

the gray box. For an adversary, full memory access is a very 

comfortable situation, as it can be used to systematically search 

the RAM for sensitive information. Seeking characteristic 

patterns, it is possible to identify cryptographic key martial 

1 

123

such as private Secure Shell (SSH) keys or other kinds of 

sensitive information. 

PT 0:0 

Pg 0:0:0 

PD 0 

CR3 

PD 1 

PT 1:0 PT 1:1 

Pg 1:1:0 

Figure 9. After a successful rowhammer attack, PT 1:0 is treated like a page. 

By writing any address to PT 1:0 full memory access from user spaec is 

possible. 

V. Result Replication 

This section documents the gained insight while working 

on replicating the findings documented by other researches 

[1]–[3]. For the test-bed, a laptop with “rowhammer-friendly” 

hardware configuration was used. For an identification of 

appropriate components, the documents referenced in related 

work provided a good orientation. 

The employed test-bed is described in Section V-A. Thereafter, 

available software tools to test for potential vulnerability 

of a given hardware are discussed. The section concludes 

with a discussion of the findings. 

A. Test-Bed 

Based on related work, laptop computers were identified to 

work well as a hardware platform to research the rowhammer 

issue. Testing different laptops of various manufactures, with a 

selection of different RAM modules is a tedious task. In order 

to identify a given configuration to be potentially affected by 

the rowhammer bug, the according test in the memory testing 

software MemTest86 [23] was conducted. 

As affected platform, a Lenovo x230 laptop based on Ivy 

Bridge architecture was identified. The used configuration 

included an Intel Core i5-3322M CPU in combination with 

Hynix RAM modules (PC3-10600 @1333MHz). This confirmed 

the findings of Seaborn, that Ivy Bridge based systems 

are affected by the rowhammer bug. 

Cross confirming with DRAM manufactures and their 

market share results in the insight, that Hynix is one of the 

top three producers of DRAM. An additional comparison 

of these findings with the research conducted by Kim et. al 

in [2] on affected memory modules, additionally supported 

plausibility of the created test-bed. 

B. Test Approach 

The identification whether a given hardware configuration 

is affected by the rowhammer bug is not trivial. Running 

software dedicated to test for this issue does not necessarily 

report a result, where as other test applications indicate a 

vulnerability. 

In his initial blog post, Seaborn published a example C- 

code to demonstrate the inner workings of the identified issue 

[1]. This code was published on github [21] for others to use, 

refine, and continue the work. 

Gruss et al. used this sample code in their research and 

adapted the available C-code to other CPU architectures, such 

as Intels Haswell and Skylake. They further developed a 

JavaScript based implementation as a PoC to demonstrate how 

to make use of the bug even without direct influcence on 

the caches of the CPU. This ported C-code and JavaScript 

version are also publicy available on github [22] to be used 

for further research. 

Finally, a test for rowhammer was also included in the well 

known application MemTest86 [23] by the company PassMark 

Software. 

As part of this research, the adapted C-code versions from 

Gruss were used. This sample application was used as 

a starting point, e.g., to understand the platform specific 

instructions to circumvent caches. Establishing detailed 

knowledge on how memory is organized in hardware as well 

as managed by the OS required many resources. Along the 

C-code and available documentation of the Linux kernel, the 

relationship between virtual memory management using Page 

Directories, Page Tables and pages was reworked [24]–[26]. 

However, these structures are very complex but hold interesting 

aspects for further work. 

C. Discussion on Test Results 

This experience results in the insight, that the possibility 

to cause a bit to flip does not result in an exploit or even 

root privileges. Provided sample source code can be used 

as a first starting point to test a system for this bug. They 

are not yet a fully working exploit. An attacker as to invest 

some more resources to develop malicious applications which 

are making use of rowhammer. The fact that very hardware 

specific aspects are used and detailed knowledge about memory 

management is necessary increases the difficulty to mount 

such an attack. 

Current software is patched, e.g., calls to clflush is now 

restricted to privileged users only. But a final solution needs 

to be included in upcomming hardware versions. 

VI. Mitigation Strategies 

As stated by different results of related work, there are 

some recommendations to mitigate the rohammer attack vector. 

The timing between refresh cycles in DRAM modules is 

one approach often proposed. But the results of Kim et. 

al [2] clarify, applying this approach memory corruptions 

are still possible. Even with half of the default refresh time 

(32 ms) specified by the DRAM standard, flipped bits have 

been documented for a one sided hammering of a target 

row. Taking a double sided attack to a row into account, 

the results get even worse. On the other hand, shorter refresh 

cycles also causing the memory module spending even more 

time with a maintenance task (refreshing rows) impairing the 

system’s performance. Finally, not all currently deployed Basic 

Input/Output System (BIOS) versions support a configuration 

of significantly shorter refresh cycles necessitating updates. 

Actually, some hardware vendors provide these updates [27], 

124

ut applying them to the respective platform is potentially 

error-prone. 

In the work of Seaborn et. al [1] the command clflush 

was used to clear cache entries and force the system to 

access the values directly from RAM. The utilization of 

this low level command enabled the first implementations of 

the rowhammer attack. Restricting calls of this command to 

privileged users is far too ineffective, as the follow up work 

by Gruss demonstrated [3], that the same behaviour can be 

achieved utilising higher level languages. Therefore, it can 

be concluded, that even platforms without support of such 

cache sanitizing functionality are vulnerable to rowhammer 

attacks. Instead of actively clean the cache entries, the caching 

mechanisms are outperformed resulting in cache eviction 

which was shown to be fast enough for the rowhammer attack. 

Optimizing the generation of cache misses, the related work 

by Oren et. al demonstrated suitable approaches [28]. 

More sophisticated solutions have been presented by Kim et. 

al in [2]. A probabilistic adjacent row activation (PARA) is 

proposed to refresh adjacent rows with a very low probability 

each time an access is performed. In case of rowhammer one 

row is accessed serveral thousend times within a short time 

frame. Over time this probability based solution causes the 

neighboor rows to be refreshed. Something similar is discussed 

by Seaborn in [6], called “Target Row Refresh (TRR)”. This 

solution refreshes adjacent rows based on an access counter. 

However, an update to the memory controller is necessary 

or it can be included in future memory generations, such as 

DDR4 chips. In his talk Gruss proposed a software based 

approach, utilizing memory hierarchies of different access 

privileges. This would restrict the impact to the performing 

application. This idea could be included in already available 

memory organisation using PDs and PTs. 

One insight gained in the course of ongoing research is, that 

mitigation strategies can more easily be applied to architectures 

using some runtime environment. There, the hardware access 

is abstracted and patches can be realised to restrict access 

to certain functions, e.g. clflush. However, related work 

showed various ways to circumvent such limitations. 

VII. Conclusions 

Corrupt memory is in general not a new phenomenon. The 

issue is known to the hardware developers of DRAM modules 

since the introduction of the first commercial DRAM module 

by Intel [14]. 

For a long time the issue of flipping bits in main memory 

was seen to be just a reliability issue. In reaction to these 

issues, the developers included error correction codes and 

combined them with redundancy (e.g. ECC RAM) or memory 

remapping as it is the case when faulty RAM rows are detected 

and compensated. 

The findings of Kim et. al [2] indicate, that memory 

corruptions can also be used to specifically manipulate a 

computers main memory and affect executed software. With 

a PoC the exploitablility is demonstrated by Seaborn [1]. 

This proves, that hardware reliability issues influence 

software on a IT system. In the context of IT security, 

such influences undermining basic security algorithms utilised 

implementig the Confidentiality, Integrity, Authenticity (CIA) 

triad for certain applications. Current findings on how 

performance optimization for code execution on CPUs can 

be abused to gain access to sensitive data [29]. In this 

attack branch prediction is exploited to retrieve data the 

CPU preprocessed in expectation of an soon occuring access. 

In cases this preprocessed branch is not required potential 

sensitive data is mapped to caches from which it can be 

extracted by malicious applications. 

Related work shows that hindrances such as closed source, 

Non-Disclosure Agreement (NDA), hidden or undocumented 

functions or the lack of working code samples are not 

stopping people from developing PoC exploits. In the case 

of rowhammer, released code is used to illustrate the issue 

and to test hardware for potential vulnerabilities. This helps 

developers and end users to establish mitigation strategies but 

a malicious attacker still has to invest resources to fill essential 

gaps. 

Following the methodology of responsible disclosure allows 

vendors to develop patches and notify their customers, e.g., 

through security advisories [30]. This reduces the overall 

risk to end users, since they can prepare for the issue being 

released. But such findings again emphasise the importance of 

maintained and supported platforms. Especially for embedded 

systems also recent devices are build upon legacy OS versions, 

such as Linux with Kernel 2.6, or unsupported and outdated 

libraries [31]. 

Yet unclear is the situation for embedded systems based on 

closed source OS, such as Windows Embedded. As these OSs 

often practice security by obscurity, it is not known to the 

public how these organise the RAM. Therefore, it is hard to 

estimate, whether an attack like rowhammer affects platforms 

using a proprietary OS. Without the possibility for producerindependent 

research, this issue and the possible impact of 

rowhammer to these platforms is hard to asses. 

A. Am I affected? 

Assessing whether a certain hardware configuration is 

potentially affected by the rowhammer bug, requires detailed 

investigations. Considering embedded platforms, one can 

perform some rough assessment to get an idea, if this specific 

attack is of relevance for further research. 

It need to be clarified, whether some of the essential 

preconditions of Section IV-A are fulfilled. Essential questions 

are: 

1) What kind of memory architecture is installed? 

2) Has an attacker the opportunity to execute own code? 

Since this kind of attack is very hardware specific, the 

possibility of influencing main memory is not yet an working 

exploit. Rather, it should be seen as a reliability issue with 

impact on the integrity of an IT system. Further on, memory 

corruptions affect the availability of a system. Finally, such 

an attack might have impact on confidentiality. Nevertheless, 

in combination with vulnerabilities in outdated libraries the 

rowhammer attack can be used to gain higher privileges [1]. 

This emphasises the need for maintained software and hardware 

components and keep them up to date with the latest patch 

level. 

As another aspect, the potential attacker model needs to be 

considered. What kind of resources are assumed an attacker 

is capable to invest to mount such a specific attack? Are there 

any other attack paths, which might be easier to realize? By 

125

establishing an attacker model describing the resources and 

capabilities of an expected adversary, attacks like rowhammer 

can be proportioned. Hardware based security issues are 

significant, but often there are other, less complex attack 

vectors with same or even higher impact. 

VIII. Future Work 

This work can be continued by developing a user friendly 

tool to test given hardware setting for the rowhammer issue. 

Current solutions focus on expert usage, e.g., it is necessary to 

compile the test tool based on C-code samples. This demands 

knowledge about the target platform, to use the test tool on, 

since hardware specific adjustments need to be considered 

while compiling. 

Additionally, the theoretic basics of memory management 

are identified to be a valuable topic for further work. Based on 

insights gained from this research, hardware related attack like 

rowhammer can be better understood. Recent findings like 

Meltdown [29] or Spectre [32] also originate from hardware 

characteristics. Here, the optimisation in branch prediction 

for preprocessing likley required statements can be exploited 

to retrieve sensitive data from caches. If understanding the 

underlying concepts comprehenting such issues is supported 

and finding secure solutions becomes easier. 

Project Funding 

The presented work is part of the German national IT- 

Security reference project IUNO (https://www.iuno-projekt.de). 

The project is funded by the German Federal Ministry of 

Education and Research, funding № KIS4ITS0001. IUNO 

aims to research and provide building-blocks for IT-Security 

in the emerging field of Industry 4.0. 

References 

[1] M. Seaborn and T. Dullien. (Mar. 2015). Exploiting 

the DRAM rowhammer bug to gain kernel privileges, 

[Online]. Available: https://googleprojectzero.blogspot. 

de / 2015 / 03 / exploiting - dram - rowhammer- bug - to - 

gain.html (visited on 01/13/2018). 

[2] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, 

C. Wilkerson, K. Lai, and O. Mutlu, “Flipping Bits in 

Memory Without Accessing Them: An Experimental 

Study of DRAM Disturbance Errors”, in Proceeding of 

the 41st Annual International Symposium on Computer 

Architecuture, ser. ISCA ’14, Minneapolis, Minnesota, 

USA: IEEE Press, 2014, pp. 361–372, isbn: 978-1- 

4799-4394-4. [Online]. Available: http://dl.acm.org/ 

citation.cfm?id=2665671.2665726. 

[3] D. Gruss, C. Maurice, and S. Mangard, “Rowhammer.Js: 

A Remote Software-Induced Fault Attack in JavaScript”, 

in Proceedings of the 13th International Conference on 

Detection of Intrusions and Malware, and Vulnerability 

Assessment - Volume 9721, ser. DIMVA 2016, San 

Sebastián, Spain: Springer-Verlag New York, Inc., 2016, 

pp. 300–321, isbn: 978-3-319-40666-4. doi: 10.1007/ 

978-3-319-40667-1_15. [Online]. Available: http: 

//dx.doi.org/10.1007/978-3-319-40667-1_15. 

[4] K. S. Yim, “The Rowhammer Attack Injection Methodology”, 

in In Proceedings of the IEEE Symposium on 

Reliable Distributed Systems (SRDS), 2016, pp. 1–10. 

[5] F. Brasser, L. Davi, D. Gens, C. Liebchen, and A.-R. 

Sadeghi, “CAn’t Touch This: Software-only Mitigation 

against Rowhammer Attacks targeting Kernel Memory”, 

in 26th USENIX Security Symposium (USENIX Security 

17), Vancouver, BC: USENIX Association, 2017, 

pp. 117–130, isbn: 978-1-931971-40-9. [Online]. 

Available: https : / / www . usenix . org / conference / 

usenixsecurity17 / technical - sessions / presentation / 

brasser. 

[6] M. Seaborn and T. Dullien. (2015). Exploiting the 

DRAM rowhammer bug to gain kernel privileges, 

[Online]. Available: https://www.blackhat.com/docs/us- 

15/materials/us-15-Seaborn-Exploiting-The-DRAM- 

Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf. 

[7] R. Isaac, “The Remarkable Story of the DRAM Industry”, 

IEEE Solid-State Circuits Society Newsletter, 

vol. 13, no. 1, pp. 45–49, Winter 2008, issn: 1098- 

4232. doi: 10.1109/N-SSC.2008.4785692. 

[8] M. Redeker, B. F. Cockburn, and D. G. Elliott, “An 

investigation into crosstalk noise in DRAM structures”, 

in Proceedings of the 2002 IEEE International Workshop 

on Memory Technology, Design and Testing 

(MTDT2002), 2002, pp. 123–129. doi: 10.1109/MTDT. 

2002.1029773. 

[9] M. Seaborn. (Apr. 2015). L3 cache mapping on 

Sandy Bridge CPUs, [Online]. Available: http : / / 

lackingrhoticity . blogspot . de / 2015 / 04 / l3 - cache - 

mapping - on - sandy - bridge - cpus . html (visited on 

01/14/2018). 

[10] ——, (May 2015). How physical addresses map to 

rows and banks in DRAM, [Online]. Available: http: 

//lackingrhoticity.blogspot.de/2015/05/how-physicaladdresses-map- 

to-rows-and-banks.html (visited on 

01/13/2018). 

[11] P. Pessl, D. Gruss, C. Maurice, and S. Mangard, 

“Reverse engineering intel DRAM addressing and 

exploitation”, CoRR abs/1511.08756, 2015. 

[12] R. Hund, C. Willems, and T. Holz, “Practical Timing 

Side Channel Attacks Against Kernel Space ASLR”, in 

Proceedings of the 2013 IEEE Symposium on Security 

and Privacy, ser. SP ’13, Washington, DC, USA: IEEE 

Computer Society, 2013, pp. 191–205, isbn: 978-0- 

7695-4977-4. doi: 10.1109/ SP.2013.23. [Online]. 

Available: http://dx.doi.org/10.1109/SP.2013.23. 

[13] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and 

C. Wu, “A Software Memory Partition Approach 

for Eliminating Bank-level Interference in Multicore 

Systems”, in Proceedings of the 21st International 

Conference on Parallel Architectures and Compilation 

Techniques, ser. PACT ’12, Minneapolis, Minnesota, 

USA: ACM, 2012, pp. 367–376, isbn: 978-1-4503- 

1182-3. doi: 10.1145/ 2370816.2370869. [Online]. 

Available: http : / / doi . acm . org / 10 . 1145 / 2370816 . 

2370869. 

[14] J. H. Saltzer and M. F. Kaashoek, Principles of Computer 

System Design: An Introduction. Morgan Kaufmann, 

2009, isbn: 978-0123749574. [Online]. Available: 

https://booksite.elsevier.com/9780123749574/ 

casestudies/00~All_Chapters(7-11).pdf. 

126

[15] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C. 

Wilkerson, and O. Mutlu, “The efficacy of error 

mitigation techniques for DRAM retention failures: A 

comparative experimental study”, in ACM SIGMET- 

RICS Performance Evaluation Review, ACM, vol. 42, 

2014, pp. 519–532. [Online]. Available: https://dl.acm. 

org/citation.cfm?id=2592000. 

[16] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu, 

“An experimental study of data retention behavior in 

modern DRAM devices: Implications for retention time 

profiling mechanisms”, in ACM SIGARCH Computer 

Architecture News, ACM, vol. 41, 2013, pp. 60–71. 

[Online]. Available: https://dl.acm.org/citation.cfm?id= 

2485928. 

[17] S. Govindavajhala and A. W. Appel, “Using memory 

errors to attack a virtual machine”, in 2003 Symposium 

on Security and Privacy, 2003., May 2003, pp. 154–165. 

doi: 10.1109/SECPRI.2003.1199334. 

[18] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM 

Errors in the Wild: A Large-Scale Field Study”, in 

SIGMETRICS, 2009. [Online]. Available: https : / / 

research.google.com/pubs/pub35162.html. 

[19] V. Sridharan and D. Liberty, “A Study of DRAM Failures 

in the Field”, in Proceedings of the International 

Conference on High Performance Computing, Networking, 

Storage and Analysis, ser. SC ’12, Salt Lake City, 

Utah: IEEE Computer Society Press, 2012, 76:1–76:11, 

isbn: 978-1-4673-0804-5. [Online]. Available: http: 

//dl.acm.org/citation.cfm?id=2388996.2389100. 

[20] S. M. Seyedzadeh, D. Kline Jr, A. K. Jones, and 

R. Melhem, “Mitigating Bitline Crosstalk Noise in 

DRAM Memories”, in Proceedings of the International 

Symposium on Memory Systems, ser. MEMSYS ’17, 

Alexandria, Virginia: ACM, 2017, pp. 205–216, isbn: 

978-1-4503-5335-9. doi: 10.1145/3132402.3132410. 

[Online]. Available: http:// doi.acm.org/ 10.1145/ 

3132402.3132410. 

[21] M. Seaborn. (Jun. 2015). Test DRAM for bit 

flips caused by the rowhammer problem., [Online]. 

Available: https://github.com/google/rowhammer-test 

(visited on 01/19/2018). 

[22] D. Gruss. (Jul. 2015). Rowhammer.js - A Remote 

Software-Induced Fault Attack in JavaScript, [Online]. 

Available: https : / / github . com / IAIK / rowhammerjs 

(visited on 01/16/2018). 

[23] P. Software. (Jul. 2017). MemTest86 V7.4 Free 

Edition Download, [Online]. Available: https://www. 

memtest86.com/download.htm (visited on 01/16/2018). 

[24] W. R. Stevens and S. A. Rago, Advanced programming 

in the UNIX environment. Addison-Wesley, 2013, isbn: 

978-0321637734. 

[25] M. Kerrisk, The Linux programming interface. No 

Starch Press, 2010, isbn: 978-1593272203. 

[26] M. Gorman, Understanding the Linux virtual memory 

manager. Prentice Hall Upper Saddle River, 2004, 

isbn: 978-0131453487. 

[27] Lenovo. (Sep. 2016). BIOS Update Utility, [Online]. 

Available: https://download.lenovo.com/ibmdl/pub/pc/ 

pccbbs/mobiles/8duj26us.txt (visited on 01/17/2018). 

[28] Y. Oren, V. P. Kemerlis, S. Sethumadhavan, and 

A. D. Keromytis, “The Spy in the Sandbox: Practical 

Cache Attacks in JavaScript and Their Implications”, in 

Proceedings of the 22Nd ACM SIGSAC Conference on 

Computer and Communications Security, ser. CCS ’15, 

Denver, Colorado, USA: ACM, 2015, pp. 1406–1418, 

isbn: 978-1-4503-3832-5. doi: 10.1145/ 2810103. 

2813708. [Online]. Available: http://doi.acm.org/10. 

1145/2810103.2813708. 

[29] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, 

S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and 

M. Hamburg, “Meltdown”, ArXiv e-prints, Jan. 2018. 

arXiv: 1801.01207. 

[30] (Sep. 2016). Row Hammer Privilege Escalation Vulnerability, 

[Online]. Available: https://tools.cisco.com/ 

security/center/content/CiscoSecurityAdvisory/ciscosa-20150309-rowhammer 

(visited on 01/17/2018). 

[31] T. Roth. (Dec. 2017). Gateway to (s)hell, [Online]. 

Available: https://media.ccc.de/v/34c3-8956-scada_- 

_gateway_to_s_hell (visited on 01/17/2018). 

[32] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, 

M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and 

Y. Yarom, “Spectre Attacks: Exploiting Speculative 

Execution”, ArXiv e-prints, Jan. 2018. arXiv: 1801. 

01203. 

Norbert Wiedermann Norbert Wiedermann, MSc. 

is employed since 2013 as scientific researcher at 

the Fraunhofer Institute for Applied and Integrated 

Security (AISEC). In his research projects he 

focuses on IT security apsects for embedded and 

industrial hardware. By performing risk analysis 

and developing security concepts he contributes 

to increase the protection level for the considered 

systems. 

Sven Plaga received the Dipl. Ing. (FH) and 

M. Eng. degrees in electrical engineering and 

computer science from the Deggendorf University of 

Applied Sciences (Germany) in 2007 and 2010 from 

the University of Limerick (Ireland), respectively. 

From 2007 to 2013 he was research fellow at 

Deggendorf University of Applied Sciences, where 

he was constantly participated in research projects 

regarding x86 Embedded Systems. Furthermore, he 

lectured Embedded Systems and C Programming. 

Currently, he is a research fellow at Fraunhofer 

Institute for Applied and Integrated Security (AISEC) and working toward the 

Ph.D. degree in the field of secure industrial communications in context to 

embedded systems. Additionally, he assists clients with risk analyses, security 

concepts and secure implementations within the scope of contracted industrial 

research projects. In his spare time, he loves to share his knowledge and 

experiences and likes to discuss his findings with others. 

127

You’ve been hacked! Now what? 

Haydn Povey 

Founder and CTO 

Secure Thingz 

Cambridge, UK 

haydn@securethingz.com 

Abstract — You’ve seen the headlines. Whether it's bots 

infecting home networks, the destruction of industrial 

systems, or the ability to take remote control of 

automobiles, the horror stories around Internet of Things 

security are starting to mount, like bodies in a bad movie. 

The bad guys will keep coming with malicious 

intent. The attacks on connected devices are only going to 

get worse and more sophisticated. Hardware, software, 

communications and communications protocol, device 

commissioning, applications layers and other systems 

considerations are just some of the many entry points that 

could impact security of a device, fall victim to malware, 

and lead to data breaches or weaponization. Boundary 

protection can be too porous. Systems that may seem 

secure today may have weaknesses that will lead to failure 

in the future. Failure at some point is almost an 

inevitability. 

Privacy, corporate reputations and even lives can depend 

on the ability to ensure the security of a device. 

So it’s time to face reality. There’s a good chance 

your device will get hacked. The question is: what are you 

going to do about it? How will you recover? And what can 

you do to prepare? 

This presentation focuses on what you need to do in the 

aftermath of an IoT compromise and how you get back to 

a trusted system. 

Keywords—security; hack; secure boot; secure element; IoT; 

IP; attack; system; architecture; cyber security 


We are becoming increasingly used to the headlines of 

hacking across our IT systems. Barely a week goes by where 

major flaws aren’t found in the computer infrastructure that 

surrounds our digital domain, and whilst these are all 

concerning attacks, there are two major corrosive effects. 

First, while the global press may publish details on the attacks, 

there is a consumer fatigue that “yet another attack has 

happened,” minimising the importance of implementing 

countermeasures and instilling hygiene amongst users. 

Secondly, and more egregiously, there is perhaps a 

hopelessness creeping into organizations, both on what they 

should do to prevent the impact of an attack on their 

businesses and how they can increase the security of their own 

products. While there is little we can do on the first, there is 

plenty we can do as an industry to solve the second by 

providing solid frameworks for responding to attacks, and 

building systems which are intrinsically resilient to these 

evolving attacks. 

The reality for any complex system is that there will always 

be flaws in their design, implementation, and management. 

We are all human, and we all have to get systems in market 

rapidly due to competitive and business pressures. As such we 

will never have the time or budget to get any system beyond a 

small kernel that’s technically correct and certified by a group 

of peers. Specifically, we see industry challenges arising in 

three focused areas: 1) architectural specification; 2) 

technological inheritance; and, 3) system integration. 

A. Architectural Specification 

System architectures are the bedrock of technology, 

and they are often open to group review and public domain 

investigation. And yet we continue to see many fundamental 

flaws appearing. We have seen this in the BlueBorne attacks, 

where flaws in the Bluetooth specification led to multiple 

attack vectors being discovered many years after definition, 

and of course we have seen the recent Meltdown and Spectre 

compromises, where incorrect architectural definition has led 

to multiple side-channel attacks. 

B. Technological Inheritance 

We all stand on the shoulders of giants, and 

leveraging standard hardware and software components is the 

bedrock of modern computing. However, we are all at risk 

from compromises within these building blocks, creating a 

bubbling-up of issues which we are ill prepared to manage. A 

recent example of this was compromises identified in lowlevel 

Transport Layer Security (TLS) communication drivers, 

which themselves are meant to be highly secure. The nature of 

these drivers is they are buried deep within the application and 

are not necessarily easy to patch or manage. In fact, the nature 

of many communication-level security flaws is that they hold 

privileged status within the system and present an Achilles 

128

heel within the system when they go wrong. In this case, the 

vendor of the drivers produced a patch when they identified 

the flaw, and correctly notified the OEMs who had built on 

this software. However, there is little guarantee that the OEMs 

and end users had the knowledge or capability to remediate 

systems in the field. 

C. Systems Integration 

Similar to technological inheritance, the integration of 

components from numerous vendors is fraught with issues. It 

is not always clear how the system will fit together, and if 

compromises are introduced when systems are built. A classic 

example of this has occurred in the mobile telephone world 

where an innocuous incompatibility between baseband 

chipsets and application processors led to a major compromise 

in 2017, subsequently patched by the baseband chip vendor. 

As we build increasingly complex systems, there is a demand 

that every system must be protected individually, and that the 

system should implement a “zero trust” model between 

modules. However, this will necessitate additional cost, and 

addition effort. 

II. DEVELOPING AN INCIDENT RESPONSE PLAN 

As mentioned earlier, every complex system will have 

multiple flaws, and hence we must assume that every system 

can, and will, become compromised at some point. At some 

point, you will be hacked. So what are we going to do about 

it? This paper outlines an initial best practice approach and 

longer-term mitigation strategy. 

Given that being hacked is inevitable, it becomes 

imperative that every vendor within the value chain 

understand their role within the industry; that they prepare for 

having components or systems that become compromised; that 

they have a pre-planned response mechanism; and that they 

follow through correctly to ensure the industry’s trust in them 

is maintained if they do not wish to lose brand value. It is also 

imperative that this incident response plan is not only in place 

for current products, but that it also covers systems that have 

already been released. To assist in this, a number of groups, 

including the Internet of Things Security Foundation 

(www.iotsecurityfoundation.org), are creating Best Practice 

Guidelines that IoT vendors can integrate into their own 

processes. 

III. 

BEING PREPARED 

Preparing for the inevitable compromises is a multi-faceted 

process however the following five components represent a 

good starting point.: 

1. Scope organizational impact 

2. Define internal cyber security policy 

3. Clear communications policy 

4. Develop a bug-bounty policy 

5. Execute deep threat analysis 

A. Scope Organization Impact 

It is difficult to judge how much to invest in protection 

against abstract threats, when compared against known budget 

constraints. Hence the first step any organization must judge is 

what the damage to the organization would be in a worst-case 

scenario. 

For many organizations, this will certainly include brand 

and reputational damage, but this is being exaggerated by 

other market forces, including the willingness of the industry 

to sue for impacts in their supply chain. A vulnerability in a 

communications stack may leave a process control system, 

and subsequently a large processing plant, at risk. Hence, a 

relatively small code fix could potentially protect a massive 

capital structure. Similarly, where there is the potential for 

customer data to leak, the balance of additional time and effort 

to minimise any threats is obvious compared to the €10M 

minimum fine for a data breach under GDPR regulations. 

Best practice: To enable a comprehensive impact, it is 

often the case that a specialist organization, external to the 

business unit, should be employed to challenge assumptions 

and ensure the corner cases are explored. This team may be a 

central function within large organizations, or may be an 

external contractor within smaller operations. Generally, this 

team will need to be supported by internal experts running an 

insurgency attack to see how they would best compromise 

their own products. 

B. Define Cyber Security Policy 

No battle plans ever survive the first encounters of war, but 

you would never wish to go into battle without one. The same 

is true of defining a cyber security policy around products for 

the IoT. We don’t know exactly when, where or how the 

compromises or exploits will be found, but we must have the 

framework to deal with them. 

• Assuming compromises will occur is the first step in 

defining a policy. We have to expect every device 

will be attacked, and every device will be 

compromised at some point in its life. Every complex 

device containing firmware or software will have 

exploits. Every device that has a communications 

stack will have exploits. Every system containing an 

active microcontroller or microprocessor will have 

exploits. 

Best practice: Assume all devices will be exploited at 

some point. Ensure active patch management is 

possible, and ensure functionality is available within 

the product to update it. Ensure that patch 

management is built into the project costs and 

lifecycle management, and that development tools 

and firmware can support ongoing releases. 

• Executive Ownership is critical in developing a cyber 

security policy, and responding to an exploit. 

129

However, it also important to have executive 

overview of security within the products being 

created. Security must be supported holistically in the 

organization, not relegated to the IT department. 

Best practice: A main board member (CxO) should 

report fortnightly to the board on any cyber 

incidences, and on cyber security policy 


• Engineering leadership engagement is a key 

requirement in producing secure IoT devices. 

Engineering must decide what level of security is 

required for a product with justification, commit to 

maintaining software for the lifecycle of the product, 

and ensure that security test frameworks are part of 

the signoff criteria 

Best practice: Security must move from an afterthought 

into the fabric of the engineering process. All 

architectures, components and sub-components must 

be validated against an evolving threat model, and 

existing products should be verified against major 

new threats. 

Active patch and update management are critical in 

supporting devices across their lifecycles, with the 

ability to recover to a known good state. 

• Product leadership must continue to own its products 

over its entire lifecycle, and must integrate security 

into the fabric of the offering. As such, the transition 

from selling commodities to services and lifecycle 

support is incredibly important, but also an important 

source of additional revenue. 

Best practice: Ensure the product mix includes 

ongoing lifetime maintenance, or the ability to 

manage the device to mitigate exploits. To achieve 

this, low-level security services must be instilled in 

the device, providing a secure kernel which can be 

utilised to support ongoing interaction. 

• Rapid escalation of issues is critical in gaining 

support from executives for necessary actions, and to 

ensuring the industry retains confidence in the 

organization. This is true whether a flaw has been 

identified internally, a compromise has been found 

externally, or an ethical hacker approaches the 

organization with a new exploit. 

Best practice: Implement a flattened management 

structure group for exploit investigation and 

responsiveness. 

• Communication policy is critical, and will be covered 

in more depth later. However, no Best Practice would 

be complete without a clear communication strategy 

for when attacks are found, how they are solved in 

partnership with clients, and how they are 

communicated publicly. 

Best practice: A set of policies should be 

implemented ensuring rapid and formalised 

communications which can convey urgency while 

demonstrating a managed and resolving situation. 

• Proper triage of an incoming compromise is critical 

to ensure the organization reacts positively without 

becoming overwhelmed by every single flaw in its 

products. A formal process of evaluation, 

investigation, mitigation and communication is 

required. 

Best practice: A written process for initial 

engagement with issue is important, including the 

creation of a template to record all aspects of the 

incident, ensuring a consistent approach. This enables 

simple recording and comparison of events, and 

ensures all stakeholders, including technical teams, 

suppliers, legal resources and human resources, 

alongside business management, are educated 

rapidly. 

Figure 1. Triage steps. 

• Process management should already be key function 

of the organization. It is important that version 

management and product evolution are managed 

aggressively within an organization. This covers both 

aspects of versioning, where previous versions may 

have known exploits which have been subsequently 

fixed. Similarly, product evolution will invariably 

enable flaws to creep into the system, and there 

should be the ability to roll back to a known-good 

version. Identifying where a flaw occurred and how it 

was introduced are key mechanisms in developing 

better internal processes to manage compromises 

over many years. 

Best practice: Maintaining clear records of releases 

within a mastering system is important in being able 

to understand when and how flaws were introduced, 

and who holds ultimate responsibility for these. 

These records are important from a technical 

perspective, but also critical from a legal perspective 

if the organization finds itself in court for a GDPR 

breach or lawsuit. 

• Patching and update management are important 

outputs from any cyber security policy, as this will be 

130

the primary response to any compromise, unless a 

full recall is required. Ensuring that all systems are 

manageable requires that low-level services have 

been fully certified and are small enough that they 

have no known attack vectors is important to 

constraining the update process. 

Best practice: Ensure development tools support 

version management and interim releases, and that 

patches can be signed and encrypted for specific 

target devices, or product groups. Assume that an 

update mechanism can also be exploited by an 

attacker, and therefore provide development 

frameworks which enable constrained mastering 

releases of software, with authentication and 

authorisation. Further ensure that updates are as small 

as possible to ensure network bandwidth and battery 

power impact are as small as possible. 

• Minimising attack impact through authentication is a 

clear goal of any cyber security policy. It should be 

the case that all devices have a strong cryptographic 

identity associated with them, and that all 

communications to and from the device have implicit 

authentication. This way, we should be able to both 

manage the communications to reduce the attack 

surface of the device, where an attacker must shape 

their attack to the unique device; and additionally 

create a zero-trust framework where devices cannot 

easily propagate attacks as they do not have 

authentication capability for other devices. 

Best practice: Ensure all devices have 

cryptographically strong certificates and 

authentication mechanisms, including PKI 

frameworks, the ability to hold secret information 

securely, and unique addressability, such as X.509 

(or equivalent). 

C. Clear communications policy 

As mentioned previously a clear communication policy is 

the bedrock of a successful Incident Response Plan. However, 

this is often easier stated than done, especially given the 

embarrassment and sensitivities traditionally associated with 

flaws. 

There are three stages to a successful incident 

communications policy 

1. Confirmation. 

In many cases a compromise may be identified by an 

ethical hacker. We will come onto bug-bounties in a 

moment, however, whether you offer one or not, it is 

always best to engage positively when approached. In 

the past, companies often buried their heads in the 

sand, but today that approach will bring industry 

condemnation and far greater reputational damage, as 

it gives the impression that they are both ignoring the 

issue and disrespecting their customers. Engage 

positively and strongly, and if possible, attain the use 

of the hacker to confirm the flaw in as much detail as 

possible. 

2. Notification 

Following initial triage, it is important to notify 

customers as soon as possible that there is a potential 

issue and that the organization is working on a fix. If 

the fix is simple, it may be that the organization will 

quickly make a patch available when they notify the 

clients. However, if the fix is complex, it is important 

to notify clients as early as possible as they will need 

to clarify how to integrate the workaround into their 

products. If the issue is catastrophic in nature, it is 

also important to flag this early to clients to ensure 

they, and their customers, are aware of any potential 

impacts to their businesses. A critical flaw in an 

automotive component may mean vehicles are at risk, 

and subsequently so are human lives. In this case, a 

failure to notify will raise the highest legislative 

impact and could put your company’s survival at risk. 

3. Publication 

Traditionally potential flaws were seen as 

embarrassments. Today that approach is changing, 

and the open publication of flaws and subsequent 

fixes is a prerequisite for trust within the industry. If 

you are not publishing flaws, you are seen as trying 

to hide mistakes, and this itself is a good reason for 

not doing business with an organization. Publication 

does not mean self-flagellation, and unless a flaw is 

critical, you do not have to publish it on the front 

page of your website. However, there should be a 

specific publication mechanism, with subscription 

notification, for users. 

D. Develop a bug-bounty policy 

We have touched on bug-bounty in the communication 

policy, but it is worthy of additional focus. A company should, 

as part of its Best Practice, engage with ethical hackers, and 

we are seeing this encoded into the new laws currently coming 

through the US Congress, where this practice has been 

legitimized. 

The nature of ethical hacking is that a third-party 

independently finds flaws and effectively ransoms the 

information to the organization. This approach is obviously 

distasteful, although better than a third party finding the flaws 

and actively exploiting the issue. A better approach is to be 

aggressive in engaging with the community, to build a 

predefined set of rules of engagement, and to predetermine 

where specific value is attributed to finding compromises. 

This approach also ensures that the bug-bounty explorers are 

always operating within the law, and they are far more likely 

to be true white-hats. 

Best practice: It is suggested that an engagement framework is 

published on the organization’s website, as part of the Cyber 

131

Response Initiative, and that the executive leader who is 

tasked with managing product security has a clear 

responsibility for engaging with this process and paying the 

hackers. After all, the cost of an exploit going wild is far 

higher than the cost of managing it internally. 

E. Execute deep threat analysis 

Of course, all of the above is great for developing a 

process to manage exploits. However, it is only through 

running an internal deep threat analysis that the organization 

will really start to understand how exposed they are. 

Best practice: In the first instance, the organization should 

create an incident response policy and dry run the process. 

This should involve senior business and technical leadership 

instigating a tear-down attack on one of their own products, 

identify all the suspected compromises, and review the 

potential consequences of a successful attack. In most cases, 

organizations receive a nasty shock. 

Identifying and ranking any exploits forthcoming will help 

build a set of priorities which engineering can start to work on. 

As this approach is targeted at the first set of products, it is 

likely that common threats emerge, such as communication 

stacks, lack of prescribed identity, limited patching, and poor 

version management. These common threats subsequently 

build a back log of threats to mitigate but also ensure that new 

products under development learn from these issues, creating a 

positive engagement across the organization. 

Following this initial phase, the policy should be 

readdressed based on feedback, and then the organization is 

probably best tasked with bringing in a third-party security 

analysis organization to review the process and focus on 

additional areas which may have been missed, or outside of 

the organizations knowledge base. 

IV. 

RESPONDING TO AN ATTACK 

If a full Incident Response Policy has been implemented, 

then the Response phase to a new attack vector should flow 

easily. However, each one will be a learning experience and 

the focus should be on investigation of the attack, 

identification of the flaw, and ongoing mitigation in both 

existing and new products through updates and patch 

mitigation. 

The following steps are suggested: 

• Identify cyber security incident 

Although obvious, it is imperative that the 

organization understand quickly whether the 

compromises found are new issues, or whether these 

have been found before and are either unpatched 

exploits or a novel leveraging of existing issues. 

To achieve this the team must be able to replicate the 

system and rapidly understand the impact and 

pathology of the attack. This necessitates the use of a 

“clean room” system disassociated for the main IT 

system where the attack can be replicated without 

impacting the company if it propagates. 

• Define a clear set of objectives 

The outcome of identifying the pathology of the 

attack may be many fold. 

First, it may be that the attack is leveraging a known 

compromise in a new way. This obviously needs to 

be understood further. However, it may be that the 

organization needs to warn customers to more rapidly 

update or apply patches. In many industrial 

organizations, the adoption of patches is slow as there 

may be unknown consequences which the client 

needs to mitigate against. 

Secondly, the attack may be a new variant of a 

known attack, compromising a new area of the 

codebase or system. In this case it may be that a 

patch can be rapidly reworked to cover the new hole, 

and future products can be coded to mitigate against 

that attack. 

Thirdly, the attack may be a completely new “zeroday” 

compromise. In this case the organization needs 

to rapidly identify the consequences of the attack, and 

develop a mitigation capability. They will need to 

judge the impact and potential scale of the exploit, 

and how quickly to alert their customer base. 

• Recover and remediate 

The primary goal of any attack response is to 

minimise the impact on the user. 

As such, in the first instance the system should be 

able to enter a quiescent state where advanced 

functionality is switched off, to reduce the attack 

surface and inhibit the propagation of any attack. 

This capability may be a feature of the Real Time 

Operating Systems (RTOS), or may be reliant on 

underlying security services, such as a Secure Boot 

Manager. In the worst case, this functionality may be 

implemented by a quarantine system within the user, 

but this should act as a last resort. 

The second phase should be to recover the device to a 

known good state. For advanced devices, this 

probably means recovering the Secure Boot Manager, 

which can be achieved through a soft reset. In this 

domain, the system should be able to isolate any 

possible attacks and inhibit them, while holding the 

main system in a safe-mode or shut down. 

The third phase is to remediate with the release of 

patches and upgrades which should be applied 

through the low-level security services, again 

leveraging the secure boot manager functionality to 

ensure the patches are signed and encrypted, and to 

ensure that these are version managed to stop any 

roll-back attacks. 

132

V. BUILDING RESILIENT SYSTEMS 

Previously, we introduced many aspects necessary to 

recover from an attack, focusing on executing from a known 

good position, the ability to patch identified compromises, and 

the desire to update with versioning. 

The mechanism to do this within a microcontroller is a 

Secure Boot Manager, a small and lightweight security kernel 

operating at the lowest execution level, which is capable of 

being certified and flexible enough to support the long 

lifecycles needed for IoT-centric devices. 

The Secure Boot Manager, outlined below, is a modular 

framework which can be configured within a commercial IDE, 

such as the IAR Embedded Workbench, to deliver a range of 

solutions that stretch from a very small codebase to a feature 

rich set of security functions. The Secure Boot Manager itself 

leverages the device boot framework of security orientated 

microcontrollers, such as the STMicroelectronics STM32H7 

and the Renesas Synergy S5 families, to create a secured 

domain operating below the RTO, enabling even low-level 

drivers to be managed an updated over the lifecycle of the 

devices. 

A modern SBM, such as that shown in Figure 2, should 

implement a rich set of functions to support the next 

generation of ultra-long-life devices. 

Figure 2. Citadel Edge Secure Boot Manager. (Secure Thingz) 

Firstly, it is important that the SBM support the secure key 

storage and management resident in the microcontroller. 

Fundamentally, to enable secure services, the devices must 

have sufficiently secure storage in which to store the private 

keys upon which the first secure communications over PKI 

asymmetric cryptography rely. Once a secure channel to the 

device is established, it is possible to program the device 

securely. 

Through leveraging the secure channel, it is possible to 

start building the secure foundations a product requires. First, 

the identity must be provisioned. The identity may be a 

certificate structure, such as the standard x.509 format, or a 

more bespoke form tailored to the IoT. The certificate relies 

on additional private keys being provisioned into the device to 

lock the device to the certificate. This may be extended with 

other forms of identity, such as physically unclonable 

functions (PUFs) or additional ownership keys. Once the 

device has been provisioned, it can be securely programmed. 

Secure programming enables OEMs to develop a master 

their application, ensuring the code is both signed for specific 

devices, and encrypted to manage it securely in transit. 

Through a secure programming system, it is then possible to 

validate the device, via its certificate infrastructure, and 

deliver the encrypted image to the device, block by block. The 

key generated at mastering is also then exposed to the device, 

unlocking the code and enabling a raw image to be written to 

flash. 

The update framework within the SBM further extends this 

capability to enable patches generated by the IDE to be 

targeted to groups of devices or individual devices, based on 

the identity provisioned into the device, and on the definition 

of the Security World within the IDE itself. The update can 

take many forms including a single line code fix, a module, or 

an entire image depending on the fix being undertaken. In this 

way we can mitigate the impact of updates to a minimum, and 

speed the adoption of patch infrastructure. 

Additionally, as mentioned earlier in the paper it is 

important to integrate version management into our IoT 

devices. Whilst signed images proved they came from a valid 

source, they may unfortunately contain known exploits, and 

as such, it is important that attackers cannot explicitly rollback 

software to a known-bad version and take control of the 

device. This solution obviously requires integration into both 

the target device (MCU) and the development environment to 

function correctly, alongside the mastering tool which injects 

and manages keys. Hence a holistic and well integrated 

system is required. 

Modular updates are traditionally outside of the scope of 

microcontrollers due to the monolithic nature of their memory 

systems. However, with the advent of the TrustZone for 

Cortex-M devices (ARMv8-M architecture), we are seeing 

this change, with multiple modules now laying across 

protected and open memory. In this context, we can now focus 

on delivering smaller modules, but obviously the system needs 

to be compiled and built with this in mind. The advantage to 

this approach is that we can more easily constrain modules, 

increasing security, but also ensure that we minimise 

bandwidth impact alongside conserving battery power, as we 

do not have to program so much flash. 

VI. 

SUMMARY 

The reality of modern connected systems is that every 

system will be compromised, and as such, the impact on the 

industry will be long-term and pervasive. The solution to this 

inherent insecurity is to provide both business and technical 

frameworks that accept flaws and compromises, embrace them 

133

as a way of improving the product; and ultimately drive 

patches and updates across the lifecycle of the device. 

The end perspective is that while we cannot stop flaws we 

can, and should, continually improve the solution. To achieve 

this a Secure Boot Manager, delivering low level secure 

services for patching, updates, remediation and foundational 

recovery mechanisms are key. The advent of Secure 

Microcontrollers, leveraging advance technologies such as the 

ARM TrustZone for Cortex-M, lend themselves to dynamic 

remediation and recovery, and coupled with advanced Secure 

Boot Managers, provide the foundation for a secure future. 

VII. 


All trademarks are properties of their respective 

holders. 

134

Hack-Proofing Your C/C++ Code 

Copyright 2018 by Greg Davis 

Introduction 

We are good at working with unreliable machines. At home, I lease a DVR from my 

cable company. It often locks up when I fast forward through commercials, so I have 

learned to hit a different key that skips rather than runs fast forward, and the DVR 

behaves better. When it comes time for the manufacturer of the DVR to release a 

product, they are under time pressure, and they focus on fixing the most critical bugs 

from a usability standpoint. A company that is not concerned about security is only 

considered about “good enough”. 

But, if we are to focus on making our product hack-proof, we must hold ourselves to a 

higher standard. Hackers are known to use extreme testing, fuzzing, and static analysis to 

look for reliability problems in the product. When these reliability problems are found, 

they analyze them to see if they are commonly exploited problems, such as a buffer 

overflows. The most promising bugs are pushed further as the hackers search for an 

exploit that will allow them to take control of the system. It is worth noting that source 

code is not necessary for a hacker to do these things. Source code makes their job easier, 

but it is not a requirement. The same thing can be said of safety mechanisms such as 

ASLR, execute bits in the MMU, or stack canaries; their presence merely makes a 

hacker’s job harder. So to a hacker, “good enough” is not good enough to keep them out. 

You need to achieve a much higher level of reliability. 

The architecture of your software is definitely important, but this when it comes to 

security, the way you write your code is just as important. Try looking over the recent 

security vulnerabilities in your browser; by my count, 75% of the critical problems are 

due to typical C/C++ program flaws (such array overruns, use after free, etc) as opposed 

to architectural flaws (such as privilege escalation or security bypasses). Thus, this paper 

focuses on tools and techniques that can be used to help prevent these coding flaws. 

Coding Standards 

An effective tool is to restrict C and C++ to avoid the problematic areas of the language. 

Coding standards do just this. When many people think of coding standards, they think 

of naming conventions, indentation styles, commenting, and the like. While these things 

are important, they’re also a religious issue. These issues are also just as applicable in 

other “safer” programming languages. 

The coding standards I’d like to discuss make the language safer and easier to 

understand. A number of standards exist such as MISRA C, MISRA C++, “The Power 

of Ten”, the Joint Strike Fighter C++ Coding Standard, and the CERT standard. I’ll give 

a couple of examples of what kinds of rules these standards use. 

135

For example, can you spot the problem in this code? 

line_a |= 256; /* set bit 8 */ 

line_b |= 128; /* set bit 7 */ 

line_c |= 064; /* set bit 6 */ 

The problem is that in C and C++, any constant that starts with a 0 is an octal constant. 

So, while 64 == 0x40 and is a valid representation of bit 6, 064 == 52 or 0x34. Many 

coding standards avoid this problem by making it illegal to use octal constants at all. So, 

you’d have to express that last line as either: 

line_c |= 64; /* set bit 6 */ 

or 

line_c |= 0x40; /* set bit 6 */ 

As another example, can you spot the problem in this code? (The code will execute OK, 

but not in the way the programmer imagined) 

int round_to_nearest(float num) 

{ 

if (num >= 0.0) { 

return (int)(num + 0.5); 

} else { 

return (int)(num – 0.5); 

} 

} 

The problem is that the constants 0.0 and 0.5 are expressed in double precision, while the 

input number is just in single precision. This means that all of the floating point in the 

function needs to be converted to double precision before it is operated on. While I 

specifically chose an example that will run OK, this misunderstanding between the 

programmer and the compiler is exactly the sort of thing that can cause reliability and 

performance problems later on. Coding standards may prevent this sort of thing by 

disallowing implicit casts between types. 

An important distinction when it comes to coding standards is to decide what can be 

automatically enforced. It’s one thing to catch problems during code reviews, but it’s 

another thing to have the problem pointed out immediately when you first try to compile 

the code. Manual code reviews also introduce room for human error or omission; who 

would have noticed the implicit conversion in the round_to_nearest()function, 

above? 

136

My recommendation when it comes to coding standards is twofold. First, start by reading 

up on some of the standards that I mentioned above. Hopefully they’ll give you some 

background and will pique your interest. Then, look at some of the tools that are 

available. Some are freely available, while others cost anywhere from $100 to thousands 

of dollars. What will your budget allow for? Look to configure the tools in a way that 

allows you to pick and choose the rules according to what you believe makes sense. You 

may agree with some rules in principle, but you may find that they are just too hard to 

address in your current code base. Start where you can, and improve your position over 

time. 

Static Analysis 

Static analysis is often seen as the big brother to coding standards. While coding 

standards are look at code from a syntactic point of view, static analysis works during the 

compilation or building stage by simulating the effects of executing the code. 

As an example of what static analysis can detect, can you see the problem in the 

following code? 

int write_it(int dest /*fd*/, uintptr_t srcAddr, 

size_t len) 

{ 

unsigned char *buf = (unsigned char *)srcAddr; 

int ret; 

} 

while (len > 0 && (ret = my_write(dest, buf, 

len)) > 0) 

{ 

buf += ret; 

len -= ret; 

} 

return ret; 

The problem is that if “len” is less than or equal to zero, the return value of “ret” will 

not have be initialized, so you’ll be returning an essentially random number. 

(Technically speaking, the behavior is much worse than this 1 .) 

1 

Although in many cases, the practical result of reading an uninitialized automatic variable is that the read 

may result in an indeterminate value, the result is considered undefined in the C and C++ standards. 

Examples exist where compilers will inadvertently optimize away sections of code due to a read of an 

uninitialized automatic. 

137

It’s worth noting that some of the coding standards might have prevented this bug by 

disallowing side effects on the right hand side of an short-circuit operator. Forcing the 

user to write this code without the short-circuit operator might have been enough for the 

programmer to spot the error. 

That said, static analysis has a number of advantages over coding standards: 

1. Static analysis doesn’t prohibit any coding constructs. You can keep your 

existing code. 

2. Static analysis points out bugs that are likely to arise in practice, while coding 

standards require many changes in cases where there wasn’t actually an existing 

problem in the code. 

3. Modern static analysis tools look at code globally, allowing for detection of 

problems that only occur across procedure boundaries. 

Still, static analysis suffers from a number of limitations. 

1. Static analysis tools employ a relatively limited number of rules that they check 

for. Many sources of bugs are not covered by these rules. 

2. Whereas a false positive for a coding standard suggests an obvious rewrite to the 

code that will silence the diagnostic, a false positive for static analysis may be 

more cumbersome to work around. 

3. Some static analysis vendors have found it counterproductive to point out 

problems that cannot be easily explained. While this is certainly pragmatic, one 

worries about the omitted issues. 

Automatic Run-Time Error Checking 

Static analysis is called “static” because it works during the compilation process. 

Automatic run-time error checking (RTEC for short) works in an entirely different 

manner. RTEC looks for specific problems during the execution of your code. 

RTEC may be implemented in a number of ways: 

1. A compiler may add the checks automatically. 

2. Uninstrumented code may be run in a simulation environment, where the 

environment performs checks. The open-source “valgrind” project is an example 

of this. 

To see the difference between static analysis and RTEC, consider the following fragment 

of code: 

int array[5]; 

int get_and_set1(int index, int value) 

{ 

int ret = array[index]; 

array[index] = value; 

138

} 

return ret; 

Obviously, any call to get_and_set1()will be invalid when the “index” 

argument is not between the values of 0 and 4, inclusive. A static analysis tool won’t 

report a problem unless it is reasonably sure that a value outside this range will be used. 

RTEC doesn’t guess. It essentially treats the code as if it were written: 

int array[5]; 


{ 

int ret; 

if (index < 0 || index > 4) { 

report_error(); 

ret = array[index]; 


return ret; 

} 

RTEC has the advantage of being able to detect error cases that are not apparent at 

compile time. On the other hand, it requires greater run-time resources, while static 

analysis runs solely at compile time. 

Not all classes of errors require RTEC. For example, consider the code: 


{ 

static int *ptr = NULL; 

static int len = 0; 

int ret; 

if (index >= len) { 

int *nptr = calloc(value+1, sizeof(int)); 

memcpy(nptr, ptr, len*sizeof(int)); 

ptr = nptr; 

len = value + 1; 

} 

ret = array[index]; 


return ret; 

} 

139

This code takes care of the array overrun that plagued get_and_set1(), but it 

doesn’t check for a NULL pointer return from the calloc() function. For all 

practical purposes, any use of a pointer returned from a C memory allocation routine 

before checking for a NULL pointer is an error. Static analysis is perfectly capable of 

detecting this class of problem. RTEC focuses on problems that are dynamic. RTEC 

complements coding standards and static analysis because of its dynamicism, but it is 

only as good as your test vectors. 

Assertions 

Another pillar of high reliability software is the frequent use of assertions. 

A static assertion is an assertion that must be checked at compile-time. It can be used for 

defensive programming and to make explicit the assumptions in code. 

// The following function assumes that a pointer 

// and an int are the same size. 

static_assert(sizeof(int) == sizeof(void *), “”); 

void do_sketchy_pointer_arithmetic(void) 

{ 

// ... 

A static assertion can only be used to check compile-time constants. For example: 

static_assert(sizeof(header) 64); 

// ... 

140

The assertion generates code that will be checked at run-time. Typically, assertions are 

enabled during development to find as many problems as possible. Then when a product 

enters a testing phase, assertions are disabled so that the product will run as fast as 

possible. 

Most development projects will implement their own form of assertions rather than using 

the standard implementation. Since the standard assert.h requires a lot of 

strings for each dynamic assert, a custom assert macro may be desired in order to produce 

smaller code. Some aspects to consider include: 

1. What happens when an assertion fails? Is there some kind of console where the 

error can be printed? Other assertion systems will go into an infinite loop when 

an assertion fails, expecting the programmer to find the problem in a debugger. 

2. What happens to a run-time assertion when run-time assertions are disabled? 

You should not write code that relies on an assertion in order to be correct. For example: 

// Bad example: We need save_to_flash to be 

// executed even when the run-time assertion 

// macro is set to not do anything. 

assert(save_to_flash(data) == err_none); 

// Better example: 

err_t result = save_to_flash(data); 

assert(result == err_none); 

It is OK if assertions use conditions involving function calls, so long as these function 

calls are not doing anything that the system might come to rely on. For example: 

assert(dictionary.size() > 10000); // OK 

Like compile-time assertions, run-time assertions are most valuable to you when they 

break your system, because fixing the assertion will be easier than finding the problem 

when it manifests itself as a downstream glitch. 

Conclusion 

We have explored a number of tools and techniques that you can use to help your project 

become more secure. 

141

What Is an IoT OS? 

Christian Légaré 

Silicon Labs Inc. 

Montréal, Québec, Canada 

christian.legare@silabs.com 

Abstract— A lot of attention in the Internet of Things (IoT) is 

given to the cloud: data, analytics, networking (fog computing), 

and mobile devices (tablets and smartphones). Unfortunately, IoT 

devices – the devices that produce the data – are often the 

neglected element in this system. Little attention is given to the 

architecture, design, and implementation of IoT devices. This 

paper will cover one aspect of the development of IoT devices: The 

IoT OS. 

First, we need to differentiate between a real-time kernel and 

an operating system. In the embedded space, a kernel is often 

referred to as an RTOS (real-time operating system), but that’s 

something of a misnomer: it’s not actually a full-fledged operating 

system. Rather, a kernel is the basis of a complete operating 

system. 

The embedded OS, or IoT OS, is composed of an RTOS (realtime 

kernel) plus multiple services and middleware stacks that 

provide connectivity and security. 

Keywords—IoT, OS, RTOS, MCU, Connectivity, Security, 

Modularity, Scalability, Machine Learning, Blockchain 

Yet at the same time, these devices may require multiple 

networking protocols, security (multiple encryption and 

decryption algorithms), and the ability for remote firmware 

updates (Firmware Over the Air, FOTA). All this requires 

resources that push the limits of the average microcontroller. 

And we haven’t even mentioned new technologies such as 

machine learning, blockchain, and others. So, how do we 

address these requirements with such limited resources? 

We must architect the IoT system in such a way so that we can 

achieve all these requirements and still meet an ownership cost 

that makes the system commercially viable. 

First, the use of a gateway in many cases is a virtual necessity. 

It is impossible to run all the software that an IoT system needs 

on a sensor/actuator device. So, which operating system do we 

run on the device, and which functions do we enable? And 

similarly, which operating system do we run on the gateway, 

and which services do we provide? 


The IoT is often presented as information technology (IT) being 

pushed into the realm of operational technology (OT). What 

this means is that the typical set of technologies employed by 

IT – especially web technologies – cannot be applied to 

building IoT edge devices. 

Looking at the IoT this way presents a major problem: 

specifically, the cost of the hardware used for IoT devices. 

Because IoT devices are produced in enormous quantities, we 

need them to be produced as cheaply as possible to make the 

system affordable and to make the business case ROI-positive. 

The average IoT device microcontroller runs between 50 MHz 

and 200 MHz, contains between 64 KB and 1 MB of flash 

memory (for code space), and has 4 to 512 KB of RAM. By 

comparison, processors that run smartphones, tablets, or cloud 

servers run at gigahertz speed, have terabytes of available code 

space, and gigabytes of RAM. So, the average IoT device is not 

capable of running typical IT software. 

Figure 1: Generic IoT system architecture 

Typically, the gateway will be running on an application 

processor (Cortex-A, Intel Quark or similar) and will use a 

general-purpose OS (GPOS) such as Android or Linux. 

The majority of the predicted billions of IoT devices will use 

microcontrollers (typically a Cortex-M), and so a GPOS is out 

of the question. Some software developers may prefer to use 

fine-tuned bare metal code to maximize the amount of code that 


142

can run on the device. This can be a solution, but it typically 

takes a lot of time to develop a reasonable IoT application using 

a single-threaded (bare metal) approach, which increases time 

to market. 

The essence of IoT is connectivity. Connectivity stacks 

(TCP/IP, Wi-Fi, Bluetooth, Thread, Zigbee, Wireless Hart, and 

many others) are large pieces of code, and are time-sensitive 

(protocols rely on timeouts). Therefore, the use of a real-time 

kernel is a good practice. It simplifies the software architecture, 

helps achieve performance, and reduces maintenance costs. 

Designing a product is all about the system requirements. A 

bare-metal (single threaded, super-loop) approach might be 

satisfactory for the design. In other situations, an open-source 

solution might be the right solution, based on the desired 

functions and features. 

An IoT-specific OS will increasingly play a significant role in 

device design. While there are GPOSs out there that provide 

connectivity, and do find their way into embedded systems, 

those GPOS-based systems do not meet real-time requirements. 

And there are other types of requirements where an IoT OS is a 

much better match. 

Safety certification is also a concern for IoT systems, as there 

are industries where devices and the software running on them 

must meet safety-critical regulations. Using an IoT OS (or at 

least a kernel) that already has a validation suite available will 

save time and money in these markets. 

As of today, there is still no industry definition of an IoT OS. 

The following sections will lay the foundation of what such an 

OS must be. For the industry to build and ship the forecasted 

billions of deployed devices, such a definition is mandatory. 

II. INDUSTRIAL VS COMMERCIAL 

The software requirements for industrial and consumer IoT 

devices can differ quite a bit. Although they might share a 

common kernel and low-level services, the middleware 

required by their applications can be radically different. 

low-power, low-cost device that may run entirely on battery. 

Such a device might typically use a Cortex-M0 or Cortex- 

M3/M4 MCU. It would use a highly efficient wireless network 

protocol such as Zigbee, Thread or Z-Wave to reduce 

transmission time and save power. And it would communicate 

over short distances wirelessly using Bluetooth or low-power 

Wi-Fi, or else use Ethernet, Sigfox, LoRa or NB-IoT when used 

as an edge node. This kind of industrial product must never fail 

once deployed, and it would be in service for decades. 

The right side of Figure 2 illustrates the software stack for a 

consumer IoT device. In a consumer environment, web 

technologies are more common, and this example includes a 

Java virtual machine. Consumer products may also make use of 

specific vertical market protocols such as AllJoyn, HomeKit, 

HomePlug/HomeGrid, Continua Health Alliance, or 2net. Such 

a device typically might use a Cortex-M3/M4 or a Cortex-A 

processor. 

Consumer products typically have shorter lifespans than 

industrial products, and tend to be replaced more frequently. 

Consumers also are more accepting of product failures; for 

example, how often do you reboot your smartphone? The fact 

is that failures in consumer products are tolerable, whereas in 

industrial products, failures can endanger people’s lives. And 

the lengthy validation process to establish the reliability of 

embedded software takes more time and money than the makers 

of consumer devices are willing (or even need) to spend. 

These requirements will drive your choice of operating system, 

as platform choice shouldn't dictate a device's functionality. 

III. MODULARITY 

IoT devices will also require a modular operating system that 

separates the core kernel from middleware, protocols, and 

applications. The reasons are ease of development and keeping 

the memory footprint of the software to a minimum. 

Figure 3: Modular IoT OS. 

Common modules in orange, optional stacks in blue. 

Figure 2: A low-power industrial IoT device (left) 

and a consumer IoT device (right) 

In Figure 2, the left side depicts the software stack for an 

industrial IoT device such as a wireless sensor node. This is a 

Using a modular OS simplifies the development process, 

especially when developing a family of devices with different 

capabilities. Relying on a common core allows the entire family 

of devices to share a common code base, while each device is 

143

customized with only the middleware and protocol stacks 

required by the application. 

This approach also allows for a smaller memory footprint in the 

device. Unlike a monolithic operating system that bundles an 

entire suite of software together, a modular operating system 

allows for tailoring the embedded software for the device, 

requiring less RAM and flash memory and reducing costs. 

A real-time kernel, used as a simple scheduler, requires about 

4–5 KB of code space. In a multithreaded environment, we’ll 

need most of the kernel services (semaphores, mutexes, 

message queues, event flags, and so on), and the kernel will 

require something like 20–25 KB of flash memory for code and 

about 2 KB of RAM. Depending on the number of tasks the 

kernel needs to manage, RAM usage will grow because each 

task requires a stack. We can estimate an average of 1 KB of 

RAM per task; the total depends on the application complexity 

and the call stack depth. 

The total required code space depends on how many stacks are 

involved. For example, a TCP/IP stack can require about 50 KB 

of code and tens of KB of RAM, depending on how many 

connections are opened and the desired performance. Similarly, 

the code for Bluetooth requires about 120 KB, and its RAM 

usage is about 20 KB. 

So, you need to sum up all the code and RAM usage and 

evaluate whether the hardware can support the required 

software. With an IoT OS, the total is still very small when 

compared to a GPOS, but the average microcontroller does not 

have the resources of an application processor. 

The other important part of Figure 3 is the yellow box, the 

common API. If you are using a commercial OS, the API for all 

the various functions are likely to have commonalities and be 

easy to use, something that is not the case when using an open 

source operating system. A common API between different 

communication stacks makes it easier and faster for you to 

develop, test and validate your application. 

IV. SCALABILITY 

A flexible, scalable RTOS can help increase return on 

investment, cut development costs, and reduce time to market. 

Although deeply embedded systems have historically been built 

entirely around 8- and 16-bit MCUs, the price of 32-bit MCUs 

has been dropping rapidly. As they have become commodity 

products, their popularity for embedded devices has 

skyrocketed. 

A common engineering solution for networked sensor systems 

is to use two processors in the device. In this arrangement, an 

8- or 16-bit MCU is used for the sensor or actuator, while a 32- 

bit processor is used for the network interface. That second 

processor runs an IoT OS. 

Sales of 32-bit MCUs have exploded in the last decade and have 

become the largest segment of the MCU market. The 32-bit 

MCU segment alone is expected to grow to 70% of the MCU 

market, forecasted to total 100 billion MCUs by 2024. (For 

more information: https://goo.gl/PZ9QhY.) 

IoT devices will still contain a mixture of small and large MCUs 

for years to come. A scalable IoT OS that runs on a variety of 

16- and 32-bit MCUs will meet tight memory requirements, 

reduce processor demands, and save money. 

As a result, it is difficult to run all the security and marketspecific 

IoT software (AI, blockchain, vertical market specific 

protocols and more) on a single microcontroller. The business 

case will drive the design. When using a larger microcontroller 

is not cost effective, you will need to centralize tasks on a 

gateway. 

Designing an IoT system to include a gateway has two main 

advantages: 

• The ability to scale down nodes to have a lighter software 

load 

• The ability to take cloud software to the edge (i.e., run part 

of your device lifecycle management locally) 

Depending on the target cost for the system, the gateway 

processor could be a microcontroller or an application 

processor. It could run either an IoT OS or heavier software 

such as Linux or Android, but it should be said that an IoT OS 

will allow you to squeeze every MIPS out of the processor. 

Using the same operating system for both the edge devices and 

the gateway will simplify your software development, ease the 

learning curve, and reduce product maintenance costs. 

Over time, the preferred CPU for the gateway will likely settle 

on an application processor. The edge nodes, on the other hand, 

will continue to require highly specialized components. To 

reach the desired target cost, the device designers will need to 

implement only the features that are strictly required for a given 

application. And it is crucial that the IoT OS is scalable across 

many varieties of devices, so that code can be portable. 

V. RELIABILITY 

Many IoT systems will be deployed in safety-critical 

environments, or in locations where repair and replacement 

are difficult. IoT devices will need to be faultlessly reliable. 

In these situations, an IoT OS must have safety-critical 

certification. This kind of certification is vital to demonstrate 

the reliability and safety of your device. Certifications that you 

may require include: 

• DO-178B for avionics systems 


144

• IEC 61508 for industrial control systems 

• ISO 62304 for medical devices 

• IEC SIL3/SIL4 for transportation and nuclear systems 

Certifying code is an expensive proposition, but the entire 

IoT OS may not require complete certification. The best 

practice would be to certify the kernel, its memory protection 

unit, and any task that is deemed safety-critical. Other nonsafety-critical 

tasks can run in separate memory regions, 

allowing them to be isolated from the safety-critical parts of the 

application. 

When building products for use in a safety-critical environment, 

software that is already certified can reduce certification time 

for a device, and reduce costs. Every safety-critical part of the 

device will require certification and extensive documentation. 

Validation suites and certification kits, typically available from 

third parties, provide thousands of pages of documentation. To 

be clear, it is not the components that are certified; it is the 

complete product. But using modules that have existing 

validation suites or certification kits will save time and reduce 

cost. 

Even if certification isn't required for the device, knowing that 

the OS running within it has been certified can provide 

confidence and peace of mind that your product will perform 

reliably. 

VI. CONNECTIVITY 

Network connectivity is essential to the Internet of Things. 

Whether we are talking about wireless sensor nodes in a factory, 

or networked medical devices in a hospital, the industry now 

expects embedded devices to be connected to each other and to 

communicate with corporate or public networks. 

This fact changes how you think about product development: 

there will be less emphasis during development on 

multithreading. Your chosen platform must be easy to use and 

already hardened, feature robust connectivity out of the box, 

and must work with your chosen hardware. Essentially, you 

want to focus on your application and trust that the software 

platform is usable, rugged, and stable. 

To achieve these goals, the IoT OS must support 

communications standards and protocols such as Ethernet, 

IEEE 802.15.4, Wi-Fi, and Bluetooth. The device must be able 

to connect to IP networks using bandwidth-efficient protocols 

such as Thread. In Figure 3, the blue vertical stacks show many 

of the available connectivity technologies. The list is not 

exhaustive. 

An IoT OS will allow you to select only the specific protocol 

stacks you need, which again means saving memory on the 

device, and reducing cost. And it can help retrofit existing 

devices with new connectivity options without reworking the 

core of the embedded software. 

VII. POWER MANAGEMENT 

Referring again to Figure 3, an IoT system that uses two or more 

stacks with a power management strategy will require a power 

management service. Such a service receives the sleep signals 

(which specify when the system can enter a sleep mode), 

manages the peripherals going into a sleep mode, and wakes up 

the processor and peripherals when a wake signal (time based 

or event based) is detected. 

A power management service cannot be managed by one stack. 

Power management must be centralized so that all peripherals 

and memory are managed properly when the system is entering 

or exiting sleep mode. 

Not so long ago, a real-time kernel did not have to worry about 

this kind of feature; it was left to the designer to implement it. 

But now, with the growing complexity of edge devices and the 

multiplication of battery-operated devices, power management 

is becoming a commodity. This service must now be a standard 

component of a real-time kernel. 

This presents a problem for portability. Each silicon vendor 

implements sleep modes differently. Even if the real-time 

kernel implements sleep and wake-up functions, these functions 

must be ported to each unique hardware architecture. 

VIII. SECURITY 

Security is the hot button topic in media coverage of IoT today. 

The average lifetime of a consumer IoT device may be years, 

but industrial IoT devices must function for decades. We need 

to rethink how to protect these devices. Consider the increasing 

lifespan of IoT devices, combined with the huge number of such 

devices producing an IoT blanket covering the globe, plus the 

rapid advances in knowledge and tools used by attackers. It is 

simply not feasible to build end-node devices that are supposed 

to remain secure throughout their practical lifetimes. 

To help protect the software running on these devices, silicon 

vendors are adding hardware security features to their 

processors. Memory protection units (MPU) have been used for 

a long time in safety-critical applications to isolate code and 

data. Now, MPUs are being applied to security. But MPUs 

alone are not sufficient. When it comes to cryptography and 

security key management, we need to guarantee that the keys 

will not be tampered with, and are stored securely. 

A complete trusted execution environment (TEE), which is 

usually found on application processors, is now available on 

microcontrollers. ARM-based microcontrollers now feature 

hardware components such as a secure element (which contain 

encrypted keys) and/or TrustZone. The inclusion of these new 

hardware features requires additional software to configure and 

control them. The configuration and management of these 

145

hardware components – MPU, secure elements and TrustZone 

– are the responsibility of the real-time kernel and so become 

new services available to the application tasks. 

IX. AI AND MACHINE LEARNING 

With the introduction of artificial intelligence (AI) 

technologies, end-users will start perceiving IoT systems as 

having human-like qualities. When a user wants to get or set 

data on a device, the response time (often referred to as latency) 

will be crucial. And if most of the AI processing is done in the 

cloud, it only adds to the response time. This is one reason AI 

technologies such as deep learning are moving to the edge of 

the network. Edge devices will increasingly handle some 

portion of the AI processing. 

In digital assistant services such as the ones offered by Amazon, 

Apple, and Google, cloud computing is being used for natural 

language processing. As this technology evolves, the 

algorithms could be transferred to the edge, allowing a faster 

response time. Applying the same principle to other forms of 

AI in devices, we can see how to build larger, connected 

systems. As AI decision making moves closer to the edge, 

automation becomes faster and therefore applicable to more 

scenarios. 

As AI and deep learning algorithms improve, and as processors 

become more powerful and optimized to run such software, 

decision making at the edge will become a commodity service 

and provide the framework for a new generation of 

applications. 

An IoT OS provides the software architecture for these new 

algorithms to coexist with sensor/actuator and communication 

software. AI algorithms would be implemented as tasks that run 

concurrently with other system tasks. 

X. BLOCKCHAIN 

There are many obstacles today slowing down the adoption of 

IoT. First, the market for IoT devices and platforms is 

fragmented, with many standards and many vendors. The 

uncertainty about the technology, the vendors, and the solutions 

are some of these obstacles. 

Second, there are also concerns about interoperability, and the 

solutions implemented often tend to create new data silos. 

As I mentioned above in the Security section, data is often 

stored in the cloud securely, but these cloud-based security 

implementations cannot protect devices against compromised 

integrity, nor against tampering of data at the source. 

Finally, the centralized architecture of most IoT solutions 

means that there are potentially serious issues with resiliency. 

When all transactions are processed in the cloud, unavailability 

of cloud resources will freeze your business operations. 

Blockchain is a technology that could help with system 

resiliency. The basic concept of blockchain is quite simple: it is 

a distributed database that maintains a continuously growing 

list of ordered records. But the term “blockchain” is usually tied 

to transactions, smart contracts, or cryptocurrencies. This is 

why we need to dissociate blockchain from specific 

implementations such as Bitcoin and Ethereum. In fact, the 

convergence of blockchain and the IoT is on the agenda for 

many companies. And there are existing implementations, 

solutions, and initiatives in several areas outside of IoT and 

financial services. 

The blockchain community believes that KSI (Keyless 

Signature Infrastructure) blockchain is a technology that can be 

used to provide integrity for all assets, including the cloud and 

edge devices. 

According to IBM, the three benefits of blockchain for IoT are: 

building trust, cost reduction and the acceleration of 

transactions. Specifically: 

• Building trust between the parties and devices with 

blockchain cryptography and reducing the risk of 

collusion and tampering 

• Reducing cost by removing the overhead associated with 

middlemen and intermediaries 

• Accelerating transactions by reducing the settlement time 

from days to nearly instantaneous 

Blockchain can add a commercial dimension to IoT. A block 

contains the transaction, but can also contain the contract. So, 

an IoT device could buy or sell data from/to another device or 

system. 

The blending of blockchain and IoT devices is not for the near 

future. Blockchain processing tasks are computationally 

difficult and time-consuming, and IoT devices are still 

relatively underpowered, lacking the processing power to 

directly participate in a blockchain. This is for good reason: the 

heavy computational load helps protect integrity. As Salil 

Kanhere, an associate professor and researcher at the University 

of New South Wales presents it: “Standard IoT devices can’t do 

this kind of heavy computational work, just like you can’t mine 

bitcoins on a standard laptop anymore”. So initially, this type 

of application will be seen on high-end gateways first. 

But as hardware and software technologies evolve, the 

architecture provided by an IoT OS would allow the device 

blockchain function to integrate well with all the other system 

tasks. 

XI. TOOLS 

An IoT OS has many more layers of complexity than an RTOS. 

And any state-of-the art commercial offering ought to include 

its own customized development environment. 


146

Of course, a toolchain (compiler, linker, loader) is mandatory 

for software development. Commercial toolchains, including 

IDEs and other advanced debugging tools, are commodity 

products, and every developer has his or her own favorite tool. 

But from the point of view of integration with the OS itself, they 

are not strongly differentiated. 

Modern developer tools should provide a GUI-driven interface. 

And ideally, they should work in a stand-alone fashion so that 

they can be integrated into the customer’s development/test 

environment. This requires that they provide command-line 

interfaces and perhaps even run-time APIs (i.e., a server/client 

interface) for easy integration. 

provides the software architecture and efficiency required for 

designing IoT devices. 

A modern IoT system can feature tiny sensor nodes running on 

small, low-power MCUs, as well as large gateways running on 

powerful application processors. A single OS that can run on all 

these types of processors is a vital component for any IoT 

system design. While the RTOS has been a commodity 

component for embedded devices for many years, we are now 

looking at a full-fledged operating system for a new class 

devices: The IoT OS. 

The tools must be as easy to use as possible. Admittedly, it is a 

difficult design challenge to bring all these tools into a simple, 

streamlined workflow. 

The following are recommended requirements for any 

development environment that supports an IoT OS: 

• It should run on the most popular desktop platforms 

(Windows, Linux, macOS). 

• It should be context-aware, based on the selected 

processor. 

• It should use an attractive, intuitive, and responsive 

developer portal so that you can easily pull all the code 

and configuration data you require. 

• It should use a secure delivery method for its software and 

tools. 

• It should provide access to online training, such as videos, 

user guides, and API references. 

• The development environment should be agnostic with 

respect to the toolchain, and so it should provide seamless 

interoperation with various IDEs. 

• It should be web and cloud-friendly. 

• It should be easy to update, as software is a living entity, 

and is always evolving. 

In summary, the IoT OS development tool should allow you to 

build a platform in a few minutes. It should resolve the 

software/hardware and software/software dependencies for 

you, and create a code base that compiles without warnings and 

errors. It always has been the goal of the embedded industry to 

provide a comprehensive tool that allows you to concentrate on 

your application and not the platform. With an IoT OS and 

proper tooling, this is an achievable goal. 

XII. CONCLUSION 

Commercial-grade real-time operating systems (RTOSs) are 

being deployed more and more widely for their determinism, 

flexibility, portability, scalability, and support. 

The use of a kernel, especially an RTOS, plus all the 

connectivity, security, and market-specific middleware 

147

Software Architectures for IoT 

Rob Oshana 

Vice President, Software Engineering 

Microcontrollers, NXP Semiconductor 

Austin, TX, USA 

robert.oshana@nxp.com 

Abstract—In this paper we will introduce a reference 

software architecture for the Internet of Things. Based on real 

world examples, we will define the software architecture 

requirements and constraints, discuss the fundamental 

structure of the software architecture, talk about the relevant 

cross cutting concepts for IoT architectures such as logging, 

error handling, security and recovery, IoT operating systems, 

and connectivity stacks. We will discuss what software 

components are available in the open source community and 

how to leverage the software ecosystem to develop software 

architectures for IoT systems. We will discuss the three 

software stack architectures for device, gateway and cloud that 

together make up the IoT system architecture required for 

todays IoT systems. We will demonstrate this with some real 

world examples showing IoT software architectures for 

specific vertical markets. 

Keywords—IoT, Software, Architecture 


In this paper we will introduce a reference software 

architecture for the Internet of Things. Based on real world 

examples, we will define the software architecture 

requirements and constraints, discuss the fundamental structure 

of the software architecture, talk about the relevant cross 

cutting concepts for IoT architectures such as logging, error 

handling, security and recovery, IoT operating systems, and 

connectivity stacks. We will discuss what software components 

are available in the open source community and how to 

leverage the software ecosystem to develop software 

architectures for IoT systems. We will discuss the three 

software stack architectures for device, gateway and cloud that 

together make up the IoT system architecture required for 

todays IoT systems. We will demonstrate this with some real 

world examples showing IoT software architectures for 

specific vertical markets. 

II. 

ACCUMULATE LEGO BLOCKS 

A. IoT Software Architectures 

IoT software architectures should be scalable across 

different categories; 

• Smart things 

• Connected things 

• Secure things 

• Safe things 

We need this Lego block approach because there are 

several IoT software architectures; 

1. Software stacks for constrained devices 

• Light weight RTOS or bare metal, hardware 

abstractions, Communications, remote 

management 

2. Software stacks for gateways 

• General purpose operating systems like 

Linux, communications/connectivity, data 

management, messaging, remote 

3. Software stacks for cloud IoT 

• Analytics and applications 

We need to also consider cross cutting concerns; 

• Security 

• Tools and Software development kits 

• Ontologies 

B. Key characteristics for IoT 

Key characteristics involved in developing software 

architectures for IoT include; 

• Loosely coupled; IoT stacks exist for small 

microcontrollers, edge microprocessors, and cloud 

providers. It should be possible to use software 

stacks for these different nodes independently, 

from different internal and external vendors. 

• Modular; when developing a software 

architecture for the different levels of IoT, it 

should be possible to use different components 

(e.g. a security stack) from different vendors to put 

together a solution. 

148

• Platform independent; use the appropriate 

hardware abstraction layers (HAL) to separate the 

software stack from the underlying hardware 

device. A mature Software Development Kit 

(SDK) should support this for multiple device 

families. 

• Based on open standards; consider using an open 

approach instead of an internal approach when 

possible. For example, openThread should be 

considered for a Thread stack instead of a 

proprietary internal Thread stack unless there is a 

compelling advantage in performance, 

functionality etc. 

Figure 2. Zephyr IoT operating system 

• Defined APIs; although IoT standards are 

evolving, its important to use standard APIs when 

available. 

III. ENSURE LEGO BLOCKS WORK WELL TOGETHER 

Once you have a scalable software architecture for IoT that 

supports device, edge, and cloud platforms, the next step is to 

ensure these blocks work well together. Figure 1 below is an 

example of a software development kit for low cost 

microcontrollers. 

Figure 3. ARM mbed IoT platform 

Challenges with IoT integration 

There are many challenges in getting an IoT system to work 

effectively “out of the box”. These include; 

• The connectivity framework can be difficult to learn, 

internalize and use 

• Documentation is often insufficient 

Figure 1. SDK for low cost IoT microcontroller 

A. Open Souce components 

Leverage open source components in your software IoT 

architecture for increased community support, 

interoperability, and scalability. Two examples of this are; 

• Zephyr (Figure 2); Zephyr is a small real-time 

operating system for connected, resource-constrained 

devices supporting multiple architectures and released 

under the Apache License 2.0 

• Mbed from ARM (Figure 3); The Arm® Mbed IoT 

Device Platform provides the operating system, cloud 

services, tools and developer ecosystem to make the 

creation and deployment of commercial, standardsbased 

IoT solutions possible at scale. 

• It can take weeks/months of effort to get an 

integration to work 

• IoT frameworks are architected to reflect a 

“connectivity centric” view, not an application view. 

Apps need to be twisted and turned to align with the 

framework (not simply installed on it). 

• IoT integrations can lead to bastardized 

implementations which end up being unique one-offs 

suited only for the specific app being done 

• End to end security framework for IoT configurations 

To overcome these challenges, interoperability and stress 

testing is required based on multiple industry and customer 

use cases. 

IV. BUILD YOUR LEGO BLOCK CASTLE 

To build your own “Lego castle” the first step is to select 

the platform of choice, device end node, gateway, or cloud. 

For example Figure 4 shows an end node microcontroller that 

support IoT with connectivity, processing, and security. The 

149

appropriate HAL will abstract the device specifics from the 

software architecture. 

Figure 4 Microcontroller device for IoT 

An IoT based software development kit (SDK) would 

support an IoT device like this with a standard set of 

enablement software; 

• CMSIS-CORE compatible software drivers 

• Single driver for each peripheral 

• Transactional APIs w/ optional DMA support for 

communication peripherals 

Integrated RTOS: 

• RTOS-native driver wrappers 

Integrated Stacks and Middleware 

• USB Host, Device and OTG 

• lwIP, FatFS 

• Crypto acceleration plus wolfSSL & mbedTLS 

• SD and eMMC card support 

• Multicore - eRPC 

Ecosystems 

• Mbed 

Reference Software: 

• Peripheral driver usage examples 

• Application demos 

• FreeRTOS usage demos 

License: 

• BSD 3-clause for startup, drivers, USB stack 

Toolchains: 

• KDS, IAR®, ARM® Keil®, GCC w/ Cmake 

• + MCUXpresso IDE 

Quality 

• Production-grade software 

• MISRA 2004 compliance 

• Checked with Coverity® static analysis tools 

Once you have a scalable device SDK, it makes it easier to 

integrate a device into a cloud SDK as shown in Figure 5. 

Figure 5 Integrating a device SDK with a Cloud SDK 


I would like to thank Jason Martin and Constantin Enascuta 

for contributing material used in this paper. 

REFERENCES 

[1] Srivaths Ravi, Anand Raghunathan, Paul Kocher, Sunil Hattangady, 

"Security in embedded systems: Design challenges", Transactions on 

Embedded Computing Systems (TECS), vol. 3, no. 3, August 2004 

[2] Ala Al-Fuqaha, Mohsen Guizani, Mehdi Mohammadi, Mohammed 

Aledhari, Moussa Ayyash, "Internet of Things: A Survey on Enabling 

Technologies Protocols and Applications", Communications Surveys & 

Tutorials IEEE, vol. 17, pp. 2347-2376, 2015, ISSN 1553-877 

150

Implementation of a Web Development Platform for 

Embedded System Designers 

Milan Raj 

IoT Software Technologies 

National Instruments 

Austin, Texas, United States 

milan.raj@ni.com 

Abstract—Distributed intelligent embedded devices rely on 

continuous feature improvements and on-demand user 

interactions to meet the evolving requirements of distributed 

systems. Technologies like HTML, CSS, and JavaScript are 

quickly becoming the de facto standard for developing crossplatform 

software. However, most embedded systems developers, 

despite proficiency in low-level programming, find web 

development unapproachable due to unfamiliar programming 

patterns, standards, frameworks and tools. The use of 

proprietary technologies or constrained platforms limits the 

ability of embedded developers to keep up with changing 

requirements in distributed applications. This paper explores the 

implementation of a next generation, open HMI development 

platform based on web standards and open source codebases. 

network requests and perform parsing and analysis of plain text 

and JSON responses. The HTML UI and dataflow 

programming diagram are both contained in a source file 

format operating as a functional unit and referred to as a Web 

Virtual Instrument. Multiple Web Virtual Instrument files can 

reference each other and are compiled together to generate a 

complete web application consisting of an application specific 

HTML file and a Virtual Instrument Assembly file containing a 

text representation of the transformed and merged dataflow 

programming diagrams. In addition to application specific 

files, static resources are generated containing JavaScript, CSS, 

and other media assets that are consistent between each 

generated web application. 

Keywords—HTML; JavaScript; CSS; WYSIWYG; asm.js; 

HTTP; HMI; Web 


Despite the introduction of many new user interface 

technologies and application development practices over the 

years, the web browser has maintained consistent availability 

across practically every device with rich graphical user 

interfaces. Active efforts by browser developers to improve 

interoperability [1] and focus on development of low-level 

primitives [2] have shifted the role of the browser from an 

interactive document scripting environment to a highperformance 

application virtual machine. 

We demonstrate how to leverage open standards and 

community driven developments to minimize risk of platform 

divergence and build an adaptable code base capable of 

adopting new standards as they stabilize. We present a 

WYSIWYG web-based HMI development platform for 

embedded system designers and discuss our strategies for web 

technology selection. 

II. 

SYSTEM ARCHITECTURE OVERVIEW 

A development environment hosting a modern embedded 

web browser is used to enable the WYSIWYG creation of 

HTML user interfaces. A user of the development environment 

can write logic for the corresponding HTML UI using a 

graphical dataflow programming language to read from and 

write values to HTML UI elements. The user is also capable of 

using the graphical programming language to perform HTTP 

Fig. 1. Compilation of multiple Web Virtual Instrument source files to a 

standalone web application. 

As an open platform, the development environment creates 

web applications capable of communicating with arbitrary 

HTTP-based web services. Included with the development 

environment is a data services platform to facilitate 

communication between generated web applications and 

embedded devices. The data services platform consists of 

HTTP endpoints queryable from generated web applications 

and allows for reading and writing named variables with latest 

values and subscribing to named queues of messages. An 

embedded device on a network can use HTTP or AMQP to 

communicate with the data services platform to publish the 

current device state or subscribe to message queues. 

151

A complete HMI solution may consist of a web application 

generated by the development environment that communicates 

with the data services platform for monitoring and control of 

network connected embedded systems. An operator can use 

commodity devices with modern web browsers to access the 

generated web applications. The focus of this paper will be on 

the architecture of the generated web applications which utilize 

open standards to implement modern user interfaces and 

implement client-side logic driven by a high-level graphical 

programming language. 

Fig. 2. Example configuration for a complete HMI solution. 

III. 

GENERATED WEB APPLICATION ARCHITECTURE 

A. Overview of Deployed Files 

Each built web application consists of an HTML file with 

configuration used for the user interface of the application and 

a Virtual Instrument Assembly text file containing a low-level 

text representation of the user-developed dataflow 

programming diagram. In addition, there is a static resources 

directory containing the JavaScript files for implementing the 

HTML UI controls, CSS for HTML UI control theming, 

JavaScript implementing a dataflow programming language 

runtime, and other resources used such as images and 

localization files. 

The primary entry point of the web application is the 

HTML file. The HTML file contains references to the 

resources previously described, an inline stylesheet 

corresponding to the HTML UI controls the user interactively 

placed in the WYSIWYG editor, and a series of custom HTML 

elements with attributes that represent the configured state of 

the HTML UI controls. 

B. Custom Element Utilization 

Custom Elements are a technology that allow user-defined 

custom HTML tags to be registered in a web application. These 

custom HTML tags enable new types of user interface 

elements to be added to a web page by inserting the newly 

defined tag in the HTML. The author of a Custom Element can 

observe when the element is inserted or removed from the 

HTML or when the element's HTML attributes are modified 

[3]. Custom Elements behave like other HTML Elements in the 

Document Object Model (DOM) of a web page. Like other 

HTML Elements natively supported by a web browser, each 

Custom Element has HTML attributes, JavaScript properties 

and methods, and the ability to fire and listen for events 

triggered in the DOM hierarchy. 

For web applications generated by the development 

environment, the entire state of each HTML UI control is 

captured in the respective HTML attributes of their Custom 

Element. The benefit of this approach over having HTML UI 

control configuration stored in generated JavaScript or other 

data formats is improved embeddability and styling 

customizability. The Custom Elements can be moved around, 

styled, and manipulated like any other HTML element and hold 

their configuration information. These properties also make 

Custom Element-based HTML UI controls highly reusable in 

other web applications. 

C. Modeling and Framework 

During the startup of a generated web application the 

lifecycle events of Custom Elements are monitored by a 

JavaScript framework that implements semantics like the 

Model-View-ViewModel (MVVM) pattern. After a Custom 

Element is connected to the DOM the framework triggers the 

creation of a Model object and ViewModel object that 

corresponds to the View represented by the Custom Element. 

In the MVVM framework, the Model provides an 

abstraction over the representation of the properties for a 

HTML UI control. The Model representation of a control’s 

properties provide a consistent interface for other agents such 

as the desktop environment or embedded dataflow runtime to 

send updates. When an update occurs to a Model, the 

corresponding ViewModel updates a Render object 

representing a mutation action for the corresponding HTML UI 

control. The Render objects are queued into the Render engine 

that is serviced on a requestAnimationFrame [4] callback for 

the web page. The net effect is that high frequency Model 

updates can be performed which are collated and serviced at 

optimal rendering times across all Custom Elements managed 

by the MVVM framework. 

Fig. 3. Flow of a property update through the MVVM framework to apply to 

a Custom Element. 

D. Application Update Service Management 

In addition to having Custom Elements to capture the state 

and initialize modeling for HTML UI controls, there are also 

nonvisible Custom Elements used to store the configuration for 

update service management. The update services are state 

machines tasked with transitioning the state of a web 

application from page load through completion. Different 

update services are implemented for the different environments 

and expected behaviors of the web application. 

Two of the most significant update services are the editor 

update service and the local update service: 

The editor update service is utilized when the web 

application is running inside the development environment and 

is expected to respond to WYSIWYG editing operations. As 

the user performs editing operations such as changing size, 

position, or configuration of a control, messages are sent 

asynchronously from the development environment to the 

embedded web browser hosting the web application. The editor 

152

update service receives the messages and applies the updates to 

the Models targeted by the development environment. 

The local update service is used when a user tests the 

execution of their web application in the development 

environment or when the web application is deployed and 

running in a standalone web browser. In this configuration the 

local update service has the responsibility of fetching the 

application-specific Virtual Instrument Assembly file, passing 

the Virtual Instrument Assembly file contents to the bundled 

dataflow runtime environment, and mediating control updates 

between the dataflow runtime environment and the Models. 

E. Virtual Instrument Runtime Engine Object (Vireo) 

Vireo is an open source dataflow runtime used in the web 

application to execute the instructions stored in Virtual 

Instrument Assembly files. The runtime is a compact C++ 

project capable of managing memory and scheduling execution 

of the low-level dataflow programs created by the user in the 

development environment. The runtime has been designed for 

execution in resource constrained embedded systems giving it a 

small size and memory footprint suitable for web applications. 

To make Vireo executable in the browser environment we 

leveraged the open source Emscripten toolchain [5] to compile 

C++ source to a subset of the JavaScript language known as 

asm.js [6]. The asm.js subset of JavaScript makes an efficient 

target for compilers by restricting code to use primarily math 

operations and to perform those operations on one large shared 

JavaScript ArrayBuffer. Benchmarking has shown that C++ 

projects compiled to asm.js and running on modern JavaScript 

browser runtimes execute within two-thirds the speed, or 

better, of the same C++ projects compiled using Clang or GCC 

and executing natively [7]. 

IV. 

PERFORMANCE CHARACTERISTICS OF WEB-BASED 

CONTROLS 

A common concern based on historical JavaScript 

execution behavior and conflated with poor usage patterns of 

the browser DOM API is that HTML UI controls may be 

unable to maintain fast and responsive user interfaces. What we 

have observed is that we can demonstrate desirable 

performance characteristics when we approach development of 

HTML UI elements with the same rigor we would use in other 

user interface environments. 

The best demonstration of performance characteristics for 

HTML UI controls comes from the graphing and charting 

Custom Element implementations. These Custom Elements 

were implemented by leveraging the existing open source Flot 

charting library [8] and creating an open source fork, known as 

engineering-flot, with features and optimizations well-suited 

for engineering and scientific applications [9]. 

In the Custom Elements built using the engineering-flot 

codebase we utilize the HTML5 canvas element for drawing, 

prevent unnecessary copies of buffers, and implement data 

decimation algorithms to avoid unnecessary drawing 

operations. Benchmarking of the graph Custom Elements on 

modern desktop web browsers has shown the capability of 

rendering over 500,000 data points per frame at sixty frames 

per second [10]. 

V. CONSIDERATIONS FOR ADOPTION OF NEW WEB 

TECHNOLOGIES 

Selecting features of the modern web platform to adopt in a 

new project has additional considerations compared to 

traditional desktop application development or even traditional 

web development. Historically web browser versioning was 

highly coupled to the operating system platform and operating 

system version for which the browsers were designed. This 

coupling made it common to use browser version of the target 

audience as the primary consideration for choosing which 

features to adopt during web application development. Many 

modern browsers release frequently and can have releases 

performed independently from their hosted operating system. 

These continuously updated browsers are referred to as 

"evergreen" browsers and can result in benefits such as 

improved interoperability with other browsers [11]. 

With browsers updating frequently, decoupled from the 

underlying operating system, and containing increasingly 

interoperable sets of shared features, it becomes possible to 

change the web platform feature selection process for new 

application development. Instead of choosing a browser and 

opting into all the web platform features that browser 

implements, it is possible to choose a feature and see if it is 

implemented in all the browsers you choose to support. 

If there is a subset of browsers that do not support a feature 

it may be possible to utilize a polyfill for the feature. A polyfill 

is code that attempts to implement a feature that might be 

missing from a browser where existing browser features can be 

used to replicate or closely approximate the missing feature 

[12]. From our experience, a well-designed and low-risk 

polyfill may have some, or all, of the following characteristics: 

• Compact in code size in a shipping web application 

• Comparable performance to the feature as natively 

implemented in a browser 

• Closely implements the native browser feature with few 

polyfill-specific exceptions 

• Makes well-understood changes to the browser global 

environment 

• Represents a specification that browsers follow or have 

committed to follow in the future 

• Delegates execution of the feature to the native 

implementation if available 

• Removable with little to no changes in source code as 

browsers enable the feature natively 

As opposed to consuming libraries that try to abstract over 

differences between browsers by providing a nonstandard API 

on top of those differences, polyfills attempt to bring up the 

baseline usable set of features across browsers. While custom 

libraries lead to an increasingly siloed ecosystem of libraries 

interdependent on nonstandard APIs, utilizing polyfills that are 

backed by open standards leads to the polyfills becoming 

removable as they are backed by native browser 

implementations over time. 

153

VI. 

SUMMARY 

We have discussed an architecture for enabling 

development of web applications supporting WYSIWYG 

manipulation using an MVVM style framework to schedule 

HTML UI control updates in a performant manner. It is 

described how Custom Elements are used to hold UI 

configuration and form the basis of reusable HTML UI 

controls that are highly customizable by users in a deployed 

application. In addition, nonvisible Custom Elements also 

maintain configuration in HTML attributes creating a 

consistent interface for all configuration of the web application. 

We demonstrated the ability to use open source tooling to 

leverage existing C++ code through compilation to the asm.js 

JavaScript subset and described results of highly performant 

and responsive HTML UI graph controls. Finally, we presented 

an approach for selecting web platform features for use in new 

web application development by selecting features backed by 

open standards with native browser implementations or with 

implementations that can be backed by well-designed polyfills. 

REFERENCES 

[1] About, Mozilla, webcompat.com/about. 

[2] The Extensible Web Manifesto, Extensibleweb, 

extensiblewebmanifesto.org/. 

[3] Denicola, Domenic. “Custom Elements”. W3C, 13 Oct. 2016, 

w3.org/TR/custom-elements/#custom-element-reactions. 

[4] “Window.requestAnimationFrame().” Mozilla Developer Network, 

Mozilla, 28 Nov. 2017, developer.mozilla.org/en- 

US/docs/Web/API/window/requestAnimationFrame. 

[5] “Emscripten.” GitHub, github.com/kripken/emscripten. 

[6] Herman, David, Zakai, Alon, and Luke Wagner. Asm.js, Mozilla, 18 

Aug. 2014, asmjs.org/spec/latest/. 

[7] Zakai, Alon, and Robert Nyman. “Gap between asm.Js and native 

performance gets even narrower with float32 optimizations – Mozilla 

Hacks - the Web developer blog.” Mozilla Hacks – the Web developer 

blog, Mozilla, 20 Dec. 2013, hacks.mozilla.org/2013/12/gap-between- 

asm-js-and-native-performance-gets-even-narrower-with-float32- 

optimizations/. 

[8] “Flot.” GitHub, github.com/flot/flot. 

[9] “Engineering-flot.” GitHub, github.com/ni-kismet/engineering-flot 

[10] “Creating Web Enabled HMIs with LabVIEW NXG.” Performance by 

Mark Black, Eli Kerry, and Omid Sojoodi, Creating Web Enabled HMIs 

with LabVIEW NXG, National Instruments, 23 May 2017, 

youtube.com/watch?v=N4XCNfGapc4&t=1m17s. 

[11] Beeman, Hadley. “The evergreen Web.” W3C, 9 Feb. 2017, 

w3.org/2001/tag/doc/evergreen-web/. 

[12] Lawson, Bruce, and Remy Sharp. Introducing HTML5. New Riders, 

2012, pp. 276-277. 

154

Make your industrial device smart using a SaaS IoT 

platform 

Stefan Vaillant 

CTO 

Cumulocity GmbH 

Dusseldorf, Germany 

cumulocity@piabo.net 

Authors Name/s per 2nd Affiliation (Author) 

line 1 (of Affiliation): dept. name of organization 

line 2-name of organization, acronyms acceptable 

line 3-City, Country 

line 4-e-mail address if desired 

I. ABSTRACT 

Today and in the future, more and more machines are 

transformed into smart machine. Pumps, compressors, bikes, 

transformators, industrial vehicles, and more need to get smart. 

Smart machines provide remote access, preventive and 

predictive maintainance, pay per use and other service. The 

fastest and most risk free approach to make machines "smart" 

is to connect them to an Software as a Service IoT Platform. 

This presentation will present the overall approach and 

advantages. It will also bring many industrial examples from 

real-world customers. 


155

Which IoT Protocol Should I Use for My System? 

Christian Légaré 

Silicon Labs Inc. 

Montréal, Québec, Canada 

Abstract—Embedded systems using sensors and connectivity 

are not new to embedded developers. However, using these 

elements with multiple additional internet technologies is. 

Internet protocols (IPs) are not new, but dedicated IPs for the 

IoT are, and they are used to help shape system capabilities. 

There are multiple IP application layer protocols that are 

above the TCP/IP sockets. Each one has its advantages and 

constraints. Knowing them helps developers make the best 

design choices for a product. Bandwidth requirements, realtime 

performance and memory footprint are some of the main 

criteria to use in selecting an IoT protocol. Many IoT projects 

are being driven by CIOs and IT departments, which are 

pushing developers to use the technologies and protocols they 

know in IoT devices. However, IoT devices are often closer to 

operational technologies (OTs), so, pushing IT technologies 

into the OT domain is often not an optimal choice. 


Developers need to be educated that there are better choices 

for IoT devices than IT technologies. 

There are multiple categories of IP: 

• Consumer vs. industrial 

• Web services 

• IoT services 

• Publish/Subscribe 

• Request/Response 

been developed and refined so that ordinary, nontechnical 

people can use the internet easily and productively. For 

example, the human interface for the internet now includes 

email, search engines, browsers, mobile apps, Facebook and 

Twitter, among other popular social media. 

By comparison, in the IoT, the idea is for electronic devices to 

exchange information over the internet. But these devices 

don’t yet have the machine equivalent of browsers and social 

media to facilitate communication. The IoT is also different 

from the web because of the speeds, scales, and capabilities 

that IoT devices require in order to work together. These 

requirements are far beyond what people need or use. We are 

at the beginning of the development of these new tools and 

services, and this is one of the reasons why a definition for IoT 

is difficult to lock down. Many visions about what it can, or 

could be, collide. 

III. TCP/IP PROTOCOL STACK 

The TCP/IP protocol stack is at the heart of the internet and 

the web. It can be represented using the OSI seven-layer 

reference model, as illustrated below (Figure 1). The top three 

layers are grouped together, which simplifies the model. 

All these factors must be considered when designing a new 

system. Let’s look at IPs for the IoT and define the selection 

criteria. 

II. THE INTERNET 

The internet is the sum of all network equipment used to route 

IP packets from a source to a destination. The world wide 

web, by comparison, is an application system that runs on the 

internet. The web is a tool built for people on which to 

exchange information, and in the last 2- years, the web has 

Figure 1. OSI Seven-layer reference model. 


156

Figure 1. TCP/IP Stack Reference Model 

The following is a quick description of the important layers 

from the perspective of embedded system integration: 

1. Physical and Data Link Layers 

The most common physical layer protocols used by 

embedded systems are: 

• Ethernet (10, 100, 1G) 

• Wi-Fi (802.11b, g, n) 

• Serial with PPP (point-to-point protocol) 

• GSM 3G, LTE, 4G 

2. Network Layer 

This is where the internet lives. The internet—short for 

inter-network—is named so because it provides 

connections between networks, between the physical 

layers. This is where we find the ubiquitous IP address. 

3. Transport Layer 

Above IP, we have TCP and UDP, the two transport 

protocols. Because TCP is used for most of our human 

interactions with the web (email, web browsing, etc.), it is 

widely believed that TCP should be the only protocol 

used at the transport layer. TCP provides the notion of a 

logical connection, acknowledgment of packets 

transmitted, retransmission of packets lost and flow 

control—all of which are great things. But for an 

embedded system, TCP can be overkill. Therefore, UDP, 

even if it has long been relegated to network services such 

as DNS and DHCP, is now finding its place in the 

domains of sensor acquisition and remote control. If you 

need some type of management of your data, you can 

even write your own lightweight protocol on top of UDP 

to avoid the overhead imposed by TCP. 

UDP is also better suited than TCP for real-time data 

applications such as voice and video. The reason is that 

TCP’s packet acknowledgment and retransmission 

features are useless overhead for those applications. If a 

piece of data (such as a bit of spoken audio) does not 

arrive at its destination in time, there is no point in 

retransmitting the packet, as it would arrive out of 

sequence and would garble the message. 

TCP is sometimes preferred to UDP, because it provides a 

persistent connection. So, to do the same thing with UDP, 

you must implement this feature yourself in a protocol 

layer above UDP. 

When you are deciding how to move data from the 

“thing’s” local network onto an IP network, you have 

several choices. Because the technologies used are 

familiar and available from a wide range of sources, you 

can link the two networks via a gateway, or you can build 

this functionality into the “thing” itself. Many MCUs now 

have an Ethernet controller on chip, which makes this an 

easier task. 

IV. IOT PROTOCOLS 

It is possible to build an IoT system with existing web 

technologies, even if it is not as efficient as the newer 

protocols. HTTP(S) and WebSockets are common standards, 

together with XML or JavaScript Object Notation (JSON) in 

the payload. When using a standard web browser (HTTP 

client), JSON provides an abstraction layer for web developers 

to create a stateful web application with a persistent duplex 

connection to a web server (HTTP server) by holding two 

HTTP connections open. 

HTTP 

HTTP is the foundation of the client-server model used for the 

web. The safest method with which to implement HTTP in 

your IoT device is to include only a client, not a server. In 

other words, it is safer when the IoT device can initiate 

connections to a web server but is not able to receive 

connection requests: We don’t want to allow outside machines 

to have access to the local network where the IoT devices are 

installed. 

WebSocket 

WebSocket is a protocol that provides full-duplex 

communication over a single TCP connection over which 

messages can be sent between client and server. It is part of 

the HTML 5 specification. The WebSocket standard simplifies 

much of the complexity around bidirectional web 

communication and connection management. 

XMPP 

Extensible messaging and presence protocol (XMPP) is a 

good example of an existing web technology finding new use 

in the IoT space. 

XMPP has its roots in instant messaging and presence 

information, and has expanded into voice and video calls, 

collaboration, lightweight middleware, content syndication, 

and generalized routing of XML data. It is a contender for 

mass scale management of consumer white goods such as 

washers, dryers, refrigerators and so on. 

XMPP strengths are its addressing, security and scalability. 

This makes it ideal for consumer-oriented IoT applications. 

HTTP, WebSocket and XMPP are examples of technologies 

being pressed into service for IoT. Other groups are also 

157

working furiously to develop solutions for the new challenges 

IoT is presenting us. 

Wannabe Generic Protocols 

Many IoT experts refer to IoT devices as constrained systems, 

because they believe IoT devices should be as inexpensive as 

possible and use the smallest MCUs available, while still 

running a communication stack. 

Table 1. Constrained systems standardization work 

and a low throughput (tens of kilobits per second). CoAP can 

be a good protocol for devices operating on battery or energy 

harvesting. 

Features of CoAP: 

• Because CoAP uses UDP, some of the TCP 

functionalities are replicated directly in CoAP. For example, 

CoAP distinguishes between confirmable (requiring an 

acknowledgment) and nonconfirmable messages. 

• Requests and responses are exchanged 

asynchronously over CoAP messages (unlike HTTP, where an 

existing TCP connection is used). 

• All the headers, methods and status codes are binary 

encoded, which reduces the protocol overhead. However, this 

requires the use of a protocol analyzer to troubleshoot network 

issues. 

• Unlike HTTP, the ability to cache CoAP responses 

does not depend on the request method, but the response 

Code. 

CoAP fully addresses the need for an extremely light protocol 

exhibiting a behavior similar to a permanent connection. It has 

semantic familiarity with HTTP and is RESTful (resources, 

resource identifiers and manipulating those resources via 

uniform application programming interface (API)). If you 

have a web background, using CoAP is relatively easy. 

Currently, adapting the internet for the IoT is one of the main 

priorities for many of the global standardization bodies. Table 

1 contains a short summary of the current activities. 

If your system does not require the features of TCP, and can 

function with the more limited UDP capabilities, removing the 

TCP module significantly helps reduce the size of the total 

code footprint of your product. This is what 6LoWPAN (for 

WSN) and CoAP (light internet protocol) bring to the IoT 

universe. 

CoAP 

Although the web infrastructure is available and usable for IoT 

devices, it is too heavy for most IoT applications. In July 

2013, IETF released the constrained application protocol 

(CoAP) for use with low-power and lossy (constrained) nodes 

and networks (LLNs). CoAP, like HTTP, is a RESTful 

protocol. 

It is semantically aligned with HTTP, and even has a one-toone 

mapping to and from HTTP. Network devices are 

constrained by smaller microcontrollers with small quantities 

of flash memory and RAM, while the constraints on local 

networks such as 6LoWPAN are due to high packet error rates 

MQTT 

MQ telemetry transport (MQTT) is an open source protocol 

that was developed and optimized for constrained devices and 

low-bandwidth, high-latency or unreliable networks. It is a 

publish/subscribe messaging transport that is extremely 

lightweight and ideal for connecting small devices to networks 

with minimal bandwidth. MQTT is bandwidth efficient, data 

agnostic and has continuous session awareness, as it uses TCP. 

It is intended to minimize device resource requirements while 

also attempting to ensure reliability and some degree of 

assurance of delivery with grades of service. 

MQTT targets large networks of small devices that need to be 

monitored or controlled from a back-end server on the 

internet. It is not designed for device-to-device transfer. 

Neither is it designed to “multicast” data to many receivers. 

MQTT is simple, offering few control options. Applications 

using MQTT are generally slow, in the sense that the 

definition of “real time” in this case is typically measured in 

seconds. 

MQTT vs. COAP 

MQTT publish/subscribe scales well. MQTT has 

demonstrated the advantages of this architecture. COAP in the 


158

latest IETF COAP RFCs have introduced the support of 

publish/subscribe by COAP. 

The COAP light payload is well-suited for wireless sensor 

networks. MQTT-SN has taken that idea and reproduced it. 

So, the two main IoT dedicated protocols are borrowing ideas 

from each other. Will these two protocols remain mainstream? 

We believe so, for at least five to 10 years. 

V. COMPARISON OF POTENTIAL IOT PROTOCOLS 

Cisco is at the heart of the internet; its IP equipment is 

everywhere. Cisco is now actively participating in the 

evolution of IoT. It sees the potential for connecting physical 

objects, getting data from our environment and processing this 

data to improve our living standards. 

Table 2 is drawn from Cisco’s work in IoT standards. 

Figure 2. Comparison of web and IoT protocols 

Table 2. Beyond MQTT: A Cisco View on IoT Protocols by Paul 

Duffy, April 30, 2013 

These internet-specific IoT protocols have been developed to 

meet the requirements of devices with small amounts of 

memory, and networks with low bandwidth and high latency. 

Figure 2 provides another good summary of the performance 

benefit that these protocols bring to IoT. The source is Zach 

Shelby in his presentation “Standards Drive the Internet of 

Things.” 

VI. CONCLUSION 

Connecting sensors and objects opens up an entirely new 

world of possible use cases—and it’s precisely those use cases 

that will determine when to use the right protocols for the right 

applications. 

The high-level positioning for each of these protocols is 

similar. Apart from HTTP, all these protocols are positioned 

as real-time publish/subscribe IoT protocols with support for 

millions of devices. Depending on how you define “real time” 

(seconds, milliseconds or microseconds) and “things” (WSN 

node, multimedia device, personal wearable device, medical 

scanner, engine control, etc.) the protocol selection for your 

product is critical. Fundamentally, these protocols are very 

different. 

Today, the web runs on hundreds of protocols. The IoT will 

support hundreds more. What you need to do when designing 

your system is to define the system requirements very 

precisely, and chose the right protocol set to address these 

requirements. 

The internet protocol is a carrier; it can encapsulate just as 

many protocols for the IoT as it does today for the web. Many 

industry pundits are asking for protocol standardization. But if 

there are so many protocols for the web, why wouldn’t there 

be just as many for the IoT? You choose the protocols that 

meet your requirements. The only difference is that the IoT 

protocols are still young and must demonstrate their reliability. 

Remember that when the internet became a reality, IP version 

4 was what made it possible. We are now massively deploying 

IP version 6, and IoT is the killer application that 

telecommunication carriers have been waiting for to justify the 

investment required. 

159

Predictive maintenance using a fully compound 

materialintegrated measuring system 

Sven Grunwald, Andy Batzdorf, Steffen Kutter, Bernard Bäker 

Chair in Automotive Mechatronics, Dresden University of Technology 

George-Baehr-Straße 1C, Germany 

Sven.Grunwald@tu-dresden.de, Andy.Batzdorf@tu-dresden.de, Steffen.Kutter@tu-dresden.de, Bernard.Baeker@tu-dresden.de 

Abstract— This paper presents the integration of a 

measurement system which enables Internet of Things (IoT) 

driven predictive maintenance in an industrial environment 

without the need of external mounted sensors. The main 

methodology for the predictive wear detection is illustrated by the 

example of an electromagnetic operated spring pressure brake. 

First real-world application results are shown within this paper. 

Furthermore, an outlook to resulting possibilities of a future 

industrial IoT-application with the focus on low costs is provided. 

For first practical application, a braking rotor with a 

corresponding measuring system has been designed. The 

measurement system consists of a low power ARM-based 

processor with integrated wireless interface and additional MEMS 

based sensors. This system is integrated in the friction material 

itself by using the hot-pressing method. With the help of this 

method an encapsulated sensor and measurement system can be 

fabricated turning the conventional brake disc into a rotating 

Industrial IoT (IIoT) sensor node which is powered wirelessly over 

a resonant inductive link. 

in the actual moving parts of the machine and usable during 

manufacture or maintenance. The data collected by an 

embedded sensor would provide much more detailed 

information about what is happening in the machine, in real 

time, which would dramatically improve process control and 

improve machine maintenance. But even with today's compact, 

energy-saving and inexpensive wireless sensors, it is already 

possible to integrate the device into the fabric of a machine and 

to expect it to transmit reliable information over months or even 

years. The feasibility was investigated in a research project at 

the Technische Universität Dresden (TUD). 

Keywords—Industrial IoT, Encapsulated sensor platform, 

Rotating sensor node, Electromagnetic safety brake, Fiberreinforced 

material 

I. MOTIVATION AND BACKGROUND 

The digital network capability of industrial machines enables 

the monitoring of the entire plant condition and gives the 

possibility to combine this information with additional sensors 

installed in the machine. In the predicted expansion stage such 

as Factory 4.0, these machines can communicate via networks 

over the Internet. The IoT is also moving into the industrial 

sector, with networks that can collect data to monitor and 

control production lines, inventories and energy consumption 

to ensure sustainable and reliable production. [1]. As part of the 

Industrial IoT (IIoT), these models are based on nodes such as 

thermostats or optical sensors at the edges of the network, 

which receive and send data that a central system can analyze 

and respond to, for example, to optimize a manufacturing 

process. 

Today, sensors are typically mounted on the outside of 

equipment where access to the device is easier. But another 

conceivable approach would be if sensors the were embedded 

Fig. 1 Smart rotor concept suitable for industrial electromagnetic safety brakes 

The focus was to encapsulate the sensors and the 

microprocessor into the brake disc as part of the 

electromagnetic safety brake. Those elements are critical for 

safety relevant applications in industrial equipment. One 

example is in elevators, where the discs are used to control the 

speed of the ascending car, protect against unintended 

movement, and maintain the car’s position when it stops at each 

floor. 

Measuring specific system parameters, such as vibration or 

wear, traditionally requires the addition of expensive hardware, 

such as torque shafts. The Usage of embedded wireless sensors 

can overcome these disadvantages. The disc is manufactured 

out of phenolic-resin-based composite material which is far 

superior to conventional, metallic materials in terms of load 


160

esistance and wear [2]. The following picture points out the 

basic setup of braking system within an industrial environment. 

Self-Aligning 

Coupling 

Anchor plate 

Fig. 2 

Conventional brake disc within an electromagnetic safety brake 

II. 

Conventional 

Rotor 

Electromagnetical 

Safety Brake 

PRELIMINARY RESEARCH 

The core idea in this project was to integrate the necessary 

components for monitoring an industrial device into the actual 

industrial component, but before the integration process is 

started with a highly integrated system equipped with these 

sensors, it is important to point out that there is no test data 

available to verify the components according to the hotpressure 

method and to validate the sensor data as well as the 

entire system behavior after the integration step with respect to 

defects. Generally speaking, a System on Chip (SoC) is a 

complex device consisting of analog and digital circuit 

elements that interact on a single silicon chip. A complete 

structural test of digital components within the integrated 

circuit (IC) is not possible at this point, because on the one hand 

a verified Verilog or VHDL netlist of the entire IC is not 

available to the customer and on the other hand the accessibility 

of the IC after integration into the material is not given. 

Fig. 4 Technology carrier Smart-rotor 

The sensors include devices for measuring acceleration in all 

three axes, three-axis gyroscopes for measuring the angle of 

rotation, magnetic field sensors and classic temperature sensors. 

These sensors are well suited for the task, as they are costeffective, 

small and highly integrated in a small housing and 

thus provide an ideal choice for integration and measurement 

directly from the material. 

The data can be recorded from inside the raw braking material 

and compressed after the actual measurement and sent to a host 

system, e. g. a wireless router, for further processing within the 

network. The system properties are highly customizable by the 

user; e. g. sensor fusion of the filter implantation is still possible 

even after the integration process. 

Electric engine 

Wireless Power 

Transmitter 

Self-Aligning 

Coupling 

Smart-Rotor 

Anchor plate 

Electromechanical 

Braking System 

Fig. 3 Oscillation of the analog test structure to validate integration 

In order to make an initial assessment of the extent to which it 

is possible to integrate an electronic system into fiber composite 

material without damaging the electronics, simple and 

manageable digital and analog elements (figure 3) were used 

for experiments [3], [4]. By using these elements, parasitic 

effects caused by the fiber-reinforced material itself can be 

taken into account. Using the declared method of error modeldriven 

test structures, the usable component sizes of the passive 

and active element was identified. These results are further 

documented in an earlier work [6], [7]. 

Fig. 5 Electric drive equipped with the Smart-rotor 

This is achieved by the implemented over-the-air (OTA) update 

capability. Due to the fact that the SoC supports Bluetooth 4.2, 

it is possible to equip the embedded system with an IP address 

via which a remote connection to the embedded system can be 

established. 

This enabled access to the Smart Rotor (figures 4, 5) via the 

Internet, which enables remote diagnosis of the brake system 

without the need for a technician to check the system and stop 

the machine. Knowing the optimum component sizes and 

materials that can withstand the hot-pressure process, it was 

possible to successfully integrate the embedded system itself. 

The following picture shows parts of the system inside the 

fiber-reinforced material after the manufacturing process. This 

also includes the post-curing process required for the fiberreinforced 

material. 

161

III. 

MEASUREMENT AND PERFORMANCE RESULTS 

The whole integration process is achieved with the hotpressure-method 

for fiber-reinforced materials. This 

methodology is a well-known manufacturing process for this 

kind of material. Nevertheless, this method stresses the 

components because of the high temperature and the high 

pressure within the actual fabrication. Therefore, it is necessary 

to analyze the system and sensor behavior after the actual 

integration took place. 

It is not sufficient to integrate the components free of defects 

but to measure and test the quality of the sensor signals and the 

whole behavior. This approach is necessary because MEMSbased 

sensors are very sensitive to mechanical stress. Already 

the mounting on the printed circuit board plays an important 

role. The mechanical and thermal stress, which are exposed to 

the elements, can cause a performance degradation leading to 

insufficient measurements in a future application. 

The examined results showed there is nearly no degrading in 

the signal quality attributed to the manufacturing process [8]. 

One method to analyze a MEMS-based inertial sensor is the 

Allan variance, introduced by David W. Allan to measure the 

frequency stability in oscillators [10]. This method can be 

adopted to characterize MEMS-based sensors and to analyze a 

sequence of data in the time domain [9]. In the present work, 

the calculation rules for the evaluation of a MEMS-based sensor 

on the basis of the Allan variance and their graphical contexts 

were applied to determine and quantify the different noise terms 

that exist in inertial sensor data. 

In general, the Allan variance analysis of a signal in the time 

domain consists of computing its Allan deviation as a function 

of different averaging times and then analyzing the 

characteristic regions and log-log scale slopes of the Allan 

deviation curves to identify the different noise modes [8]. The 

major noise relevant terms within this example are the 

Acceleration-Random-Walk (1), the Bias instability (2) and the 

Rate-Random-Walk (3). 

σ 

= σ ( τ ) ⋅ 

1 

ARW 0 

(1) 

τ 

0 

( π 

σ Bias 

= σ τ ) ⋅ 1 2 ⋅ 

(2) 

ln(2) 

the sensor resolution. Therefore, the signal quality needs to be 

evaluated. The pictures show that the integration process leads 

to a slight degradation especially in the z-axis due to the 

hot-pressure integration process. But the sensor readings are 

still useable and can produce valid values which further can be 

used for measurement tasks. Those measurement tasks will be 

introduced in the following. 

σx(τ) in |g| 

σy(τ) in |g| 

σz(τ) in |g| 

10 −2 

10 −3 

10 −4 

10 −5 

10 −2 

10 −3 

10 −4 

10 −5 

10 −2 

10 −3 

10 −4 

10 −2 10 −1 10 0 10 1 10 2 10 3 

Fig. 6 

Fig. 7 

Befor the integration 

After the integration 

τ in s 

Allan deviation plot of the X-Axis 



10 −2 10 −1 10 0 10 1 10 2 10 3 

τ in s 

Allan deviation plot of the Y-Axis 



σ 

= σ ( τ ) ⋅ 

3 

RRW 2 

(3) 

τ 

2 

The following figures point out a degradation regarding the 

noise relevant terms after the integration step as a sensor quality 

indicator. A defective sensor behavior could be identified via 

this method. Worst case this could introduce problems and 

wrong readings in a further demanding application concerning 

10 −5 

10 −2 10 −1 10 0 10 1 10 2 10 3 

τ in s 

Fig. 8 Allan deviation plot of the Z-Axis 

The relevant monitoring parameters of industrial braking 

systems are the braking torque over lifetime and also 

parameters such as temperature, speed and critical failure 

effects e.g. a broken power supply which are recorded by the 


162

device. While using the embedded system which is 

encapsulated into the fiber-reinforced material following 

measurements were carried out with a custom test bench to 

simulate degrading effects and long-time damages of the 

system. Various case studies were carried out with the help of 

the test stand. 

For this purpose, use cases were defined which allow the 

methodology to be applied to the field of electromagnetic safety 

brakes and industrial electric drives. Simple scenarios such as 

the detection of the rotation direction and position of the rotor 

were considered and implemented with the help of MEMSbased 

sensors. Thus, it is possible to detect and quantify the 

direction of rotation, angular speed and the position of the rotor 

without any external sensors evaluating (4) and (5). With a r 

being the radial acceleration, r a the distance between the sensor 

and the center of the printed circuit board and n the speed. 

exceeding or falling below the limits can be detected when a 

limit value is reached A warning about the wear of the rotorhub 

connection can be generated. Therefore, it is possible to 

detect the defect very early while running the machine. 

a 

= ( 2⋅π ) ⋅ 

(4) 

2 

r 

r a 

Fig. 9 

Smart-rotor based detection of a good rotor hub connection 

1 

n = ⋅ 

2⋅π 

a 

r 

r 

a 

(5) 

In addition, much more demanding tasks, such as the state 

monitoring of the rotor-hub connection, were also considered. 

Two of those representative examples and the results will be 

presented in the following using only the brake disc equipped 

with sensors and microcontroller. 

A. Connection between the rotor and the hub 

If the braking system is used as an active deceleration 

component, the rotor-hub connection can deflect by the 

transmission the mechanical torque. According to the current 

state of technology the rotors are replaced upon reaching the 

limit for the air gap also a critical condition within the rotor-hub 

connection cannot be detected yet. 

If the connection fails, the drive shafts can no longer be stopped 

by the brake which could result in major damage. 

dω 

at 

= ra 

⋅ 

(6) 

dt 

The radial acceleration a r, shown as the x-component in the 

following figure, indicates the loose connection as a shift in the 

radial direction. The tangential acceleration a t as the 

y-component can be calculated according to (6). It indicates the 

change in the angular velocity within the time and obviously 

dependents on the distance of the sensor from the center. 

Noticeable for an occurring defect are the peaks marked with 

the arrows in Figure 10 within the acceleration values. The 

loose connection during the rotation leads to a radial 

displacement, which causes a measurable acceleration as a 

peak. 

For the detection of this defect an upper and lower limit can be 

taking into account and tolerance band for the expected 

acceleration values could be defined. With a simple counter, the 

Fig. 10 Smart-rotor based detection of a defective rotor hub connection 

The greater the wear occurring, the greater the number of peaks. 

The measured clearance between the rotor and the brake disc 

was around 0.1 mm during the measurement. With the 

presented method, the connection between the rotor and the 

attached hub could be monitored on the shaft by using the 

smart-rotor. This option is for the safety critical component a 

great added value and the necessary form fit for optimal 

deceleration can be monitored during the main operation. 

B. Status Detection of the brake system 

In addition to the speed measurement, the state recognition of 

the anchor plate is already detectable according using state of 

the art external sensors. By now only the two states, brake 

closed or brake ventilated can be detected by using e.g. micro 

buttons. A disadvantage is the occurring wear of the mechanical 

components inside the button and furthermore the button itself 

must be adjusted very accurately to correctly detect the end 

position of the anchor plate. In comparison the advantage of the 

condition detection via the measuring rotor with a 

magnetometer is shown in the following figure. By 

continuously evaluating the measured magnetic flux density in 

the z-direction B z, it is possible to make further statements 

about the spring-applied brake system in addition to the status 

determination. 

163

For the measurement, the mounting direction of the rotor was 

chosen so that only positive values are recorded. Pictured is the 

z-component of the magnetometer for the single release of the 

brake over a period of 12 seconds under the influence of 

different supply voltages. 

This variation of the voltage is intended to simulate a fault 

occurring in the energy supply of the brake. The curve for the 

rated voltage of 24 V is used as a reference. When opening the 

brake, an increase in the magnetic flux density is expected. The 

coil in the magnet housing of the spring-pressure brake builds 

up a magnetic field that can be detected by the connection to a 

power supply. After switching off, the field degrades again. 

Fig. 12 

Smart-rotor based detection of a jammed anchor plate 

Fig. 11 Smart-rotor based detection of a broken power supply 

For a reliable state determination, the evaluation of the 

z-component of the magnetometer according to the condition 

would satisfy 400 µT 

voltage of 24 V was used as a reference for this purpose. The 

400 µT result from releasing the brake against the preload force 

of the spring. With the upper limit of 500 µT, the jamming of 

the brake disc is taken into account so that the digital button 

does not generate a wrong status. If the detected flux density is 

below the limit of 400 µT, there must be a fault in the power 

supply. 

A jammed anchor plate can be detected by exceeding the limit. 

In direct comparison to the prior techniques, this method has 

the advantage of monitoring the coil and the power supply of 

the spring pressure brake in addition to the state detection. If an 

error occurs, the cause can be identified. With a simple 

pushbutton, it is only possible to check in conjunction with a 

control whether the desired control pulse has been converted by 

the brake. 

By evaluating the absolute values of the magnetic flux density, 

this system can determine the cause of the fault. It can be 

distinguished between a shortage or a defect on the coil and 

additionally the mechanical clamping of the anchor plate. By 

using this methodology, a fault tree analysis could be derived 

supporting the user while diagnose the system. 

IV. 

CONCLUSION AND OUTLOOK 

The presented work shows the successful integration of 

microelectronic circuitry into a fiber-reinforced material with 

the help of the hot-pressure manufacturing method. Selected 

measurement tasks for the predictive maintenance where 

presented while using the wirelessly powered brake rotor. It is 

to mention that the presented measurement tasks are only an 

excerpt and are tailored to the area of industrial brake systems. 

By using and installing them in a machine, the repair times and 

maintenance costs could be optimized instead of constantly 

struggling with unplanned repairs of failed machines. 

As a vision, companies can reduce costs through scheduled 

maintenance when wear and tear is reported by wireless 

sensors [11]. In addition, cloud servers could use sensor data in 

algorithms to predict high and low demand periods, and the 

operation of the system could be adapted to save energy. The 

research shows these models can be refined by moving the 

sensors into the very heart of the machine, accelerating the 

progress of the smart factory. 

With cost-effective, wireless SoCs and sensors embedded 

directly in components and connected to the cloud via a mesh 

network, high-performance servers can, for example, determine 

the location of underutilized devices and identify bottlenecks in 

processes to adjust process speed accordingly. 

This vision will also evolve within the automotive sector for 

upcoming important applications turning conventional 

automotive parts like wheels into "smart" devices with 

integrated sensors - enabling real-time monitoring of proper 

functionality - a basic requirement for getting safety relevant 

“conventional” hardware of highly automated driving vehicles 

compliant to the requirements of functional safety according to 

ISO 26262. 

Annotation 

The work presented in the context of this paper is funded by the 

Central Innovation Program for Small and Medium-Sized 

Businesses (ZIM). 


164

V. REFRENCES 

[1] Caroline Hayes, “IoT extends its reach into the industrial landscape”, 

Nordic Semiconductor ULP WQ Summer 2017 page 20. 

[2] Gehard: „Verzahnung bis 25 Millionen Lastzyklen verschleißfrei“, Iris 

Gehard, freie Journalistin in München, im Auftrag der Rex Industrie- 

Produkte Graf von Rex GmbH, Vellberg. Konstruktion 3-2013, s.l.: 

Springer VDI Verlag. 

[3] H.J. Wunderlich: „Models in Hardware Testing“, Springer Netherlands, 

2010. 

[4] G. Huertas Sánchez, D. Vázquez García de la Vega, A. Rueda Rueda, J. 

Huertas Díaz: “Oscillation-Based Test in Mixed-Signal Circuits”, 

Springer Netherlands, 2006. 

[5] M.L. Bushnell and V.D. Agrawal: „Essentials of Electronic Testing for 

Digital, Memory and Mixed-Signal VLSI Circuits “, Springer, New York, 

2010“, Springer, New York, 2010. 

[6] S. Grunwald, B. Bäker: „Integration eines Mess- und Sensorsystems in 

einen Integralrotor für elektrische Antriebe unter Verwendung des 

Heißpressverfahrens“, 12. VDI/VDE Mechatronik-Tagung, Dresden, 09.- 

10. March 2017. 

[7] S. Grunwald, B. Bäker: „Integrated measurement units and sensor 

systems for harsh industrial applications“, AMA Sensor Conference 2017, 

Nuremburg 30.5.2017 - 1.6.2017. 

[8] N. El-Sheimy, H. Hou, X. Niu: „Analysis and modeling of inertial sensors 

using allan variance”. IEEE Trans. Instrum. Meas. 2008, 57, 140–149. 

[9] A. A. Hussen, I. N. Jleta: „Low-Cost Inertial Sensors Modeling Using 

Allan Variance”, World Academy of Science, Engineering and 

Technology International Journal of Computer, Electrical, Automation, 

Control and Information Engineering Vol: 9, No: 5, 2015. 

[10] D. W. Allan: „Statistics of atomic frequency standards”, Proceedings of 

the IEEE, vol. 54, no. 2, pp. 221–230, 1966 

[11] PricewaterhouseCoopers (PwC), “Industry 4.0: Building the digital 

enterprise”, Global Industry 4.0 Survey 2016. 

165

IoT Integration in Machines and Production Facilities 

Made Simple! 

Dipl.-Ing. (FH) Robert Schachner 

CEO 

RST Industrie Automation GmbH 

Ottobrunn-Riemerling, Germany 

r.schachner@rst-automation.com 

Abstract— Companies are beginning to retrofit their 

production sites to meet the requirements of the future. But still 

there are more questions than answers. How does a cloud work, 

what does service oriented mean? Are big players always the go-to 

solution or do we need SMEs? This presentation intends to 

encourage SME machine builders and software providers to 

broach the subject and get involved. 

Keywords—middleware; production; machine control; IoT; 

communication; integration; PLC; digitalization 

I. INTRODUCTION (HEADING 1) 

The realization of machines and production facilities in 

accordance with methods described in the Reference 

Architecture Model Industry 4.0 (RAMI4.0) or the parallel IoT 

development towards Industrial Internet Reference Architecture 

(IIRA) is currently only actively pursued by a small number of 

companies. In a lot of cases, a simple lack of know-how prevents 

a successful digitalization. 

But who has the necessary skills to implement this kind of 

project? The obvious answer would be a big player like IBM 

with Watson or Siemens and their Mindsphere program. And 

yes, the global players are crucial on the Office Floor level. But 

the most important players are the facility owners themselves 

with their service personnel and IT departments. They know the 

often decades old established (manual) procedures and methods 

by heart. 

But there is also a need for a third player: small and mediumsized 

businesses that are tightly networked (like in the German 

trade organization Embedded4You e.V.) who, in close 

cooperation with the owners, are able to integrate old and new 

requirements and processes machine by machine, service by 

service, on the Shop Floor. 

Combining all three creates the development team that is 

needed for a successful implementation. 

II. REFERENCE ARCHITECTURE MODEL INDUSTRY 4.0 

(RAMI4.0) 

Before delving deeper into implementation itself, it is useful 

to have a look at the Reference Architecture Model Industry 4.0 

as described in DIN SPEC 91345 in order to define the basic 

terms and establish a common understanding of reference 

models and architectures. 

Fig. 2. RAMI4.0 – Basic Structure of a Modern Production Facility 

Fig. 1. Participants in the Digitalization of Production Facilities 

As the structure diagram shows, at the center of production 

is not a (possibly internet based) cloud but two network tiers 

spanning a common semantic model between their endpoints. 

The so called Enterprise Network, or Office Floor, comprises 

the business processes that indirectly also control production 

through order management. This is where the main focus is on 

the big players like IBM or SAP. When digitalizing existing 

production facilities, it is best to start by modernizing the big 

player technology that's already in place. We're not going to 

linger on this subject as our focus is on successfully 

implementing the Real Time Network, or Shop Floor, which is 

usually the harder part of modernization. 

166

When digitalizing a factory, a consistent administration shell 

has to emerge that is able to integrate the countless different 

controls of assets from different manufacturers, often originating 

from different technological ages. This necessitates both a 

flexible middleware platform and an innovative SME partner 

capable of developing that AAS with its respective Shop Floor 

services and then integrating the machines one by one. 

Fig. 3. Reference Architecture Model Industry 4.0 (RAMI4.0) 

The layer model shown above is probably the most well 

known representation of the reference architecture, as it 

classifies all major parts of a production system, arranges them 

on three axes and assigns the relevant terms. 

The layer axis classifies the layers of a production system in 

more detail, starting with the machines or assets on the Shop 

Floor all through the business layer on the Office Floor. The 

hierarchy axis on the other hand classifies items ranging from 

product to internet. 

Of special interest is the life cycle or value stream axis. It 

describes procedures starting with receipt of parts all the way 

through to the finished product being delivered. Or e.g. the 

sequence of error management. It is interesting to note that error 

management is moving towards diagnosing and rectifying 

potential errors even before they happen (predictive 

maintenance). These procedures are then implemented by 

defining appropriate service processes. 

Going another layer deeper, we arrive at the so called 

"industry 4.0 component". This comprises an often already 

present asset, the machine with its existing control and the so 

called administration shell (or asset administration shell, or AAS 

for short). The AAS surrounds the legacy control and is directly 

connected to the Shop Floor communication layer through 

service processes. 

Fig. 4. Industry 4.0 Component 

III. COMMUNICATION METHODS 

Now that we have a general overview, let's go another layer 

deeper and look at the technologies that are necessary for the 

implementation. In most cases control of existing mechanical 

assets is handled by PLC control systems as described by IEC 

61131-3. They are part of their respective machines and should 

generally not be changed due to the enormous effort and loss of 

manufacturer support that come with such a change. Here, 

process data is cyclically read, processed and output. At the core 

of this form of communication is a common memory area 

containing all pertinent values. Since these are cyclically 

overwritten only the latest values are available at any time. 

Fig. 5. Synchronous Communication Model 

This model is the optimal implementation for cycle based 

and synchronous processes like regulators. On the other hand, 

since any data that gets overwritten is lost, its usefulness for 

handling status information like error messages or commands is 

extremely limited. Nevertheless, we need this model to build a 

so called "digital twin" that will be used to mirror data from the 

PLC control. More on that later. 

As mentioned, the implementation of flexible and persistent 

sequences necessitates the use of another form of 

communication. This is ideally solved using serialized 

communication of data and commands via brokers that distribute 

and organize the messages. This way, any data remains in the 

system until fetched by a client. Most middleware based 

technologies use this as their only means of communication. 

Often they do not support the synchronous process data model 

even though it is imperative for cyclical processing and 

establishing the digital twin. 

Information can now be sent to specific clients through 

explicit addressing. Alternatively, messaged can be published 

to so called topics (semantically addressed virtual channels). 

Clients can then subscribe to topics and any messaged published 

to the subscribed topic will automatically be delivered to them. 

Powerful wildcard functions facilitate choosing relevant topics. 

Publish / Subscribe is the only communication model that does 

167

not necessitate a direct link between sender and receiver. This is 

currently the most important innovation in process 

communications since it is the indispensable basis for building a 

cloud. 

Fig. 6. Asynchronous Message Based Communication 

This form of communication isn't slow, either. Control 

programs based on "state machines", which are commonly used 

in automation systems, can be implemented faster and more 

efficiently with messaging services. 

The message based model also offers additional quality of 

service functions like "last will": a predefined message that is 

automatically distributed when a client fails or disconnects from 

the network. 

A deciding factor in choosing a middleware is the 

availability of a multi broker architecture (as illustrated in fig. 

7). The clients inside a system can communicate via their own 

dedicated broker, which is much faster and independent from 

network access. System spanning messages are handled via 

broker-to-broker traffic and distributed locally. 

Communications become redundant and Industry 4.0 

components retain their functionality even if the network goes 

down. 

Though the Shop Floor is usually not connected to the 

internet, one should still take heed to implement adequate 

security including certificate based identification and encryption 

of transmitted data. It is further recommended to restrict 

available communication functions through user roles. 

Fig. 7. Multi Broker System with Security Functions 

Communication models like the ones described above are 

found in numerous products that are collectively called 

"middleware". Especially the IoT sphere offers a wide range of 

available solutions. Let us compare some exemplary products 

and their communications functionality: 

Functionality 

OPC 

UA 

Middleware Products 

DDS Gamma V MQTT 

Asynchronous / Message Based coming yes yes yes 

Publish / Subscribe coming yes yes yes 

Synchronous / Cyclical yes yes yes no 

Real Time Capable coming yes yes yes 

Multi Broker no yes yes no 

Integrated Security yes yes yes yes 

Simple Programming yes no yes yes 

IV. SHOP FLOOR SERVICES 

Now that we have had a look at the different available 

communication strategies and their uses, let's examine how they 

work with our Industry 4.0 component. 

Fig. 8. Schematic of an Industry4.0 Component Based on Middleware 

Architecture 

In this example, the PLC stands for the existing control 

systems of a machine. It is irrelevant whether we're discussing 

designing a new production facility or retrofitting a legacy site 

for the digital age. Machines sourced from external suppliers 

often come with all kinds of different control systems, interfaces 

and field buses. It is not advisable to change these so as not to 

lose warranty and support (not to mention the necessary effort). 

Therefore we create a digital twin of the PLC control based on 

our synchronous communications model. The twin is connected 

to the real system through appropriate interfaces (usually field 

buses) and process data from the PLC is mirrored in the semantic 

process data model. Now all data can be managed and processed 

without hindrance and the services of the administration shell 

can be connected. 

168

When developing new machines the same concept can be 

used. Instead of the PLC control a field bus is adapted that 

connects I/O signals directly to the process data model. Now the 

control software can coordinate processes while the overarching 

services of the administration shell process data. 

A good middleware should not only provide a diverse range 

of tools for control and service development but also for test and 

simulation. On-site programming and troubleshooting, which 

used to shut down entire production lines until all errors were 

eliminated, is no longer a desirable option. Therefore, one 

should strive to implement Continuous Integration techniques 

with automatic code rollouts. 

V. ADMINISTRATION SHELL SERVICES 

The services shone in figure 8 are utility programs that pool 

and process information from one or several machines. In an 

Industry 4.0 component, such services form the administrative 

shell and communicate with services on a higher level though 

the resource manager. Thus, information is collected and 

coordinated across numerous machines. 

An important example for the use of services in a production 

facility is error management. A service in the administrative 

shell identifies an error based on the available information and 

publishes the error state to a predefined topic. Modern 

"predictive maintenance" services are able to deduct from 

available information that an error is about to occur shortly and 

publish relevant warnings. Superior services running on a higher 

system level subscribe to these topics and can log them in a 

database or inform a service technician. 

translates data between Office Floor and Shop Floor. This results 

in connectivity between arbitrary endpoints and an overarching, 

common semantic model. 

Fig. 10. Exemplary Overview of a Networked Production Facility 

The successful implementation of a digital production site in 

only possible when owners, SMEs and global players work hand 

in hand to pool their strengths. Implementation strategies as well 

as migration strategies have to be developed. Single handed 

attempts or rushed implementations without proper planning, as 

are regrettably occurring at the moment, pose a considerable 

threat to both the project itself and the owners running them. 

Fig. 9. Error Management Services 

VI. FURTHER IMPLEMENTATION 

Once all necessary workflows have been implemented in 

accordance with the reference system, all machines on a 

production site can now be transferred into the new homogenous 

service landscape. 

In addition to the message based communication structures 

between machines and their synchronous controls, it is often 

necessary to also establish a synchronous network. This 

facilitates transferring cyclical process parameters in real time. 

Only now does the connection to the overarching Office 

Floor become possible and make sense. A protocol converter 

169

Dotdot Unifies IoT Device Networks 

Jason Rock 

IoT Products 

Silicon Labs 

Austin, TX USA 

jason.rock@silabs.com 

Abstract— Dotdot is the universal language of the Internet of 

Things (IoT). Today, devices are being connected all around us. 

Most of these connected devices, however, cannot interoperate 

with each other due to their unique skills, rules and systems. If 

cost, power and RF regulatory compliance were not factors, the 

simplest solution is to replace or retrofit every device and ensure 

they use a single physical protocol such as Wi-Fi. In reality, each 

device’s chosen wireless technology is optimized for its particular 

application. Therefore, the better solution to achieve device 

interoperability across an array of devices with various protocols 

is to embed and use an application library such as Dotdot. This 

will allow each connected device to be physically optimal for their 

intended purpose and facilitates the ability for devices in multiple 

ecosystems to communicate with each other. 

Keywords—IOT, Dotdot, Zigbee, Thread, Bluetooth, multiprotocol, 

software, edge, cloud 


Devices are being connected all around us due to the growth 

and evolution of the IoT. In fact, Gartner predicts, “By 2020, IoT 

technology will be in 95% of electronics for new product 

designs” [1]. Most of these connected devices, however, cannot 

interoperate with each other because of unique device attributes 

and behaviors required by an ecosystem, which in turn means 

each of these devices likely cannot communicate with other 

devices when outside of their intended ecosystem by default. 

This forces consumers to likely lock into branded ecosystems 

based on their desired device selection, and vice-versa. 

One challenge with the walled gardens created by branded 

device ecosystems is that it makes it difficult for device 

manufacturers to develop end devices or manage product 

variants that can communicate across these multiple ecosystems. 

Additionally, the desire for a single wireless protocol such as 

Wi-Fi will not be the solution as some may predict because 

multiple physical protocols will continue to coexist for the 

foreseeable future. 

Rather than seeking a single physical protocol, a common 

application language residing above these protocols will become 

the preferred option. Of these, an optimal option is Dotdot. 

Developed by the Zigbee Alliance, Dotdot can instead be 

installed or upgraded into each set of connected devices to 

process interaction commands between each system. The 

presence of Dotdot’s device descriptors presents a common 

language allowing each device to operate optimally for its 

purpose while enabling connected devices to communicate with 

other devices within or outside of its branded ecosystem. 

II. 

BACKGROUND 

A. Before IoT, mobile phones became smart 

Today, IoT faces with a similar scenario that the mobile phone 

industry experienced in the late 1990s when mobile operators 

had to choose a protocol such as CDMA or GSM. Based on that 

selection, international travelers discovered the challenges of 

getting phone service as they roamed into networks with 

unregulated cost structures. Consumers choosing a different 

service found themselves with phones unable to interoperate on 

the chosen network. While some predicted one standard would 

prevail, the eventual outcome was the existence of multiple 

technologies with multiple physical wireless standards 

interoperating with each other. This caused mobile phone 

manufacturers to introduce multiband phones that could 

interoperate over an array of deployed wireless technologies, 

allowing mobile phone operators to negotiate favorable terms 

to share their networks. These actions to work together to 

contend with what was available and establish a way for mobile 

phones to work across competing networks ultimately benefited 

the consumer. It also helped facilitate the next evolution of the 

smart phone, with Android at the center of this evolution, which 

led to the creation of the mobile app industry [2]. The Internet 

of Things is beginning to use the same approach. 

B. Walled gardens don’t make good networks 

Concannon Business Consultant Michael Dorazio 

poignantly states the connected device conundrum: “The 

Internet of Things will weave a seamless tapestry of connected 

devices into your life. Except that it won’t … if things keep 

going the way they are” [2]. This means that ‘walled gardens’ 

that only work with branded devices will likely leave the 

consumer in a struggle to determine which new devices work 

with the devices they currently own. 

C. The woes of the modern building manager 

To understand the problem, consider commercial buildings 

being managed by today’s building managers. A modern 

building will contain connected LED lighting, a state-of-the-art 

security system and an efficient HVAC with remote sensors. 


170

Without a standardized application layer, it is likely that none of 

these systems can easily communicate with each other. 

Further, what if a building manager overseeing a dozen 

buildings located in multiple regions needs to apply a uniform 

policy to: 

• Pre-condition every building starting at 6:00 am, 

• Activate all lights at 7:00 am, 

• Then at 7:00 pm, set all buildings to an energy-efficient 

mode for lighting and HVAC systems 

• And, provide exceptions for weekends and holidays. 

To do so, this building manager would likely have to log into 24 

different accounts (12 for lighting, 12 for HVAC), setup 12 sets 

of lighting policy rules as well as the same number of individual 

rules for the HVAC. To complete the change, he/she will then 

ask localized staff to help verify that each system operated as 

programmed. 

D. Contending with regulations 

While this scenario may seem to be a temporary problem that 

only occurs during installation, maintenance or infrequent policy 

changes, government and regulatory entities at the federal, state, 

and municipal level are taking a more active role by mandating 

energy policies that building managers have to contend as they 

apply their energy policies across the buildings under their 

administration. For example, the State of California’s Title 24’s 

2013 standard requires buildings greater than 10,000 square feet 

to be capable of automatically reducing lighting power in 

reaction to a demand response signal by a minimum of 15% 

below the total installed lighting power [3]. These regulations 

additionally may require the submission of monthly or yearly 

reports. If each of these systems does not have a way to 

intercommunicate, compiling such reports can be challenging as 

the building manager has to manually compile the necessary 

data. Failure to adhere to specific regulations could lead to 

penalties such as fines, loss of tax incentives, or possibly 

decertification of the building’s occupancy permit. 

E. Consider the array of physical protocols 

Connected LED lighting best represents the challenges faced 

within the IoT evolution due to the varied adopted physical 

protocols. Connected lighting especially in a commercial setting 

tends to use wired protocols such as DALI (Digital Addressable 

Lighting Interface), DMX, Power over Ethernet and Power Line 

Communications. Wireless protocols, however, are rapidly 

becoming a presence within connected lighting using Zigbee, 

Bluetooth, Wi-Fi, EnOcean, Li-Fi, and recently 6LoWPAN and 

Thread, to name a few. 

HVAC, however, does not have as many deployed protocols, 

but these systems use similar protocols used in lighting along 

with others such as BACnet, Modbus, LonWorks and KNX. 

How does a building manager decide which protocol to 

deploy, especially when faced with having to choose a specific 

brand for lighting and a different brand for HVAC? Each brand 

may have its own control software, connected to different cloud 

management solutions requiring multiple accounts and multiple 

passwords to maintain. 

F. Using cloud APIs to patch device interoperability 

One method to facilitate these challenges is to use some of 

the new cloud-capable platform services being offered by 

startups and established companies. The purchase of additional 

equipment of IoT-enabled gateways and subscription services 

offers building managers a way to connect their deployed 

systems and control them within a unified cloud-centric 

platform. In effect, these platforms take advantage of the walled 

garden problem by offering a paid service to help alleviate the 

struggle building managers are facing. In other cases such as 

smart cities and transportation, there are many emerging 

companies offering to help via paid services to solve the same 

sorts of challenges as more connected devices, which mostly are 

not interoperable, appear within their particular industry. 

III. 

MULTI-PROTOCOL WIRELESS IS HERE TO STAY 

A. One protocol to rule them all 

As in the case of connected lighting, network operators, as 

well as device makers hoping to solve their customers 

connectivity problems, find themselves forced to offer 

connected devices in multiple physical protocols. If cost, power 

or other RF aspects such as operating frequency and range were 

not factors, manufacturers could standardize on a single protocol 

like Wi-Fi. This is not realistic and would require all device 

makers to arrive at the same decision across all devices within 

the respective ecosystems. 

B. Wi-Fi is great for data, bad for batteries 

The reality is each device’s chosen technology is in fact 

optimal for the end application, and to select a single protocol 

requires a much more complex decision that outweighs the 

technological tradeoffs. Replacing or retrofitting every device to 

ensure they use a single protocol is not realistic. The reality is a 

device’s chosen connectivity is optimal for its intended 

application within its targeted industry. 

TABLE I. COMMON WIRELESS RADIO STANDARDS [4] 

Wi-Fi Z-Wave Zigbee Thread BLE 

Launched to 

the Market 

1997 2003 2003 2015 2010 

PHY/MAC 

Standard 

IEEE 

802.11.1 

ITU-T 

G.9959 

IEEE 

802.15.4 

IEEE 

802.15.4 

IEEE 

802.15.1 

Frequncy 

868/900 

2.4 GHz 

Band 

MHz 

2.4 GHz 2.4 GHz 2.4 GHz 

Maximum 

Data Rate 

> 1 

Git/s 

40-100 

kbit/s 

250 

kbit/s 

250 

kbit/s 

2 

Mbit/s 

Topology Star Mesh Mesh Mesh P2P/Mesh 

Power 

Usage 

Alliance 

High Low Low Low Low 

Wi-Fi 

Alliance 

Z-Wave 

Alliance 

ZigBee 

Alliance 

Thread 

Group 

Bluetooth 

SIG 

Wi-Fi was designed as a way to wirelessly connect personal 

computers within a local network environment. The protocol 

facilitates the ability for connected devices to securely connect 

as they nomadically move in and within the given local network. 

Given the data transfer requirements, it uses higher overall 

energy consumption than other radio technology options 

currently available. Combined with an on-board SoC as in the 

171

case of computers and now mobile phones, these Wi-Fi enabled 

devices use rechargeable lithium-polymer batteries versus small 

disposable batteries. In essence, one would be tossing out 

numerous batteries if these devices were not built to be 

rechargeable. 

On the other hand, wireless protocols such as Zigbee and Z- 

wave are being adopted by home automation and security device 

manufacturers as they are better suited for battery-efficient 

applications. Bluetooth is widely adopted for direct mobile 

phone connectivity across a variety of devices such as headsets, 

peripherals, fitness trackers, light bulbs, beacons and other asset 

trackers. 

IV. DOTDOT: A COMMON DEVICE LANGUAGE 

While there is a desire for a single wireless protocol spurred 

by the need for device interoperability, each protocol was 

conceived to satisfy a particular physical aspect of a specific 

market application. As a result, the best way to satisfy the desire 

for interoperability is to provide a method for each technology 

to be able to communicate with each other in a language 

designed for connected devices. 

The most widely deployed use of a common device language 

is the Zigbee Cluster Library (ZCL) developed by the Zigbee 

Alliance, which is running on hundreds of millions of IoT 

devices. In 2017, the Zigbee Alliance introduced Dotdot, which 

is effectively a rebranding and extension of the ZCL to operate 

across other wireless protocols. As one journalist noted: “The 

ZCL is mature and comprehensive, representing years of work 

defining and cataloging how ‘things’ will interoperate.” [5] 

For new devices using Wi-Fi or Thread, Dotdot residing 

within these devices can easily control and respond to 

commands as Dotdot was designed for IP-centric transport 

protocols that support UDP. For IoT legacy systems, Dotdot can 

be deployed via firmware upgrade into these devices to give 

them the language to communicate with ecosystems that have 

existing or future products that use Dotdot. 

In the case of Z-Wave and Bluetooth, small language 

translation APIs can be added to its embedded application to 

translate protocol commands into the Dotdot language. A 

translation API, for example, requires only kilobits of memory 

and minimal processor usage. Further adaptions of translating 

UDP to TCP or a RESTful protocol may require a little more 

planning, but its complexities are hardly difficult for most 

firmware-savvy engineers to implement onto a low-cost wireless 

connected device. 

V. SUMMARY 

As more devices become connected thanks to advances 

within IoT, these connected devices will need to interoperate 

with each other regardless of ecosystem to enable the next 

generation of services to arise. This was the case when mobile 

phones evolved into smart phones, fostering the introduction of 

mobile apps. 

While the walled garden approach by branded ecosystems 

allowed the connected device industry to emerge, the reality is 

device manufacturers need to leverage the right wireless 

protocol without having to contend with supporting multiple and 

ever-changing APIs to ensure their products can operate across 

an array of ecosystems. 

Given its ability to describe device behavior, Dotdot, with 

the backing of a consortium of established IoT vendors, can 

provide a way for device manufacturers with minimal coding 

effort to augment their connected devices to process interaction 

commands between each system. The presence of Dotdot’s 

device descriptors presents a common language allowing each 

device to be optimized for their intended purpose while enabling 

devices in multiple ecosystems to communicate with each other. 

REFERENCES 

Fig. 1. Example of a Zigbee attribute specification for level control 

Instead of finding the perfect transport protocol, Dotdot can 

be augmented into devices as a language to unify the many 

products leveraging the varied protocols. It enables an array of 

innovations for device control agnostic to the underlying IoT 

protocols within each deployed device. Using Dotdot can give a 

connected lighting unit a lighting profile, whether it uses Wi-Fi, 

DALI or Zigbee as its wireless technology. Its device behavior 

and attributes are specified so that messages from a Wi-Fi 

lighting dimmer panel using Dotdot can be properly understood 

by a Zigbee light using Dotdot connected to the same local 

network via a wireless protocol translation bridge. Adding 

Dotdot to each system is relatively simple as its memory 

footprint for the particular device is quite small. 

[1] D.C. Plummer, et al, “Top Strategic Predictions for 2018 and Beyond: 

Pace Yourself, for Sanity's Sake”, September 29, 2017, 

https://www.gartner.com/doc/3803530?srcId=1-6595640685 

[2] The Verge staff, “Android: A visual history”, December 7, 2011, 

https://www.theverge.com/2011/12/7/2585779/android-history 

[3] Michael Dorazio, “Walled Gardens are Killing the Internet of Things”, 

December 21, 2015, http://www.concannonbc.com/walled-gardens-arekilling-the-internet-of-things/ 

[4] California Energy Commission, 2013 Building Energy Efficiency 

Standards for Residential and Nonresidential Buildings, November 25, 

2013, Section130.1(e) http://www.energy.ca.gov/2012publications/CEC- 

400-2012-004/CEC-400-2012-004-CMF-REV2.pdf 

[5] C. Liew, “The Smart Home radio protocols war”, August 10, 2015, 

https://www.iot-now.com/2015/08/10/35653-the-smart-home-radioprotocols-war/ 

[6] D. Ewing, “Delving deeper into Dotdot -- ZigBee's new 'Universal 

Language for the IoT'”, April 4, 2017, https://www.embedded.com/ 

electronics-blogs/say-what-/4458281/Delving-deeper-into----ZigBee-s- 

new--Universal-Language-for-the-IoT- 


172

Localizing Analytics 

for Speed, Reliability and Reduced Power Consumption 

John Milios 

Sendyne Corp. 

New York, NY, USA 

jmilios@sendyne.com 

Nicolas Clauvelin 

Sendyne Corp. 

New York, NY, USA 

nclauvelin@sendyne.com 

Abstract --- New tools make possible physics-based analytics 

in an embedded environment. By computing locally – performing 

predictive and prescriptive analytics at the edge of the IoT – 

significantly less data must be directed to the cloud. Further, the 

data sent are more informative and they are available in serverless 

situations. This improves reliability, speeds computation 

time and reduces power consumption. In addition, physics-based 

models have the ability to assess the internal state of an observed 

system. This makes their predictions more accurate. By enabling 

physics-based models to operate in real time in small footprint 

embedded devices, the resultant robust predictive ability can lead 

to a reduction of needed, and often expensive, system monitoring 

sensors. To illustrate how embedded model-driven analytics can 

be implemented, a real world examples will be demonstrated: an 

electric motor health monitor. Each step in the implementation 

process will be shown, from model design to the utilization 

of embedded scientific computing tools, final real-time model 

optimization, and system predictions. 

Keywords --- analytics; physical analytics; Edge of the IoT; 

model-based analytics 

I. Introduction 

The vast amount of data generated by Internet of Things (IoT) 

devices and sensors is threatening to disrupt the current Internet 

infrastructure. The prospect of billions of interconnected devices 

and sensors ceaselessly generating data creates unsustainable 

requirements in storage, energy and bandwidth. According to 

CISCO, Machine to Machine (M2M) traffic alone is growing 

at a 49% CAGR, and is projected to generate 14 exabytes/ 

month up from 3 exabytes/month experienced in 2017[1]. To 

have a measure of comparison, one Exabyte is10 18 bytes and 

some educated guesses about the storage capacity of Google 

place it somewhere around 10-15 Exabytes of data [2]. Industry, 

academia and even governments are becoming aware of this 

threat and are investigating multiple approaches to address the 

impending very big data crisis [3],[4]. It is beyond the scope 

of this paper to delve in depth in all aspects of the problem. 

Instead we will focus on the role of analytics in reducing the 

traffic between IoT devices and the cloud. 

II. The Role of IoT Analytics 

All IoT generated data can be valuable but only if they are 

interpreted in a useful way. This is the role of analytics – one 

IoT sensor data 

Structured data 

10000 

8000 

Exabytes 

6000 

4000 

2000 

IoT generated data 

social & computer 

generated data 

edge 

IoT sensor data Derived (model) data Structured data 

2010 2015 2020 

Figure 1: IoT generated data is growing faster than social & computer generated 

data 

Figure 2: Performing analytics at the edge reduces storage, energy and bandwidth 

requirements 

edge 


173

of the most important applications in the IoT, which can be 

categorized as predictive and prescriptive analytics. 

Predictive analytics aims to identify potential issues 

before they occur. The benefits are immediate; for example, 

unscheduled down time in a production line can be significantly 

reduced or eliminated. 

Prescriptive analytics goes one step further by acting on the 

data through a feedback system that optimizes a process. 

Given the storage, energy and bandwidth concerns that 

the very big IoT data is creating, it is optimal to perform the 

data analysis as close as possible to their source, thus reducing 

transmission of uneccessarily large amount of data as illustrated 

in Fig.2. Edge performed prescriptive analytics provide an 

added benefit by enabling local operational functionality during 

scheduled or unscheduled server-less conditions. 

Model complexity & 

processing power 

Full physics 

models 

Dynamic & 

data dependent 

Static & 

data independent 

Compact 

model analytics 

Statistical 

analytics 

III. The Power of Physics in IoT Analytics 

Big data generated by IoT devices and sensors are different 

than data created by social networks, financial or business 

transactions. Most analytics in the latter category are statistical, 

dealing for example with frequency of appearance of keywords 

or with relating healthcare protocols to patient outcomes. In 

contrast, data generated by IoT sensors are measurements of 

natural or man-made physical systems (e.g., temperature of a 

specific location or velocity of a motor shaft). 

These physical systems are by nature deterministic, and their 

analysis traditionally has been physics-based. The purpose of 

phsysical systems analytics is primarily to extract information 

about the internal state of an observed system. Measurements 

are often limited to the observable behavior of the system. What 

we are interested, though, in most of the cases, is the hidden 

information behind these data: The internal state of the system. 

For example, we can measure the surface temperature of a 

battery but what we may be interested in is its core temperature 

which we cannot measure directly. 

The association between the observables and the hidden 

state information is accomplished best through physics-based 

and mathematical models which by design relate the inputs 

to the outputs through a description of the system’s internal 

dynamics. Physics models once they are derived and formulated 

do not change and they are data independent. If they contain 

enough details, at least in theory, they can predict accurately the 

response of an observed system to changing input conditions. 

# of data points 

Figure 4: IoT physics based analytics lies between traditional physics models 

and big data analytics 

So in theory if there is a physical model for a given system and 

we know its inputs we could predict its outputs. In practice, 

many physical systems are too complicated for accurately 

solving them within the time and computing resources 

available. In addition, there are physical phenomena that escape 

first principle approaches, or they are too complicated to model. 

Finally, real world observable data are noisy. These are some of 

the issues that IoT physical analytics can address. 

In the IoT world observed physical systems reside in the 

analogous of a continuous experiment. As in traditional science 

experimenting manifests the dynamic process of knowledge 

advancement, similarly the co-existence of physics models 

and big data can create dynamic data dependent models with 

predictive power that can potentially provide better physical 

insights and advance knowledge. The mixing of physical 

models and experimental data is not novel. It has been used 

extensively in the semiconductor and other industries for the 

creation of compact physical models. In these models physics 

laws are combined with experimentally derived parameters and 

relationships in order to create small, fast and accurate model 

units that can scale well in very large simulations. In the big data 

IoT this combination of physics based models and experimental 

data can occur dynamically. 

Model 

Observables 

Parameter Estimation 

Optimizer/Solver 

State Estimation 

Y model 

Y model 

Compare 

predicted w. 

observed 

Hidden 

u c 

Plant 

Y plant 

Figure 3: Physical systems analysis relates the observables with the hidden 

states 

Figure 5: The model reads the inputs of the plant u c 

and predicts the output Y model 

. 

After comparison of its predictions with the actual outputs Y plant 

, it subsequently 

adjusts its parameters for a more accurate estimation of the current plant state 

174

Model complexity & 

processing power 

It occupies up to 300 KB of memory with all features enabled. 

Benchmark testing against automatically-generated C code 

from Matlab Embedded Coder exhibits an order of magnitude 

faster execution and similar memory usage when solving the 

Van der Pol equations on an ARM Cortex-M4 MCU. 

VI. Decentralized Scientific Processing 

An example of how this technology can be deployed in the 

IoT is the monitoring of a factory floor. In this scenario, multiple 

motors are operating generating a constant flow of voltage, 

current, rotor position, angular velocity and acceleration data. 

It is desirable to derive from these data through analytics the 

health condition of each motor in order to schedule maintenance 

and avoid unscheduled down time. Instead of transmitting all 

these data to a central processing location, a local MCU can 

utilize a model to associate the observed electrical and motion 

measurements with the device parameters of interest. Such a 

simple model is shown in Fig. 8 for a DC motor. Utilizing 

this model and measurement data the numerical solver and 

optimizer can fit the model parameters to the incoming data 

dynamically monitoring the health of the motor. 

A typical scenario in this example would be to detect 

changes in the parameters describing the motor specifications 

and therefore detect the onsets of faulty behaviors: for example, 

changes in parameters related to friction and inertia of the shaft 

could indicate wear in the gear box, and changes in the motor 

inductance could indicate that the windings are deteriorating or 

overheating. 

Moreover, within such a setup it would make sense to only 

transmit metrics related to the health condition of the system 

rather than the entire set of observed data. The rate at which the 

health condition metrics are transmitted can in turn be adapted 

to the health of the system itself: if the system is operating 

normally data can be transmitted at a slow rate, and if a faulty 

behavior is detected data can be transmitted at a higher rate 

so that the central processing location can accurately flag the 

system. 

There are numerous other applications and methods that can 

benefit from the ability of fast and accurate scientific computing 

in small MCUs at the edge. 

article/view/24/120>. Date accessed: 12 Jan. 2018. doi:http://dx.doi. 

org/10.14529/jsfi140206. 

[5] Bieswanger, A., Hamann, H.F. and Wehle, H.D., 2012. “Energy efficient 

data center”. IT-Information Technology Methoden und innovative Anwendungen 

der Informatik und Informationstechnik, 54(1), pp.17-23. 

[6] R. Melville, N. Clauvelin, and J. Milios, “A high-performance 

model solver for ‘in-the-loop’ battery simulations,” in American 

Control Conference (ACC), 2016, 2016, pp. 3119–3125. 

VII. Conlclusion 

Decentralized scientific processing in the IoT provides 

a method for reducing the flow of sensor data by processing 

them right at the source. This IoT platform requires compact 

models and compact numerical solvers that can operate within 

today’s MCU memory and speed constraints. Through physical 

analytics big IoT data can advance our understanding of the 

physical world and lead applications that turn big data into 

bigger returns. 

References 

[1] CISCO, “The Zettabyte Era: Trends and Analysis,” 2017. [Online]. Available: 

https://www.cisco.com/ [Accessed: 12-Jan-2018]. 

[2] What-if, “Google’s Datacenters on Punch Cards,”.[Online]. Available: 

https://what-if.xkcd.com/63/ [Accessed: 12-Jan-2018]. 

[3] Stephanie Pappas, “How Big Is the Internet, Really?’, Live Science,2016. 

[Online]. Available: https://www.livescience.com/54094-how-big-is-theinternet.html/ 

[Accessed: 12-Jan-2018]. 

[4] MATSUOKA, Satoshi et al. Extreme Big Data (EBD): Next Generation 

Big Data Infrastructure Technologies Towards Yottabyte/Year.Supercomputing 

Frontiers and Innovations, [S.l.], v. 1, n. 2, p. 89-107, 

sep. 2014. ISSN 2313-8734. Available at:

Build an Industrial-Strength 

Device-to-Cloud IoT Application in 30 Minutes – 

No Smoke and Mirrors involved 

Delivering IoT Solutions with a Hybrid IoT Platform Approach 

Terrence Barr 

Head of Solutions Engineering 

Electric Imp 

Los Altos, CA, USA 

Hugo Fiennes 

CEO and Co-Founder 

Electric Imp 

Los Altos, CA, USA 

Abstract— Internet of Things (IoT) platforms can provide 

critical assistance to companies looking to deliver connected 

products and services by reducing risk and time to market for 

their IoT solutions while providing the security and flexibility to 

adapt to evolving market demands. Outsourcing specialist areas 

such as security, operational management, and long-term 

maintenance allows companies to focus their resources on 

maximizing business value of the connected products and services 

themselves, rather than struggling with the underlying 

technology complexity. 

This paper discusses key requirements of end-to-end (deviceto-cloud) 

IoT solutions and the importance and critical benefits 

of choosing the right IoT architecture and platform. In the 

corresponding conference session the attendees will learn how to 

build a secure, scalable, and customizable device-to-cloud IoT 

application in 30 minutes by integrating a managed IoT device 

connectivity platform with a popular IoT application cloud 

service. 

Keywords—Internet of Things, IoT, Hybrid Platform, Security, 

Connectivity, Devices, Device Management, Edge Intelligence, 

Software Platform, Security Maintenance, UL 2900-2-2, Cloud 

Applications, Enterprise Integration, End-to-End, Time-to-Market, 

Risk Minimization 

I. THE CHALLENGE: IOT SOLUTION COMPLEXITY 

Many companies now understand the business benefits of 

connected products and services in Internet of Things (IoT). 

The value of connected products and services is driven by 

IoT business applications, which in turn depend on trustworthy, 

accurate, and reliable data from devices (products) in the field. 

Therefore, any IoT solution must be created and delivered endto-end, 

from the device to the business application, to generate 

the expected business benefits. 

However, implementing, deploying, and supporting end-toend 

IoT solutions can be complex and challenging – in 

particular meeting the increasing commercial or industrial 

requirements for security, reliability, scalability, and longevity. 

This complexity is a key factor that is slowing down or holding 

back many IoT deployments today. Among some of the 

technical challenges are: 

• Security (from device hardware through 

communications to cloud and management) 

• Hardware selection and product design 

• Software complexity (device and cloud) 

• Multitude of communication technologies 

• Device manufacturing and deployment at scale 

• Integration of legacy systems 

• Protocol conversion and data integration 

• Cloud infrastructure and scalability 

Many product and services companies do not possess the 

technical expertise, resources, or appetite for risk to build and 

deliver IoT solutions by themselves. For such companies, it 

makes more sense to focus on their core competencies and 

leverage IoT offerings from specialized vendors who abstract 

much of that complexity away from the customer. 

Unfortunately, choosing the right IoT offerings itself is a 

challenge. A dizzying number of IoT offerings exist in the 

market today, from low-level device hardware components on 

one end to powerful cloud-based IoT platforms on the other 

end, and any number of components, technologies, standards, 

protocols, tools, and services in between. 

When approaching the IoT solutions market, there are two 

extremes visible: 

1. Bespoke IoT solutions, which are customized 

from on a number of different technology and 

service components, and assembled, integrated, 

delivered, and supported by the vendor for a 

specific customer 

2. Off-the-shelf IoT Solutions, which are preintegrated 

technology and services packaged as 

© Electric Imp, 2018 

177 

www.embedded-world.eu

IoT solutions targeting specific markets and use 

cases, offered by the vendor with limited 

customization 

#1 (Bespoke) offers maximum flexibility for IoT solutions 

at the expense of complexity, cost, and time-to-market. 

Building bespoke IoT solutions from components requires 

substantial expertise, incurs technology and execution risk, and 

carries the burden of having to support the bespoke IoT 

solution over its entire lifetime. In our experience, this burden 

is almost always underestimated, especially with regards to 

security, resulting in projects going over time, and over budget. 

#2 (Off-The-Shelf) focuses on narrow functionality for 

specific applications and market segments, typically trading 

fast time-to-market and simplicity against flexibility. 

While off-the-shelf offerings may initially seem attractive, 

companies often find that these are not flexible enough to 

integrate well with existing products and business models, do 

not scale across product or business lines, and don’t evolve 

well with business needs. The difference between a Proof-of- 

Concept and a shipping IoT product is often about corner cases 

that occur in the real world, and a rigid IoT solution is often not 

capable of handling these corner cases efficiently. 

Both extremes are not viable for the majority of the 

companies looking for IoT solutions today. Many companies 

are looking for a straightforward but flexible approach to IoT – 

the key business requirements can be summarized as follows: 

• Fast time-to-market, low execution risk, and 

predictable, bounded expenditure 

• Well-designed, fully integrated security from 

device hardware to cloud, and maintained for the 

lifetime of the product 

• Flexibility to address unique and evolving 

technology and business needs 

• Easy integration with, and support for, existing 

and future products 

• Simplified procurement, integration, and delivery 

without sacrificing functionality or flexibility 

• Low upfront investment, and ability to 

incrementally invest as connected business scales 

• Cost effective, timely long-term support of 

solution 

II. THE SOLUTION: HYBRID IOT PLATFORM APPROACH 

In our experience, these requirements are best met with a 

Hybrid IoT Platform approach, which combines a 

comprehensive IoT device connectivity and management 

platform with the IoT application cloud platform that is best 

suited for the customer’s needs. 

1. IoT Device Connectivity and Management Platform 

The IoT Device Connectivity and Management Platform 

connects devices to the cloud securely, reliably, and at scale. 

Device connectivity and management is a highly specialized 

field which requires expertise in device hardware, security 

from device to cloud and all layers in-between, robust bidirectional 

connectivity (data and control), device 

management, software provisioning and OTA updates, protocol 

integration and data conversion, cloud integration, massive 

scalability, and more. The security, flexibility, and scale of the 

IoT Device Connectivity and Management Platform is a 

prerequisite to getting trustworthy IoT data into the application 

cloud. Without trusted device data there can be no IoT business 

value. 

2. IoT Application Cloud Platform 

The IoT Application Cloud Platform provides massivescale 

device data ingestion, processing and storage, business 

applications, and enterprise orchestration. IoT cloud platforms 

typically rely on external mechanisms to provide device 

security, connectivity, and management, and this is where the 

IoT Device Connectivity and Management Platform comes in. 

The ease and flexibility of integration between the two 

platforms is critically important for real-world IoT solutions as 

it enables the organization to optimize the complete solution to 

the customer’s requirements. 

A Hybrid IoT Platform approach provides important 

benefits to companies looking to build IoT solutions: 

• Strikes an optimal balance between flexibility and 

time-to-market for many companies and their IoT 

use cases 

• Leverages proven platform implementations for 

common IoT functionality while enabling 

customization of the IoT solution to meet the 

customer’s unique needs 

• Simplifies procurement and support of the IoT 

solution with only two key vendors and welldefined 

responsibilities and integration points 

• Provides flexibility to evolve with the customer’s 

business needs, both on the device and on the 

cloud side, including expanding the solution with 

additional connectivity options and a wider range 

of cloud services 

III. DEMONSTRATION 

To demonstrate the Hybrid IoT Platform approach, we 

combine the Electric Imp Device Connectivity and 

Management Platform with the Microsoft Azure Cloud 

Platform to rapidly create a secure, industrial-grade, 

customizable device-to-cloud IoT application. 

The Electric Imp platform provides fully integrated edge 

device-to-cloud security, bi-directional low-latency 

connectivity, end-to-end management, device and cloud 

application platforms, ongoing security maintenance, and 

ready-to-use enterprise integrations to a range of popular IoT 

cloud platforms such as Microsoft Azure, Amazon Web 

Services, Salesforce IoT, GE Predix, and many others. 

The integration between the Electric Imp platform and 

Microsoft Azure is accomplished via the unique and powerful 

Electric Imp cloud middleware container, while the device 

integration, edge processing and data model is implemented via 


178

the Electric Imp managed device application container. This 

design makes processing the device data in the IoT business 

application straightforward because the data is trustworthy, 

appropriately processed, and correctly formatted when it 

reaches the IoT application cloud, avoiding impedance 

mismatches which are common in real-world IoT. 

demonstration is built on a single ‘shard’ of a scalable system – 

scaling is horizontal from this point on and the architecture of 

the Electric Imp connectivity platform allows scaling to many 

concurrent shards to support millions of devices, growing with 

the business needs of the company and minimizing engineering 

effort as the solution scales out. 

For the demonstration, these are the high-level steps to 

build the device-to-cloud IoT application (with approximate 

time in parenthesis): 

1. Securely connect and enroll edge device into the 

Electric Imp platform (2 min) 

2. Create IoT application template in Microsoft Azure 

IoT application builder (5 mins) 

3. Define device template, device properties, 

visualizations, and rules in Microsoft Azure IoT 

application builder (10 min) 

4. Authenticate device with Microsoft Azure IoT 

application via the Electric Imp platform (1 min) 

5. Deploy Microsoft Azure integration into Electric Imp 

cloud middleware container (2 min) 

6. Deploy edge application into Electric Imp device 

application container (2 min) 

7. Device and cloud applications now execute and device 

data is sent to the Azure IoT application, where it is 

stored, visualized, and rules are applied (1 min) 

Total time to create, deploy, and execute: Under 30 

minutes. 

The demonstration can be easily expanded further. It can 

include additional data models, bi-directional communication 

and control, and custom processing, filtering, and almost any 

other application functionality can be implemented by updating 

and simply re-deploying the business logic to the edge device 

and cloud (on the push of a button using the Electric Imp OTA 

software provisioning). 

It is noteworthy that the resulting IoT solution inherits the 

industrial-grade properties of the underlying platforms, 

including full security, scalability, and manageability. The 

IV. CONCLUSION 

Creating, deploying, and supporting device-to-cloud IoT 

business applications can be complex and challenging. 

Bespoke IoT solutions provide maximum flexibility but are 

costly and time-consuming and are not a viable option for 

many companies. Off-the-shelf IoT solutions offer simplicity 

and fast time-to-market, but often lack the necessary flexibility 

to adapt and grow with business needs. 

A hybrid IoT platform approach can deliver the best option 

for many companies by combining a comprehensive IoT 

device connectivity and management platform with the 

customer’s preferred IoT application cloud platform. This 

approach integrates two proven and ready-to-use platforms to 

deliver the necessary common IoT functionality while 

providing the flexibility to easily adapt and extend the IoT 

solution to the customer’s needs. 

The result is a customized device-to-cloud IoT solution that 

can be delivered to market quickly and with low risk and which 

can evolve over time as the needs of the customer grow. This is 

what most companies need to successfully extend their core 

product and service business into the connected world of the of 

Internet of Things. 

The Electric Imp IoT Edge-to-Enterprise Platform helps 

more than 100 customers around the world to build, ship, and 

manage their IoT solutions securely, effectively, and at 

massive scale, with more than 1 million devices, UL 2900-2-2 

Cybersecurity Certification, and pre-built integrations to 

leading cloud services like Microsoft Azure, Amazon Web 

Services, Salesforce IoT, GE Predix, and more. 

For more information, please see www.electricimp.com 


179 


Agile Development and ISO26262 

Irwin Fletcher 

Quality Management, OpenSynergy GmbH 

Berlin, Germany 

irwin.fletcher@opensynergy.com 

Abstract—The Harvard Business Review recently praised Agile 

methods saying they have greatly increased success rates in 

software development, improved quality and boosted productivity. 

Yet there is a persistent belief that Agile is unsuitable for 

automotive software development and used as an excuse for 

undisciplined behavior, when applied to developing software for 

embedded systems in regulated industries. This paper challenges 

this and shows how using Agile can be combined with conformance 

to standards such as ISO26262, to deliver real business benefits. 

However, balancing Agile practices with conformance expectations 

is not simple and effortless. It requires a deep understanding of the 

intent behind both approaches. This paper describes where Agile 

techniques can be applied most usefully to software development, 

and importantly, where they should not be used. The primary focus 

of this paper is directed towards the development of a Safety 

Element out of Context Software Component. 

Keywords—Scrum; Safety-critical; Agile; ISO26262; 

Embedded Software Development; SEooC; Traceability; User 

Stories; Requirements 

I. SOFTWARE IN THE DRIVING SEAT 

Software is a major component of today’s 

automobile. The features that it delivers are an ever 

increasing factor in purchasing decisions. The 

Boston Consulting Group state that “Consumers, 

want to purchase cars from companies that bring 

new technologies to market and do so quickly” [1]. 

At the same time there is rising concern about 

relinquishing control to software. Many people do 

not fully trust the technology and express doubts 

about the safety of self-driving cars [2]. 

The automotive industry is having to satisfy two 

differing demands. On the one hand to deliver 

innovative technology rapidly and, on the other, to 

ensure that software intensive systems are 

demonstrably safe and dependable. 

Software developers favor Agile methods. These 

should enable rapid product development thus 

satisfying the first demand. Whereas working to 

ISO26262 is recognized as guaranteeing functional 

safety which should satisfy the second [3,4]. Given 

that applying ISO26262 raises the cost of 

development by a factor of around three to five 

times, any way this can be reduced is welcome. 

Agile and ISO26262 both bring benefits but since 

they are based on different premises a considerable 

amount of thought and adaptation is required for 

them to work together efficiently. Agile can be 

thought of as ‘organic’, whereas ISO26262 is 

‘mechanical’. 

As an innovation, Agile has “greatly increased 

success rates in software development, improved 

quality and speed to market and boosted motivation 

and productivity in IT teams” [5]. For safety, the 

ISO26262 standard for Functional Safety in Road 

Vehicles is another advance, but in a different 

direction. While Agile methods and ISO26262 both 

aim to deliver high quality software that performs 

appropriately, the methodologies implied by the 

standard and Agile are not interchangeable [6]. This 

is highlighted by the fact that the most popular Agile 

method, Scrum, is defined in the 20-page Scrum 

Guide whereas the ISO26262 standard spans 450 

pages and details 600 practice requirements. It is 

clear that they approach software development quite 

differently [7,8]. 

The question is, can Agile practices be used to 

speed software development, while at the same time 

complying with ISO26262 - and if so how? 

II. 

WHAT IS THE DIFFERENCE? 

The fundamental difference between ISO26262 

and Agile, it that ISO26262 can be considered to be 


180

ased on traditional mechanical engineering type 

practices applied to software development. In this 

case using techniques such as top down traceability 

provides ‘belt and braces’ safety. Agile, on the other 

hand, develops software in an emerging, organic 

fashion, which dispenses with some of the 

traditional practices that are considered 

cumbersome, wasteful and time consuming. 

The removal of the ‘braces’ from the ‘belt and 

braces’ approach is not a problem in, say, an 

infotainment system where failure would not be life 

threatening. 

When considering Agile in safety-critical projects 

there are some Agile techniques that will not 

compromise safety and can be used. These are 

described in the next section. However, some Agile 

techniques are not directly compatible with 

ISO26262 but once adapted can yield benefits. 

These are described in subsequent sections and 

further illustrated by two Use Case studies. 

III. 

WORKING SMART WITH AGILE 

Automotive organizations have been slow in 

exploiting Agile techniques when compared with 

others, such as medical device manufacturers or the 

US Department of Defense. The latter two have had 

regulator approved guidance on using Agile for 

some years [9,10]. These organizations recognize 

that Agile methods provide a number of proven 

techniques which speed development and improve 

quality. [11,12]. 

There are some practices, such as pair 

programming, that simply offer an alternative way 

of working and so are not discussed here. The 

following, however, have been demonstrated to 

offer major advantages [13,14]. This has also be the 

experience of the author in diverse regulated 

industries. 

• Time Boxed Iterations: Short iterations where 

each produces a demonstrable update to the 

product. Usually 2-4 weeks in duration. 

Iterations begin with a planning session and 

finish with a demonstration to stakeholders of 

what has been completed. Stakeholder 

feedback then helps to optimize any future 

work. 

• No Change Rule: This is a mechanism to 

manage the persistent issue of ‘feature creep.’ 

The rule states that during and increment no 

changes are allowed that would endanger the 

goal of the increment. Applying this rule 

results in more considered change demands, 

that are then dealt with during subsequent 

increment planning sessions. 

• The Product Owner Role: Sometimes called 

an Initiative Owner this role provides a single 

voice representing all of the project’s 

stakeholders. To work effectively the holder 

of the role needs to be given the authority to 

decide precisely the content of the final 

product. A single, clearly defined Product 

Owner with real authority buffers developers 

from the conflicting demands of multiple 

stakeholders, thus enabling the development 

team to focus on their job - developing code. 

• Daily Stand-ups: These are short, focused, 

regular, daily communications meetings which 

keep the development team informed of each 

member’s progress and issues. Note that 

discipline is required to avoid these meetings 

drifting into solution discussions. 

• Continuous Integration and Regression 

Testing: Here developers integrate code into a 

shared repository several times daily. Each 

check-in is then verified by an automated 

build, allowing teams to detect any problems 

as they arise. 

• Code Refactoring: To ensure safety, it is not 

sufficient to simply provide code that works, 

rather the objective is to deliver code that is 

dependable. It is therefore necessary to 

reconsider the specific implementation of code 

modules and their designs, as the project 

proceeds and understanding evolves. Code 

Refactoring, must be planned and controlled. 

It can then strengthen the robustness and 

safety of a system. 

• Test First Development: Here unit tests are 

developed alongside every code unit. This 

complements Continuous Integration and 

builds confidence in the state and stability of 

the overall codebase, as development 

proceeds. It does not, however, replace the 

need for Requirements Based System Testing 

to validation the Requirements. 

181

IV. AGILE VS ISO26262 

The previous section outlined several practices 

that can be helpful in improving the speed of 

development. Equally important is to appreciate the 

fundamental areas of difference between Agile 

thinking and that behind ISO26262. Acknowledging 

and understanding these differences is necessary 

before a blended solution can be considered. 

This section highlights five key areas where the 

Agile methods and ISO26262 could be considered in 

conflict and proposes potential adaptions. This 

discussion is particularly relevant to the 

development of a Software Component as a Safety 

Element out of Context (SEooC). In this case real 

life Requirements are not available and Safety and 

functional Requirements are assumed (invented) by 

the engineering team. 

A. Analysis vs Emergence 

In the Agile method decisions as to how much 

documentation a project requires are made by the 

team itself. This contrasts with ISO26262 which is 

prescribes the types and the content of the 

documents it requires. This difference need not of 

itself create a problem, as in theory, Agile would 

also allow a team to define and create an ISO26262 

compliant set of documents. 

The challenge, however, arises due to major 

differences in underlying philosophy. In Agile the 

thinking is that the needs (requirements) of a project 

cannot be fully captured at the start of a project, 

because at this stage people do not know exactly 

what they need. Thus, creating specifications and 

upfront documentation is a waste of time, since it is 

not known what will, and what will not, work 

[15,16]. The Agile way is to allow details to emerge 

during iterative development. Agile gurus dictate 

that “Scrum projects do not have an upfront analysis 

or design phase” [17]. 

For a safety development, ISO26262 rightly 

presumes that a safety analysis and safety 

requirements are defined before further development 

occurs. Nevertheless, there is merit in the Agile 

concern that effort can be wasted defining details 

that will inevitably change. In this case, the spirit of 

the Agile thinking can be applied by creating the 

initial safety concepts iteratively. The key here is to 

cover the complete scope of the project in the 

documents but to limit the level of detail specified to 

help bring stakeholders to consensus as to what is to 

be delivered. Detail can be added to the areas as thy 

are selected for further elaboration and development. 

The use of facilitated workshops can be 

particularly useful here because the safety concepts 

and requirements are based on expert judgement 

which can be rapidly explored in workshop settings 

[18]. 

B. Requirements vs User Stories 

In the authors experience, Agile can cease to be 

Agile when it is elevated to dogma. This can lead to 

good Agile practices in one context being 

inappropriately applied elsewhere. This is true of 

User Stories. As Per Lundholm, an Agile coach, 

correctly says “Elephants are not giraffes and User 

Stories are not Requirements” [19]. Yes, the animals 

in this case are both large four legged mammals that 

live in Africa, but they are not the same beast. In the 

same way User Stories are useful, but they are not 

Requirements. 

User Stories are particularly suitable when 

defining Human Computer Interactions (HCI) or for 

detailing how diagnostic tools should work for 

developers and integrators. They have clearly helped 

generate many appropriate and user-friendly 

applications. 

It is also important to acknowledge that User 

Stories are not the only form of Requirements 

definition necessary especially in the case of 

automotive software development. During 

architecture and design new requirements tend to 

emerge. These relate to how the software interacts 

with the other components and hardware and are 

‘Engineering Requirements’ that represent the shift 

from the problem in the outer world to inner world 

of the machine [20]. That is, Engineering 

Requirements typically define the behaviors of subsystems 

and components, such as actuator and 

sensor operations or timing constraints for operating 

systems. 

Recognizing the difference between these 

Engineering Requirements and User Stories is 

critical to avoid confusion. User Stories, as typically 

derived from Agile techniques such as User 

Journeys and Personas, are often stated in the form 

of “As a … I need…in order to…” This format does 

not work for Engineering Requirements and are best 

specified using the “The system shall…” format 


182

advocated by Requirements Engineering standards 

such as IEEE 29148. 

As an illustration, the author has encountered the 

following kind of ‘User Story’ “As an interrupt 

handler, I need to prioritize and pass on interrupts 

according to the timing criteria, so that the system 

can react to interrupts”. This misguided thinking is 

what happens if User Stories are applied in a 

doctrinaire fashion. Rather than gaining clarity the 

opposite is achieved. The ‘story’ has become ‘story 

telling’. In fact the appropriate technique is to use 

the Engineering Requirements format. 

A further problem experienced relates to Design. 

If Requirements define ‘what’ the system behavior 

should be, then Design defines ‘how’ the 

Requirement will be achieved. A valid User Story 

might say “As a car owner I want to securely park 

my car using my smartphone”. In the Agile method 

this User Story is then broken down into smaller 

stories, one of which might be “to start the parking 

app I need to enter two randomly selected digits of 

my password”. Here developers have added a ‘how’ 

disguised as a ‘what’, and created a User Story, that 

is highly unlikely to have been requested by any 

user. 

This arises partially because, as mentioned 

before, Scrum does not have a design phase. Given 

that architecture and design are where safety 

countermeasures are defined, using this ‘pure’ Agile 

philosophy is clearly dangerous in a safety-critical 

development, and would be considered professional 

malpractice [13]. 

With safety-critical projects, the overall Design 

does need to be documented early in the 

development. Where Agile thinking can be useful, 

just as with Safety Requirements, is in limiting 

detail during initial Design. By creating just enough 

of a Design Specification at the outset to span the 

full scope of the project, without covering the detail, 

it is possible to get agreement on the overall 

adequacy of the safety countermeasures. Details are 

then being worked on as the project evolves. 

C. Traceability vs Frequent Feedback 

What is traceability? Traceability is being able to 

demonstrate completeness and go back through the 

steps used to develop the solution and thereby 

manage changes successfully. Normally Agile 

developers do not concern themselves with 

traceability at all. They rely on the Stakeholder 

Feedback Sessions at the end of iterations to ensure 

an optimal solution (which will often differ from the 

original conception). When changes are required 

User Stories describing the change are created and 

implemented. For products such as apps for mobile 

phones this is perfectly adequate. However for 

ISO26262 traceability is mandatory. 

One way of understanding how Agile 

development works is to compare it to a fluid, which 

is allowed to flow. In order to apply traceability, the 

fluid must stabilize and change state from liquid to 

solid, from water to ice. Increments can build parts 

of the solution but only once a part has reached a 

stable state can traceability be applied. Note that not 

all parts of an artefact (such as a Design 

Specification) will stabilize at the same time. To 

apply this is in a blended solution requires clear 

criteria to be defined in order to know when 

stabilization has actually occurred for each type of 

artefact or part of an artifact, such as a design 

component [21]. 

Attempting to apply traceability when 

development is still at the fluid stage, before 

stabilization has occurred, or retrospectively once 

the development is finished, expends time and effort 

without anything appreciative to safety. 

D. Agile Tools 

Because Agile has been so incredibly successful 

in certain applications it has created what might be 

termed ‘Agile Dazzle’. In the rush to adopt Agile 

organizations tend to buy in Agile toolsets without 

fulling analyzing the different requirements of the 

software in their own field. 

Understandably, Agile tools, such as the popular 

Atlassian JIRA, are designed to support an Agile 

way of working [22,23]. Generally, these tools 

provide for the management of User Stories (and 

Bugs) but have no separate mechanisms for Design. 

The author has experienced how complex this 

adapting such tools to regulated environments can 

be, often requiring additional commercial 

applications and plugins and subsequent writing of 

scripts to join it all together. In one organization 

modification to the JIRA product, itself, was needed 

to create the necessary management reports. 

183

Consequently, Agile toolsets will not work 

‘straight out of the box’ for software conforming to 

ISO26262. These additions also complicate the 

required Tool Confidence Analysis. 

Agile tools, used intelligently, do have a useful 

place in conformant developments. They are 

particularly suitable in managing the allocation of 

tasks in iterative development, as well as for defect 

resolution. Nevertheless, it is only once 

understanding has been acquired of the uses and 

limitations of Agile that these tools can be 

successfully integrated and applied. Experience 

shows that until this is achieved the potential exists 

of complicating rather than simplifying the 

developers task. 

E. Process vs People 

Automotive organizations have traditionally been 

hierarchal in structure, with accountability for 

success and failure resting solely on the shoulders of 

Management. In contrast, Agile methods promote 

flatter structures and push accountability down to 

development teams. In other words, Agile places its 

trust in people, whilst ISO26262 places it in 

processes. But processes don’t think. “Safety is 

demonstrated not by compliance with prescribed 

processes, but by addressing the hazards, mitigating 

those hazards and showing that the residual risk is 

acceptable” [25]. True safety arises out of acquired 

skills and experience of the Safety Engineers and 

Development Teams. 

Moving to an Agile way of working requires 

people to change how they work. This applies both 

to Management and to Development Teams. 

Managers need to relinquish some aspects of control 

and developers need to take on more responsibility. 

This is a significant shift in behavior and not 

surprisingly can give rise to resistance. In the 

author’s experience dealing with this factor cannot 

be overlooked when an automotive organization 

wishes to adopt Agile practices. 

V. APPROACHING A SOLUTION 

The following two examples are use cases 

demonstrating how a mindful blending of Agile 

methods and ISO26262 can be achieved that 

exploits the advantages of both. Firstly, an upgrade 

to an established product is examined and secondly a 

case of a new development within a known field of 

expertise. 

A. Upgrading an established product 

Here the situation considered is that of an already 

developed conformant product which is to be 

updated. In this case there is therefore ample 

knowledge of the original requirements and designs, 

traceability is in place, and the direction of the 

changes is already clear. 

The development could proceed without using 

Agile methods but an Agile approach provides 

appreciable gains. By running two cycles 

simultaneously the release date can be brought 

forward. Additionally, with careful planning, 

working interim releases can be made before the full 

change in implemented. 

In a two-cycle approach both cycles are managed 

using timed boxed iterations, stand-ups and show 

and tells. See Figure 1. 

• Change Management: Initial change planning 

begins before starting the update. 

• Safety Case: The Safety Case is required by 

ISO26262 and provides the evidence of 

confirmation to the standard. It also contains 

technical arguments explaining how the 

technical approach delivers safe operation. 

• Design Cycle: There can be several design 

cycles each providing a stable group of 

designs, traceable to requirements that can be 

passed to the implementation cycle. Planning 

is crucial to create a workflow where these 

design groups are independent 

• Implementation Cycle: This takes groups of 

finalized designs and constructs the product 

iteratively. This cycle includes updates to the 

Safety Case and eventually the final release. 


184

B. A new development 

When a development undertakes the creation of a 

new software product the direction of the final 

solution will be subject to uncertainties and stability 

will take time to materialize. This is then well suited 

to Agile methods that allow requirements, design 

and construction to influence each other. (Figure 2) 

By running Development Increments interspersed 

every so often with Stabilization Increments, a 

balance can be struck between Agility and 

Conformance. The key activities are: 

• Startup: A time-boxed stage usually 3-4 weeks 

in duration. Several facilitated workshops are 

held to scope and plan the work for the 

release. The initial topics such as failure 

analysis, requirements and architectures are 

created complete in scope but not in detail 

• Development Increments: Each increment is 

typically 2-3 weeks in duration. Increments 

start with a planning session setting the 

increment goal and selecting the work to be 

undertaken and end with a demonstration of 

the functions developed in a show and tell. 

Daily stand-ups are held to provide frequent 

team communication with a retrospective 

completing the activities before starting the 

next iteration. 

• Stabilization Iterations: When a sufficient 

amount of functionality is considered to be 

"done," a stabilization increment delivers the 

updated set of materials and assets for the 

work to date. These materials and assets are 

VI. 

then both detailed and consistent with the 

implementation of the system and provide any 

conformant deliverables in an efficient 

manner. A stabilization increment would be 

expected between the fourth and sixth 

development increments. 

• Consolation: This is where final testing is 

completed. Safety documents such as the 

safety manual, test completion report and 

safety case are finalized. 

• Release and Closure: This short step is where 

a decision to release or not is made. Following 

release the project materials and assets are 

baselined ready for further development and 

change management if required. 

CONCLUSION 

Deep Shift, the World Economic Forum report, 

states that “The seamless integration of the physical 

and digital worlds through networked sensors, 

actuators, embedded hardware and software will 

change industrial models. In short, the world is 

about to experience an exponential rate of change 

through the rise of software and services” [26]. 

This shift brings both opportunities and dangers 

for us all. Those of us working on embedded 

automotive software systems are in the forefront of 

bringing about these changes that societies will 

adapt to over the coming decades. 

This paper is an initial response to the challenge 

of meeting consumer demands during a time of 

accelerating change, whilst maintaining 

185

safety. It proposes that this is possible when it is 

recognized that Agile methods are flexible as indeed 

is ISO26262. Just as and ice and water are one 

substance in two different states, in a similar way it 

is possible for the two approaches to complement 

each other. 

People also need to change patterns of behavior 

and thinking. Evangelical Agillists will have to let 

go of Agile as dogma, automotive engineers of long 

standing to recognize the advantages of Agile as 

well as the responsibility it entails, managers to 

replace reliance on process with trust in their 

developers. 

Furthermore, the question could usefully be 

asked as to whether ISO26262 itself needs to be 

more Agile. Meanwhile using the adaptions outlined 

here it is already possible to derive agile benefits 

while remaining true to ISO2626 

REFERENCES 

[1] Xavier Mosquet, Massimo Russo, Kim Wagner, Hadi Zablit, 

and Aakash Arora. Boston Consulting Group. Accelerating Innovation: 

New Challenges for Automakers, January 22, 2014. 

[2] Hillary Abraham, Bryan Reimer, Bobbie Seppelt, Craig Fitzgerald, 

Bruce Mehler & Joseph F. Coughlin, Massachusetts Institute of 

Technology, Consumer Interest in Automation: Preliminary 

Observations. Exploring a Year’s Change. 2017 

[3] C. Binder (Microsoft GmbH), T.Hemmer (conolement AG), S.Kukn 

(Porsche Consulting GmbH) and C.Mies (Electrobit Automotive 

GmbH), Microsoft White Paper. Adaptive Automotive Development. 

[4] Sergej Weber. Kugler Maag CIE. May 2015 Agile in Automotive – State 

of Practice 2015. 

[5] Darrell. K. Rigby, Jeff Sutherland, Hirotaka Takeuchi, Harvard Business 

Review. May 2016 Embracing Agile. 

[6] Steve Palmquist, Mary Ann Lapham, Suzanne Garcia-Miller, Timothy 

A. Chick, Ipek Ozkaya, Software Engineering Institute Parallel Worlds: 

Agile and Waterfall Differences and Similarities CMU/SEI-2013-TN- 

021 

[7] Ken Schwaber and Jeff Sutherland November 2017 The Scrum Guide. 

Scrum.org 

[8] ISO 26262:2011 Road vehicles -- Functional safety, parts 1-10, 

International Standards Organisation, 

[9] IR45:2012. Association for the Advancement of Medical 

Instrumentation. Guidance on the use of agile practices in the 

development of medical device software. ISBN 1570204454. 

[10] Kathleen Mayfield, Robert Benito, Michelle Casagni. 2010 Mitre 

Corporation, Handbook for Implementing Agile in Department of 

Defence, Information Technology Acquisition. 

[11] QSM. 2009 Beyond the hype: Measuring and Evaluating Agile 

Development, white paper. 

[12] Standish Group. 2015 Chaos report 

[13] Bertrand Meyer. Agile! The Good, the Hype and the Ugly Springer. 

ISBN 9783319051543 

[14] Scott. W. Ambler, Mark Lines. Disciplined Agile Delivery: A 

Practitioner's Guide to Agile Software Delivery in the Enterprise, 2012 

IBM Press, ISBN 9780132810135 

[15] Ken Schwaber; Jeff Sutherland . Software in 30 Days: How Agile 

Managers Beat the Odds, Delight Their Customers, And Leave 

Competitors In the Dust., John Wiley & Sons, 2012 

[16] D. Snowden, M.Boone. A leader’s framework for decision making. 

Harvard Business Review. November 2007 

[17] M. Cohn. User Stories Applied. Addison-Wesley Professional, 2004 

[18] Roman Pichler Agile Product Management with Scrum: Creating 

Products that Customers Love. Addison-Wesley Professional March 22, 

2010 

[19] Per Lundholm . User Stories are not requirements.. Blog poset 2016. 

Chrisp.se 

[20] M Jackson Problem Frames., Proceedings 14th Asia-Pacific Software 

Engineering Conference (APSEC 2007) 

[21] K. Collyer, J. Manzio Being agile while still being compliant, INCOSE 

2013. Annual Systems Engineering Conference 2013 (ASEC2013). 

[22] Scrum, documentation and the IEC 61508-3:2010 software standard 

[23] Thomas E. Murphy, Mike West, Keith James Mann . Magic Quadrant 

for Enterprise Agile Planning Tools. April 2017 Gartner Inc. 

[24] VersionOne. 11th Annual State of AgileTM Report. April 2017. 

[25] A. Ray Acceptable and residual risk. Quoted in C.Hobbs Embedded 

Software Development for Safety Critical systems, CRC Press 2016. 

[26] Deep Shift. Technology Tipping Points and Societal Impact. World 

Economic Forum. Report. September 201 


186

AUTOSAR – 

Development of a New C++ Standard 

Dr. Frank van den Beuken 

Senior Technical Consultant, Programming Research 

Ashley Park House, 42-50 Hersham Road, Walton on Thames 

Surrey, KT12 1RZ, United Kingdom 

Frank_van_den_Beuken@prqa.com 

Abstract—MISRA C++ [1], the most widely adopted C++ 

coding standard in the automotive industry, has not been updated 

since its publication in 2008. It does not cover C++ language 

features introduced in the later ISO C++ standards published in 

2011 (C++11 [7]) and 2014 (C++14 [2]). In March 2017, the 

AUTOSAR (Automotive Open System Architecture) partnership 

released a new coding standard, “Guidelines for the use of the 

C++14 language in critical and safety-related systems”, which was 

updated in October 2017 [4]. 

The AUTOSAR standard incorporates existing rules and 

guidelines from other standards such as MISRA C++ and High 

Integrity C++ [5]. These have been reviewed, modified and 

extended. The AUTOSAR standard provides a comprehensive set 

of guidelines for using the modern C++ language in safety-critical 

systems. We will compare the standard with other popular C++ 

coding standards and explain its relationship with the automotive 

functional safety standard ISO 26262 [6]. 

Keywords—Software engineering; AUTOSAR; MISRA; safety; 

coding standards; compliance 


Software development is increasingly important for 

automotive applications. Increasingly demanding safety, 

environmental, and convenience requirements have sharply 

increased the number of electronic systems found in vehicles. 

Ninety percent of all innovations are based on software-driven 

electronic components. These components account for up to 

forty percent of a vehicle’s development cost. The pace of 

development and the continual need to integrate more functions 

and control units, pose a significant challenge for vehicle 

manufacturers. This paper gives a brief overview of the new 

AUTOSAR Coding Guidelines and offers guidance on how to 

comply with them. 

II. 

WHAT IS AUTOSAR? 

AUTOSAR (AUTomotive Open System ARchitecture) 

aims to standardize and future-proof basic software elements, 

interfaces and bus systems, to help vehicle manufacturers 

manage growing system complexity while keeping costs down. 

It develops standardized open software architectures for 

automotive Electronic Control Units (ECUs). 

As a partnership of over 180 automotive manufacturers, 

automotive suppliers, tool vendors and semiconductor vendors, 

AUTOSAR’s core members include: BMW, Bosch, 

Continental, Daimler, Ford, GM, PSA, Toyota and Volkswagen. 

The first open architecture developed by AUTOSAR, the 

‘Classic Platform’, is intended for vehicle functions with strict 

real-time requirements and safety criticality, implemented on 

basic microcontrollers. Now, AUTOSAR has developed a new 

standard called the ‘Adaptive Platform’ for connected and 

autonomous vehicles. This is intended to meet the rapidly 

growing market needs for connected vehicle and highly 

autonomous driving technologies. Examples of technologies 

driving the adaptive platform standard include: high-powered 

32-/64-bit microprocessors with external memory, parallel 

processing and high bandwidth communications. 


187

Fig. 1 AUTOSAR platforms – classic and adaptive 

Software developed according to the Adaptive Platform 

standard can integrate with existing systems built according to 

the AUTOSAR ‘Classic Platform’ standard. 

The Classic Platform explicitly allowed for implementations 

in C, C++ and Java, but C was the dominant programming 

language used. Now, the APIs within the Adaptive Platform are 

defined in C++, suggesting that AUTOSAR views C++ as the 

language of choice for new Adaptive Platform components. 

C and C++ are the dominant programming languages used 

for automotive embedded systems. This is largely because they 

permit direct, deterministic control of hardware, and give 

flexibility to the developer. This also brings risk. It is possible to 

compile code that has undefined behavior, or code that is not 

guaranteed to behave the same way when compiled and run on 

different target hardware. Even the most experienced developer 

can introduce defects inadvertently. 

III. 

WHAT ARE THE AUTOSAR CODING GUIDELINES? 

In order to help ensure the safety and security of the code 

written by implementers of AUTOSAR software, AUTOSAR 

invited PRQA to become a development partner, and join the 

working group to develop the “Guidelines for the use of the 

C++14 language in critical and safety-related systems” (the 

‘Guidelines’)1. As the exclusive static analysis development 

partner in AUTOSAR we have contributed our expertise in the 

C++ programming language and best-practice software 

development gained over the last 30 years. 

The AUTOSAR Guidelines specify 342 coding rules. 154 of 

these are adopted directly from the widely adopted MISRA C++ 

standard. 131 are based on rules defined in other well-known 

coding standards, such as PRQA’s High Integrity C++. 57 are 

based on research or other resources. The Guidelines permit 

some of the language features prohibited by some previous 

standards. Examples include dynamic memory management, 

exceptions, inheritance, templates and virtual functions. There 

are rules to ensure that these language features are used only in 

a safe manner. 

One of the principles of AUTOSAR development is to 

validate specifications in parallel with the standardization. The 

Adaptive Platform is validated through an AUTOSAR internal 

implementation, written in C++, known as the Adaptive 

Platform Demonstrator. AUTOSAR used the advanced 

QA·C++ analysis tool from PRQA, the exclusive static analysis 

development partner for AUTOSAR, to ensure the quality of the 

Demonstrator source code and verify compliance with the 

coding guidelines. 

IV. 

WHY ARE THE AUTOSAR CODING GUIDELINES 

NEEDED? 

Prior to the AUTOSAR Guidelines, there was no appropriate 

coding standard available for the use of modern C++ standards 

(C++11 and C++14) in safety-critical software. Available 

standards were either incomplete, written for legacy C++ 

standards, or were not applicable for safety-critical applications. 

The most widespread C++ coding standard in the automotive 

188

industry, MISRA C++:2008 [1] was written for C++03 [7], 

which is over 14 years old. 

There have been a number of changes since the introduction 

of C++03 which has reduced the relevance of the MISRA 

standard for the AUTOSAR project: 

1. Evolution of C++ 

2. Compiler improvements 

3. Improvements to testing, verification and analysis tools 

4. Creation of the ISO 26262 Vehicle Functional Safety 

Standard 

5. Assimilation of a broader base of safety and security 

expertise into additional standards such as: 

High Integrity C++ [5] 

Joint Strike Fighter Air Vehicle C++ [8] 

CERT C++ [9] 

C++ Core Guidelines [10][10] 

AUTOSAR designed the Guidelines to be used as an 

extension to the existing MISRA C++ standard. It specifies new 

rules and updates to MISRA rules as well as stating which 

MISRA rules are obsolete. 

V. WHO WILL USE THE AUTOSAR CODING GUIDELINES? 

The Objectives section of the Guidelines states: “The main 

application sector is automotive, but it can be used in other 

embedded application sectors. The AUTOSAR C++14 Coding 

Guidelines addresses high-end embedded microcontrollers that 

provide efficient and full C++14 language support, on 32- and 

64-bit microcontrollers, using POSIX or similar operating 

systems.” 

PRQA recommends, therefore, that any organization 

developing embedded software in C++14 should consider using 

these Guidelines. 

VI. 

HOW DO THE AUTOSAR CODING GUIDELINES 

COMPARE TO OTHER CODING STANDARDS? 

A. Traceability to existing standards 

Appendix A of the AUTOSAR Coding Guidelines document 

gives details about the traceability of the guidelines to five 

widely adopted C++ coding standards: MISRA C++, High 

Integrity C++ 4.0, JSF, SEI CERT C++ and the C++ Core 

Guidelines. 

For each rule of these standards it is established how it relates 

to the AUTOSAR Guidelines. A rule can be categorized as: 

1. Identical (only for MISRA C++): the rule text, 

rationale, exceptions, code example are identical. Only 

the rule classification can be different. There can be 

also an additional note with clarifications. 

2. Small differences: the content of the rule is included by 

AUTOSAR Guidelines rules with minor differences. 

3. Significant differences: the content of the rule is 

included by AUTOSAR Guidelines with significant 

differences. 

4. Rejected: the rule in the referred document is rejected 

by AUTOSAR Guidelines. 

5. Not yet analyzed: at the time of release of the 

Guidelines, the review of all standards was incomplete, 

so a number of rules is still to be analyzed. 

Below chart gives a summary of the comparison. 

TABLE 1 SUMMARY OF TRACEABILITY TO EXISTING STANDARDS 

C P P C G 

160 

44 

120 

81 

C E R T 

36 

24 

29 

62 

J S F 

105 

20 

47 

53 

H I C P P 

101 

20 

20 

11 

M C P P 

146 35 

26 

20 

1 - Identical 2 - Small differences: 3 - Significant differences 4 - Rejected 5 - Not yet analyzed 

189 


Because the Guidelines are based on MISRA C++, it could 

be expected that this is where the largest overlap can be seen. 

The second largest overlap is with High Integrity C++ followed 

by JSF, C++ Core Guidelines and finally SEI CERT C++. It 

must be noted, however, that CERT C++ has the largest portion 

of rules that still need to be analyzed which may change its 

position relative to the other standards. In the following sections, 

we will discuss the comparison in more detail for each standard 

and also how the AUTOSAR Guidelines relate to ISO 26262. 

B. Traceability to MISRA C++ 

The AUTOSAR Guidelines disagree with MISRA C++ on a 

number of topics. One significant topic is “single point of exit”. 

The Guidelines argue that this rule often leads to code that is 

harder to read, maintain or test. Furthermore, the C++ language 

provides exceptions, and throwing an exception that is not 

caught in the same function, also stops execution of that function 

which constitutes an exit point. MISRA C++ treats this as an 

exception to the rule. AUTOSAR Guidelines take the view that 

multiple exit points are acceptable, also because the introduction 

of extra variables for result values can cause dataflow anomalies. 

Another topic where the Guidelines deviate from MISRA 

C++ is dynamic memory management. This was forbidden by 

MISRA C++, but is allowed by the Guidelines. Instead it 

introduces new rules to prevent issues that may arise from using 

dynamic memory, such as memory leaks, memory 

fragmentation, invalid memory access, erroneous memory 

allocations and not deterministic execution time of memory 

allocation and deallocation. 

C. Traceability to High Integrity C++ 4.0 

HIC++ has a number of rules about code metrics and coding 

style, on which the Guidelines impose no limitation. In addition, 

the Guidelines allow two levels of pointer indirection where 

HIC++ only allows one. On the other hand, the Guidelines 

require use of noexcept, protected and =delete in more places. 

HIC++ heavily restricts use of the preprocessor; it may only 

be used for file inclusion and include guards. The Guidelines 

allow conditional file inclusion and use of path specifiers in 

include statements. 

D. Traceability to JSF 

JSF contains a number of rules about source code layout and 

naming conventions, which the Guidelines do not address. JSF 

also does not allow use of exceptions, for which the Guidelines 

provide rules. On the other hand, JSF does allow some form of 

multiple inheritance, where the Guidelines only allow it for 

implementing multiple interfaces. JSF does not allow type 

casting on pointer types whereas the Guidelines only don’t 

allow to cast pointer types to integral types. JSF allows two 

levels of pointer indirection where the Guidelines only allow 

one. 

E. Traceability to CERT C++ 

CERT C++ provides rules for use of C memory allocation 

and IO functions, errno, variadic arguments, and 

functions that all are prohibited by the Guidelines. 

CERT C++ allows defining function macros as long as there are 

no side effects in the arguments, whereas the Guidelines forbid 

defining function macros. 

F. Traceability to C++ Core Guidelines 

The Core Guidelines allow the use of multiple inheritance to 

represent the union of implementation attributes, where the 

AUTOSAR Guidelines prohibit multiple inheritance. The 

AUTOSAR Guidelines are also stricter on the use of virtual 

inheritance, which it only allows for diamond hierarchies. The 

Core Guidelines include rules for the use of concepts, which is 

not part of the Guidelines because it’s not part of any ISO C++ 

language standard. 

G. Relationship with ISO 26262 

ISO 26262 is a Functional Safety standard, entitled “Road 

vehicles – Functional Safety”. The standard is derived from the 

Functional Safety standard IEC 61508 titled “Functional safety 

of electrical/electronic/ programmable electronic safety-related 

systems”. As such, it covers all aspects of system development, 

and is not a coding standard. Part 6 [6] exclusively covers 

software, but it does not prescribe the use of any specific 

programming language, but merely specifies compliance tables 

with recommendations for the use of certain methods in 

software development for each automotive safety integrity level 

(ASIL). The ASIL can range from A to D where D has the 

strictest requirements (highest integrity level) and A has the 

least. The ASIL is determined by performing a risk analysis of 

a potential hazard by looking at the Severity, Exposure and 

Controllability of the vehicle operating scenario. For each ASIL 

the recommendation is one of 

“o”: there is no recommendation for use of the method 

“+”: the method is recommended 

“++”: the method is highly recommended. 

Each compliance table method is identified by a number and 

a letter. The meaning is that for each number, a suitable 

combination of methods with that number need to be 

implemented. So it is possible to be compliant while not (fully) 

implementing each listed method, but then a rationale shall be 

given that the selected combination of methods complies with 

the corresponding requirement. 

A number of the compliance table methods can be 

implemented by following coding standard rules, so enforcing 

a coding standard is an effective means in complying with the 

ISO 26262 safety standard. The AUTOSAR Guidelines cover 

four compliance tables: 

Table 1 — Topics to be covered by modelling and 

coding guidelines 

Table 3 — Principles for software architectural design 

Table 8 — Design principles for software unit design 

 

and implementation 

Table 9 — Methods for the verification of software 

unit design and implementation 

The coverage is most apparent for table 8 of which all 

methods correspond with one or more rules from the 

Guidelines: 

190

TABLE 2 ISO 26262-6 TABLE 8 — DESIGN PRINCIPLES FOR 

SOFTWARE UNIT DESIGN AND IMPLEMENTATION 

Methods 

1a. One entry and one exit 

point in subprograms 

and functions 

1b. No dynamic objects or 

variables, or else online 

test during their creation 

ASIL 

A B C D 

++ ++ ++ ++ 

+ ++ ++ ++ 

1c. Initialization of variables ++ ++ ++ ++ 

1d. No multiple use of 

variable names 

1e. Avoid global variables or 

else justify their usage 

+ ++ ++ ++ 

+ + ++ ++ 

1f. Limited use of pointers o + + ++ 

1g. No implicit type 

conversions 

1h. No hidden data flow or 

control flow 

+ ++ ++ ++ 

+ ++ ++ ++ 

1i. No unconditional jumps ++ ++ ++ ++ 

1j. No recursions + + ++ ++ 

Note that in the table above, for a number of methods there 

are some differences; method 1a requires one exit point, where 

the Guidelines allow more, but it does forbid use of setjump and 

longjmp to bypass the normal function call mechanism. Also, 

the Guidelines allow dynamic memory management under 

some conditions whereas method 1b forbids dynamic objects, 

but it can be argued that the rules provided implement “online 

test during creation”. In the other compliance tables there are 

also methods regarding limiting complexity and restricting 

hierarchy, size and dependencies, but the Guidelines imposes 

no limitations on code metrics. Similarly, there are methods 

recommending use of style guides and naming conventions, 

which is not required by the Guidelines. 

VII. 

HOW DO I ENSURE MY CODE COMPLIES WITH 

AUTOSAR GUIDELINES 

Traditionally, engineers conducted laborious manual code 

reviews to ensure code had been written according to their 

chosen standard. This process was error-prone and did not scale 

to handle today’s large, complex code bases. Fortunately, these 

checks can now be automated using tools. A ‘static analyzer’ is 

a tool designed for this purpose. A static analyzer not only 

reports violations of coding rules, but it performs a deep code 

inspection to highlight any undefined, unspecified, or compilerdependent 

behavior. It analyzes all the possible execution paths 

of the program to flag potential runtime issues. Often it can find 

issues that are not found by testing because it is rarely practical 

for tests to cover all possible execution paths. A static analyzer 

is an essential component of the toolset used for the development 

of safe, secure and reliable software. 

AUTOSAR’s use of PRQA’s static analysis tool, QA·C++, 

to ensure quality of its Demonstrator source code and its 

compliance to the coding guidelines has provided valuable 

insights. These insights, combined with PRQA’s contribution to 

the Guidelines, have enabled the development of the only static 

analysis solution that is optimized for AUTOSAR-compliant 


PRQA’s AUTOSAR Compliance Module extends QA·C++ 

for out-of-the-box compliance with the AUTOSAR Guidelines. 

For medium to large development teams the solution may be 

further enhanced with PRQA’s code quality management 

control center, QA·Verify. This guarantees that all team 

members consistently apply the coding guidelines in addition to 

tracking and reporting code quality for the duration of the 

project. 

VIII. 

SUMMARY 

The AUTOSAR standard will serve as a platform upon 

which future vehicle applications will be implemented by 

minimizing the current barriers between functional domains. 

The standard will achieve this by making it possible to map 

functions and functional networks to different control nodes in 

the system, almost independently from the associated hardware. 

Although developed for the automotive industry, these 

guidelines can also be used by any other organization or sector 

which uses C++14 to develop embedded software. In any 

application, the use of the PRQA static analysis tool, QA·C++, 

will ensure that the code is error free and that it complies with 

the coding guidelines. 

REFERENCES 

[1] MISRA C++:2008 Guidelines for the use of the C++ language in critical 

systems, The Motor Industry Software Reliability Association, June 2008 

[2] ISO/IEC 14882:2011, ISO International Standard ISO/IEC 

14882:2011(E) — Programming Language C++, International 

Organization of Standardization, September 2011 



Organization for Standardization, December 2014 

[4] Guidelines for the use of the C++14 language in critical and safety-related 

systems, Automotive Open System Architecture, October 2017 

[5] High Integrity C++ Coding Standard Version 4.0, 

http://www.codingstandard.com, PRQA, October 2013. 

[6] ISO 26262-6, Road vehicles — Functional safety - Part 6: Product 

development at the software level, International Organization for 

Standardization, November 2011. 



Organization of Standardization, October 2003 

[8] Joint Strike Fighter Air Vehicle C++ Coding Standards for the System 

Development and Demonstration Program, Document Number 

2RDU00001 Rev C, Lockheed Martin Corporation, December 2005. 

[9] SEI CERT C++ Coding Standard, 

https://www.securecoding.cert.org/confluence/pages/viewpage.action?pa 

geId=637, Software Engineering Institute Division at Carnegie Mellon 

University, 2017 

[10] Bjarne Stroustrup, Herb Sutter, C++ Core Guidelines, 

http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines, 

December 2017 


191

Automotive Software Solutions for 

Complex Safety-Certified Designs of the Future 

Daniel Bernal 

Automotive Business Segment 

Arm, Inc. 

Chandler, AZ, U.S.A. 

Daniel.Bernal@arm.com 

Abstract — Automotive Original Equipment Manufacturers 

(OEMs) and Tier 1 suppliers have recognized that they are in the 

middle of a technology revolution. They spend over $100 billion 

in R&D including the training required for highly skilled 

software development resources. This paper describes the 

necessary elements of a mature software development and runtime 

software stack required to meet the strict demands of 

functional safety (ISO 26262) and also meet the standards for a 

common software infrastructure (AUTOSAR). The breadth of 

today’s standards-based, safety-certifiable solutions is well 

positioned to support even the most complex automotive 

Electronic Control Unit (ECU) use-cases, including the trend to 

support ECU consolidation (mixed-criticality). 


The modern automobile is transforming into a complex 

System of Systems (SoS) with many sensors, actuators, and 

intelligent compute platforms. Automotive system 

architectures are increasing in complexity as OEMs add 

Advanced Driver Assistance System (ADAS) features, which 

are now approaching autonomous drive functionality. 

International functional safety standards, such as ISO 26262, 

defines the key components of qualifying hardware and 

software in automotive equipment. These apply throughout the 

defined lifecycle of all the automotive electronics and 

electrical safety-related systems. Standards such as 

AUTOSAR provide for a common software infrastructure for 

automotive systems to achieve modularity, scalability, 

transferability, and reusability. This helps OEMs and Tier 1 

suppliers preserve their investment in software. 

I. COMPLEXITY ECU DESIGNS 

Vehicle designs are rapidly approaching 100+ ECUs. 

This presents a huge challenge in software development and 

systems integration. Similarly, over a decade ago the 

commercial avionics industry began to standardize the 

consolidation of applications of differing criticality levels. 

This is referred to as Integrated Modular Avionics (IMA). 

The automotive industry is moving in the same direction with 

mixed-criticality platform designs. Vehicle cockpit controller 

platform designs and autonomous drive compute platforms are 

taking advantage of the sophisticated features in hardware and 

software to support mixed-criticality computing. This trend of 

consolidation will inevitably continue as long as modern 

hardware features and software ecosystem supports the ability 

to easily consolidate multiple applications. This paper will 

explain how today's automotive software ecosystem solutions 

are well positioned to support the evolving requirements in the 

automotive industry. It will detail how the breadth and depth 

of the automotive software ecosystem including safetyseparation 

solutions, real-time operating systems (RTOS), and 

software tools can support traditional automotive ECU and 

consolidated ECU platform (mixed-criticality systems) 

designs. [1][2][3] 

II. FUNCTIONAL SAFETY AND SECURITY ENGINEERING 

ISO 26262 outlines a systems engineering approach for 

design of functionally safety electronic systems in road 

vehicles. OEMs and Tier 1 suppliers must account for this as 

systems are redesigned or new features are included in new 

models through the years. One simple example of this is the 

trend to replace side view mirrors with side view cameras and 

displays. This type of redesign makes for a cleaner exterior 

design but also comes at the expense of additional safety 

requirements on the vehicle electronics. These are the type of 

trade-offs that manufacturers must consider in new designs. 

ISO 26262 outlines a classification scheme for Automotive 

Safety Integrity Levels (ASIL). During the hazard analysis and 

risk assessment for a new ECU, an OEM will capture and 

classify safety goals. A safety goal is determined for every 

possible hazardous event. Hazardous events are classified 

with an ASIL. 

The ISO 26262 functional safety standard adopts a 

systems development lifecycle commonly referred to as the 

“V” model. Fig. 1 shows the “V” model with classes of tools 

and pre-qualified software elements that can be leveraged at 

different stages of a safety-qualified design. 


192

Fig. 1 “V”-Model Project Development Lifecycle 

The automotive industry, just like the industrial Internet 

of Things (IoT), is trending toward an engineering approach 

that integrates functional safety and security for product 

development throughout the lifecycle of a hardware and 

software integrated system. Functional safe system design and 

secure architecture system design are similar in process 

discipline. Both are based on a systems engineering approach 

with a configuration-managed set of evolving requirements, 

models, analysis, designs, and test/validation plans. These 

configuration-managed products that are produced as a result 

of the process are often referred to as “artifacts”. Both safety 

engineering and security engineering organizations follow a 

disciplined approach to identify, analyze, reuse, specify, verify 

and validate goals and requirements. This is referred to as 

requirements engineering. [4] 

Having tools to support the development of these artifacts 

is instrumental to the efficiency of a process driven 

organization. These tools are referred to as “requirements 

engineering” or “requirements management tools”. ISO 

26262 mandates traceability of testing back to requirements. 

The ability to automatically generate reports that show testing 

coverage back to requirements helps tremendously in process 

efficiency. Tooling can be used to improve efficiency not 

only for requirements engineering but also for applies to 

design, test and validation phases. 

III. SIMULATION FOR A “SHIFT-LEFT” STRATEGY 

OEMs are increasingly challenged by more complex 

electrical/electronic (E/E) vehicle designs and more feature rich 

ECUs. In addition, competitive pressure has forced OEMs to 

shorten development schedules. As a result, Tier 1 suppliers 

have started to rely more on a simulation strategy to reduce risk 

in software development. This allows for software 

development ahead of silicon and platform availability. 

Several levels of simulation models that support software 

development are listed below: 

• Instruction Set Simulator (ISS) models: ISS models are 

cycle accurate models that are compiled directly from 

RTL and retain complete functional and cycle accuracy. 

This enables users to confidently make architectural 

decisions, optimize performance or develop bare metal 

software. 

• SoC Models: SoC and subsystem models provide an 

accurate, flexible programmer's view models of an SoC’s 

design allowing software development such as drivers, 

firmware, OS and applications prior to silicon availability. 

These models allow full control over the simulation, 

including profiling, debug and trace. These models can 

typically be exported to allow integration into the wider 

SoC design process and Electronic Design Automation 

(EDA) tool platforms.. 

• Virtual Platforms: Virtual platforms allow for software 

development without a hardware target. Although 

generally not cycle accurate, virtual platforms run at 

speeds comparable to the real hardware. Virtual platforms 

are complete simulations of physical paltform including 

processor, memory and peripherals. Virtual platforms are 

much more than just an instruction set simulator. A 

processor, memory and peripheral(s) model provides a 

good indication how software will execute on the physical 

device. In cases where a large team is working on a 

“generic” device support, virtual models remove the need 

for a large number of hardware targets. A virtual platform 

allows for OS bring-up, driver and firmware development 

significantly in advance of silicon. 

193

• Software in the Loop (SIL), also commonly referred to as 

SIL testing, is used to describe a test methodology where 

executable code is tested within a modelling environment 

that can help test software functionality. It is possible for a 

SIL simulation to run faster than real-time. This allows for 

comprehensive logic testing with faster than normal testing 

times. 

• Hardware in the Loop (HIL), also commonly referred to as 

HIL, is used in the development and test of complex realtime 

embedded systems. HIL simulation provides an 

effective platform by adding the complexity of the system 

under control (vehicle network) to the test platform. 

Virtual platforms solutions are available that support 

embedded real-time system simulation. Simulation technology 

is necessary but often not sufficient. SIL and HIL testing can 

complement the testing strategy for an ECU development. 

Avionics systems development in accordance to DO- 

178C, Software Considerations in Airborne Systems and 

Equipment Certification, has set precedent that testing on 

virtual platforms is viable for safety use-cases. In the cases 

where hardware/software integration testing have been run for 

formal credit on a virtual platform, the virtual platform has 

been qualified as a test tool. Since this type of tool could fail to 

detect an error while testing the hardware/software integration, 

DO-178C indicates a specific tool qualification process to be 

followed. It is common on a virtual platform that hardware 

emulation is not equivalent to the fidelity of the real hardware. 

In this case, only software requirements related to the fully 

emulated and qualified hardware features can be formally 

tested in this test environment for credit. It is likely that we will 

see OEMs and Tier 1 suppliers use this type of verification for 

credit using virtual platform environments. [5] 

Virtual platforms for verification and test have several 

advantages: 

• Software development in advance of silicon and 

platform availability. 

• Increased code coverage, functional coverage, 

assertion checking. 

• Stimulation of the design in different places and the 

exercise of software into paths and conditions that are 

difficult to replicate on real hardware platforms. 

• Fault injection at the hardware level. Easier to 

characterize how effective software is to handle 

special conditions, e.g. radiation caused bit-flips. 

Just like a compiler can be a qualified tool to generate 

executable code for a safety critical platform, a simulation 

model/environment can be qualified as a tool to support an 

ASIL development testing effort. [6][7] 

IV. SAFETY-QUALIFIED PLATFORM – THE BASIS FOR A 

SAFETY-QUALIFIED ECU 

A vehicle E/E architecture can be characterized as an SoS. 

A Commercial off the Shelf (COTS) ECU will have the 

requirement to be compliant with the ISO 26262 functional 

safety standard if its malfunctions are deemed to be safety 

relevant in a specific vehicle platform. This dictates that the 

hardware was developed as a Safety Element out of Context 

(SEooC) in accordance with the standard. This indicates that 

the hardware safety claims stand on their own merits. The 

platform will serve as the basis for the software stack which 

will also be required to meet the applicable safety objectives. 

The safety-qualified ECU hardware platform provider will 

provide a functional safety package with a safety manual that 

details the design and verification process, fault detection and 

control and assumptions of use. The platform fault detection 

and control may include pre-qualified software test libraries 

(STL). An STL is an executable program that is periodically 

executed by a processing unit to detect that the processing unit 

or another part of the integrated circuit is operating as 

designed. STLs can be used by safety engineers as a fault 

detection mechanism. STLs are typically run as built-in-selftest 

(BIST) at power on or runtime to detect failure modes that 

must be mitigated. STLs are part of the solution to detect a 

failure mode in hardware that may require mitigation based on 

the safety objectives of the ECU. [8][9] 

V. SOFTWARE STACK TO SUPPORT SAFETY-QUALIFIED 

ECU DESIGNS 

The ECU software stack must support the item safety 

case. This is a structured argument, supported with evidence 

(artifacts), which justifies why the system meets its safety 

goals in its specified operating environment. Safety cases may 

be represented as hierarchical. The software safety case 

composed by an ECU developer is composed of input from 

each software supplier that provides a COTS software element 

that is integrated into the ECU. Each of these software 

elements must meet the safety case as a Safety Element out of 

Context (SEooC). [10] 

A. Composing the Safety Case 

Assuming there is an ASIL ECU requirement, designers 

must make design decisions on how best to meet the safety 

requirements. Often, the best approach is to compose the 

ECU software stack from previously safety-qualified software 

elements. The elements in the runtime safety software stack 

must adhere to the process rigor mandated by ISO 26262. Fig. 

2 shows the typical software elements in a runtime software 

stack of a safety case. 

Automotive ECU engineers have many options in an 

ecosystem to help identify and mitigate the failure modes that 

must be mitigated to support the safety claims of an ECU. A 

comprehensive ecosystem of products (Silicon Device, Board 

Level, Software Stacks) and Services will make the job of an 

automotive safety engineer easier. Some of the safety related 

design patterns supported by the ecosystem of hardware and 

software suppliers includes: 

• Isolation for safety separation in software. 

• Redundancy in hardware and software 

• Pre-qualified software elements. 

• Boot and Run-Time Consistency Checks, i.e., 

LBIST, BIST, etc. 

• Leverage solutions and tools from adjacent safety 

industries (i.e., DO-178B/C, IEC 61508, etc. 


194

Fig. 2 Software Elements of a Certified System 

B. AUTOSAR Compliant SW Elements 

ECUs vary in hardware resources and software 

architecture. Many OEMs have chosen to standardize their 

application interface by complying to the AUTOSAR Classic 

platform interface. This allows for greater application 

portability from one ECU platform to another. The 

AUTOSAR Classic platform for many years has been the 

application interface supporting the modular E/E vehicle 

architectures where every function has its own ECU. The 

evolving need for safety and security features is demanding 

that 8/16-bit ECU designs migrate to 32/64-bit. The upgrade 

in ECU platform hardware has allowed OEMs and Tier 1s to 

consider consolidating similar ECU applications onto one 

ECU. AUTOSAR Classic compliant runtimes have typically 

run on microcontroller unit (MCU) based ECUs. The ability 

to support consolidation of portable ECU functional blocks 

provides the greatest flexibility in an E/E vehicle architecture. 

Having a software stack that supports multiple AUTOSAR 

Classic runtimes on one MCU based SoC was previously 

possible. Until recently, this was supported with safety 

separation in software and no hardware support. This places a 

greater burden on the hypervisor safety separation layer. 

Features like a 2-level Memory Protection Unit (MPU) allow 

MCUs to support multiple guest OSes on the same platform. 

This allows vehicle designers flexibility to integrate classic 

ECU functional blocks onto one platform. This supports the 

ECU consolidation trend to reduce the cost of infrastructure 

including wiring and the number of domain gateways.[11] 

C. Complex ECU Architectures 

Modern vehicle designs with ADAS features require 

much more complex safety certifiable platforms. ADAS 

platforms have a much more demanding architecture and 

platform processing requirement. For this reason, the 

AUTOSAR consortium has decided to standardize the API 

and services to allow vehicle manufacturers the flexibility of 

using a service oriented architecture for their designs. The 

requirements that drive these more complex systems include 

mixed-criticality, real-time, security and safety. Cockpit 

Controller, ADAS and fully autonomous drive ECU platforms 

are all examples of this class of use-case. The logical choice is 

for OEMs and Tier 1 suppliers to leverage safety separation as 

part of their designs. Safety separation with a hypervisor 

layer allows a safety engineer to compose a mixed-criticality 

ASIL ECU. Fig. 3 is an example of a mixed-criticality 

autonomous drive platform. 

195

Fig. 3 Autonomous Drive Mixed-Criticality Platform 

The autonomous drive platform leverages the features 

provided by the AUTOSAR Adaptive software stack for 

sensor input and processing but also allows integration of a 

much more deterministic real-time application such as steering 

and accelerator control provided by an AUTOSAR Classic 

software stack. The safety separation in this use-case is 

provided by a pre-qualified ISO 26262 ASIL D hypervisor. 

An ASIL D safety embedded hypervisor is required to 

maintain the ASIL D safety claim of the steering and 

accelerator control guest operating systems. The ecosystem of 

automotive software solutions provides many options for 

qualified embedded hypervisor solutions that support designs 

up to ASIL D. Assuming that the integrator follows the 

assumptions of use of the qualified hypervisor, this will 

guarantee freedom from interference between time and space 

partitions. An additional benefit of using a modular 

architecture that is composed of mixed-criticality partitions is 

the ability for an OEM or Tier 1 to introduce unqualified 

software, such as Linux or other open source software in a 

partition that does not have a safety requirement. A mixed 

AUTOSAR Classic and Adaptive platform provides the 

foundation and flexibility to support different system 

architectures. This is required for a software defined vehicle 

where OEMs and Tier 1s will have the flexibility to move 

encapsulated ECU functionality from ECU to ECU. 

VI. ECOSYSTEM SUPPLIER SOLUTIONS 

There are many automotive suppliers that offer technologies 

which can help meet the safety engineering development and 

runtime compute requirements for a vehicle ECU. This 

includes complex ECU designs such as digital cockpit, 

advanced driver assistance and autonomous drive systems. 

These suppliers provide solutions for silicon, software and 

services facilitating efficient development of automotive 

solutions. 

• Silicon Devices and Board Products 

• Safety-Qualified Compilers and Tools 

• Embedded Virtualization (hypervisors) 

• Operating Systems, AUTOSAR Classic/Adaptive 

• Testing and Simulation Platforms 

• Collaborative Open Source Projects 

• Human Machine Interface (HMI) Tools 

• Security Software Frameworks 

• Middleware and Software Frameworks 

• Software Development Tools 


It is important that the ecosystem of suppliers that support 

automotive safety designs provides a breadth of products to 

support safety separation, security frameworks, and operating 

environments compliant with AUTOSAR Classic and 

AUTOSAR Adaptive. OEMs and Tier 1s will use these 

technologies to integrate real-time safety critical ECUs 

designs with newer more complex ECU designs such as 

ADAS and Autonomous drive ECUs. The pressure to shorten 

development times places a strong emphasis on starting 

software development early. Software tooling to help improve 

process efficiency in addition to high-fidelity simulation 

environments help mitigate risk. Lastly, having ecosystem 

solution options at all levels of the ECU system architecture 

including hardware and software stack helps reduce the risk in 

a new design. 


196

REFERENCES 

[1] http://ieeexplore.ieee.org/document/7994957/ 

[2] https://en.wikipedia.org/wiki/Integrated_modular_avionics 

[3] https://en.wikipedia.org/wiki/Mixed_criticality 

[4] https://resources.sei.cmu.edu/asset_files/Presentation/2010_017_001_23 

266.pdf 

[5] RTCA/DO-178C Software Considerations in Airborne Systems and 

Equipment Certification. RTCA, Inc. 2011 

[6] https://developer.arm.com/products/system-design/fast-models 

[7] https://www.psware.com/using-virtual-environments-for-formalverification-credit/ 

[8] http://www.armtechforum.com.tw/upload/2017/Hsinchu/B6_Functional 

_Safety_What_is_Arm_Doing_to_Support_this_Critical_Capability_HS 

U.pdf 

[9] http://yogitech.com/sites/default/files/documents/frstl_white_paper_rev1 

.2en.pdf 

[10] https://community.arm.com/processors/b/blog/posts/white-paper-thefunctional-safety-imperative-in-automotive-design 

[11] https://developer.arm.com/products/processors/cortex-r/cortex-r52 

197

Achieving ISO 26262 Compliance to Ensure Safety 

AND Security 


Mark. W. Richardson 

Lead Field Application Engineer 

LDRA 

Wirral, UK 

mark.richardson@ldra.com 

The bits and bolts of the automotive industry have kept pace 

with the latest advances in computing technology. A typical 

new car includes over one gigabyte of executable software, 

and automotive electronics accounts for about 25% of its 

capital cost. 

Adapting to such dramatic increase in software has spawned 

errors of process that have resulted in unfortunate loss of 

lives, countless recalls, and expensive litigation. Added to 

this, the advent of the connected car has served to 

dramatically increase the number of possible points of 

failure. To curb this level of risk and to enforce software 

quality, automotive OEMs increasingly demand ISO 26262 1 

compliance from their supply chain. 

This ISO 26262 Functional Safety Standard defines a product 

safety life cycle based on a risk-oriented approach, 

encapsulated by the assignment of an Automotive Safety 

Integrity Level (ASIL) to each system or subsystem under 

development. For every ASIL, the standard defines the 

processes related to requirement definition, implementation, 

verification, and validation, with traceability between each of 

these phases also being a key factor in the achievement of 

compliance. 

It is incumbent on suppliers seeking compliance to document 

how their mature and safety focused development 

environment is in accordance with the standard throughout 

this development lifecycle. 

This paper describes a tools-based methodology that can be 

used to cost-effectively manage an ISO 26262 compliant 

product development life cycle by providing an interactive 

compliance roadmap to help manage the software planning, 

development, verification, and regulatory activities of ISO 

26262 Part 6, Product Development: Software Level (ISO 

26262-6) 2 . 

The methodology guides development teams through the 

generation of fully compliant plans, document checklists, 

transition checklists, standards and other required lifecycle 

documents. When integrated with software verification tools, 

the compliance management system can also streamline 

verification processes, further reducing thousands of hours of 

documentation effort and achieving significant reductions in 

planning costs. 

II. 

ISO 26262 PROCESS OBJECTIVES 

A key element of ISO 26262-4:2011 3 is the practice of 

allocating technical safety requirements in the system design 

specification, and developing that design further to derive an 

item integration and testing plan. It applies to all aspects of 

the system including software, with the explicit subdivision 

of hardware and software development practices being dealt 

with further through the lifecycle. 

The relationship between the system-wide ISO 26262-4:2011 

and the software specific sub-phases found in ISO 26262- 

6:2011 can be represented in a V-model. Each of those steps 

is explained further in the following discussion (Figure 1). 

1 

International standard ISO 26262 Road vehicles — 

Functional safety 

2 


Functional safety — Part 6: Product development at the 

software level 

3 



system level 

198

Software architectural design (ISO 26262-6:2011 section 

7) 

There are many tools available for the generation of the 

software architectural design, with graphical representation 

of that design an increasingly popular approach. Appropriate 

tools are exemplified by MathWorks ® Simulink ®4 , IBM ® 

Rational ® Rhapsody ®5 , and ANSYS ® SCADE. 6 

Figure 1 - Software-development V-model with crossreferences 

to ISO 26262 and standard development tools 

System design (ISO 26262-4:2011 section 7) 

The products of this system-wide design phase potentially 

include CAD drawings, spreadsheets, textual documents and 

many other artefacts, and clearly a variety of tools can be 

involved in their production. This phase also sees the 

technical safety requirements refined and allocated to 

hardware and software. Maintaining traceability between 

these requirements and the products of subsequent phases 

generally causes a project management headache. 

The ideal tools for requirements management can range from 

a simple spreadsheet or Microsoft Word document to 

purpose-designed requirements management tool such as 

IBM Rational DOORS Next Generation or Siemens Polarion 

REQUIREMENTS . The selection of the appropriate tools 

will help in the maintenance of bi-directional traceability 

between phases of development, as discussed later. 

Specification of software safety requirements (ISO 26262- 

6:2011 section 6) 

This sub-phase focuses on the specification of software 

safety requirements to support the subsequent design phases, 

bearing in mind any constraints imposed by the hardware. 

It provides the interface between the product-wide system 

design of ISO 26262-4:2011 and the software specific ISO 

26262-6:2011, and details the process of evolution of lower 

level, software related requirements. It will most likely 

involve the continued leveraging of the requirements 

management tools discussed in relation to the System Design 

sub-phase. 

Figure 2 - Graphical representation of Control and Data 

Flow as depicted in the LDRA tool suite 

Static analysis tools contribute to the verification of the 

design by means of the control and data flow analysis of the 

code derived from it, providing graphical representations of 

the relationship between code components for comparison 

with the intended design (Figure 2). 

A similar approach can also be used to generate a graphical 

representation of legacy system code, providing a path for 

additions to it to be designed and proven in accordance with 

ISO 26262 principles. 

Software unit design and implementation (ISO 26262- 

6:2011 section 8) 

Coding rules: The illustration in Figure 3 is a typical example 

of a table from ISO 26262-6:2011. It shows the coding and 

modelling guidelines to be enforced during implementation, 

superimposed with an indication of where compliance can be 

confirmed using automated tools. 

4 

MathWorks ® Simulink 


5 

IBM ® Rational ® Rhapsody ® family http://www- 

03.ibm.com/software/products/en/ratirhapfami 

6 

ANSYS ® SCADE Suite 

http://www.ansys.com/products/embedded-software/ansysscade-suite 

199

These guidelines combine to make the resulting code more 

reliable, less prone to error, easier to test, and/or easier to 

maintain. Peer reviews represent a traditional approach to 

enforcing adherence to such guidelines, and while they still 

have an important part to play, automating the more tedious 

checks using tools is far more efficient, less prone to error, 

repeatable, and demonstrable. 

least because large, rambling functions are difficult 

to read, maintain, and test – and hence more 

susceptible to error. 

• High cohesion within each software component. 

High cohesion results from the close linking 

between the modules of a software program, which 

in turn impacts on how rapidly it can perform the 

different tasks assigned to it. 

Figure 3 - Mapping the capabilities of the LDRA tool suite to 

“Table 6: Methods for the verification of the software 

architectural design” specified by ISO 26262-6:2011 7 

ISO 26262-6:2011 highlights the MISRA 8 coding guidelines 

language subsets as an example of what could be used. There 

are many different sets of coding guidelines available, but it 

is entirely permissible to use an in-house set or to 

manipulate, adjust and add one of the standard set to make it 

more appropriate for a particular application (Figure 4). 

Figure 4 - Highlighting violated coding guidelines in the 

LDRA tool suite 

Software architectural design and unit implementation: 

Establishing appropriate project guidelines for coding, 

architectural design and unit implementation are clearly three 

discrete tasks but software developers responsible for 

implementing the design need to be mindful of them all 

concurrently. 

These guidelines are also founded on the notion that they 

make the resulting code more reliable, less prone to error, 

easier to test and/or easier to maintain. For example, 

architectural guidelines include: 

• Restricted size of software components and 

Restricted size of interfaces are recommended not 

Figure 5 - Output from control and data coupling analysis as 

represented in the LDRA tool suite 

Static analysis tools can provide metrics to ensure compliance 

with the standard such as complexity metrics as a product of 

interface analysis, cohesion metrics evaluated through data 

object analysis, and coupling metrics via data and control 

coupling analysis (Figure 5). 

More generally, static analysis can help to ensure that the 

good practices required by ISO 26262:2011 are adhered to 

whether they are coding rules, design principles, or principles 

for software architectural design. 

In practice, for developers who are newcomers to ISO 26262, 

the role of such a tool often evolves from a mechanism for 

highlighting violations, to a means to confirm that there are 

none. 

Software unit testing (ISO 26262-6:2011 section 9) and 

Software integration and testing (ISO 26262-6:2011 

section 10) 

Just as static analysis techniques (an automated “inspection” 

of the source code) are applicable across the sub-phases of 

coding, architectural design and unit implementation, dynamic 

analysis techniques (involving the execution of some or all of 

the code) are applicable to unit, integration and system 

testing. Unit testing is designed to focus on particular software 

procedures or functions in isolation, whereas integration 

testing ensures that safety and functional requirements are met 

7 Based on table 6 from ISO 26262-6:2011, Copyright © 2015 IEC, Geneva, Switzerland. All 

rights acknowledged 

8 

MISRA – The Motor Industry Software Reliability Association 

https://www.misra.org.uk/ 

200

when units are working together in accordance with the 

software architectural design. 

ISO 26262-6:2011 tables list techniques and metrics for 

performing unit and integration tests on target hardware to 

ensure that the safety and functional requirements are met and 

software interfaces are verified at the unit and integration 

levels. Fault injection and resource tests further prove 

robustness and resilience and, where applicable, back-to-back 

testing of model and code helps to prove the correct 

interpretation of the design. Artefacts associated with these 

techniques provide both reference for their management, and 

evidence of their completion. They include the software unit 

design specification, test procedures, verification plan and 

verification specification. On completing each test procedure, 

pass/fail results are reported and compliance with 

requirements verified appropriately. 

ISO 26262:2011 does not require that any of the tests it 

promotes deploy software test tools. However, just as for 

static analysis, dynamic analysis tools help to make the test 

process far more efficient, especially for substantial projects. 

Figure 7 - Examples of representations of structural 

coverage within the LDRA tool suite 

Structural coverage metrics: In addition to showing that the 

software functions correctly, dynamic analysis is used to 

generate structural coverage metrics. In conjunction with the 

coverage of requirements at the software unit level, these 

metrics provide the necessary data to evaluate the 

completeness of test cases and to demonstrate that there is no 

unintended functionality (Figure 7). 

Figure 6 - Performing requirement based unit-testing using 

the LDRA tool suite 

The example in Figure 6 shows how the software interface is 

exposed at the function scope allowing the user to enter inputs 

and expected outputs to form the basis of a test harness. The 

harness is then compiled and executed on the target hardware, 

and actual and expected outputs compared. 

Unit tests become integration tests as units are introduced as 

part of a call tree, rather than being “stubbed”. Exactly the 

same test data can be used to validate the code in both cases. 

Boundary values can be analysed by automatically generating 

a series of unit test cases, complete with associated input data. 

The same facility also provides a facility for the definition of 

equivalence boundary values such as minimum value, value 

below lower partition value, lower partition value, upper 

partition value and value above upper partition boundary. 

Should changes become necessary – perhaps as a result of a 

failed test, or in response to a requirement change from a 

customer - then all impacted unit and integration tests would 

need to be re-run (regression tested), automatically reapplying 

those tests through the tool to ensure that the 

changes do not compromise any established functionality. 

Metrics recommended by ISO 26262:2011 include 

functional, call, statement, branch and MC/DC coverage. 

Unit and system test facilities can operate in tandem, so that 

(for instance) coverage data can be generated for most of the 

source code through a dynamic system test, and then be 

complemented using unit tests to exercise such as defensive 

constructs which are inaccessible during normal system 

operation. 

Software test and model based development: There are 

several vendors of model based development tools, such as 

MathWorks Simulink, IBM Rational Rhapsody, and ANSYS 

SCADE, many of which are deservedly popular in the 

automotive industry. Their integration with test tools becomes 

pertinent once source code has been auto generated from the 

models they generate. 

Using the MathWorks product as an example, “Back-toback” 

testing is approached by first developing and verifying 

design models within Simulink. Code is then generated from 

Simulink, instrumented by the dynamic test tool, executed in 

either Software in the Loop (SIL or host) mode, or Processor 

In the Loop (PIL or target) mode. Structural coverage reports 

are presented at the source code level. 

In addition to “back-to-back” testing, such an integration 

provides facilities to ensure that generated source code 

complies with an appropriate coding standard, performs 

addition dynamic testing at the source code level, and 

201

complies with requirements. The same facilities can also be 

used to ensure that any hand-written additions to the auto 

generated code are adequately tested. 

Bi-directional traceability (ISO 26262-4:2011 and ISO 

26262-6:2011) 

Bi-directional traceability runs as a principle throughout 

ISO26262:2011, with each development phase required to 

accurately reflect the one before it. In theory, if the exact 

sequence of the V-model is adhered to, then the requirements 

will never change and tests will never throw up a problem. 

But life’s not like that. 

Consider, then, what happens if there is a code change in 

response to a failed integration test, perhaps because the 

requirements are inconsistent or there is a coding error. What 

other software units were dependent on the modified code? 

Such scenarios can quickly lead to situations where the 

traceability between the products of software development 

falls down. Once again, while it is possible to maintaining 

traceability manually, automation helps a great deal. 

Software unit design can take many forms – perhaps in the 

form of a natural language detailed design document, or 

perhaps model based. Either way, these design elements need 

to be bi-directionally traceable to both software safety 

requirements and the software architecture. The software 

units must then be implemented as specified and then be 

traceable to their design specification. 

Automated requirements traceability tools are used to 

establish between requirements and tests cases of different 

scopes, which allows test coverage to be assessed (Figure 8). 

The impact of failed test cases can be assessed and addressed, 

as can the impact in requirements changes and gaps in 

requirements coverage. And artefacts such as traceability 

matrices can be automatically generated to present evidence 

of compliance to ISO 26262:2011. 

Figure 8 - Performing requirement based testing. Test cases 

are linked to requirements and executed within the LDRA 

tool suite 

In practise, initial structural coverage is usually accrued as 

part of this holistic process from the execution of functional 

tests on instrumented code leaving unexecuted portions of 

code which require further analysis. That ultimately results in 

the addition or modification of test cases, changes to 

requirements, and/or the removal of dead code. Typically, an 

iterative sequence of review, correct and analyse ensures that 

design specifications are satisfied. 

During the development of a traditional, isolated system, that 

is clearly useful enough. But connectivity demands the ability 

to respond to vulnerabilities identified in the field. Each 

newly discovered vulnerability implies a changed or new 

requirement, and one to which an immediate response is 

needed – even though the system itself may not have been 

touched by development engineers for quite some time. In 

such circumstances, being able to isolate what is needed and 

automatically test only the impacted code becomes 

something much more significant. 

Connectivity changes the notion of the development process 

ending when a product is launched, or even when its 

production is ended. Whenever a new vulnerability is 

discovered in the field, there is a resulting change of 

requirement to cater for it, coupled with the additional 

pressure of knowing that in such circumstances, a speedy 

response to requirements change has the potential to both 

save lives and enhance reputations. Such an obligation shines 

a whole new light on automated requirements traceability. 

Confidence in the use of software tools (ISO 26262-8:2011 

section 11) 

This supporting process defines a mechanism to provide 

evidence that the software tool chain is competent for the job. 

The required level of confidence in a software tool depends 

upon the circumstances of its deployment, both in terms of 

the possibility that a malfunctioning software tool can 

introduce or fail to detect errors in a safety-related element 

being developed, and the likelihood that such errors can be 

prevented or detected. 

Tool qualification by a TÜV organization (“Technischer 

Überwachungsverein“, or “Technical Inspection 

Association”) for use in ISO 26262 compliant systems 

removes considerable user overhead in providing alternative 

evidence of that confidence. 

Depending on the user’s assessment of their application, test 

tools are generally assigned a “Tool Confidence Level“ of 

either TCL1 or TCL2. In all cases except where the tool suite 

is assigned TCL2 and the product is designated ASIL D, the 

existence of a TÜV certificate is sufficient to establish 

sufficient confidence in the tool. Otherwise, the tool is 

required to be subjected to a validation process, to show that 

the tool is capable of analysing sample software in the 

appropriate target environment. 

III. 

CONCLUSIONS 

There is an ever-widening range of automotive electrical 

and/or electronic (E/E/PE) systems such as adaptive driver 

assistance systems, anti-lock braking systems, steering and 

202

airbags. Their increasing levels of integration and connectivity 

provide almost as many challenges as their proliferation, with 

non-critical systems such as entertainment systems sharing the 

same communications infrastructure as steering, braking and 

control systems. The net result is a necessity for exacting 

functional safety development processes, from requirements 

specification, design, implementation, integration, 

verification, validation, and through to configuration. 

ISO 26262 “Road vehicles – Functional safety” was published 

in response to this explosion in automotive E/E/PE system 

complexity, and the associated risks to public safety 9 . Like the 

rail, medical device and process industries before it, the 

automotive sector based their functional standard on the 

industry agnostic functional safety standard IEC 61508 10 . The 

resulting ISO 26262 has become the dominant automotive 

functional safety standard, and its requirements and processes 

are becoming increasingly familiar throughout the industry. 

Although the standard has significant contribution to make to 

both safety and security, there is no doubt that it brings with it 

considerable overhead. The application of automated tools 

throughout the development lifecycle can help considerably to 

minimize that overhead, whilst removing much of the 

potential for human error from the process. 

Never has that been more significant than now. Connectivity 

changes the notion of the development process ending when 

a product is launched, and whenever a new vulnerability is 


requirement to cater for it. Responding to those requirements 

places new emphasis on the need for an automated solution, 

both during the development lifecycle and beyond. 


LDRA 

Portside 

Monks Ferry 

Wirral 

CH41 5LH 


Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 


CONTACT DETAILS 


Mark James 



Presenter 

Mark Richardson 

Lead Field Applications Engineer 

E:mark.richardson@ldra.com 

9 

https://www.iso.org/news/2012/01/Ref1499.html 

10 IEC 61508:2010 Functional safety of electrical/electronic/programmable electronic safetyrelated 

systems 

203

A lean process for Ariane 6 Flight Software 

Development 

Philippe Gast 

Avionics Architecture & Flight Software 

ArianeGroup 

Les Mureaux, France 

philippe.gast@ariane.group 

Abstract—This paper addresses the method deployed in 

ArianeGroup to master the transition from System Functional 

engineering to Flight Software development. 

Keywords—Critical Software; realtime; Model Based System 

Engineering; SysML; Functional Design; Ada2012 


The current challenging context of Space Systems 

development (more competitors, less budget) requires to put in 

place innovative engineering processes and methods in order to 

be efficient all along the development, while keeping these 

systems at the necessary quality level. Efficiency means: 

To be able to develop right first time, 

To reduce the development duration and so the cost 

This paper shows how ArianeGroup has defined a Model 

based System Engineering (MBSE) method which permits to 

ease the transition from System to Software, to improve the 

consistency of the system definition and to early detect errors. 

The objective of this work is to generate, from a single model 

shared between the System and the Software teams, parts of 

documents (System Definition Files, Flight Software 

Specification) and a part of the Flight Software code. 

After this introduction, Section 2 will quickly present the 

Ariane 6 launcher. Section 3 will present the engineering 

method. Section 4 will describe the main Software Design 

choices consistent with the System definition concepts. Finally, 

Section 5 will show how the method has been implemented 

through usage of “off the Shelf” and “in house” tools. Section 

6, as a conclusion, will present the current feedback of using 

this approach. 

II. 

ARIANE 6 LAUNCHER PRESENTATION AND SPECIFICITIES 

Ariane 6 [1] is a 3-stage launcher, versatile (2 or 4 boosters) 

for various missions (mono/multi boost on different types of 

orbits). Ariane 6 is Fail Operational (FO). 

Fig. 1. Ariane 6 overview 

Main function, embedded by Flight Software are: 

Flight Control 

Propulsion (engine management, tank management) 

Mission management 

Services functions: pyrotechnic ignitor, measurement 

acquisition system, telemetry, valves control 

With following specificities 

Mix of cyclic treatment (as Flight Control command) 

and sequential treatment (engine ignition, stage 

separation, 

Highest control reactivity less than 10 milliseconds, 

Highest System Reconfiguration on error less than 20 

milliseconds, 

Level B Software 


204

III. 

FROM SYSTEM TO SOFTWARE 

The covered perimeter of this paper is the functional 

definition of the System, as shown on the figure below. 

Fig. 2. Perimeter of “System to Software” engineering approach 

The approach is based on our own experience 

(development of complex system such as Ariane 5, Automated 

Transfer Vehicle (ATV)) and results [2] of ESA analysis 

related to software crisis (beginning of 2000’s). The main 

objective we targeted when putting in place the” System to 

Software” engineering process has been the following: to 

improve the capture of system requirements allocated to the 

software. The process is based on the Functional Unit concept 

as a support to the System functional design activity. 

A. The Functional Unit approach 

The Functional Unit concept is to perform a breakdown of 

the functional architecture into a number of clearly defined 

parts, with high internal coherence and low coupling between 

them. Each part, designated as “Functional Unit”, represents a 

set of Hardware and Software products mutually coherent to 

provide dedicated functionalities and services. Functional Unit 

are managed by the Launcher Management Software function. 

Their control & command is based on “Finite State Machines” 

defining modes and configurations of each Functional Unit. 

 

 

 

 

 

 

To map functions on products in a coherent way, 

To trace software functional requirements directly 

from Functional Unit requirements (System need), 

To design the related software with a clear system 

design definition, 

To manage the development in a modular way, 

To build a verification approach based on Functional 

Unit requirements with a clear identification of test 

objectives, 

To facilitate the definition of the related integration 

tests and operations at system level based on 

Functional Unit design. 

B. Functional Design 

The functional architecture of the launcher is mapped on 

the Hardware and Software products (physical architecture) as 

a result of the Launcher system design activities. This mapping 

is organized in the following way: 

A set of Functional Units, each one gathers the 

equipment and software items which provides specific 

services and capabilities; with an objective of high 

consistency within a Functional Unit perimeter, and low 

coupling (interfaces) between Functional Units. 

In addition to the Functional Units, a Launcher 

Management function is in charge of the management 

of those Functional Units according to the on-going 

operation (on ground control, flight mission). Launcher 

Management is in charge of the on board contribution 

of the management of the Launcher Operational life, its 

hardware and software configuration, together with the 

management of the functional interfaces with the 

ground. 

The Launcher Management is as “a conductor” for the 

different Functional Units. 

Fig. 3. General Launcher functional architecture 

C. Launcher Management Details 

The Launcher Management is also one specific Software 

item and provides specific services and is the conductor of the 

Functional Unit. It is composed only of Software. The 

Launcher Management function is in charge of the on-board 

contribution to the management of the operations and of the 

configuration of the launcher. It is made of 

1) The Mission Management provides the on board 

capability of executing sequences of automatic operations as a 

set of pre-defined and scheduled commands stored in the 

Mission Plan, meaning: 

On board management of the Mission Plan according 

to ground commands (if any) for: enabling/disabling a 

plan. 

205

Detecting occurrence of the mission events, the 

occurrence of which enables the execution of the 

Mission Plan commands. 

Executing the current active plan (nominal, 

contingency), generating commands towards the 

Launcher Mode Management for: 

o 

Changes of launcher modes (Launcher 

commands), as needed by the execution of 

specific operations (e.g.: switch between 

Hardware capacities) 

o Functional Unit commands : to set 

Functional Unit modes and configuration 

according to the command 

o 

Jump to an alternative plan on request from 

Launcher Mode Management for Alarm 

recovery or from the TC path if any. 

2) The Launcher Mode Management provides the 

capability to: 

Set the launcher mode to the one required by the 

current operations according to the launcher commands 

from the Mission Management when in 

AUTONOMOUS mode, or from the ground when in 

MANUAL mode 

 

 

 

Execution of launcher commands: sequencing of 

different commands (Functional Unit command) 

defined in the launcher sequence related to the 

launcher mode transition. A launcher command 

execution cannot be interrupted by another command 

(e.g. failure recovery command) 

Execution of Functional Unit commands: sequencing 

of the different steps of the commanding sequence 

related to the Functional Unit mode transition. The 

execution of a Functional Unit command cannot be 

interrupted by another command (e.g. failure recovery 

command) 

Monitor the configuration of the launcher, 

The Launcher management and the Functional Unit 

services provide together all the services required to fulfil the 

mission in 2 different levels, as shown on the figure below. 

Fig. 4. Launcher Management 

D. Functional Unit details 

Functional Unit features consist in: 

 

 

A Hardware architecture, meaning equipment, part of 

equipment (sensors, avionics treatment, actuators), 

supporting the related Functional Unit services. 

A software part in charge of the Functional Unit 

management and Functional Unit algorithms in 

interface with the Hardware part (for commands and 

acquisitions). The Functional Unit provides also 

software services which are ran only when the 

Functional Unit is in a steady state mode; these are not 

part of the Launcher Management, but their scope is 

nevertheless recalled hereafter for the sake of 

completeness: 

Processing of measurements generated by the 

Functional Unit Hardware, 

 

 

 

Internal regulations loops for the Functional Unit, 

Processing of commands generated by the Functional 

Unit, 

Detection of mission events used by Mission 

Management in the execution of the Mission Plan. 

Each Functional Unit Software part provides the following 

functionalities: 

 

 

 

“Execute Functional Unit commands”, which gathers 

the acyclic processes of the Functional Unit, 

Settles the mode/configuration of the Functional Unit 

on receipt of commands from the Launcher Modes 

Management, 

“Execute Functional Unit processing”, which gathers 

all the Functional Unit cyclical functions like: 

o 

o 

o 

o 

o 

o 

o 

Process measurements generated by the 

Functional Unit Hardware, 

Internal regulations loops for the Functional 

Unit (e.g. pressurization regulation thrust 

positioning, etc.), the Mission Plans, 

Processes commands generated by the 

Functional Unit, 

Monitors the Functional Unit (e.g.: Hardware 

status, voltage, current, etc.); it generates 

alarms in case it detects a failure occurring to 

a piece of equipment or to a Software 

algorithm being part of the function it 

manages; 

generates Mission events to the MM 

Provides the electrical and functional status 

of equipment (if any) related to the 

Functional Unit. 

Provides telemetry data related to the 

Telemetry Functional Unit. 


206

Fig. 5. Functional unit 

IV. 

MAIN SOFTWARE DESIGN CHOICES RELATED TO SYSTEM 

DEFINITION APPROACH 

The Ariane 6 Flight Software is built with strong design 

principles (already applied on Automated Transfer Vehicle 

System) fully consistent with Functional Unit approach applied 

for system functional definition 

A. Synchronous design with Time triggered communication 

system 

The multi-tasking scheduling of the Ariane 6 Flight 

Software use Rate Monotonic Scheduling (RMS) policy; this 

permit to build synchronous software design. More precisely, 

Ariane 6 Flight Software is composed of a main task (the 

lowest priority task acting as the background task), one basic 

cyclic task (the lowest activation period / highest priority task 

whose cyclical activation is synchronized with Communication 

system) and a set of harmonic cyclic tasks (higher activation 

period / lower priority tasks whose cyclical activation is 

controlled by the basic cyclic task). Each task shall fulfil its 

deadline (treatment terminated before end of the period). 

Notice that all acyclic actions (i.e. launcher commands, 

Functional Unit commands, Failure recovery) are all executed 

in the basic cycling task in a discrete way (one step of the 

acyclic action is executed in one basic task cycle); with a 

unique specified maximum execution time for one step (this 

ease a lot the Central Processing Unit (CPU) budget 

mastering). Flight Software reactivity on asynchronous event is 

defined by specifying the max allowed number of steps related 

to an acyclic action. 

The communication system of Ariane 6 is based on Time 

Trigger Ethernet (TTE) technology. The figure below shows 

the principle of Ariane 6 communication frame building and 

synchronisation with the Flight Software to manage avionic 

input/output: 

 

TTE cluster cycle is designed in order to contain only 

one occurrence of each possible TT message. In other 

words, each TT message will be supposed to have the 

same period equal to the duration of the TTE cluster 

cycle, even if we may use oversampling (the period of 

 

 

some TT messages will be indeed a multiple of the 

TTE cluster cycle). This makes it possible to have a 

simpler TTE communication network configuration 

which is also independent from the definition of the 

bus frame. Thus, modifications in the bus frame 

definition will only impact the configuration of the 

Ariane 6 middleware and / or the configuration of the 

multi-tasks sequencer (harmonic cyclic tasks 

definition). 

A major frame made of a finite number of minor 

frames will be defined, each minor frame being made 

from a subset of all the possible TT messages, 

according to the required period and reactivity of the 

various functions. Moreover, the duration of a minor 

frame will be equal to the duration of the TTE cluster 

cycle. 

Thanks to the TTE start of cluster cycle event, the 

Ariane 6 Flight Software Basic Cyclic Task (BCT) will 

be activated once per minor frame (strong 

synchronization between Software and Hardware). The 

basic cyclic task will in turn activate a finite number of 

harmonic cyclic tasks whose activation period will be 

defined according to the bus frame definition. 

A cyclic task will receive / send TT messages according to 

the Last frame In / Next frame Out (LINO) principle. This 

principle can be summarized as: 

 

 

 

Every TT input message will be taken from the 

preceding minor cycle. 

Every TT output message will be addressed to the next 

minor cycle. 

No operation will be performed on the TT messages 

transmitted in the current minor cycle. 

Note that any cyclic task may access to these TT input / 

output messages using dedicated task by task data access 

mechanism. 

In the example below: 

 

 

 

 

One Minor frame=one TTE cluster cycle; one major 

frame=2 minor frames, 

The Basic Cyclic Task is activated once per minor 

frame and there is one Harmonic Cyclic Task (HCT) 

activated by the BCT once per major frame. 

Inside the BCT minor cycle n, the BCT will take the 

TT input message D transmitted in the BCT minor 

cycle n-1 and will address the TT output message A 

transmitted in the BCT minor cycle n+1. 

Inside the HCT minor cycle n, the HCT will take the 

TT input message B transmitted in the HCT minor 

cycle n-1 and will address the TT output message C 

transmitted in the HCT minor cycle n+1. 

207

Fig. 8. An example of SysML statechart 

Fig. 6. Software scheduling and synchronization with communication bus 

V. IMPLEMENTATION OF THE METHOD 

Several modelling languages and associated tools have 

been selected/developed to support the engineering process 

from functional definition to code. 

A. Functional definition 

The language which has been chosen for Functional Unit 

definition formalization is SysML [5] (using the Rhapsody 

tool; cf. [4]). The figure below shows the perimeter which is 

covered by SysML: the language is used to formalize the 

functional design related to the Software part of the Functional 

Unit. 

Functional Unit processing/monitoring (SysML 

blocks) and associated activation condition (Domain 

Specific Language in SysML model) 

 

Dataflows (SysML ports, Interface blocks + flow 

properties) between Software blocks 

Fig. 9. SysMLInternal Block Diagram 

Fig. 7. models perimeter 

More precisely, are formalized in SysML: 

 

 

The Functional Unit modes and configuration (SysML 

statechart), 

The Functional Unit commands and associated 

transition (SysML statechart) 

The language which is used to specify the mission is a 

Domain Specific Language. This textual language permits, at 

functional level, to specify mission plans and launcher 

sequences. 

A Mission Plan is built using instructions which permits to 

 

 

 

 

 

Execute commands 

Jump to another plan 

Monitoring of events 

Wait (duration, event raising, Boolean condition) 

Execute “If … then … else” statement 


208

Fig. 10. A mission plan 

A Launcher sequence (example provided below) is built 

using instructions which permit to 

 

 

Execute commands (in parallel or not) 

Wait (duration, end of command execution) 

 

 

 

Implementation of data flows using data buffers 

(taking into account multi-threading real time design of 

the Flight Software) 

Implementation of the different threads and associated 

sequencer of subprograms. 

Instantiation of in-house building blocks (which 

implement Functional Unit and Launcher Management 

generic mechanism) to implement 

o 

o 

Mission plans, launchers sequences, 

And for each Functional Unit : Finite State 

Machine (execution of commands and 

related transitions), processing/monitoring 

activation conditions, interfaces with 

Hardware, telemetry; a transition 

The figure below provides an overview of the automatic 

code generation chain 

Fig. 11. a launcher sequence 

Notice that a launcher sequence looks like a mission plan; 

the main differences are: 

 

 

in a mission plan, commands cannot be executed in 

parallel, while it is authorized in a launcher sequence, 

A launcher sequence is considered as atomic, it cannot 

be interrupted by another command request (e.g. 

failure recovery command) 

B. Design 

Real time design is also supported by the VASCO Domain 

Specific Language. At this level of engineering process 

VASCO is used to formalize the tasking of order of sub 

programs call as shown in the figure below. 

Fig. 13. Automatic Code Generation chain 

Finally, the code which shall be hand written is: 

 

 

Algorithms for each processing/monitoring, 

Algorithm for transition between modes/configuration 

(sequence of commands to configure Hardware, set of 

Software data). 

Fig. 12. Real time design model 

C. Coding 

An in-house suite tools take data from different functional 

and design models (SysML and DSL) and generate part of the 

Flight Software code; this covers 

VI. 

FIRST FEEDBACK 

The method presented in this paper has been applied on 

Ariane 6 project since more than one year. It has confirmed 

that: 

 

 

Co engineering between System and Software team is 

a success: the functional design is much mature and 

interfaces are well mastered when starting the products 

development. Several definition problems have been 

detected early. 

Modification in system definition can be quickly 

implemented thanks to modelling and automatic code 

generation. 

209

The lessons learned are the following ones 

 

 

 

Training of team on the method is essential and 

providing support all along the development also 

Modelling guidelines and rules shall be defined before 

starting the project and tools dedicated to rule checking 

shall be developed; it is very important to maintain, all 

along the development, the modelling rules and 

associated tools. Do not under estimate the work load 

of these activities. 

Considering the long life duration of the Flight 

Software (several decades), it is important to 

implement the method in a way to be independent, as 

much as possible, of commercial products (e.g. 

modelling tool) 

 

Models shall be managed in configuration 

To conclude, the method applied on Ariane 6 for Flight 

Software development, has shown its efficiency; and, as this 

method use enough general concepts for functional engineering 

formalization, it can be used for any type of Intensive Software 

System development. 

REFERENCES 

[1] Ariane 6 https://en.wikipedia.org/wiki/Ariane_6 

[2] Software crisis ESA Board for Software Standardisation and Control 

(BSSC), ESTEC, 10-11/02/2005 

[3] Ada 2012 http://www.ada2012.org/ 

[4] Rhapsody http://www- 

03.ibm.com/software/products/en/ratirhapfami 

[5] SysML www.omgsysml.org 


210

The Infinite Software Development Lifecycle of 

Connected Systems 



LDRA 

Wirral, UK 



Anyone familiar with functional safety standards such as 

DO-178C 1 , IEC 61508 2 , ISO 26262 3 , or IEC 62304 4 , will 

know all about the concept of bi-directional traceability of 

requirements and the need to ensure that the design reflects 

the requirements, that the software implementation reflects 

the design, and that the test processes confirm the correct 

implementation of that software. 

Anyone used to developing safety critical software 

applications will be also familiar with how painful a 

change of requirements can be because of the need to 

identify the code to be changed, and to then identify any 

testing to be repeated. 

Until now, that cycle has concluded with product release. 

Sure, there might be tweaks in response to field conditions 

but the business of development is then essentially over. 

Then came the Connected Car; the Industrial Internet of 

Things; Remote Monitoring of Medical Devices. For these 

or any other connected systems, requirements don’t just 

change in an orderly manner during development. They 

change without warning - whenever some smart Alec finds 

a new vulnerability, develops a new hack, or puts your car 

into a ditch. And they keep on changing not just 

throughout the lifetime of the product, but as for long as it 

is out in the field, changing the significance and emphasis 

of the product maintenance phase. 

code, static and dynamic analysis results, and unit- and 

system-level tests. It demonstrates how linking these 

elements enables the entire software development cycle to 

become traceable, making it easy for teams to identify 

problems and implement solutions faster and more cost 

effectively. And it highlights how such linked elements are 

even more important after product release, presenting a 

vital competitive advantage in dealing with the sinking 

feeling that starts with the message “We’ve been hacked”. 

II. 

PROCESS OBJECTIVES 

The ISO 26262 automotive functional safety standard 

serves as an example, but the principles discussed apply 

equally to the other safety critical industries and standards. 

Although terminology varies, a key element consistent to 

all such standards is the practice of allocating technical 

safety requirements in the system design specification, and 

developing that design further to derive an item integration 

and testing plan. It applies to all aspects of the system, 

with the explicit subdivision of hardware and software 

development practices being dealt with as the lifecycle 

progresses. 

The relationship between the system-wide ISO 26262- 

4:2011 and the software specific sub-phases found in ISO 

26262-6:2011 can be represented in a V-model. Each of 

those steps is explained further in the following discussion 

(Figure 1). 

This paper outlines how next-generation automated 

management and requirements traceability tools and 

techniques can create relationships between requirements, 

1 

RTCA DO-178C “Software Considerations in Airborne Systems and 

Equipment Certification” http://www.rtca.org 

2 

IEC 61508-1:2010 FUNCTIONAL SAFETY OF 

ELECTRICAL/ELECTRONIC/PROGRAMMABLE ELECTRONIC SAFETY-RELATED 

SYSTEMS 

3 

International standard ISO 26262:2011 Road vehicles — Functional safety 

4 

IEC 62304 International Standard Medical device software – Software life 

cycle processes Consolidated Version Edition 1.1 2015-06 


211

Software architectural design (ISO 26262-6:2011 

section 7) 

There are many tools available for the generation of the 

software architectural design, with graphical representation 

of that design an increasingly popular approach. 

Appropriate tools are exemplified by MathWorks® 

Simulink® , IBM® Rational® Rhapsody® , and ANSYS® 

SCADE 

Figure 1 - Software-development V-model with crossreferences 


System design (ISO 26262-4:2011 section 7) 

The products of this system-wide design phase potentially 

include CAD drawings, spreadsheets, textual documents 

and many other artefacts, and clearly a variety of tools can 

be involved in their production. This phase also sees the 


hardware and software. Maintaining traceability between 

these requirements and the products of subsequent phases 

generally causes a project management headache. 

The ideal tools for requirements management can range 

from a simple spreadsheet or Microsoft Word document to 

purpose-designed requirements management tool such as 

IBM Rational DOORS Next Generation 5 or Siemens 

Polarion REQUIREMENTS 6 . The selection of the 

appropriate tools will help in the maintenance of bidirectional 

traceability between phases of development, as 

discussed later. 

Specification of software safety requirements (ISO 

26262-6:2011 section 6) 

This sub-phase focuses on the specification of software 

safety requirements to support the subsequent design 

phases, bearing in mind any constraints imposed by the 

hardware. 

It provides the interface between the product-wide system 

design of ISO 26262-4:2011 and the software specific ISO 

26262-6:2011, and details the process of evolution of 

lower level, software related requirements. It will most 

likely involve the continued leveraging of the requirements 

management tools discussed in relation to the System 

Design sub-phase. 

Figure 2 - Graphical representation of Control and Data 


Static analysis tools contribute to the verification of the 

design by means of the control and data flow analysis of 

the code derived from it, providing graphical 

representations of the relationship between code 

components for comparison with the intended design 

(Figure 2). 

A similar approach can also be used to generate a graphical 

representation of legacy system code, providing a path for 

additions to it to be designed and proven in accordance with 

ISO 26262 principles. 

Software unit design and implementation (ISO 26262- 

6:2011 section 8) 

Coding rules: The illustration in Figure 3 is a typical 

example of a table from ISO 26262-6:2011. It shows the 

coding and modelling guidelines to be enforced during 

implementation, superimposed with an indication of where 

compliance can be confirmed using automated tools. 

5 

IBM ® Rational ® DOORS ® http://www- 

03.ibm.com/software/products/en/ratidoor 

6 

SIEMENS Polarion ® REQUIREMENTS TM 

https://polarion.plm.automation.siemens.com/products/polarion 

-requirements 

212



maintain. Peer reviews represent a traditional approach to 

enforcing adherence to such guidelines, and although they 

still have an important part to play, automating the more 

tedious checks using tools is far more efficient, less prone to 

error, repeatable, and demonstrable. 

to test and/or easier to maintain. For example, architectural 

guidelines include: 

 

 

Restricted size of software components and 

Restricted size of interfaces are recommended 

not least because large, rambling functions are 

difficult to read, maintain, and test – and hence 

more susceptible to error. 

High cohesion within each software 

component. High cohesion results from the close 

linking between the modules of a software 

program, which in turn impacts on how rapidly it 

can perform the different tasks assigned to it. 

Figure 3 - Mapping the capabilities of the LDRA tool suite 

to “Table 6: Methods for the verification of the software 

architectural design” specified by ISO 26262-6:2011 7 

ISO 26262-6:2011 highlights the MISRA 8 coding 

guidelines language subsets as an example of what could 

be used. There are many different sets of coding guidelines 

available, but it is entirely permissible to use an in-house 

set or to manipulate, adjust and add to one of the standard 

sets to make it more appropriate for a particular application 

(Figure 4). 

A. 

Figure 4 - Highlighting violated coding guidelines in the 


Software architectural design and unit implementation: 

Establishing appropriate project guidelines for coding, 

architectural design and unit implementation are clearly 

three discrete tasks but software developers responsible for 

implementing the design need to be mindful of them all 

concurrently. 

As for the coding guidelines before them, the guidelines 

relating to software architectural design and unit 

implementation are founded on the notion that they make 

the resulting code more reliable, less prone to error, easier 

Figure 5 - Output from control and data coupling analysis 

as represented in the LDRA tool suite 

Static analysis tools can provide metrics to ensure 

compliance with the standard such as complexity metrics as 

a product of interface analysis, cohesion metrics evaluated 

through data object analysis, and coupling metrics via data 

and control coupling analysis (Figure 5). 

More generally, static analysis can help to ensure that the 

good practices required by ISO 26262:2011 are adhered to 

whether they are coding rules, design principles, or 

principles for software architectural design. 

In practice, for developers who are newcomers to ISO 

26262, the role of such a tool often evolves from a 

mechanism for highlighting violations, to a means to 

confirm that there are none. 

Software unit testing (ISO 26262-6:2011 section 9) and 

Software integration and testing (ISO 26262-6:2011 

section 10) 

Just as static analysis techniques (involving an automated 

“inspection” of the source code) are applicable across the 

sub-phases of coding, architectural design and unit 

implementation, dynamic analysis techniques (involving the 

execution of some or all of the code) are applicable to unit, 

7 Based on table 6 from ISO 26262-6:2011, Copyright © 2015 IEC, Geneva, Switzerland. All 

rights acknowledged 

8 

MISRA – The Motor Industry Software Reliability Association 

https://www.misra.org.uk/ 


213

integration and system testing. Unit testing is designed to 

focus on particular software procedures or functions in 

isolation, whereas integration testing ensures that safety and 

functional requirements are met when units are working 

together in accordance with the software architectural 

design. 

ISO 26262-6:2011 tables list techniques and metrics for 

performing unit and integration tests on target hardware to 

ensure that the safety and functional requirements are met 

and software interfaces are verified at the unit and 

integration levels. Fault injection and resource tests further 

prove robustness and resilience and, where applicable, 

back-to-back testing of model and code helps to prove the 

correct interpretation of the design. Artefacts associated 

with these techniques provide both reference for their 

management, and evidence of their completion. They 

include the software unit design specification, test 

procedures, verification plan and verification specification. 

On completing each test procedure, pass/fail results are 

reported and compliance with requirements verified 

appropriately. 

Should changes become necessary - perhaps as a result of a 

failed test, or in response to a requirement change from a 

customer - then all impacted unit and integration tests 

would need to be re-run (regression tested), automatically 

re-applying those tests through the tool to ensure that the 

changes do not compromise any established functionality. 

ISO 26262:2011 does not require that any of the tests it 

promotes deploy software test tools. However, just as for 

static analysis, dynamic analysis tools help to make the test 

process far more efficient, especially for substantial 

projects. 

Figure 7 - Examples of representations of structural 

coverage within the LDRA tool suite 

Figure 6 - Performing requirement based unit-testing 

using the LDRA tool suite 

The example in Figure 6 shows how the software interface 

is exposed at the function scope allowing the user to enter 

inputs and expected outputs to form the basis of a test 

harness. The harness is then compiled and executed on the 

target hardware, and actual and expected outputs compared. 

Unit tests become integration tests as units are introduced as 

part of a call tree, rather than being “stubbed”. Exactly the 

same test data can be used to validate the code in both 

cases. 

Boundary values can be analysed by automatically 

generating a series of unit test cases, complete with 

associated input data. The same facility also provides a 

facility for the definition of equivalence boundary values 

such as the minimum value, value below lower partition 

value, lower partition value, upper partition value and value 

above upper partition boundary. 

Structural coverage metrics: In addition to showing that 

the software functions correctly, dynamic analysis is used to 

generate structural coverage metrics. In conjunction with the 

coverage of requirements at the software unit level, these 

metrics provide the necessary data to evaluate the 

completeness of test cases and to demonstrate that there is 

no unintended functionality (Figure 7). 

Metrics recommended by ISO 26262:2011 include 

functional, call, statement, branch and MC/DC coverage. 

Unit and system test facilities can operate in tandem, so 

that (for instance) coverage data can be generated for most 

of the source code through a dynamic system test, and then 

be complemented using unit tests to exercise, for example, 

any defensive constructs which are inaccessible during 

normal system operation. 

Bi-directional traceability (ISO 26262-4:2011 and ISO 

26262-6:2011) 

Bi-directional traceability runs as a principle throughout 

ISO 26262:2011, with each development phase required to 


sequence of the V-model is adhered to, then the 

requirements will never change and tests will never throw 

up a problem. But life’s not like that. 

Consider, then, what happens if there is a code change in 

response to a failed integration test, perhaps because the 

214

equirements are inconsistent or there is a coding error. 

What other software units were dependent on the modified 

code? 

Such scenarios can quickly lead to situations where the 

traceability between the products of software development 

falls down. Once again, while it is possible to maintaining 

traceability manually, automation helps a great deal. 

Software unit design can take many forms – perhaps in the 

form of a natural language detailed design document, or 

perhaps model based. Either way, these design elements 

need to be bi-directionally traceable to both software safety 

requirements and the software architecture. The software 

units must then be implemented as specified and then be 

traceable to their design specification. 

Automated requirements traceability tools are used to 

establish traceability between requirements and tests cases 

of different scopes, which allows test coverage to be 

assessed (Figure 8). The impact of failed test cases can be 

assessed and addressed, as can the impact in requirements 

changes and gaps in requirements coverage. And artefacts 

such as traceability matrices can be automatically 

generated to present evidence of compliance to ISO 

26262:2011. 

Figure 8 - Performing requirement based testing. Test 

cases are linked to requirements and executed within the 


In practise, initial structural coverage is usually accrued as 

part of this holistic process from the execution of 

functional tests on instrumented code leaving unexecuted 

portions of code which require further analysis. That 

ultimately results in the addition or modification of test 

cases, changes to requirements, and/or the removal of dead 

code. Typically, an iterative sequence of review, correct 

and analyse ensures that design specifications are satisfied. 

III. 

THE INFINITE DEVELOPMENT LIFECYCLE 

When such changes become necessary, revised code needs 

to be reanalysed statically, and all impacted unit and 

integration tests need to be re-run (regression tested). 

Although that can result in a project management nightmare 

at the time, in an isolated application the need to support 

such occurrences lasts little longer than the time the product 

is under development. 

But connectivity demands the ability to respond to 

vulnerabilities identified in the field. Each newly discovered 

vulnerability implies a changed or new requirement, and 

one to which an immediate response is needed – even 

though the system itself may not have been touched by 

development engineers for quite some time. In such 

circumstances, being able to isolate what is needed and 

automatically test only the functions implemented becomes 


Whenever a new vulnerability is discovered, there is a 

resulting change of requirement to cater for it, coupled with 

the additional pressure of knowing that a speedy response 

could be critically important if products are not to be 

compromised in the field. 

Automated bi-directional traceability links requirements 

from a host of different sources through to design, code 

and test. The impact of any requirements changes – or, 

indeed, of failed test cases - can be assessed by means of 

impact analysis, and addressed accordingly. And artefacts 

can be automatically re-generated to present evidence of 

continued compliance to the functional safety standard of 

choice. 

IV. 

CONCLUSIONS 

Functional safety standards such as DO-178C, IEC 61508, 

ISO 26262, and IEC 62304, have made the concept of bidirectional 

traceability of requirements a familiar concept 

to anyone working in those industries, to ensure that the 

design reflects the requirements; that the software 

implementation reflects the design; and how the test 

processes confirm the correct implementation of that 

software. 

Anyone used to developing safety critical software 

applications will be also familiar with how painful a 

change of requirements can be, with its resulting change of 

design and code, consequential retesting. 

Although functional safety standards have significant 

contributions to make to both safety and security, there is no 

doubt that they bring considerable overhead with them. The 

application of automated tools throughout the development 

lifecycle can help considerably to minimize that overhead, 

whilst removing much of the potential for human error from 

the process. 

Never has that been more significant than now. Connectivity 

changes the notion of the development process ending when 

a product is launched, and whenever a new vulnerability is 


requirement to cater for it. Responding to those requirements 

places new emphasis on the need for an automated solution, 

both during the development lifecycle and beyond. 


215


LDRA 

Portside 

Monks Ferry 

Wirral 

CH41 5LH 


Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 




Mark James 



Presenter 




216

Automating the maintenance of bi-directional 

requirements traceability 



LDRA 

Wirral, UK 



Although the ever improving techniques in safety- and 

mission-critical software development and test are proven to 

yield significant improvements in software quality, they come 

to naught if the resulting application fails to perform as 

expected by the stakeholders – not just functionally but also 

with adequate regard for safety. 

Small wonder then that depending on the criticality of the 

application, requirements traceability is obligatory for 

certifiable, safety-critical applications to ensure that all 

requirements are implemented, and that all development 

artefacts can be traced back to one or more requirements. 

When requirements are managed well, traceability can be 

established between each development phase. For example, 

the resulting bi-directional traceability demonstrates that all 

system level requirements have been completely addressed by 

high level requirements, and that all high level requirements 

can be traced to a valid system level requirement. 

Requirements traceability also encompasses the relationships 

between entities such as intermediate and final work products, 

changes in design documentation, and test plans. 

The principle of bi-directional traceability has been 

established in the avionics community since no later than 1992 

when the DO-178B document 1 was introduced (since 

succeeded by DO-178C 2 ), and the introduction of other 

functional safety standards such as ISO 26262 3 in the 

automotive industry, IEC 62304 4 in the medical device sector, 

and the more generic IEC 61508 5 has seen that principle 

embraced more widely. 

Although it is both a logical and laudable principle, last 

minute changes of requirements or code made to correct 

problems identified during test put such ideals into disarray. 

Despite good intentions, many projects fall into a pattern of 

disjointed software development in which requirements, 

design, implementation, and testing artefacts are produced 

from isolated phases. Such isolation results in tenuous links 

between requirements, the development stages, and/or the 

development teams. 

The answer to this conundrum lies in the “trace data” between 

development processes which sits at the heart of any project. 

Whether or not the links are physically recorded and managed, 

they still exist. For example, a developer creates a link simply 

by reading a design specification and using that to drive the 

implementation. The collective relationships between these 

processes and their associated data artefacts can be viewed as 

a Requirements Traceability Matrix, or RTM. When the RTM 

becomes the centre of the development process, it impacts on 

all stages of safety critical application development from highlevel 

requirements through to target-based testing. 

Beyond that, the nature of connectivity calls into question 

when the development process itself comes to an end. The 

advent of such as the connected car, interactive medical device 

and Industrial IoT means that requirements can change at any 

time – not just during the traditional development lifecycle, 

but after it has been completed and even after a product’s 

production life is over. Any newly discovered vulnerability or 

actual compromise of a system implies an additional 

1 

RTCA DO-178B “Software Considerations in Airborne 

Systems and Equipment Certification” http://www.rtca.org 

2 

RTCA DO-178C “Software Considerations in Airborne 

Systems and Equipment Certification” http://www.rtca.org 

3 


Functional safety 

4 


Software life cycle processes Consolidated Version Edition 1.1 

2015-06 

5 IEC 61508-1:2010 FUNCTIONAL SAFETY OF 

ELECTRICAL/ELECTRONIC/PROGRAMMABLE ELECTRONIC SAFETY- 

RELATED SYSTEMS - PART 1: GENERAL REQUIREMENTS 

217

equirement to counter it, bringing with it a new emphasis on 

traceability even into the product maintenance phase. 

II. 

REQUIREMENTS ARE AN ONGOING COMMITMENT 

How often is the requirements specification baselined and then 

never referred to again? Without constant reference, how can a 

development team be sure that it is delivering a system which 

meets those requirements? By constructing trace links between 

requirements and development components from the very 

beginning of a project, problems such as missing or nonrequired 

functionality will be discovered earlier and, thus, will 

be easier and less costly to remedy. Unfortunately there are 

many factors which make it difficult or tiresome to maintain 

reference to a project’s set of requirements. 

At the start of a contract, the stakeholder sets out their vision 

for what they want from the delivered application. The project 

team then works to represent that vision as a set of 

requirements from which development can begin. The 

requirements should act as a blueprint for development. 

However, all too often, the team’s efforts diverge from this 

blueprint resulting in an application which does not align with 

the requirements. At best the stakeholder is disappointed. At 

worst, the company opens itself up to litigation and costly 

remedial work. 

The key to preventing the emergence of this “requirements 

gap” is to place the requirements at the forefront of 

development. To achieve this successfully, the process should 

not be too intrusive and the aim should be for it to help all 

participants equally, avoiding bias towards any particular 

disciplines or development phases. 

As a basis for all validation and verification tasks, all high 

quality software must start with a definition of requirements. 

Each high level software requirement must map to a lower 

level requirement, design and implementation. The objective 

is to ensure that the complete system has been implemented as 

defined - a fundamental element of sound software 

engineering practice. 

Simply ensuring that high level requirements map to 

something tangible in the requirements decomposition tree, 

design and implementation is not enough. The complete set of 

system requirements comes from multiple sources, including 

high level requirements, low level requirements and derived 

requirements. As illustrated in Figure 1 below, there is 

seldom a 1:1 mapping from high level requirements to source 

code, so a traceability mechanism is required to map and 

record the dependency relationships of requirements 

throughout the requirements decomposition tree. 

Figure 1 - Example of “1:Many” mapping from high level 

requirement through to requirements decomposition tree 

To complicate matters further, each level of requirements 

might be captured using a different mechanism. For instance, 

a formal requirements capture tool might be used for the high 

level requirements while the low level requirements are 

captured in PDF and the derived requirements captured in a 

spreadsheet. 

III. 

THE WATERFALL PROCESS AND OTHER STORIES 

With the initial requirements specified, development can 

proceed in accordance with the specified process for the 

project. It is useful to consider what impact the requirements 

have on the chosen process, and vice versa. 

Back in the 80s and 90s the “Waterfall” process dominated 

software development. Waterfall development processes are 

generally broken down into four, distinct phases, as illustrated 

in Figure 2ure 2. 

Figure 2 - The real- life implementation of a project is rarely 

as simple as the “Waterfall” process suggests 

Each phase is performed in isolation with the output of one 

being the input for the next. The final output, in theory 

anyway, is a working system which passes all tests. 

The purpose of the analysis phase is to refine the stakeholder’s 

vision of the system and produce a list of requirements, with 

those for software being itemised in the Software 

218

Requirements Specification (SRS). However much a project 

manager may wish for the SRS to be error-free, it rarely is and 

the change log begins increasing in size until a new version 

becomes inevitable. 

Contemporary software development processes and practices, 

such as the Iterative process, address many of the deficiencies 

found in the Waterfall process. 

IV. 

THE ART OF RQUIREMENTS 

Requirements need to be high quality. If they are too complex 

or cannot be easily understood, in later phases they will be 

difficult to follow and a lot of time will be wasted requesting 

modifications and refinements. Confidence in the 

requirements needs to be kept high otherwise the willingness 

to work with them will diminish, risking divergence from the 

stakeholder’s vision of the application. If the starting point of 

a project is of poor quality, then low-quality software is sure 

to follow. 

Given the overwhelming complexity of many projects, if all 

stakeholders are to share a common commitment to 

requirements then they must be understandable, unambiguous 

and precise. Such an environment will help alleviate scope 

creep & requirements churn, and will ensure that the delivered 

solution meets the stakeholder needs. It will also provide a 

mechanism to ensure adherence to any applicable software 

and industry standards. 

A. Textual Specifications 

Figure 3 - The “Iterative” process is just one example of how 

the “Waterfall” process has been refined to reflect the 

dynamic nature of projects and their requirements 

Requirements will change during the life of a project, whether 

due to stakeholders altering their vision or requested features 

proving to be unfeasible. Iterative processes (Figure 3) 

embrace this fact by splitting development into a number of 

phases (iterations) and considering only a subset of 

requirements during each iteration. Thus, the number of 

requirements subject to revision through feedback is 

significantly reduced; meanwhile development and refinement 

of those requirements not yet marked for implementation may 

proceed and benefit from any quality improvements applied to 

those requirements being implemented. 

Iterative processes retain the Waterfall phases, perhaps with 

altered names and additional disciplines added. With reference 

again to Figure 2 and Figure 3, the ‘Requirements’ discipline 

is analogous to the ‘Analysis’ phase. However, the key 

difference is that, although most effort is invested during early 

iterations as we would expect, effort continues over the life of 

the project (albeit at a gradually reducing level). 

Iterations can be thought of as mini Waterfall projects. A subset 

of the envisioned system is selected for implementation and 

then taken through the phases of analysis, design, construction 

and test. At the end of the iteration, the subset is expanded to 

include additional features and a new mini Waterfall begins. 

This process ensures that the requirements are continually being 

revisited and refined, keeping them in focus and using them to 

drive development. 

Textual specifications remain a popular way to capture 

requirements, and although they can be highly effective there 

are some disadvantages. For example, the stakeholder may 

prefer layman’s language whereas the contractor naturally 

leans towards technical jargon and their plans for the 

implementation. Furthermore, in conversational form, spoken 

language is inherently imprecise and prone to ambiguity. 

However, if a high degree of rigour is applied then such 

pitfalls can be overcome. One approach is to apply rules when 

writing requirements in much the same way as the MISRA 

standards are applied to C and C++ code; for example: 

• Use paragraph formatting to distinguish requirements 

from non-requirement text 

• List only one requirement per paragraph 

• Use the verb “shall” 

• Avoid “and” in a requirement 

o Consider refactoring as multiple requirements or 

specifying in more general terms 

• Avoid conditional language such as “unless” or “only if” 

o Such terms are likely to lead to ambiguous 

interpretation 

The use of such key words also helps if some members of the 

development team are less fluent in the chosen requirements 

language than others. 

219

B. Use Cases 

Use Cases 6 or User Stories offer another way to organise 

requirements, and to reduce ambiguous or imprecise 

specification. The example in figure 4 clearly shows what is 

expected to happen under a particular set of circumstances. 

The reduced dependence on natural language is particularly 

beneficial to international companies that do not share a 

common spoken language. Graphical representation of 

requirements switches the angle of analysis from a line-byline, 

itemised list of desired features (perhaps spreading over 

many pages) to a user-focused view of how the system will 

interact with external elements and what value it will deliver. 

On the other hand, a disadvantage to this approach is that 

precise written language is sure to be universally understood 

by anyone fluent in it. Not everyone involved in the project, 

particularly at its periphery, will have the inclination to learn 

the nuances of Use Case diagrams. 

Each Use Case or User Story comprises several scenarios. 

The first scenario as illustrated in Figure 4 is always the “basic 

path” or “sunny day scenario” in which the actor and system 

interact in a normal, error-free way. 

V. REQUIREMENTS MANAGEMENT AND 

TRACEABILITY 

However well the requirements are specified and their place in 

the development process established, a mechanism is required 

to ensure that they are reflected in the implementation of the 

project. Requirements traceability is widely accepted as a 

development best practice to ensure that all requirements are 

implemented and that all development artefacts can be traced 

back to one or more requirements. Like IEC 61508, ISO 

26262 and IEC 62304 amongst others, the DO-178C standard 

requires bi-directional traceability and has a constant emphasis 

on the need for the derivation of one development tier from 

the one above it. Paragraph 5.5 c typifies this when it states: 

“Trace Data, showing the bi-directional association between 

low-level requirements and Source Code, is developed. The 

purpose of this Trace Data is to: 

1. Enable verification that no Source Code implements 

an undocumented function. 

2. Enable verification of the complete implementation 

of the low-level requirements.” 

The level of traceability required by standards such as this 

vary, dependent on the criticality of the application. For 

example, less critical avionics applications designated DO- 

178C level (or “DAL”) D are known as “black box”, meaning 

that there is no focus on how the software has been developed. 

That means there is no need to have any traceability to the 

source code or software architecture. It is only required that 

the System Software requirements are traced to the High- 

Level Requirements and then to the test cases, test procedures 

and test results. 

For the more demanding DO-178C levels B and C, the source 

code has been development process is considered significant 

and so evidence of bi-directional traceability is required from 

the High Level requirements to the Low Level Requirements 

and then to the source code. 

Figure 4 - This example of a “Sunny day” scenario from an 

“Allow Authorised Access” Use Case shows how a system is 

expected to behave when a valid key card is swiped 

As the list of scenarios is established via this end-to-end 

analysis, the stakeholder’s vision is rigorously exercised 

allowing ambiguities and problems to be ironed out. Each 

scenario is assigned a priority enabling the complete set to be 

ranked, allowing the project team to plan each iteration and 

select which subset of the system will be implemented. 

Ultimately, for level A projects, there is a need to trace beyond 

the source code down to the executable object code. 

While bi-directional traceability is and always has been a 

laudable principle, last minute changes of requirements or 

code made to correct problems identified during test tend to 

put such ideals in disarray. Despite good intentions, many 

projects fall into a pattern of disjointed software development 

in which requirements, design, implementation, and testing 

artefacts are produced from isolated development phases. 

Such isolation results in tenuous links between the 

requirements stage and / or the development team. 

Processes like the waterfall and iterative examples show each 

phase flowing into the next, perhaps with feedback to earlier 

6 

TechTarget Definition: use case 

http://searchsoftwarequality.techtarget.com/definition/use-case 

220

phases. Traceability is assumed to be part of the relationships 

between phases; however, the mechanism by which trace links 

are recorded is seldom stated. The reality is that, while each 

individual phase may be conducted efficiently thanks to 

investment in up-to-date tool technology, these tools seldom 

contribute automatically to any traceability between the 

development tiers. As a result, the links between them become 

increasingly poorly maintained over the duration of projects. 

The Requirements Traceability Matrix (RTM) provides the 

solution to this problem and represents the logical extension of 

the required traceability between different phases. The links 

between phases can be ignored, or they can be acknowledged 

and properly managed. Either way, are critical. 

Figure 6 -The requirements traceability matrix (RTM) plays a 

central role in a development lifecycle model. Artefacts at all 

stages of development are linked directly to requirements 

matrix and changes within each phase automatically update 

the RTM. 

At the highest level, Requirements Management and 

Traceability tools can initially provide the ability to capture 

the requirements specified by standards such as the DO-178C 

standard. These requirements (or “objectives”) can then be 

traced to Tier 1 – the application specific software and system 

requirements. 

Figure 5 - RTM sits at the heart of the project defining and 

describing the interaction between the design, code, test and 

verification stages of development. 

Figure 5 illustrates this alternative view of the development 

landscape, reflecting the importance that should be attached to 

the RTM. Due to this fundamental centrality, it is vital that 

project managers place the same priority on RTM construction 

and maintenance as they do on requirements management, 

version control, change management, modelling and testing. 

The RTM must be represented explicitly in any lifecycle 

model to emphasise its importance as illustrated in Figure 6. 

With this elevated focus, it becomes the centre of the 

development process, impacting on all stages of design from 

high-level requirements through to target-based deployment. 

These Tier 1 high-level requirements might consist of a 

definitive statement of the system to be developed (perhaps a 

aircraft flap control module, for instance) and the functional 

criteria it must meet (e.g., extending the flap to raise the lift 

coefficient). This tier may be subdivided depending on the 

scale and complexity of the system. 

Tier 2 describes the design of the system level defined by Tier 

1. With our flap example, the low-level requirements might 

discuss how the flap extension is varied, building on the need 

to do so established in Tier 1. 

Tier 3’s implementation refers to the source/assembly code 

developed in accordance with Tier 2. In our example, it is 

clear that the management of the flap extension is likely to 

involve several functions. Traceability of those functions back 

to Tier 2 requirements includes many-to-few relationships. It 

is very easy to overlook one or more of these relationships in a 

manually managed matrix. 

In Tier 4 host-based verification, formal verification begins. 

Using a test strategy that may be top-down, bottom-up or a 

combination of both, software stimulation techniques help 

create an automated test harnesses and test case generators as 

necessary. Test cases should be repeatable at Tier 5 if 

required. 

At this stage, we confirm that the example software managing 

the flap position is functioning as intended within its 

development environment, even though there is no guarantee 

it will work when in its target environment. DO-178C 

221

acknowledges this and calls for the testing “to verify correct 

operation of the software in the target computer environment”. 

However, testing in the host environment first allows the 

target test (which is often more time consuming) to merely 

confirm that the tests remain sound in the target environment. 

In our example, we ensure in the host environment that 

function calls to the software associated with the flap control 

system return the values required of them in accordance with 

the requirements they are fulfilling. That information is then 

updated in the RTM. 

Our flap control system is now retested in the target 

environment, ensuring that the tests results are consistent with 

those performed on the host. A further RTM layer shows that 

the tests have been confirmed. 

VI. 

MAINTAINING THE REQUIREMENTS TRACEABILITY 

MATRIX 

A Requirements Traceability Matrix is a laudable aim 

irrespective of whether a standard insists on it. However, 

maintaining an RTM in a set of spreadsheets is a logistical 

nightmare, fraught with the risk of error and permanently 

lagging the actual project status. 

Constructing the RTM in a suitable tool not only maintains it 

automatically, but also opens up possibilities for filtering, 

quality checks, progress monitoring and metrics generation 

(Figure 7). The RTM is no longer a tedious, time-consuming 

task reluctantly carried out at the end of a project; instead it is 

a powerful utility which can contribute to its efficient running. 

The requirements becoming usable artefacts that are able to 

drive implementation and testing. Furthermore, many of the 

trace links may be captured simply by doing the day-to-day 

work of development, accelerating RTM construction and 

improving the quality of its contents. 

Modern requirements traceability solutions enable the 

extension of the requirements mapping down to the 

verification tasks associated with the source code. The 

screenshot below shows one such example of this. Using this 

type of requirements traceability tool, the 100% requirements 

coverage metric objective can be clearly measured, no matter 

how many layers of requirements, design and implementation 

decomposition are used. This makes monitoring system 

completion progress an extremely straightforward activity. 

Figure 7 - Traceability from high level requirements down to 

source code and verification tasks. 

VII. 

CONNECTIVITY AND THE INFINITE DEVELOPMENT 

LIFECYCLE 

During the development of a traditional, isolated system, that 

is clearly useful enough. But connectivity demands the ability 

to respond to vulnerabilities identified in the field. Each newly 

discovered vulnerability implies a changed or new 




such circumstances, being able to isolate what is needed and 

automatically test only the functions implemented becomes 


Connectivity changes the notion of the development process 

ending when a product is launched, or even when its 

production is ended. Whenever a new vulnerability is 


requirement to cater for it, coupled with the additional 

pressure of knowing that in such circumstances, a speedy 

response to requirements change has the potential to both save 

lives and enhance reputations. Such an obligation shines a 

whole new light on automated requirements traceability. 

VIII. 

CONCLUSION 

The delivery of a Requirements Traceability Matrix (RTM) is 

often contractually imposed on suppliers. Even when not 

required, many development teams recognise that a RTM is an 

important ‘best practice’ for successful projects. However the 

creation of a useful and error-free RTM can only happen when 

the requirements are of sufficient quality and the process is 

taken seriously. This paper has outlined several areas which 

have the capability to limit or undermine the RTM and has 

proposed a series of solutions: 

• Ensure that requirements embrace functional, safety and 

security related issues 

• Accept that requirements will change over the life of the 

project 

• Employ a development process which embraces and 

responds to change 

• Manage the quality of requirements 

• Let the requirements drive development 

• Build a RTM from the start of the project 

222

• Use the RTM to manage progress and improve project 

quality 

• Use the RTM to respond quickly and effectively to 

newly-discovered security vulnerabilities after product 

deployment 

Implementing these improvements undoubtedly takes effort 

but the end result will be a project that finishes on time, on 

budget, avoids any gap between the stakeholder’s vision of the 

application and what is ultimately delivered, and results in an 

effective support vehicle for deployed connected systems. 


LDRA 

Portside 

Monks Ferry 

Wirral CH41 5LH 


Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 



Mark James 



Presenter 




223

Change based Requirements Management 

Bernd Röser (Author) 

agosense GmbH 

Kornwestheim, Germany 

bernd.roeser@agosense.com 

Ralf Klimpke (Author) 

agosense GmbH 

Kornwestheim, Germany 

ralf.klimpke@agosense.com 

Today, requirements management only rarely begins 

„from scratch“. That means that existing product versions 

and variations are defined and developed at the same time. 

It is often at this point that the planned changes depart 

from the provisions contained in the requirements and 

specifications, making it difficult to ascertain which 

individual requirement resulted in which documented 

change. While it may be possible to show the differences 

between two milestones or baselines, identifying a direct 

correlation between specific provisions in a document and 

the actual changes implemented becomes very difficult, if 

not impossible. The article clarifies how tool-based 

methodology makes this connection, completely integrating 

request management into the modern development process. 

New ways to establish change-based requirements 

management with clearly distributed responsibilities to 

increase the level of traceability of development aim to 

improve communication and coordination between 

different responsibilities in requirements management and 

the development process over the long term. 

INTRODUCTION 

Software and software development is hardly imaginable 

today with a methodical approach - particularly given the 

importance of product security, quality and the need to predict 

activities and results. Application Lifecycle Management 

(ALM) tools and platforms help by supporting various 

practices and methods by visually representing the 

interdependencies between development artefacts and 

activities. But is that enough? Examining how development is 

organised more closely, it becomes clear that products today 

are rarely planned„from scratch“. In many cases, existing 

products are developed upon, improved or used as the model 

for further product variation. Looking at development in this 

way, it becomes clear that the major proportion of development 

activities can be characterised as changes to existing material. 

Changes, and change management do not begin with 

implementation or in the production stage, but rather much 

earlier in the product development process - for example, 

during requirements management. 

Managed effectively, this process can foster the re-use and 

alteration of existing artefacts or documents and provide 

optimal support across all the ALM tools used as part of the 

development process. This approach has long been known in 

the field of version management and source code 

administration using terminology like checkpoint, baselines, 

variants, versions, change lists etc. But why can we not also 

use these terms in the practice of requirements management? 

The challenges and the issues are almost identical. How can 

baselines or variants be used for planning requirements? How 

can intended changes to the specifications for a new product 

iteration be extracted and applied to an existing requirements 

document? How can it be shown, quickly and easily, which 

task and which change request is based on which specific 

change to a specification? 

Until now, establishing a controlled and most importantly 

comprehensive change process within requirements 

management required a considerable amount of manual effort. 

This article describes new opportunities to efficiently control 

development processes for „change based requirements 

management“ from the beginning using agosense.fidelia from 

agosense GmbH. 

agosense.fidelia is an independent web-based system for 

requirements management that unites popular RM functions 

with a specialised support for requirements management, 

integrating the development process. 

CHANGE TRACKING AND RELEASE 

Integration of requirements and change management is 

mostly limited to a simple linking of requirements and change 

requests. These links are usually manually created, and have no 

binding or lasting character. But how to demonstrate that 

requirements without a change request cannot be changed or 

whether the changes actually carried out match the intended 

changes? 


224

The uncertainty ends with agosense.requirements. The 

methodological approach of this tool allows every change to a 

document/ requirement to be listed and to ensure that a 

document can only be edited in connection with an allocated 

task or change request. 

This approach shall be examined in detail in the following 

sections, which will also provide information showing to make 

the most of this approach as part of planning in the 

development process. 

LISTING CHANGES 

As mentioned at the beginning, many regulated industries, 

every individual step in a development process much be 

recorded in order to exactly replicate the development process 

at a later time. To make this possible, developments and 

changes are described, planned and distributed to the person or 

group responsible. These individual steps are later tested and 

released as part of the ongoing process. 

To make this process as efficient as possible, and most 

importantly, to ensure that the changes carried out truly 

represent the intended outcome, all granular changes to the 

document (Sheet) are recorded in agosense.requirements. The 

system creates a „Change Set“ for every task and change which 

automatically records granular alterations to the document - 

similar to the approach used in software version administration 

tools (see. Fig 1). 

The change requests or tasks linked to the Change Sets 

usually originate in existing Change Management Systems 

(e.g. IBM RTC, Atlassian Jira etc) which can be directly 

integrated into agosense.requirements, thanks to agosense 

interface technology. This integration enables the allocated 

planning artefacts to be selected by the user in 

agosense.requirements. It goes without saying that information 

- including that regarding changes, change sets, release, etc - 

can be transferred back to the change management system. 

Change sets present an distinct list of changes to a 

document that can be allocated to a task or a change request. 

They can then flow into the document according to a specific 

order, for example using a regulated release process. This 

ensures that the document version is an accurate representation 

of the planning process, and the changes can be traced back to 

the appropriate layer at any time. 

current released version of a document is labelled the „base“ 

version. If this document (or an older version of the document - 

then in a branch) is edited, the user simply selects the allocated 

task and begins working. This automatically results in the 

generation of a „Tentative“ version of the document, which 

serves as the working copy for processing. 

From this point, all the changes in the document version are 

automatically listed in the change set. After the work is 

completed, depending on the process implemented, the user 

can choose to either present the change sets for review, or enter 

them directly to the document. This operation results in the 

changes being officially carried over into the base version 

(„Apply“ see Fig. 2). The status of all dependent tasks and 

change requests in the linked system is automatically updated. 

Advantages: 

• Specifications and documents contain exact 

changes, which have been previously planned and 

approved 

• The user is guided through this process by 

agosense.fidelia without adding any extra effort 

• Changes are automatically documented in the 

background. 

Fig. 2. Concept Base/Tentative View and Change Sets 

Fig. 1: Diagram of Changes and Classification 

So how does this work in 

agosense.fidelia the 

detail? In 

225

Fig. 3 Change Sets and Reviews 

RELEASE PROCESS 

Depending on the level of maturity or formality of the 

defined operation processes, the specifications may be subject 

to a review process before being released for implementation. 

For our example, that means that every change request is 

reviewed according to the relevant change set. 

As shown in Fig. 3, the reviewer is shown detailed 

information regarding the specific change in an integrated diff 

view. Naturally, in this view comments can be created and 

corrections or follow-ups requested before the change set is 

released. After all the change sets for a specific release level 

have been accepted, a baseline for the document can be 

generated, representing the released version. 

This release process can also be controlled using the linked 

change management system. The reviewer can, for example, go 

straight to the change set view in agosense.fidelia by clicking 

on a hyperlink in the change request - depending on the user‘s 

operational preference. 

Advantages: 

• The user is guided through the whole process 

• Every step in the release process is documented 

and traceable. 

BRANCHES AND VARIATIONS 

As previously described, branches are central functions 

which allow product variations to be defined, or to generate 

deviations from existing product definitions. This creates a 

need for these functions to exist as an integral part of 

requirements management. 

There are only very few tools on the market that offer true 

branching, and those that do often do so using Add-ons, using 

inefficient copy mechanisms that unnecessarily inflate data 

storage needs and slow down the application over time. 

All the functions described here are also used within the 

branches, and every document version (whether baseline, 

variant, tentative or base) can of course be compared with 

another in any combination using the diff-view. 

Where special customer requests require more than 

straightforward branching, for example a description of 

function trees with corresponding dependencies that must be 

specifically managed in the creation of product variations, 

agosense.fidelia offers the ability to directly integrate variant 

management systems like „pure::variants“ from pure-systems 

GmbH. 

INTEGRATION 

As mentioned in the sections above, embedding 

requirements management into the whole development process 

is very important. The following section aims to show where 

226 


the usual linkage points are, and how these can best be used to 

provide the traceability standards of today. 

To what extent is it necessary to expand the domain of 

requirements management to a planned change management? 

The following shall examine that question using a few 

examples. 

Assuming that a requirements document has been released 

and the product developed according to those 

specifications: what is now required to best 

incorporate subsequent changed requirements? The 

requirements cannot simply be changed, they must also be 

managed as part of a planned change process: 

• Changes must first be evaluated according to a 

range of criteria (e.g. costs, risks,...) 

• Dependencies must be checked: test cases, usecase 

models, ... may also need changing 

• The specifications itself must be altered and 

released 

• Changed requirements must be communicated and 

passed on for implementation. 

Efficient resolution of all these tasks cannot be achieved 

without an extensive integration of requirements management 

into all other adjacent domains, and global change management 

in particular. 

DEVELOPMENT PROCESS INTEGRATION 

Requirements management has thus become an integral 

part of the development process and can no longer be 

considered in isolation. It is therefore important for most 

companies that the RM tools being used provide sufficient 

relevant open interfaces and are able to be perfectly integrated 

into the broader tool environment (e.g. with test management, 

modelling, change management, etc). However, only a very 

small number of providers have an integration strategy that 

goes beyond the scope of their current product portfolio. This 

often leads to significant costs and frustration on the part of the 

client, particularly when trying to administer integration 

(usually specifically developed for the client) and provide users 

with the appropriate level of support. Additionally, companies 

often forget or underestimate the fact that tool integration 

requires profound knowledge and expertise of all tools to be 

integrated. 

Ideally, change management tool providers should at least 

offer open interfaces for all potential integration technologies, 

with the best case scenario also featuring a clear strategy and 

the relevant expertise in this area. 

agosense has a clear strategy, many years of experience in 

tool integration and offers an integration platform that links 

almost all popular tools in the ALM industry as an optimal 

addition. agosense.symphony is the central technology for all 

integration within agosense.requirements, and it can also be 

used as a stand-alone product to integrate a heterogenous tool 

chain in a way that is process oriented and specific to your 

needs. 

Fig. 4 Baselines and Variants in Document Selection Window 

CROSS DOMAIN TRACEABILITY 

What are the most important motivations for tool 

integration in general? 

• Presentation of information from other tools or 

databases 

• Enable users to work in their domain specific tools 

without constantly having to change between tools 

(due to compatibility, license or cost 

considerations) 

• Close the breaks in media and process that arise 

from the use of different tools for different tasks 

• Present dependent information 

• Use dependent information: see the effect of a 

planned change to a specification on other 

domains (e.g. tests, models, etc.). 

Fig. 5 

agosense.symphony as the central integration platform 

for all development domains and data transfer 

227

These individual motivations could be summarised under 

the term„traceability“ - the ability to locate data from a broad 

range of sources in relation to each other and to make this 

information network visible. In particular with regard to 

products where safety is key, for example in the automobile 

industry, there is a statutory requirement to ensure a certain 

level of maturity in the product creation process. This should 

result in an improved product quality, but also allow for an 

error to be reconstructed and followed back to its source, the 

product requirement. 

Ultimately, this is only possible through integration, as the 

majority of companies have a very heterogeneous tool 

environment and data storage. 

agosense.fidelia provides the optimum support (see Fig. 6 

for a presentation of the „Split Screen“ in conjunction with test 

data) in that trace information on the data from other tools can 

be presented, generated or processed directly in a view. 

Traceability is therefore possible across different tools, and 

able to be presented using the dashboards in agosense.fidelia 

for reports in real time. 

SUMMARY 

For the first time, organisations which are characterized by 

tight, strictly organised planning and development processes 

now have the opportunity to ensure continuous traceability for 

all activities all the way up to requirements management. 

Fig. 6 “Split Screen” in conjunction with test data 


228

Certification Testing Process with Full Traceability 

Michael Wittner 

Razorcat Development GmbH 


www.razorcat.com 

Abstract—The certification of safety critical systems 

especially in avionic systems requires extensive testing of the 

complete system functionality. Time and cost for the certification 

can be reduced significantly by a proven tool-supported testing 

process. The necessary documentation will then be generated 

automatically based on the testing results for each system 

requirement. The testing process from requirements analysis to 

evaluation of test results for each requirement will be 

demonstrated with a real-life project example. 

Within each cycle of the testing process, tests will be 

defined, linked to requirements, executed and evaluated. The 

testing process needs to keep track of the results of all testing 

activities in order to provide reporting and traceability between 

test results and requirements. With the integrated testing 

environment ITE, users can import and manage all 

requirements, select the appropriate test means and plan testing 

campaigns. 

Keywords—Testing, standards, requirement, certification, 

reporting 


This paper describes a well-structured testing process 

which is based on a dedicated test specification language 

(CCDL) and a test management tool (ITE) for requirements 

linking and traceability reporting. By means of a successfully 

completed test campaign for an avionics system component, 

the method, the step by step process and the available tool 

support will be demonstrated. 

II. 

THE TEST PROCESS AT A GLANCE 

For the validation and verification of a system it is 

necessary to provide evidence that the requirements are both 

correctly implemented as system functions and that the system 

functions do exactly what they are expected to do. For this 

purpose, a suite of tests needs to be created that will be 

executed against the system under test (SUT). Testing will take 

place at different levels down from unit testing up to 

integration and system testing. Other test means like reviews 

may also be used to validate non-technical requirements. 

Fig. 1 provides an overview about the testing process and 

its entities. The assignment of testing tools for each 

requirement will be done using the verification and validation 

(VxV) matrix. According to this planning activity, the progress 

in defining and executing tests linked to requirements can 

easily be measured throughout the whole testing process. At 

the end of each testing cycle the certification ready reports 

present the current testing status. 

Fig. 1. Overview about the testing process 

A. Preparation of requirements 

The base for all certification activities are well-defined 

requirements for the system under test. Requirements are 

usually structured within one or several documents and they 

can be further refined into sub requirements being linked to 

their main requirements. Requirements stemming from external 

tools need to be imported into ITE to be able to detect changes 

and handle linking to other test entities. 

B. Selecting the test means 

The next step after requirement definition is the selection of 

testing tools. These test means will be used for testing on 

different levels like unit, integration and system testing. There 

may be different tools that can be used for verification of a 

single requirement and it is the responsibility of the test 

229

engineer to select the most appropriate and effective one for 

each requirement. 

Fig. 2. Verification and validation matrix 

The assignment of test means to requirements takes place 

within the VxV matrix. For each requirement, there is one row 

containing the test mean assignments. This step of the process 

just decides about the testing tools or methods being applied to 

validate each requirement. Fig. 2 shows a VxV matrix for 

several requirements being tested with different test means. 

C. Test definitions and test procedures 

When testing a specific requirement, it is important to 

firstly reveal all relevant testing aspects to fully cover the 

functionality described within the requirement. A testing 

program for a complex system consists of a lot of test 

definitions that need to be linked to the requirements being 

tested. 

1) Defining the test specification 

Writing test definitions in purely textual form has the 

disadvantage that it gets harder to distinguish different tests for 

the same objective when already a couple of tests have been 

defined. Especially when maintaining and extending existing 

tests due to new requirements, systematic approaches to test 

specification such as the classification tree method are highly 

recommended. Fig. 3 shows tests defined using the 

classification tree method at a high abstraction level. These test 

definitions just outline the necessary setup and checks to be 

implemented on the respective test mean (i.e. a system testing 

tool in this case). 

1: Normal no load 

2: Normal extend 

3: Normal retract 

4: Degraded voltage 

5: Degraded hydraulic 

6: Degraded low temperature 

Both 

systems 

Redundancy 

System 1 

only 

System 2 

only 

Operating 

conditions 

normal 

Supply 

voltage 

reduced 

normal 

Hydraulic 

pressure 

reduced 

Actuator Test 

extend 

Moving 

direction 

retract 

none 

Environmental 

conditions 

Op load 

normal 

max 

Ambient 

temperature 

normal 

low 

-40 °C 

external auditors without further training. It is a test 

specification language consisting of only a few syntactical 

elements in a human readable format that can be learned easily. 

It provides means for the definition of complex chronological 

as well as event driven test procedures. Expected reactions of 

the system under test can be specified and evaluated in parallel 

to the test control flow. Requirements can be linked directly to 

test script commands using the syntax of the CCDL language. 

Fig. 4 shows an example test procedure for testing of a 

safety monitoring function of an aircraft actuator system. The 

example provides a short overview about the CCDL scripting 

possibilities: The SUT gets stimulated and the test control flow 

waits for the SUT to be running in normal operating mode. 

Now the test simulates a sensor fault and checks for the 

expected reaction of the SUT. 

Fig. 4. CCDL sample test procedure 

This simple example already shows the power of the CCDL 

language: The trigger condition defines the point in time where 

the SUT runs in normal operating mode. Based on this trigger 

the fault situation is applied (within the “when” script 

expression) and in parallel the expected reaction of the SUT 

will be checked (using the “within” script expression). The 

operator => checks if the given signal changes exactly once 

within the given interval to the value provided. Fig. 5 shows 

the graphical execution flow for a test run of the example test 

procedure. The red dot within the expected reaction check 

indicates that the failure warning signal did not change to its 

expected value of “1”. 

Fig. 3. Classification tree for the test of an aircraft actuator component 

The exact sequence of test steps is not defined within each 

test definition. It is up to the test engineer to write a useful test 

procedure that copes with the given testing challenge. It can be 

useful to have a set of initial conditions defined for all possible 

test setups of the system. These initial conditions can be reused 

for all applicable test definitions. 

2) Writing test scripts 

The CCDL language as an example for a system level 

testing language provides all required real-time testing 

functionality while remaining readable and understandable for 

Fig. 5. Execution flow with timing and test results 

The SUT will be stimulated and checked exactly as defined 

within the test in a precisely definable time frame. The usage of 

trigger conditions in conjunction with time offsets allows 

specifying precise time intervals for real time checking of SUT 

behavior. The use of CCDL within a certification test campaign 

230

esulted in a highly increased productivity of the test team 

compared with the former test scripting based on the 

programming language Python. 

D. Evaluation of test results 

The next step after completion of tests is the evaluation of 

the test results with focus on each requirement. The initial 

linking of test definitions to requirements is the basis for this 

evaluation, but it may also be necessary to link additional 

requirements and their respective evaluation results to executed 

tests. Since the CCDL provides linking of requirements to 

actual results being checked during test execution, the 

evaluation process can be highly automated. Manually 

overriding the automatic results must be possible as well to 

include human expertise in test result evaluation into the testing 

process. 

Each test result needs to be acknowledged as “closed” to be 

included into the current test statistics. This feature allows 

immediate reporting about the current testing status without 

taking into account work in progress results. 

E. Test reporting 

The final result of a certification process is the creation of 

reports about the achieved test results and the traceability 

between requirements and their respective test results. Because 

the overall testing process is usually an iterative process with 

several testing campaigns, it is essential that the reporting 

provides a quick overview as well as detailed insights into 

problems detected while testing. Fig. 6 shows an excerpt of an 

overview report. For each requirement, the planned test 

coverage defined within the VxV matrix together with the 

achieved execution results is shown. The number of tests are 

based on the number of test definitions and their respective 

number of executions (i.e. test runs). Filtering may be applied 

to show the results for specific test campaigns or part of the 

whole testing program. 

Fig. 6. Overview report for requirement based test results 

With the integrated testing environment ITE, users can 

generate customizable reports for traceability between 

requirements and executed tests which can directly be used for 

certifications and assessments. 

III. 

ANOTHER TESTING CYCLE 

Because testing of complex systems is normally not a oneshot 

operation it is necessary to support an effective handling 

of changes within the testing process. Changes to requirements 

may cause updates or extensions of tests, require additional 

tests or result in obsolete tests that can be removed. The 

existing relationships between all test process artefacts make it 

possible to highlight any suspicious elements (i.e. elements that 

may be affected by changes of their linked elements). 

Fig. 7. Relationships between test artefacts 

Fig. 7 shows the links between test artefacts that can be 

followed when exploring the impact of changes. Each 

dependent element will be marked as suspicious in order to be 

updated or acknowledged. Such a tagging of suspicious 

elements guides the test engineer throughout the necessary 

adaptions within a highly dynamic testing process. 

After resolving all suspicious artefacts, the test engineer can 

be sure to have taken into account all necessary changes and 

updates of tests that were to be done due to requirement 

changes. 

IV. 

USAGE IN LARGER TESTING PROJECTS 

This shortly outlined testing process has been used 

successfully in various avionics certification testing programs 

for safety critical aircraft components. One of these projects 

had the following key indicators: 

Time span over 3 years, up to 25 test engineers 

Number of requirements: 1270, additionally 3024 

derived sub requirements 

Number of test cases: 3014 

Number of test runs: 4118, thereof 1196 repeated 

execution cycles of test procedures 

The execution of test procedures could be conducted either 

on an expensive hardware test rig with all other relevant 

original components in place or on a simulator for the tested 

aircraft component only. Therefore it was important to select 

the most relevant tests and the best matching test tool within 

each test campaign. 

Especially the number of testing cycles and retests is 

interesting as it underlines the necessity for a powerful 

versioning and change management to be able to follow any 

suspicious paths of the test artefacts link chains. 

231

1000x in Three Years: How Embedded Vision is 

Transitioning from Exotic to Ubiquitous 

Jeff Bier 

Embedded Vision Alliance 

Walnut Creek, CA USA 

bier@embedded-vision.com 

Abstract—Just a few years ago, it was inconceivable that 

everyday devices would incorporate visual intelligence. Now it’s 

clear that visual intelligence will be ubiquitous soon. How soon? 

Faster than you might think, thanks to three key accelerating 

factors. In the next few years, we’ll see roughly a 10X 

improvement in cost-performance and energy efficiency at each 

of three layers: algorithms, software techniques, and processor 

architecture. Combined, this means that we can expect roughly a 

1000X improvement. So, tasks that today require hundreds of 

watts of power and hundreds of dollars’ worth of silicon will soon 

require less than a watt of power and less than a dollar’s worth of 

silicon. This will be world-changing, enabling even very costsensitive 

devices, like toys, to incorporate sophisticated visual 

perception. In this talk, I’ll explain how innovators across the 

industry are delivering this 1,000X improvement very rapidly. 

I’ll also highlight end-products that are showing us what’s 

possible in this new era, and important challenges that remain. 

Note: A paper is not being published for this presentation. 

The presentation slides are available upon request. 

232 


Embedded Vision Solutions – State of the Art, 

Options and Applications 

Jan-Erik Schmitt 

Vision Components GmbH 

Ettlingen, Germany 

schmitt@vision-components.com 

Miriam Schreiber 


Ettlingen, Germany 

miriam.schreiber@vision-components.com 

Abstract—This documents gives a brief overview over 

embedded vision solutions, their definitions, options and limits as 

well as typical applications. 

Keywords—embedded vision component; formatting; style; 

styling; insert (key words) 


With this paper we would like to contribute to an ongoing 

discussion regarding embedded components, and more specific 

Embedded Vision Solutions. To begin with, we find it 

necessary trying to define what Embedded Vision Solution is – 

especially since experience showed that certain ideas are 

connected with this term(s), but usually they are not very 

precise and we can not assume them as generally accepted. In a 

second step we will give a quick overview over the technology 

development until today and last but not least explain some 

typical application fields for embedded vision systems. 

II. 

DEFINITION: WHAT IS AN EMBEDDED VISION 

SOLUTION? 

Today, Embedded Vision and Embedded Vision Solutions 

are hot terms widely spread, but, unfortunately, there is no real 

definition, that claims to specify exactly what it means. Thus, 

any contribution to this topic, needs to begin with a short 

definition outlining the basics. 

Still, this paper only can give a brief survey over current 

use of certain terms. All of these terms additionally are subject 

to constant change by use of professionals as well as, due to 

increasing awareness, by nonstandard use in everyday 

language. 

The term Embedded Vision Solutions derives from three 

different terms: Embedded System, Machine Vision System, and 

Vision Solution. 

A. Embedded System 

What is an Embedded System? 

Even here, there are several terms in use. For better 

understanding, we considered the term embedded system as a 

short form for embedded computing system and thus 

synonymous. 

“Embedded computing systems (ECSs) are dedicated 

systems which have computer hardware with embedded 

software as one of their most important components. Hence, 

ECSs are dedicated computer-based systems for an application 

or a product, which explains how they are different from the 

more general systems […]. As implementation technology 

continues to improve, the design of ECS becomes more 

challenging due to increasing system complexity, as well as 

relentless time-to-market pressure. Moreover, ECSs may be 

independent, part of a larger system, or a part of a 

heterogeneous system. They perform dedicated functions in a 

huge variety of applications, although these are not usually 

visible to the user.“[1] 

In general, we can sum up, that embedded systems consist 

of two main components: 

· Hardware, which consists of the processor, 

program and data memory, interfaces, inputs, 

outputs, etc. 

· Firmware/Operating system 

B. Machine Vision System 

What is a Machine Vision System? 

The setup required to call a Machine Vision System is 

consisting of several components: 

· Lighting 

· Lens/Optics 

· Image sensor/Camera 


233

· Hardware/electronics (Processing Unit) 

· Software 

C. Vision Solution 

In the field of Machine Vision & Imaging, a Vision 

Solution is most often used as a system combining all 

necessary hardware and software components that are needed 

for the particular task. This can differ from application to 

application, since a wide range of components, both regarding 

hardware and software, can be used. 

Based on these clarifications, we can sum up that an 

Embedded Vision Solution is an embedded system disposing 

specific hardware and software components for a particular 

vision inspection task without using an external processing 

device but instead process all collected data onboard. 

III. 

EMBEDDED VISION TECHNOLOGY AND ITS 

DEVELOPMENT UNTIL TODAY 

The Apollo Guidance Computer, developed in the 1960s, is 

considered to be one of the first modern embedded systems [2]. 

It had appr. 4KB RAM and could achieve 40,000 additions per 

second and a clock frequency about 100kHz. 

The first so-called Smart Camera, by then the most 

common term for an embedded vision system, for industrial 

use was brought to market in 1995. It was the VC11 from 

Vision Components GmbH, a DSP-based system with 32MHz 

clock frequency, 2MB DRAM and a 512KB Flash-EPROM. 

Today, this is considered as the beginning of embedded vision 

technology and soon other companies followed with similar 

products to join the quickly growing market. The first 

generations Smart Cameras were homogeneous systems based 

on DSP only like illustrated in fig. 1. 

A few years later, in 2000 the rollout of the first Vision 

Sensor (another term not exactly defined) followed. Also a 

homogeneous system based on DSP technology clocked at 

75MHz. 

Today, typical Embedded Vision Systems are based on 

heterogeneous systems like quad-core ARM with 1.2GHz 

combined with FPGA or GPU modules. A typical 

heterogeneous system is for example the Zynq SoC from 

Xilinx[3] as shown in fig. 2. 

In recent years, processor technologies became fast-paced 

due to related technologies in consumer and automotive 

markets: Cell phones, tablet computers, autonomously driving 

vehicles and many more. 

IV. 

CORE QUESTION: WHY AND WHEN ARE EMBEDDED 

VISION SYSTEMS USED? 

There are many reasons using embedded vision systems 

instead of conventional PC-based vision systems: 

· Like no other machine vision system, embedded 

vision systems are reduced to their basic 

components. Thus, the can easily be optimized for 

cost/performance ratio. 

· Embedded vision systems consist of less 

components than PC-based systems and are a lot 

smaller in general. Thus, they also consume less 

power than PC-based systems. 

· Embedded vision systems operate absolutely 

stand-alone. 

· Due to there minimal hardware design, they are 

extremely low-maintenance. 

Conclusion: Equal performance assumed, all arguments are 

pro using Embedded Vision Systems. 

Fig. 1. Block diagram DSP based camera. © Vision Components GmbH 

Fig. 2. Systemtypology of a Dual-Core ARM with combined FPGA. © 


234

V. APPLICATION AREAS OF EMBEDDED VISION SYSTEMS 

Thanks to technology achievements, there are hardly limits 

to applications that can be realized with Embedded Vision 

Systems. 

· General quality control like e.g. glass inspection or 

electronic parts inspection. 

· 1D and 2D code reading as well as Optical 

Character Recognition/OCR. 

· Pick & place applications as used for assembling 

robots. 

· In logistics, stereo vision for 3D applications and 

general ID reading tasks, parcel sorting, and 

general warehouse automation. 

· 3D laser triangulation, e.g for weld inspection. 

· Motion analysis in sports, for medical use or use 

in virtual realities. 

· 3D stereo vision cameras, e.g. used in sports or 

entertainment industries for ball tracking in Golf 

simulators, or for people counting tasks. 

· License plate reading/LPR for automated access 

systems. 

· Biometrics like fingerprint scanners. 

· Special tasks like surface inspection with standalone 

interferometer. 

· Robot guidance systems for autonomous assembly 

robots. 

and many more. 

[1] Springer International Publishing Switzerland 2016 D.P.F. Mo¨ller, 

Guide to Computing Fundamentals in Cyber-Physical Systems, 

Computer Communications and Networks, DOI 10.1007/978-3-319- 

25178-3_2 , p. 37f 

[2] According to Wikipedia: 

https://en.wikipedia.org/wiki/Apollo_Guidance_Computer, 19.01.2018 

[3] ZYNQ and XILINX are registered trademarks of Xilinx, Inc. 


235

Shifting Advanced Image Processing from 

Embedded Boards to Future Camera Modules 

A Paradigm-Change for Embedded Designers? 

Paul Maria Zalewski 

Product Management 

Allied Vision Technologies GmbH 

Stadtroda, Germany 

Abstract— Today’s embedded designers can choose from a 

broad multitude of possibilities when it comes to embedded vision. 

Most of them choose so called CMOS (Complementary Metal 

Oxide Semiconductor) camera modules, which are integrated in 

our smartphones, tablets and laptops. These modules are the 

preferred choice for designers since these modules deliver an 

acceptable image quality, are small, cheap and most importantly: 

Due to the standard interface MIPI CSI-2 (MIPI Camera Serial 

Interface 2) are easy to integrate within an embedded system. 

Besides these modules, other cameras are available with LVDS 

(Low Voltage Differential Signaling) and parallel interface, which 

make the integration more challenging to the designers. 

These camera options for Embedded Vision have something 

very important in common: A poor capability of image processing 

inside the camera. The Embedded community arranged 

themselves by overcoming these limitations with operating 

important image processing tasks on the CPU (Central Processing 

Unit) or dedicated ISPs (Image Sensing Processor) on the 

embedded board. Overall, cameras played a subordinate role in 

the context of the whole embedded system. Their major role was 

just to collect photons, convert them into electrons, create an 

acceptable image and transfer it to the host. 

In the past years, we have seen tremendous developments of 

embedded boards and their capability to operate complex tasks for 

the embedded world. This includes more powerful CPUs, GPUs or 

dedicated processing units e.g. own ISPs, which offer embedded 

designers more flexibility in designing their embedded systems. 

Simultaneously, embedded designers are confronted with a 

major question when it comes to embedded vision: Where can I 

perform image processing to get the best and most efficient results 

for my application? 

The paper gives the reader an idea of which kind of image 

processing exists and where the options are to perform and process 

algorithms in embedded systems for vision applications. 

Keywords— embedded, embedded vision, vision, camera, camera 

module, CMOS, Sony CMOS, ONSEMI CMOS, advanced image 

processing, MIPI CSI-2, Video 4 Linux 2, USB3 

I. IMAGE PROCESSING 

Image processing is a general term to explain a method to 

perform algorithms on an image. By performing algorithms on 

the image, embedded designers want to enhance the image itself, 

extract helpful information or just want to reduce the volume of 

the image data to enable a faster processing afterwards on the 

embedded board. 

The term image processing can be further divided in 

different sub categories. 

 

 

First on the list is pre-processing as the first step 

after an image was captured e.g. by a CMOS 

camera. Such an image is also called RAW image, 

since it is not touched by algorithms so far. This 

image may contain unwanted effects like defect 

pixels or an uncalibrated white balance. This is 

where Pre-processing algorithms help embedded 

designers to improve the image data by suppressing 

distortions or removing defects and abnormalities. 

Some typical examples are algorithms that perform 

a defect pixel correction or noise reduction. 

In this stage, the image is already characterized 

with a specific level of acceptable image quality. 

However, it can be further improved, and additional 

processing can be applied. 

The next block of processing is advanced image 

processing. Sometimes just called Image 

Processing. Within this step, higher level 

operations or image enhancements are performed to 

facilitate additional processing. A good example is 


236

sharpening. A 3x3 filter-matrix is applied on the 

image to increase the sharpness of the image. 

Another example is a Look-Up Table (LUT), which 

is an array that replaces runtime computation with 

a simpler array indexing operation. This can reduce 

the computing time significantly. Typically, LUTs 

are used to enhance contrast, brightness or color 

reproduction. 

The final step is the post-processing. It contains 

techniques to automate the identification of features 

in a scene to produce a decision. A common used 

algorithm in this step is for example face detection 

like implemented in smart phones. Embedded 

designers can choose out of a broad multitude of 

possibilities when it comes to embedded vision. 

Most of them choose so called CMOS camera 

modules, which are integrated in our smartphones, 

tablets and laptops. These modules are the preferred 

choice for designers since these modules deliver an 

acceptable image quality, are small, cheap and most 

importantly: Due to the standard interface MIPI 

CSI-2 are easy to integrate in the embedded system. 

Besides these modules, other cameras are available 

with LVDS and parallel interface, which make the 

integration more challenging to the designers. 

II. SYSTEM ARCHITECTURE FOR IMAGE PROCESSING 

There are four key questions to answer for engineers, when 

it comes to embedded vision applications. 

 

 

 

 

First, the selection of the necessary image 

processing tasks / algorithms for the specific 

application. 

Second, the selection of the right hardware platform 

including various options like main processor 

CPUs, or co-processors like GPUs, dedicated ISPs, 

video processors, DSPs (Digital Signal Processor), 

FPGAs (Field Programmable Gate Array) and so 

on. 

Thirdly, the selection of the software platform e.g. 

a lot of image processing can be performed on 

software with pros and cons. 

Fourth, the camera selection with preferred sensor 

resolution and especially image quality capabilities. 

Overall, a smart and careful consideration of all these four 

questions and their options is key. Especially in the embedded 

environment, where it is all about cost, power consumption and 

just simply to get the maximum performance out of the selected 

(cost sensitive) components. 

Todays used and preferred cameras for embedded vision are 

so called CMOS camera modules. Once introduced in mobile 

phones, they found their way in other applications besides 

mobile, because of their attractive price, small size and low 

power consumption. They are still mainly driven by the mobile 

industry in terms of new resolutions and functionalities. 

Regarding image processing, most of them powered by a basic 

set of algorithms for image processing. This contains for 

example a set of automatic image control functions like auto 

exposure, white balance or black level calibration. Most 

importantly, they make use of simple algorithms to enhance the 

image quality by applying for example sharpness, lens 

correction, defect pixel correction or noise canceling. These are 

standard functionalities in state of the art CMOS camera 

modules and designers configure them to get an acceptable 

image to the embedded board. The camera is connected via a 

MIPI CSI-2 connector and can be controlled by I2C to configure 

the registers of the camera. 

Before the camera is fully functional, a CSI-2 driver must be 

written or reused for a specific embedded board and Linux 

distribution. These drivers are typically provided by the vendor 

of the camera module or are written by embedded solution 

providers themselves. Each camera module type needs its own 

driver. Therefore, once a decision for a camera is made, it will 

stay in the system for a longer time to avoid adjusting the driver 

environment. Before it comes to any other image processing 

related question on the embedded board, designers need to make 

the decision if they use an off-shelf driver for a specific 

supported embedded board, or program a driver on their own to 

have the flexibility of choosing a preferred embedded board. 

If the set of algorithms for the image processing provided by 

the CMOS camera module are enough for the application, 

embedded boards do not need to provide additional image 

processing functionalities. This is not the case in most 

applications. Therefore, todays embedded boards are equipped 

with many co-processors besides the main CPU. 

As stated out earlier, co-processors like GPUs, dedicated 

ISPs, video processors, DSPs or FPGAs can be found on 

embedded boards. Each of them have advantages in specific 

areas of image processing. That means they can process specific 

algorithms very efficiently. Video processors for example have 

integrated hardware IP for image compression, which is the 

preferred choice over other co-processors. If the use case 

requires for example some 3D operations or geometric 

transformation, pure CPU based embedded boards would reach 

their limits very fast. In this case, a board with integrated GPU 

is recommended, because it can process these operations much 

faster due to the architecture. On the other hand, if it comes to 

pattern matching, the CPU is the preferred choice over the GPU 

since it can process these types of algorithms more efficient. 

When it comes to more advanced algorithms and image 

processing, like special filters, pixel and signal processing, 

CPUs as well as GPUs reach their limits at an early stage. This 

is where FPGAs and DSPs are getting interesting for the 

embedded designers. Besides the fact that they can perform such 

tasks more efficiently, they can be re-programmed, which offers 

the designers even more flexibility. Some drawbacks are that the 

overall cost of the embedded design and its complexity are 

increased. 

Another option for embedded designers are dedicated ISPs 

on embedded boards. They contain a specific set of image 

processing algorithms and enjoy considerable popularity among 

designers. In most of the cases, it is not necessary anymore to 

use all the image processing algorithms provided by the CMOS 

camera module anymore. Instead, the RAW image out of the 

sensor is directly transmitted to the ISP via MIPI CSI-2 and 

237

processed on the embedded board. From the hardware point of 

view, this option has major advantages for the embedded vision 

application. Nevertheless, the embedded designers are 

confronted with two disadvantages with this option. First, to get 

the full control of the ISP, the software environment must be 

designed around the vendor specific software and therefore, 

lacks flexibility. Second, such an option has its price. State of 

the art system on modules like the NVIDIA Tegra TX2 with 

dedicated ISP start with a list price of $499, with a carrier board 

not included. 

III. ALTERNATIVE OPTION FOR IMAGE PROCESSING 

In the previous chapter, different options were described to 

perform image processing algorithms from the camera itself to 

the embedded board with its CPU and optional co-processors. 

An alternative option is to shift some of the image processing 

tasks back to the camera. The corner stone of this alternative 

approach is a new kind of camera module, which is powered by 

an Application-Specific Integrated Circuit (ASIC) with an 

integrated ISP and image processing library. It performs a 

similar kind of image processing like state of the art CMOS 

camera modules, but extends it with processing usually 

performed in high-end embedded boards with dedicated ISPs or 

integrated FPGAs and DSPs. This includes Pre-processing, as 

well as advanced image processing functionality for example 

filters, pixel operations, signal processing or color space 

conversation. 

One major advantage of performing more image processing 

in the camera instead of on the embedded board, is the reduction 

of processor and co-processor load. This enables free resources 

for potentially other tasks. It also helps the embedded designer 

to accelerate the development phase and answers questions 

where to perform specific image processing tasks best. An 

FPGA or DSP may still be required on the embedded board for 

a specific application, but due to the shifting of some image 

processing tasks to the camera, less logic cells on the FPGA or 

DSP part are needed. This reduces the overall system costs of 

the design. 

Furthermore, ultra-low-cost embedded boards can be 

considered as potential alternatives to mainstream and high-end 

equipped boards. So far, designers rely on additional processing 

power on these higher cost embedded boards to perform their 

required image processing tasks. Since some of the processing 

tasks can be shifted and operated on the new camera module, 

costs can be reduced by selecting a less performant embedded 

board. 

Another challenge designers are confronted with, which is 

not directly related to image processing, but worth a mention is 

the need for a camera driver. Are four key questions to answer 

for engineers, when it comes to embedded vision applications. 

First, the selection of the necessary image processing tasks / 

algorithms for the specific application. Second, the selection of 

the right hardware platform including various options like main 

processor CPUs, or co-processors like GPUs, dedicated ISPs, 

video processors, DSPs, FPGAs and so on. Thirdly, the selection 

of the software platform e.g. a lot of image processing can be 

performed on software with pros and cons. Fourth, the camera 

selection with preferred sensor resolution and especially image 

quality capabilities. 

Each CMOS camera module needs its own camera driver. 

Every time an embedded designer wants to change the module 

for example to implement a higher resolution sensor, he or she 

needs to rewrite and configure the driver again. This is not the 

case with the new camera module. Once configured the driver 

for a specific SoC (System-on-Chip) and embedded board, the 

designers can easily replace the camera module with a higher 

sensor resolution without touching the whole architecture of the 

camera driver. The driver itself is provided by the camera vendor 

for selected SoCs and embedded boards. Alternatively, the 

camera driver is planned to be an open source to enable the 

highest flexibility possible for embedded designers should a 

specific SoC or embedded board be not supported by the camera 

provider. 


As of today, embedded system designers are facing design 

challenges when it comes to image processing and embedded 

vision. Different options are available on the market with 

individual advantages and disadvantages, which need careful 

consideration during the design phase in terms of performance 

and cost. Looking at the camera, designers commonly used 

CMOS camera modules to apply vision to their embedded 

design. When it comes to image processing, they have two basic 

options regarding the camera. First, using the image processing 

functionality implemented in the camera and accept the fact that 

additional Pre-processing and advanced image processing are 

performed on the embedded board if necessary. Second, skip 

most of the image processing functionality in the camera and 

instead make use of more sophisticated image processing 

algorithms on the embedded board. Both scenarios fulfil the 

requirements of the embedded vision designer in terms of image 

processing. What it does not satisfactory solve are budgetary 

terms of the overall system costs and the fact that each CMOS 

camera module needs its individual camera driver every time the 

embedded vision system is upgraded. 

As an alternative option, the camera vendor Allied Vision 

developed a new kind of camera module for embedded vision 

designers. The 1 Product Line with its embedded camera 

modules of the 130 C Family and 140 C Family which are 

designed to not only perform Pre-processing and advanced 

image processing, but also meet a specific price point starting 

with a list price of 99€ for a single camera module. Both families 

are equipped with the most common used interface in the 

embedded system environment: MIPI CSI-2. In the first phase, 

camera drivers will be available for embedded boards powered 

by the NXP i.MX6 and NVIDIA Tegra TX1 and TX2 SoC. 

More driver support is planned for boards with the upcoming 

i.MX8 SoC. Embedded designers can easily upgrade their 

camera module within the same SoC architecture, without 

reprogramming the whole CSI-2 camera driver like in the past. 

Furthermore, it will be an open source to give designers a 

maximum amount of flexibility. 

In summary, embedded designers will get an attractive 

camera module option for their next embedded vision system 

design as an alternative to the widely used CMOS camera 

mobiles driven by the mobile industry. 


238

Image Data Compression with a System-on-a-Chip 

Joerg Mohr (Author) 

Imaging Department 

Solectrix GmbH 

Nuremberg, Germany 

Abstract— The exponentially growing use of cameras and 

other image data acquisition devices requires efficient methods 

for image data compression. Not all image compression 

requirements can be fulfilled with common multimedia 

standards, some need special compression methods. If such a 

method shall be implemented in an embedded device, the 

realization in an FPGA (field-programmable gate array) or SoC 

(System-on-a-Chip) is a valid option. In this paper we briefly 

introduce reasons for and fundamentals of image compression 

methods. We present solutions for implementing compression 

techniques in an FPGA and give a brief overview of SoCs. 

Finally, we describe a specific realization of image data 

compression in an SoC. 

Keywords—image data compression; System-on-a-chip; FPGA 


A. Image Data Compression — Why? 

Semiconductor technology is getting ever more powerful 

and affordable. On the one hand, this helps lower costs for 

memory and data rates. For example, according to [1], costs for 

NAND flash memory are expected to decrease by a factor of 

10 within six years. So one could ask why image data 

compression methods are still needed in an environment with 

affordable data storage costs. On the other hand, this broad 

availability of powerful and affordable semiconductor 

technology allows for the creation of new and more products 

with high-resolution image sensors and powerful processors. A 

study expects video (i.e. image content) will even account up to 

75% of mobile data traffic [2]. In order to handle this 

increasing amount of data, advanced techniques to reduce the 

data rate but to keep image quality within expectations are 

crucial. 

B. Special Image Compression Requirements Demand 

Special Solutions 

Many image compression standards are targeted for 

common demands like still or video cameras, TV sets or 

mobile phones. These multimedia applications are widely used 

and — due to the large number of devices — they allow the 

creation of optimized software libraries for generic processors 

or the implementation of specific image data compression coprocessors 

in silicon. Even an extra-low-cost Internet of 

Things (IoT) system like the widely used Raspberry Pi is 

equipped with a complex H.264 video codec, enabling low 

delay video streaming [3]. 

Although by now these codecs for multimedia applications 

are available for 4K resolutions at up to 60 frames per second 

(fps), they often lack some properties that exclude them from 

use in applications with special image compression 

requirements. Example applications that require properties 

beyond common multimedia codecs are high-end acquisition 

devices (professional cameras, film scanners), medical imaging 

systems (computed tomography scanners, sonographers), etc. 

For these special applications, there are software libraries 

for generic processors, but silicon co-processors are not 

available due to the low quantities of the applications. As a 

result, if these methods are needed in compact embedded 

systems, the implementation of image compression methods in 

an FPGA (field-programmable gate array) or in a System-on-a- 

Chip (SoC) becomes attractive. 

II. 

BASICS OF IMAGE DATA COMPRESSION 

A. Generic Types of Data Compression 

The goal of data compression is to reduce the amount of 

data. Basically, two types of compression can be distinguished: 

Mathematically lossless compression that encodes the 

data in a more efficient manner but is completely 

restorable. An often distracting feature of lossless data 

compression is that the data rate depends on the entropy 

of the input signal and cannot be efficiently controlled 

“Lossy” compression that reduces the amount of data 

by controlled loss of content that is considered to be 

negligible. The original cannot be completely restored 

from the compressed data. 

B. Methods of Image Data Compression 

As with lossless methods, only compression rates of 

approx. 2:1 can be expected on natural images, most image 

data compression methods combine both types. To be more 

specific, many image compression methods combine the 


239

following techniques: (1) Transformation, to reduce correlation 

of input data, (2) Quantization, to reduce information, and (3) 

Encoding, to reduce entropy. 

An example for an image data compression scheme that 

combines different techniques is JPEG (Joint Photographic 

Experts Group), named after the group that standardized the 

format [4]. It comprises: 

1) Color space transformation: To transform the primary 

RGB (Red/Green/Blue) color components to Luminance and 

Chrominance values. 

2) Subsampling of chrominance components: Because the 

human eye is more sensitive to luminance than to 

chrominance, the latter can be spatially subsampled without 

affecting the visual perception of the reconstructed image. 

This step is optional in the standard, but most commonly used 

as it improves the compression perfomance for most natural 

images. 

3) Digital Cosine Transformation (DCT): The colorconverted 

and optionally subsampled input image is divided 

into many 8x8 blocks. During encoding each block is 

transformed into another 8x8 block of DCT coefficients. The 

mathematical definition of Forward DCT is: 

(1) 

DCT has been chosen as it can concentrate the energy of 

the transformed signal in the low frequency range. Because the 

human eye is less sensitive to the high frequency content, this 

contribution can be reduced. 

4) Quantization: This reduction of the high frequency 

contribution can be achieved by quantization with a carefully 

selected quantization matrix: Each coefficient in the 8x8 DCT 

is divided by a corresponding quantization value and the result 

is rounded, therefore data is lost. 

(2) 

5) Entropy Encoding: After quantization, the first 

coefficient is treated different than the other 63 coefficients. 

The latter are often zero, and by applying a zigzag scan, an 

efficient zero-run-length coding can be applied. The results 

can be represented even more efficiently by coding them via a 

Huffman table. See [5] for further details. 

For correct interpretation of the compressed data, the 

results of all these steps are ordered into a file and supplied 

with descriptive headers. 

This diversification shows: Currently, there is no universal 

image compression method. Instead, it is rather an art to select 

the best-suited compression method for the expected content, 

expected use, tolerable artifacts and other side parameters. 

III. 

IMAGE DATA COMPRESSION IN THE FPGA 

As mentioned in previous chapters, requirements may 

forbid pure software or hardwired co-processor realizations of 

image data compression. Modern FPGAs are equipped with 

enough logic resources to allow complex calculations. 

Nevertheless, some challenges exist in FPGA realizations that 

have to be dealt with: 

Floating-point operations in an FPGA are inefficient. 

FPGA-internal memory is expensive and scarce. 

External memory needs extra logic for signal 

controlling and arbitration. Access to it is often a 

bottleneck. 

Complex decisions, e.g., nested if-else-decisions and 

loops, are difficult to implement. 

On the other hand, FPGAs offer advantages to the 

implementer that a generic processor cannot provide, like 

massive parallel processing and deep pipelining architectures. 

Additionally, modern FPGAs are equipped with lots of 

dedicated DSP (Digital Signal Processing) resources that allow 

over 5,000 GMAC/s of peak performance [6]. 

Systems-on-a-Chip provide an additional advantage: As the 

FPGA and processor subsystems are connected via highperformance 

buses, they can be programmed to work on a 

closely connected basis. This allows for combining the 

advantages of both architectures. 

In this paper we present some example challenges 

implementing image compression methods in an FPGA or 

SoC and the solutions to overcome them: 

1) DCT Implementation: Equation (1) describes the 

mathematical definition, but obviously a direct 

implementation of cosine functions and loops would be 

inefficient in an FPGA. Other authors have already developed 

efficient DCT implementations. The Arai-Agui-Nakajima 

(AAN) algorithm [7] is among the fastest known DCTs. As a 

solution to the DCT implementation challenge, the algorithm 

was analyzed and optimized for parallel operations into an 

efficient realization. The following flow chart gives an 

overview of the internal pipelining structure: 

C. Other image compression methods 

Although JPEG is still widely used, other image 

compression methods have evolved. On the one hand, to 

provide better compression results, on the other hand, to better 

cope with specific input data types, or to provide different 

tolerable compression artifacts or side parameters that derive 

from the used application. Other image compression methods, 

to mention some are, e.g., GIF, PNG, JPEG-LS, JPEG 2000 

ProRes ® , DNxHD ® , AVC-Intra ® , HEIF etc. 

240

should be used for parallel pre-processing where possible, 

while complex decisions should be handled in the processor. In 

comparison to an RTL-described “soft core” within an FPGA 

implementation, a processor in an SoC offers much higher 

performance. 

Fig. 1. Flow chart of DCT implementation 

2) Quantization implementation: The definition of the 

quantization is mathematically simple, see Eq. (2), but 

difficult to implement in an FPGA. A division operation needs 

the combination of several logic and DSP operations, thus 

increasing the complexity and lowering the processing speed 

of the overall design. As a solution to the quantization 

implementation challenge, all possible divisor values were 

transferred to their reciprocals in 18-bit precision. The precalculated 

values are stored in internal memory. As the divisor 

values range from 1 to 256, the memory consumption for these 

reciprocal values is quite manageable. By simple 

multiplications following a predefined order this realization 

allows one quantization operation per clock cyle. 

3) Optimization strategies: While image compression 

standards mostly describe only the techniques, there is a wide 

area for implementers to optimize their parameters. For 

example, the selection of quantization matrices in JPEG to 

either achieve a constant quality or constant bit rate (CBR) 

over several images is in the responsibility of the programmer. 

Modern image compression methods offer a vast set of other 

options that should be carefully selected in order to achieve 

optimal compression performance. The following image 

shows an example for a complex mode decision algorithm in a 

video transcoding implementation: 

Fig. 2. Complex mode decision algorithm. From [8] 

Although complex decisions can be described in register 

transfer languages (RTL) like VHDL or Verilog, the effort for 

programming, maintaining and changing the code is far more 

inefficient than in programming languages for a processor. As 

a solution to overcome this challenge, we propose a closely 

coupled system of FPGA logic and processor. The FPGA logic 

IV. 

SYSTEMS-ON-A-CHIP 

A. Definition and Market Overview 

Generally, a System-on-a-Chip (SoC) is an integrated 

circuit that consists of several components like analog or digital 

interfaces, processing or logic functions. A typical application 

is in the field of embedded systems. 

Within this paper, the term SoC is used for silicon devices: 

 

 

 

 

with processor and FPGA subsystems on one die 

both connected via internal buses 

with a predefined set of processor interfaces like 

UART, SPI, memory interfaces, etc. 

Other functionality can be programmed in the 

FPGA subsystem. 

By now, all major FPGA manufacturers offer SoCs that 

combine their programmable FPGA fabric with a hard-coded 

processor in ARM architecture. The following table shows 

some of the SoCs on the market: 

Name 

CPU subsystem 

Core 

TABLE I. 

Microsemi ® 

SmartFusion ® , 

SmartFusion ® 2 

1× ARM ® 

Cortex ® -M3 

MARKET OVERVIEW SOCS 

Manufacturer 

Intel® FPGA 

(Altera ® ) 

Cyclone ® V 

SoC, Arria ® V 

SoC, Arria ® 10 

SoC, Stratix ® 

10 SoC 

1..4× ARM ® 

Cortex ® -A9, 

A53 

Xilinx ® 

Zynq ® -7000 

SoC, Zynq ® 

UltraScale + 

MPSoC 

1..4× ARM ® 

Cortex ® -A9, 

A53+2× R5 

Memory SRAM, Flash Cache, SRAM Cache, SRAM 

Interface 

FPGA subsystem 

Logic 

Serial, EMC, 

10/100Eth, ... 

700...150K 

logic elements 

Serial, EMAC, 

USB, ... 

25K...5500K 

logic elements 

Serial, EMS, 

GbE, USB, 

PCIe, SATA, ... 

23K...1143K 

Logic Cells 

Memory 36...4488 Kbits 1.4…229 Mbits 1.8...70 Mbits 

Interfaces 

ADCs, DACs, 

SerDes 3G, 

PCIe Gen2, ... 

SerDes 3...28G, 


ADCs, 

SerDes3...32G, 


Although FPGA manufacturers provide various tools for 

parallel FPGA and processor design, example designs, and 

other support, the development of embedded systems with 

SoCs is still a complex task. To overcome this challenge and to 

allow faster time to market, independent manufacturers have 

started the development of SoC modules and base boards. 

Examples are available under [9]. 


241

B. The SXoM MS2-K7 module 

This specific image data compression implementation was 

planned for an SXoM MS2-K7 module by Solectrix GmbH, 

Germany. It is a module in SMARC ® format (Smart Mobility 

ARChitecture). The latter has been defined by the 

Standardization Group for Embedded Technologies e.V. [10]. 

It is primarily designed for the development of extremely 

compact low-power systems. Its edge finger connector 

provides 281 signal lines, with a mixture of typical energysaving 

interfaces like SPI and I 2 C alongside classical computer 

interfaces such as USB, SATA and PCI Express. 

Generally, SMARC ® modules are based on ARM 

processors. They can, however, also be fitted with other SoC 

architectures. The SXoM MS2-K7 module by Solectrix GmbH 

is equipped with a Xilinx ® Zynq ® Z-7030/35/45 SoC. 

Fig. 3. Block diagram of a Xilinx ® Zynq ® Z-7000 series SoC. From [11] 

The SXoM MS2-K7 module provides two banks of DDR3 

memory, USB 2.0 host/client functionality, Ethernet PHYs and 

non-volatile memory as shown in the following diagram: 

Fig. 4. Block diagram of SXoM MS2-K7 module. From [12] 

Using the SMARC ® format offers the choice of various 

base boards, or starting development with an ARM processoronly 

module first or in parallel. 

V. APPLICATION EXAMPLE: IMAGE DATA COMPRESSION 

ON A SXOM MS2-K7 MODULE 

A. Requirements 

Within this paper we describe an example application of 

image data compression on an SoC. The exact image 

compression method cannot be disclosed but the requirements 

for an encoder-only implementation were: 

Image resolution: 4096 x 2160 pixels (4K), 

Data format: RGB with 10 bit per color component, 

Frame rate: Up to 60 fps, 

Target compression rate: 4:1 up to 20:1 

Quality: Similar to the given software reference 

Other compression properties: 

o 

o 

o 

o 

Optional chrominance subsampling 

Block-based DCT 

CBR with content-optimized quantization 

Optimized for parallel decoding on a processor, 

i.e. a header with information about compressed 

data organization is mandatory. 

Architecture: Xilinx ® Zynq ® -7000 SoC 

Realization: As an IP core with AXI-input/output 

(Advanced eXtensible Interface Bus) 

As the system will not only perform image compression but 

other compression tasks, e.g., networking, user interaction, etc., 

an implementation method was preferred that allows parallel 

development and verification of these tasks. 

B. Chosen Course of Action 

Based on these requirements the following course of action 

was planned: 

1) Evaluation of the compression format in software: In 

order to evaluate the possible quality, a bit-exact variant of the 

reference software was created. This included exchanging all 

floating-point operations to appropriate fixed-point 

representation and also implementing the quantization as 

mentioned above. Only color conversion from RGB to 

luminance/chrominance, subsampling, and a framebuffer in 

external memory was implemented in FPGA. The software 

variant was compiled for the processor subsystem, with access 

to the framebuffered data. Albeit this implementation needs 

several seconds per image to compress and is far from the 

required frame rate, both AXI interfaces and the image quality 

could be verified on a system prototype. By using a SXoM 

with preconfigured memories for FPGA and processor 

subsystems, the implemention time could be shortened. 

242

2) Realization of a simplified compression scheme in 

FPGA: Within this step, the DCT, the quantizer and entropy 

encoder were implemented as a module in FPGA in order to 

verify the implementation speed. This module was called the 

“compressor”. The module was implemented in a parallel 

architecture, processing two pixels at the same time with 

280 MHz clock speed. No optimization strategies were 

included in this step, thus delivering compressed data without 

CBR properties and, due to the nature of the compression 

method, without header. Nevertheless, together with a 

modified reference software the reconstruction of the 

compressed image data was possible in order to compare the 

achieved quality. 

3) Final step: The optimization strategy in the software 

reference requires evaluation of all DCT blocks with different 

quantization parameters before chosing a final set of 

parameters. Within this step this requirement was solved by n- 

fold implementation of the “compressor” module without data 

output, but delivering a data base of compression data side 

information to an optimization algorithm. The latter is 

implemented on one of the two processor cores of the SXoM 

module. In order to achieve close coupling to the FPGA 

modules without software overhead, this core needs no 

operating system and runs as a so-called “bare-metal 

application”. The different treatment of processor cores is 

called asymmetric multiprocessing (AMP). After performing 

an optimization algorithm in order to achieve contentoptimized 

quantization and CBR, the application controls the 

final quantization process in a separate “compressor” module 

in the FPGA. Additionally, it creates the header with the 

information about the compressed data organization. The 

FPGA modules and the software were designed for close 

coupling without complex synchronizing overhead and they 

take advantage of the high-performance internal busses in the 

SoC. 

C. Results 

An image compression method that was originally targeted 

for generic processors was implemented on a Xilinx ® Zynq ® - 

SoC. A compression performance for 4K images was achieved 

up to 60 fps. All requirements regarding compression 

parameters, image quality, and implementation details were 

met. The consumed SoC resources for the encoder IP core are: 

Encoder IP 

core 

Absolute 

numbers 

Relative 

usage of a 

7035-device 

Slice 

LUTs 

TABLE II. 

Slice 

Registers 

SOC RESOURCES 

SoC resources 

Block 

RAM 

DSP 

slices 

processor 

cores 

68,128 71,574 177 98 1 

39.6 % 20.8 % 35.4 % 10.8 % 50 % 

In parallel to the implementation of the image data 

compression system, other embedded system tasks could be 

developed. 

The usage of a SXoM MS2-K7 module in standardized 

format and with predefined memory resources accelerates the 

development process. Stepwise integration of image 

compression components enables the verification on the target 

platform. 

REFERENCES 

[1] D. Floyer, D. Vellante, B. Latamore and R. Finos, “The Emergence of a 

New Architecture for Long-term Data Retention,” Wikibon.org, July 

2014. 

[2] R. Möller, P. Jonsson, S. Carson et al., “Ericsson Mobility Report,” 

Ericsson AB, November 2017. 

[3] U. Jennehag, S. Forsstrom and F.V. Fiordigigli, “Low Delay Video 

Streaming on the Internet of Things Using Raspberry Pi,” MDPI 

(Multidisciplinary Digital Publishing Institute), September 2016. 

[4] Joint Photographic Experts Group, “Information technology -- Digital 

compression and coding of continuous-tone still images: Requirements 

and guidelines” (ISO/IEC IS 10918-1 / ITU-T T.81), February 1994. 

[5] W.-Y. Wei, “An Introduction to Image Compression,” Graduate Institute 

of Communication Engineering, National Taiwan University, 2008. 

[6] T. Hill, “Accelerating Design Productivity with 7 Series FPGAs and 

DSP Platforms,” Xilinx Inc, February 2013. 

[7] Y. Arai, T. Agui, M. Nakajima, “A Fast DCT-SQ Scheme for Images,” 

Trans. IEICE, vol. E-71, no. 11, pp. 1095-1097, November 1988. 

[8] K. Lee, G. Jeon, J. Jeong, “Fast mode decision algorithm in MPEG-2 to 

H.264/AVC transcoding including group of picture structure 

conversion,” Opt. Eng. 48(5), 057003, doi:10.1117/1.3127198, May 

2009. 

[9] Solectrix GmbH, “SXoM Modules”, 

https://www.solectrix.de/en/sxom-modules 

[10] SGET Standardization Group for Embedded Technology e.V., “Smart 

Mobility ARChitecture,” Version 2.0, June 2016. 

[11] Xilinx Inc., “Zynq-7000 All Programmable SoC Product Advantages,” 

https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html 

[12] M. Schetter, “SXoM MS2-K7 System-on-Module compliant with 

SMARC 2.0 specification,” Solectrix GmbH, December 2017. 


243

CPU or FPGA for Image Processing: Choosing the 

Best Tool for the Job 

Kevin Kleine 

Vision Product Manager 

National Instruments 

Austin, Texas 

Abstract—This paper is a comparison of the considerations and 

advantages involved in image processing on CPUs and FPGAs. It 

explores common architectures and core software implementation 

constraints. 

Keywords—Image Processing; FPGA; CPU; Co-Processing; 

Preprocessing 


Machine vision has long been used in industrial automation 

systems to improve production quality and throughput by 

replacing manual inspection traditionally conducted by humans. 

We’ve all witnessed the mass proliferation of cameras in our 

daily lives in computers, mobile devices, and automobiles. 

However, the biggest advancement in machine vision has been 

processing power. With processor performance doubling every 

two years and a continued focus on parallel processing 

technologies like multicore CPUs, GPUs, and FPGAs, vision 

system designers can now apply highly sophisticated algorithms 

to visualize data and create more intelligent systems. 

This increase in performance means designers can achieve 

higher data throughput to conduct faster image acquisition, use 

higher resolution sensors, and take full advantage of some of the 

latest cameras on the market that offer the highest dynamic 

ranges. An increase in performance helps designers not only 

acquire images faster but also process them faster. Preprocessing 

algorithms such as thresholding and filtering or processing 

algorithms such as pattern matching can execute much more 

quickly. This ultimately gives designers the ability to make 

decisions based on visual data faster than ever. 

As more vision systems that include the latest generations of 

multicore CPUs and powerful FPGAs reach the market, vision 

system designers need to understand the benefits and trade-offs 

of using these processing elements. They need to know not only 

the right algorithms to use on the right target but also the best 

architectures to serve as the foundations of their designs. 

II. 

INLINE VS. CO-PROCESSING ARCHITECTURE 

A. Co-Processing 

Before investigating which types of algorithms are best 

suited for each processing element, you should understand 

which types of architectures are best suited for each application. 

When developing a vision system based on the heterogeneous 

architecture of a CPU and an FPGA, you need to consider two 

main use cases: inline and co-processing. With FPGA coprocessing, 

the FPGA and CPU work together to share the 

processing load. This architecture is most commonly used with 

GigE Vision and USB3 Vision cameras because their acquisition 

logic is best implemented using a CPU. The image is acquired 

on the CPU and sent to the FPGA via direct memory access 

(DMA). The FPGA can then perform operations such as filtering 

or morphology. The image can then be sent back to the CPU for 

more advanced operations such as optical character recognition 

(OCR) or pattern matching. In some cases, the entire algorithm 

can be implemented on the FPGA, sending only the results back 

to the CPU. This allows the CPU to devote more resources to 

other operations (e.g. motion control, network communication, 

image display) and can increase total system performance. 

Fig. 1. In FPGA co-processing, images are acquired using the CPU and then 

sent to the FPGA via DMA so the FPGA can perform operations. 

B. Inline Processing 

In an inline FPGA processing architecture, you connect the 

camera interface to the I/O pins of the FPGA enabling the pixels 

to stream directly to the FPGA from the camera. This 

architecture is commonly used with Camera Link and 

CoaXPress cameras because their acquisition logic is well suited 

for FPGA implementation. This architecture has two main 

benefits. First, just like with co-processing, you can use inline 

processing to move some of the work from the CPU to the FPGA 

by performing preprocessing functions on the FPGA. For 

example, you can use the FPGA for high-speed preprocessing 

functions such as filtering or thresholding before sending pixels 

to the CPU. This also reduces the amount of data that the CPU 

244

must process because it implements logic to only capture the 

pixels from regions of interest, which increases overall system 

throughput. Traditional Camera Link and CoaXPress frame 

grabbers use an FPGA to decode the pixel bus and transmit 

images to a CPU (typically via PCIe). Some of these frame 

grabbers enable fixed functionality such as onboard de-mosaic 

or convolution with set kernel size. A more advanced approach 

is to enable the FPGA onboard the frame grabber to implement 

custom user logic. In this architecture, systems can leverage user 

choice of prebuilt IP from specific software toolchains or enable 

users to design custom algorithms to run onboard the FPGA. The 

second benefit of this architecture is that it allows for high-speed 

control operations to occur directly within the FPGA without 

using the CPU. FPGAs are ideal for control applications because 

they can run extremely fast, highly deterministic loop rates. An 

example of this is high-speed sorting during which the FPGA 

sends pulses to an actuator that then ejects or sorts parts as they 

pass by. 

Fig. 2. In the inline FPGA processing architecture, the camera interface is 

connected directly to the pins of the FPGA passing the pixels directly from the 

camera. 

III. 

CPU VS. FPGA VISION ALGORTHIMS 

With a basic understanding of the different ways to architect 

heterogeneous vision systems, you can look at the best 

algorithms to run on the FPGA. First, you should understand 

how CPUs and FPGAs operate. To illustrate this concept, 

consider a theoretical algorithm that performs four different 

operations on an image and examine how each of these 

operations runs when implemented on a CPU and an FPGA. 

A. Hypothetical FPGA vs. CPU Processing Example 

Many image processing algorithms are inherently parallel 

and hence suitable for FPGA implementations. These 

algorithms which involve operations on pixels, lines, and 

regions of interest do not need high-level image information 

such as patterns. You can perform these functions on small 

regions of bits as well as on multiple regions of an image 

simultaneously. 

CPUs operate sequentially; the first operation must run on 

the entire image before the second one can start [for the sake of 

high-level discussion, this overlooks modern CPU optimization 

techniques such as pipelining and multithreading]. In this 

example, assume that each step in the algorithm takes 6 ms to 

run on the CPU. This results in a total processing time of 24 ms. 

Now consider the same algorithm running on the FPGA. 

Since FPGAs are massively parallel in nature, each of the four 

operations in this algorithm can operate on different pixels in the 

image at the same time. In this example, the latency for the first 

operation to finish processing the initial pixels is 2ms. From this 

point on, the operations can be running in parallel. This 

parallelism enables a significantly reduced processing time of 

6ms. This is substantially faster than the CPU implementation. 

Even in an FPGA co-processing architecture factoring the image 

transfer latency, the total processing time will be improved. 

Fig. 3. Since FPGAs are massively parallel in nature, they can offer significant 

performance improvements over CPUs. 

B. Practical FPGA vs. CPU Benchmarked Example 

Now consider a real-world example in which an image is 

preprocessed for particle counting. First, you apply a 

convolution filter to sharpen the image. Next, you run the image 

through a threshold to produce a binary image. This reduces the 

amount of data in the image by converting from 8-bit 

monochrome to binary representation enabling a more efficient 

morphology algorithm. The last step is to use morphology to 

apply the close function. This removes any holes in the binary 

particles. This algorithm executed on a CPU suffers the 

performance limitation discussed above. In practice, this takes 

166.7 ms when using the NI Vision Development Module for 

LabVIEW and the cRIO-9068 CompactRIO Controller based on 

a Xilinx Zynq-7020 All Programmable SoC. However, this 

same algorithm run on the FPGA executes every step in parallel 

as each pixel completes the previous step. This results in the 

FPGA taking 8 ms to complete the processing. This 8 ms 

benchmark includes the DMA transfer time to send the image 

from the CPU to the FPGA. In some applications, you may need 

to send the processed image back to the CPU for use in other 

parts of the application. Factoring in time for that, this entire 

process takes only 8.5 ms. In total, the FPGA can execute this 

algorithm nearly 20 times faster than the CPU. 

Fig. 4. Running this vision algorithm using an FPGA co-processing 

architecture yields 20 times more performance than a CPU-only 


C. FPGA Algorithm Characteristics 

So why not run every algorithm on the FPGA? Though the 

FPGA has benefits for vision processing over CPUs, those 

benefits come with trade-offs. For example, consider the raw 

clock rates of a CPU versus an FPGA. FPGA clock rates are 

typically on the order of 40 MHz to 200 MHz. These rates are 

significantly lower than those of a CPU, which can run over 3 

245

GHz. Image processing algorithms can include functions that 

rely on the entire output of a previous step to be valid before the 

next step can begin. These algorithms cannot leverage the 

parallelism of an FPGA and thus are more efficient on a CPU. 

Additionally, algorithms that require random access to pixels 

data throughout the image can be challenging to implement on 

FPGAs. While many modern FPGAs include integrated memory 

such as Dynamic RAM (DRAM), this is typically in lower 

quantities the RAM available to the CPU. Due to this memory 

limitation, it can also be difficult to run algorithms that require a 

template to be stored in accessible memory space (e.g. OCR or 

pattern matching). Since these algorithms typically cannot raster 

scan through an image, they suffer the difficulties in 

parallelization discussed above. 

A high-level rule of thumb is if an algorithm can operate by 

raster scanning an image, it is typically well suited for FPGA 

implementation. If it cannot, deeper consideration and 

potentially complex design is required. 

IV. 

OVERCOMING PROGRAMMING COMPLEXITY 

The advantages of an FPGA for image processing depend on 

each use case, including the specific algorithms applied, latency 

or jitter requirements, I/O synchronization, and power 

utilization. Often using an architecture featuring both an FPGA 

and a CPU presents the best of both worlds and provides a 

competitive advantage in terms of performance, cost, and 

reliability. Unfortunately, one of the biggest challenges to 

implementing an FPGA-based vision system is overcoming the 

programming complexity of FPGAs. Vision algorithm 

development is, by its very nature, an iterative process. 

Typically, several approaches need to be tried to determine 

which works best for a given application. 

To maximize productivity, you need to get immediate 

feedback and benchmarking information on your algorithms 

regardless of the processing platform you are using. Seeing 

algorithm results in real time is a huge time-saver when you are 

using an iterative, exploratory approach. What is the right 

threshold value? How big or small are the particles to reject with 

a binary morphology filter? Which image preprocessing 

algorithm and algorithm parameters can best clean up an image? 

These are all common questions when developing a vision 

algorithm, and having the ability to make changes and see the 

results quickly is key. However, the traditional approach to 

FPGA development can slow down innovation due to the 

compilation times required between each design change of the 

algorithm. One way to overcome this is to use an algorithm 

development tool that helps you develop for both CPUs and 

FPGAs from the same environment, while not getting bogged 

down by FPGA compilation times. The NI Vision Assistant is 

an algorithm engineering tool that simplifies vision system 

design by helping you develop algorithms for deployment on 

either the CPU or FPGA. You also can use the Vision Assistant 

to test the algorithm before compiling and running it on the 

target hardware, while easily accessing throughput and resource 

utilization information. Extensive development and testing 

efforts ensure that the results are identical between the CPU and 

FPGA executed algorithm. 

Fig. 5. Developing an algorithm in a configuration-based tool for FPGA targets 

with integrated benchmarking cuts down on the time spent waiting for code to 

compile and accelerates development. 

When considering whether a CPU or an FPGA is best for 

image processing, the answer is, “It depends.” You need to 

understand the goals of your application and use the processing 

element that is best suited to that design. However, regardless of 

your application, CPU- and FPGA-based architectures and their 

many inherent benefits are poised to take machine vision 

applications to the next level. 

246

Vision Applications Continuum from High- 

Performance and Desktop toward Embedded 

Computing made easy by an Efficient OpenCL TM 

Runtime Environment 

Bogdan Ditu, Ciprian Arbone, Fred Peterson 

Automotive Compiler Group 

NXP Semiconductors 

Abstract— As the embedded world seems the perfect place for 

exploring very specific hardware acceleration technologies, one of 

the highest challenges that comes along is the complexity of 

programming these architectures. Defining a programming 

model for each type of accelerator device is very resource 

consuming, considering all the time investments that are required 

in tools development, software porting and deployment, and not 

to mention specific optimizations for exploiting all the accelerator 

features. 

Even though some may argue that it is not very suited for 

embedded environments, OpenCL might be the perfect solution 

for providing a unified programming model for these 

acceleration technologies. By definition, OpenCL provides a 

standardized and portable approach for using any multi-core 

capabilities. The portability characteristic is the one that should 

allow algorithm development on high-level targets (Desktop or 

even HPC environments) followed by direct deployment on 

embedded systems. 

In order to achieve this level of portability, the embedded 

systems should be powered by an Efficient OpenCL Runtime 

Environment (i.e. OpenCL system implementation) which would 

support all the embedded targets (including acceleration devices) 

and would make the continuum between targets as seamless as 

possible. 

This paper is presenting how such an OpenCL Runtime 

Environment needs to be designed and implemented and how it 

would help in achieving the purpose of application portability 

(with main focus on vision applications) covering all of the 

computing architectures spectrum. Using OpenCL, the vision 

algorithm development should focus only on algorithm details 

and should not consider any device architecture characteristics 

for the algorithm functionality, while the performance should be 

at decent levels for the out-of-the-box / portable application. All 

the architecture details should be handled by the OpenCL 

runtime environment support for that target in the limits 

required by the standard, as efficiently as possible. 

Exploration of the very specific device capabilities would be 

possible by using custom extensions made available by the 

Runtime Environment (also detailed in the paper). The 

continuum and portability in both directions could also be 

maintained by the Runtime Environment which could assure that 

the custom extensions are available / emulated on all supported 

targets. 

The story of using this Efficient OpenCL Runtime 

Environment is also backed up by various experimentations with 

use cases of real life out-of-the-box OpenCL applications 

(developed using Desktop Environments) implementing Vision 

algorithms, which were easily deployed on different types of 

embedded multi-core systems (including the usage of some 

specific, custom extensions). 

Keywords—vision applications continuum; OpenCL 


Current and future embedded computing systems are 

becoming more and more complex and beside a main, general 

purpose computing unit, they would usually enclose one or 

more domain specific computing units. These domain specific 

computing units are used for offloading the main cores 

computations and even accelerating certain operations (e.g. 

graphic accelerators, compression accelerators, cryptographic 

accelerators, packet processing accelerators, and many others). 

As the embedded world seems the perfect place for 

exploring very specific hardware acceleration technologies, 

one of the highest challenges that comes along is the 

complexity of programming these architectures. Defining a 

programming model for each type of accelerator device is very 

resource consuming, considering all the time investments that 

are required in tools development, software porting and 

deployment, and not to mention specific optimizations for 

exploiting all the accelerator features. 

Beside programming each of the accelerating cores, another 

major challenge is that of coordinating, synchronizing and 


247

partitioning the workload each of the cores needs to do. Even 

though each of the accelerators might have its own efficient 

programming model (regardless its complexity and ease of 

enablement), it would cover at most the interaction between the 

main computing units and one specific accelerator (or class of 

accelerators). The specificity and efficiency of each particular 

programming model would probably make it unsuitable for the 

other cores than the ones it was intended for. This is why 

coordinating more accelerators would usually involve the 

interaction of more programming models, which were most 

likely not designed with inter-operability and collaboration as 

their strongest argument. 

OpenCL is a standard that enables a parallel programming 

paradigm that can support homogeneous as well as 

heterogeneous multi-core and many-core systems. Besides 

providing complex means for multi-core parallel programming 

(including collaboration, coordination, synchronization and 

workload partitioning for all the cores in the system, either 

homogeneous or heterogeneous), OpenCL also provides a 

unified programming model for all entities involved in the 

system, as well as a high-level of application portability. In 

other words, OpenCL enables ease-of-use for multi-core 

architectures programming (with significant impact especially 

around the heterogeneous area) as well as assuring a 

standardized and portable approach for using multi-core 

capabilities. 

Even though some may argue that it is not very suited for 

embedded environments, OpenCL might be the perfect solution 

for providing a unified programming model for these 

acceleration technologies. By definition, OpenCL provides a 



provide a great usability advantage and should allow algorithm 

development on high-level targets (Desktop or even HPC 

environments) followed by direct deployment on embedded 

systems. 

In order to achieve this level of portability, the embedded 

systems should be powered by an Efficient OpenCL Runtime 

Environment (i.e. OpenCL system implementation) which 

would support all the embedded targets (including acceleration 

devices) and would make the continuum between targets as 

seamless as possible. 

This paper is presenting how such an OpenCL Runtime 

Environment needs to be designed and implemented and how it 

would help in achieving the purpose of application portability 

(with main focus on vision applications) covering all of the 

computing architectures spectrum. Using OpenCL, the vision 

algorithm development should focus only on algorithm details 

and should not consider any device architecture characteristics 

for the algorithm functionality, while the performance should 

be at decent levels for the out-of-the-box / portable application. 

All the architecture details should be handled by the OpenCL 

runtime environment support for that target in the limits 

required by the standard, as efficiently as possible. 

Exploration of the very specific device capabilities would 

also be possible by using custom extensions made available by 

the Runtime Environment. These extensions are detailed in the 

paper, including examples and performance evaluation of using 

such extensions. In this context, the continuum and portability 

in both directions (from high-level computing toward 

embedded computing, and the other way around) could also be 

maintained by the Runtime Environment which could assure 

that the custom extensions are available / emulated on all 

supported targets. 

The story of using this Efficient OpenCL Runtime 

Environment as the enabler of the application continuum is 

also backed up by various experimentations with use cases of 

real life out-of-the-box OpenCL applications (developed using 

Desktop Environments) implementing Vision algorithms, 

which were easily deployed on different types of embedded 

multi-core systems (including the usage of some specific, 

custom extensions, as mentioned before). 

The rest of the paper is structured following the logical line 

presented in the introduction so far: the next section will make 

a brief overview of the OpenCL programming paradigm 

(section II). Once the reader gets an idea about how OpenCL 

can be used, we will detail the OpenCL usability story for 

embedded systems with main focus on the how OpenCL 

provides applications continuum from desktop development 

toward embedded computing systems (section III). Since one 

of the key elements in supporting the usability story is having 

an efficient OpenCL runtime environment implementation, the 

next section will present our proposition for a portable OpenCL 

system suitable for multi-core embedded systems (section IV). 

Next, we are also proposing custom extensions for exploring 

and exploiting very specific device capabilities (section V). 

The next two sections are presenting the embedded systems the 

proposed OpenCL implementation was targeted for together 

with the experimentation and performance evaluation of using 

this system efficiently for the usability scenario described so 

far (which involves the applications continuum from high-level 

development toward embedded systems deployment) (section 

VI and VII). In the end, some conclusion of the proposed 

solutions and usability scenarios as well as future work that can 

be developed from the presented ideas (section VIII). 

II. 

OPENCL OVERVIEW 

As mentioned before, OpenCL is a standard that enables a 

parallel programming paradigm that could be used for any type 

of multi-core or many-core system (containing either 

homogeneous or, most important in the presented context, 

heterogeneous cores). 

All the details about the OpenCL standard and paradigm 

can be checked in the OpenCL Specification (provided and 

maintained by the Khronos TM Group) [1]. In this overview, we 

are only trying to highlight the aspects of the standards that are 

of great interest in sustaining our usability scenario and can 

provide better programming model for the heterogeneous 

multi-core embedded systems described before. Our previous 

work ([4], [5]), as well as other overviews on OpenCL ([2], 

[3]) are helping us in this direction. 

The main aspect of OpenCL that sustains our usability 

scenario (and the application continuum from high-level targets 

toward embedded computing) is the portability of the 

application. The same application is guaranteed by the standard 

248

to be run on any system that supports the OpenCL paradigm. 

The portability is assured by developing the application for an 

abstract system, without any concern on the physical system 

that runs beneath. It is the OpenCL Runtime System (the 

OpenCL implementation) which would be responsible with the 

mapping of the abstract system on the target architecture. 

First of all, what are the means that the OpenCL 

programming paradigm is providing for handling a multi-core 

system? The OpenCL programming paradigm consists of a 

standardized API (which provides access to and allows 

controlling the OpenCL runtime environment) and a 

programming language derivation from C/C++ language 

(which allows programming the accelerating cores, by focusing 

on the solving mechanisms of the parallel problem). 

To understand how these means can be used for handling a 

multi-core system, we also need to clarify the logical entities 

that are involved in an OpenCL system. These entities are 

directly derived from the OpenCL abstract system platform 

definition: an OpenCL system consists of one host and one or 

more compute devices. To continue the hierarchy provided by 

the OpenCL abstract platform, any compute device can consist 

of more compute units, which at their own can support more 

processing elements. 

Once we become aware of the logical entities that are part 

of the OpenCL system, we will try to explain how the OpenCL 

paradigm (the means described above) can be applied for 

programming these entities and how they can work together as 

a whole for the complete programming model of the multi-core 

system. An OpenCL application will have to program both the 

host and the compute devices for controlling the complete 

system; from this perspective, the application will consist of: 

• one main application – running on the host of the 

OpenCL system – the host code can be programmed 

using C/C++ (but not limited to it) and it interacts with 

the OpenCL System through the mentioned 

standardized API. This application is providing the 

context and the control board for solving one or more 

parallel problems by defining the scenario on how the 

problems will be solved (which entities can collaborate, 

work in parallel, synchronize, or partition and balance 

their workloads to solve pieces of parallel problem, or 

multiple parallel problems). 

• one or more OpenCL kernels – running on compute 

devices as defined and controlled by the main 

application scenario. Each OpenCL kernel is used for 

solving a parallel problem by defining the base 

algorithm for it. This kernel will be executed in as many 

instances as required by the problem iteration space. 

Instances of the same kernel can be run on compute 

devices as instructed by the application scenario, while 

the work partitioning and workload balance can be 

smartly handled in an automatic and transparent manner 

by the OpenCL Runtime System implementation. The 

kernel itself defines only the core mechanism for 

solving the parallel problem (one instance), while the 

OpenCL Runtime Engine is being responsible with 

assuring that the kernel is executed as many times as the 

iteration space requires. 

Now that we provided some information about how 

OpenCL can be used as a programming paradigm for multicore 

systems, we can make a step further and see what types of 

parallel problems the OpenCL can be used to solve. 

Considering that an OpenCL application (powered by the 

OpenCL programming paradigm / model) can be used as 

control board for a multi-core system, there are many types of 

scenarios that can be imagined for defining such an application: 

• the most common type of parallel problem that can be 

solved is that of repeatedly applying the same 

mechanism to a large iteration space, using large 

amounts of data (similar with the single program – 

multiple data (SPMD) computing paradigm) (Fig. 1) 

o 

o 

o 

o 

in this context, the problem will be solved by 

compute devices or subsets of them 

as mentioned before, the solving mechanism 

is called kernel, one instance of it is called 

work-item and will be executed by the 

OpenCL abstract entity called processing 

element 

work-items are grouped together in workgroups 

and executed by compute units – this 

abstract level of work partitioning comes 

with programming mechanisms that allow 

some level of concurrency and 

synchronization 

the complete iteration space of the parallel 

problem should be covered by the execution 

of all work-groups 

• such parallel problems can be used as the basic building 

blocks of an OpenCL application, while the controlling 

application can resolve one or more such problems, 

using one or more compute devices, partitioning and 

balancing the workloads as needed for efficiency of the 

problem in a certain configuration of the targeted 

system 

• the OpenCL standard also provides the means for 

solving more problems in parallel, compute devices 

being used either in collaboration or concurrently to 

cover all intended needs – one can define different sets 

of problems or tasks from them, execute them in any 

order, define dependencies between them, synchronize 

them as needed 

Fig. 1. the most common type of parallel problem solved by OpenCL 

paradigm 


249

• there can be defined tasks or problems that are not even 

necessary parallel problems – they can be defined as 

sequential tasks based on the capabilities of the compute 

devices involved in the system 

• on the other end, the system can be configured for highdegree 

of parallelism (even higher than the OpenCL 

standard is defining), different compute units in the 

system being able to provide intrinsic parallel 

capabilities which can be smartly exploited by the 

OpenCL system, either automatically or with some 

custom extensions provided at the application level 

• also, the workload partitioning and load balancing can 

be smartly and dynamically handled by the OpenCL 

Runtime System, either implicitly or by exposing some 

custom control mechanisms to the main application 

As one can see, the OpenCL programming paradigm can be 

a very useful and meaningful way of programming a complex 

multi-core system involving more types of heterogeneous 

cores. Even though some may argue that it is not very suited 

for embedded environments, in our opinion OpenCL can be the 

perfect solution for providing a unified programming model for 

the described acceleration technologies and combining them in 

a single system. By definition, OpenCL provides a 



provide a great usability advantage and should allow algorithm 

development on high-level targets followed by direct 

deployment on embedded systems. 

Even if the usability scenario is the best-case scenario, in 

most of the cases it provides only the starting conditions for 

defining an efficient application which is properly using and 

exploiting all of the capabilities that are available in the 

targeted multi-core system. Many of the features described 

before are useful only in the context of specializing an OpenCL 

application for the targeted multi-core system. These features 

were presented only to allow the reader to make an informed 

idea on the powerful means that OpenCL can provide. One 

should be aware that any step further in the specialization 

direction gains efficiency, but loses from the portability (and 

application continuum) perspective. 

III. OPENCL USABILITY STORY FOR EMBEDDED SYSTEMS – 

APPLICATIONS CONTINUUM TOWARD EMBEDDED COMPUTING 

As described both in the Introduction (Section I) and the 

OpenCL Overview (Section II), we can agree that OpenCL can 

be a very good candidate for a unified programming model that 

can be used for complex multi-core embedded systems. The 

focus so far was on the multi-core programming capabilities 

the OpenCL paradigm can provide, what types of problems can 

it be used for and how suited it is for exploiting the 

performance provided by specialized acceleration technologies. 

As the purpose of this paper is that of defining a usability 

scenario that can allow the application continuum from highlevel 

targets toward embedded computing systems, this section 

will focus on this aspect. 

To better understand how this scenario will work, we can 

use Fig. 2 to explain which would be the most comfortable 

approach toward deploying of multi-core application on 

embedded systems, which are the advantages of using this 

approach and how the drawbacks of this scenario can be 

overcome. 

Fig. 2. OpenCL Usability Scenario for Embedded Systems – Applications Continuum toward Embedded Computing 

250

The OpenCL usability scenario for embedded systems, the 

one that can assure the applications continuum from high-level 

development toward embedded computing systems, should 

follow these steps: 

• the application developer should be mainly focused on 

developing domain specific algorithms (vision, linear 

algebra, scientific, medical, etc.) using general aspects 

of the parallel programming paradigm (defining 

problems that could be solved using SPMD principles, 

described before) 

• there is no doubt that OpenCL can be used as the 

programming model / paradigm for developing the 

application the domain specific algorithms will be 

integrated in 

• for convenience, the application would be developed in 

a high-level development environment, with main focus 

on the algorithm and application correctness, the 

environment being suited for running, debugging and 

preliminary evaluating performance of the application 

• since the application is developed using the OpenCL 

programming model, there will be direct deployment of 

the application toward embedded computing systems 

that are supported by an OpenCL Runtime Environment 

(and have an OpenCL system implementation targeted 

for those multi-core systems) 

• both the application development and the embedded 

system deployment would not require any knowledge of 

the targeted embedded system 

• in this phase the developer should be able to run the 

OpenCL application (and assess its correctness) as well 

as conducting an initial performance evaluation 

• by providing an Efficient OpenCL Runtime 

Environment implementation, at this point the out-ofthe-box 

performance should be at decent levels, with 

room of improvement 

• based on the performance requirements for the parallel 

application in the embedded context, some iterations of 

specialization of the OpenCL application can occur, 

mainly based on the specificities of the targeted multicore 

system that can be exploited: 

o 

o 

either by application specialization using 

OpenCL standard mechanisms, but taking 

into consideration the multi-core system 

resources (available compute units and their 

capabilities) 

or by specializing the application using 

custom extensions specific to certain 

accelerating target architectures 

• the specialization – re-deploying iteration could happen 

on embedded system level, including possible 

correctness re-assessment (and maybe embedded 

system debugging) OR could complete the continuum 

of the application by moving back to the high-level 

development. In this case, some extensions might be 

needed on the OpenCL system implementation for the 

high-level system used for application development 

Even though the scenario might look straight forward for 

some of the readers and maybe unrealistic for others, it has 

many advantages: 

• the focus of the algorithm developer on the algorithm 

functionality, not on the specific characteristics and 

capabilities of the targeted multi-core embedded system 

• better running and debugging conditions in the 

developing phase 

• using out-of-the-box OpenCL applications (or just 

algorithms) already available as a starting point for their 

embedded deployment and performance specialization 

(with minimum investment for the first phases of the 

embedded application) 

Especially for vision applications, this type of scenario is 

very suited since most of the vision algorithm developers come 

from the high-level development environments (including 

desktop, high-performance or graphical units development). 

Another advantage in the vision applications context is that of 

the increased vision algorithms code base, including 

implementations using OpenCL or similar parallel 

programming paradigms. 

In order to achieve this scenario, there are few aspects that 

one should keep in mind, aspects that would also be detailed in 

the following sections of the paper: 

• the need of an Efficient OpenCL Runtime Environment 

implementation available for the targeted embedded 

system (with main focus on its efficiency, easiness of 

porting toward new target systems, easiness of adding 

specific custom extension) 

• the first direct deployment should offer decent 

performance (due to the efficient OpenCL system 

implementation), but it would be only the starting point 

for the real performance the system can achieve 

• application specialization can be done either by using 

OpenCL standard mechanisms or by using custom 

extensions specific to targeted systems 

• more specialization an application gets, more it loses 

from its portability (and reduces the application 

continuum) toward other OpenCL systems 

IV. PROPOSED EFFICIENT OPENCL RUNTIME 

ENVIRONMENT FOR MULTI-CORE EMBEDDED SYSTEMS 

As mentioned several times before, one of the key elements 

in the usability scenario (application continuum) is the OpenCL 

Runtime Environment. The implementation of the OpenCL 

system influences the usability scenario in at least few aspects: 

• it makes possible to apply the scenario in the first phase 

(since it makes the OpenCL available for the targeted 

multi-core embedded system) 


251

• it contributes to the decent out-of-the-box performance 

of the application deployed on the targeted system 

• it allows application specialization by adding custom 

extensions for specific capabilities of the targeted 

system 

To meet these expectations, we used as central point for our 

usability scenario the OpenCL system implementation 

generated from the Efficient OpenCL Runtime Environment for 

Multi-Core Embedded Systems based on our previous work 

described in [4]. 

We will not get in too many details, for any detailed aspects 

anyone can check the mentioned paper. We will just reiterate 

some aspects of the design (which are of interest to our story) 

and the challenges the design should meet. 

In the later sections, there will be some details related to the 

multi-core embedded systems this OpenCL Runtime 

Environment was ported to (section VI), as well as the 

experimentations we made for the vision applications 

continuum (section VII). 

Our previous work focused on the design of an efficient 

OpenCL Runtime Environment for Multi-Core Embedded 

Systems, keeping in mind the portability of the runtime system 

toward more types of hardware system configurations 

including different underlying operating systems and even 

bare-metal systems. As part of the design, the following 

considerations were addressed: 

• challenges of supporting OpenCL for Multi-Core 


• OpenCL runtime system architecture decisions for easy 

portability of the OpenCL implementation toward new 


• OpenCL runtime system configuration considerations 

for best matching of the OpenCL abstract system on the 

physical targeted system 

• efficiency of the workload partitioning and balancing 

strategies, including dynamic adjustments based on the 

runtime aspects of the overall system 

• easy adapting of custom extensions without influencing 

or losing from the portability aspect of the design and 


V. PROPOSED OPENCL CUSTOM EXTENSIONS 

An important aspect in the OpenCL application 

specialization toward the targeted multi-core embedded system 

is that of being able to provide custom OpenCL extensions for 

the specific capabilities the targeted systems may have. 

In this section, we will detail some of the custom 

extensions that may be needed in different contexts. The list is 

not complete and mostly covers the types of extensions we 

came across in some of our targeted systems: 

• OpenCL C language extensions to allow access to the 

accelerator specific instructions (intrinsics). This type of 

extension would allow writing C-like code with the 

possibility of generating very specific instructions 

(similar with writing assembly code, but in the context 

of a high-level language (enhanced by OpenCL)). This 

is a very specific extension and would require reimplementation 

for each new target which might use 

this feature. Most of the changes are only at the level of 

OpenCL C compiler for the targeted system. 

• OpenCL extensions to allow calls to native accelerator 

functions. This type of extension will allow calling 

accelerator native code from the OpenCL host 

application. One should be aware that this type of native 

code does not meet the characteristics of the kernel code 

and thus cannot be called through standard OpenCL 

functionality. This type of extension is not specific to a 

particular targeted system. This extension is specific to 

the OpenCL system implementation and should be easy 

to re-use for any targeted system. 

• OpenCL C vector language extensions to enable usage 

of vector units within an OpenCL compute unit. This 

type of extension will allow exploring the vector 

capabilities of the compute unit, without using specific 

vector configuration aspects and without adapting the 

algorithm / kernel toward the usage of specific vector 

partitions. This extension should be straight forward to 

adapt from target to target, the only specific 

implementation details being in the target specific part 

of the OpenCL C compiler, around vector code 

generation. 

• OpenCL extensions for cascading kernels on device 

without transferring control back to the host. This type 

of extension can be very useful for keeping data locality 

and improve data transfers, including caching which 

would suffer if for consecutive computations using the 

same data set would require moving control back and 

forth from host to device for each of the kernels doing 

the computations. This extension is not specific to a 

particular targeted system, and may be very handy for 

those systems which have limited device memory and 

costly memory access to the OpenCL global memory. 

• OpenCL extensions to allow explicit partitioning of 

workloads between different devices as well as explicit 

dynamic workload balancing. These types of extensions 

would allow a better control of the OpenCL application 

toward the executed tasks and a better workload 

balancing between devices and compute units based on 

dynamic runtime conditions. Some of these aspects 

might be considered implicitly without breaking any of 

the OpenCL standard rules, but in certain situations 

explicit control is required. 

Most of the mentioned extensions are in various stages of 

implementation. The OpenCL C vector language extension in 

particular is one aspect of our work that was already explored, 

implemented and published ([5]). Some of its details will be 

mentioned below. Also, the specialization using this extension 

was experimented for one of our targeted system, in the context 

of out-of-the-box experience. The results of this 

252

experimentation are also available in [5], and will be briefly 

presented in a later section. 

The motivation behind the OpenCL C vector language 

extension is that of having target systems with vector 

capabilities within the device compute units. These compute 

units will be able to run a work-group in a faster manner, by 

vectorially execute more work-items in parallel. Even though 

the ideal situation would be for the vector capabilities to be 

automatically enhanced by the OpenCL C compiler (using 

auto-vectorization), there are situation where the application 

could benefit from being written in a vector manner (the actual 

vector extensions will be done in the OpenCL kernels). In 

order to provide this functionality, we defined a language 

extension (we called OpenCL Vector Language Extension) 

which would help the user designing and implementing the 

kernels in a vector manner, without concerning about data 

partitioning in certain vector sizes (as would the native 

OpenCL vector types provide support for). Using this 

extension is somehow equivalent with the user specifying to 

the compiler which are the pieces of code that would benefit 

from vector execution. 

The mechanism of extending the OpenCL language is not 

specific to a particular device target architecture and could be 

used in a generic fashion, regardless the vector architecture and 

vector specificities of the device itself. It would be the concern 

of the OpenCL system (including the OpenCL compiler) to 

handle the vector language extensions for the supported 

devices. 

This solution can be used for any device with vector 

execution, regardless the vector unit size, since it does not 

require specifying the vector size. The application and kernels 

using this solution will still be portable (and keep the 

continuum) as long as the OpenCL system supports this 

extension for the targeted device. The custom OpenCL Vector 

Language Extension consists of using generic vector types 

(something similar with the existing OpenCL vector types, but 

more generic, without a vector size). The OpenCL compiler 

detects the generic vector types and generates vector code for 

the device. 

There are few aspects that the application should be 

concerned of, all of which are detailed in the paper, including 

some example of transforming an existing application in a 

vector application using the language extension. We are just 

mentioning some of them so the reader will be aware of the 

impact this extension could have on the application itself and 

its execution in the OpenCL system: 

• changing the scalar types with generic vector types in 

the OpenCL kernels wherever vector execution could be 

enabled 

• the OpenCL system will handle (and automatically 

adjust) the traversal of the iteration space so that the 

vector execution will be considered 

• the vector memory accesses will need to be explicitly 

performed using library functions 

Regardless the extensions an application will use for being 

specialized, one should be aware that the portability of the 

application decreases with the amount of the particular aspects 

the application takes in consideration. Some of the extensions 

are generic and might not use too much from the specific 

aspects, others are conditioned by the same extension being 

supported by the OpenCL for the targets in the spectrum, while 

some of them are very specific and cannot be ported at all. 

VI. OPENCL SYSTEM IMPLEMENTATION AND THE 

TARGETED MULTI-CORE EMBEDDED SYSTEMS 

As the main enabler for the application continuum we are 

using an implementation of the Efficient OpenCL Runtime 

Environment for Multi-Core Embedded Systems we mentioned 

before, started from the design and implementation described 

in [4]. 

The implementation of the design (as detailed in the paper) 

started as a proof of concept that the design itself is not just a 

theoretical approach, raising a lot of questions and challenges. 

It started with some multi-core systems that have large scale 

applicability in various areas, such as automotive, 

infotainment, industrial, medical and communication. Both the 

types of the systems and the types of the applications were well 

suited for the OpenCL programming paradigm. 

The initial implementations we had ([4]) started with multicore 

systems that were Linux enabled, mostly using the CPU- 

CPU OpenCL paradigm: 

• i.MX6Q family ([6]) (quad-core ARM® CPU) – used 

for automotive and infotainment 

• 32-core Power Architecture® CPU – used for 

communication applications 

• QorIQ® T4240 series ([7]) (24-virtual core Power 

Architecture® CPU enabled by several communication 

and signal processing accelerators) – used for 

communication applications 

Since then, the OpenCL system implementation passed the 

proof-of-concept phase and increased into maturity toward a 

product. The number of supported multi-core embedded 

systems also increased, as well as the diversity of the systems, 

approached applications and goals: 

• both homogeneous and heterogeneous multi-core and 

many-core systems 

• Linux-based, as well as bare-board 

• using both CPU-CPU and CPU-accelerator paradigms 

• multi-threading, multi-process as well as multiprocessor 

/ multi-OS inter-connectivity 

• the goal of using OpenCL was both as performance 

evaluation of the system scalability (including physical 

target, OpenCL system and application) AND as 

demonstrating ease-of-use and portability of the 

OpenCL application (the application continuum) 


253

(including real-life and real-time demonstrations of the 

system functionality) 

• the OpenCL focused applications were in the area of 

vision (image and video processing), networking and 

nevertheless artificial intelligence (deep learning / 

neural networks) 

A brief list of systems the OpenCL system was targeted for, 

mostly in embedded computing area, but not limited to it: 

• x86/64 – configurations up to 64 compute units (mostly 

used for prototyping and validation) 

• i.MX6Q family ([6]) – quad-core ARM® CPU 

• QorIQ® T4240 series ([7]) – 24 Power Architecture® 

virtual cores 

• Layerscape2 communication processor - 

LS2080A/LS2084A ([8]) – 8 ARM® Cortex® A57 / 

A72 cores 

• S32V234 automotive vision processor ([9]) – 4 ARM® 

Cortex® A53 cores 

• NXP BlueBox autonomous driving platform ([10]) – 

S32V234 + LS2084A (4 + 8 ARM® cores) 

• S32V234 automotive vision processor ([9]) using the 

APEX-2 vision accelerator 

We are mentioning all of these achievements, for the reader 

to make an impression about the maturity of the OpenCL 

Runtime Environment we are proposing, its portability 

characteristic (by design) as well as the support it provides for 

the application continuum between the high-level application 

development toward embedded computing systems. 

VII. EXPERIMENTATIONS AND PERFORMANCE EVALUATION 

OF VISION APPLICATIONS CONTINUUM EXPERIENCE 

As mentioned in the previous section, beside the increased 

number of multi-core embedded systems we targeted, the goal 

of this portfolio is toward experimentation and performance 

evaluation, more precisely: 

• performance evaluation of the system scalability – 

including hardware architecture, OpenCL system and 

application viewed as a whole 

• demonstration of ease-of-use and portability of the 

OpenCL application (the application continuum) – 

including real-life and real-time demonstrations of the 

system functionality 

• demonstration of how the proposed custom extensions 

can work in real-life and which might be the application 

specialization performance impact (with minimum 

effort) 

A. System Scalability 

Most of the mentioned targeted systems were used to 

evaluate the performance of the system scalability, especially 

those for embedded systems with increased number of cores 

(8+). The used applications covered all mentioned areas: 

• vision applications (including image and video 

processing) 

• networking applications 

• neural networks 

We will not get in too many details, as the purpose of this 

paper is a different one, but it worth mentioning that the overall 

system performance was properly scaling with the number of 

cores, until it reaches the scalability limit of the application. 

Any increase in performance over this limit will usually imply 

redesigning of the application. 

To contribute to the theme of this paper, it is important to 

mention that the used applications were out-of-the-box 

OpenCL applications, with special focus on the vision 

applications. 

For the sake of the argument, we are showing the 

performance evaluation of an out-of-the-box OpenCL vision 

application (a complex combination of image filters from [12]) 

on a system that used up to 64-cores (Fig. 3). The OpenCL 

application performance was compared with the scalar 

implementation of the same application (no parallelism at all) 

and also using a specific programming model of that system. 

As one can see in the figure, the OpenCL out-of-the-box 

performance is consistently comparable with the specialized 

application implemented using the specific programming 

model, fact that proves the efficiency of the OpenCL system in 

the context of out-of-the-box, portable application. 

B. Usability Story – Vision Application Continuum 

The next step is demonstrating the usability story or as we 

called it the vision application continuum from high-level 

development environments (OpenCL application developed on 

desktop) toward embedded computing system. 

For the purpose of this demonstration we are detailing the 

actual story of the real-life, real-time demo application we 

performed for one of the targeted multi-core embedded system 

we mentioned above: 

• we started with porting the OpenCL Runtime 

Environment to a new multi-core embedded system 

which at that point prototyped the mixture of two 

existing systems (S32V234 automotive vision processor 

[9] and Layerscape2 communication processor [8]) for 

combining the automotive rigorousness of the S32V234 

multi-core with the computational power of the 

Layerscape2 multi-core system 

• this system became an important success in the 

automotive area and is now called BlueBox (NXP 

BlueBox autonomous driving platform [10]) 

254

Fig. 3. Performance Evaluation of a Vision Application using up to 64 cores 

• we started the porting of the OpenCL system toward 

this system in the configuration of running a multioperating 

system (each multi-core system running its 

separate version of Linux), with the OpenCL host on 

one of the S32V234’s ARM® core and the 8 

Layerscape2’s ARM® cores as compute units 

• to demonstrate the performance and functionality of this 

system we needed a very resource consuming 

application, to be sure the computational power brought 

by the Layerscape2 is highly challenged 

• we found a very interesting application, an OpenCL 

implementation of the dense optical flow algorithm 

(based on the Lukas-Kanade method – [11], [12]) 

• we took the out-of-the-box OpenCL application, 

adapted it to work for our input and graphical feed on a 

desktop environment, using a desktop dedicated 

OpenCL implementation 

• the next step was to move the application to run on our 

OpenCL system targeted for the desktop environment 

(x86/64 Linux) – the transition went smoothly 

• to prove the vision application continuum, we moved 

forward and directly deployed the application to the 

BlueBox multi-core embedded system – as before, the 

deployment happened seamlessly 

• we managed to get decent performance out of this port 

and we were satisfied with it – mostly because the used 

compute units are quite general computing units and not 

too much specialization could be applied 

• we also assembled a real-time demonstration using the 

optical flow application with a live-feed from a camera 

– the main applicability of the demo was for 

demonstrating the BlueBox capabilities toward 

Advanced Driver Assistance Systems usage of the 

vision application 

As a proof of concept, we were able to evaluate the 

performance scalability by using different amounts of 

Layerscape2 ARM® cores as OpenCL compute units, as 

shown in Fig. 4. 

A similar real story was around targeting the S32V234 

multi-core embedded system using the APEX-2 vision 

accelerator. The only main difference was the used algorithm. 

In this case an edge detection vision application was used, with 

the sole purpose of demonstrating the vision application 

continuum. We started the same, working on an out-of-the-box 

OpenCL application, with no specialization of the algorithm 

(running only on the scalar unit of the APEX-2 core). 

Fig. 4. OpenCL Optical Flow application scaling on different number of 

cores on BlueBox 


255

C. Vector Language Extension – Vision Application 

Specialization 

Even though it was already mentioned and for very detailed 

information the reader can check [5], for the sake of 

performance evaluation it worth mentioning in this section the 

experimentation we conducted with the specialization of the 

vision application using the custom OpenCL vector language 

extensions we proposed. 

The story around this experimentation (together with the 

motivation for defining and implementing such extension) goes 

like this: 

• we started with another out-of-the-box OpenCL vision 

application implementing a complex image filtering 

algorithm (started from [12]) and directly deployed it to 

the S32V234 ARM® + APEX-2 multi-core embedded 

system 

• since the out-of-the-box application was performing on 

APEX-2 device using only its scalar unit and not even 

considering the usage of its internal vector capabilities 

(one can check details on the APEX vector capabilities 

in [9]) we came up with the idea of vector language 

extension and to implement it in our OpenCL Runtime 

Environment 

• the next step was the specialization of the vision 

application toward vector execution of the kernel code 

The effect of the specialization can be clearly observed in 

Fig. 5 (extracted from our previous work [5]). The details 

related to the configurations the performance evaluation was 

conducted in, can also be checked in the original paper. The 

variation is around the number of vector units the APEX-2 core 

is configured to use, as well as in the memory areas that are 

considered for vector specialization. 

The experimentations we conducted and the evaluated 

performance in each of the cases are sustaining that the 

usability story (based on the application continuum) can be 

achieved with decent results on the out-of-the-box application, 

proving the system scalability, while the specialization of the 

application brings a real boost toward performance 

improvement. 

Fig. 5. Performance Evaluation of the vision application specialization using custom OpenCL vector language extensions 

256

VIII. CONCLUSION AND FUTURE WORK 

One of the goals of the paper was to provide a general 

solution, a viable alternative for the multi-core embedded 

computing systems programming models. Once this solution 

was identified as the OpenCL parallel programming paradigm, 

we had to come up with a usability story that would enable 

using the OpenCL system for embedded systems. The strongest 

argument of using the OpenCL programming paradigm, 

together with an efficient OpenCL Runtime Environment is 

that of providing a high degree of application portability. The 

OpenCL usability scenario sustains the application continuum 

from high-level / high-performance computing toward 

embedded computing systems, covering all the computing 

spectrum. The greatest advantage of using the OpenCL 

programming paradigm and OpenCL applications is that of 

developing algorithms in high-level development 

environments, with main focus on algorithm functionality, with 

no concern on targeted system, followed by direct deployment 

on embedded systems. 

We brought many strong arguments on why the OpenCL 

can be successfully used as a viable programming paradigm for 

multi-core embedded systems. We also defined a usability 

story that would ease-up deployment of application on 

embedded systems (with minimum knowledge of targeted 

architecture). The success of the usability story probably stays 

in the performance of the out-of-the-box application deployed 

in the embedded environment. The goal of our scenario is to 

have decent performance on the out-of-the-box experience, 

providing a good starting point of the application in the 

embedded environment followed by the specialization of the 

OpenCL application. The specialization can be explored in two 

main directions, either by using standard OpenCL and adapt 

the application toward better exploiting of the embedded 

system capabilities OR by using custom OpenCL extensions, 

very specific to the targeted system. 

The experimentation and performance evaluation we 

conducted are backing up the usability story of the application 

continuum using the Efficient OpenCL Runtime Environment 

implementation with decent out-of-the-box performance 

followed by specialization of the application toward targeted 

embedded system with an increased performance boost. 

As future work, we plan to make available more OpenCL 

custom extensions, especially those that are not target specific, 

to provide more specialization opportunities for the out-of-thebox 

OpenCL applications. One of the major focuses that can be 

experimented for the already targeted embedded systems is that 

of cascading kernels, as well as allowing access to the 

accelerator specific instructions. Another interesting 

specialization area would be around the explicit workload 

partitioning and balancing. 

Some other development direction would be that of 

exploring more vision applications with immediate impact in 

the ADAS area, and applying the complete usability scenario 

for more and more OpenCL applications, targeting more new 

multi-core embedded systems. 

REFERENCES 

[1] Khronos Group. The open standard for parallel programming of 

heterogeneous systems, http://www.khronos.org/opencl/ 

[2] J. Tompson and K. Schlachter, “An Introduction to the OpenCL 

Programming Model”, Khronos Group, 2008 

[3] B. Dipert, “OpenCL Eases Development of Computer Vision Software 

for Heterogeneous Processors”, Embedded Vision Alliance, 2015 

[4] B. Ditu, I. Romaniuc, C. Arbone, M. Oprea, D. Vasile, “Design and 

Portability of an Efficient OpenCL Runtime Environment for Multi- 

Core Embedded Systems”, Embedded World Conference, 2015 

[5] B. Ditu, F. Peterson, C. Arbone, “Experimentation of Vision Algorithm 

Performance using Custom OpenCLTM Vector Language Extensions 

for a Graphical Accelerator with Vector Architecture”, 2017 IEEE 13th 

International Conference on Intelligent Computer Communication and 

Processing (978-1-5386-3368-7/17/$31.00 ©2017 IEEE) 

[6] NXP i.MX 6Quad Processors - Quad Core, High Performance, 

Advanced 3D Graphics, HD Video, Advanced Multimedia, ARM® 

Cortex®-A9 Core, https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/i.mx-applicationsprocessors/i.mx-6-processors/i.mx-6quad-processors-high-performance- 

3d-graphics-hd-video-arm-cortex-a9-core:i.MX6Q 

[7] NXP QorIQ® T4240 Multicore Communications Processors, 

https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/qoriq-platforms/t-series/qoriqt4240-t4160-t4080-multicore-communications-processors:T4240 

[8] NXP QorIQ® Layerscape 2084A and 2044A Multicore 

Communications Processors, https://www.nxp.com/products/processorsand-microcontrollers/arm-based-processors-and-mcus/qoriq-layerscapearm-processors/qoriq-layerscape-2084a-and-2044a-multicorecommunications-processors:LS2084A 

[9] NXP S32V234: Vision Processor for Front and Surround View Camera, 

Machine Learning and Sensor Fusion Applications, 

https://www.nxp.com/products/processors-and-microcontrollers/armbased-processors-and-mcus/s32-automotive-platform/vision-processorfor-front-and-surround-view-camera-machine-learning-and-sensorfusion-applications:S32V234 

[10] NXP BlueBox: Autonomous Driving Development Platform, 

https://www.nxp.com/products/processors-and-microcontrollers/armbased-processors-and-mcus/s32-automotive-platform/nxp-blueboxautonomous-driving-development-platform:BLBX 

[11] OpenCL Imaging on The GPU: Optical Flow, 

https://www.khronos.org/assets/uploads/developers/library/2011_GDC_ 

OpenCL/NVIDIA-OpenCL-Optical-Flow_GDC-Mar11.pdf 

[12] Open Source Computer Vision, http://opencv.org/ 


257

A modern approach to developing software in the 

growing embedded vision sector 

Christoph Wagner, Julian Beitzel 

Christoph Wagner 

Product Manager Embedded Vision 

MVTec Software GmbH 


christoph.wagner@mvtec.com 

Julian Beitzel 

Application Engineer 

MVTec Software GmbH 


julian.beitzel@mvtec.com 


The current trend of increasing performance in the 

embedded sector continues and enters areas that were 

unthinkable five years ago. One driving force behind this is 

the rapid development of the "mobile sector", the resulting 

widespread success of smartphones and the associated number 

of units. This area includes mainly mobile phones, but also 

smart watches and tablets. Meanwhile, even mobile computing 

sees its first releases of Arm-based laptop computers that are 

no longer based on the well-tried x86-Intel architecture. 

Arm processor architecture originated in 1983 from the 

British computer company Acorn and was first used in the 

predecessors of today's desktop PCs [1]. At the very least with 

the introduction of 64-bit architecture (ARMv8 series) in 

2013, and its use in iOS as well as Android devices, Arm has 

started to strengthen its position in the embedded market. One 

major reason of this increasing market share is that the 

processors fully exploit their advantages over x86 architecture 

in terms of power consumption and low power dissipation [2]. 

Independently of the processor used, the computing power 

of modern mobile devices is up to sixty times higher than that 

of common desktop PCs of 2004. For an indication of how 

this power dynamic has shifted, see the comparison between 

the computing speed of the Pentium IV desktop processor and 

the current iPhone 8/X mobile phone in Figure 1. 

The fastest embedded platforms (e.g. iPhone 8 in Figure 1) 

have reached about a third of the computing power of a 

modern standard desktop CPU [3]. Especially in the last few 

years, the market has been experiencing a veritable boost in 

the integrated computing power of embedded devices. 

Figure 1: Computing performance for Single Precision Floating General 

Matrix Multiply (SGEMM) in GFLOPS. Image courtesy of MVTec Software 

GmbH, data by Geekbench.com [4] 

Due to increased performance and the associated 

application possibilities, embedded vision has become a 

guiding theme of Industry 4.0. The aim of this initiative is to 

network devices with each other, and thus enable the 

development of even more flexible production systems [5]. 

In the consumer market, for example, the networking of a 

refrigerator with an online shopping account makes it easier to 

order food. A similar example from the industrial sector would 

be embedded vision systems communicating and exchanging 

results with each other easily and flexibly via e.g. OPC UA. 

This means that, especially with the help of embedded 

vision systems, modern production plants can be made even 

more flexible and even small batch sizes can be produced fully 

automated. Identification technologies such as code reading, 

matching, OCR, etc. play a decisive role in this respect. All 

this with a focus on being able to deliver as quickly as 

258

possible, without increasing inventories and tying up massive 

amounts of capital. 

Due to the many advantages of embedded vision over 

conventional PC-based machine vision systems such as 

simplification, miniaturization, and the associated savings 

potential, embedded vision is increasingly being used to teach 

production processes how to “see”, making it the “eye of 

production”. 

II. MARKET OVERVIEW 

The embedded market is a growth market that still has 

massive potential left to tap within the next few years, among 

other reasons due to technical innovations. This is especially 

true in the automotive segment with autonomous driving, in 

logistics with drones as a delivery service provider, or in the 

area of collaborative robotics where embedded technologies 

help to make cooperation between robots and humans possible 

and safe. 

Even today, there are already enough successful examples 

relying on embedded vision. Whether in the form of simple 

vision sensors, which are often equipped with "only" one 

function, but in return are highly optimized while keeping the 

smallest possible form factor, or in the form of a freely 

programmable smart camera that offers full flexibility, but is 

much more compact than a PC-based vision system. Another 

example are the powerful "single board computers" (SBC), 

which offer flexibility comparable to PC systems with cost 

reduction clearly in the foreground. 

They are already among us - almost every well-known 

sensor manufacturer has embedded systems in its portfolio. 

There are very good reasons for this, among the market 

forecasts, the undeniable advantages and the predicted market 

potential (See Figure 2). 

III. SIGNIFICANCE OF PROFESSIONAL VISION SOFTWARE 

Due to the ever-increasing proportion of embedded vision 

systems and the associated process relevance, the demands on 

the reliability of these systems are also increasing. As failures 

or false decisions lead to production stop, defective work, and 

subsequently substantial costs, inline vision-based control 

gains significance. 

In order to fulfil this need, the software must be very 

robust, i.e. it must be able to recognize the characteristics 

reliably, even under difficult conditions such as 

contamination, vibrations or interfering light and 

simultaneously manage the limited resources of embedded 

devices as sparingly as possible. At the same time, in order to 

avoid burdening cycle times unnecessarily, the maximum 

evaluation speed should always be the goal. 

This requires efficient programming and optimization of 

the relevant algorithms, which support hardware and methods 

like GPUs, instruction set extensions like Neon (Arm), or the 

automatic parallelization on CPU cores. 

IV. EFFICIENT DEVELOPMENT OF EMBEDDED VISION 

APPLICATIONS 

A. Software development IDEs vs. vision applications 

For the development of programs on embedded platforms, 

the use of a cross-compiler, as shown in Figure 3, is a 

common approach. This compensates for the usually limited 

resources compared to a standard PC platform. 

Figure 3: Classic application development for embedded platforms 

(target) with a development platform (host). 

Figure 2: European embedded system market size, by application, 2012-2023 

(USD Billion) [6] 

The PC is used to develop the application within a suitable 

integrated development environment (IDE). Using a 

configured toolchain, an executable for the embedded 

platform (target) can be built on the PC (host). This program 

can now be transferred to the target in order to run. 

Image processing functionality is usually provided in third 

party libraries for programming languages, like OpenCV for 

C++, C, Java, Python, or scikit-image for Python. The 

handling is no different from other libraries – they are 

included in the project and can be used in the source code. The 

advantage of this approach is that developers use a familiar 

workflow including the IDE and programming language for 

the creation of vision applications. 

259

Figure 4: Particle image in byte format and corresponding histogram 

Considering the kind of graphical data that should be 

processed in such an application, most IDEs are not well 

equipped for this purpose. Vision-based algorithms may get 

complex very fast and frequently require prototyping. Usually, 

there is an uncertainty about the setup, and the appearance of 

objects in the images affects the used approach. 

For example, the image shown Figure 4 demonstrates a 

simple vision task. The particle image should be segmented in 

fore- and background. Despite the simplicity of the 

assignment, many different approaches are possible. Some 

exemplary approaches: 

1. Fixed, manually set intensity threshold separating the 

fore- and background. Advantages: Very fast, easy. 

Disadvantages: Does not deal well with 

inhomogeneous background, unable to handle global 

exposure changes. 

2. Dynamic global threshold based on automatic 

histogram analysis, e.g. check for a global minimum 

in the histogram function. Advantages: Fast, handles 

global exposure changes. Disadvantages: Unable to 

handle inhomogeneous background. 

3. Variational threshold [7] using a sliding windowapproach 

to determine local intensity differences. 

Advantages: Deals with inhomogeneous background, 

handle global exposure changes. Disadvantages: 

Slow compared to the fixed thresholds. 

Furthermore, the requirements of the image processing 

task may shift; if new parts should be inspected, the lighting 

conditions or other factors of the setup could change. 

Therefore, the software should be designed in a way that the 

acquired images and results can be easily seen. Additionally, if 

the image processing part should be developed for multiple 

programming languages, a re-implementation is usually 

required. 

Lastly, debugging related to image processing on a 

platform for which the code is cross-compiled may be difficult 

using console-based debuggers like GDB. 

B. Speeding up prototyping with IDEs for image processing 

To get rid of long development and maintenance cycles, 

IDEs with support for image processing can be used. 

Following the suggestions of the previous part, the basic 

features should be supported: 

 

 

 

 

Immediate and easy-to-use display of images and 

variables containing graphical content (lines, regions, 

geometrical primitives). 

Interaction with the displayed content, such as 

zooming or moving. 

Support the display of numerical arrays as plots. 

Tools/assistants including typical tasks such as 

showing the histogram, performing interactive 

measurements in the image, display gray values at a 

certain spot would be desirable. 

An IDE containing these features is shown in Figure 5 as 

an example. 

In order to speed up the development even more, it would 

be desirable to use a simple, script-based language which 

allows the user to reduce the lines of code needed for 

performing simple actions. For this reason, Matlab and Python 

are pretty well-known in the scientific community to be 

suitable for fast evaluations [8] [9] [10] [11]. To get the best 

performance of the systems, compiled languages are preferred 

or even required to be executed on an embedded platform. 

IDEs like MVTec Software GmbH’s HDevelop, which can 

be seen in Figure 5, enable the export of the developed scriptcode 

to various programming languages. This enables a 

smooth transition from the prototyping stage to the productive 

environment. 

On the other hand, changes in the code have to be made on 

the host system, exported, included in the framework, crosscompiled 

and transferred again to the target platform in order 

to check the fix. This workflow leads to a longer development 

time. This issue is addressed in the following chapter. 

Figure 5: Exemplary vision IDE with the output image for debugging 

(left top), additional tools (center and right top), variable view (left 

bottom), programming window (right bottom). The screenshot shows 

HDevelop of MVTec Software GmbH. 

260

C. Easier development and maintenance with interpreted 

image processing functionality including remote 

debugging 

To address the issue of maintainability and debugging, a 

different approach is needed. Using classic approaches as 

described above, it is cumbersome to visualize graphic 

variables during runtime. 

As stated in the previous part, there are various 

requirements for the IDE in order to efficiently develop image 

processing algorithms. It would be ideal if this IDE could also 

be used for debugging on the platform. 

A concept for doing this is shown in Figure 6. The target 

platform starts the vision script using an interpreter within the 

application. In the code, a debugging server has to be 

configured and explicitly activated – it is worth mentioning 

that this is a potential security threat. A host may utilize the 

vision IDE to log on to the running application of the target 

platform, set breakpoints similar to a local application, and 

inspect intermediate results directly. 

This concept allows maintenance even for a remote target 

system which is available via Internet connection, e.g. at a 

customer site. 

Figure 6: Remote debugging using the vision IDE of the host to debug an 

interpreted vision script inside the application of the target platform. 

Since the vision code is interpreted, the update process is 

rather straightforward: The script can be fixed on the host 

platform before copying it back to the target platform. In this 

concept, there is no need for cross-compiling the complete 

application, a restart is sufficient as long as the signature of 

the vision functions does not change, see Figure 7. 

V. THE ECOSYSTEM OF MODERN EMBEDDED VISION 

SOFTWARE 

Due to the current development of the market, the 

requirements and expectations for professional embedded 

vision software are growing. Subsequently, competitive 

pressure in this segment is increasing. As a result, it is 

becoming increasingly important for hardware vendors to be 

able to implement the requirements faster and more 

efficiently. 

It is no longer sufficient to use a high-performance and 

robust software solution, because this is only one part of the 

development process. It is becoming increasingly important to 

be able to access an extensive development package that 

includes additional services around the software. Another 

important element is a convenient development environment 

especially developed for use in the vision sector. This can save 

a lot of time when creating the "Workbench" and therefore it 

is crucial for efficient and fast development of vision 

applications. 

Another essential aspect is the legal situation. When a 

(embedded) vision application is created and used on a 

commercial basis, it is essential for the creator to be protected 

legally. This means that it must be ensured that there is no 

patent infringement in the use of the vision algorithm, as this 

can lead to serious damage claims. 

These are typical areas that are difficult to cover with open 

source software and thus speak for the use of professional 

commercial image processing software. 

VI. FUTURE OUTLOOK 

Extrapolating the aforementioned developments, a shift 

between embedded- and PC-based vision becomes apparent 

quickly. Embedded vision is unlikely to replace the desktopbased 

applications, but it will continue to gain significance, 

market share, and usage in industrial scenarios. If the last 

decades’ developments in the consumer electronics market are 

any indication, the end of the line for the embedded computing 

market is yet far out of sight. 


[1] S. B. Furber, "ARM system Architecture", Addison-Wesley, 1996, p. 36. 

[2] M. Hachman, "ARM Cores Climb Into 3G Territory," 14 10 2002. 

[Online]. Available: http://www.extremetech.com/extreme/52180-armcores-climb-into-3g-territory. 

[Accessed 18 01 2018]. 

[3] Geekbench, "Geekbench 4.1.3 Tryout for Mac OS X x86 (64-bit)," 

Geekbench, 2018. [Online]. Available: 

https://browser.geekbench.com/v4/cpu/4687004. [Accessed 13 1 2018]. 

[4] Geekbench, "Geekbench 4 CPU Search," Geekbench, 2018. [Online]. 

Available: https://browser.geekbench.com/v4/cpu/search. [Accessed 18 1 

2018]. 

[5] L. Goasduff, "What Is Industrie 4.0 and What Should CIOs Do About 

It?," Gartner, 18 5 2015. [Online]. Available: 

https://www.gartner.com/newsroom/id/3054921. [Accessed 12 1 2017]. 

Figure 7: Fixing an erroneous script on the target platform without 

recompiling. 

261

[6] Global Market Insights, "Embedded System Market Size by Application 

[...]," Global Market Insights, 2016. [Online]. Available: 

https://www.gminsights.com/industry-analysis/embedded-systemmarket. 

[Accessed 09 1 2018]. 

[7] W. Niblack, in ”An Introduction to Digital Image Processing”, 

Englewood Cliffs, N.J., Prentice Hall, 1986, pp. 115-116. 

[8] K. J. Millman and M. Aivazis, "Python for Scientists and Engineers," 

Computing in Science & Engineering, vol. 13, no. 2, pp. 9-12, 2011. 

[9] Z. Boyan, "Application of MatLab in Science and Engineering 

Calculation," 01 01 2001. [Online]. Available: 

http://en.cnki.com.cn/Article_en/CJFDTOTAL-DLXZ200101019.htm. 

[Accessed 12 02 2018]. 

[10] F. Perez, B. E. Granger und J. D. Hunter, „"Python: An Ecosystem for 

Scientific Computing",“ Computing in Science & Engineering, vol. 13, 

no. 2, pp. 13-21, 2011. 

[11] T. E. Oliphant, „"Python for Scientific Computing",“ Computing in 

Science & Engineering, pp. 10-20, 2007. 

262

Strategies for facilitating reuse of code in embedded 

vision applications 

Frank Karstens 

Marketing Module Business 

Basler AG 

Ahrensburg, Germany 

frank.karstens@baslerweb.com 

Abstract – More and more companies active in the field of 

computerized machine vision (which is still mostly based on a 

classic PC setup), are recognizing the benefits of an embedded 

approach (lower power consumption, less space and most 

notably--significant cost savings). However, transferring existing 

software code from a classic PC setup to an embedded target can 

pose a number of challenges: different operating systems (i.e. 

standard Windows vs. hardware-specific Linux), different 

processor architecture (i.e. x86 vs. ARM, MIPS, etc.), different 

camera interfaces (i.e. GigE vs. MIPI CSI-2) and so on. 

Well-defined standards, which are able to set frameworks for 

these differences, can help tremendously in the reuse of existing 

code. With GenICam, the industrial machine vision industry 

established such standards years ago, and it still maintains them 

to keep them up-to-date with new technologies. GenICam 

standardizes camera configuration and image data transmission 

and provides standardized APIs for software developers. 

GenICam reference implementations exist for various operating 

systems and processor architectures. 

In addition, there are camera-vendor-specific SDKs available 

which are based on GenICam technology and which add even 

more user convenience to the camera APIs. The broader the 

choice of the SDKs’ supported operating systems, processor 

architectures and camera interface technologies, the more 

flexibility is offered to the user to move from one technology to 

another and the easier it is to port existing code to the new target. 

The MIPI CSI-2 interface, which is of particular interest for 

embedded vision applications, has not been covered by the 

GenICam standard so far. Different camera vendors are already 

working on a GenICam-like abstraction layer on top of CSI-2, 

which in turn will make migration to this interface similar to 

other GenICam camera interfaces. It is likely that CSI-2 will play 

an important role in the world of GenICam in the near future 

too. 


According to the Aspencore Embedded Markets Study 

2017 [1], by far most software developers (81%) reuse code 

created in-house. This is not surprising, as this code represents 

the core intellectual property of an innovative enterprise. 

Reusing existing code that has been proven to perform the task 

it was written for and is already well tested reduces 

technological risks, speeds up development, improves software 

quality, and overall reduces development costs. 

Many well-established techniques are available to help 

reuse code in new applications. Providers of machine vision 

software solutions, however, have to deal with the fact that as 

years and technological progress advance, the in-house 

developed code is not only evolving and progressing itself, but 

also needs to be adapted to hardware changes on the camera 

side. Across the last 15 years, the machine vision industry has 

gone through two industry-changing trends: 

- The move from analog to digital, where analog frame 

grabbers were replaced by digital hardware, ranging from 

costly and specialized high-end (Camera Link, CoaxPress etc.) 

down to common consumer-level hardware such as Firewire, 

Gigabit Ethernet or USB2/USB3. 

- The trend from CCD to CMOS which offered higherresolution 

sensors with higher speed and better sensitivity 

These trends enabled better camera product offerings for a 

more affordable price, and this in turn forced machine-vision 

solution providers to integrate these new camera products into 

their setup to stay competitive. The adaptation of the cameraside 

hardware changes always requires changes in the software 

stack. New proprietary drivers with proprietary APIs need to be 

integrated; new camera features or properties need to be 

matched by code modifications and so on. The industry has 

found an answer to these challenges: the GenICam standard 

(see below). 

However, the most recent trend–which is about to 

fundamentally change the machine vision industry–is 

“embedded vision”. The technological progress on both the 

sensor and processing sides allows in many cases the design of 

computer vision setups (previously requiring a high-end 

camera, cable and PC) now with embedded technologies like 

camera or sensor modules, embedded processing boards etc. 

An embedded approach offers lower power and space 

requirements and most notably, significantly lower costs 

compared to a classic PC-based setup. An important driver of 


263

this trend has been the mobile industry, with consumer 

products requiring vision for smartphones, tablets and so on. 

But it is not only the “classic” machine vision industry 

adopting embedded technologies. .We also see vendors of 

typical mobile processors (like Qualcomm, Rockchip, Samsung 

etc.) discovering the industry as an interesting market segment. 

They have started to offer typical mobile processors (which are 

equipped with multiple CSI-2 interfaces) with long-term 

availability for industrial applications. 

II. 

THE GENICAM STANDARD 

To reduce at least partly the burden of rewriting code again 

and to speed up development cycles and control development 

costs, about 15 years ago, the machine vision industry 

(consisting of software, camera, cable and frame-grabber 

vendors) organized into the GenICam (Generic Interface for 

Cameras) standards group [2] and started to develop standards 

to offer unified basic functions for digital cameras. 

Camera configuration - this function supports a range of 

camera features such as frame size, acquisition speed, 

pixel format, gain, image offset, etc. 

Grabbing images - this function creates access channels 

between the camera and the user interface, and initiates 

receiving of images 

Transmitting meta-information - this function enables 

cameras to send extra data on top of the image data. 

Typical examples could be histogram information, time 

stamp, area of interest in the frame, etc. 

Delivering events - this function enables cameras to talk to 

the application through an event channel 

These functions rely on three different modules provided by 

GenICam: 

GenAPI [3] specifies a methodology for generating a 

standardized camera API. This is achieved by an XML file 

provided by the camera device. The XML file expresses all 

characteristics, properties and features of the camera device in 

a standardized way (e.g. if a device provides a feature like 

“Blacklevel” the XML describes this feature in all its properties 

(Name=Blacklevel, Type=IInteger, Accessmode=RW, 

Val=128, Min= 10, Max=255 and so on). The XML file can be 

used to generate a static API in various programming 

languages or it can even be used to dynamically generate a 

generic API at runtime. This approach guarantees that an API 

which was generated out of the camera’s XML reflects all the 

most recent camera features and properties. 

Standard Feature Naming Convention (SFNC) [4] 

specifies and standardizes the name and behavior of a given 

feature. If a camera vendor uses a standard feature name for a 

feature, then it must behave precisely as defined in the SFNC. 

GenTL [5] specifies a standardized transport layer 

interface for enumerating cameras, grabbing images from the 

camera, and moving them to the user application. 

Associated with GenICam are Interface standards such as 

GigE Vision [6] or USB3 Vision [7], which specify protocols 

for reliable data transfer on established physical layers (e.g. 

GigEVision on Gigabit Ethernet or USB3 Vision on USB). 

These standards are not actually part of GenICam; still, it is 

mandatory for such interface-standard-compliant devices to be 

GenICam-compliant too. 

III. 

SPECIFIC CHALLENGES FOR EMBEDDED VISION 

In a classic machine vision setup - consisting of camera, 

cable and (in most cases) a Windows PC, GenICam helped a 

lot to provide a stable interface which defines camera or 

interface specifics and provides plug-and-play functionality. 

Code, which was written for a specific GigE Vision camera 

supplied by vendor A, could be reused with only minor 

modifications for a USB3 Vison camera from vendor B. 

However, things are different in the world of embedded 

vision. The environment is filled with variables. The vision 

sensor is not necessarily a camera; it can be a camera module 

or even a naked CMOS sensor. In addition, the available 

processing platforms represent a wide variety: many different 

classical CPU architectures (x86, ARM, MIPS, PowerPC etc.) 

compete with FPGA, GPU or DSP-based approaches etc. 

Moving from one sensor or camera to another, or changing the 

processing system architecture quite likely requires the 

software developer to rewrite significant parts of the vision 

software. 

However, the grade of difficulties that may arise when 

migrating code from a non-embedded setup to an embedded 

one depends on the camera interface chosen. 

Embedded with GigE/USB is not actually different from a 

non-embedded approach. If the camera interfacing code for the 

non-embedded system was written for a GenICam-compliant 

API, and as long as the targeted processing platform provides 

these interfaces, the existing code can be reused without any 

modification when ported to an embedded target. Most camera 

vendors offer their GenICam-based camera API for Windows 

and Linux (or even macOS) operating systems for x86 or 

ARM-based processing architectures. Recompiling the code for 

a new target is usually all that is required. The plug-and-play 

behavior of USB3 Vision or GigE Vision works the same way 

as on a desktop PC. 

Embedded with proprietary camera interfaces is – as 

the term “proprietary” suggests – not standardized. There are 

plenty of proprietary embedded camera interfaces available, 

parallel or serial ones (e.g. LVDS). Quite often drivers for a 

specific embedded processing platform do not exist and need to 

be developed first. Some vendors of embedded processing 

systems (e.g. SOM vendors) offer camera modules as an 

accessory. In this case, they typically make sure that their BSP 

and SDK provides support for their camera modules too. 

Camera-interfacing code written for a non-embedded target 

must be rewritten extensively for such embedded devices. This 

can be an expensive task, which might involve creating kernel 

mode drivers, implementing memory management, task 

scheduling etc. If a new camera module needs to be designed 

in, it is likely that parts of this work need to be done again, in 

264

particular when the new camera module comes from a new 

vendor or uses another interface technology. 

Embedded with MIPI-CSI-2: In 2003, vendors of mobile 

devices or components formed the MIPI (Mobile Industry 

Processor Interface) Alliance [8] as it became more and more 

obvious that standards for connecting peripherals (e.g. all kind 

of sensors or displays) are required to speed up development 

and product release cycles. The CSI-2 specification [9] 

(Camera Serial Interface 2nd generation) is the today’s number 

one standard for connecting vision sensors or camera modules 

to mobile processors or SoCs respectively. This standard, 

however, can be regarded more as a hardware standard 

specification. The physical layer is described in the D-PHY or 

C-PHY specification; in addition, the CSI-2 standard specifies 

the packet-oriented protocol for transmitting image data “on 

the wire”, but it does not standardize driver architectures, 

software stacks or a camera API. In addition, camera 

configuration and camera features or feature register layouts 

are not standardized at all. This implies two important facts to 

be considered: 

Each individual mobile SoC comes with its proprietary 

camera framework (if there is any at all). This is 

particularly true for Linux (on Android we see camera 

APIs which provide an abstraction from lower software 

and hardware layers, like Google’s Camera API 3 etc.) 

Moving an embedded application from one mobile SoC to 

another typically requires rewriting the related software 

layers and might even require modifying the existing 

software architecture. 

With CSI-2, camera configuration is done based on the CCI 

(Camera Control Interface) specification which is a type of 

I²C subset. Each CSI-2-compliant sensor or camera 

module however has a different register layout and 

different sensor/camera features. Switching from one 

sensor/camera to another will require the software 

developer to adapt the code to the different sensor-specific 

features. 

For vendors of consumer-grade mobile applications, these 

standard restrictions do not actually pose a big problem. The 

development efforts needed for interfacing an individual sensor 

with an individual mobile SoC normally pay off as part of 

regular practice: the primary goal of a consumer product sold 

in very high numbers is to keep production costs under control, 

which often already includes an individual highly costoptimized 

full-custom software design. 

For providers of embedded vision applications, which 

create more specialized, industrial, solutions (which are 

typically not sold in such high numbers as smart phone apps), 

things are different: development costs must be kept under 

control and additional efforts for sensor or camera integration 

have a direct impact on the development budget and delay the 

time-to-market. Any approach that would offer any kind of 

generic API for any combination of CSI-2 camera and SoC 

would be highly welcome. 

This requirement is now also reflected by efforts of the 

MIPI alliance, which released in October 2017 the new MIPI 

CCS (Camera Control Set) specification [10]. The primary goal 

of CCS is to “Enable rapid integration of basic image sensor 

functionalities without device-specific drivers”. The CCS 

standard specifies a set of basic functions a CCS compliant 

device must provide (along with a related register map). In 

theory, the integration efforts for a new camera device should 

be reduced to a minimum at least as long as only basic camera 

features are needed (“…such as resolution, frame rate and 

exposure time, as well as advanced features such as phase 

detection auto focus (PDAF), single frame HDR, or fast 

bracketing…”) [10]. The future will show whether the CCS 

specification is able to penetrate the market. Until now, many 

CSI-2 compliant devices have not even complied with the CCI 

specification. In addition, it is questionable whether the basic 

functions are sufficient for typical machine vision tasks, which 

usually require differentiated and real time-like control on 

more complex features such as single frame capture, fast 

changing ROI, exotic pixel formats etc. 

IV. 

A GENICAM-LIKE INTERFACE ABSTRACTION FOR MIPI 

CSI-2? 

The key for reusing code is a stable API, which is able to 

hide the specifics of lower software and hardware layers. For 

industrial machine vision, GenICam already showed how such 

an approach can look. Given that GenICam is already the 

dominant camera-interfacing standard in the non-embedded 

world of computer vision, it is obvious that a GenICam-like 

interface for MIPI CSI-2 would be a perfect abstraction layer 

for migrating and reusing code that was already written against 

a GenICam-conforming API. Leading machine vision camera 

vendors like Allied Vision, Basler, FLIR Systems, and IDS 

joined the MIPI alliance; it can be expected that they (along 

with the GenICam organization) will start to work on a 

GenICam-like abstraction for MIPI CSI-2. It is, however, 

unpredictable when the first results will be available. Until 

then, there remain at least two more proprietary approaches 

which individual camera vendors may offer to ease the reusage 

of existing code: 

Putting CSI-2 under the hood of the existing camera 

SDK. Even though the GenICam organization specifies a 

reference implementation, almost all software developers do 

not use this reference implementation directly; rather, they use 

the proprietary camera SDK of an individual camera vendor. 

Such camera APIs (e.g. Basler’s pylon, Allied Vision’s vimba, 

or FLIR`s Spinnaker SDK) are essentially a vendor-specific 

implementation of GenICam. They typically add vendorspecific 

functions and a better level of convenience to 

GenICam, along with programming samples, documentation, 

drivers etc.; they build an easy-to-use complete SDK. For a 

CSI-2 target, a camera vendor would still have to provide an 

individual driver stack for a given camera/SoC combination. 

However he can offer the common vendor-specific camera API 

(e.g. the Basler pylon API) on top of the CSI-2 driver stack. 

From the developer’s perspective in this case the MIPI camera 

would behave like other cameras from the same vendor, thus 

existing camera-interfacing code could easily be reused. From 

the camera vendor’s perspective this means a huge 

development effort as the existing API must be adapted to each 

individual camera/SoC combination. Basler, for example, made 


265

a first step in this direction and now offers a MIPI camera 

module for Qualcomm Snapdragon processors by offering the 

pylon API. 

Putting CSI-2 under the hood of GenTL. A less 

proprietary approach would be to offer GenTL as a camera 

interface. GenTL (Generic Transport Layer) – a GenICam 

substandard – consists of two parts: 

The GenTL Producer, which is usually provided by the 

hardware vendor of a GenTL-compliant device and which 

exposes a standardized API for all required camera 

functions, including enumeration, configuration and image 

acquisition. 

The GenTL Consumer, this is the interface a software 

needs to implement in to interface with (utilize) any 

GenTL Producer. 

This means that any software providing a GenTL 

Consumer would be able to interface with any device that 

exposes a GenTL Producer, regardless of the actual hardware, 

interface technology, driver etc. Again, a camera vendor might 

offer a GenTL Producer on top of the CSI-2 driver stack. And 

this would be also a huge development effort as the GenTL 

Producer must also be adapted to each individual camera/SoC 

combination. 

GenTL offers a higher level of abstraction, as it does not 

bind the software developer to a vendor-specific SDK. The 

disadvantage of GenTL is its relatively high complexity, which 

comes with significant internal overhead, and so it might not be 

the best solution for an embedded platform with limited CPU 

power and memory. 

V. CONCLUSION 

MIPI CSI-2 will become the most important camera interface 

for embedded machine vision applications. The lack of a 

standardized API (like GenICam) makes it difficult to reuse 

code that was written for other GenICam-compliant camera 

hardware. 

Camera vendors are now starting to put MIPI CSI-2 driver and 

software stacks under the hood of their existing camera SDK, 

or are offering a GenTL Producer which abstracts CSI-2 

specifics. Until the MIPI alliance itself does not integrate 

GenICam into the CSI specification, the vendor-specific 

approaches will remain as proprietary solutions for specific 

Camera/SoC combinations. 

For the software developer this means: watch for those 

camera SDKs which offer the broadest support for both nonembedded 

and embedded processing platforms, operating 

systems and interface technologies, including MIPI CSI-2. 

Having one unified camera API leverages the possibility of reusing 

significant amounts of existing code, and offers more 

flexibility to the user to move from one technology to another 

and to port existing code to the new target. 

REFERENCES 

[1] https://m.eet.com/media/1246048/2017-embedded-market-study.pdf 

[2] http://www.emva.org/standards-technology/genicam/ 

[3] http://www.emva.org/wpcontent/uploads/GenICam_Standard_v2_1_1.pdf 

[4] http://www.emva.org/wp-content/uploads/GenICam_SFNC_2_3.pdf 

[5] http://www.emva.org/wp-content/uploads/GenICam_GenTL_1_5.pdf 

[6] https://www.visiononline.org/vision-standards-details.cfm?type=5 

[7] https://www.visiononline.org/vision-standards-details.cfm?type=11 

[8] https://mipi.org/about-us 

[9] https://mipi.org/specifications/csi-2 

[10] https://mipi.org/specifications/camera-command-set 

266

Accelerating the Development of Intelligent, Vision- 

Enabled Devices at the Edge 

Rapid Prototyping with an Embedded Vision Development Kit 

Dirk Seidel 

Lattice Semiconductor: Senior Marketing Manager, Industrial 

San Jose, CA, U.S. 

dirk.seidel@latticesemi.com 

Abstract— The future looks promising for embedded vision 

systems. Exciting new applications are coming to market. One 

key to their success will be designers’ ability to continually 

improve performance and utility. Today, mobile platforms have 

expanded and gone beyond smartphones and tablets. Often they 

are used in industrial display systems for M2M applications and 

Industry 4.0 implementations, Advanced Driver Assistance 

Systems (ADAS) and Infotainment applications for automotive 

markets, DSLR cameras, drones, virtual reality (VR) systems 

and medical equipment. What today’s embedded vision system 

designers are looking for is flexible connectivity to address 

evolving interface requirements, energy-efficient image signal 

processing, and hardware acceleration. This paper will review the 

tools available to embedded vision designers for rapid 

prototyping and describe embedded vision technology and how 

it’s being used. 

Keywords— Artificial Intelligence, AI, machine learning, 

Intelligence at the Edge, Edge Intelligence, Embedded Vision Kit, 

Embedded Vision Technology, CrossLink, ECP5, NanoVesta, 

FPGA, ASSP, HDMI 


Ten years ago, embedded vision technology was primarily 

used in relatively obscure, highly specialized applications. 

Today, designers are finding exciting new uses cases for 

embedded vision applications in a growing array of industrial, 

automotive, and consumer applications. Specifically, the 

emergence of advanced robotics and machine learning, as well 

as the migration to the Industry 4.0 manufacturing model, 

promise to create new applications for embedded vision. 

Driven by the rapid rise of mobile influenced technologies, 

designers are faced with increasing their pace when designing 

new products such as machine vision, Advanced Driver 

Assistance Systems (ADAS), drones, gaming systems, 

surveillance and security systems, virtual reality (VR) systems, 

medical equipment and AI solutions. All these applications 

highly benefit from the easy access and simplicity of embedded 

vision technology. 

II. 

TECHNOLOGICAL CHANGE 

What has changed? First and foremost, many of the key 

components and tools crucial to the rapid deployment of low 

cost embedded vision solutions have finally emerged. Now 

designers can choose from a wide range of lower cost 

processors and programmable logic capable of delivering 

higher performance in a compact footprint, all while 

consuming minimal power. At the same time, thanks to the 

rapidly growing mobile market, designers are benefiting from 

the proliferation of cameras and sensors. In the meantime, 

improvements in software and hardware tools are helping to 

simplify development and shorten time to market. 

The rapid rise in the number of sensors that are being 

integrated into the current generation of embedded designs, as 

well as the integration of low cost cameras and displays, have 

opened doors to a wide range of exciting new intelligence and 

vision applications. 

At the same time, this embedded vision revolution has 

forced designers to carefully re-evaluate their processing needs. 

New, data-rich video applications are driving designers to 

reconsider their decision to use a particular Applications 

Processor (AP), ASIC or ASSP. In some cases, however, large 

software investments in existing APs, ASICs or ASSPs and the 

high startup costs for new devices prohibit replacement. In this 

situation, designers are looking for co-processing solutions that 

can provide the added horsepower required for these new, datarich 

applications without violating stringent system cost and 

power limits. 

III. 

MOBILE INFLUENCE 

While embedded vision solutions in one form or another 

have been around for many years, the growth rate of the 

technology has been limited by a number of factors. First and 

foremost, key elements of the technology have not been 

available at low cost. In particular, compute engines capable of 

processing HD digital video streams in real-time have not been 

widely available within the power and cost budget. Limitations 

in the high capacity solid state store capabilities and advanced 

analytic algorithms have also presented challenges. 


267

! 

Three recent developments promise to radically change 

market conditions for embedded vision systems. First, the rapid 

development of the mobile market has given embedded vision 

designers a wide selection of processors that deliver relatively 

high performance at low power. Second, the recent success of 

the Mobile Industry Processor Interface (MIPI) specified by the 

MIPI Alliance, offers designers effective alternatives, using 

compliant hardware and software components to build 

innovative and cost-effective embedded vision solutions. 

Finally, the proliferation of low-cost sensors and cameras for 

mobile applications have helped embedded vision system 

designers drive implementation up and cost down. 

IV. THE NEED FOR MORE PROCESSING POWER 

By definition, embedded vision systems include virtually 

any device or system that executes image signal processing 

algorithms or vision system control software. The key elements 

in an intelligent vision system typically include high 

performance compute engines capable of processing HD digital 

video streams in real-time, high capacity solid state storage, 

smart cameras or sensors, and advanced analytic algorithms. 

Processors in these systems can perform a wide range of 

functions from image acquisition, lens correction and image 

pre-processing, to segmentation, object analysis and AI. 

Designers of embedded vision systems employ a wide range of 

processor types including general purpose CPUs, Graphics 

Processing Units (GPUs), Digital Signal Processors (DSPs), 

Field Programmable Gate Arrays (FPGAs) and Application 

Specific Standard Products (ASSPs) designed specifically for 

vision applications. Each of these processor architectures offers 

distinct advantages and challenges. In many cases, designers 

combine multiple processor types in a heterogeneous 

computing environment. Other times, the processors are 

integrated into a single component. Moreover, some processors 

use dedicated hardware to maximize performance on vision 

algorithms. Programmable platforms such as FPGAs offer 

designers both a highly parallel architecture for computeintensive 

applications and the ability to serve other purposes 

such as expand I/O resources. 

V. IMAGE CAPTURE 

Designers of embedded vision systems can select from a 

wide variety of analog cameras and digital image sensors. 

Digital image sensors are usually CCD or CMOS sensor arrays 

that operate with visible light. Embedded vision systems can 

also be used to sense other types of energy such as infrared, 

ultra sound, Radar and LIDAR. Complex embedded vision 

systems require sensor fusion in which data from several 

different sensors are "fused" to compute something more than 

it could be determined by any one sensor alone. 

Designers are moving increasingly to “smart cameras” that 

use the camera or sensor housing to serve as a chassis for all 

edge electronics in the vision system. Other systems transmit 

the sensor data to the cloud to reduce processing overhead on 

the system processor and, in the process, minimize system 

power, footprint and cost. However, this approach faces issues 

when low latency and critical decision-making, based on the 

image sensor data is required. 

VI. 

THE DESIGNER’S CHALLENGE 

The widespread adoption of low cost, mobile-influenced 

MIPI peripherals has created new connectivity challenges. 

Designers want to take advantage of the economies of scale 

that the latest generation of MIPI cameras and displays offer. 

But they also want to preserve their existing investment in 

legacy devices. The main challenge the designers are faced 

with is creating customized prototypes quickly and costeffectively, 

while reusing their existing designs. 

What designers need is a highly flexible solution that offers 

the logic resources of a high performance, “best-in-class” coprocessor 

capable of highly-parallel computation that is 

required in vision and intelligence applications, while adding 

high levels of connectivity and support for a wide range of I/O 

standards and protocols. Moreover, this solution should offer a 

highly scalable architecture and support the use of mainstream, 

low-cost external DDR DRAM at high data rates. This device 

should be optimized for both low power and low-cost 

operation, and offer designers the opportunity to use industryleading, 

highly compact packages. 

VII. 

USE CASES 

The rapid evolution of embedded vision, caused by the 

availability of low cost mobile influenced image sensors and 

displays, has led to exiting new commercial embedded vision 

applications. In some cases, however, large software 

investments in existing APs, ASICs or ASSPs and the high 

startup costs for new devices prohibit replacement. In this 

situation, designers are looking for co-processing solutions that 

can provide the added horsepower required for these new, datarich 

applications without violating stringent system cost and 

power limits. 

A. Machine Vision 

One of the promising applications for embedded vision is in 

industrial arena for machine vision systems. Machine vision 

technology is one of the most mature and higher volume 

application for embedded vision. As a result, it is widely used 

in the manufacturing process and quality management 

applications. Typically, in these applications manufacturers use 

compact vision systems that combine one or more smart 

cameras with a processor module. 

Today, designers are finding a seemingly endless array of 

new applications for this technology. For example, a machine 

vision smart-camera in Figure 1: Machine Vision Smart 

Camera is ideally suited to monitor the production floor of a 

manufacturing facility. Designers can use an FPGA to serve as 

a sensor bridge, act as a complete camera Image Signal 

Processing Pipeline (ISP), and supply connectivity like GigE 

Vision or USB Vision. 

Figure 1: Machine Vision Smart Camera 

Another example, is an FPGA-based video grabber Figure 

2 which aggregates data from multiple cameras and performs 

image pre-processing prior to sending it over a PCIe interface 

to a host processor. 

268

! 

Figure 2: Video Grabber 

B. Automotive 

Given its rapid rise in the use of electronics, the automotive 

market offers a high growth potential for embedded vision 

applications. The introduction of Advanced Driver Assistance 

Systems and infotainment features are expected to drive 

growth quickly. The embedded vision product most commonly 

used in these applications is the camera module. Vendors either 

develop analytics and algorithm in house or embed third party 

IP from external developers. One emerging automotive 

application is a driver monitoring system which uses vision to 

track driver head and body movement to identify fatigue. 

Another one is a vision system that can monitor potential driver 

distractions, such as texting or eating, increasing vehicle 

operational safety. 

But vision systems in cars can do far more than monitor 

what happens inside the vehicle. Starting in 2018, regulations 

will require that new cars must feature back-up cameras to help 

drivers see behind the car. And new applications like lane 

departure warning systems combine video with lane detection 

algorithms to estimate the position of the car. In addition, 

demand is building for features that read warning signs, 

mitigate collisions, offer blind spot detection and automatically 

handle parking and park reverse assistance. All of these 

features promise to make driving safer and are required to 

make decisions right at the edge. 

Together, advances in vision and sensor systems for 

automobiles are laying the groundwork for the development of 

true autonomous driving capabilities. In 2018, for example, 

Cadillac will integrate a number of embedded vision 

subsystems into its CT6 sedan to deliver SuperCruise, one of 

the industry’s first hands-free driving technologies. This new 

technology will make driving safer by continuously analyzing 

both the driver and the road while a precision LIDAR database 

provides details of the road and advanced cameras, sensors and 

GPS react in real-time to dynamic roadway conditions. 

Overall, auto makers are already anticipating that ADAS for 

modern vehicles will require forward facing cameras for lane 

detect, pedestrian detect, traffic sign recognition and 

emergency braking. Side- and rear-facing cameras will be 

needed to support parking assistance, blind spot detection and 

cross traffic alerts functions. 

One challenge auto manufacturers face is limited I/Os in 

existing electronic devices. Typically, processors today feature 

two camera interfaces. Yet many ADAS systems require as 

many as eight cameras to meet image quality requirements. 

Designers need a solution that gives them the co-processing 

resources to stitch together multiple video streams from 

multiple cameras or perform image processing functions such 

as white balance, fish-eye correction and defogging, on the 

camera inputs and pass the data to the Application Processor 

(AP) in a single stream. For example, many auto manufacturers 

offer as part of their ADAS system a bird’s-eye view capability 

that gives the driver a live video view from 20 feet above the 

car looking down. The ADAS system accomplishes this by 

stitching together data from four or more cameras with a wide 

Field-of-View (FoV). 

Historically designers have used a single processor to drive 

each display. Instead, designers can now use a single FPGA to 

replace multiple processors, aggregate all the camera data, 

stitch the images together, perform pre- and post-processing 

and send the image to the system processor. Figure 3 shows the 

simplified architecture of a Bird’s-eye-view 360 Automotive 

Camera system, which collects data from four cameras located 

around the car (front, back, and side). A single FPGA is used 

for various pre- and post-processing functions and stitches 

together the video data to provide a 360-degree view of the 

vehicle surroundings. In this case, a single FPGA replaces 

numerous processors. 

Figure 3: Bird’s-eye-view 360 Automotive Camera 

System 

C. Consumer 

Drones, Augmented Reality/Virtual Reality (AR/VR) and 

other consumer applications offer tremendous opportunities for 

developers of embedded vision solutions. Today, drone 

designers are finding it cheaper to synchronize six or more 

cameras on a drone to create a panoramic view than build a 

mechanical solution that takes two cameras and moves them 

180 degrees. Similarly, AR/VR designers are rapidly 

converting video from a single video stream and splitting the 

content to a dual-display. They make use of low-cost mobile 

influenced technology which uses two MIPI DSI displays, one 

for each eye, providing low latency performance requiring only 

minimal power consumption, enhance depth perception and 

offer the user a more immersive experience. 


269

! 

Figure 5: Customizable FPGA-based Prototyping 

Platform 

! 

Figure 4: FPGA-based Virtual Reality System 

VIII. 

HOW TO MASTER THE CHALLENGE 

To overcome the designer’s challenge and striving for fast 

product development cycles for the lowest cost and lowest 

power products, which provide superior performance to the 

market it is highly recommended to take a modular approach. 

This method allows designers to customize their prototyping 

system based on existing field proven hardware and software 

and reuse existing elements in their design. 

Many hardware platforms, tailored to specific functions 

like sensor bridging, image processing or networking 

connectivity are available off-the-shelf from numerous 

vendors. Semiconductor manufacturers provide reference 

platforms or development boards incorporating their own 

products, while specialized design houses provide modular 

systems including semiconductor products from several 

vendors. Often these design houses also offer a commercial 

grade ISP, which simply can be used with their hardware and 

quickly included in a prototype. Furthermore, many boards and 

systems are available through electronic distributors or through 

online stores. 

One important factor when selecting the prototyping system 

of choice is the availability of the right connector and the 

capability to seamlessly connect multiple boards together. 

Numerous connectors exist as standard header pins, PMC 

Mezzanine, Milli-Grid Connector, or ERM5 Rugged High- 

Speed Headers connector. The amount of available connectors 

is endless. 

Header pins provide usually the most flexible option to 

wire up several boards. However, the drawback is that it often 

becomes extremely difficult to connect several high speed 

signals which require synchronization, like in video 

applications. Enormous amounts of time and money need to be 

spend to connect the development boards from different 

semiconductor vendors. 

A smart modular solution for an embedded vision 

prototyping solution is provided by Lattice Semiconductor and 

their Embedded Vision Development Kit. This kit is part of 

Lattice’s Video Interface Platform (VIP), which allows for an 

easy interchange of input and output interconnect boards 

through a simple snap on concept. The kit incorporates a 

nanoVesta connector, allowing easy connectivity of a variety of 

different image sensors, which are available from third party 

vendors like HelionVision. 

This modular 3-board set simplifies the implementation of 

highly flexible, cost-effective embedded vision solutions for 

mobile-influenced systems in industrial, automotive, and 

consumer markets. The development kit is built around a 

stackable three-board set that combines the CrossLink video 

bridge input board for sensor aggregation, an ECP5 FPGA 

processor board used for the ISP, and an HDMI output board. 

Allowing easy connectivity to a standard HDMI display. The 

kit is complemented by an evaluation version of Helion’s 

commercial grade IONOS ISP. This ISP is sensor independent 

and can easily be customized to any specific need. IONOS ISP 

provides HD-image signal processing for superior image 

quality and provides algorithm for pixel correction, white 

balance, debayer, color space conversion and gamma 

correction and more. 

To help embedded vision developers rapidly build 

prototypes for a growing array of applications and shorten 

time-to-market, a modular system approach is highly 

beneficial. Lattice recognizes these benefits and plans to offer a 

variety of additional input and output boards in the near future. 

A newly released HDMI input board is available now and can 

be used as an alternative to the Sensor Bridge for camera 

aggregation 

The Embedded Vision Development Kit allows developers 

to take advantage of existing hardware building blocks and 

easily customize their design functionality by easily mixingand-matching 

new boards to meet the needs of industrial, 

automotive and consumer vision applications. 

270

Selecting Cellular LPWAN Technology 

for the IoT 

Brent Nelson 

Sr. Product Manager – Long Range RF and Gateway Products 

Digi International 

Minnetonka, MN USA 

Abstract— For many Internet of Things (IoT) applications, 

high-throughput standards such as LTE-Advanced, with its 

throughput of 300Mbps, are overkill, since the amounts of 

data are relatively small. What’s more, devices and sensors 

are often deployed in far-flung, remote areas that often lack 

access to power, making a high-powered router unfeasible. 

To address this segment’s low-power, low-bandwidth 

requirements, the 3GPP, the cellular-standards body, is 

putting forth new “narrowband” standards. LTE Cat 1, LTE- 

M, and NB-IoT are designed to connect devices and sensors 

that dribble data and operate at very low power, allowing them 

to last multiple years on a battery. 

LTE Cat 1 networks are available in North America, 

Australia, and Japan, and are an excellent option for IoT 

devices that require cellular connectivity. With throughput 

speeds capped at 10 Mbps, this standard is significantly less 

complex and less power-hungry that Cat 3 or Cat 4 

technologies that support throughput of 100 and 150 Mbps. 

In the fall of 2017, carriers will activate their networks to 

support LTE-M in North America and NB-IoT in Europe. 

LTE-M has a maximum speed of about 1 Mbps and NB-IoT 

caps out at 144 Kbps, making them ideal for low-power, lowdata-rate 

applications. Keywords: Cat 1; LTE; NB-IoT; IoT; 

routers; 3G; 


Existing barriers to entry 

For several years, market research has suggested that the global 

IOT will drive to an exponential growth in number of connected 

devices with predications going as high as 50B new connected 

devices. While the number of connected devices has grown, we 

have yet to see the exponential growth that was widely 

predicted. 

There are many factors that have slowed the growth of the IOT. 

Perhaps the most challenging barrier is the cost and challenge 

of getting data from edge devices like sensors and machines. 

No matter how interesting or potentially revolution a 

technology new technology is; every IOT solution must pass a 

Return on Investment (ROI) analysis before that investment 

will be made. 

Historically there have been three typical barriers to connecting 

edge devices when deploying an IOT solution that requires 

mass deployment of remote monitoring… 

 

 

 

Device Cost 

Recurring Cost 

Battery Life 

New wireless WAN technologies like Low Power(LP) Cellular 

and Lora WAN were developed specifically to address these 

barriers and have the potential to drive the exponential growth 

the has been predicted. 

This paper will discuss those cellular technologies in depth. 

II. TECHNOLOGY OVERVIEW 

LTE‐CAT 1 

The first step on the roadmap of cellular for IOT devices was 

LTE‐CATEGORY (CAT) 1. LTE CAT 1 required a only a software 

update to existing LTE networks, so it could be deployed 

quickly and now has pervasive coverage in the US and Canada 

as well as across much of Europe. 

LTE CAT 1 was designed to operate at 3G speeds with a max 

downlink/uplink of 10Mbps/5Mbps. The main advantage of 

LTE CAT 1 was cost. LTE modules were priced at a premium vs 

3G/2G devices so IOT devices that used LTE were often price 

out of the market when they did not require the bandwidth of 

LTE this slowed the deployment of LTE even when the a 

shutdown of 3G and 2G networks was on the near term 

horizon. 

LTE CAT 1 modules were priced at similar levels to 3G/2G 

models. This meant there was no incentive to continue to 

deploy 3G/2G modules since those networks are nearing EOL 


271

and there was no longer a cost advantage. LTE CAT 1 however 

did not implement any power saving features vs LTE CAT3/4 

networks so battery life continued to be a challenge for 

remotely deployed assets. 

LTE Cat 1 networks are available in North America, Australia, 

and Japan. 

LTE‐M and NB‐IOT 

LTE‐M and NB‐IOT are the next step in the on the cellular IOT 

roadmap. Carriers have been deploying these networks 

throughout 2017 and that will accelerate in 2018. 

LTE‐M and NB‐IOT are market disruptive technologies based 

on their low cost, increase link budget and low power. These 

are the two technologies that will lead to the inflection point 

in the deployment of IOT. 

These technologies are often referred to as narrow band 

cellular. The typical 20 MHZ bandwidth of LTE is shrunk to 

1.4MHZ for LTE‐M and 200KHZ for NB‐IOT. While narrowing 

the band would be considered a bad thing for a high 

throughput application, for these technologies a narrower 

band leads to lower cost, better received sensitivity and lower 

power. 

A quick summary of the technical specifications of the different 

standards is shown below. 

Technology 

Downlink 

speed 

Uplink 

speed 

Number of 

antennas 

Duplex 

mode 

Receive 

bandwidth 

Transmit 

power 

Modem 

complexity 

Release 

8 

Release 

8 

Release 

13 

CAT4 CAT1 LTE-M 

150 

Mbps 

50 

Mbps 

10 

Mbps 

5 Mbps 

200 

kbps - 

1 Mbps 

200 

kbps - 

1 Mbps 

2 2 or 1 1 1 

Full 

duplex 

Full 

duplex 

20 MHz 20 MHz 

Half 

duplex 

1.4 

MHz 

Release 

13 

NB- 

IOT 

200 

kbps 

144 

kbps 

Half 

duplex 

200 

kHz 

23 dBm 23 dBm 20 dBm 23 dBm 

100% 80% 20%

IV. 

TRADEOFFS 

Since nothing I life is free, moving to either the NB-IOT or 

LTE-M standard has tradeoffs. Understanding these is critical 

to picking the right technology that will meet your application 

requirements. 

Mobile Initiated vs Mobile Terminated 

One of the first things engineers need to understand when 

looking at these LP-WAN technologies is they are designed for 

mobile initiated calls (meaning the end device initiates the data 

connection to the application). To support the ultra-low power 

modes, these modules to into a sleep state and cannot receive 

traffic from the network. The networks are not guaranteed to 

store network initiated messages that cannot be received (like 

they would with SMS) to data may be lost on network initiated 

calls. Note that keeping the device in a state where it can 

receive messages would defeat the purpose of the technology 

and make them only slightly better than standard LTE or 3G 

devices. 

LP‐ Cellular technologies like LTE‐M and NB‐IOT have the 

potential to grow the number of IOT devices deployed 

exponentially. They offer significant benefits in terms of 

device cost, recurring cost and battery life. While they are an 

ideal technology for many IOT devices there are tradeoffs to 

using these technologies and any engineer or company looking 

to deploy with these technologies should fully study the 

tradeoffs to ensure they meet the needs of their application. 

Latency 

The second major tradeoff is latency. When a LP-WAN device 

goes into low power mode the latency will increase 

significantly. In Discontinuous Reception Mode (DRX or 

eDRX) the device will only listen for incoming traffic every x 

seconds in DRX Mode and x seconds in eDRX mode. 

In Power Save Mode (PSM) mobile terminated connections are 

not possible so the latency would be define by how often the 

remote device wakes up and connects to the network but could 

easily be measure in hours or days. 

Throughput 

The second tradeoff is throughput. This one is obvious; these 

networks were not designed for high bandwidth applications 

like video. The typical throughput on an LTE-M network is 

300kbp while the throughput of a NB-IOT network is typically 

Sigfox – connecting the world with one LPWAN 

a global low power wide area UNB non-slotted Aloha transmission IoT approach 

Alexander Lehmann, M.Eng. 

Principal Engineer 

Sigfox Germany 


Alexander.Lehmann@sigfox.com 

Abstract – Efficient Transmission of small data packets in a 

shared spectrum is challenging. This paper gives a brief 

summary of using Sub-GHz General Purpose Transmitters to 

transmit low power wide area UNB Frames without Signaling 

using Sigfox as an example. 

Keywords – Sigfox; UNB; IoT; Sub-Ghz; non-slotted; Aloha; 

spectrum frugality; low power; wide area 

keeping the spectrum utilized as minimal as possible and 

having the desired Quality of Service (QoS) as shown in III. 

III. 

QUALITY OF RECEPTION 

When messages arrive randomly, a simulation can indicate 

a first result for the performance of this concept – the desired 

quality is 99,99% of the received messages have to be decoded: 


Spectrum is a very rare resource – this makes it important 

to use it as efficiently as possible. Therefore, an UNB approach 

with a high spectral efficiency is essential – even more in a 

license free and shared band. Sigfox as a widely deployed 

network (35 countries and counting) shall be used as an 

example to discuss the current state of the art Ultra Narrow 

Band Network. An outlook will be given to possible future 

features, advanced processing methods and updates. 

II. 

RANDOM TIME AND FREQUENCY 

A. Design choices 

As a starting point an Aloha-approach network was chosen. 

Each device can emit randomly (in compliance with the duty 

cycle, maximal transmitted power and the frequency-band, e.g. 

in the ETSI Zone 1 ) to have no signaling for power saving and 

lifetime predictability reasons. 

B. Resulting Aftermath 

This also brings some consequences for reliability – to 

compensate for the random transmissions in time and 

frequency, retransmissions and a cooperative reception 

approach were chosen. Each message is repeated 2 times on 

different frequencies with < 50µs time in between – the 

complete message consists therefore of 3 frames. All base 

stations in range pick up the signal and forward it to the 

backend – the redundancy is enough and a tradeoff between 

Figure 1: Maximal BS Load for 99,99% QoS [Simulation by Sigfox] 

When the success rate of 99,99% is given, the maximal input 

load is at 14% of the channel capacity – a Sigfox message is 

100 Hz wide and the channel is 192 kHz. 

192 kHz / 100 Hz * 14% = 269 (1) 

269 concurrent messages / second should be decoded. Practical 

tests have shown, that this number keeps up. This results in a 

capacity of more than 10 Million frames per day. In case this 

should ever be not enough anymore, there is always the 

possibility to decrease the sensitivity of the base stations and 

then add more in the desired area. The cooperative approach – 

all base stations pick up all the signals they receive and relay 

them to Sigfox cloud – makes it easy to enhance the network in 

terms of capacity and coverage. 

1 

“ETSI EN 300 220”: Short Range Devices (SRD) operating 

in the frequency range 25 MHz to 1 000 MHz 

274

IV. 

MINIMISING THE MESSAGE OVERHEAD 

A. Uplink Frame chacteristics 

Besides the payload, a device has to have a unique ID, the 

message must be checked if it is received and decoded 

correctly, the base stations must be able to tune in on a 

message... Thus, a message consists primarily of: 

+––––––+––––––––––+––––––+–––––+––––+–––+ 

| Preamble | Sync & Header | Device Id | Payload | MAC | CRC | 

+––––––+––––––––––+––––––+–––––+––––+–––+ 

The downlink is always 600 bps in all Zones and occupies 

1600 Hz. As a modulation, Gaussian FSK was chosen. As the 

base stations are in listen before talk mode – opposed to the 

devices – they must obey a different duty cycle and are also not 

limited to the e.g. 14 dBm for devices in the ETSI Zone – here 

27 dBm can be used. For frequency stability, the drift of the 

downlink – base station sends, device receives – is lower and 

the center frequency is chosen in correspondence to the 

received frequency of the preceeding uplink message. The base 

station to send the downlink is the one with the best reception 

quality of the corresponding uplink. 

• Preamble: 19 bits, always 0b1010101010101010101 

• Sync & Header: 17 bits and a 12 bit counter 

• Device Id: 32 bits 

• Payload: 0 – 96 bits 

• MAC (Authentication): 16 – 40 bits 

• CRC: 16 bits 

B. Downlink Frame chacteristics 

For downlink messages, the format is similar, but additionally, 

a forward error correction was introduced. One of the reasons 

is that there is no repetition on the downlink. Below is the 

structure of a downlink message: 

+––––––+–––+–––+–––––+––––+–––+ 

| Preamble | Sync | ECC | Payload | MAC | CRC | 

+––––––+–––+–––+–––––+––––+–––+ 

• Preamble: 91 bits, always 

0b101010101010101010101010101010101010101010101010 

1010101010101010101010101010101010101010101 

• Sync: 13 bits, always 0b1001000100111 

• ECC (Forward Error Correction): 32 bits 

• Payload: 0 – 64 bits 

• MAC (Authentication): 16 bits 

• CRC: 8 bits 

V. RADIO CHARACTERISTICS 

The occupied bandwidth in the uplink – device transmits, base 

stations receive – was already given before and is in the ETSI- 

Zone (RC1) 100 Hz. In other Zones, it is 600 Hz (FCC, RC2). 

Corresponding to the bandwidth, the modulation is 100 bps 

while occupying 100 Hz and 600 bps using 600 Hz. The 

maximum allowed emission power is regulated by the 

corresponding Regulation Organizations. To be able to 

manufacture cheap devices, the frequency drift is also not too 

critical, if it is not too significant. A regular crystal can achieve 

this. For resilience reasons, a differential PSK is used in binary 

mode: 

Figure 3: 2GFSK-Modulation [Illustration by Sigfox] 

VI. 

POWER CONSIDERATIONS 

A. Initial Power planning 

Not only sending, but also receiving a message and the 

listening time before puts considerable strain on the battery. 

On average, sending can be considered 10.000 more power 

hungry than idle / deep sleep in the case of the used ICs. 

Listening and receiving uses around half the power compared 

to transmitting (but this depends heavily on the used 

transceiver!). 

B. Lack of Signaling 

As there is no signaling at all in the network and the radiated 

power is always the same, a very good lifetime power usage 

prediction can be created for different IoT customer scenarios. 

Depending on the transceiver, the implementation and the 

number of messages sent and received, a lifetime of over 5 

years with a standard AA-Battery was shown – with the 

battery aging being the biggest concern. 

VII. 

DEVICE INITIATED COMMUNICATIONS 

Figure 4 gives a qualitative overview of a complete downlink 

cycle. After the three uplink frames are sent in roughly 6 

seconds, depending on the payload, the device goes into sleep 

mode. 20 seconds after the end of the first frame, it wakes up 

and tunes in on the expected downlink frequency. When the 

download is after a maximum timeout of 25s received, the 

transceiver sends 1,4s after the end of the downlink frame a 

control frame as an acknowledgement containing the 

temperature, voltage (VDD idle & VDD tx) and the RSSI. The 

downlink messages must not be sent, but when the bit for a 

downlink is set, the device will listen 25s anyway. An Upload 

only cycle in comparision ends after the three first peaks. 

Figure 2: DBPSK-Modulation [Illustration by Sigfox] 

Figure 4: Power consumption over time for an DL cycle 

275

A. Downlink Cycle Considerations 

As the transceivers would considerably waste power in an 

always on listen-mode, the resulting battery life is not 

manageable from an economical point of view – that’s why 

bidirectional communication can only be established from the 

device. 

B. Battery Strain 

Over time, the battery will have problems supplying the peak 

power consumption in transmission or reception mode. The 

specified waiting time in between reduces the strain on the 

battery, as it can recover for a brief period of ~16 seconds. 

C. Security 

An additional advantage of the device initiated communication 

is the hardened security – without a previous uplink, the 

device cannot receive a downlink as it is simply not in 

listening mode. That hardens the device against malicious 

frames intended for misuse. 

VIII. OUTLOOK 

To further enhance the successful reception of messages, either 

brute force or insights, even together with the customers can be 

used. The points stated below should only give a hint on 

possible advanced recovering methods to come – they are 

neither complete nor fully described. 


The author wants to thank his technical colleagues in 

France for deep and insightful discussions over the last 18 

month. 

REFERENCES 

[1] “ETSI EN 300 220”: Short Range Devices (SRD) operating 

in the frequency range 25 MHz to 1 000 MHz 

Alexander Lehmann received his B.Eng. in 

electrical engineering from the University of Applied 

Sciences in Munich and his M.Eng. from the 

Deggendorf Institut of Technology. Before joining 

Sigfox, he worked at Advanced Micro Devices. His 

main interests are in the fields of video compression 

and advanced communications and lecturing about 

the latter to customers. 

A. Forward ECC 

As already in the downlink specified, a forward error 

correction in the uplink can be considered. 

B. Combined Reception and Recombination 

When a decoded message at the base station has a CRC 

mismatch, it is nowadays discarded. If more than one frame 

was received or / and if received at more than one base station, 

there are multiple copies of the frames – these can be 

compared and recombined until the CRC matches. 

C. Expecting (parts of) the Message / Payload 

When working together with the customer, an insight can be 

gained on what the payload (or parts of it) should contain. The 

more payload bits are known, the fewer combinations must be 

tried to get a CRC match. Also, the counter is known at the 

Sigfox cloud. Static devices with transmission history are also 

an example, as then the device IDs are known at the usual 

receiving base stations. The header can in its current protocol 

implementation only have a significantly lower number of 

possibilities than what the 17 bit could represent. All these 

known or easy achieved educated guesses can significantly 

reduce the needed tries for a CRC success. 

D. Deep learning 

To automate this process further, patterns in the messages can 

be recognized and then over a significant amount of messages 

patterns then can predict how the coming messages should 

look like. If there are deviations and CRC fails, a leaner than 

brute force approach will be used. 

276

Why LTE Cat M1 and NB-IoT make perfect sense for 

digitization and smart predictive infrastructure 

Ludger Boeggering 

Market Development Manager 

u-blox AG 

Thalwil, Switzerland 

Ludger.Boeggering@u-blox.com 

Abstract— The newly implemented LPWA technologies 

Narrowband-IoT (NB-IoT or Cat NB1) and LTE Cat M1 

combine the advantage of using common available infrastructure 

with low cost, low power consumption, deep in-building 

penetration and high numbers of simultaneously operating units. 

NB-IoT provides its advantages in efficient battery operation 

and in-building penetration, opening the possibility for coverage 

underground and deep within factory buildings. By design, NB- 

IoT considers cost-optimized deployment essential due to the 

“clean slate” approach. NB-IoT does not need the intelligence to 

coexist with classic 4G traffic, hence there is a fair chance to 

reach customer cost expectations. 

Due to the fact that LTE Cat M1 technology is much closer to 

the standard LTE network, this technology allows applications 

which need more bandwidth and LTE-like latency. Nevertheless, 

LTE Cat M1 also provides extensive power saving modes and 

improved coverage. 

In combination with the security concept for the whole 

communication module, security by design provides upcoming 

security requirements for the industrial IoT. 

A further aspect of IoT is the remote management of a large 

and growing category of connected devices; those with limited 

bandwidth and those viable only at very low production costs. 

For these types of devices a standard has been introduced by 

(Open Mobile Alliance) OMA, called Lightweight M2M 

(LwM2M). 

Keywords—LTE; Cat M1; Cat NB1; 4G; 3G; 2G; low-power 

wide-area; LPWA; IoT; Internet of Things; NarrowbandIoT; NB- 

IoT; eMTC; 3GPP; licensed spectrum; smart; electricity; utility; 

gas; water; heat; power; city; building; environmental; agriculture; 

security; connected health 

I. INTRODUCTION AND TECHNOLOGY BRIEF 

Different wireless LPWA technologies have been invented 

in the past few years with the target to be specifically 

optimized for the Internet of Things (IoT). It is expected that 

the IoT will allow for numerous services in different 

application areas of industry and consumer. Examples of 

applications are VIP/pet/bike tracking, assisted living/medical, 

security systems and detectors, agriculture tracking, 

water/gas/heat/electricity metering, vending, fleet management, 

waste-bin/tank monitoring, lighting, parking and traffic 

management etc. 

In general, the invented technologies are either operating in 

a licensed or unlicensed frequency environment. There are 

current proprietary low-power wide-area (LPWA) technologies 

like SigFox, LORA or RPMA, but also the fast forthcoming, 

3GPP-standardized cellular IoT technologies LTE Cat NB1 (or 

narrowband-IoT, NB-IoT) and LTE Cat M1. According to 

Figure 1 each of the technologies have their specific features, 

hence technology selection should always be driven by the 

application requirements and by technology preference itself. 

u-blox has been involved in early stage trials and proof of 

concepts for NB-IoT, LwM2M and LTE Cat M1, covering a 

range of use cases. 

The presentation will give deep dive insights into these 

communication infrastructure technologies. Part of the 

presentation will cover each technology in detail by using 

dedicated use cases from the energy market and from predictive 

maintenance. 

Mobile network operator involvement will also be outlined 

with an outlook into the challenges of how to handle the future 

millions of devices on their networks. Finally this presentation 

will outline future 3GPP standards for the IoT. 

Fig. 1. Technology overview 

NB-IoT is a clean slate technology that can be implemented 

into radio and core networks of existing LTE or 2G cellular 

networks. The radio network supports work with simple, low 

cost devices. The transmission and higher layer protocols help 

devices consume less power with the aim of achieving a battery 

277

life of over ten years. Finally, extended coverage is provided 

for deep indoor penetration and rural areas. 

Fig. 2. NB-IoT overview 

LTE Cat M1 is a low‐power wide‐area (LPWA) air 

interface that lets you connect IoT and M2M devices with 

medium data rate requirements (375 kb/s upload and download 

speeds in half duplex mode). It enables longer battery 

lifecycles and greater in‐building range as compared to 

standard cellular technologies such as 2G, 3G or LTE Cat 1. 

Key features include: 

Support of voice functionality via VoLTE 

Full mobility and in‐vehicle hand‐over 

Low power consumption 

Extended in‐building range 

applications running on devices that may be deployed in the 

field for extended periods of time. 

Battery life of up to 10 years on a single charge in some use 

cases also contributes to lower maintenance costs for deployed 

devices, even in locations where end devices may not be 

connected directly to the power grid. 

As compared to NB‐IoT, LTE Cat M1 is ideal for mobile 

use cases, because it handles hand‐over between cell towers 

similarly to high speed LTE. For example, if a vehicle moved 

from point A to point B, crossing several different network 

cells, a Cat M1 device behaves like a cellular phone and never 

drops the connection. An NB‐IoT device needs to re‐establish 

a new connection after a new network cell is reached. 

Another benefit is the support of voice functionality via 

VoLTE (voice over LTE) for applications requiring a level of 

human interaction, such as for certain health and security 

applications (e.g. stay-in-place solutions and alarm panels). 

III. 

LANDSCAPE OF APPLICATIONS 

The IoT targets the connectivity of things to autonomously 

exchange status information and sensor data between each 

other and to cloud-based information platforms. Different 

technologies are currently in use or will be used in the future, 

including wired and short range wireless technologies. 

There are some applications, which are best operated by 

cellular IoT technologies, especially LPWA. Such applications 

include smart metering and battery powered sensors. 

Fig. 4. NB-IoT applications 

Out of these applications, a number of key technology 

requirements can be identified to support massive deployment: 

Lowest power consumption 

Fig. 3. LTE Cat M1 overview 

II. 

EXCEEDING EXPECTATIONS 

LTE Cat M1 is part of the same 3GPP Release 13 standard 

that also defined narrowband-IoT (NB‐IoT or LTE Cat NB1) - 

both are LPWA technologies in the licensed spectrum. With 

uplink and downlink speeds of 375 kb/s in half duplex mode, 

Cat M1 specifically supports IoT applications with low to 

medium data rate needs. At these speeds, LTE Cat M1 can 

deliver remote firmware updates over‐the‐air (FOTA) within 

reasonable timeframes, making it well‐suited for critical 

Low device cost 

Low deployment cost 

In-building penetration and extended coverage 

High scalability and massive number of devices 

278

NB‐IoT can help local government control street lighting, 

determine when waste bins need emptying, identify free 

parking spaces, monitor environmental conditions, and survey 

the condition of roads. 

Fig. 5. LTE Cat M1 applications 

A. Automotive & transportation 

LTE Cat M1 supports full hand‐over between network 

cells from a moving vehicle and is therefore well‐suited for 

mobile use cases with low to medium data rate needs, such as 

vehicle tracking, asset tracking, telematics, fleet management 

and usage‐based insurance. 

B. Smart metering 

Cat M1 is also ideal for monitoring metering and utility 

applications via regular and small data transmissions. Network 

coverage is a key issue in smart metering rollouts. Since meters 

are commonly located inside buildings or basements, Cat M1's 

extended range leads to better coverage in hard‐to‐reach areas. 

NB‐IoT is well suited for monitoring gas and water meters, 

via regular and small data transmissions. Network coverage is a 

key issue in smart metering rollouts. Meters have a very strong 

tendency to turn up in difficult locations, such as in cellars, 

deep underground or in remote rural areas. NB‐IoT has 

excellent coverage and penetration to address this issue. 

C. Smart buildings 

Cat M1 can easily provide basic building management 

functionality, such as HVAC, lighting and access control with 

its enhanced indoor range. Since it is also features voice 

functionality via VoLTE, it is also well‐suited for critical 

applications like security systems and alarm panels. 

NB‐IoT connected sensors can send alerts about building 

maintenance issues and perform automated tasks, such as light 

and heat control. NB‐IoT can also act as the backup for the 

building broadband connection. Some security solutions may 

even use LPWA networks to connect sensors directly to the 

monitoring system, as this configuration is more difficult for an 

intruder to disable as well as easier to install and maintain. 

D. Smart cities 

Within smart cities, Cat M1 can meet a variety of needs and 

effectively control street lighting, determine waste 

management pickup schedules, identify free parking spaces, 

monitor environmental conditions, and survey the condition of 

roads in a matter of milliseconds. 

E. Consumers 

NB‐IoT will provide wearable devices with their own 

long‐range connectivity, which is particularly beneficial for 

people and animal tracking. Similarly, NB‐IoT can also be 

used for health monitoring of those suffering from chronic or 

age‐related conditions. 

F. Agricultural and environmental 

NB‐IoT connectivity will offer farmers tracking 

possibilities, so that a sensor containing a u‐blox NB‐IoT 

module can send an alert if an animal’s movement is out of the 

ordinary. Such sensors could be used to monitor the 

temperature and humidity of soil, and in general to keep track 

of attributes of land, pollution, noise, rain, etc. 

G. Connected health 

Due to its extended in‐building range, voice support and 

mobility, Cat M1 is also a well‐matched air interface choice 

for connected health applications, such as outpatient 

monitoring and stay‐in‐place solutions. 

IV. 

TECHNOLOGY HIGHLIGHTS 

NB-IoT optimizes coverage, device battery life/power 

consumption and costs, as well as capacity for a massive 

number of connected devices and supports a scalable solution 

for low data rates. It can be deployed either in shared spectrum 

together with legacy LTE, or as stand-alone, i.e. a re-farmed 

2G carrier, with a narrow bandwidth of about 180 kHz. NB-IoT 

can coexist with LTE and, in standalone operation, also 

coexists with 2G, 3G and 4G. Due to changes in 

synchronization signal design and simplified physical layer 

procedures, the complexity of NB-IoT devices is even lower 

than that of 2G devices. 

One of the drivers for inventing this new technology is the 

requirement of extended coverage. Smart meters are one of the 

simple examples, because they are often installed in basements 

of buildings and surrounded by concrete. Methods to optimize 

indoor coverage use a combination of techniques, such as 

power boosting of data and reference signals, repetition, retransmission, 

tolerating modulation schemes and accepting 

lower signal strength levels. 

Finally, this methods provide an increased link budget of up 

to 23 dB when compared with 2G or 3G technology, at the 

trade-off of increased latency in extreme condition due to 

repeated transmissions. 

279

nomadic and globally usable products allow more flexibility 

for OEMs and customers, reducing cost. Especially for mobile 

IoT applications, such as wearables and tracking devices, a 

technology providing global roaming increases the usability. 

Due to 3GPP standardization, NB-IoT equipped products can 

be used everywhere a NB-IoT network is available. 

Fig. 6. Optimized coverage 

Compared with mesh radio solutions and other technologies 

in unlicensed spectrum, NB-IoT and LTE Cat M1 are operating 

in a controlled/licensed frequency spectrum (see Figure 7). 

Operating in a licensed environment allows the management of 

interference and offering of quality of service. 

V. ARCHITECTURE 

NB-IoT technology is designed such that it can be used in 

areas beyond the radio coverage of current cellular standards 

(GSM, UTMS or LTE) and in applications which typically 

require low power consumption, especially when run from 

battery power for many years. A corollary of this is that the 

devices will generally send small amounts of data infrequently; 

a typical usage scenario might be 100 bytes sent twice per day. 

Even higher data volume and number of transmissions are 

possible, but with an impact on power consumption. 

By design NB-IoT is dedicated to initiate the connection 

and data transfer. The system operation is analogous to SMS in 

that it is a datagram-oriented, stored-and-forward system, 

rather than a GPRS-like IP pipe. This is because NB-IoT 

devices can spend most of their time asleep, making possible to 

operate over long time from a battery. 

The NB-IoT standard specifies three different 

communication options: IP data, Non-IP data and SMS, as 

shown in the Figure 8 and Figure 9. Among these, IP data 

transmission using UDP/IP is currently the most used method, 

fully supported worldwide by network operators. IP data 

communication UDP/IP allows the use of the existing core 

network and supports efficient power consumption to allow 

long battery life time. 

Use of TCP/IP in a battery operated application is not the 

best possible way to reach longest battery life time due to 

higher power consumption for retransmissions and 

acknowledgement messages. 

Fig. 7. Technology Comparison 

Another key advantage of NB-IoT is the ability to support 2 

way communication. Depending on the selected power saving 

modes, the NB-IoT equipped device can communicate bidirectional 

with lower latency for control purposes and with 

higher latency only for data transfer. This bi-directional 

function can be used for any kind of interaction with the 

connected device, e.g. data collection, remote management and 

control, as well as firmware updates over the air (FOTA). 

FOTA is an extremely important functionality to support future 

tuning for changing security requirements. 

The cost of implementation (CAPEX) and operation 

(OPEX) of a network must also be kept in consideration. With 

NB-IoT in most cases a simple software upgrade allows the 

easy implementation and deployment of this technology in the 

field without the need of a complete new setup of a network 

including the search for the best position of the base stations. 

GSM was unique among the global standards for mobile 

communications, as it operates in a majority of the world. NB- 

IoT has the chance to be as pragmatic and globally usable as 

2G in the past. Even most of the usable applications are 

Fig. 8. NB-IoT communication methods 

Fig. 9. Protocol structure of a NB-IoT network 

280

Using IP data, an NB-IoT module is able to send raw data 

through UDP sockets to a destination IP address. The data sent 

over the socket AT commands is not wrapped in any other 

layer by the module, and the data provided is the data that is 

sent. 

In a typical NB-IoT application, as shown in Figure 10, at 

the far left there is the end device that contains a u-blox NB- 

IoT module. The module communicates over the radio network 

with a cell tower that supports NB-IoT. The cellular network 

links the cell tower with an IoT platform that stores uplink 

messages from the module. The server, on the right, 

communicates with the IoT platform to retrieve uplink 

messages and to send downlink messages to the module. 

Fig. 10. Typical NB-IoT application 

The end device, the IoT platform and the cloud server 

application can use additional protocols on top of UDP/IP, but 

these are transparent from the NB-IoT module point of view 

and have to be developed outside the module. A possible IoT 

scenario is the one shown in the Figure 11, where the CoAP 

protocol is developed on both the end-device (outside the 

module) and the IoT platform. In this example, the IoT cloud 

server uses the AMQP interface to send and receive data 

to/from the IoT platform. 

Fig. 11. End-to-end protocol architecture 

NB-IoT technology is not session oriented, depending onto 

the application UDP/IP is mostly preferred over TCP/IP, and 

the latencies between each packet in uplink and downlink may 

vary from milliseconds to minutes. Latency depends onto the 

coverage conditions. In poor coverage conditions the latency 

increases. The UDP sockets do not create connections to the 

servers, since it is a connection-less datagram protocol and the 

messages sent by the end device (by the NB-IoT module) may 

also not be received by the server. An NB-IoT application 

should take all these aspects into account. Message 

acknowledgement is required for a range of use cases, such as 

grid control and near real-time monitoring. 

An NB-IoT module is designed to enter a power save 

mode, called deep-sleep mode, whenever possible depending 

on the network activities and in order to limit the end device 

power consumption. A module in deep-sleep mode can reach 

an extremely low current consumption of about 3 µA (at 3.6 

V). After the module connects to a base station to send a 

message, the module will stay connected to the base station for 

a period of time after the last communication; this time is based 

on a network-defined timer called Radio Resource Control 

(RRC) release timer. The device will then go back in power 

saving mode as soon as possible (see the Figure 12 for more 

details). 

An NB-IoT network presents several timers (the most 

relevant ones are shown in the Figure 12) and all of these 

contribute to the overall energy balance of an IoT application: 

The RRC release timer defines the time a module 

remains connected to the network after a Tracking Are 

Update (TAU) or after an up-link message is sent 

The timer T3324: is the time used by the network to 

send paging events. The NB-IoT module enters into the 

power saving mode after T3324 has elapsed 

The timer T3412 define the time interval between two 

consecutive Traffic Area Updates 

Fig. 12. NB-IoT network activities and timers 

An NB-IoT module is able to send an up-link message 

whenever requested by the application. On the opposite, 

downlink messages (i.e. coming from the IoT platform) can be 

received only during the RRC release timer, i.e. after the 

transmission of an up-link message. To handle this process, 

when a message is sent from the end-device to the network, the 

cloud server knows the module is active and connected to the 

network and should send the down-link messages (if these are 

available). 

An application that needs to send lots of messages 

throughout the day, should keep the NB-IoT module powered 

on and set to idle mode but not use the PSM or eDRX. This 

means that when the module becomes active again there is no 

need to re-attach or re-establish PDN connections. The module 

is always connected to a base station. This approach would 

limit all the signalling operations required to (re)register the 

module with the network and in turn preserve the application 

power consumption. 

VI. 

CONCLUSION 

The newly implemented LPWA technologies Narrowband- 

IoT (NB-IoT or Cat NB1) and LTE Cat M1 combine the 

advantage of using common available infrastructure with low 

cost, low power consumption, deep in-building penetration and 

high numbers of simultaneously operating units. 

281

NB-IoT provides its advantages in efficient battery 

operation and in-building penetration, opening the possibility 

for coverage underground and deep within factory buildings. 

By design, NB-IoT considers cost-optimized deployment 

essential due to the “clean slate” approach. NB-IoT does not 

need the intelligence to coexist with classic 4G traffic, hence 

there is a fair chance to reach customer cost expectations. 

Due to the fact that LTE Cat M1 technology is much closer 

to the standard LTE network, this technology allows 

applications which need more bandwidth and LTE-like latency. 

Nevertheless, LTE Cat M1 also provides extensive power 

saving modes and improved coverage. 

In combination with the security concept for the whole 

communication module, security by design provides upcoming 

security requirements for the industrial IoT. 

A further aspect of IoT is the remote management of a large 

and growing category of connected devices; those with limited 

bandwidth and those viable only at very low production costs. 

For these types of devices a standard has been introduced by 

(Open Mobile Alliance) OMA, called Lightweight M2M 

(LwM2M). 

u-blox has been involved in early stage trials and proof of 

concepts for NB-IoT, LwM2M and LTE Cat M1, covering a 

range of use cases. 

282

Enabling firmware updates over LPWANs 

Jan Jongboom 

Internet Services Group, Arm 

Amsterdam, the Netherlands 

jan.jongboom@arm.com 

Johan Stokking 

The Things Industries 

Amsterdam, the Netherlands 

johan@thethingsindustries.com 

Abstract—Firmware updates are essential for large-scale 

deployment of connected devices. Security patches protect 

customer and business data, and new functionality, optimization 

and specialization extend the lifetime of devices. This paper 

discusses firmware updates over the most challenging type of 

networks: low-power and long-range networks. 

Keywords—LPWAN, LoRaWAN, Sigfox, NB-IoT, firmware 

updates 


Most Internet of Things (IoT) devices require both longrange 

and low-power consumption where a battery life can last 

years. Traditional wireless network technologies, such as 

cellular and Wi-Fi, cannot accommodate these needs. To 

facilitate the requirements of these devices, new network 

technologies called “Low-Power Wide Area Networks” 

(LPWANs) have emerged in the past few years. Networks such 

as LoRaWAN, Sigfox and NB-IoT are deployed using low-cost 

radio chips with kilometers of range and low battery 

consumption. 

A downside of these networks is that the data rates are 

much lower than those of traditional radio networks. Data rates 

in LPWANs are measured in bits per second, rather than 

megabytes per second. Additionally, many of these networks 

operate in the unlicensed spectrum (ISM band), which requires 

devices to adhere to duty cycle limitations, only allowing them 

to send a fraction of the time while suffering from interference. 

These characteristics make it difficult to support firmware 

updates over the air. This implies you cannot update the 

devices deployed in the field easily: especially devices that are 

deployed in places that are almost impossible to reach, or 

where the cost of sending a technician is too high with 

thousands of devices in a variety of places. 

Not being able to update the firmware on IoT devices easily 

is an extreme challenge when deploying at scale. First, you can 

never have 100% secure software; we have seen many 

examples of this in 2017. Second, these devices may have to 

last up to ten years, so keeping them up to date with the latest 

standards and protocols is important. Lastly, the ability to add 

functionality or specialize devices throughout the lifetime, 

from manufacturing and distribution to transfer of ownership or 

change of purpose, is critical in many business cases. 

The key requirements for firmware updates are the abilities 

to: 

1. Send data to multiple devices at the same time (so 

called multicast) in an efficient manner in terms of 

power consumption and channel utilization. 

2. Recover from lost packets. 

3. Verify the authenticity and integrity of the firmware 

while following standards end-to-end. 

This article discusses these challenges one by one and 

presents a solution. 

II. MULTICAST 

Unlike cellular or Wi-Fi, with which a device maintains a 

connection with the network at all times, most LPWANs are 

uplink oriented. Sending data (uplink) is more important than 

receiving data (downlink). It is only possible to send a 

downlink message at set times, which LPWAN refer to as RX 

windows. These RX windows only open shortly after a 

transmission, which is great for battery life because the device 

does not need to maintain the connection with the network and 

can go to sleep mode as much as possible. LoRaWAN Class-A, 

Sigfox and LTE-M in Power Save Mode (PSM) follow this 

model. 

However, for sending firmware images, this is terrible 

because you need downlink oriented transmission of many 

packets. With a payload size of 115 bytes taking 615 ms. air 

time (a typical transmission speed for LoRaWAN [1] ), you need 

to exchange 891 messages to send a 100 KB firmware image. 

Because of the 1% duty cycle requirement in many markets 

(including Europe), this requires over 9 hours to update a single 

device, assuming no packet loss. In addition, the gateways may 

cover hundreds or thousands of devices that are also subject to 

duty cycle limitations, which means it may take weeks to 

update a fleet of devices. Finally, for every received packet, the 

required transmission consumes lots of energy (typical 

LPWAN TX consumes ~50 mA, and RX ~9 mA current) and 

uses a lot of the available spectrum. 

To enable more efficient firmware update capabilities, you 

need to implement two features: 


283

1. A way to send the firmware image without the device 

requiring to transmit first, optimizing the device’s duty 

cycle and power consumption. 

2. Multicast support - for updating multiple devices at the 

same time, optimizing the gateway duty cycle. 

The first step is to get all devices that you need to update to 

listen at the same time, at the same frequency, data rate and 

security session. If you load the same keys to the devices, then 

all devices can both receive and decrypt the same packets as if 

they are one device. Once you are certain that the devices are 

listening, you can start broadcasting the firmware image 

without the need for devices to transmit first. This means that 

you need to schedule firmware updates, typically hours or days 

in advance, depending on the sleep behavior of the devices 

requiring the update. 

Because the network can continuously send messages, you 

can transmit the 891 packets (100 KByte) in under six minutes 

(at 600 ms. time on air per packet). Multicast firmware updates 

are only achievable for stationary devices because the gateways 

selected and the data rate need to be fixed when you create the 

multicast session, which can be days before the session starts. 

III. NETWORK RELIABILITY 

By doing this, there is an effect on the network reliability. 

The network still needs to adhere to the duty cycle limitation of 

the gateways, which send the packets. This means that a 

firmware update can render a gateway useless for relaying 

downlink messages for a while. A way of mitigating this is to 

have coverage by multiple gateways, and round-robin [2] 

between gateways during the update. Also, because most 

LPWAN gateways are only half-duplex, devices cannot use the 

frequency that the update uses. This is not such a problem on 

licensed spectrum (NB-IoT / LTE-M1), or in areas with wide 

unlicensed spectrum available (U.S.), but it is in Europe where 

LPWANs are deployed in limited spectrum (LoRaWAN only 

has eight channels in the EU). A way of mitigating this is to 

implement frequency hopping. (Weightless-N does this.) 

Another way is for a technician to drive to the deployment 

site with a separate gateway and use it for the update. Although 

this seems unnatural, it has some benefits when deploying on a 

constrained site. Because the reception is more predictable, you 

can use a higher data rate when sending the update, which 

causes less congestion on the network. Additionally, the update 

does not affect normal gateways. This could be an option for 

hard-to-reach sensors. This method is also an option for 

LPWANs that have a smaller link budget for downlink 

messages than for uplink messages, such as Sigfox. 

IV. SECURITY FOR FIRMWARE UPDATE OVER MULTICAST 

When instructing multiple devices to join a temporary 

multicast session in which all the devices share the same 

session keys, there is a potential security risk when one of the 

devices is compromised: packet injection. This is applicable 

because most LPWANs choose a symmetric authentication 

mechanism. Having the multicast session keys, the attacker can 

send packets as if they came from the server. Although this is a 

serious issue when using multicast without additional security 

measurements, such as controlling lights simultaneously, a 

firmware update mechanism requires three additional measures 

to secure the update process. 

First, once the device receives the file, it calculates the 

checksum of the data it received. The device sends this 

checksum to the network using its private secure session. The 

server compares this checksum with the checksum of the data 

that it sent. This check fails if the data has been tampered with. 

The server responds whether the checksum is correct to each 

device individually, on its private secure sessions. 

Second, as part of the server's response to indicate the 

correctness of the checksum, the server sends the message 

integrity code (MIC), which guarantees data integrity to the 

device. No one who does not know the device’s private secure 

session keys can forge this MIC: only the device and the server 

can calculate the same MIC. So the server checks the device's 

checksum, and the device checks the server's MIC, 

communicating on the device's private secure session. 

Third, when an attacker injects random packets, the device 

may not be able to reconstruct the original image. To avoid 

devices that run out of power because they keep listening for 

error correction packets, as presented in the next section, the 

multicast session should have a lifetime: a fixed limit on the 

number of messages. When reaching this limit, the device 

switches back to its private secure session and power efficient 

operating mode, and discards all data. 

V. SENDING LARGE BINARY PACKETS OVER A LOSSY 

NETWORK 

In the schema proposed above, there is no communication 

between the device and the network when the multicast 

transmission is in progress. Thus, it is not possible to determine 

which device received which fragments of firmware update. 

This is required to minimize spectrum usage, similar to how 

UDP works. On LPWAN networks, there is often no 

guaranteed quality of service, and packet loss can occur when 

the device is moving. To deal with the high packet loss, you 

should implement an error-correcting algorithm, which does 

not require communication from the device to the network. 

One such algorithm is Low-Density Parity Check Coding [3] . 

In the first step, the network sends the firmware as is, 

fragmented in packets. Next, the network starts sending error 

correction packets, which are XORed to what the device 

already received. Because the fragments have an increasing 

frame number, the device knows which fragments are missing 

and can use the correction packets to reconstruct the missed 

fragments. The network keeps sending correction packets until 

all devices confirm that they reconstructed all the fragments of 

the firmware update, or, in case of extreme packet loss, until 

the update server sent all correction packets. With a good error 

correction algorithm, you need up to five correction packets to 

correct for three missed fragments. 

After the device reconstructs the full firmware, the device 

switches back to its private secure session and operating mode. 

After successfully testing the device's checksum and the 

server's message integrity code as presented above, the device 

performs the firmware update. 

284

DELTA UPDATES 

To minimize the amount of data sent, it's advisable to 

implement delta updates, in which the network sends only the 

changed parts of the firmware image instead of the full image. 

This can reduce the data by 90%. For constrained devices, it's 

important to choose a linear patch format, which you can apply 

while using little memory. In addition, it's important to verify 

the authenticity of the firmware after patching to avoid bricking 

devices. 

VI. CRYPTOGRAPHIC VERIFICATION OF THE FIRMWARE 

The devised protocol only handles raw data integrity of the 

firmware. It involves timing and message level security and 

accounts for packet loss. However, a good firmware update 

process also requires additional security on top of the network 

layer because hijacking the firmware update algorithm is a big 

attack vector. 

To protect against these attacks, you need to program extra 

properties into the end device: 

• A public key of the owner who is authorized to update 

firmware on the device. 

• A manufacturer universally unique identifier (UUID). 

• A device type UUID. 

The firmware update should contain a manifest that consists 

of the cryptographic hash of the update, the manufacturer and 

the device type that the update applies to, all signed with the 

manufacturer's private key. Whenever the device receives the 

update, you can verify that a trusted authority signed it and 

whether it was meant for this device because the device 

contains the manufacturer's public key. 

For the cryptographic hash, we suggest you use at least 

single-curve ECDSA/SHA256 [4] , which you can efficiently 

implemented on constrained devices while still providing 

adequate security. 


Firmware updates are an essential requirement before 

devices that use LPWANs for connectivity hit the market in 

volume. When implementing the requirements in this paper, 

device manufacturers can ship products while assuring their 

customers of security updates, new functionality, optimizations 

and specialization throughout the device’s lifetime. 

To demonstrate that multicast firmware updates are 

possible, Arm and The Things Industries have developed a 

reference implementation on top of LoRaWAN that 

implements the suggestions from this paper. The result is a 

secure, fast and efficient method of updating constrained 

devices (under 32K RAM required) in the field. You can 

implement the reference implementation on other LPWANs, 

too, and it is licensed under the permissive Apache 2.0 (device 

firmware) and MIT (network server) license. You can find 

more information at https://mbed.com/fota-lora. 

REFERENCES 

[1] https://www.thethingsnetwork.org/forum/t/spreadsheet-for-lora-airtimecalculation/1190 

[2] https://en.wikipedia.org/wiki/Round-robin_scheduling 

[3] https://en.wikipedia.org/wiki/Low-density_parity-check_code 

[4] https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorith 

m 


285

Real-Time Position Tracking and Finish 

Detection with LoRa 

Juan-Mario Gruber 

Institute of Embedded Systems (InES) 

Zurich University of Applied Sciences (ZHAW) 

8401Winterthur, Switzerland 

gruj@zhaw.ch 

Benjamin Brossi 

Institute of Embedded Systems (InES) 

Zurich University of Applied Sciences (ZHAW) 

8401Winterthur, Switzerland 

broi@zhaw.ch 

Abstract—For outdoor sport events, it is often necessary to 

track the position of participants or equipment within a defined 

area in real time. Global navigation satellite systems (GNSS) can 

determine the position with great accuracy. In addition, using the 

LoRa radio technology, data can be transmitted over a distance 

up to 4 km under optimal conditions. 

Keywords—GPS; LoRa; Real-Time Position Tracking; Low 

Engergy; Energy Harvesting 

I. MOTIVATION 

Sports events like sailing regattas often require real-time 

position data of the participants. Since the participants are often 

spread over a wide area, the data must be able to be transmitted 

over a large distance. 

implemented for data transmission. In addition, the system is to 

be optimized for low energy consumption, so that it can be 

operated for several hours by rechargeable batteries or energy 

harvesting. In summary, the following objectives can be 

defined: 

• Real-time GNSS data 

• Data transfer via LoRa 

• Low energy 

• Lowest possible cost 

• Base station with real time visualization and finish 

detection 

II. CONCEPT 

The new position tracking system developed at the Institute 

of Embedded Systems at Zurich University of Applied Science 

measures the position of up to 255 independent objects in a 

large-scale area. That system consists of multiple position 

trackers, a central base station and a PC software (Position 

Tracker Manager). 

Fig. 1. Sailing regatta 

The Institute of Embedded Systems at Zurich University of 

Applied Science developed a system that transmits real-time 

position data from a global navigation satellite system (GNSS) 

live over a long distance to a base station. The base station 

displays the data on a map and can define a finish line. The aim 

is to detect the crossing of the finish line. The LoRa standard is 

Fig. 2. System block diagram [1] 

The system communicates bidirectionally. This makes it 

possible to configure and control the trackers from the base 

station. The system can automatically perform position-based 

evaluations such as crossing the finish line. The data package 

consists of object number, position data and time stamps. This 

286

has to be done because only a certain amount of data per hour 

can be sent through the LoRa band. 

The system implements a standard LoRa functionality and 

uses the 868 MHz band. The position trackers are built with 

latest very low power components and are optimized for low 

power consumption. This means that they are ready to be 

powered by energy harvesting. 

A. Tracker device 

The tracker devices determine time and position data from 

a GNSS and send them via LoRa. The GNSS module L86-M33 

from Quectel is used to determine the time and position. GPS, 

GLONASS and QZSS can be used with this module. The LoRa 

module iM880B-L from IMST is used for the data transfer. 

This is a certified module for wireless communication via the 

LoRa radio standard in the 868 MHz frequency band. The 

module features a STM32L151 microcontroller with an ARM 

Cortex M3 core and a SX1272 LoRa chip from Semtech. 

Fig. 3 shows the software state diagram for the tracking 

device. When switching on the track device, the required 

peripheral modules of the microcontroller, the LoRa radio and 

the GNSS module are initialized first. After the initialization, a 

valid time of the GNSS module is waited for and then the RTC 

is synchronized to this time. This may take a few minutes, 

depending on the signal quality of the satellite signals. When 

the RTC is successfully synchronized, the log and TDMA 

counters are started. The software changes to idle state. 

Fig. 3. Tracker device state diagram 

The software changes to the log state by an event, in which 

the UART is set to receive the data of the GNSS module. If the 

data has been successfully received, the data is buffered. The 

data in the buffer is sent to the base station via LoRa. When 

transmission is complete, the software switches to receive 

mode in which it is possible to receive configuration data from 

the Base Station. The reception time window is 30 milliseconds 

according to the TDMA protocol. If configuration data is 

received, the new log interval is set and the log and TDMA 

counters are restarted. 

If a low battery event occurs, the log and TDMA counters 

are deactivated. An attempt is made to determine a last position 

within 30 seconds. If successful, this position is transmitted in 

Low Power Mode. All components are then switched to the 

low power mode in order to consume as little power as 

possible. The Tracker Device must then be recharged and 

restarted with the switch. 

For the transmission, it had to be determined which is the 

smallest possible log interval and how often data packets are 

sent. In order to increase the reliability of data transmission, the 

current and previous position data are transmitted during each 

transmission. It turned out that the best results are achieved 

with 3 seconds log interval and 9 seconds transmission interval. 

Fig. 4. Tracker device 

B. Base station 

A STM32L152 Nucleo Development Kit from 

STMicroelectronics is used for the base station. It has an ARM 

Cortex M3, which is the same controller as on the LoRa 

module iM880B-L from IMST. The LoRa Shield 

XS1272MB2DAS module from Semtech is used for the radio 

connection. The module is compatible with the Development 

Kit and is connected via SPI. 

If the Base Station is switched on, it is in the idle state. If 

data are received via the UART, an interrupt is triggered. The 

system checks whether the data is valid. If this is the case, the 

new log interval is saved so that it can be configured via LoRa 

when receiving data from the corresponding tracker device. 

The system communicates bidirectionally, so if data has 

been received, the software switches to transmit mode in which 

the new log interval is sent to the tracker device, if a new log 

interval has to be configured. If there is no new log interval, the 

software switches to the data processing mode. The data 

received via LoRa are stored in this module. In this state the 

times of the positions are reconstructed, assigned to the 

corresponding tracker device ID and stored. The system then 

uses the time to check whether the position has already been 

received. If this is not the case, the position is sent via the 

UART interface. The software then changes back to the idle 

state. 

287

C. PC software (Position Tracker Manager) 

The Position Tracker Manager is connected to the base 

station via a virtual COM interface via USB. The data is 

transferred to the computer via UART. The Position Tracker 

Manager evaluates the data and displays the device trackers on 

an embedded map. In the software, two trackers can be defined 

corresponds to the maximum permitted values issued by the 

Federal Office of Communications (OFCOM). The time for 

the receiving state is shorter than that of the sending process 

and in this case is specified at 10 seconds per hour. The 

remaining time the module is in Low Power Mode RTC ON. 

These values result in an average current of 0.933 mA. 

To calculate the total power consumption of the tracker 

device, the current consumption of the individual components 

is added together. Based on the capacity of the battery used, a 

battery life of 16.6 h can be estimated [1]. 

Current consumption tracker device 

State 

Current 

Quectel L86-M33 26 mA [2] 

iM880B-L 

LEDs 

0.933 mA 

2 mA 

LTC4080 1.9 mA [3] 

Total 

Estimated battery life 

30.833 mA 

500 mAh / 30.1 = 16.6 h 

Fig. 5. Position Tracker Manager 

as start and end points of the finish line. The software 

automatically calculates the distance to the finish line and 

detects when it is crossed. The transferred raw data and the 

ranking list can be exported from the software into a CSV file 

for further use. 

III. ENERGY CONSUPMPTION 

The power supply of the tracker device is ensured by a Li- 

Ion battery. Additionally, a charging circuit with integrated 

buck converter and 3.3 VDC output is used to charge this 

battery. The charging circuit is supplied by a 5 VDC plug 

from a power supply unit. 

To determine the current consumption of the LoRa module 

iM880B-L, the following values of the current consumption 

are taken from the data sheet [2]: 

Idle 

Transmit 

Receive 

Current consumption iM880B-L 

State 

Current 

5 mA 

Low Power Mode RTC ON 

Low Power Mode RTC OFF 

90 mA 

11.22 uA 

1.85 uA 

0.8 uA 

In the worst case, it is assumed that the module can 

transmit a maximum of 36 seconds per hour at 14 dBm. This 

IV. TEST RESULTS 

Field tests with sailing ships have shown that the device 

trackers work reliably. The positions are resolved with an 

accuracy of a few meters or less and the data transfer works in 

good conditions up to 3 km. The battery life of over 16 hours 

is sufficient for most sports and competitions. It should also be 

possible to operate the device trackers with an outdoor solar 

panel. The device trackers are ideal for use in open terrain. In 

built-up areas, both the accuracy of the position and the 

maximum transmission distance are reduced. 

V. CONCLUSION AND OUTLOOK 

It has been shown that it is possible to operate the device 

trackers with a very small energy budget. The accuracy of 

position and transmission distance is sufficient for most 

applications. 

In future, the base station will be replaced by a Raspberry 

Pi with a LoRa hat. The Raspberry Pi should be able to 

perform the data evaluation without the need of a PC. In 

addition, a web server and a JavaScript application should 

replace the position Tracker Manager. 

[1] T. Eigenmann, R. Gubler, Echtzeit-Positionstracker mit LoRa (Bachelor 

Thesis, advisor J. Gruber), Zurich University of Applied Science, 2017 

[2] iM880B Datasheet, v1.3, IMST GmbH Wireless Solutions, 2016 

[3] L86 Hardware Design, Rev. V1.0, Quectel Wireless Solutions, 2014 

[4] LTC4080 Datasheet, Rev C, Linear Technology, 2015 

288

Battery-Free Wireless Sensors 

Enabling the growth of the Internet of Things with a unique sensor architecture 

Greg Rice 

Technical Marketing and Applications Manager 

ON Semiconductor 

Protection and Signals Division 

Phoenix, Arizona, USA 

greg.rice@onsemi.com 

Abstract—The Internet of Things (IoT) is a phrase used to 

described a network of connected devices that send data back 

and forth. Sensors are used to create data that is used within the 

IoT, typically consisting of a sensing block, a power block, and a 

processing block. This paper presents a new sensor design that 

untethers the sensing block from the power and processing 

blocks, resulting in a battery-free, wireless sensor that 

complements traditional sensor designs. 

Keywords—IoT, Internet of Things, battery-free, wireless 

sensor, RFID, energy harvesting, temperature sensor, moisture 

sensor, passive sensor 

I. INTRODUCTION (Heading 1) 

The Internet of Things (IoT) is a phrase that is widely used 

when referring to emerging and growing technologies. 

Although the IoT is a common term, the definition of what the 

IoT means will change depending on how a particular person 

interacts with the IoT. For the purposes of this paper, the IoT 

is defined as a network of connected electronic devices, where 

data is transferred back and forth through a standard 

communications interface. The communication between the 

devices that comprise the IoT commonly takes place across a 

wireless interface, and will often incorporate a connection to 

the cloud for data storage and processing. 

For many people, their first experience to the IoT occurred 

when personal pagers were used to send basic messages across 

a wireless network. When smartphones were widely 

introduced approximately 10 years ago, the IoT began to 

evolve into something that almost every household was 

exposed to. Today, many homes include connected devices 

such as security cameras and smart thermostats that enable 

homeowners to monitor and control their home from 

thousands of miles away. Ultimately, data is being sent from 

one device to another across the IoT. The data being sent can 

include basic communication to send messages from person to 

person, and the data can also include information about the 

physical condition of something such as temperature or 

moisture. In many cases, some type of sensor is used to 

convert a physical parameter such as temperature or moisture 

into electronic data that can be sent across the IoT. 

II. 

SIZE AND GROWTH PREDICTIONS FOR THE IOT 

The estimates for the size and projected growth of the IoT will 

vary depending on who is providing the estimate, the rate of 

adoption for new IoT technologies, and how quickly the IoT 

ecosystem can be built and expanded. According to IHS, 

there are currently approximately 15 billion connected devices 

in the world, and the number of connected devices is expected 

to expand to 31 billion in 2020 and 75 billion in 2025. Other 

estimates project up to 200 billion connected devices in the 

next 10 years. Regardless of the final number of connected 

devices in the coming years, the project growth is expected to 

be exponential. 

What is expected to fuel the tremendous growth in the IoT? 

Historically, consumer technology such as smartphones, 

personal computers, and tablet computers have been the 

primary drivers in technology growth. With the IoT, growth is 

anticipated to come from non-traditional applications. 

Fig. 1. IoT Growth is expected to come from multiple industries 


289

Connected cars are expected to double as autonomous driving 

and other safety features are included in new vehicles. The 

number of connected device within a typical home is projected 

to grow from approximately 9 connected devices per home in 

2017, to approximately 500 devices per home in 2025. 

Additional applications such as healthcare, smart cities, and 

digital farming will also contribute to the tremendous growth 

of the IoT in the coming years, driving the global data traffic 

from approximately 2 Exabytes per day in 2017 to over 120 

Exabytes per day. To support the increased demand for data, 

new technologies need to be developed which may replace or 

complement existing technology, particularly when it comes 

to electronic sensors. 

III. 

SENSOR ARCHITECTURES 

Within an IoT network, there are multiple sensors used at the 

edge of the network to convert a physical parameter such as 

temperature or pressure into electronic data. Traditionally, 

electronic sensors are used to perform this function within the 

IoT. A traditional electronic sensor can be viewed as three 

primary technology blocks. 

A. Traditional Electronic Sensor Architecture 

At the core of the sensor design is the sensing element. The 

sensing element is something that reacts to the physical 

environment around the sensor. For a temperature sensor, the 

sensing element is something that has a predictable change in 

response to the temperature of the element. For a gas sensor, 

the sensing element will change in the presense of gas, etc. 

A power supply is needed for electronic sensors, and is 

typically designed into each sensor to provide stable voltage 

and current to power the circuits that comprise the sensor. 

Power can either be supplied through an AC power connection 

to wired power, or through DC power provided by batteries in 

some cases. 

In addition to power and sensing, a traditional electronic 

sensor also incorporates a block to perform data processing 

and connectivity. The data processing is used to control the 

power and sensing sections of the sensor, and also to provide 

communication of the electrical data that correlates to the 

physical parameter that is being monitored by the sensor. 

It is common for the sensing, power, and data processing 

elements to be included in every electronic sensor. This 

approach works well for many applications where sensors are 

used, but for certain applications the number of components 

required for each technology block results in limitation in the 

physical size of a sensor as well as the cost to scale when 

multiple sensor nodes are required for a system. In some 

applications, these aspects have limited the number of sensors 

that would be deployed if another sensor technology was 

available. 

Fig. 2. Block diagram of a traditional electronic sensor 

B. Battery-Free Wireless Sensor Architecture 

As a complement to traditional electronic sensors, a sensor 

architecture has been developed that separates the sensing 

element from the power and data processing blocks required 

for a complete sensor system. This approach results in an 

ecosystem that incorporates shared power and data processing 

in a single system, along with wireless, battery free sensors 

that are designed in small form factors, with a low cost to 

scale when multiple sensing nodes are needed. The Smart 

Passive Sensors TM (SPS) are able to sense parameters 

including temperature, moisture, pressure, and proximity with 

additional functionality in development. 

Fig. 3. Smart Passive Sensor block diagram 

Using smart passive sensors, each sense node is designed to 

operate in conjunction with a sensor hub, which performs a 

wireless interface to the sensor using the standard RAIN UHF 

RFID protocol and also incorporates connectivity to the IoT. 

With this approach, the sensors are designed to be powered by 

harvesting RF energy supplied by the sensor hub, and 

communicating sensor information to the hub wirelessly. One 

sensor hub can communicate with a large number of sensor 

nodes, provided that the sense nodes are within range of the 

RF antenna from the sensor hub. Read ranges of up to 10m 

have been achieved in ideal conditions, where typical 

applications support a range of 3-5m between the sensor and 

the RF antenna. Figure 4 shows the IoT architecture using 

SPS and a connected Sensor Hub 

290

B. Smart Healthcare 

As healthcare facilities become more connected, the need for 

advanced sensing capabilities grows. Smart passive sensors 

can be used to indicate if a patient is in a hospital bed or not, 

automatically notifying nursing staff if the patient is 

unexpectedly out of the bed which could suggest that the 

patent has fallen and requires immediate assistance. 

Fig. 4. IoT Architechture using SPS and Sensor Hub 

IV. 

PRACTICAL APPLICATIONS 

A. Industrial Predictive Maintenance 

Monitoring the condition of equipment within industrial 

factories is critical in order to manage equipment maintenance, 

reduce factory downtime, and avoid physical injuries to 

factory personnel. Busbars within power switchgear are used 

to transfer thousands of watts of power throughout a facility. 

The busbars are used to connect the inputs and outputs of high 

current circuit breakers that are capable of passing thousands 

of amperes of current through three phases. If a connection to 

a high current busbar becomes corroded or is loose, the 

increased resistance in the connection will result in a 

temperature rise on the busbar. If not corrected, a degraded 

high power busbar connection can result in catastrophic failure 

due to arcflash. An arcflash can cause significant damage to 

factory equipment and buildings, and physical injury or death 

to humans. Due to the high current within power switchgear, 

wired sensors cannot be used to monitor busbar temperature, 

and battery powered sensors are not desired due to the 

physical injury risk associated with replacing batteries within 

a sensor. 

Fig. 6. Smart Passive Sensors used for occupant detection in hospital bed 

Passive sensors can also be used to monitor fluid levels within 

a hospital room, such as the amount of fluid in an IV or 

catheter bag. Moisture sensing can also be integrated into bed 

linens and hospital garments to automatically and 

unobtrusively detect incontinence events without the need for 

manual monitoring. All of these functions are performed 

effectively and in a cost efficient manner, resulting in an 

improved experience overall for medical patients and staff. 

Fig. 5. Smart Passive Sensors installed in power switchgear 

For this application, using smart passive sensor technology to 

monitor individual busbar temperatures within power 

switchgear is a good fit. The sensor hub can be used to 

aggregate data from multiple temperature sensors, and send 

them data through a MODBUS interface for integration into 

standard facilities SCADA management software. 

C. Digital Farming 

Smart Passive Sensor technology can also be used in farming 

and agriculture applications. For livestock management, 

passive wireless temperature sensors can be used to monitor 

the health of cattle and pigs. As the livestock industry moves 

to reduce the amount of preventative medication that is 

administered to animals, it is expect that there will be an 

increase in illness for livestock. If certain illnesses are not 

detected early, the disease can spread to multiple animals and 

can result in increased cost to manage the illness and death in 

some cases. In addition, temperature sensing can be used to 

monitor the temperature of breeding animals to better predict 

female ovulation in pigs and cows and improve breeding 

efficiency. 


291

Finally, mobile sensor hubs can be integrated into drones and 

flown around crop fields to monitor the soil moisture 

throughout a farm and optimize watering for different sections 

within a farm. 

architecture has been developed that complements traditional 

electronic sensors. The new smart passive sensor technology 

untethers the sensing element from the power and 

communication blocks needed for traditional sensors, resulting 

in a shared power and communication function in a sensor hub 

and distributed wireless, passive sensing nodes. Practical 

applications for this technology include Industrial predictive 

maintenance, smart healthcare, and digital farming among 

others. Additional information regarding this sensor system 

and associated applications is available as needed. 


Thanks to RF Micron for developing the smart passive sensor 

technology and for their continued collaboration. Magnus-S 

technology was invented by and is owned by RF Micron, Inc. 

Fig. 7. Smart Passive Sensor used to monitor livestock temperature 

V. CONCLUDING REMARKS 

As the internet of things continues to grow, the need for new 

sensor technologies must also grow to enable billions of new 

types of devices to connect to the IoT. A new sensor 

REFERENCES 

[1] Ali Abedi, “Battery Free Wireless Sensor Networks: Theory and 

Applications,” Proceedings of IEEE’s 2014 Int. Conf. on Computing, 

Networking and Communications 

[2] I. Zalbide, et al, “Battery-free wireless sensors for industrial 

appplications based on UHF RFID Technology,,” proceedings of the 

IEEE Sensors 2014 conference, Spain, 2014 

[3] AND9213: “Reading Battery Free Wireless Sensors,” application note, 

ON Semiconductor 

[4] AND9211: “Battery Free Wireless Sensor Measurements,” application 

note, ON Semiconductor 

[5] http://www.rfmicron.com/ 

292

Understanding Advanced Bluetooth Angle Estimation 

Techniques for Real-Time Locationing 

Sauli Lehtimäki 

Silicon Labs 

Espoo, Finland 

Abstract — Bluetooth Angle of Arrival (AoA) and Angle of 

Departure (AoD) are techniques used for real-time locationing. 

These techniques are relatively new concepts in Bluetooth. The 

basic idea behind these techniques is to measure the phase 

differences between received radio frequency signals and 

numerically compute AoA or AoD based on these differences. By 

using the angle readings, it is possible to build systems that track 

people, mobile devices and other assets, usually in indoor 

environments. These new techniques can enhance the utility and 

functionality of Bluetooth beaconing applications. Antenna arrays 

and angle-of-arrival algorithms play a significant role in properly 

functioning Real-Time Locationing Systems (RTLS). 

Keywords— Angle estimation; Bluetooth; Direction Finding; 

Indoor locationing; Real-Time Locationing; RTLS 


Locationing technologies have many useful applications, 

one example being GPS, which is widely used all over the world. 

Unfortunately, GPS does not work very well indoors, so there is 

a real need for better indoor positioning technologies. Our goal 

is to track the locations (or angles) of individual objects with an 

external tracking system or for a device to track its own location 

in an indoor environment. This kind of locationing system can 

be used to track assets in a warehouse or people in a shopping 

mall, or people can use locationing for their own wayfinding. 

Bluetooth Angle of Arrival and Angle of Departure are new 

technologies that establish a standardized framework for indoor 

locationing. In these technologies, the fundamental problem of 

locationing comes down to solving for the arrival and departure 

angles of radio frequency signals. In this paper, we explain the 

basics of these technologies and give some theory for estimating 

direction of arrival. Currently the Bluetooth AoA/AoD 

specifications are in a mature state but not yet public. Because 

of this, this paper will only cover the general concepts without 

going into the details of the specification. In conclusion, we will 

briefly compare two other locationing technologies. 

II. 

BLUETOOTH AOA AND AOD 

A. AoA 

Let's consider a device with a multiple antenna linear array 

for a receiver and a device with one antenna for a transmitter. 

Also, assume that the radio wave travels as a planar wave front 

rather than spherically, which we can safely assume when 

looking from a distance. If the transmitter, which is sending a 

sine wave through the air, lies on the normal line perpendicular 

to the array line, every antenna (channel) of the array will see 

the incoming signal in the same phase. If the transmitter does not 

lie on the normal line, then the receiving antennas will see phase 

differences between the channels. This phase difference 

information can be used to calculate the angle of arrival. In 

practice, the receiver device will need to have multiple ADC 

channels or use an RF switch to be able to take samples from 

each individual channel. The samples are called “IQ-samples” 

since a sample pair of “In-phase” and “Quadrature-phase” 

readings are taken from the same input signal. These samples 

have a 90 degree phase difference in the sampling. When this 

pair is considered to be a complex value, each complex value 

contains both phase and amplitude information and can be an 

input for the arrival angle estimation algorithm. 

Radio waves travel at the speed of light, which is 300000 

km/s. When using frequencies around 2.4 GHz, the 

corresponding wavelengths are about 0.125 m. The maximum 

distance between two adjacent antennas for most estimation 

algorithms is a half wavelength. Many algorithms require this; 

otherwise, we get effects similar to aliasing. There is no 

theoretical minimum distance limitation, but, in practice, the 

minimum size is limited by the mechanical dimensions of the 

array plus, for example, mutual coupling between the antenna 

elements. 

B. AoD 

For Angle of Departure, the fundamental idea of measuring 

phase differences is the same, but device roles are swapped. In 

AoD, the device being tracked uses only one antenna, and the 

transmitter devices use multiple antennas. The transmitter 

device sequentially switches the transmitting antenna, and the 

receiving side knows the antenna array configuration and 

switching sequence. 

When considering this from an application point of view, we 

can see a clear difference between these two techniques. In AoD, 

the receiving device is able to calculate its own position in space 

using angles from multiple beacons and their positions (by 

triangulation). In AoA, the receiving device tracks arrival angles 

for individual objects. Still, it is good to note that different 


293

combinations of these can be performed; so, these techniques do 

not limit what can be done at the application level. Both in 

Bluetooth AoA and AoD, the AoA/AoD related control data is 

sent over a traditional data channel. Typically, these techniques 

can achieve a couple of degrees angular accuracy and around 0.5 

m locationing accuracy, but these figures are highly dependent 

on the implementation of the locationing system. 

III. CHALLENGES 

One of the biggest and perhaps most obvious challenges in 

this subject is answering the question: “How are angle estimates 

calculated based on the sample data?” It is not enough that we 

are able to calculate angle estimates in an ideal environment; we 

must also be able to calculate them in environments with very 

heavy multi-path in which signals are highly correlated or 

coherent. By coherent signal, we mean a signal that is delayed 

and scaled version of some other signal. This can be the case 

when radio waves are reflected from walls, for example. 

Other challenges include signal polarization. In most cases, 

we cannot control the polarization of the mobile device, so the 

system has to take this into account. Also signal noise, clock 

jitter and signal propagation delays add their own variables to 

the problem. Depending on the system scale, the RAM and 

especially CPU requirements can be demanding for an 

embedded system. Many of the well performing angle 

estimation algorithms require a significant amount of processing 

power from the CPU. 

In the next section, we will cover some theory on antenna 

arrays and angle of arrival estimations. Angle of departure can 

be derived from the angle of arrival theory. 

IV. ANGLE OF ARRIVAL THEORY 

Angle estimation methods and antenna arrays are essential 

for the locationing system to work properly. The history of 

direction finding theory goes back over 100 years when the first 

attempts to solve this problem were made using directional 

antennas and, obviously, purely analog systems. In the years 

following, test methods moved to the digital world, but the basic 

principles are still quite the same. These direction-finding 

methods are already used in many applications, such as medical 

equipment, security and military devices. In this section, we will 

discuss the basics of some typical antenna arrays and estimation 

algorithms. By direction finding, we refer to the general problem 

of estimating arrival and departure angles. 

A. Antenna Arrays 

Antenna arrays for direction finding are usually divided into 

categories. The most common ones discussed here are Uniform 

Linear Array (ULA), Uniform Rectangular Array (URA) and 

Uniform Circular Array (UCA). The linear array is a onedimensional 

array, meaning that all the antennas in the array lie 

on a single line, whereas the rectangular and circular arrays are 

two-dimensional arrays, meaning that the antennas are spread in 

two dimensions (on a plane). By using a one-dimensional 

antenna array, one can reliably measure only the azimuth angle, 

assuming the tracked device moves consistently on the same 

plane. Furthermore, with two-dimensional arrays, one can 

reliably measure both azimuth and elevation angles in the 3D 

half-space. If the array is extended to a full 3D array (antennas 

spread on all three Cartesian coordinates), then we will be able 

to measure the full 3D space. 

Designing an antenna array for direction finding is not a 

straightforward task. When antennas are placed in an array, they 

start affecting each other’s responses; this is called mutual 

coupling. We also have to keep in mind that, in most cases, we 

cannot control the polarization of the transmitting end. This 

creates an additional challenge for the designer. In IoT 

applications, the devices are often expected to be small and even 

work in very high frequency bands. Estimation algorithms often 

require some certain properties from the array. For example the 

estimation algorithm called ESPRIT works on the mathematical 

assumption that the array is divided into two identical subarrays 

([3]). 

B. Angle Estimation Algorithms 

Let's look at the mathematical/algorithmic problem of 

estimating the angle of arrival based on the input IQ-data. The 

problem definition itself is simple: “Estimate the arrival angle of 

an emitted (narrowband) signal arriving at the receiving array”. 

While the problem statement sounds very trivial, a robust 

solution (that works in real life) for this problem is not easy and 

can require much processing power from the hardware. 

Next, we will present two different approaches for solving 

this problem. The first one is a basic one and is called a classical 

beamformer. The second is a more advanced technique called 

Multiple Signal Classification (MUSIC). We will not go through 

proofs of any theorems or reasons why these methods work but 

rather give a high-level view of how the algorithms work. 

Deeper studies about these estimation algorithms can be found 

from [1] and [2]. 

Classical Beamformer 

Let's begin with a mathematical model of a uniform linear 

array. We are given a data vector of IQ-samples for each 

antenna. Let this vector be called x. Now, there is a phase shift 

seen by each antenna (which can be 0) plus some noise, n, in the 

measurements, so x can be written as a function of time t: 

x(t) =a(✓)s(t)+n(t) , (1) 

where s is the signal sent over the air, and a is the steering 

vector of the antenna array: 

a(✓) =[1,e j2⇡dsin(✓)/ ,...,e j2⇡(m 1)dsin(✓)/ ], (2) 

where d is the distance between adjacent antennas; λ is the 

wavelength of the signal; m is the number of elements in the 

antenna array, and θ stands for the angle of arrival. 

Steering vector (2) describes how signals on each antenna 

are phase shifted because of the varying distances to the 

transmitter. By using (1), we can calculate an approximation of 

the so-called sample covariance matrix, R ++ , by calculating 

294

R xx ⇡ 1 N 

NX 

x(t)x H (t) 

, (3) 

t=1 

where H stands for the Hermitian transpose of a matrix. 

The sample covariance matrix (3) will be used as an input 

for the estimator algorithm as we will see. 

The idea of the classical beamformer is to maximize the 

output power as a function of the angle, similar to how a 

mechanical radar works. If we attempt to maximize the power, 

we end up with the next formula: 

P (✓) = aH (✓)R xx a(✓) 

a H (✓)a(✓) 

Now, to find the arrival angle, we need to loop through the 

arrival angle θ and find the peak maximum power, P. The angle, 

theta, that produces the maximum power corresponds to the 

angle of arrival. While this approach is quite simple, its accuracy 

is not generally very good. So, let's introduce another method, 

which is a bit better in terms of accuracy. See, for example, [4] 

for an algorithm accuracy comparison. 

MUSIC (Multiple Signal Classification) 

One type of estimation algorithm is the so-called subspace 

estimator, and one popular algorithm of that category is called 

MUSIC (Multiple Signal Classification). The idea of this 

algorithm is to perform eigendecomposition on the covariance 

matrix R ++ : 

(4) 

R xx = V AV 1 

, (5) 

where A is a diagonal matrix containing eigenvalues and V 

containing the corresponding eigenvectors of R ++ . Assume we 

are trying to estimate the angle of arrival for one transmitter with 

an n antenna linear array. It can be shown that the eigenvectors 

of R ++ either belong to so-called noise subspace or signal 

subspace. If the eigenvalues are sorted in ascending order, the 

corresponding n − 1 eigenvectors span the noise subspace, 

which is orthogonal to the signal subspace. Based on the 

orthogonality information, we can calculate the pseudo spectrum 

P: 

P (✓) = 

1 

a H (✓)VV H a(✓) 

As in a classical beamformer, we loop through the desired 

values of θ and find the maximum peak value of P, which 

corresponds the angle of arrival (argument θ) we wish to 

measure. 

In an ideal case, MUSIC has very good resolution in a good 

SNR environment and is very accurate. On the other hand, its 

performance is not very good when the input signals are highly 

correlated. This is the case especially in an indoor environment. 

Multipath effects distort the pseudo-spectrum causing it to have 

maximums at the wrong locations. More information about the 

conventional beamformer and MUSIC estimators can be found 

from [3]. 

(6) 

Spatial Smoothing 

Spatial smoothing is a method for solving problems caused 

by multipathing (when coherent signals are present). It can be 

proven that the signal covariance matrix can be "decorrelated" 

by calculating an averaged covariance matrix using subarrays of 

the original covariance matrix. For a two-dimensional array, this 

can be written as the following 

, (7) 

where M 2 and N 2 are the number of subarrays in x- and y- 

directions respectively and R 45 stands for the (m, n):th 

subarray covariance matrix. An example proof of this formula 

and more information can be found from [2]. 

The resulting covariance matrix can now be used as a 

"decorrelated" version of the covariance matrix and fed to the 

MUSIC algorithm to produce correct results. The downside of 

spatial smoothing is that it reduces the size of the covariance 

matrix, which further reduces the accuracy of the estimate. 

V. OTHER LOCATIONING TECHNOLOGIES 

In this section, we briefly present two other locationing 

technologies for comparison. These two methods use different 

kinds of algorithms/methods for locationing than those 

presented in this paper. 

RSSI 

With Received Signal Strength Indicator (RSSI), the basic 

idea is to measure the signal strength of the received signal to 

get a distance approximation between RX and TX. This 

information can be used to trilaterate the position of a receiver 

device based on multiple distance measurements from different 

transmitter points. This technology requires only one antenna 

per device but is not usually very accurate in an indoor 

environment. 

ToA / TDoA 

R = 1 

M s N s 

XM s 

m=1 n=1 

XN s 

R mn 

With Time of Arrival / Time of Flight (ToA/ToF), we 

measure the travel time of a signal between RX and TX and use 

that to calculate the distance between the ends. This distance is 

then used to trilaterate the position of the receiver. In ToA, all 

devices are time-synchronized. This technology also requires 

only one antenna per device, but, on the other hand, it requires 

very high clock accuracy to get reasonable positioning 

accuracies. There is also a variant of this technology called 

TDoA, where only the receiver devices need to be timesynchronized, 

and the estimation algorithms use the time 

difference for calculating position estimates. 

VI. SUMMARY 

Bluetooth Angle of Arrival and Angle of Departure are new 

emerging technologies that can be used to track assets as well as 


295

for indoor positioning and way-finding. These are phase-based 

direction finding systems that require an antenna array, RF 

switches (or a multi-channel ADC) and processing power to run 

the estimation algorithms. Designing a proper antenna array as 

well as an angle estimation algorithm are essential for a RTLS 

system. Good performing estimator algorithms are often not 

computationally cheap. Other positioning technologies include 

(but are not limited to) RSSI based methods and ToA based 

methods, but only phase-based AoA/AoD currently have a 

standardized framework in Bluetooth. 

REFERENCES 

[1] H. Krim, M. Viberg, “Two Decades of Array Signal Processing”, IEEE 

Signal Processing Magazine, July 1996, pp. 67-94 

[2] Y.-M. Chen, “On Spatial Smoothing for Two-Dimensional Direction-of- 

Arrival Estimation of Coherent Signals”, IEEE Transactions on Signal 

Processing, Vol. 45, No. 7, July 1997 

[3] Z. Chen, G. Gokeda, Y. Yu, “Introduction to Direction-of-Arrival 

Estimation”, Artech House, 2010 

[4] N. A. Baig, M. B. Malik, “Comparison of Direction of Arrival (DOA) 

Esimation Techniques for Closely Spaced Targets”, International Journal 

of Future Computer and Communication, Vol. 2, No. 6, December 2013 

296

Bluetooth Mesh Networking 

Martin Woolley 

Bluetooth SIG 

UK 

Twitter: @bluetooth_mdw 

Abstract— Mesh is a new network topology option available 

for Bluetooth Low Energy (LE) adopted in the summer of 2017. 

It represents a major advance which positions Bluetooth to be the 

dominant low power wireless communications technology in a 

wide variety of new sectors and use cases, including Smart 

Buildings and Industrial IoT. 

Keywords—Bluetooth,mesh,IoT,smart buildings 


Bluetooth has been actively developed since its initial 

release in 2000, when it was originally intended to act as a 

cable replacement technology. It soon came to dominate 

wireless audio products and computer peripherals, such as 

wireless mice and keyboards. 

In 2010, Bluetooth LE provided the next, major step 

forward. Its impact has been substantial and widely felt, most 

notably in smartphones and tablets, as well as in Health and 

Fitness, Smart Home and Wearables categories. 

Wireless communications systems based around mesh 

network topologies have proved themselves to offer an 

effective approach to providing coverage of large areas, 

extending range and providing resilience. However, until now 

they have been based upon niche technologies, incompatible 

with most computer, smartphone and accessory devices owned 

by consumers or used in the enterprise. 

120 Bluetooth SIG member companies participated in the 

work required to bring mesh networking support to Bluetooth. 

This is significantly more than is typically the case, and is 

representative of the demand for a global, industry standard for 

a Bluetooth mesh networking capability. 

The addition of mesh networking support represents a 

change of a type, and of such magnitude that it warrants being 

described as a paradigm shift for Bluetooth technology. 

parking space lights up so you can drive easily to it. The 

parking bay allocation system is updated to note that this space 

is now occupied. 

Entering the building, occupancy sensors note your arrival 

and identify you from the wearable technology about your 

person. You take the elevator to the 2 nd floor and exit. You’re 

the first to arrive, as usual. As the lift doors open, the lights 

from the elevator to your office and the kitchen come on. 

Coffee is deemed of strategic significance in your company! 

Other areas are left in darkness to save power. 

You walk to your office and enter, closing the door behind 

you. The LED downlights and your desk lamp are already on 

and at exactly the level you prefer. You notice the temperature 

is a little warmer than the main office space, reflecting your 

personal preference. Proximity with your computer 

automatically logs you in. 

Your day started well, with the building responding to your 

needs, taking into account your preferences. It’s clear that 

systems are being used efficiently. What made this possible? 

Your company installed a Bluetooth mesh network some 

months ago, starting with a mesh lighting system. Later the 

mesh was added to with occupancy sensors, environmental 

sensors, a wireless heating control system, and a mesh-based 

car park management system. The company is saving money 

on electricity and heating, and work environments have 

become personalized, boosting personal productivity. 

Maintenance costs are going down since adding items like 

additional light switches no longer requires expensive and 

disruptive physical wiring. Data is allowing the building 

management team to learn about the building, its services and 

how people act within it and are using this data to make 

optimizations. 

II. 

TAKING CONTROL 

A. Smart Buildings Get Truly Smart 

Imagine arriving at the office in your car, early one dark, 

winter morning. The security system lets you in and a parking 

bay is automatically allocated to you. The bay number over the 

297

achieved using messages, and devices are able to relay 

messages to other devices so that the end-to-end 

communication range is extended far beyond the radio range of 

each individual node. 

Figure 1 - A Bluetooth mesh network could span the office and car 

park 

The Bluetooth mesh network has made it easier and 

cheaper to be in control of building services, to wirelessly 

interact with them and to automate their behaviors. You 

wonder how you ever lived without such advanced building 

technology in the past! 

III. 

BLUETOOTH MESH - THE BASICS 

A. Concepts and Terminology 

Understanding Bluetooth mesh networking topology 

requires the reader to learn about a series of new technical 

terms and concepts, not found in the world of Bluetooth LE. In 

this section, we’ll explore the most fundamental of these terms 

and concepts. 

B. Mesh vs Point-to-Point 

Most Bluetooth LE devices communicate with each other 

using a simple point-to-point network topology enabling oneto-one 

device communications. In the Bluetooth core 

specification, this is called a ‘piconet.’ 

Imagine a smartphone that has established a point-to-point 

connection to a heart rate monitor over which it can transfer 

data. One nice aspect of Bluetooth is that it enables devices to 

set up multiple connections. That same smartphone can also 

establish a point-to-point connection with an activity tracker. 

In this case, the smartphone can communicate directly with 

each of the other devices, but the other devices cannot 

communicate directly with each other. In contrast, a mesh 

network has a many-to-many topology, with each device able 

to communicate with every other device in the mesh (we’ll 

examine that statement more closely later on in the section 

entitled “Bluetooth mesh in action”). Communication is 

Figure 2 - A many to many topology with message relaying 

C. Devices and Nodes 

Devices which are part of a mesh network are called nodes 

and those which are not are called “unprovisioned devices”. 

The process which transforms an unprovisioned device into 

a node is called “provisioning”. Consider purchasing a new 

Bluetooth light with mesh support, bringing it home and setting 

it up. To make it part of your mesh network, so that it can be 

controlled by your existing Bluetooth light switches and 

dimmers, you would need to provision it. 

Provisioning is a secure procedure which results in an 

unprovisioned device possessing a series of encryption keys 

and being known to the Provisioner device, typically a tablet or 

smartphone. One of these keys is called the network key or 

NetKey for short. You can read more about mesh security in 

the Security section, below. 

All nodes in a mesh network possess at least one NetKey 

and it is possession of this key which makes a device a member 

of the corresponding network and as such, a node. There are 

other requirements that must be satisfied before a nnode can 

become useful, but securely acquiring a NetKey through the 

provisioning process is a fundamental first step. We’ll review 

the provisioning process in more detail in a later section of this 

paper. 

D. Elements 

Some nnodes have multiple constituent parts, each of which 

can be independently controlled. In Bluetooth mesh 

terminology, these parts are called elements. Figure 3 shows an 

298

LED lighting product which if added to a Bluetooth mesh 

network, would form a single node with three elements, one for 

each of the individual LED lights. 

Figure 3 - Lighting node consisting of three elements 

E. Messages 

When a node needs to query the status of other nodes or 

needs to control other nodes in some way, it sends a message 

of a suitable type. If a node needs to report its status to other 

nodes, it sends a message. All communication in the mesh 

network is message-oriented and many message types are 

defined, each with its own unique opcode. 

Messages fall within one of two broad categories; 

Acknowledged messages require a response from nodes 

that receive them. The response serves two purposes: it 

confirms that the message it relates to was received, and it 

returns data relating to the message recipient to the message 

sender. 

The sender of an acknowledged message may resend the 

message if it does not receive the expected response(s) and 

therefore, acknowledged messages must be idempotent. This 

means that the effect of a given acknowledged message, 

arriving at a node multiple times, will be the same as if it had 

only been received once. 

Unacknowledged messages do not require a response. 

F. Addresses 

Messages must be sent from and to an address. Bluetooth 

mesh defines three types of address. 

A unicast address uniquely identifies a single element. 

Unicast addresses are assigned to devices during the 

provisioning process. 

A group address is a multicast address which represents one 

or more elements. Group addresses are either defined by the 

Bluetooth SIG and are known as SIG Fixed Group Addresses 

or are assigned dynamically. 4 SIG Fixed Group Addresses 

have been defined. These are named All-proxies, All-friends, 

All-relays and All-nodes. The terms Proxy, Friend and Relay 

will be explained later in this paper. 

It is expected that dynamic group addresses will be 

established by the user via a configuration application and that 

they will reflect the physical configuration of a building, such 

as defining group addresses which correspond to each room in 

the building. 

A virtual address is an address which may be assigned to 

one or more elements, spanning one or more nodes. It takes the 

form of a 128-bit UUID value with which any element can be 

associated and is much like a label. Virtual addresses will 

likely be preconfigured at the point of manufacture and be used 

for scenarios such as allowing the easy addressing of all 

meeting room projectors made by this manufacturer. 

G. Publish / Subscribe 

The act of sending a message is known as publishing. nodes 

are configured to select messages sent to specific addresses for 

processing, and this is known as subscribing. 

Typically, messages are addressed to group or virtual 

addresses. Group and virtual address names will have readily 

understood meaning to the end user, making them easy and 

intuitive to use. 

In Figure 4 above, we can see that the node “Switch 1” is 

publishing to the group address Kitchen. Nodes Light 1, Light 

2 and Light 3 each subscribe to the Kitchen address and 

therefore receive and process messages published to this 

address. In other words, Light 1, Light 2 and Light 3 can be 

switched on or off using Switch 1. 

Switch 2 publishes to the group address Dining Room. 

Light 3 alone subscribed to this address and so is the only light 

controlled by Switch 2. Note that this example also illustrates 

the fact that nodes may subscribe to messages addressed to 

more than one distinct address. This is both powerful and 

flexible. 

Similarly, notice how both Switch 5 and Switch 6 publish 

to the same Garden address. 

The use of group and virtual addresses with the 

publish/subscribe communication model has an additional, 

substantial benefit in that removing, replacing or adding new 

nodes to the network does not require reconfiguration of other 

nodes. Consider what would be involved in installing an 

additional light in the dining room. The new device would be 

added to the network using the provisioning process and 

configured to subscribe to the Dining Room address. No other 

nodes would be affected by this change to the network. Switch 

2 would continue to publish messages to Dining Room as 

before but now both Light 3 and the new light would respond. 

Figure 4 - Publish / Subscribe 

299

H. States and Properties 

Elements can be in various conditions and this is 

represented in Bluetooth Mesh by the concept of state values. 

A state is a value of a certain type, contained within an 

element (within a server model - see below). As well as values, 

states also have associated behaviors and may not be reused in 

other contexts. 

As an example, consider a simple light which may either be 

on or off. Bluetooth Mesh defines a state called Generic OnOff. 

The light would possess this state item and a value of On 

would correspond to and cause the light to be illuminated, 

whereas a Generic OnOff state value of Off would reflect and 

cause the light to be switched off. 

The significance of the term Generic will be discussed later. 

Properties are similar to states in that they contain values 

relating to an element. But they are significantly different to 

states in other ways. 

Readers who are familiar with Bluetooth LE will be aware 

of characteristics and recall that they are data types with no 

defined behaviors associated with them, making them reusable 

across different contexts. A property provides the context for 

interpreting a characteristic. 

To appreciate the significance and use of contexts as they 

relate to properties, consider for example, the characteristic 

Temperature 8, an 8-bit temperature state type which has a 

number of associated properties, including Present Indoor 

Ambient Temperature and Present Outdoor Ambient 

Temperature. These two properties allow a sensor to publish 

sensor readings in a way that allows a receiving client to 

determine the context the temperature value has, making better 

sense of its true meaning. 

Properties are organized into two categories: Manufacturer, 

which is a read-only category and Admin, which allows readwrite 

access. 

I. Messages, States and Properties 

Messages are the mechanism by which operations on the 

mesh are invoked. Formally, a given message type represents 

an operation on a state or collection of multiple state values. 

All messages are of three broad types, reflecting the types of 

operation which Bluetooth Mesh supports. The shorthand for 

the three types is GET, SET and STATUS. 

GET messages request the value of a given state from one 

or more nodes. A STATUS message is sent in response to a 

GET and contains the relevant state value. 

SET messages change the value of a given state. An 

acknowledged SET message will result in a STATUS message 

being returned in response to the SET message whereas an 

unacknowledged SET message requires no response. 

STATUS messages are sent in response to GET messages, 

acknowledged SET messages or independently of other 

messages, perhaps driven by a timer running on the element 

sending the message, for example. 

Specific states referenced by messages are inferred from the 

message opcode. Properties on the other hand, are referenced 

explicitly in generic property related messages using a 16-bit 

property ID. 

J. State Transitions 

Changes from one state to another are called state 

transitions. Transitions may be instantaneous or execute over a 

period of time called the transition time. A state transition is 

likely to have an effect on the application layer behavior of a 

node. 

K. Bound States 

Relationships may exist between states whereby a change 

in one triggers a change in the other. Such a relationship is 

called a state binding. One state may be bound to multiple 

other states. 

For example, consider a light controlled by a dimmer 

switch. The light would possess the two states, Generic OnOff 

and Generic Level with each bound to the other. Reducing the 

brightness of the light until Generic Level has a value of zero 

(fully dimmed) results in Generic OnOff transitioning from On 

to Off. 

L. Models 

Models pull the preceding concepts together and define 

some or all of the functionality of an element as it relates to the 

mesh network. Three categories of model are recognized. 

A server model defines a collection of states, state 

transitions, state bindings and messages which the element 

containing the model may send or receive. It also defines 

behaviors relating to messages, states and state transitions. 

A client model does not define any states. Instead, it defines 

the messages which it may send or receive in order to GET, 

SET or acquire the STATUS of states defined in the 

corresponding server model. 

Control models contain both a server model, allowing 

communication with other client models and a client model 

which allows communication with server models. 

Models may be created by extending other models. A 

model which is not extended is called a root model. 

Models are immutable, meaning that they may not be 

changed by adding or removing behaviors. The correct and 

only permissible approach to implementing new model 

requirements is to extend the existing model. 

M. Generics 

It is recognized that many different types of device, often 

have semantically equivalent states, as exemplified by the 

simple idea of ON vs OFF. Consider lights, fans and power 

sockets, all of which can be switched on or turned off. 

Consequently, the Bluetooth Mesh Model specification 

defines a series of reusable, generic states such as, for example, 

Generic OnOff and Generic Level. 

Similarly, a series of generic messages that operate on the 

generic states are defined. Examples include Generic OnOff 

Get and Generic Level Set. 

300

Generic states and generic messages are used in generalized 

models, both generic server models such as the Generic OnOff 

Server and Generic Client Models such as the Generic Level 

Client. 

Generics allow a wide range of device type to support 

Bluetooth Mesh without the need to create new models. 

Remember that models may be created by extending other 

models too. As such, generic models may form the basis for 

quickly creating models for new types of devices. 

Figure 5 - Generic Models 

N. Scenes 

A scene is a stored collection of states which may be 

recalled and made current by the receipt of a special type of 

message or at a specified time. Scenes are identified by a 16-bit 

Scene Number, which is unique within the mesh network. 

Scenes allow a series of nodes to be set to a given set of 

previously stored, complimentary states in one coordinated 

action. 

Imagine that in the evening, you like the temperature in 

your main family room to be 20 degrees Celsius, the six LED 

downlights to be at a certain brightness level and the lamp in 

the corner of the room on the table set to a nice warm yellow 

hue. Having manually set the various nodes in this example 

scenario to these states, you can store them as a scene using a 

configuration application and recall the scene later on, either on 

demand by sending an appropriate, scene-related mesh 

message or automatically at a scheduled time. 

O. Provisioning 

Provisioning is the process by which a device joins the 

mesh network and becomes a node. It involves several stages, 

results in various security keys being generated and is itself a 

secure process. 

Provisioning is accomplished using an application on a 

device such as a tablet. In this capacity, the device used to 

drive the provisioning process is referred to as the Provisioner. 

The provisioning process progresses through five steps and 

these are described next. 

Step 1. Beaconing 

In support of various different Bluetooth mesh features, 

including but not limited to provisioning, new GAP AD types 

(ref: Bluetooth Core Specification Supplement) have been 

introduced, including the AD type. 

An unprovisioned device indicates its availability to be 

provisioned by using the AD type in 

advertising packets. The user might need to start a new device 

advertising in this way by, for example, pressing a combination 

of buttons or holding down a button for a certain length of 

time. 

Step 2. Invitation 

In this step, the Provisioner sends an invitation to the 

device to be provisioned, in the form of a Provisioning Invite 

PDU. The Beaconing device responds with information about 

itself in a Provisioning Capabilities PDU. 

Step 3. Exchanging Public Keys 

The Provisioner and the device to be provisioned, exchange 

their public keys, which may be static or ephemeral, either 

directly or using an out-of-band (OOB) method. 

Step 4. Authentication 

During the authentication step, the device to be provisioned 

outputs a random, single or multi-digit number to the user in 

some form, using an action appropriate to its capabilities. For 

example, it might flash an LED several times. The user enters 

the digit(s) output by the new device into the Provisioner and a 

cryptographic exchange takes place between the two devices, 

involving the random number, to complete the authentication 

of each of the two devices to the other. 

Step 5. Distribution of the Provisioning Data 

After authentication has successfully completed, a session 

key is derived by each of the two devices from their private 

keys and the exchanged, peer public keys. The session key is 

then used to secure the subsequent distribution of the data 

required to complete the provisioning process, including a 

security key known as the network key (NetKey). 

After provisioning has completed, the provisioned device 

possesses the network’s NetKey, a mesh security parameter 

known as the IV Index and a Unicast Address, allocated by the 

Provisioner. It is now known as a node. 

P. Features 

All nodes can transmit and receive mesh messages but there 

are a number of optional features which a node may possess, 

giving it additional, special capabilities. There are four such 

optional features: the Relay, Proxy, Friend and Low Power 

features. A node may support zero or more of these optional 

features and any supported feature may, at a point in time, be 

enabled or disabled. 

Q. Relay Nodes 

Nodes which support the Relay feature, known as Relay 

nodes, are able to retransmit received messages. Relaying is the 

mechanism by which a message can traverse the entire mesh 

network, making multiple “hops” between devices by being 

relayed. 

301

Mesh network PDUs include a field called TTL (Time To 

Live). It takes an integer value and is used to limit the number 

of hops a message will make across the network. Setting TTL 

to 3, for example, will result in the message being relayed, a 

maximum number of three hops away from the originating 

node. Setting it to 0 will result in it not being relayed at all and 

only travelling a single hop. Armed with some basic 

knowledge of the topology and membership of the mesh, nodes 

can use the TTL field to make more efficient use of the mesh 

network. 

R. Low Power Nodes and Friend Nodes 

Some types of node have a limited power source and need 

to conserve energy as much as possible. Furthermore, devices 

of this type may be predominantly concerned with sending 

messages but still have a need to occasionally receive 

messages. 

Consider a temperature sensor which is powered by a small 

coin cell battery. It sends a temperature reading once per 

minute whenever the temperate is above or below configured 

upper and lower thresholds. If the temperature stays within 

those thresholds it sends no messages. These behaviors are 

easily implemented with no particular issues relating to power 

consumption arising. 

However, the user is also able to send messages to the 

sensor, which change the temperature threshold state values. 

This is a relatively rare event, but the sensor must support it. 

The need to receive messages has implications for duty cycle 

and as such power consumption. A 100% duty cycle would 

ensure that the sensor did not miss any temperature threshold 

configuration messages but use a prohibitive amount of power. 

A low duty cycle would conserve energy, but risk the sensor 

missing configuration messages. 

The answer to this apparent conundrum is the Friend node 

and the concept of “friendship”. 

Nodes like the temperature sensor in the example may be 

designated Low Power nodes (LPNs) and a feature flag in the 

sensor’s configuration data will designate the node as such. 

LPNs work in tandem with another node, one which is not 

power-constrained (e.g. it has a permanent AC power source). 

This device is termed a Friend node. The Friend stores 

messages addressed to the LPN and delivers them to the LPN 

whenever the LPN polls the Friend node for “waiting 

messages”. The LPN may poll the Friend relatively 

infrequently so that it can balance its need to conserve power 

with the timeliness with which it needs to receive and process 

configuration messages. When it does poll, all messages stored 

by the Friend are forwarded to the LPN, one after another, with 

a flag known as MD (More Data) indicating to the LPN 

whether there are further messages to be sent from the Friend 

node. 

The relationship between the LPN and the Friend node is 

known as friendship. Friendship is key to allowing very power 

constrained nodes which need to receive messages, to function 

in a Bluetooth mesh network whilst continuing to operate in a 

power-efficient way. 

S. Proxy Nodes 

There are an enormous number of devices in the world that 

support Bluetooth LE, most smartphones and tablets being 

amongst them. In-market Bluetooth devices, at the time 

Bluetooth mesh was adopted, do not possess a Bluetooth mesh 

networking stack. They do possess a Bluetooth LE stack 

however and therefore have the ability to connect to other 

devices and interact with them using GATT, the Generic 

Attribute Profile. 

Proxy nodes expose a GATT interface, which Bluetooth LE 

devices may use to interact with a mesh network. A protocol 

called the Proxy Protocol, intended to be used with a 

connection-oriented bearer, such as GATT is defined. GATT 

devices read and write Proxy Protocol PDUs from within 

GATT characteristics implemented by the Proxy node. The 

Proxy node transforms these PDUs to / from mesh PDUs. 

In summary, Proxy nodes allow Bluetooth LE devices that 

do not possess a Bluetooth mesh stack to interact with nodes in 

a mesh network. 

Figure 6 - Smartphone communicating via a mesh proxy node 

T. Node Configuration 

Each node supports a standard set of configuration states 

which are implemented within the standard Configuration 

Server Model and accessed using the Configuration Client 

Model. Configuration state data is concerned with the node’s 

capabilities and behavior within the mesh, independently of 

any specific application or device type behaviors. 

For example, the features supported by a node, whether it is 

a Proxy node, a Relay node and so on, are indicated by 

Configuration Server states. The addresses to which a node has 

subscribed are stored in the Subscription List. The network and 

subnet keys indicating the networks the node is a member of 

are listed in the configuration block, as are the application keys 

held by the node. 

A series of configuration messages allow the Configuration 

Client Model and Configuration Server Model to support GET, 

302

SET and STATUS operations on the Configuration Server 

Model states. 

IV. 

THE MESH SYSTEM ARCHITECTURE 

A. Overview 

In this section we’ll take a closer look at the Bluetooth 

mesh architecture, its layers and their respective 

responsibilities. We’ll also position the mesh architecture 

relative to the Bluetooth LE core architecture. 

Figure 7 shows the mesh architecture. 

Figure 7 - The Bluetooth mesh architecture 

At the bottom of the mesh architecture stack, we have a 

layer entitled Bluetooth LE. In fact, this is more than just a 

single layer of the mesh architecture, it’s the full Bluetooth LE 

stack, which is required to provide fundamental wireless 

communications capabilities which are leveraged by the mesh 

architecture which sits on top of it. It should be clear that the 

mesh system is dependent upon the availability of a Bluetooth 

LE stack. 

We’ll now review each layer of the mesh architecture, 

working our way up from the bottom layer. 

B. Bearer Layer 

Mesh messages require an underlying communications 

system for their transmission and receipt. The bearer layer 

defines how mesh PDUs will be handled by a given 

communications system. At this time, two bearers are defined 

and these are called the Advertising Bearer and the GATT 

Bearer. 

The Advertising Bearer leverages Bluetooth LE’s GAP 

advertising and scanning features to convey and receive mesh 

PDUs. 

The GATT Bearer allows a device which does not support 

the Advertising Bearer to communicate indirectly with nodes 

of a mesh network which do, using a protocol known as the 

Proxy Protocol. The Proxy Protocol is encapsulated within 

GATT operations involving specially defined GATT 

characteristics. A mesh Proxy node implements these GATT 

characteristics and supports the GATT bearer as well as the 

Advertising Bearer so that it can convert and relay messages 

between the two types of bearer. 

C. Network Layer 

The network layer defines the various message address 

types and a network message format which allows transport 

layer PDUs to be transported by the bearer layer. 

It can support multiple bearers, each of which may have 

multiple network interfaces, including the local interface which 

is used for communication between elements that are part of 

the same node. 

The network layer determines which network interface(s) to 

output messages over. An input filter is applied to messages 

arriving from the bearer layer, to determine whether or not they 

should be delivered to the network layer for further processing. 

Output messages are subject to an output filter to control 

whether or not they are dropped or delivered to the bearer 

layer. 

The Relay and Proxy features may be implemented by the 

Network Layer. 

D. Lower Transport Layer 

The lower transport layer takes PDUs from the upper 

Transport Layer and sends them to the lower transport layer on 

a peer device. Where required, it performs segmentation and 

reassembly of PDUs. For longer packets, which will not fit into 

a single Transport PDU, the lower transport layer will perform 

segmentation, splitting the PDU into multiple Transport PDUs. 

The receiving lower transport layer on the other device, will 

reassemble the segments into a single upper transport layer 

PDU and pass this up the stack. 

E. Upper Transport Layer 

The upper transport layer is responsible for the encryption, 

decryption and authentication of application data passing to 

and from the access layer. It also has responsibility for 

transport control messages, which are internally generated and 

sent between the upper transport layers on different peer nodes. 

These include messages related to friendship and heartbeats. 

F. Access Layer 

The access layer is responsible for defining how 

applications can make use of the upper transport layer. This 

includes: 

- defining the format of application data. 

- defining and controlling the encryption and decryption 

process which is performed in the upper transport layer. 

- verifying that data received from the upper transport layer 

is for the right network and application, before forwarding the 

data up the stack. 

G. Foundation Model Layer 

The foundation model layer is responsible for the 

implementation of those models concerned with the 

configuration and management of a mesh network. 

303

H. Model Layer 

The Model Layer is concerned with the implementation of 

Models and as such, the implementation of behaviors, 

messages, states, state bindings and so on, as defined in one or 

more model specifications. 

V. SECURITY 

A. Mesh Security is Mandatory 

Bluetooth LE allows the profile designer to exploit a range 

of different security mechanisms, from the various approaches 

to pairing that are possible, to individual security requirements 

associated with individual characteristics. Security is in fact, 

totally optional, and its permissible to have a device which is 

completely open with no security protections or constraints in 

place. The device designer or manufacturer is responsible for 

analyzing threats and determining the security requirements 

and solutions for their product. 

In contrast, in Bluetooth Mesh, security is mandatory. The 

network, individual applications and devices are all secure and 

this cannot be switched off or reduced in any way. 

Figure 8 - security is central to Bluetooth mesh networking 

B. Mesh Security Fundamentals 

The following fundamental security statements apply to all 

Bluetooth mesh networks: 

1. All mesh messages are encrypted and authenticated. 

2. Network security, application security and device 

security are addressed independently. See “Separation 

of Concerns” below. 

3. Security keys can be changed during the life of the 

mesh network via a Key Refresh procedure. 

4. Message obfuscation makes it difficult to track 

messages sent within the network, providing a privacy 

mechanism that makes it difficult to track nodes. 

5. Mesh security protects the network against replay 

attacks. 

6. The process by which devices are added to the mesh 

network to become nodes is, itself, a secure process. 

7. Nodes can be removed from the network securely, in a 

way which prevents trashcan attacks. 

C. Separation of Concerns and Mesh Security Keys 

At the heart of Bluetooth Mesh security are three types of 

security key. Between them, these keys provide security to 

different aspects of the mesh and achieve a critical capability in 

mesh security, that of “separation of concerns”. 

To understand this and appreciate its significance, consider 

a mesh light which can act as a relay. In its capacity as a relay, 

it may find itself handling messages relating to the building’s 

Bluetooth mesh door and window security system. A light has 

no business being able to access and process the details of such 

messages, but does need to relay them to other nodes. 

To deal with this potential conflict of interest, the mesh 

uses different security keys for securing messages at the 

network layer from those used to secure data relating to 

specific applications such as lighting, physical security, heating 

and so on. 

All nodes in a mesh network possess a network key 

(NetKey). Indeed, it is possession of this shared key which 

makes a node a member of the network. A network encryption 

key and a privacy key are derived directly from the NetKey. 

Being in possession of the NetKey allows a node to decrypt 

and authenticate up to the Network Layer so that network 

functions such as relaying, can be carried out. It does not allow 

application data to be decrypted. 

The network may be subdivided into subnets and each 

subnet has its own NetKey, which is possessed only by those 

nodes which are members of that subnet. This might be used, 

for example, to isolate specific, physical areas, such as each 

room in a hotel. 

Application data for a specific application can only be 

decrypted by nodes which possess the right application key 

(AppKey). Across the nodes in a mesh network, there may be 

many distinct AppKeys, but typically, each AppKey will only 

be possessed by a small subset of the nodes, namely those of a 

type which can participate in a given application. For example, 

lights and light switches would possess the lighting 

application’s AppKey but not the AppKey for the heating 

system, which would only be possessed by thermostats, valves 

on radiators and so on. 

AppKeys are used by the upper transport layer to decrypt 

and authenticate messages before passing them up to the access 

layer. 

AppKeys are associated with only one NetKey. This 

association is termed “key binding” and means that specific 

applications, as defined by possession of a given AppKey, can 

only work on one specific network, whereas a network can host 

multiple, independently secure applications. 

The final key type is the device key (DevKey). This is a 

special type of application key. Each node has a unique 

DevKey known to the Provisioner device and no other. The 

304

DevKey is used in the provisioning process to secure 

communication between the Provisioner and the node. 

D. Node Removal, Key Refresh and Trashcan Attacks 

As described above, nodes contain various mesh security 

keys. Should a node become faulty and need to be disposed of, 

or if the owner decides to sell the node to another owner, it’s 

important that the device and the keys it contains cannot be 

used to mount an attack on the network the node was taken 

from. 

A procedure for removing a node from a network is 

defined. The Provisioner application is used to add the node to 

a black list and then a process called the Key Refresh 

Procedure is initiated. 

The Key Refresh Procedure results in all nodes in the 

network, except for those which are members of the black list 

from being issued with new network keys, application keys and 

all related, derived data. In other words, the entire set of 

security keys which form the basis for network and application 

security are replaced. 

As such, the node which was removed from the network, 

and which contains an old NetKey and an old set of AppKeys, 

is no longer a member of the network and poses no threat. 

E. Privacy 

A privacy key, derived from the NetKey is used to 

obfuscate network PDU header values, such as the source 

address. Obfuscation ensures that casual, passive 

eavesdropping cannot be used to track devices and the people 

that use them. It also makes attacks based upon traffic analysis 

difficult. 

The degree of security offered in this technique is fit for 

purpose. 

F. Replay Attacks 

In network security, a replay attack is a technique whereby 

an eavesdropper intercepts and captures one or more messages 

and simply retransmits them later, with the goal of tricking the 

recipient, into carrying out something which the attacking 

device is not authorized to do. An example, commonly cited, is 

that of a car’s keyless entry system being compromised by an 

attacker, intercepting the authentication sequence between the 

car’s owner and the car, and later replaying those messages to 

gain entry to the car and steal it. 

Bluetooth Mesh has protection against replay attacks. The 

basis for this protection is the use of two network PDU fields 

called the Sequence Number (SEQ) and IV Index, respectively. 

elements increment the SEQ value every time they publish a 

message. A node receiving a message from an element which 

contains a SEQ value less than or equal to that which was in 

the last valid message will discard it, since it is likely that it 

relates to a replay attack. IV Index is a separate field, 

considered alongside SEQ. IV Index values within messages 

from a given element must always be equal to or greater than 

the last valid message from that element. 

VI. 

BLUETOOTH MESH IN ACTION 

A. Message Publication and Delivery 

A network which uses Wi-Fi is based around a central 

network node called a router, and all network traffic passes 

through it. If the router is unavailable, the whole network 

becomes unavailable. 

In contrast, Bluetooth Mesh uses a technique known as 

managed flooding to deliver messages. Messages, when 

published by a node, are broadcast rather than being routed 

directly to one or more specific nodes. All nodes receive all 

messages from nodes that are in direct radio range and, if 

configured to do so, will then relay received messages. 

Relaying involves broadcasting the received message again, so 

that other nodes, more distant from the originating node, might 

receive the message broadcast. 

B. Multipath Delivery 

An important consequence of Bluetooth’s use of managed 

flooding is that messages arrive at their destination via multiple 

paths through the network. This makes for a highly reliable 

network and it is the primary reason for having opted to use a 

flooding approach rather than routing in the design of 

Bluetooth mesh networking. 

C. Managed Flooding 

Bluetooth mesh networking leverages the strengths of the 

flooding approach and optimises its operation such that it is 

both reliable and efficient. The measures which optimise the 

way flooding works in Bluetooth mesh networking are behind 

the use of the term “managed flooding”. Those measures are as 

follows: 

i) Heartbeats 

Heartbeat messages are transmitted by nodes periodically. 

A heartbeat message indicates to other nodes in the network 

that the node sending the heartbeat is still active. In addition, 

heartbeat messages contain data which allows receiving nodes 

to determine how far away the sender is, in terms of the 

number of hops required to reach it. This knowledge can be 

exploited with the TTL field. 

ii) TTL 

TTL (Time To Live) is a field which all Bluetooth mesh 

PDUs include. It controls the maximum number of hops, over 

which a message is relayed. Setting the TTL allows nodes to 

exercise control over relaying and conserve energy, by 

ensuring messages are not relayed further than is required. 

Heartbeat messages allow nodes to determine what the 

optimum TTL value should be for each message published. 

iii) Message Cache 

A network message cache must be implemented by all 

nodes. The cache contains all recently seen messages and if a 

message is found to be in the cache, indicating the node has 

seen and processed it before, it is immediately discarded. 

305

iv) Friendship 

Probably the most significant optimisation mechanism in a 

Bluetooth mesh network is provided by the combination of 

Friend nodes and Low Power nodes. As described, Friend 

nodes provide a message store and forward service to 

associated Low Power nodes. This allows Low Power nodes to 

operate in a highly energy-efficient manner. 

D. Traversing the Stack 

A node, receiving a message, passes it up the stack from the 

underlying Bluetooth LE stack via the bearer layer to the 

network layer. 

The network layer applies various checks to decide whether 

or not to pass the message higher up the stack or to discard it. 

In addition, PDUs have a Network ID field, which provides 

a fast way to determine which NetKey the message was 

encrypted with. If the NetKey is not recognized by the network 

layer on the receiving node, this indicates it does not possess 

the corresponding NetKey, is not a member of that subnet and 

so the PDU is discarded. There’s also a network message 

integrity check (MIC) field. If the MIC check fails, using the 

NetKey corresponding to the PDUs Network ID, then the 

message is discarded. 

Messages are received by all nodes in range of the node 

that sent the messages, but many will be quickly discarded 

when it becomes apparent they are not relevant to this node due 

to the network or subnet(s) it belongs to. 

The same principle is applied higher up the stack in the 

upper transport layer. Here though, the check is against the 

AppKey associated with the message, and identified by an 

application identifier (AID) field in the PDU. If the AID is 

unrecognized by this node, the PDU is discarded by the upper 

transport layer. If the transport message integrity check 

(TransMIC) fails, the message is discarded. 

VII. BLUETOOTH MESH - NEW FRONTIERS 

This paper should have provided the reader with an 

introduction to Bluetooth Mesh, its key capabilities, concepts 

and terminology. It’s Bluetooth but not as we know it. It’s a 

Bluetooth technology that supports a new way for devices to 

communicate using a new topology. 

Most of all, it’s Bluetooth that makes this most pervasive of 

low power wireless technologies, a perfect fit for a whole new 

collection of use cases and industry sectors. 

REFERENCES 

[1] Bluetooth SIG, Bluetooth Mesh Specification 

See https://www.bluetooth.com/specifications/adopted-specifications 

[2] Bluetooth SIG, Bluetooth Mesh Model Specification 


[3] Bluetooth SIG, Bluetooth 5 Core Specification 


[4] Bluetooth SIG, Bluetooth Core Specification Supplement 


306

Bluetooth Low Energy Solar Beacon as IoT Enabler 

Cecilia Höffler, Tobias Gemmeke 

Institute of Integrated Digital Systems and Circuit Design 

RWTH Aachen University 

Aachen, Germany 

hoeffler@ids.rwth-aachen.de 

This paper focuses on the accuracy issue of indoor navigation 

systems. Initially, key effects of RF signal transmission will be 

reviewed. On this basis, the received signal strength indicator 

(RSSI) based indoor navigation methods like triangulation and 

finger printing will be analyzed and their insufficiencies will be 

examined. This includes the RSSI deviation due to static and 

dynamic effects in the surroundings. A critical distance of RF 

sources for a high accuracy indoor positioning will be assessed. 

With this, the earlier mentioned inaccuracies can be 

significantly reduced. Finally, a hardware based blueprint will 

be provided, which enables the deployment of a significant 

amount of RF sources to realize the critical distance down to 

one RF source / m². 

Keywords— Bluetooth Smart; BLE; Bluetooth Low Energy; 

solar beacon sticker; Indoor Navigation; RSSI variances; heat map; 

triangulation 


Enabling the interaction between human beings and any kind of 

objects is a growing desire in modern days. Starting with the 

‘Internet of Things’ (IoT) today people even refer to the 

‘Internet-of-Everything’ (IoE), where any physical object is 

linked with the digital world. One key enabler is the link of 

linking a real-world scenario, the geometries of a room, with a 

digital navigation system. 

The focus of this paper will be on indoor navigation. The most 

common indoor navigation methods are based on triangulation 

or fingerprinting algorithms. Triangulation does not need a 

large database and is therefore very fast. But it suffers from a 

low accuracy compared to fingerprinting, due to the high 

dependences on the room geometries. To address the related 

short-coming there have been quite elaborate approaches for 

fingerprinting with a focus on static environments [1]. Since in 

a real world set up, dynamic effects (like the unpredictable 

movement of people) have to be considered, this kind of 

approach still lacks accuracy. 

Currently multiple RF technologies enable these indoor 

navigation solutions. The most widespread ones are based on 

Wi-Fi and BLE (Bluetooth Low Energy) [2]. These 

technologies need sensors and tags. Existing implementations 

of such active tags suffer from at least one of the following: 

form factor, range, cost of maintenance or cost of installation. 

These hurdles have prevented neither of the many available 

technologies like active RFiD, Bluetooth beacons nor Wi-Fi 

sensors from a massive breakthrough in the IoT. 

To overcome these barriers, an ideal tag would be unobtrusively 

while enhancing the original purpose of the enabled object. 

Such tag - being maintenance free - can for example reduce 

labor cost by issuing usage-based preventive check-ups. In 

addition to this, it enables efficiency increases by providing 

detailed insights into the value generating processes. Based on 

our assessment with a fully integrated System-on-Chip (SoC) 

stripped to the bare minimum of necessary functionality, two 

differentiators could be achieved at the same time: minimized 

the Bill-of-Materials (BoM) and a self-sustained wireless IoT 

system purely powered by energy harvesting. 

The core IP for this SoC is presented in [3]. It allows the 

manufacturing of an ultra-low power, secured ‘transmit-only 

radio tag’ that combines subthreshold operation in the digital 

baseband and control processor, and an ultra-low power radio 

front-end. This SoC is intended to be powered by a single 

organic photovoltaic cell allowing the implementation of a 

flexible, bendable, sticker sized active Bluetooth beacon. This 

actually means that we are capable of combining the advantages 

of passive tags (zero maintenance and a small form factor) with 

the advantages of active tags (extended range and significant 

transmit power). By adopting the BLE standard this technology 

can easily be exploited with standard technologies as integrated 

in common smartphones and mobile device to connect users 

with the capabilities of the digital world and the richness of their 

physical environment. 

II. 

INDOOR NAVIGATION METHODS 

Indoor navigation methods considered in this paper are 

triangulation and fingerprinting. Their theoretical approach will 

be briefly introduced. Furthermore, the limitations of these 

methods will be discussed based on simulation results and data 

taken from experimental measurements in real-life scenarios. 

The simulation is broken down in two steps. First an ideal RF 

source will be simulated in a rectangular space in Matlab. 

Second for a more realistic approach a simulation is run in 

WinProp, a Software Suite for wave propagation and radio 

network planning. Finally, the results will be validated in an 

experimental set up, which is identical to the scenario assessed 

in the simulation in WinProp. 

In our scenario, the doors consist of glass with a thickness of 

10cm and are 94 cm wide. The wall itself consists of brick and 

the ceiling is 2.45 m high. Two concrete columns are located in 

the middle of the room. The floorplan on the left of Figure 1 

shows the basic room without brick columns. 

307

The triangulation approach suffers various short-comings. 

Firstly, there is a high dynamic variance of the RSSI values of 

the individual RF sources as is highlighted in Figure 3 for 

different times and beacons of the identical type and transmit 

power setting. 

Figure 1 The lab set up (6 m wide and 9.5 m long) with brick walls and glass 

doors with a width of 0.94 m and wooden ceiling with a height of 2.45 m 

and 2 brick columns 

This simple scenario is used for the simulation in Matlab to 

show the reflection in an ideal indoor set up. The WinProp 

simulation and the experimental results are based on the 

described lab set up as shown the floorplan on the right of 

Figure 1. For the experimental validation five Minew i6 Sticker 

MiniBeacon were used as RF sources and placed at welldefined 

positions. The beacons advertised at 2.4 GHz with 

0dBm P t. As receiver we used a commercially available 

smartphone (Samsung A 5) to capture all effects of the whole 

transmission channel. The receiver was moved in discrete steps 

away from the transmitters. Thereby laying on a line of sight 

orthogonal to the wall. 

III. 

RSSI BASED LOCALIZATION. 

A. Triangulation 

In free space the power of the received signal P r at a certain 

distance d to the signal source can be calculated based on the 

Friis formula [4] described in (1) with the transmission power 

P t, the antenna directivities of the receiver and transmitter D r 

and D t and the wave length of the signal λ as follows 

P r = P t + D r + D t + 20log 10 ( 

̅̅̅̅̅) λ 

4πd 

(1) 

Based on this equation the RSSI value can be used to calculate 

the exact distance to a certain RF source. The position of the 

receiver is determined as shown in Figure 2 with at least three 

received RF sources and their known transmit power level. 

Figure 3 RSSI value measurement of different beacons on the same position 

More specifically, the RSSI values are shown for a distance of 

1 m with a P t of each RF source being set to 0 dBm. Hence, the 

triangulation approach suffers from this hardware effects. 

Secondly, the results are based on the assumption of an ideal, 

i.e. free space, scenario. However, indoor environments are 

dominated by obstacles like walls or furniture. Such objects 

cause large scale effects like multi path propagation and 

shadowing [6]. These effects are visible in Figure 4, e.g., the 

shadowing can be seen in the green regions. 

Figure 2 Triangulation (with the yellow star as the object and the red blue 

dots as the RF sources with the black circles as their signal strength at a 

specific distance (Ref. [5])) 

Figure 4 Simulation in WinProp of 2.4 GHz RF source with ray tracing in 

line of sight 

308

Here, a so-called heat map is created with the finger printing 

algorithm. This heat map contains for each position in a given 

scenario the received signal strengths of the various transmitters 

(its fingerprint). Hence, the calculation of distance is no longer 

based on the Friis formula. 

B. Heat Map 

The test set up with its material parameters and geometries was 

evaluated in WinProp to simulate a heat map for the static case. 

This simulation takes large-scale effects into account. The multi 

path propagation can be seen in Figure 4. Here the ray tracing 

of a simulated RF source with 4 dBm P t and 2.4 GHz is 

displayed. The signal strength is indicated by color coding over 

the whole area. 

The shadowing effects in Figure 4 are caused by the brick 

columns. These effects have to be taken into account when 

determining the receiver position based on the received signal 

strength. The shadowing effects are also visible by comparing 

the Channel Impulse Response (CIR) of Figure 5 and Figure 6. 

specific attenuation [7][8], these signals can be seen with a 

lower power envelope on the CIR graphs. All this contributes 

to the measured powered at the position of the receiver. The 

received power in line of sight is significantly higher than the 

received power in the shadowed area. To neglect the shadowing 

effects the line of sight will be used as references in the later 

experimental evaluation in section IV. 

IV. 

EXPERIMENTAL MEASUREMENT 

So far, the focus lay on the large-scale effects like multi path 

propagation and shadowing. But for an accurate position 

estimation also the small-scale effects have to be taken into 

account. They happen over the carrier wavelength λ, which is 

12.5 cm for a frequency of 2.4 GHz. These effects are due to 

the constructive and destructive interference of signals, which 

are caused by the earlier mentioned multi path propagation and 

therefore dependent on the individual room geometries [6]. 

The experimental test set up is placed in the already introduced 

room in Figure 1. The comparison of the free space simulation 

(blue curve), the simulation of small scale effects due to the 

room geometries (black curve) and the experimental results of 

BLE beacons (blue dots) and omnidirectional antennas (red 

stars) are shown in Figure 7. 

Figure 5 Channel Impulse Response in line of sight with a distance of 5.5 m 

of 2.4 GHz RF Source 

Figure 7 signal power propagation with constructive and destructive 

interferences in line of sight 

Figure 6 Channel Impulse Response in shadow of brick column with 

distance of 5.5m to 2.4 GHz RF Source 

The absolute distance of the receiver to the RF source is in both 

cases the same. By comparing the CIR in both figures, it can be 

observed, that the line of sight signal arriving at 18 ns delay 

time is further attenuated from -53 dBm to -75 dBm. Since the 

signal of the RF source is distributed in a radial way, there are 

reflected rays from the walls, which contribute to the multi path 

propagation. These signals have a different time of arrival, due 

to the longer distances to the receiver and therefore propagation 

time. Since the reflection on the walls contain a material 

Here the wall is simulated with a reflection coefficient of 0.35 

according to [8]. The free space simulation was calibrated with 

experimental results measured in a low reflection room of two 

omnidirectional antennas with a signal of 2.44 GHz and 0 dBm 

P t. 

The comparison of the free space simulation to the simulation 

in the lab set up confirm the simulation results. The measured 

RSSI values of the BLE beacons are significantly lower than 

the idealized case. This is attributed to the lower gain of the 

antenna in the beacons and the smart phone. However, exact 

antenna directivities D r and D t are unknown. But this offset can 

be resolved by calibration of the positioning application. These 

interferences lead to a non-monotonic decrease in RSSI values 

that is identical RSSI values are obtained at different distances. 

309

This lack of uniqueness in the RSSI value gets worse for higher 

distances from the RF source (starting at 1.5 m) resulting in 

reduced accuracy for indoor positioning. With this the 

correlation of the RSSI value and distance between receiver and 

transmitter is impossible, due to the ambiguity for RSSI values 

for distances higher than 1.25 m. 

For a qualitative analyses of the RSSI values of the BLE 

Beacons mentioned in Figure 7, they are plotted with their 

distribution in Figure 8 in dependence on the distance of the 

phone to the beacon. The box plot depicts the median with the 

red line and with the whisker the variability outside the upper 

and lower quartiles. Outliers are plotted as red stars. 

Figure 8 decrease of RSSI value of one beacon depending to the distance 

environment. An algorithmic approach alone cannot solve this 

issue, neither a heat map nor triangulation are a sufficient 

solution. But the earlier mentioned results in section II and IV 

indicate, that a high density of RF sources can enable a high 

accuracy algorithmic approach. The necessity for a high RF 

density is a low cost hardware solution. So, in the following 

currently available technologies will be compared. 

A. Requirements for RF Sources for Indoor Navigation 

application 

The accuracy of GPS for outdoor navigation of 3 m finds its 

shortcomings for indoor applications due to shielding off walls 

and ceiling. Therefore, another technical solution needs to be 

applied. As mentioned earlier the most common ones are based 

on either BLE or Wi-Fi as RF technology. The BLE beacon 

technology can appear in two flavors: battery powered or solar 

powered. For the Wi-Fi solution additional access points are 

needed for a full coverage to support the router. All RF sources 

will be called in the following beacon, their specification such 

as Wi-Fi or BLE will be added. 

The comparison of these technologies is taking the earlier 

proposition of one RF source / m² into account. The RF 

technologies will be compared in Figure 9 regarding their Total 

cost of ownership (TCO) and quality of service. The TCO 

implies the purchase cost, deployment cost and time, 

maintenance cost and time. The quality of service focuses on 

latency and energy consumption for the phone. Wi-Fi uses 

twice as much as energy of a smartphone than BLE[9], [10]. 

This means that the BLE solution is preferable for the user than 

the Wi-Fi solution. 

The analysis in Figure 9 is based on the assumption of an 

average building of 500 m². 

The spread of the RSSI values at a given distance to the RF 

source is shown in Figure 8. It becomes clear that the spread 

increases with the distance. This is visualized by a tunnel, 

which widens dramatically at a distance of 1.25 m. The 

significance of a given RSSI values decreases with the distance 

between receiver and transmitter. This effect is striking at 

1.25m and extends for higher distances. Thence 1 m will be 

taken as a critical distance for reliable position determination. 

The mean RSSI values for distances under 1 m are in the 

expected range and lie on the fitted curve (as blue circles) in the 

earlier mentioned Figure 7. 

A distance of 1 m to a RF source leads to the need of one RF 

source / m². So the distance of the receiver never exceeds the 

critical distance to a RF source. This leads also to a high density 

of RF sources. To enable this high density, the requirements for 

RF sources will be examined in the next section. 

V. FEASIBILITY OF SUGGESTED SOLUTIONS 

Even though all measurements and simulations mentioned 

earlier were in a static environment (no humans or movable 

objects were present) the unreliability of the correlation of 

distance and signal strength is already prominent for a distance 

higher than 1 m. Hence, it will increase in a dynamic 

Figure 9 Comparison of Wi-Fi, battery powered and autarkic solar powered 

BLE beacon for indoor navigation 

310

The maintenance time for a beacon includes the time for finding 

missed beacons, which are not advertising anymore and the 

exchange of the battery. This is assumed to take 12 min in 

average. The average life time of a battery is expected to be 6 

years. 

With this comparison, the superiority of the autarkic BLE 

beacon to Wi-Fi and battery powered BLE beacon is distinct. 

The BLE beacon have a fifth of the power consumption of 

Wi-Fi beacons. The deployment cost of Wi-Fi is compared to 

the deployment cost for BLE beacon almost 10 times higher, 

due to their additional electrical cabling. The battery powered 

BLE beacons need a significant higher amount of maintenance 

time, approximately 133 h / year for battery replacements and 

therefore accumulate maintenance cost of over 2900 € / year. 

The operation costs for Wi-Fi (visualized as maintenance cost 

in Figure 9) consist only of the costs for the energy consumption 

for its access points. 

The total cost of ownership (TCO) consist of the deployment 

and maintenance cost. In these the material and labor cost 

according to German standards are included. So, the TCO for 

Wi-Fi is 27700 € and for battery powered beacons 5535 €. The 

TCO of autarkic solar powered beacon is significantly lower 

with 2100 €. 

This comparison shows the clear superiority of autarkic beacon. 

Which leads to the question, how the autarkic BLE beacons can 

be realized and which requirement have to be met to achieve 

this energy autonomy. 

B. Requirements for autarkic beacon 

The major cost factor in an autarkic BLE Beacon is its power 

source. Although there are multiple energy harvesting 

opportunities neither piezo nor temperature based energy 

harvesting is generating enough energy of 48 µW, which is 

needed for the advertisement of BLE packages in an average 

use case scenario. So the only valid source can be a solar cell 

[11]. The driving cost factor for the solar cell is the active area. 

This can only be reduced by increasing either the available light 

in the surrounding area or reducing the overall power 

consumption of the device. 

It can be assumed that the main current of a BLE beacon is 

consumed by the SoC [12]. Therefore the current consumption 

of a well performing state-of-the-art-beacon will be examined 

and taken as baseline for required energy budget [13]. 

The usage of beacons for indoor navigation purposes is based 

on a minimum advertisement interval of 683ms for an average 

walking speed of 1.462 m / s [14] and a beacon density of 1 

beacon / m². With this, the following calculation is based on an 

advertisement interval of 683 ms and a duty cycle of 4.5 ms, 

Figure 10. 

The BLE beacon in [13] is working with 3.3 V supply voltage. 

Since in section IV P t is 0 dBm, the analyzed beacon in Figure 

10 will also advertise with 0 dBm. The question what influence 

the P t of a beacon has on the deviation of the RSSI values is not 

addressed in this paper. This can be analyzed in further 

experiments. It can be assumed, that with a significant lower P t 

further improvement in the power management can be reached. 

Figure 10 Oscilloscope measurement of BLE beacon while advertisement 

(with a voltage drop over 10 Ω load resistance) 

The current profile in Figure 10 consists of four different 

segments. The first segment is the charge up of the capacity. It 

is worth mentioning, that the size of the capacity provides a 

trade off between the supply noise for each advertisement and 

the time and power loss for the charge up of this capacity. Here 

a maximum current of 7 mA is be measured. The second part is 

the wake up of the transmitter part, with an average current of 

1.9 mA. The third shows the three advertisements on channels 

37, 38 and 39. Its average current is 3.6 mA. 

The fourth phase is representing the shutdown, with an average 

current of 1.9 mA. With a sleep current (Figure 11) of 1.28 µA 

this best-in-class commercial SoC requires a total energy of 

38.4 mJ per advertisement cycle of 687.5 ms. 

Figure 11 Oscilloscope based measurement of BLE beacon during sleep 

mode (with a voltage drop over 10 kΩ load resistance) 

311

The lighting conditions in a dim hallway will be taken as worstcase 

scenario with 200 lx. Since the light intensity is an external 

factor the only variable besides the active area of the PV cell is 

the power consumption of the BLE beacon. According to [2] 

most of the current consumption takes place while advertising. 

So, an approach for a lower power consumption could be to 

increase the advertisement interval. Since the sleep current 

cannot be neglected, its amount on the power consumption 

increases by extending the advertising interval. A possible 

tradeoff could be an advertisement interval of 1 s, with this the 

average power consumption can be reduced to 34µW. With the 

available power of various solar cells at this lighting conditions, 

the price for the solar cell per beacon is in the range of $ 1.76 

to $ 2.5. 

C. Proprietary Radio SoC 

The energy consumption can be decreased with a proprietary 

SoC in multiple ways. Firstly, an immediate shut down of the 

active part after advertising can be realized. This leads to a total 

energy reduction of 21.2 %. Secondly, the wake-up time can be 

reduced with faster locking of the clock oscillator down to 100 

µs. With this the total energy consumption is reduced to 24.6mJ, 

which is a reduction of 35.9%. Finally a significant decrease in 

the third part of the current profile in Figure 10 leads to an 

overall reduction of 51.5 %. This is based on an ultra-low power 

oscillator. This low current profile is based on the results of [3]. 

Its current consumption of 65 nA for an accuracy of 420 ppm is 

outstanding, compared to the 120 nA for the oscillator of the 

best-in-class commercial SoC [12]. This can be reached thanks 

to a crystal free oscillator. In summary, the total energy 

consumption for one advertisement interval of 687.5 ms can be 

reduced to 18.6 mJ. In addition to this, the proprietary radio SoC 

scales the external components by three compared to the 

commercial SoC, since its oscillator does not need a dedicated 

RF crystal. This reduces the bill of materials significantly. 

VI. 

CONCLUSION 

This study details the reliability issues of state-of-the-art RSSI 

related indoor positioning methods like triangulation and heat 

finger printing. The main obstacle is the RSSI variations caused 

by large- and small-scale effects. The analyses of the 

experimental and simulated RSSI values suggested a maximal 

distance that received signal strength can still be used for 

localization. Based on the quantitative results, we find a critical 

distance for reliable indoor navigation with RF sources of 1 m. 

The implementation is finally discussed considering the total 

cost of ownership incl. the energy consumption of various RF 

technologies. In consequence, the best fitting beacon 

technology is based on the BLE standard using a dedicated SoC 

enabling autarkic PV powered operation even under dim light 

conditions. 

VII. ACKNOWLEDGMENT 

The authors would like to thank for the support of Prof. Dr.-Ing. 

Heberling and of Jörg Pamp (Institute High Frequency 

Technology, RWTH Aachen) as well as Prof. Dr. Heinen and 

Markus Scholl (Chair of Integrated Analog Circuits and RF 

Systems, RWTH Aachen). 

VIII. REFERENCES 

[1] R. Faragher and R. Harle, “An Analysis of the 

Accuracy of Bluetooth Low Energy for Indoor Positioning 

Applications,” Proc. 27th Int. Tech. Meet. Satell. Div. Inst. 

Navig. (ION GNSS+ 2014), pp. 201–210, 2014. 

[2] Bluetooth Special Interest Group, “Specification of the 

Bluetooth System Covered Core Package Version 4.2,” 

History, vol. 0, no. April, p. 2272, 2014. 

[3] M. Scholl, Y. Zhang, R. Wunderlich, and S. Heinen, 

“A 80 nW, 32 kHz charge-pump based ultra low power 

oscillator with temperature compensation,” Eur. Solid-State 

Circuits Conf., vol. 2016–Octob, pp. 343–346, 2016. 

[4] H. T. Friis, “A Note on a Simple Transmission 

Formula,” Proc. IRE, vol. 34, no. 5, pp. 254–256, 1946. 

[5] J. Hightower, G. Borriello, and R. Want, “SpotON: An 

indoor 3D location sensing technology based on RF signal 

strength,” Uw Cse, no. March 2000, p. 16, 2000. 

[6] K. N. P. A. Kushki, “Indoor Positioning with Wireless 

Local Area Networks (WLAN),” Encycl. GIS, pp. 469–469, 

2008. 

[7] P. Ali-rantala and M. Keskilammi, “Indoor 

Propagation of Bluetooth Waves, Effect of Distance on 

Bluetooth Data Transmission, and Simulation of Wave 

Propagation,” Citeseer, pp. 1–4. 

[8] T. Koppel, “Reflection and Transmission Properties of 

Common Construction Materials at 2.4 GHz Frequency,” 

Elsevier, vol. 113, pp. 158–165, 2017. 

[9] A. Lindemann, B. Schnor, J. Sohre, and P. Vogel, 

“Indoor positioning: A comparison of WiFi and Bluetooth Low 

Energy for region monitoring,” Heal. 2016 - 9th Int. Conf. Heal. 

Informatics, Proceedings; Part 9th Int. Jt. Conf. Biomed. Eng. 

Syst. Technol. BIOSTEC 2016, vol. 5, no. Biostec, pp. 314– 

321, 2016. 

[10] G. D. Putra, A. R. Pratama, A. Lazovik, and M. Aiello, 

“Comparison of energy consumption in Wi-Fi and bluetooth 

communication in a Smart Building,” 2017 IEEE 7th Annu. 

Comput. Commun. Work. Conf. CCWC 2017, no. 2014, 2017. 

[11] D. Niyato, E. Hossain, M. Rashid, and V. Bhargava, 

“Wireless sensor networks with energy harvesting 

technologies: a game-theoretic approach to optimal energy 

management,” IEEE Wirel. Commun., vol. 14, no. 4, pp. 90– 

96, 2007. 

[12] J. Bernegger and M. Meli, “Comparing the energy 

requirements of current Bluetooth Smart solutions,” no. 

February, pp. 1–23, 2014. 

[13] Atmel, ATBTLC1000 WLCSP SoC DATASHEET. 

2016, pp. 1–52. 

[14] R. W. Bohannon, “Comfortable and maximum 

walking speed of adults aged 20-79 years: Reference values and 

determinants,” Age Ageing, vol. 26, no. 1, pp. 15–19, 1997. 

312

Artificial Neural Networks Unleash New 

Possibilities for Edge Intelligence 

Hussein Osman 

Lattice Semiconductor: Product Marketing Manager 

San Jose, CA, U.S. 

hussein.osman@latticesemi.com 

Abstract— The rapidly growing area of artificial intelligence 

(AI), Neural Networks (NNs) and Machine Learning offers 

tremendous promise as developers attempt to bring higher levels 

of intelligence to their systems. Engineers have been using NNs 

as a paradigm to implement systems that can learn and infer 

based on learning. Computational requirements for such systems 

vary widely depending upon application. Traditionally designers 

using deep learning techniques and floating-point math in the 

data center have relied on high performance and power-hungry 

GPUs to meet demanding computational requirements. Designers 

extending AI to the edge don’t have the luxury of using powerhungry 

GPUs. Instead they must develop computationally 

efficient systems that not only meet accuracy targets, but also 

comply with the power, size and cost constraints of the consumer 

market. 

This paper will review on-device artificial intelligence which 

uses NN models to compare new incoming data and infer 

intelligence. On device AI dramatically improves user privacy by 

processing data locally rather than send it back to the cloud for 

processing. In addition, the paper will evaluate how technologies 

such as Field Programmable Gate Arrays (FPGAs) can make 

edge computing possible and how they can be used to optimize 

parallel computing. It will also explore the intelligence these low 

power technologies bring to battery-powered applications. Using 

a design example, the article will also examine how building AI 

into a FPGA running an open source RISC-V processor with 

accelerators can both dramatically reduce power consumption, 

while shortening response time and improving security. 

Keywords—Artificial Intelligence, AI, Artificial Neural 

Networks, ANN, neural networks, machine learning, Intelligence at 

the Edge, Edge Intelligence, RISC-V 

I. ON-DEVICE AI 

Over the last several years Neural Networks (NNs) have 

become an increasingly common paradigm for engineers 

utilizing machine learning (ML) techniques. In these 

applications engineers use NNs to implement systems that can 

continuously learn and infer, based on that learning. 

Traditionally these deep learning techniques are employed in 

the data center in systems built around large, high 

performance GPUs to meet highly demanding computational 

requirements. In these applications systems store data and run 

arithmetic functions in the cloud where the use of escalating 

levels of power is less of a design obstacle. 

Recently, however, demand has been building for new ways 

to extend these capabilities to edge applications. Ranging from 

smart TVs, and security systems to intelligent doorbells and 

self-driving vehicles, a rising number of new applications 

require a more immediate response than cloud-based systems 

can provide. On the edge the deep learning techniques that use 

floating-point math in the data center are impractical. Instead 

designers must develop more computationally-efficient 

solutions that not only meet stringent accuracy targets, but 

also comply with the power, size and cost constraints of the 

consumer market. 

While designers can use complex machine learning 

techniques during training in the data center, once a device 

moves to the edge, it must perform inferences using arithmetic 

that use as few bits as possible. Designers can simplify 

computation by switching from floating-point to fixed-point 

math or, ideally, basic integers. By altering training to 

compensate for the quantization of floating-point to fixedpoint 

integers, they can develop solutions that train faster with 

high accuracy and push the performance level of fixedpoint/low-precision-integer 

NNs close to those using floating 

point math. To build the simplest edge devices, however, 

training must produce NN models with 1-bit weights and 

activations. These models are called Binarized Neural 

Networks (BNNs). 

BNNs eliminate the use of multiplication and division by 

using 1-bit values instead of larger numbers during runtime. 

This allows the computation of convolutions using just 

addition. Since multipliers consume more space and power 

than other components in a digital system, replacing them with 

addition offers significant power and cost savings. But as 

demand builds for more intelligence on the edge, how must 

the use of BNNs change to meet these requirements? 

To address this need, designers require in-device AI 

solutions that still perform machine learning techniques in the 

cloud once a year. But to deliver this quick response they store 

a template locally and compare new data collected to the 

template to perform inferencing. These in-device AI solutions 

can not only reduce power, cost and product footprint, by 

eliminating the transfer of data back to the cloud they also 

offer improved security and reliability. 

How is the template created? Take, as an example, a TV 

designed to automatically turn off and save power when it 

does not detect a person in the room. As long as the TV 


313

detects a face in the room, it stays on. When it doesn’t, it 

powers off. In this case a template for a facial detection 

solution would be created in the cloud by comparing 100,000 

images of faces from around the world. This data is then sent 

from the cloud to the edge application and stored in a Field 

Programmable Gate Array (FPGA) or Microcontroller (MCU). 

The model or template is sent to the AI device in the form of 

weights and activations. Typically, it is stored in the internal 

memory of the device. A sensor directly connected to the AI 

device collects raw images constantly. These raw images are 

continually compared to the template in the AI device using 

the computational resources on the device to perform 

inferencing. 

Applications of this type do not have to perform the 

complex calculations associated with a facial recognition 

function. But by simply performing a facial detection function 

and turning off the TV when no one is present, designers can 

add significant new capabilities to dramatically reduce power 

consumption. Similar applications might include security 

systems that can detect whether the movement in a house was 

a person, a pet, or a shadow, or a doorbell that automatically 

rings when a person approaches the front door. 

II. 

BINARIZED NEURAL NETWORKS 

A recent collaboration between Lattice Semiconductor and 

VectorBlox Computing, a developer of high performance, 

soft-core processors for embedded applications, illustrates the 

advantages Binarized Neural Networks offer. Binarized 

Neural Networks reduce memory requirements by eliminating 

the use of multiplications and divisions and allowing the 

compilation of convolutions using just additions and 

subtractions. 

VectorBlox needed the hardware to run their machine 

learning algorithms to perform inferencing at the edge. But it 

also needed a solution that could deliver high performance at 

low power. To accomplish this task Lattice Semiconductor 

proposed the use of its iCE40 UltraPlus Field 

Programmable Gate Arrays (FPGAs). The iCE40 UltraPlus is 

a highly energy-efficient solution for repetitive number 

crunching. With 8 hardened DSP blocks, highly flexible I/Os 

and increased memory for buffering, it offers an attractive 

platform for building intelligent IoT edge products. 

In CNN-based machine learning the compute kernel is the 

convolution kernel where a 3x3 window of weights is 

multiplied with input data and then sum-reduced into a scalar 

result. Input values, weights and results typically use the 

floating-point system. Recent implementations like that 

described in “BinaryConnect: Training Deep Neural Networks 

with Binary Weights During Propagations” by M. 

Courbariaux and Y. Bengio and J.P. David eliminate 

multiplication by using binary weights to represent +1 or -1. 

To improve performance engineers at Lattice and 

VectorBlox made three enhancements to the BinaryConnect 

approach. First, they shrunk the network structure in half by 

moving from 

(2x128 C3) - MP2 – (2x256C3) – MP2 (2x512C3) – MP2 – 

(2x1024FC) – 10FSC 

to 

(2 x 64C3) – MP2 – (2 x 128C3) – MP2 - (2 x 256C3) – MP2 - 

– (2 x 256FC) – 10SFC 

where C3 is the 3x3 ReLU convolution layer, MP2 is a 2 x 2 

max-pooling layer and FC is a fully connected layer. At that 

point they optimized the network by using 8-bit signed fixedpoint 

values for all input data. Accumulation used 32-bit 

signed data to prevent overflow which was then saturated to 8- 

bit before the next layer. 

Fig. 1: The Binarized CNN structure presented a 10.8% error rate. 

Secondly, the designers implemented a hardware accelerator 

for the binarized neural network. Then they used the 

accelerator as an ALU in the ORCA soft RISC-V processor. 

They enhanced the ORCA processor with a custom set of 

lightweight vector extensions (LVE). By streaming the matrix 

data through the RISC-ALU, the LVE reduced or eliminated 

loop, memory access and address generation overhead and 

improved the efficiency of matrix operations. A CNN 

accelerator was added as a custom vector instruction (CVI) 

(see figure 2) to the LVE to further improve operation. 

Fig. 2: A Binarized custom vector instruction boosted performance. 

The third and final modification in the project was the 

addition of an augmented RISC-V processor in the iCE40 

UltraPlus FPGA. To perform inferencing at the edge designers 

needed a solution that offered a highly parallel architecture 

capable of performing a large number of similar arithmetic 

operations at low power. One of the reasons the team chose 

the iCE40 UltraPlus FPGAs was because they offer very 

flexible I/Os to connect to the image sensors and logic 

resources needed to down scale and manipulate the captured 

image data. The FPGAs also feature 8 hardened DSP blocks 

314

that the developers could dedicate to more complex algorithms 

as well as 1 Mbit of on-chip memory which could be used to 

buffer data longer in low power states. The LVE operates 

directly on 128 Kb of scratchpad RAM that has been triple 

overclocked to supply two reads and one write per CPU clock. 

Binary weights are stored in internal RAM, so the DMA 

engine can efficiently transfer those values into the scratchpad 

and steal cycles from the CPU if any LVE operations are 

running. 

The development team used Lattice’s iCE40 UltraPlus 

mobile development platform to prototype and test their 

design. Proof-of-concept demos helped engineers quickly 

develop drivers and interfaces. The platform featured a 1x 

MIPI DSI interface up to 108 Mbps, 4x microphone bridging 

and a variety of sensors. The FPGA can be programmed using 

an on-board SPI flash memory or the USB port. 

The team created a person detector by training a 10-category 

classifier with a modified CIFAR-10 dataset that replaced deer 

images with duplicated images from the people superclass in 

CIFAR-100. To maximize performance, the team reduced the 

network structure further and trained a new 1-category 

classifier using a proprietary database of 175K images 

including human facial images of various ages, ethnicities, as 

well as people wearing glasses and hats. 

III. 

COMPACT, LOW POWER SOLUTION 

Operating at 24 MHz this compact CPU was implemented 

in 4,895 of the iCE40 UltraPlus 5K’s 5,280 4-input LUTs. It 

also uses four of the FPGA’s eight 16x16 DSP blocks, 26 of 

30 4kb (0.5kB) BRAM, and all four 32 kB SPRAM. The 

proposed solution can support up to 8-layer deep NNs inside a 

single FPGA. 

The accelerator on the ORCA RISC-V improves runtime 

of convolution layers by 73X, while the LVE improves 

runtime of dense layers by 8X. Use of the iCE40 UltraPlus 

with the accelerators results in an overall increase in speed of 

approximately 71X. 

The 1-category classifier runs in 230 ms with 0.4% error 

and consumes 21.8 mW. A power-optimized version designed 

to run at one frame/second consumes just 4.4 mW. Error rates 

were attributed primarily to training not reduced precision. In 

effect, thanks to the impact of the accelerator implemented in 

the FPGA fabric, this FPGA-based solution offers the 

performance of a 1.7 GHz processor in the power envelope of 

a 24 MHz processor. 

IV. 

CONCLUSION 

With analysts at the Gartner Group predicting up to 80 

percent of all smartphones will feature on-device AI 

capabilities by 2022, demand is clearly building for more 

intelligence at the edge. The challenge for designers lies in 

finding the best technology to build highly resource-efficient 

solutions. 

REFERENCES 

[1] M. Courbariaux, Y. Bengio, and J.P. David, 

“BinaryConnect: Training Deep Neural Networks with Binary 

weights during Propagations,” in Advances in Neural 

Information Processing Systems 28 (NIPS 2015 (C. Cortes, N. 

D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett, eds), pp. 

3123-3131, Curran Associates, Inc., 2015. 

[2] G. Lemieux, J. Edwards, J. Vandergriendt, A. 

Severance, R. De Iaco, A. Raouf, H. Osman, T. Watzka, and 

S. Singh, TinBiNN: Tiny Binarized Neural Network 

Overlay in about 5,000 4-LUTs and 5mW, 3rd International 

Workshop on Overlay Architectures for FPGAs (OLAF), Feb 

2017. 

Fig. 3: The TiNBiNN solution was implemented in an iCE40 UltraPlus FPGA. 


315

olOne: Artificial Intelligence on Chip 

How Industry 4.0 would benefit from a new approach to AI miniaturization 

Marco Calabrese, Claudio Martines 

Holsys S.r.l. 

Taranto, Italy 

m.calabrese@holsys.com, c.martines@holsys.com 

Abstract—Embedded systems are being used in a wide span of 

contexts, from industrial processes to consumer appliances. With 

the prospective growth of Industry 4.0 that promises to enlarge the 

spectrum of embedded applications in disparate settings such as 

real-time anomaly detection, predictive maintenance, selfdiagnostics 

and so on, manufacturers will be enforced to inflate 

more intelligence into their products. Today, machine learning 

technologies often require heavy Cloud infrastructures, the 

availability of datasets for training and test, long time to market 

and high skills, all elements that, once combined, may obstacle 

investment decisions. As an answer to the growing need of readyto-deploy 

solutions, we present olOne, the first Artificial 

Intelligence development environment able to bring real-time 

sensor raw data interpretation onto commonly-used micro 

controllers in few steps and without any specific data scientist skill. 

Keywords—Industry 4.0; real-time sensor data processing; 

cyber-physical systems; holons; granular computing; computing 

with words; 


Industry 4.0 can be defined as the embedding of advanced 

cyber-physical systems (CPS) [1] into digital and physical 

processes [2]. 

The rapid development of smart sensing technologies 

coupled with the increase of Internet of Things (IoT) connected 

devices let industrial machines and smart products generate a 

staggering amount of data on a daily basis. Even small amount 

of raw data produced at a constant rate by each single IoT device, 

when gathered over a large period of time and summed over the 

entire installed base of devices, become Big Data. As a result, 

CPS are required Big Data analytics and Machine Learning 

(ML) features to transform these data into meaningful 

information, thus enabling high-value services for the end 

customers [3]. 

If raw data are analyzed via a centralized intelligence 

through Cloud services only, there are a number of potential 

points of failure to consider such as connectivity, bandwidth, 

latency, security, infrastructure costs, to cite a few. For such 

reasons, there is a hyping interest in combining Cloud services 

with Edge Computing solutions [4]. This trend is producing an 

interesting race towards AI miniaturization which is hindered 

however by technological barriers since most ML techniques 

were not engineered to fit the stringent memory and 

computational requirements of embedded computing. 

According to our view, Edge Computing is a necessary but 

insufficient paradigm shift for leveraging AI on every chip. Our 

claim is that also the traditional data-oriented machine learning 

approach, which is centred around data scientist activities, has 

to be rethought in favour of a more human-oriented approach 

which starts from (and incrementally enriches with) the knowhow 

of the process expert or the process manager. 

The rest of the paper is organized as it follows: Section II 

describes the current ML process and the hinders to their 

implementation in many real-world industrial contexts; Section 

III introduces our vision on a human-oriented ML approach; 

Section IV presents olOne, our Integrated Development 

Environment (IDE) to develop AI-based sensor data processing 

applications; Section V reports on a real-world experiment and 

results carried out on a commercial prototyping board; Section 

V concludes. 

II. 

EFFECTS OF DATA-CENTRIC MACHINE LEARNING 

Recent trends in ML are dominated by neural networks (NN) 

and deep-learning (DL) technologies. They can be defined as 

black-box multi-layer computational models that self-tune their 

internal weights to fit input training data. 

Undoubtedly, NN and DL have been proven effective in 

disparate application settings such as, to cite a few, medical 

image processing [5], automatic text generation [6] or 

handwriting recognition [7]. However, it is important to stress 

that, as a prerequisite to their implementation, NN and DL 

require the availability of data to run the training phase. We 

concentrate on this point in the next subsection. 

A. Barriers to effective implementation in real-world settings 

Traditional ML training cannot be done without data. This 

point is a crucial one in the Industry 4.0 scenario. 

While major technological companies are used to gathering 

and processing data since many years ago, the figure 

dramatically changes for most manufacturing firms who have 

focused until now on building their products rather than on the 

digital transformation and competencies required by the new 

vision. 

316

According to a recent study [8], almost one third of EU 

enterprises recruiting or trying to recruit ICT specialists reported 

having difficulty filling those vacancies, where more than one in 

two companies searching for an ICT specialist found a serious 

shortage of people with such skills. 

Simply put, several industries collect and process few data 

or no data at all, not having appropriate internal skills to do this; 

therefore, this impedes them from implementing appropriate 

maintenance or customer-oriented services. 

When data is available, it does not mean that it embraces all 

the relevant conditions one would like to keep an eye on. 

Although it may seem ironical, if one wanted to train a preexplosion 

pattern in a plant, the perfect way would be having the 

explosion happening, to collect sufficient data for analysis! 

The next barrier is posed by the lack of technical skills able 

to mine out useful insights from raw data, e.g. by appropriately 

tuning NN parameters. Generally, this task is performed by data 

scientists, a professional figure which is scarcely available on 

the labor market. 

Finally, much of the available know-how in terms of human 

experience is excluded by this process. NN and DL are trained 

as black-box models, e.g. to find anomalous patterns. Once that 

the pattern has been found, further work is required to 

understand why that pattern happened. 

B. Race towards AI miniaturization 

Today, especially in Internet of Things (IoT) applications, it 

is common to have NN and DL models trained on the Cloud, 

which are accessed as web services from the edge through REST 

API calls or similar [9]. This architecture works well in 

“relaxed” settings, where processing frequency is in the range of 

seconds, minutes or more. Because of network constraints, the 

centralized approach to ML computation does not fit real-world 

applications requiring prompt responses with millisecond 

latency. 

Indeed, computation performed directly at the edge/fog 

level, or even at the thing/endpoint level, has many additional 

benefits such as bandwidth reduction, privacy, security, 

resilience to single point failure, and so on. Unfortunately, NN 

and DL are engineered for solving computationally intensive 

tasks often outreaching those of micro-processors commonly 

used at the edge of the network. This is causing a race in chip 

manufacturer towards realizing AI acceleration platforms 

which, in general, are based on special purpose architectures 

such as GPU or FPGA [10]. 

Nevertheless, manufacturing special-purpose chips requires 

significant time and investment. By contrast, since the installed 

base of microcontrollers is dominated by low-cost generalpurpose 

microprocessors, such as those of ARM ® Cortex ® -M 

family, we believe that there is a great potential to re-use existing 

solutions, provided that a certain paradigm shift must be 

undertaken in the overall ML process. 

III. 

CHOOSING A DIFFERENT PERSPECTIVE 

Several decades ago [11], Minsky, one of the forefathers of 

AI, conjectured that intelligence viewed as a complex process is 

the manifest macro appearance of a number of simpler micro 

phenomena taking place at a lower observation level. This 

collective intelligence grows towards increasing complexities as 

an emergent property [12]. 

In the literature, these principles are subjected to 

investigations in studies on Granular Computing [13-15] and 

Zadeh’s approach to Computing with Words [16] (CWW). A 

computational model that sprouts from GrC and CWW is that of 

holons [17], which represents the theoretical background of our 

approach. 

A. Granular Computing, Computing with Words and Holons 

According to Pedrycz’s view [13], GrC, as opposed to 

numeric computing (which is data-oriented), is knowledgeoriented 

and accounts for a new way of dealing with information 

processing in a unified way. Since knowledge is basically made 

of information granules, information granulation operates on the 

granule scale thus defining a sort of pyramid of information 

processing where low levels account for ground data and higher 

level for symbolic abstraction. 

GrC provides a basic framework for CWW which consists 

in expressing knowledge of observed phenomena in terms of 

linguistic propositions rather than numerical equations. At the 

core of the CWW methodology lays in fact the concept of 

granule due to the inner fuzziness of linguistic expressions. In 

the Zadeh’s view, a word w is considered as a label of a granule 

[16]. Under this perspective, the use of words becomes de facto 

a form of granulation. 

Introduced by Koestler in late 60’s [18], holon was defined 

as an entity playing the role of an autonomous whole and a 

dependent part at the same time as it happens for biological cells 

which are autonomous wholes that contribute as parts to the 

benefit of the hosting organism. By analogy, the same scheme 

holds also for human words in a phrase. A word, taken alone, 

carries its own meaning, while in a phrase it contributes to the 

understanding of a different semantic picture. That’s why holons 

can be viewed as a computational model to CWW applications. 

Language inherent recursivity is a powerful mean for 

humans to embed and transfer pieces of knowledge at different 

granularity levels. For example, let’s consider an energy-fromwaste 

plant manager describing the good-working condition of 

the combustion chamber as when the temperature signals from 

thermocouples behave in the same way. This is a very simple 

and expressive statement that suffices to describe from the 

human standpoint what is “normal condition” and consequently, 

as negation to that, what should be considered as “anomalous”. 

It is machine that should work to close the gap between the 

semantics behind the human utterance and its algorithmic 


B. Holons in real-time sensor data interpretation 

Physical sensors can be viewed as the CPS perception 

system that senses the surrounding environment, generally 

through periodic sampling. Since measures are often noisy or 

ambiguous, real-time information extraction is demanding. If 

data interpretation is delayed for too long, many important 

phenomena can be lost, being this unacceptable in several 

mission-critical scenarios. 

317

Since the kind of AI we present here works on raw data 

streams coming from sensors, it is useful to restrict our domain 

only to real-time signal interpretation, rather than general 

purpose data processing, and in particular to the very step that 

transforms raw data streams into meaningful events for upper 

application layers. 

Early work on holonic approaches to sensor data 

interpretation were presented in [19-20]. This paper extends 

them in the light of a more general approach to CWW and 

human-machine interaction (HMI). 

C. Focusing on human know-how first 

As an alternative to the data-centric approach described in 

previous Section, an iterative deployment workflow is proposed 

that centers on the available human know-how about the target 

process to analyze and is incremental with respect to new 

insights that may come out after deployment. 

The main phases characterizing such workflow are 

somewhat inspired to the so-called Plan-Do-Check-Act cycle 

[21] and to software development [22]. They are the following: 

1. Target events definition: the process manager or the 

domain expert identify/update the relevant conditions in 

terms of events happening on both physical and virtual 

signals they want AI to find for them. This phase is the 

equivalent to the example on the energy-from-waste 

plant problem description provided before. 

2. Transcription: the sensor processing flow that fires the 

target events is transcribed in an intuitive way, e.g. a 

visual language. A real-time CWW interpretation 

engine can be then embedded as a widget to insert in the 

visual scheme to characterize signal dynamics on the 

fly, letting machine do the job of transforming a highlevel 

concept into executable code. The well-formed 

flow starts from the input signals, declares some 

processing, and ends with the target events. For 

example, if we want an accelerometer sensor level to be 

“steady” over a 1 second time span, we would draw a 

flow like the one in Figure 2 ahead in the text. In case 

the specific concept representing the intended 

description were missing from the CWW vocabulary, 

the language should be engineered to provide a 

vocabulary-enriching mechanism, e.g. by automatically 

abstracting input examples into computable and 

reusable models, in a way similar to what happens in the 

training phase of a NN (mechanisms used to perform 

this abstraction are outside the scope of the present 

article). 

3. AI automated code building: once that transcription has 

been completed, the processing flow is automatically 

built into an executable to run on to the target device, 

e.g. in the form of a compiled C library to insert in the 

main loop. 

4. Embedding and test: the executable is then embedded 

into the target device for deployment. AI in run may 

supply new insights to restart the whole process again 

until all the desired conditions have been declared and 

verified. 

IV. 

OLONE 

“olOne” (from the Italian word for holon) is an IDE 

engineered to quickly build and deploy real-time sensor data 

processing AI applications without any data scientist 

background. 

In compliance with the vision and workflow illustrated in the 

previous section, olOne enables user to dictate domainknowledge 

to the CWW engine that will be included as a specialpurpose 

AI library in the C/C++ embedded software. 

Once deployed onto the target device, AI sole objective will 

be to interpret at run time sensor raw data streams according to 

the event conditions defined in the design phase, checking if 

such conditions are being met or not. In this sense, AI acts as a 

“semantic transducer”, transforming numeric data streams into 

meaningful booleans either for higher application layers or for 

taking local actions. 

A. Visual design 

To ease program design, a visual language which is specific 

to signal processing is used. This choice comes from the 

observation that most of the sensor data stream analytics 

platforms available on the market are either SQL-oriented (such 

as Microsoft Azure Stream Analytics or SQL Stream) or 

performed through ad-hoc programming (like Oracle Stream 

Analytics). Both approaches use programming languages that 

were not conceived in origin for stream data processing. Instead, 

olOne provides a toolbox of draggable widgets that can be 

arranged in the editor panel to produce a sensor data processing 

flow. A well-formatted flow is mainly composed of the 

following nodes: 

1. Sensor: name of the data source. 

2. Time Window: time span of the raw-data buffer at 

which AI will observe phenomena, e.g. 1 second. 

3. Behavior: it is a selector over the type of dynamics AI 

will focus on; examples are level (representing a generic 

state change from one numerical plateau to another one) 

and trend (which corresponds to the classical notion of 

direction given an array of data points). 

4. Concept: it represents an abstract Yes/No condition AI 

will check for, e.g. “high level” or “rising trend”. 

Technology behind olOne is also capable of comparing 

behaviors from different sources, e.g. “humidity level 

higher than temperature level” and even abstract new 

concepts from raw data, e.g. “slight rise”, to enrich the 

base vocabulary. 

5. Boolean operator (optional): unary or n-ary logical 

connector that allows to combine or negate processing 

flows. 

6. Event: name of the target event. 

The above steps are repeated until all processing flows have 

been defined. A well-formatted design can be then automatically 

built into executable code through a build request performed by 

clicking on the build button. 

318

B. AI library integration in embedded programs 

Once the “olOne.h” library has been returned after the build 

request, AI can be inserted in the main loop of an embedded 

program as shown in Figure 1. 

The process is straightforward: at each iteration, data read 

from sensors are sent to the library as float values, then AI 

interpretation is called and Yes/No results are returned for taking 

local actions or provide condition monitoring insights to other 

application layers. 

/*initialize AI library memory before the main loop */ 

olOneInit(); 

while(1) { 

} 

/* Get instantaneous data from the expansion board using variable s */ 

/* static X_CUBE_MEMS *s = X_CUBE_MEMS::Instance(); */ 

s->hts221.GetTemperature((float *)&TEMP_Value); 

s->hts221.GetHumidity((float *)&HUM_Value); 

s->lsm6ds0.Acc_GetAxes((AxesRaw_TypeDef *)&ACC_Value); 

/* send raw data to AI library */ 

addValue(0,ACC_Value.AXIS_X); 

addValue(1,ACC_Value.AXIS_Y); 

addValue(2,ACC_Value.AXIS_Z); 

addValue(3, TEMP_Value); 

addValue(4, HUM_Value); 

/* call AI library interpretation function */ 

interpret(); 

/* get Boolean results -> 0 False, 1 True */ 

int* events= getEvents(); 

// do something and then repeat the process, e.g. at 10Hz 

wait_ms(100); 

Fig. 1. Excerpt of sample C code for using the AI library produced by olOne. 

Reference hardware is the ST STM32F401 Nucleo-64 board with MEMS 

Inertial and Environmental expansion module. 

C. Additional features 

In addition to visual design and automatic code building, 

olOne brings a number of features engineered to quicken 

application development and test. Here it follows a list of the 

main ones: 

• Optimized code: the produced executable is optimized 

to stringently follow only the logics provided in the 

design phase, thus minimizing memory and 

computational footprint at runtime. In case time window 

buffers exceed the total available RAM, memory can be 

optimized through a granularization process that 

compresses many data points into one single variable. 

• Multiplatform: the same design can be built for different 

target environments, e.g. as a C library for embedded 

platforms or a completely stand-alone Java program for 

operating system-enabled devices. 

• Simulation: arrays of historical data can be analyzed offline 

with a given design. Results are written in the form 

of CSV file where raw-data input columns are paired 

with corresponding Yes/No output conditions. 

• Live dashboard: chart panels and LED-like widgets can 

be composed to display raw-data and event conditions 

read in real-time from the embedded device via serial 

port. 

• New concept creation: concepts can be added to the base 

vocabulary. User provides an array of data points, either 

taken from live data or defined by sketch based on a 

simulated example, asking AI to abstract the single 

example into a category of similar conditions. 

V. EXPERIMENTS USING OLONE 

To address the feasibility of the proposed approach, a couple 

of case studies have been implemented on a selected group of 

commercially-available embedded boards, all equipped with 

ARM ® Cortex ® -M family microprocessors, like the NXP 

MKL25Z128VLK4 Microcontroller Development Board 

(16KB, 48MHz, Cortex-M0+). 

We report here on the results obtained with the ST 

STM32F401 Nucleo-64 (96-KB, 84MHz, Cortex-M4), since the 

availability of pluggable sensor modules made the testing 

processing very quick. 

Case studies were chosen with the objective of detecting 

events involving different types of signal dynamics and AI tasks. 

In particular, we focused on: 

• level-change: through the analysis of accelerometer 

data. Since acceleration is the second order derivative of 

position, accelerometers are fast trackers of movement 

conditions. 

• trend analysis: through the comparison of temperature 

and humidity sensor data drifts. Since such physical 

quantities are correlated but of different nature, 

comparison between them requires AI to perform some 

kind of data fusion [23]. 

The ST MEMS Inertial and Environmental Nucleo 

Expansion board was used to get raw data streams for all the 

above case studies. 

A. Level-change in accelerometer data 

The ability to detect changes in sensor data streams, to detect 

faults or time-variant environmental conditions, is a key 

functionality in self-adaptive CPS [24]. 

In our experiments, we targeted level changes in 

acceleration, often called jerk in physics, to detect an 

orientation-independent [25] “standing still” condition (no jerk 

on the three axes) over a 1 second time span, with a sampling 

frequency of 10Hz. 

Although, at first glance, the classification task may appear 

trivial, it is complicated by the MEMS accelerometer intrinsic 

noise [26]. Furthermore, no a-priori information such as 

319

Fig. 2. Visual design for the use cases described in this paper. The processing flows proceed from the data sources (widgets representing sensors on the left) towards 

output boolean events (widgets on the right). Intermediate widgets are used to define the time window, the type of dynamics AI will look for and comparators or logic 

connectors to fuse different flows. In the trend analysis flow, the “slight rise” concept is inserted as a specification of the trend behavior. 

variance or noise power is given to the embedded AI apart from 

the description visually transcribed in olOne, as it appears in 

Figure 2. 

B. Temperature-Humidity trend comparison 

In this case study, the target was the comparison of slightly 

rising trends of temperature and humidity over a 5 second time 

window. This analysis, extended to wider time-span, can be 

useful for example to detect long-term drift, which is a wellknown 

problem related to sensor aging [27]. 

Since the “slight rise” concept is not present by default in the 

base vocabulary, there was the need to add it as a new word. An 

array of temperature data points was used for this scope. The 

data collection experiment consisted in bringing a hand close to 

the board. The heat from the hand slightly increased the read 

temperature. Live data were then collected and stored to perform 

the data abstraction task by means of the olOne new concept 

creation feature described in the previous Section. 

It is noteworthy that the new concept could be used as a 

reference condition also for humidity, a different physical 

quantity from the one used in the learning phase. 

C. Preliminary tests and final considerations 

Our tests showed that both use cases could be implemented 

on the STM Nucleo board with success at 10Hz with a minimal 

memory footprint (approximately 250 byte) and with a CPU 

usage of less than 0.1%. We also tested the same design at 

100Hz, maintaining CPU usage to less than 1%. Some 

screenshots taken from the olOne live dashboard during these 

tests are reported in Figure 3. 

Fig. 3. Example screenshot taken from the live dashboard representing the 

“still” condition observed on the 3-axes acceleromter sensor and the “slight 

rise” condition on humidity and temperature signals.When AI finds that realtime 

data (displayed in chart) comply with the target event conditions, the boxes 

reporting the event labels get highlighted and display “FIRED”. 

320

As a consequence of our experiments, we support the thesis 

that CWW computational models like the one employed in 

olOne can be brought to commonly-used embedded 

architectures, even to ARM ® Cortex ® -M0 microprocessors, 

provided that a different ML approach centred around the human 

know-how about the process is followed. 

VI. 

CONCLUSIONS 

The interest around AI and its potential application to 

Industry 4.0 is hyping. However, several barriers to effective 

implementation in real-world settings can be found, such as the 

lack of adequate technical skills for data processing, the 

availability of datasets and the time itself spent for model 

training required by traditional ML approaches. 

In the race towards AI miniaturization, which is today 

mainly focused on realizing special-purpose architectures, our 

proposal should be considered as an attempt to provide a 

different perspective, that exploits existing general-purpose 

CPUs without requiring data scientist background for AI design 

and deployment. 

Rooted in the CWW paradigm, the approach employed in 

olOne represents, according to authors’ view, a viable 

alternative in the Industry 4.0 scenario to mainstream NN and 

DL implementations. 


The authors would like to personally thank Pio Quarticelli 

and Danilo Pau from the ST Microelectronics Agrate Brianza 

site, for gently providing all the hardware and the technical 

support needed to setup the presented experiments. 

REFERENCES 

[1] E. A. Lee, “Cyber physical systems: design challenges”, 11 th IEEE 

International Symposium on Object and Component-Oriented Real-Time 

Distributed Computing (ISORC), Orlando, FL, 2008, pp. 363-369. 

[2] R. Schmidt, M. Möhring, RC. Härting, C. Reichstein, P. Neumaier, P. 

Jozinović, “Industry 4.0 - potentials for creating smart products: empirical 

research results”. In: Abramowicz W. (eds) Business Information 

Systems. BIS 2015. Lecture Notes in Business Information Processing, 

vol 208. Springer, Cham 

[3] J. Lee, H.A. Kao, S. Yang, “Service innovation and smart analytics for 

Industry 4.0 and big data environment”, 6 th CIRP Conference on Industrial 

Product-Service Systems, 2014, pp. 3 – 8. Vol. 16, 2014, Pages 3-8, 

Elsevier. 

[4] S. Teerapittayanon, B. McDanel and H. T. Kung, "Distributed deep neural 

networks over the cloud, the edge and end devices," 2017 IEEE 37th 

International Conference on Distributed Computing Systems (ICDCS), 

Atlanta, GA, 2017, pp. 328-339. 

[5] D. Shen, G. Wu, H.-I. Suk. “Deep learning in medical image analysis”. 

Annual review of biomedical engineering. 2017; 19:221-248. 

[6] I. Sutskever, J. Martens, G. Hinton, “Generating Text with Recurrent 

Neural Networks”, Proceedings of the 28 th International Conference on 

Machine Learning (ICML-11), ACM , pp. 1017-1024, June 2011. 

[7] M. Liwicki, A. Graves, and H. Bunke. “Neural Networks for Handwriting 

Recognition”. Book chapter, Computational Intelligence Paradigms in 

Advanced Pattern Classification, pp. 5-24, Springer, 2012. 

[8] S. Compagnucci, G. Berni, G. Massaro, M. Masulli, “Thinking the future 

of the european industry. Digitalization, Industry 4.0 and the role of EU 

and national policies”, EU study from I-com (Institute for 

competitiveness), Bruxelles, 6 September 2017. 

[9] C.-W. Tsai, C.-F. Lai, H.-C. Chao, A. V. Vasilakos, “Big data analytics: 

a survey”, Journal of Big Data, 2:21. Springer International Publishing. 

December 2015. 

[10] D. Vainbrand and R. Ginosar, "Network-on-Chip Architectures for 

Neural Networks," 2010 Fourth ACM/IEEE International Symposium on 

Networks-on-Chip, Grenoble, 2010, pp. 135-144. 

[11] .M. Minsky, The Society of Mind, Simon and Schuster, New York. 1986. 

[12] M. Ulieru, R. Doursat, “Emergent engineering: a radical paradigm shift”, 

Int. J. Autonomous and Adaptive Communications Systems, Vol. 4, No. 

1, 2011, pp. 39-60. 

[13] W. Pedrycz, “Granular computing: an introduction”, Proc. of the Joint 9th 

IFSA World Congress and 20th NAFIPS International Conference, 

Vancouver, BC, 2001, pp. 1349-1354 vol.3. 

[14] L. A. Zadeh. “Toward a theory of fuzzy information granulation and its 

centrality in human reasoning and fuzzy logic”. Fuzzy Sets Syst. 90, 2, 

pp. 111-127, September 1997. 

[15] L. A. Zadeh, “Some reflections on soft computing, granular computing 

and their roles in the conception, design and utilization of 

information/intelligent systems”, Springer-Verlag, Soft Computing (2): 

23—25. 1998. 

[16] L. A. Zadeh, "Fuzzy logic = computing with words," in IEEE 

Transactions on Fuzzy Systems, vol. 4, no. 2, pp. 103-111, May 1996. 

[17] M. Calabrese, Hierarchical-Granularity Holonic Modelling. Doctoral 

Thesis, 2011. University of Milan, Italy. 

[18] A. Koestler, “Some general properties of self-regulating open hierarchic 

order (SOHO)”, In Koestler and Smythies, 1969, 210-216. 

[19] V. Di Lecce, M. Calabrese, C. Martines, “From sensors to applications: a 

proposal to fill the gap”, Sensors & Transducers Journal, Vol. 18, Special 

Issue, pp. 5-13, January 2013. 

[20] V. Di Lecce and M. Calabrese. “Smart sensors: a holonic perspective”. In 

Proceedings of the 7 th international conference on Intelligent Computing: 

bio-inspired computing and applications (ICIC'11), De-Shuang Huang, 

Yong Gan, Prashan Premaratne, and Kyungsook Han (Eds.). Springer- 

Verlag, Berlin, Heidelberg, 290-298. 2011. 

[21] N. R. Tague, The Quality Toolbox, Second Edition, ASQ Quality Press, 

2004, pages 390-392. 

[22] G. Suryanarayana, T. Sharma and G. Samarthyam, "Software Process 

versus Design Quality: Tug of War?," in IEEE Software, vol. 32, no. 4, 

pp. 7-11, July-Aug. 2015. 

[23] B. Khaleghi, A. Khamis, F.O. Karray, “Multisensor data fusion: A review 

of the state-of-the-art”, Information Fusion, Vol. 14, Issue 1, January 

2013, Pages 28-44, Elsevier. 

[24] C. Alippi, V. D'Alto, M. Falchetto, D. Pau and M. Roveri, "Detecting 

changes at the sensor level in cyber-physical systems: Methodology and 

technological implementation," 2017 International Joint Conference on 

Neural Networks (IJCNN), Anchorage, AK, 2017, pp. 1780-1786. 

[25] W. Hamäläinen, M. Järvinen, P. Martiskainen and J. Mononen, "Jerkbased 

feature extraction for robust activity recognition from acceleration 

data," 2011 11th International Conference on Intelligent Systems Design 

and Applications, Cordoba, 2011, pp. 831-836. 

[26] A. Albarbar and S.H. Teay, “MEMS Accelerometers: Testing and 

Practical Approach for Smart Sensing and Machinery Diagnostics”, pp. 

19-40 in D. Zhang, B. Wei (eds.), Advanced Mechatronics and MEMS 

Devices II, Microsystems and Nanosystems, Springer International 

Publishing Switzerland 2017. 

[27] T. Islam, H. Saha, “Study of long-term drift of a porous silicon humidity 

sensor and its compensation using ANN technique”, In Sensors and 

Actuators A: Physical, Vol. 133, Issue 2, 2007, Pages 472-479. 

321

A new scalable architecture to accelerate 

Deep Convolutional Neural Networks for low 

power IoT applications 

Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, Nitin Chawla 

ST Central Labs and Technology R&D 

STMicroelectronics 

Cornaredo (MI), Italy; Geneva, Switzerland; Noida India 

giuseppe.desoli@st.com, thomas.boesch@st.com, surinder-pal.singh@st.com, nitin.chawla@st.com 

Abstract— Deep Convolutional Neural Networks (DCNNs) or 

ConvNets allow achieving state of the art results in many 

applications involving recognition, identification and/or 

classification tasks; however, those come at a high cost in terms 

of processing power hindering their adoption in embedded and 

IoT domains, due to the scarce availability of low-cost and 

energy-efficient solutions. Recently a push towards an everincreasing 

deployment of DCNNs based inference tasks in 

embedded devices supporting the edge-computing paradigm has 

been observed, overcoming limitations of cloud-based computing 

for latency, bandwidth requirements, security, privacy, 

scalability, and availability. At the edge, severe performance 

requirements must coexist with tight constraints in terms of 

power and energy consumption. DCNNs algorithms necessitate 

billions of multiply-accumulate operations per second for realtime 

workloads, as well as local storage of millions of bytes of 

pre-trained weights. To cope with these constraints, low-power 

IoT end-nodes must resort to specialized hardware blocks for 

specific compute-intensive data processing, while retaining 

sufficient software programmability to cope with diverse 

computational needs. The Orlando architecture is a 

reconfigurable, scalable and design time parametric DCNN 

Processing Engine powered by an energy efficient set of HW 

convolutional accelerators supporting kernel compression and an 

on-chip reconfigurable data transfer fabric to improve data reuse 

and reduce on-chip and off-chip memory traffic. The Orlando 

SoC prototype integrates custom designed DSPs, along with an 

instance of the reconfigurable dataflow custom HW accelerator 

fabric designed in FD-SOI 28 technology with low power features 

and adaptive circuitry to support a wide voltage range from 1.1V 

to 0.575V. The chip adopts a GALS clocking architecture to 

reduce the clock network dynamic power and skew sensitivity 

due to on-chip variation at lower voltages. We achieved a power 

consumption of 41mW on a typical DCNN algorithm (AlexNet) 

with a peak layer efficiency of 2.9 TOPS/W. 

Keywords—Deep Learning; Neural Networks; FD-SOI; ultralow 

power SoC; 


DCNN based algorithms are now widely applied to a large 

number of hard to solve problems in classification, detection, 

recognition, analysis and, more recently, even synthetic signals 

generation in computer vision, signal processing, speech and 

audio applications, robotic motion, navigation, financial data 

analysis, medical diagnostics, and more. Since the seminal 

work of Y. LeCun et al [1] that lead to winning the 2012 

ImageNet Large Scale Visual Recognition Challenge with a 

CDNN significantly outperforming classical computer vision 

approaches for the first time called AlexNet [2]; many new 

kinds of neural networks topologies and operators have entered 

the state of the art, all requiring a baseline computational 

pattern consisting on some form of tensor convolution along 

with a more diverse set of additional operators deployed in a 

sequence of processing steps or layers. 

It is only in recent years that commodity computing 

hardware such as GPUs delivered the performance required to 

address DCNN training and inference based applications. At 

the same time, it is increasingly more difficult to improve over 

the state-of-art in hardware performance by way of generalpurpose 

designs leading to the emergence of hardware DCNN 

accelerators. A survey of the existing proposals in this domain 

is beyond the scope of this paper, some of the early works 

include the DianNao accelerator family [5] using a SISD 

architecture to process operations in parallel on a single chip 

[5], while few other examples can be found in [3,4 and 6]. 

Hardware accelerators design efforts have proceeded in two 

directions: either toward more general-purpose accelerators to 

support training and inference with very high throughput and 

efficiency for example in servers [11] or toward specialized 

units addressing layers or classes of DNNs with the goal of 

reducing execution time and/or energy. In order to deploy these 

technologies making them pervasive in mobile, IoT and 

wearable devices; hardware acceleration provides the ability to 

work in real time with very limited power consumptions and 

limited amounts of embedded memory overcoming limitations 

of fully programmable solutions. 


322

C 

t 

r 

l 

We present a scalable modular architecture called Orlando 

providing state-of-the-art performance and energy efficiency to 

design HW accelerated Neural Processing Units (NPU) with 

the following features: (1) A flexible streaming HW 

convolutional accelerators supporting variable bit length kernel 

decompression, (2) a reconfigurable dataflow switching fabric 

improving data reuse and reducing the need for on-chip and 

off-chip memory traffic, (3) a power efficient array of DSPs to 

increase flexibility and support real-world applications. In 

addition, the SoC prototype designed to validate the 

architecture includes an ARM-based host subsystem with 

peripherals, a range of high-speed I/Os interfacing for imaging 

and other types of sensors and a chip-to-chip high-speed link to 

pair multiple devices together. 

Output 

Data 

Stream 

ThomaHawk ACCELERATOR 

Adder 

Tree 

Buf 

Display out 

(DVI) 

Interface 

M 

M 

… 

M 

MAC Units 

Image 

Sensor IF & 

ISP 

Feature Line Buffer 

Kernel 

Buf. 

Buf 

Buf 

Image 

Sensor IF & 

ISP 

Feature 

Data 

Stream 

Kernel 

Data 

Stream 

Batch 

Data 

Stream 

2xDSP 

2xDSP 

ORLANDO SoC 

8 x Dual DSP cluster 

2xDSP 

2xDSP 

XBAR 

2xDSP 

2xDSP 

BUS full xbar (64bits) 

2xDSP 

2xDSP 

Fetch 

PC 

BTB 

BTAC 

Xbar2(CMEM) 

6 6 6 6 

6 4 4 4 4 

4 Xbar(DMEM) 

I-Cache 

I 

D 

R 

F 

E 

E 

C 

G 

U 

F 

I 

L 

E 

Dual Core Cluster 

STRED5 PIPELINE 

STQ 

Color 

convert 

. . . 

H264 

MJPEG 

Ctrl Regs. 

DMA 

Eng 15 

Stream 

Switch 

. . . DMA 

Eng 0 

Bus Arbiter & 

System Bus Interface 

CA 

0 

. . . 

CA 

7 

CO-PROCESSOR SUBSYSTEM 

Image & DCNN 

co-processor 

subsystem 

CLK 

GEN 

RST 

CTRL 

PWR 

MNG 

2x128B 

2x128B 

2x128 KB 

2x128 KB 

Low 

Speed 

Periph 

2x128 KB 

4 x 1MB Global Ram 

(clk & pwr gating) 

AXI 

2x128 KB 

2x128 KB 

2x128 KB 

2x128 KB 

ARM 

CORTEX M4 

CPU 

2x128 KB 

2x128 KB 

2x128 KB 

MEM 

CTRL 

2x128 KB 

2x128 KB 

2x128 KB 

2x128 KB 

SoC 

I/D 

16K 

STRED5 

C-MEM 

32K 

LXBAR 

I/D 

16K 

I$ 

16K 

CXBAR 

STRED5 

I/D 

16K 

LXBAR 

I/D 

16K 

I$ 

16K 

2D 

DMA 

Fig. 1. [a] Orlando 1 FD-SOI 28nm SoC prototype high level system architecture, [b] DCNN HW accelerator subsystem shown on the left, [c] a single DSP 

cluster shown on the right 

II. 

SOC ARCHITECTURE 

A. High Level System 

The Orlando 1 test chip prototype [Fig. 1a] integrates an 

ARM Cortex-M4 microcontroller with 128KB of memory, 

assigned with control and sequencing tasks for I/O and HW 

configuration and synchronization. The chip supports a number 

of peripherals for external communication and interfacing and 

includes eight programmable clusters [Fig. 1c] each one 

composed of two ultra-low power proprietary DPSs along with 

interrupts controllers, timers, and dedicated tensor transfer 

DMA channels. A reconfigurable dataflow accelerator fabric 

[Fig. 1b] connects high-speed camera interfaces with image 

sensor processing (ISP) pipelines, croppers, color converters, 

feature detectors and descriptors (FAST, Census), video 

encoders (MJPEG, H.264), 8 channel digital microphone 

interface, streaming DMAs and 8 Convolutional Accelerators 

(CA). The chip includes 4 SRAM banks with dedicated 64bits 

bus ports each with 1Mbyte composed with 64KB memory 

cuts with individual sleep line control to activate it on demand 

and reduce total leakage when not needed. The system 

parameters are chosen to be capable to sustain the execution of 

all convolutional stages from internal on-chip memory for 

CDNN topologies of a complexity similar to an AlexNet 

without pruning or even larger ones if fewer bits are used for 

activations and/or weights to achieve high power efficiency. It 

is possible to connect multiple chips together via a chip-to-chip 

4 lanes high-speed serial links running up to 6Gbit/sec to 

support larger networks without sacrificing throughput and/or 

using the chip as a co-processor. CDNNs state-of-the-art 

topologies (e.g. VGGs, ResNet, inception-v4, etc.) require 

deeper topologies with many layers, millions of parameters, 

and varying kernel sizes, resulting in large bandwidth, power, 

and area costs often not compatible with constraints associated 

to embedded devices and applications. The cost in terms of 

energy per access varies almost of one order of magnitude from 

level to level, as well as large gaps exist for throughput and 

access latency at different levels of on-chip and external 

memory [Fig. 2], 

Energy/power x word access 

Local SRAM 

On-chip SRAM 

LPDDR 

Fig. 2. Relative cost of accessing different levels of memory going from 

local buffers attached to functional units to higher level of on-chip and 

external memory 

As a result, a common way to achieve efficiency is to 

define a hierarchical memory system and efficiently reuse local 

data in the deeper level of the hierarchy. Accelerating CDNN 

convolutional layers accounting for more than 90% of total 

operations calls for the efficient balancing of the computational 

vs memory resources for both bandwidth and area to achieve 

1x 

10x 

100x 

323

maximum throughput without hitting their associated ceilings 

due to architectural limitations. 

B. DSP Sub System 

Each 32bit DSP provides specific instructions (Min, Max, Sqrt, 

Mac, Butterfly, Average, 2-4 SIMD ALU) to accelerate typical 

CNN operations other than convolutions [2]. A dual load with 

16b saturated MAC, advanced memory buffer addressing 

modes and zero latency loop control execute in a single cycle 

while an independent 2D DMA channel allows the overlap of 

data transfers. The DSPs are tasked with max or average 

pooling, nonlinear activation, cross-channel response 

normalization and classification representing a small fraction of 

the total CDNN computation but more amenable to future 

algorithmic evolutions. They can operate in parallel with CAs 

and data transfers, synchronizing by way of interrupts and 

mailboxes for concurrent execution. DSPs are activated 

incrementally when the throughput targets require it, leaving 

ample margins to support additional tasks associated with 

complex applications, such as object localization and 

classification, multisensory (e.g. audio and video) CDNN 

based data-fusion and recognition, scene classification, etc. 

C. The Configurable Accelerator Framework (CAF) 

The Orlando Neural Processing Unit (NPU) engine 

includes a configurable accelerator framework (CAF) [Fig. 3] 

with a design-time selectable number of Functional Units (FU) 

such as DMAs, accelerators, or I/O interfaces to external 

devices. A centralized, fully connected, runtime configurable 

stream switch interconnects all FUs with unidirectional links 

transporting data streams to/from different kinds of data 

sources and sinks. A fully automated configuration process 

allows the designer to quickly generate synthesizable RTL 

code tailored to the actual system requirements. The 

configuration tool suit uses predesigned FU templates provided 

in a central library, takes care of any signal synchronization for 

FUs that run on different clock domains and configures the 

required stream links and bus interfaces to provide access to all 

configuration registers in the system. 

Image sensor 1 

Parallel YUV/RGB 

Serial (CSI2) RGB 

Image sensor 2 

Parallel YUV/RGB 

Serial (CSI2) RGB 

DVI RGB 

IF 

IF 

CONFIG. 

REGISTERS 

ISP 

ISP 

DISPLAY IF 

SYSTEM BUS INTERFACE 

2x64BIT 

DMA 

ENGINE 0 

… 

CONFIGURABLE 

STREAM SWITCH 

DMA 

ENGINE 15 

53 (INPUT STREAM LINKS) 

40 (OUTPUT STREAM 

LINKS) 

H264 

ENCODER 

… 

CENSUS TRANFORM 0 

co-processor 

subsystem 

Fig. 3. Orlando NPU Configurable Acceleration Framework (CAF) 

At runtime, an arbitrary number of concurrent, virtual 

processing chains limited by the available hardware resources 

can be defined to meet the specific characteristics of a task 

graph. These virtual processing chains can be configured and 

fired within a few system clock cycles and may process 

multiple tasks in parallel. An automatic backpressure 

mechanism handles the data flow control in each virtual 

processing chain preventing any data overflows. The 

CA 0 

… 

CA 7 

Unit 

DMA ENG 

SENSOR IF 

DISPLAY IF 

CA 

OTHER 

INFO 

16 UNITS, INPUT OR OUTPUT, DATA 

PACKING/UNPACKING, LINKED LIST 

CONTROL 

2 UNITS INCL. ISP BAYER => RGB/YUV 

DVI MONITOR INTERFACE 

8 UNITS FOR 2D CONVOLUTION 

ACCELERATION 

1 H264 ENC, 1 MJPEG ENC., 

1 MJPEG DEC., 2 CENSUS TRANSF., 2 

IMAGE CROPPER, 1 FAST FEATURE 

DETECTOR, 4 GP COLOR CONV. 

interconnect supports stream multicasting to allow reuse of a 

data stream at multiple data sinks reducing the overall data 

bandwidth from/to the system bus [Fig. 4]. The unidirectional 

stream links are able to transport different data formats such as 

raster scan images, kernels coefficient, activation data and 

other kinds. A start and end tag along with other command and 

message packets are used for signaling and to trigger specific 

actions in all FUs participating in a virtual processing chain. 

Functional Units can have an arbitrary number of input and 

output stream links as well as a set of configuration registers 

used to enable, reset and configure their functionality. A 

centralized interrupt controller enables the routing of interrupt 

signals from any accelerator, interface or DMA engine to the 

DSP cores. A clock and reset management unit provides an 

individual clock and reset control for each FU in the system. 

Specialized DMA engines transform data structures 

accessible on the system bus into data streams injected into 

virtual processing chains whereas data streams received by the 

DMA engines are translated back to data structures to be 

written to any memory location on the system bus. Extensive 

data packing and unpacking features in the DMA engines allow 

the efficient use of variable data bit width and sophisticated 

control mechanisms using linked lists to support autonomous 

processing of tensors. Interrupt signals generated by the DMA 

engines signal the completion of a processing task to the DSP 

Cores and/or central control processor. 

The CAF subsystem instance in the Orlando 1 SoC 

prototype includes four camera interfaces (two serial and two 

parallel) with integrated ISPs, a display interface and various 

accelerators for standard image processing tasks such as color 

conversion, image cropping, image(MJPEG) and video (H.264) 

encoding. Additional accelerator blocks are available for 

feature point identification and tagging such as a FAST feature 

point detector and a two census transform blocks that allow for 

generating compact and illumination invariant feature 

descriptors. 

DMA E0 A 0 A 1 DMA E0 

IF 0 A 0 A 1 DMA E0 

IF 0 A 0 A 1 DMA E0 

IF 0 

DMA E0 

IF 0 

Sync. 

IF 1 

IF 0 

A 0 

A 1 

A 0 

A 1 

A 0 

A 2 

A2 

A2 

A2 

BUF 

DMA E1 

DMA E1 

DMA E0 

DMA E0 

Simple 

chains 

Chains 

with forks 

Joins 

single IF 

Joins with 

multiple IF 

Forks 

and hops 

Fig. 4. The Configurable Accelerator Framework allows different kinds of 

virtual link connections to be created between blocks, including sources and 

sinks of data. 

D. Chip Implementation 

The prototype chip is manufactured with 

STMicroelectronics 28nm FD-SOI technology; designed with 

mono-supply SRAM based single well low power 0.12µ 2 

single p-well bit cells with reduced variability, in-situ tracking 


324

of bitcell current and programmable read time for best speed 

and lowest dynamic power. Memories also have in-situ 

tracking of word line delay and slope for robust low voltage 

read/write across a wide voltage range from 1.1V to 0.575V 

[Fig. 6]. Globally asynchronous and locally synchronous 

clocking architecture reduces the clock network dynamic 

power and skew sensitivity due to on-chip variation at lower 

voltages and eases the use of dynamic frequency scaling. Finegrained 

power gating and multiple sleep modes for memories 

decrease the overall dynamic and leakage power consumption. 

Die size [Fig. 5 ]is 6.2x5.5mm2, each CA is 0.27mm2 

including memory and the chip reaches 1.175GHz at 1.1V with 

a theoretical peak CAs performance of 676 GOPS [Fig. 5]. The 

chip is capable of sustaining a wide range of operating points 

and can run at 200MHz with a 0.575V supply at 25C with an 

average power consumption of 41mW on AlexNet using eight 

pipelined CAs, achieving a peak efficiency of 2.9 TOPS/W. 

OTP 

High Speed 

Camera IF 

PLL 

Chip 

to 

Chip 

co-processors 

subsystem 

(DSP) Cores and local mems 

Global Memory 

Subsystem 

Technology 

Chip sizes 

Package 

Clock freq 

FD-SOI 28nm 

(X) 6239.2 um 

(Y) 5598.2 um 

FBGA 15x15x1.83 

200MHz – 1.175GHz 

Supply voltages 

0.575V – 1.1V digital 

1.8V I/O 

Power 

41 mW 

4x1 MB 

On-chip RAM 8x192 KB 

128 KB 

No of DPSs 16 

Peak DSP 

performance (*) 

No of CAs 8 

Peak CAs 

performance (*) 

75 GOPS 

(dual 16b MAC loop) 

676 GOPS 

(*) 1 MAC defined as 2 OPS (ADD + MUL) 

Sub tensors can be processed entirely with the local buffer 

resources available in each accelerator. The configurable batch 

size and a variable number of parallel kernels enable optimal 

trade-offs for the available input and output bandwidth sharing 

across different units and the available computing logic 

resources. Keeping the entire batch of feature and kernel data 

locally and as close as possible to the MAC units enables the 

optimal use of the available power budget. Feature and kernel 

data batches can be processed sequentially with multiple 

accelerators in a virtual processing chain or iteratively with 

intermediate results being stored in on-chip memory and 

fetched in the subsequent batch processing round. 

Various kernel sizes (up to 12x12), sub-tensor batch sizes 

(up to 16), and parallel kernels (up to 4) can be handled by a 

single CA instance but any size kernel can be accommodated 

with the accumulator input. The CA includes a line buffer to 

fetch up to 12 feature map data words in parallel with a single 

memory access. A register-based kernel buffer provides 36 

read ports, while 36 16-bit fixed point multiply-accumulate 

(MAC) units perform up to 36 MAC operations per clock 

cycle. The kernel buffer implements pre-buffering of kernel 

data that are required in a subsequent processing step. 

An adder tree accumulates MAC results for each kernel 

column [Fig. 7]. The overlapping, column-wise calculation of 

the MAC operations allows an optimal reuse of the feature 

maps data for multiple MACs thus reducing the power 

consumption associated with redundant memory accesses. 

A different CAs optimal configuration per each CDNN 

layer is defined manually, while we are working on a tool to 

automatically generate it off-line starting from a CDNN 

description format such as Caffe and TensorFlow [10]. 

Fig. 5. Orlando 1 prototype SoC build in FD-SOI28 nm technology 

Wide DVFS Range 

2930 2691 

950 

1977 

650 

200 266 450 

1423 

1175 

969 801 

0.575 0.6 0.7 0.825 1 1.1 

Frequency 

GOPS/W 

Fig. 6. The Orlando 1 SoC prototype supports a wide range of operating 

conditions from ultra-low Vdd for highest efficiency to high-performance 

FEATURE 

CA 

DATA 

STR. 

BUF FEATURE LINE 

STREAM 

IF 

BUFFER 

(UP 12 LINES 

WITH UP TO 512 

KERNEL 

PIXELS 

FEATURE 

36 x 

WIDTH 

OR 3 LINES WITH 

16x16BIT MACs 

BATCH 

UP TO 2048 

SIZE 

13 INPUT 

PIXELS) 

ADDER TREE 

KERNEL 

36 READ 

KERNEL BUFFER 

STR. 

PORTS 

CONVOLUTION 

PIXEL N 

DATA 

BUF (UP 484 KERNEL 

STREAM IF 

RESULTS 

VALUES) 

KERNEL BUFFER 

CTRL 

OUPUT 

DATA 

OUTPUT 

STREAM 

INTERM. 

CTRL 

DATA 

STR. 

STR. 

BUF INTERMEDIATE ACCUMULATION 

BUF 

STREAM 

IF 

IF 

DATA 

MAC 

Unit 

INFO 

UNITS 

KERNEL SIZE 

1x1 to 12x12 

ACCU. 

KERNEL 0 

BATCH SIZE Up to 16 

PARALLEL KERNELS Up to 4 

ACCU. 

KERNEL 1 

FEATURE SIZE Up to 512 for kernels > 6x6 

Up to 1024 for kernel > 3x3 

ACCU. 

Up to 2048 for others 

KERNEL 2 

VARIOUS 

KERNEL DECOMPRESSION 8bit => 16 bit, 

EXTENSIONS KERNEL PREBUFFERING, OUTPUT STREAM 

ACCU. 

MERGING, DATA SHIFTING AND ROUNDING 

KERNEL 3 

ACCU. 

PIXEL N-1 

Fig. 7. Orlando NPU Convolutional Accelerator 

FEATURE WIDTH 

FEATURE 

DEPTH 

ADDER 

TREE 

CONV. 

OUT 

ACCU. ACCU. 

PIXEL N PIXEL N+1 

III. 

CNN HW ACCELERATION 

A. Convolutional Accelerators (CA) 

Convolutional accelerators can be grouped or chained 

together to handle varying sizes of feature maps and multiple 

kernels in parallel using the interconnection capabilities 

provided by the programmable stream switch adapting to 

different neural network topologies as well as feature and 

kernel tensors geometries. 

B. Hyperparameters compression 

A large number of schemes have been proposed in the 

literature to compress CNN's hyper-parameters with fewer bits 

including uniform and trained quantization, pruning, weight 

sharing and even Huffman encoding and others techniques [8]. 

It is generally accepted that in many cases the precision 

required for marginal decreases in the output accuracy can be 

lower than 16 bits and as low as eight or fewer bits [9]. In order 

to keep the hardware complexity limited, we’ve selected a 

325

elatively simple scheme calling for a non-linear quantization 

scheme for which the quantized steps are defined offline with a 

k-means approach applied to all of the weights per each layer. 

This scheme is flexible enough to accommodate also linear 

quantization models with a min, max boundary representation 

such as the one adopted in TensorFlow. The Orlando 

Convolutional Accelerators can decompress at run-time the 

compressed weights before storing them into the local kernel 

buffers providing significant benefits in terms of total memory 

bandwidth requirements reduction, while the nonlinear 

quantization scheme helps to minimize the impact on the 

reduction of accuracy. Fig. 8 shows the quantizer functions for 

two different layers of an AlexNet compressed to eight bits per 

coefficient starting from their FP32 representation produced 

during training; as it can be seen, the statistics vary 

significantly across layers, showing the benefits of allocating 

quantization steps non-uniformly and asymmetrically with 

respect to the center offset. The CA supports on-the-fly kernel 

decompression and rounding; the functionality implemented 

with a lookup table populated upfront before the processing of 

a tensor or sub-tensor is started. 

12000 

8000 

4000 

0 

-4000 

0 127 254 

-8000 

-12000 

Layer 1 

2000 

1500 

1000 

0 

-500 

0 127 254 

-1000 

Fig. 8. Kernel weights can be quantized non linearly with 8 or fewer bits 

(e.g. with KNN), convolution Accelerator supports decompression in HW, 

AlexNet top-1 classification error rate increase of 0.3% 

On many CNN topologies, kernel weights can be quantized 

with an ensemble of vector codebooks for increased network 

compression and lower memory bandwidth without significant 

loss in performance. We have developed a scheme that can be 

applied to kernel tensors to take advantage of this; TABLE I. 

shows how many bits per coefficient per layers are achieved 

when using an ensemble of 1 to 64 vector codebooks per layer 

with vectors lengths of 3 and 5 coefficients. The codebooks are 

learned with a modified version of k-means adapted to low 

values of k; vectors can be chosen either slicing horizontally, 

vertically or depth-wise the kernel tensors, grouping them in 

subsets each assigned to a different codebook. The position of 

each vector in the original kernel determines which codebook 

is used to encode it in order to avoid transmitting additional 

bits for encoding its label. We’ve not observed great variations 

depending from which direction the vectors are selected from 

(x,y or z); however, a great variability in terms of the overall 

VQ ensemble optimal parameters is observed from a network 

topology to another and, including for the same network 

trained to a different set of classes. 

500 

Layer 3 

TABLE I. 

IV. 

ACCURACY VS PARAMETER COMPRESSION WITH VQ FOR 

TINY YOLO [13] 

Layer 

No. 

codebooks 

Vector Quantized Tiny Yolo 

Codebook No. of 

geometry parameters 

Bits per 

coefficient 

C1 4 3x16 432 4.89 

C2 1 3x256 4608 4.00 

C3 4 3x256 18432 4.00 

C4 16 3x256 73728 4.00 

C5 32 3x256 294912 3.33 

C6 64 3x256 1179648 3.00 

C7 64 3x256 4718592 2.75 

C8 64 3x256 9437184 2.71 

C9 64 5x256 128000 6.72 

IOU% 

Recall% 

FP32 65.17 81.53 

32 

VQ 

15855536 

63.28 79.76 2.86 

Quantized 

CNN LOGICAL TO PHYSICAL MAPPING 

Efficient mapping of a CNN task graph to the underlying 

architectural computing and memory resources requires that 

the execution of convolutional layers is partitioned by way of 

slicing both kernel and input activation tensors. Each subtensor 

is assigned to a different convolutional accelerator and 

the partial results can either be sent to memory or directly 

streamed into another accelerator’s input processing a different 

slice of the same kernel sub-tensor for direct accumulation 

[Fig. 9]. 

The Orlando streaming architecture allows the creation of 

virtual channels between CAs dynamically to chain them 

together in a coprocessor pipeline or to run them independently 

while broadcasting sub-tensors input data without the need to 

perform separate memory accesses [Fig. 10]. The shape of the 

sub-tensors is constrained by a relatively large number of 

parameters such as the available local storage for each CA (line 

buffers and kernel buffers), the total on-chip memory storage 

that can be used for input and output activation maps, and the 

size of the kernels for a given layer compared to the maximum 

kernel size supported by the accelerators. 

Multiple accumulation rounds would be required if the iteration 

space exceeds any of those constraints. In addition to finding a 

legal schedule for sub-tensors that tessellates the whole global 

tensor iteration space; a mapping strategy should keep into 

account a multi-objective cost function that includes not only 

performance (e.g. frames per second), but possibly also energy 

efficiency (frames/sec/W) while keeping external memory size 

and requirements into considerations. While this is a relatively 

difficult scheduling problem to solve, there are effective 

approaches that automatically derive a nearly optimal solution, 

for example, based on polyhedral models in the simplified 

iteration space of the convolutional tensor processing [7]. 


326

FEATURE 

DEPTH 

KER 

0 

… 

… KER 

Q 

Fig. 9. Feature and kernel tensors are sliced into batches of variable depth 

processed iteratively and results are accumulate 

FMAP 

KER. 0/1 

BATCH (N-1) K0 

KERNEL 

WIDTH 

BATCH 0 

IN FEATURE 

MAP 

Parallel Batch Execution 

DMA 

BATCH (N-1) K1 DMA 

CA 0 

… 

CA N 

DMA 

OUT K0 

DMA 

OUT K1 

Chained Batch Execution 

FMAP 

KER. 0/1 

CA 0 

DMA 


FMAP next 

batch DMA 

… 

CA N 

OUT 

DMA 

FEATURE WIDTH 

BATCH 

N 

BATCH 

SIZE 

FEATURE 

HEIGHT 

Fig. 10. Chained and parallel sub-tensor execution on multiple CAs reduces 

bandwidth, power, and # of DMA channels 

V. EXPERIMENTAL RESULTS 

KER 0 

In the following section we provide some experimental 

results for executing few CDNN workloads; first a typical 

AlexNet benchmark is described for which maximum power 

efficiency is the target on the actual Orlando SoC prototype; 

then a VGG-16 workload is described (this one based on a 

simulated model) comparing different choices of design time 

parameters for the Orlando NPU to illustrate performance for 

different possible configurations. 

A. Alexnet on the Orlando 1 SoC prototype 

When a compressed format with eight bits is adopted, as 

described in section III.B, AlexNet fits entirely within the 

internal on-chip memory with the exception of the Fully 

connected (FC) layers for the final classifier stages. The total 

amount of internal storage required is 2318 KB for parameters 

stored with 8 bits each, 1436 KB for feature maps with a 

precision of 16 bits and a total of ~10 MB of external RAM for 

FC layers compressed with a VQ scheme [Fig. 11b]. All of the 

five convolutional layers are directly mapped onto the Orlando 

NPU via a dynamic configuration of the configurable 

accelerator framework and associated CAs, while the rest of 

the layers is directly managed by optimized code running on 

the eight DSP clusters [Fig. 11a]. 

Performance is reported in TABLE II. Per each layer in 

terms of processing latency, percentage utilization of 

computing resources for CAs and GOPs/Sec/W for both eight 

and 16 bits precision for coefficients, with accumulators results 

always scaled back to 16 bits when storing the final 

convolution result. The chip operates at 200MHz with a Vdd of 

Σ 

Σ 

KER Q 

OUT FEATURE 

MAP 

Parallel/Chained Batch Execution 

FMAP 

KER. 0-3 


DMA 

DMA 

FMAP next 

batch 

BATCH (N-1) K1 DMA 

CA 0.0 

… 

CA 

0.M 

CA 

N.0 

… 

CA 

N.M 

DMA 

OUT K0 

DMA 

OUT K1 

0.575V at 25 degrees Celsius, and each convolutional layer is 

processed with four independent chains of two cascaded CAs. 

105M 223M 149M 224M 74M 37M 16M 4M 

CONV 11x11 

CONV 11x11 

RELU, NORM 

POOL 

RELU, NORM 

POOL 

CONV 5x5 

RELU, POOL 

CONV 3x3 

Fig. 11. [a] Top, AlexNet HW/SW partitioning, [b] bottom, memory footprint 

The input is a batch of a single image of sizes 227x227 

pixels. The maximum efficiency reached is for layers 3, 4 and 

5 and is equal to 2930 GOPS/sec/W with a total average for the 

whole network of 2473 GOPS/sec/W for 8 bits coefficient and 

2009 GOPS/sec/W for 16 bits. The average power 

consumption is this configuration is 41 mW and includes the 

power for all of the accelerators subsystem and on-chip 

memories. 

Layer MOPS Latency Utilization 

[ms] 

Tot. Operations: 832 M 

RELU 

CONV 3x3 

TABLE II. 

GOPs/W 

RELU 

85-90% of total operations 

CONV 5x5 

RELU, POOL 

CONV 3x3 

RELU 

CONV 3x3 

35K 307K 884K 649K 442K 37M 16M 4M 

RELU 

Power 

[mW] 

GOPs/W 

Power 

[mW] 

GOPs/W 

Power 

[mW] 

16(F)x16(W)->16 16(F)x8(W)->16 8(F)x8(W)->16 

max avg max avg max avg 

1 210.8 2.5 80% 1228 988 86 1471 1183 72 1810 1456 58 

2 447.8 6.5 86% 1475 1262 54 1767 1512 45 2175 1861 37 

3 299 3.6 73% 1987 1445 58 2380 1731 48 2930 2131 39 

4 224.2 2.7 73% 1987 1445 58 2380 1731 48 2930 2130 39 

5 149.6 1.8 72% 1987 1434 58 2380 1717 48 2930 2114 39 

Total 1331.6 17.1 77% 1677 1287 61 2009 1542 51 2473 1898 41 

B. VGG-16 estimates on different configurations 

In order to evaluate the flexibility and efficiency of the 

Orlando NPU template, we have estimated performance and 

power consumption of VGG-16 [14] on a number of different 

configurations with a varying number of accelerators, assuming 

the availability of a high-speed external memory LPDDR3/4 

interface. We assumed a total on-chip memory allocated to the 

DCNN workload of 512KB (a reasonable assumption for a 

SoC in current silicon process technologies) and configured the 

CAs to support both 144 and 36 MACs with 16 bits and 8 bits 

precision respectively (8 bits MACs have four times the 

throughput by way of a SIMD implementation). Results are 

shown in Fig. 12 in terms of frames per second throughput vs. 

power consumption for a range of voltage/frequency operating 

points for configurations of 1, 2, 4, 8 and 16 CAs respectively. 

CONV 3x3 

CONV 3x3 

Tot. Parameters: ~ 60M 

RELU, POOL 

RELU, POOL 

FC 

FC 

FC 

FC 

FC 

FC 

DSP 

HW 

Acc 

EXT 

MEM 

SRAM 

327

FPS 

160 

140 

120 

100 

80 

60 

40 

20 

0 

1 CA 2 CAs 4 CAs 8 Cas 16 CAs 

140 

115 

80 

73 

56 

59 

40 

33 

37 

28 

30 

25 

20 

14 

14.8 18.3 

12 

17 

2 2 4 5 7 9 

3.1 6 4.1 

8 

7.0 10.1 

0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175 

Vdd/Freq range 

1200 

1000 

800 

600 

400 

200 

0 

Power [mW] 

Fig. 12. VGG16 performance vs power scaling for different Vdd ranges and CAs with 8bpp MACs 16 kernels in parallel 

Efficiency F/sec/W 

500 

450 

400 

350 

300 

250 

200 

150 

100 

50 

0 

1 CAs 2 CAs 4 CAs 8 CAs 16 CAs 

0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175 

Vdd/Freq range 

14000 

12000 

10000 

8000 

6000 

4000 

2000 

0 

LPDDR MB/sec 

Fig. 13. VGG16 Efficiency vs external LPDDR Bandwidth for different Vdd ranges and CAs with 8bpp MACs 16 kernels in parallel 

Fig. 13 shows the efficiency in terms of frames per second 

per Watt vs. the associated bandwidth requirements for the 

external LPDDR interface. A configuration with a single CA 

provides 1.55 FPS at 200 MHz and 0.575V with an efficiency 

of 130 FPS/W equivalent to 1.95 TOPS/W or 12 mW while, at 

the same frequency and power supply, a 16 CAs configuration 

performs at 25 FPS with an efficiency of 440 FPS/W or 56 

mW. The two configurations require a throughput of 215 

MB/sec and 2015 MB/sec respectively. In the high-end corners 

for frequency and supply (1.1GHz, 1.1V), the peak 

performance for 16 CAs configuration reaches 140 FPS with a 

lower efficiency of 145 FPS/W and associated external 

memory bandwidth required of ~11GB/sec still compatible 

with an LPDDR4 32 bits interface. 

VI. 

CONCLUSIONS AND FUTURE WORK 

We have described a flexible and scalable HW architecture 

to accelerate DCNN workloads for the design of scalable NPUs 

and its silicon validation, demonstrating its use in accelerating 

deep convolutional neural network operations, with a focus on 

convolutions that are the key compute-intensive task therein. 

We have also addressed the problem of the large parameter 

space associated with these networks by incorporating a 

quantization scheme simple to implement in HW yet effective 

enough to compress the parameter space of embeddable 

networks like tiny-yolo, which although targeting resourceconstrained 

devices, would still need off-chip external memory 

to implement otherwise. In terms of future work, we are 

evolving the Orlando architecture to include HW acceleration 

of other non-convolutional operators like pooling, activation 

function ranging from sigmoid, tanh, ReLU variants as well as 

custom defined activations covering recent work like 

Kafnets[12], batch normalization and other miscellaneous 

operators. These new accelerators will leverage the streaming 

dataflow model of the Orlando to stitch together computation 

pipelines at runtime allowing data to flow from one block to 

the other reducing the need to access memory by subsequent 


328

compute units thus providing an energy efficient realization of 

the execution graph 

VII. 

REFERENCES 

[1] Y. LeCun et al., “Gradient-based learning applied to document 

recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 

1998 

[2] Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with 

deep convolutional neural networks. In: NIPS. pp. 1-9. Lake Tahoe, NV 

(2012) 

[3] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network 

Recognition Processor for Intelligent IoE Systems,” ISSCC Dig. Tech. 

Papers, pp. 264-266, 2016. 

[4] T. Chen et al., “A High-Throughput Neural Network Accelerator,” IEEE 

MICRO 2015, vol. 35, no. 3, pp. 24-32 

[5] Y. Chen, et al., “DaDianNao: A machine-learning supercomputer” 

IEEE/ACM Int. Symp. on Microarchitecture, pp. 609-622, 2014. 

[6] YChen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator 

for Deep Convolutional Neural Networks” ISSCC Dig. Tech. Papers, pp. 

262-264, 2016 

[7] B Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer, Richard 

Lethin, “Polyhedral Optimization of TensorFlow Computation Graphs” 

6th Workshop on Extreme-scale Programming Tools (ESPT-2017) 

[8] Song Han, Huizi Mao, and William J. Dally, “Deep Compression: 

Compressing Deep Neural Network with Pruning, Trained Quantization 

and Huffman Coding”, http://arxiv.org/abs/1510.00149, 07 Jun 2017 

[9] Alberto Delmas Lascorz, Sayeh Sharify, Patrick Judd & Andreas 

Moshovos, “Tartan: Accelerating Fully-Connected and Convolutional 

Layers in Deep Learning Networks by Exploiting Numerical Precision 

Variability”, https://arxiv.org/abs/1707.09068, 27 Jul 2017 

[10] “How to Quantize Neural Networks with TensorFlow”, 

https://www.tensorflow.org/performance/quantization 

[11] Norman P. Jouppi et al., “In-Datacenter Performance Analysis of a 

Tensor Processing Unit”, e 44th International Symposium on Computer 

Architecture (ISCA), Toronto, Canada, June 26, 2017 

[12] S. Scardapane et al., “Kafnets: kernel-based non-parametric activation 

functions for neural networks”, https://arxiv.org/pdf/1707.04035.pdf, 23 

Nov 2017 

[13] J. Redmon et al, “YOLO: Real-Time Object Detection”, 

https://pjreddie.com/darknet/yolo, accessed 10 Jan 2018 

[14] Karen Simonyan, Andrew Zisserman, “Very Deep Convolutional 

Networks for Large-Scale Image Recognition”, 

https://arxiv.org/abs/1409.1556, 10 Apr 2015 

329

Certification Aspects of a Connected Vehicle 

Ritu Sethi 

Intel Technology India Pvt. Ltd. 

Bangalore, India 

ritu.sethi@intel.com 

Abstract—Vehicular communications is an evolving area of 

networking between vehicles and everything else (V2X) – 

vehicles-to-vehicles (V2V), vehicles-to-pedestrians (V2P), 

vehicles-to-infrastructure (V2I) or vehicles-to-network (V2N). 

IEEE backed DSRC (Dedicated short-range communications 

based on 802.11p Wi-Fi-based protocol) and 3GPP proposed 

LTE advancements in Cellular-V2X are two of such emerging 

wireless technologies that will enable communication between 

talking vehicles of tomorrow. While the LTE enhancements are 

still being standardized, DSRC enabled products are already 

pitching in the new cars to provide 360-degree situational 

awareness to enhance vehicle safety [7]. 

Vehicular communication puts forth a unique wireless 

communication scenario with stringent requirements of fast 

network acquisition, ultra-low latency, very high reliability, 

priority for safety critical messages, interoperability between 

technologies and still conforming to security and privacy 

constraints. Even though wireless communication technologies 

individually provide specifications and conformance to them 

would certify that particular component, there is still a need to 

certify a complete system holistically. This paper brings out 

these certification challenges. 

Keywords—V2X; DSRC; Cellular; Safety ; Certification 


Nearly two decades ago, the United States Department of 

Transportation’s (USDOT) National Highway Traffic Safety 

Administration (NHTSA) analyzed accidental driving related 

mishaps and took up an initiative to address them. It was 

concluded that 82% of all car crashes involved impaired drivers 

and up to 90% of car-accident related deaths and 40% of crashes 

at the intersections could be eliminated if vehicle to vehicle 

(V2V) communication could be enabled. 

In 1999, FCC allotted 75 MHz of spectrum in the 5.9 GHz 

band to be used by intelligent transportation systems (ITS) in 

the US. In 2008, ETSI allocated 30 MHz to be utilized for V2X 

based studies in the European Union. IEEE working group thus 

led the initiative of proposing an amendment to 802.11a Wi-Fi 

protocol especially related to automotive cases to address their 

requirements. 

Thus a new protocol 802.11p termed as DSRC (Dedicated 

short-range communications) evolved [1] [3] specifically 

designed to permit very high data transmission that is critical in 

communications-based active safety applications in the 

automotive sector. Not left far behind, the LTE standards were 

evolved to drive-in the requirements for Cellular-V2X 

bringing-in many more commercial use cases. Depending on 

the use case, the requirements from the protocol can be easily 

translated to Key Performance Indicators (KPI). For example, 

collision avoidance safety critical use case needs an end-to-end 

latency of a few milliseconds giving sufficient reaction time to 

the driver; reliability of the order of 10 -x ; supporting node 

mobility of the order of hundred km/hr; providing positional 

accuracy of few cm; and communication range of the order of 

few 100 meters. 

With the integration of V2X to the autonomous and assisting 

driving use cases, there is an interplay of all other sensor data 

assimilation as well. As a standalone system, there is 

compelling need for the connected vehicle to be fully 

independent and capable of providing a functionally reliable 

and safe environment. As a player in the entire ecosystem, it has 

to contribute towards providing of low interference and nonvulnerable 

operating condition. 

This paper is further structured as the following: Section II 

provides a brief on the wireless technologies to achieve V2X, 

including the technical and infrastructural challenges, Section 

III presents certification challenges followed by a brief 

summary for closing remarks. 

II. 

WIRELESS TECHNOLOGIES ENABLING V2X 

Historically, an important wireless technology for V2X is 

DSRC - based on IEEE 802.11p protocol. It defines the physical 

(PHY) and medium access control (MAC) layers. It has evolved 

from Wi-Fi___33 standard IEEE 802.11a and maintains the 

same frame structure and modulation as in 802.11a. The 

software stack is standardized under the IEEE 1609 working 

group for WAVE (Wireless Access in Vehicular 

Environments). Many different layers have been developed for 

various networking and management functions for multichannel 

operations, resource management and security 

specifically for vehicular use cases. 


330

On a high level, DSRC is based on the principle of each 

vehicle broadcasting its core state information in a Basic Safety 

Message (BSM) nominally 10 times/sec. The BSM contains 

vehicle state information (location, speed, acceleration, and 

heading) and is sent out in all directions. The receivers, on the 

other hand, build model of each neighbor’s trajectory, assess 

threat to host vehicle and warn driver to take control if threat 

becomes acute. 

Not far left behind at utilizing LTE cellular networks, 3GPP 

has enhanced the Release 14 specifications (and beyond) to 

standardize cellular access for vehicular communication. Since 

cellular network provides higher capacity than the local Wi-Fi 

networks and enjoys a wide world deployment, the existing 

broadcast mechanisms like MBMS and eMBMS(enhanced 

Multimedia and Broadcast Services)/ SC-PTM (Single Cell 

Point to Multipoint) /Sidelink (PC5) for Device2Device [2][5] 

can be utilized to achieve the particularly demanding needs of 

V2X. With the help of core network, prioritization of safety Vs 

non-safety messages could be easily achieved. The controlling 

nodes, could vary transmission rate and range based on service 

conditions like vehicle speed and density. 

The proposed enhancements in LTE enable the 

communication range sufficient to give the driver(s) ample 

response time (e.g. 4 seconds). The maximum supported 

relative velocity of the vehicles is 500 km/h, absolute velocity 

is 250 km/h for vehicle2vehicle or vehicle2pedestrian use case. 

It also ensures that the subscriber identity cannot be tracked or 

identified by any other vehicle subscriber/single party/operator 

beyond a certain short time-period required by the V2X 

application. 

A. Technical Challenges 

Since the underlying assumption of the 802.11 protocol is 

based on Carrier Sense Multiple Access with Collision 

Avoidance (CSMA-CA), which relies heavily on carrier 

sensing with back off timers upon sensing of collision, the 

intensity of channel contention among vehicles in dense urban 

setting increases due to high transmission collision rate leading 

to large channel delay. Some optimizations proposed in [4] 

references 802.11e to improve Quality of Service (QoS) 

enhancements to provide priority access to certain traffic. 

Optionally, high priority traffic may use a shorter back off 

time before trying to sense the channel again for activity. This 

improves latency for safety critical traffic but does not address 

the contention as there is no guaranteed or reserved resources 

in 802.11p. Another possible way to address it is the dynamic 

formation of Vehicle Ad-hoc NETwork (VANET) giving 

improved chance of channel access to the newly formed 

network of vehicles in close proximity. But then such VANETs 

come with high cost of maintainability due to high mobility and 

dynamic nature of movement of individual nodes. 

The competing technology of LTE geared towards V2X is 

capable of addressing the limitations of the 802.11p PHY 

challenges to some extent, it comes with its own set of 

constraints. The cellular network is designed around centralized 

control and is not suitable for safety applications having strict 

delay requirements as it comes with signaling overhead 

implying high latency. Each vehicle is required to have valid 

subscription and authorization irrespective of served by E- 

UTRAN or not. Even though this has a huge push from 5G 

Automobile Association (5GAA), it is still slowly finding its 

way since the specifications were standardized in January 2017 

and are still lagging to be widely adopted by the industry. 

B. Infrastructural Challenges 

V2X is an incumbent technology and before it can be 

deployed, it needs huge investment in setup of some or all of 

the below infrastructural nodes: 

 

 

 

 

 

 

Equipment at the roadside, enclosures, mountings, 

and network backhaul; 

Provision of controllers and systems at traffic lights 

and intersections to provide the signal phase and 

timing accurately. 

Systems and services to provide for detailed maps and 

geometries including and road signage 

Positioning services for resolving vehicle locations to 

high accuracy and precision. 

Centralized data collection servers and analysis of the 

data provided by the vehicles. 

Security credential management and processes for a 

trusted network. 

III. 

CERTIFICATION CHALLENGES 

Recently a car manufacturer has approached for design 

approval with a car interior having no steering wheel or brake 

pedals citing no need for manual controls in case of an 

autonomous driving vehicle. With this utmost level of 

automation and no manual intervention, there is increased focus 

of validation and verification of not just basic functionality but 

also providing for the utmost secure and safe operation for both 

- the passengers within the vehicle and others nearby. The 

connected vehicle, thus is not only responsible for the safety 

related use cases of its own but also contributes additionally 

towards the other vehicles in its neighborhood, it becomes 

critical for it to abide by norms that can lead to a reliable, 

functionally dependent, robust, secure and safe operating 

condition. 

While the above mentioned wireless technologies provide a 

backbone to enable connectivity between vehicles and prove 

technical viability, V2X devices do not work self-sufficiently, 

they are dependent on the infeed sensor data from the vehicle 

which is usually controlled by the vehicle OEM (Original 

Equipment Manufacturer). In such a case, the certification of 

each modules can be achieved with ease but the end platform 

level certification is an enormous task especially because of the 

tight coupling to achieve the interworking. 

With the given constraints, certification becomes a challenge 

especially with multiple completing standards and profusion in 

331

the number of car manufactures and vendors offering their own 

solutions. Below are the few challenges that need to be 

addressed: 

A. Inter-operational 

Interoperability is a critical piece of the puzzle. To achieve 

the full benefit of the system, every vehicle on the road should 

be able to communicate with every other vehicle and that is 

possible only when every vehicle follows a common protocol. 

Today, the industry is divided between DSRC and Cellular- 

V2X, with several consortiums to back their respective choices. 

Few of the requirements are as follows: 

Seamless user experience across 3GPP and non-3GPP 

interworking – enable seamless handovers across 

technologies. 

Ensure interoperability across multi-modal 

 

communication systems (DSRC/Cellular). 

Ensure consistency on the messages being delivered 

from Cellular and DSRC systems. 

Ensure interoperability and consistency across 

Vendors, Deployments, OEM and wide array of device 

manufacturers. 

 

Agreed upon and a standardized adoption of 

intersection and traffic control management by the 

Department of Transportation. 

B. Functionally Reliable 

The system should consistently behave as expected to 

deliver the required functionality in any reasonable scenario. 

The adopted technologies should be able to provide service in 

all scenarios in-out of coverage including non-subscribed users 

and network gaps. Additionally, it should also adhere to the 

below: 

Reliable and consistent message 

prioritization/transmission across cities, states and 

countries. 

Consistency in deploying the DSRC and Multi-modal 

communication systems at all levels. 

System redundancy planning to handle natural 

disasters. 

C. Fault Management 

The ability to manage faults and raise them appropriately is 

an essential safety requirement. The following could provide 

the high level guidelines: 

 

 

 

System should be able to detect network outages/gaps 

and raise and broadcast alarms over multimodal 

available paths. 

Re-prioritization/fallback options of the decision 

matrix should be provided for. 

Ensure service continuity possibly through P2P (Peer 

to peer)/D2D (device to device) with mesh. 

The below table [8] further specifies difference failure modes 

in the message communication context: 

D. Regulatory 

The regulatory conformance is usually driven by the local 

bodies and have regional flavors. Primarily, the following need 

to be considered: 

Regulations for connected cars need to be driven at all 

levels from Local governments to Department of 

Transportation. 

Uniformity in deploying and building 

 

Smart/Connected cities through standards. 

Certification of Connected Cars should be beyond the 

Radio certification, where it is expected to cover the 

real-world scenarios of traffic scenarios, and 

multimodal 

communications. 

E. Performanace Gurantee 

Performance is the critical part of the connected vehicles. 

Since many of the safety related use cases depend on 

deterministic latencies, the following aspects are important 

during conformance validation: 

 

 

 

Reliability and latency aspects should conform to the 

expectations. The connected vehicle should comply 

with the Quality of Service guarantees. 

Protocols should be resilient to handle the worst case 

scenarios of network dynamic congestion and 

interference. 

Performance guarantee is essential even during natural 

disasters in a truly autonomous world. 


332

F. Safety and Security 

Security plays another important part in the safe deployment 

of such applications. Starting at component level, security 

needs to be built as an integral part of the system and also 

ensuring that at no times privacy and security get compromised. 

 

 

 

Autonomous Vehicles are guided by signals from 

RSU, Cellular Networks and also from other Vehicles. 

All systems in the communication pipeline should 

secure from intrusion or other compromising 

situations. 

All multimodal communications should ensure 

integrity of the communication from source to 

destination. 

Address issues related to security through credential 

management, avoid deliberate and accidental jamming 

and advanced hacking and spoofing. 

G. Functional Safety 

Functional safe systems requires the system to operate 

correctly in all kinds of inputs, and safely manage any errors or 

failures. 

Systems must function correctly in order to avoid 

hazardous situations 

 

 

Faults must be detected and controlled 

Possible fallbacks on other protocols like radars, 

LIDAR’s in case of connectivity medium failure to 

move into a system safe state. 

IV. 

SUMMARY 

The current landscape points that DSRC is highly likely to 

be deployed in US and ITS-G5 (based of 802.11p) somewhat 

likely in Europe. But 4G/5G may be the V2X technology of 

choice as a long runner. For V2X applications, 5G will likely 

start with 4G (LTE V2V) and LTE V2V will be the cellular 

V2X solution for several years to come. Definitely, the new 5G 

radio will augment and complement it over time and will play 

a major role as the technology is being studied and getting 

rapidly standardized over the coming years. It is hence essential 

to take some of the discussed challenges around certification of 

such connected vehicles into consideration. 

REFERENCES 

[1] Abdeldime M.S. Abdelgader, Wu Lenan, The Physical Layer of the IEEE 

802.11p WAVE Communication Standard: The Specifications and 

Challenges. Proceedings of the World Congress on Engineering and 

Computer Science 2014 Vol II WCECS 2014, 22-24 October, 2014, San 

Francisco, USA 

[2] Junyi Feng, Device-to-Device Communications in LTE-Advanced 

Network 

[3] John B Keneddy, Dedicated Shot Range Communication Standards used 

in the United States. 

[4] Moraes, Ricardo, et al. "How to use the IEEE 802.11 e EDCA Protocol 

when Dealing with Real-Time Communication." 11th Brazilian 

Workshop on Real-Time and Embedded Systems. 2009. 

[5] Jiaxin Yang, Benoît Pelletier, Benoît Champagne, "Enhanced 

autonomous resource selection for LTE-based V2V communication", 

Vehicular Networking Conference (VNC) 2016 IEEE, pp. 1-6, 2016, 

ISSN 2157-9865. 

[6] http://www.ces.tech/Show-Floor/Marketplaces/Self-Driving- 

Technology.aspx 

[7] https://www.nhtsa.gov/press-releases/us-dot-advances-deploymentconnected-vehicle-technology-prevent-hundreds-thousands 

[8] ISO 26262 

333

Radar sensors for autonomous driving 

From motion measurement to 3-D imaging 

Karthik Ramasubramanian, Jasbir Singh, Brian Ginsburg, Dan Wang, Anil Kumar, Chethan Kumar, Sreekiran Samala, Karthik 

Subburaj, Shankar Ram, Anjan Prasad, Sandeep Rao, Anil Mani, Snehaprabha Narnakaje 

Radar and Analytics Processors, Texas Instruments 

Dallas, TX, USA and Bangalore, India 

Abstract—In recent years, the automotive industry has been 

making rapid strides in various advanced driver assistance 

systems (ADAS), with the ultimate goal of enabling fully 

autonomous driving. Radar sensors play a key role in this vision, 

due to certain inherent benefits compared to other technologies. 

This paper provides an overview of the industry trends in this 

space and highlights the disruptive change brought about by 

unprecedented level of silicon integration achieved by TI’s CMOSbased 

radar, leading to ‘radar-on-a-chip’ sensors. Looking 

forward to the future, the industry is moving towards deployment 

of advanced ‘imaging radars’ that use multiple cascaded radar 

devices to achieve high angular resolution. The paper describes a 

4-chip cascaded radar design and demonstrates the imaging 

capabilities achieved that will help enable the future of 

autonomous driving. 

Keywords—radar; autonomous driving; mmwave 


Every year, a significant number of injuries and deaths in the 

United States, as well as all over the world, happen due to 

vehicle accidents. As per NHTSA statistics, in the year 2015, 

there were 22,144 passenger vehicle occupants who died in 

motor vehicle traffic crashes and an estimated 2.18 million 

passenger vehicle occupants who were injured. Automotive 

radar technology at 77GHz has the ability to significantly reduce 

the occurrence of these accidents, especially those involving 

frontal collision or blind spots, and this technology has been 

deployed in premium vehicles over the past decade [1, 2, 3]. The 

applications for radar include Adaptive Cruise Control (ACC), 

Autonomous Emergency Braking (AEB), Blind Spot Detection 

(BSD), Lane Change Assist (LCA) and Cross Traffic Alert 

(CTA). 

Radar sensors exhibit certain inherent advantages compared 

to other technologies, due to their ability to measure range and 

radial velocity precisely, as well as, their ability to operate well 

regardless of the ambient lighting conditions and in a wide 

variety of environmental conditions including fog, dust and 

smoke. In order to make the availability of automotive radar 

systems more common, and in order to extend the use of radar 

technology to additional safety functions including Parking 

Assist and 360-degree surround-view sensing, it is important to 

reduce the cost, size and ease-of-use of 77 GHz radar 

technology. This would make it possible for multiple sensors to 

be mounted in various spots around the vehicle, providing more 

advanced safety and comfort functions in a cost-effective 

manner, and also enable radar-based safety features to become 

standard offerings even in mid-end and entry-level vehicles. 

Further, this would promote newer in-cabin and body/chassis 

applications that are now emerging, such as radar-based driver 

vital signs monitoring, occupant detection (child left behind in 

car), gesture recognition, door opener and ground clearance 

measurement. 

In this paper, we discuss the trend of silicon integration and 

highlight the industry’s first CMOS-based radar-on-a-singlechip 

solution from Texas Instruments (TI), which makes radar 

sensors compact, cost effective and easy to use. A family of 

devices [4], namely AWR1243, AWR1443 and AWR1642, has 

been launched addressing different applications. This launch 

signifies the introduction of CMOS-based highly integrated 

77GHz RF devices into the mainstream, with the objective of 

accelerating deployment of radar sensors and helping designers 

improve safety for drivers and passengers all over the world. 

The key advantage offered by CMOS is the ability to 

integrate the RF and analog circuits together with all of the 

digital processing functions into a single silicon die, thus 

reducing the form-factor significantly and making it easy to use. 

In Section II, we show the high-level block diagram of the 

AWR1642 device and explain the key features that make it an 

excellent solution for corner radar applications. We also show 

illustrative chirp configuration examples and field test results 

demonstrating its functionality. 

One of the key challenges for radar technology is the 

inherently poor angular resolution. In order to overcome this 

limitation, ‘imaging radars’ are being developed in the industry. 

In Section III, we discuss this emerging imaging radar 

application based on multi-chip cascading. In this context, we 

highlight the AWR1243 front-end device that supports 

cascading of multiple devices and discuss the complexities 

involved in developing a cascaded radar sensor solution. We 

showcase a 4-chip cascaded radar design that employs 12 

transmitters and 16 receivers to achieve very high angular 

resolution and demonstrate its functionality through field results. 

II. 

RADAR-ON-A-CHIP SENSOR 

Traditionally, the corner radars have been based on 24GHz 

technology. However, there is a shift in the industry toward the 

use of the 77 GHz frequency band due to emerging regulatory 


334

equirements (upcoming sunset date for 24GHz UWB radar), as 

well as the smaller size, larger bandwidth availability and 

performance advantages [2, 5]. 

Historically, radar implementations used discrete 

components (PAs, LNAs, VCO, ADCs), but more integrated 

solutions are now becoming available. A CMOS-based radar 

that integrates all RF and analog functionality, as well as digital 

signal processing (DSP) capability into a single chip represents 

the ultimate radar system-on-chip solution. Such a highly 

integrated device significantly simplifies radar sensor 

implementations, enables low power, a compact form factor for 

the sensor, and makes the solution cost-effective. 

AWR1642 uses complex baseband architecture and provides inphase 

(I-channel) and quadrature (Q-channel) outputs. There 

are several advantages of complex baseband architecture as 

described in [8]. 

The radio processor sub-system (a.k.a. BIST sub-system) 

includes the digital front-end, the ramp generator and an internal 

processor for control and configuration of the low-level 

RF/analog and ramp generator registers based on well-defined 

API messages from the master or DSP sub-system. This radio 

processor takes care of RF calibration needs and selftest/monitoring 

functions (BIST), which makes the device easy 

to use. 

The DSP sub-system includes a TI C674x DSP clocked at 

600MHz for radar signal processing, typically the processing of 

raw ADC data until object detection. This DSP is customer 

programmable and enables the user full flexibility in using his 

proprietary algorithms. 

The master sub-system includes the Arm® automotive grade 

Cortex® R4F processor clocked at 200 MHz, which is customer 

programmable. This processor handles the communication 

interfaces and typically implements the higher layer algorithms 

such as object classification and tracking. This processor may 

also be used to run AUTOSAR. The master sub-system supports 

secure boot and includes cryptographic accelerators as well. 

Fig. 1. CMOS based Radar-on-a-chip sensor 

In this context, we highlight the AWR1642 device and its 

key features that enable the sensor to perform advanced ADAS 

functions. 

A. High level of integration and Ease of use 

The AWR1642 device offers unprecedented level of 

integration and includes all the RF/Analog components (LNA, 

PA, Synthesizer, IF, ADC) for 2 transmitters and 4 receivers, as 

well as built-in customer-programmable DSP and MCU 

processor units for radar signal processing (Figure 1). In other 

words, a single device handles the signals all the way from 

77GHz high frequency RF, to the final CAN-FD output through 

which the list of detected and tracked objects is sent to the central 

ECU of the vehicle. 

Figure 2 shows the block diagram of the device. As seen in 

the figure, the device comprises four main sub-systems – the 

RF/analog sub-system, the radio processor sub-system, the DSP 

sub-system and the master sub-system. The RF/analog subsystem 

includes the RF and analog circuitry – namely, the 

synthesizer, PAs, LNAs, mixers, IF chains and ADCs. It 

supports fast (sawtooth) FMCW modulation scheme, which 

allows range and velocity of objects to be measured using an 

elegant 2D FFT processing procedure [6, 7]. The RF/analog 

sub-system also includes the crystal oscillator, temperature 

sensors, voltage monitors and a General Purpose ADC. The 

6 

LNA 

LNA 

LNA 

LNA 

PA 

PA 

GPADC 

Osc. 

VMON 

IF ADC 

IF ADC 

IF ADC 

IF ADC 

Temp 

x4 

RF/Analog subsystem 

Synth 

(20 GHz) 

Digital 

Front-end 

(Decimation 

filter chain) 

Ramp 

Generator 

Radio (BIST) 

processor 

(For RF Calibration 

and Self-test – TI 

programmed) 

Prog RAM 

and ROM 

Data 

RAM 

Radio processor 

subsystem 

(TI programmed) 

ADC 

Buffer 

* Up to 512KB of Radar Data Memory can be switched to the Master R4F if required 

Bus Matrix 

Cortex-R4F 

at 200MHz 

(User programmable) 

Prog RAM 

(256kB*) 

DMA 

Data 

RAM 

(192kB*) 

Boot 

ROM 

Master subsystem 

(Customer programmed) 

Mailbox 

L1P 

(32kB) 

C674x DSP 

at 600 MHz 

L1D 

(32kB) 

L2 

(256kB) 

QSPI 

SPI 

SPI / I2C 

DCAN 

CAN-FD 

Debug 

UARTs 

Test/ 

Debug 

LVDS 

HIL 

DMA CRC Radar Data Memory 

(L3) 

DSP subsystem 

768 kB* 

(Customer programmed) 

Fig. 2. Block diagram of AWR1642 

Serial flash interface 

Optional External 

MCU interface 

PMIC control 

Primary communication 

interfaces (automotive) 

For debug 

JTAG for debug/ 

development 

High-speed ADC output 

interface (for recording) 

High-speed input for 

hardware-in-loop verification 

B. Wide RF bandwidth and Multi-mode capability 

The range resolution of a radar sensor depends on the RF 

bandwidth. If the RF sweep bandwidth used is B, then the 

theoretical range resolution R is given by: 

 

 

 

 

Here c denotes the speed of light. 

c 

R 2 B 

One of the primary advantages of 77GHz band is the 

availability of 76-77GHz, as well as, 77-81 GHz bands for 

automotive radar applications. The AWR1642 device supports 

multi-mode capability, such that the same device can be used in 

76-77GHz far-range use-cases, as well as 77-81GHz near-range 

 

335

use-cases. Also, the device supports up to 4 GHz of RF sweep 

bandwidth and can therefore achieve a range resolution of 

3.75cm. This is 20 times better resolution than a 24 GHz 

narrowband radar sensor that uses 200MHz of sweep bandwidth 

(achieving a range resolution of 75 cm). 

The range resolution performance is important because it 

signifies the ability of the sensor to separate out multiple objects 

in range. A sensor with good range resolution can provide a 

‘dense point cloud’ of detected objects and has better ability to 

distinguish objects such as a person standing near a car. This 

improves environmental modeling and object classification, 

which are important for developing advance driver assistance 

algorithms and enabling autonomous driving features. 

Also, higher-range resolution helps the sensor achieve better 

minimum distance. For automotive applications like parking 

assist, a minimum distance of detection is very important; the 

use of 77-81 GHz radar provides a significant advantage in this 

aspect in comparison to technologies like ultrasound sensors. 

Since the accuracy is also proportional to the range resolution, 

the AWR1642 device can achieve high accuracy as well. 

The availability of a fully programmable DSP in the AR1642 

device allows users to implement proprietary algorithms and 

build innovative solutions to address these difficult challenges. 

Specifically, the following are some critical areas where there is 

continued research and advancement of algorithms to improve 

the performance. 

Interference mitigation algorithms 

Improved detection algorithms 

High-resolution angle estimation algorithms 

Clustering and Object Classification algorithms 

For all of the above needs, the built-in DSP enables 

improved performance of the sensor by offering high 

performance and fully programmable signal processing 

capability. 

The afore-mentioned features enable the AWR1642 device 

to be used effectively as a radar-on-a-chip sensor, especially for 

various corner radar applications. The table below shows an 

illustrative example of a multi-mode use-case, where alternating 

frames are used with different chirp configurations to achieve 

80m and 20m maximum range respectively, with the former at 

normal resolution and the latter at high resolution. 

Fig. 4. Multi-mode use-case example 

Fig. 3. Illustration of high resolution with 77 GHz 

Figure 5 shows an example field test result with 80m chirp 

configuration. The scenario has a car driving away from the 

radar and it can be detected in the 2D range-Doppler heatmap as 

shown in the figure. This field test was done with a short-range 

radar antenna with gain of 10dBi. Based on the choice of 

antenna design and chirp configuration, even higher maximum 

range can be achieved with the sensor. 

C. DSP advantage for advanced algorithms 

FMCW radar technology has evolved significantly in the 

past several years and continues to evolve. More use-cases are 

getting added, as radar plays a larger role in modern vehicles, 

both for driver comfort features and safety features. The 

emerging use-cases also make the radar performance 

requirements tighter, in terms of spatial resolution, velocity 

resolution, object detection and classification. 


336

Fig. 5. SRR field test example with AWR1642 

The AWR1642 radar-on-a-chip sensor can be used as a 

standalone radar feeding detected and tracked objects to the 

ECU via the vehicle CAN bus. The availability of one CAN-FD 

and one CAN interface on the AWR1642 allows each sensor to 

communicate to the ECU over the vehicle CAN bus, as well as, 

to other sensors over a private CAN bus. The AWR1642 can 

also be used as a satellite radar mounted in the corners feeding 

detected objects to a central radar fusion box which combines 

the information from the multiple sensors to generate a surround 

coverage of the vehicle. Thus, the AWR1642 forms an excellent 

solution to various corner radar applications. 

single master AWR1243 can feed the LO output to multiple 

slave devices in order to maintain phase coherence. 

The LO synchronization is performed at 20GHz, as shown 

in Figure 7, which reduces the board routing loss compared to 

synchronizing at 77GHz. The LO signal from the master chip is 

sent through one or both of its output buffers, and after 

symmetric routing on PCB, all chips, including the master, 

receive this LO signal through their input buffer [9]. 

Fig. 7. Illustration of LO distribution for 2-chip cascading 

Figure 8 shows the high level block diagram of a 2-chip 

cascaded radar implementation. In addition to the 20GHz LO 

signal, a digital SYNC_OUT signal from the master is also fed 

to the slave(s) to ensure that the ADC sampling is synchronous 

across all the devices. 

Fig. 6. Corner radar system topologies 

In the next section, we will cover the emerging trend of ‘imaging 

radars’ using multi-chip cascading to achieve high angular 

resolution. 

III. 

IMAGING RADAR 

One of the key challenges of radar technology is the angular 

resolution. The angular resolution depends on the number of TX 

and RX channels in the radar sensor. Angular resolution is very 

important to the future of autonomous driving due to various 

reasons – specifically, when there are two or more objects at the 

same range and velocity (for example, two static objects at same 

range), they need to be separated in the angle dimension and 

therefore, good angular resolution is vital in order to clearly 

identify objects in dense target situations. This is particularly 

important in scenarios such as dense urban driving conditions, 

drive-over or drive-under situations with small objects or 

overhead signposts/bridges/tunnels, and for curb detection 

during parking assist. 

In order to achieve high angular resolution, it is possible to 

cascade multiple radar devices and operate them in a 

synchronized manner to effectively increase the number of 

antennas. The TI AWR1243 device is a high-performance radar 

front-end that includes 3 transmitters and 4 receivers and sends 

out the ADC data via CSI2 to an external DSP/MCU. One of 

the key features of AWR1243 is multi-chip cascading, where a 

Fig. 8. 2-chip cascaded radar system 

A single AWR1243 device can create a virtual antenna array 

of 3*4 = 12 antennas. On the other hand, with two AWR1243 

cascaded, it is possible to create a virtual antenna array of up to 

6*8 = 48 antennas. Extending this further, with four AWR1243 

devices cascaded, it is possible to create a virtual antenna array 

of up to 12*16 = 192 antennas. This allows the 4-chip cascaded 

radar to achieve 16 times better angular resolution than the 

single-chip radar and such a cascaded implementation can be 

called as ‘imaging’ radar. The antennas can be distributed 

between azimuth and elevation dimensions and therefore the 

radar can provide good resolution in both these angular 

dimensions. 

Another important care-about with cascaded radar sensors is 

the ability to perform TX beamforming and beam-steering. This 

allows the transmitters to coherently transmit the RF signal in 

337

order to form a narrow beam and achieve farther maximum 

range. Also, by using phase shifters on each of the transmitters, 

the beam can be steered in any direction of interest. This 

capability allows the radar to scan towards the left or to the right 

depending on the situation at hand. The AWR1243 device 

includes a linear phase shifter with 6 degree step size that can be 

configured to achieve TX beam steering as needed. 

Figure 9 shows a 4-chip cascaded radar implementation 

using AWR1243. This implementation supports TX 

beamforming using 9 transmit antennas to achieve high range 

beyond 250m. Further, using the transmit and receive channels 

in a MIMO radar configuration, it is possible achieve an azimuth 

angular resolution of ~1.5 degrees. 

Fig. 10. Field test result with cascaded radar. 

These results demonstrate the significant improvement that 

is achievable with multi-chip cascading and showcases the 

significance of imaging radars for the future of autonomous 

driving. 

Fig. 9. 4-chip cascaded imaging radar 

Figure 10 shows a sample field test result from the 4-chip 

cascaded radar implementation. It can be noted that the three 

pedestrians can be clearly separated with the cascaded radar, 

whereas with a single chip radar, the angular resolution is not 

good enough to achieve clear separation. Also, the cascaded 

radar produces a ‘3D point cloud’ that includes elevation 

measurement as well, in addition to azimuth. A complete video 

of these field results are available in [10]. 

IV. 

SUMMARY 

The automotive radar industry is rapidly evolving, in terms 

of emerging new applications, and in pursuit of the ultimate 

vision of autonomous driving. We presented two major 

advancements that may have significant impact on the future of 

automotive radar – a radar-on-a-chip sensor that represents an 

unprecedented level of integration enabling small form-factor, 

low-power and ease of use, and an imaging radar 

implementation that demonstrates high angular resolution and 

dense 3D point cloud capability using four cascaded radar chips 

for future advanced radar applications. TI’s portfolio of radar 

devices, spanning single chip radar to high-performance 

cascadable front-end solutions, thus enables developers to build 

a variety of radar sensor implementations from motion 

measurement to 3D imaging. 


The authors would like to thank the team members of the TI 

radar team for their contributions to the development of the 

devices highlighted in this paper. 

REFERENCES 

[1] M. Schneider, ‘‘Automotive Radar Status and Trends’’, German 

Microwave Conference (GeMiC), pp. 144--147, Ulm, Germany, April 

2005. 

338 


[2] Karl M. Strohm, Hans-Ludwig Bloecher, Robert Schneider, Josef 

Wenger, “Development of Future Short Range Radar Technology”, 

EURAD 2005. 

[3] Jurgen Hasch, “Driving towards 2020: Automotive Radar Technology 

Trends”, IEEE MTT-S International Conference on Microwaves for 

Intelligent Mobility, 2015. 

[4] http://www.ti.com/sensing-products/mmwave/awr/overview.html 

[5] Karthik Ramasubramanian, Kishore Ramaiah, Artem Aginskiy, “Moving 

from legacy 24 GHz radar to state-of-the-art 77 GHz radar”, Whitepaper 

available at http://www.ti.com/lit/wp/spry312/spry312.pdf 

[6] Donald E. Barrick, “FM/CW Radar Signals and Digital Processing”, 

NOAA Technical Report ERL 283-WPL 26, July 1973. 

[7] A. G. Stove, ‘‘Linear FMCW radar techniques’’, IEE Proceedings F, 

Radar and Signal Processing, vol. 139, pp. 343--350, October 1992. 

[8] Karthik Ramasubramanian, “Using a complex baseband architecture in 

FMCW radar systems”, Whitepaper available at 

http://www.ti.com/lit/wp/spyy007/spyy007.pdf 

[9] B. P. Ginsburg, et al., “A Multimode 76-to-81GHz Automotive Radar 

Transceiver with Autonomous Monitoring,” Accepted for publication, 

International Solid-State Circuits Conference (ISSCC), 2018. 

[10] Dan Wang, “Imaging radar using multiple single-chip FMCW 

transceivers”, YouTube video, 2018. 

339

Automotive Synthetic Aperture Radar 

Florian Fembacher 

Infineon Technologies AG 

Neubiberg, Germany 

Email: florian.fembacher@infineon.com 

Gabor Balazs 



Email: gabor.balazs@infineon.com 

Abstract—This work presents an analysis of automotive frequency 

modulated continuous wave synthetic aperture radar 

(FMCW SAR) using a 77 GHz radar, with the focus on computational 

and memory requirements. Such an automotive SAR can 

be used for imaging purposes reusing already existing automotive 

radar systems and might be especially useful as an information 

source for future autonomous driving. The presented analysis relies 

on a range-Doppler and wavenumber domain algorithm, which 

are two of the most common used techniques in SAR applications. 

Based on given constraints of an automotive embedded system, 

different processing frameworks for each discussed algorithm will 

be proposed. The results for all proposals are tested in a real 

application and presented in this work. 

Keywords—SAR, FMCW, embedded system 


Today synthetic aperture radar (SAR) is a well established 

signal processing technique to create high resolution images 

without the need of large antenna sizes. It was originally 

invented by Carl Wiley in the early 1950s with the purpose of 

being used for surveillance systems. Since radar can be operated 

independently of light conditions and its ability to penetrate 

clouds and fog, it can be used independent of daytime and 

weather conditions. 

So far SAR is still mainly used for aeronautical but not 

for automotive applications, although radar systems are already 

commonly used for driver assistance systems like blind spot detection, 

automated cruise control (ACC) or collision avoidance 

systems. While automotive radars achieve high resolution in 

range, due to small aperture sizes only poor angular resolution 

is achieved. Therefore its use is limited to applications which 

do not require high angular resolution. 

There exists a high demand for automated park assist systems 

which require high resolution sensors in range and azimuth. 

A Bitkom survey conducted in 2017 [1] showed 69% of the 

interviewed drivers would be willing to hand over full control 

over the vehicle for automated parking. In 2016 already 16% 

of new cars are equipped with a park assist system and 64% 

with a park distance control [2]. 

A high resolution imaging radar, which captures the road 

side, would be an essential tool for fully automated parking 

without the need of human interaction. 

A. Related Work 

Several approaches to implement an automotive SAR system 

can be found in literature. 

In [3] Wu and Zwick design a wavenumber domain SAR 

imaging system for parking lot scenarios and evaluate the 

influence of motion errors in their system using a simulation. 

As a result they suggest to compensate motion only in azimuth 

by controlling the chirp repetition frequency (CRF). 

A further improved motion compensation algorithm is presented 

by Wu et al. in [4] with experimental results that were 

achieved in a real environment. 

In [5] the authors use different implementations of a range- 

Doppler domain algorithm for radar imaging and test their 

system with an automotive 77 GHz radar, which is moved on a 

rail for parking lot detection to demonstrate the effect of range 

migration and compare the results of different algorithms. 

Imaging results of measurements that were performed in a 

real automotive scenario using a 77 GHz radar system moved 

on a rail are shown in [6]. The authors compare a range-Doppler 

domain and line processing algorithm, which both can be used 

to compute high resolution radar images. 

In [7] a computationally efficient radar imaging algorithm 

for automotive applications is implemented, which uses some 

approximations in the signal model. The basic idea of the 

algorithm is to map multiple range-Doppler images on a global 

cartesian coordinate system. 

To test if series radar sensors are applicable for synthetic 

aperture radar processing, the authors in [8] use a time-domain 

backprojection algorithm. A processing in the Fourier domain 

is not possible for series sensors because of limitations for the 

possible radar configurations. Using a 24 GHz the authors are 

able to increase the azimuth resolution to few centimeters. 

In the previous mentioned work it is shown that high 

resolution radar images can be computed using different SAR 

processing algorithms. Though the main focus lies on image 

quality but not on the implementation on an automotive embedded 

system. 

In the Fourier domain SAR algorithms can be computed 

more efficiently than in the time domain. Therefore it is preferable 

for applications in embedded systems. The range-Doppler 

[9] and ω-k algorithm [9] as well as their variations belong 

to the most commonly used algorithms for SAR processing. 

Both algorithms are evaluated in this paper with respect to 

their imaging quality, memory requirements and performance 

on an embedded system to conclude if they are suitable for an 

automotive use case. 


340

Range 

IF Signal 

IF Signal 

Range compression 

Range compression 

θ 

Azimuth FFT 

Azimuth FFT 

v 

Azimuth 

RCMC 

Reference function 

multiply 

Fig. 1. A parking lot detection scenario for a vehicle with a broadside corner 

radar. 

Azimuth compression 

Stolt mapping 

A. System Model 

II. SIGNAL PROCESSING 

Fig. 1 illustrates a parking lot scenario in which a broadside 

radar with a beamwidth of θ is mounted on a vehicle that moves 

with a constant velocity v. For simplicity of the following 

analysis, it will be assumed that the vehicle is moving on a 

straight line. This is usually not true for a real scenario where 

motion compensation has to be considered. 

The radar transmits frequency modulated continuous wave 

(FMCW) signals at a given pulse repetition frequency (PRF) 

and receives the echoed signals from targets. Multiple echoes 

over time are collected and stored in a memory buffer. One 

target will be measured multiple times during the movement of 

the vehicle and appear at different slant ranges in the recorded 

two dimensional signal. In the azimuth dimension the Doppler 

frequency shift is sampled, which is caused by the varying slant 

range of a target. Therefore the PRF depends directly on the 

vehicles velocity and the maximum range of interest. The range 

profile of one target can be modeled by a hyperbolic equation 

r(t) = √ r 2 o + v 2 t 2 a (1) 

where r 0 is the range at closest approach and t a refers to the 

slow time in azimuth. 

It is an essential step in SAR signal processing to correct this 

so called range cell migration (RCM). The way how the RCM 

correction (RCMC) is realized, is actually the key difference 

between range-Doppler and wavenumber domain algorithms. 

B. Signal Model 

Automotive FMCW radar systems are operated in the 24 

GHz, 77 GHz or 79 GHz frequency bands, respectively. The 

possible resolution in range 

δr = 

c 

(2) 

2B 

is limited by the radar’s bandwidth B. 

Azimuth IFFT 

Radar Image 

(a) 

Azimuth IFFT 

Radar Image 

Fig. 2. Block diagram showing the processing steps of the range-Doppler 

algorithm in a) and ω-k algorithm in b). 

The FMCW radar transmits a chirp signal, which can be 

mathematically described by 

s t (t) = exp(j2π(f c t + 1 2 αt2 )) (3) 

where f c is the radar’s carrier frequency, t the fast time 

variable within the PRI, and α the frequency sweep rate 

B/P RI. 

The received signal is mixed with the transmitted signal, 

which results in the intermediate frequency 

s if (t) = exp(j2π(f c τ + αtτ − 1 2 ατ 2 )). (4) 

with the round trip delay time τ. 

The range r of a detected target is directly proportional to 

the beat frequency 

f b = ατ = 2α c 

(b) 

r. (5) 

This signal model will be used in the following signal 

processing. 

341

C. Range Doppler Algorithm 

The processing of SAR signals in the Fourier domain consists 

of three major tasks. First, the SAR signal has to be compressed 

in range, which results in a hyperbolic range profile for each 

observed target. To focus the range profiles the RCM has to 

be corrected. Afterwards the SAR signal can be compressed in 

azimuth. 

Fig. 2a pictures the single steps of the range-Doppler algorithm. 

The range compression can be achieved by simply 

applying a fast Fourier transform (FFT) in range. Due to the 

properties of the IF signal (cf. Eq. 4) the range for each target 

is directly proportional to the occurring frequencies in the IF 

signal. In principle the RCMC can be done in the frequency or 

time domain. Correcting the range migration in the frequency 

domain is more efficient since trajectories of targets at the same 

slant range measured at different azimuth times will collapse in 

the frequency domain and can therefore be corrected together 

in one step. 

To correct the RCMC, the signal has to be shifted for every 

Doppler frequency f D by 

(√ 

) 

1 

r rcm = r 0 

1 − ( λf D 

2v 

) − 1 . (6) 

2 

Finally the the range profiles can be compressed in azimuth 

by applying the matched filter 

) 

h a (f a , r) = exp 

(−jπ v2 

λr f a 

2 (7) 

where f a describes the frequencies in azimuth. 

After applying an inverse Fourier transform (IFFT) in the 

azimuth dimension the focused radar image is received as an 

output. 

D. Wavenumber Algorithm 

In this part the ω-k algorithm (cf. Fig. 2b), which operates 

in the wavenumber domain, is shortly explained. The data is 

completely processed in the two dimensional frequency domain 

where the range dependence of the range-azimuth coupling 

can be corrected. It is especially superior to the range-Doppler 

algorithm in case of wide azimuth apertures that are typical for 

automotive applications. 

The SAR data is transformed in the two dimensional 

wavenumber domain by applying a FFT in azimuth and range. 

In the frequency domain the data is already compressed in range 

and therefore only a correction of the RCM and a compression 

in azimuth is necessary. 

Partial focusing is achieved by multiplying the data with a 

two dimensional reference function 

( √ 

) 

h a2D (ω, ω D ) = exp j 4 ω2 

c 2 − ω2 D 

v 2 r ref − 2πf a τ . (8) 

If the RCM is small the two dimensional function can be 

approximated by a one dimensional function that corresponds to 

TABLE I 

COMPARISON OF PSLR AND SNR OF THE ALGORITHMS’ PTR 

Algorithm PSLR [dB] SNR [dB] 

Range-Doppler 9.9 35 

Range-Doppler 4-tap 9.7 35 

Range-Doppler no RCMC 10.6 73 

ω-k 1D 11.0 73 

ω-k 2D 10.8 85 

the compression function used in the range-Doppler algorithm 

(cf. Eq. 7) at a reference range r ref . 

To focus the data at the other ranges a Stolt interpolation 

as described in [10] is applied. After an IFFT in azimuth 

dimension a focused image is received. 

III. SIMULATION RESULTS 

The two SAR processing algorithms are evaluated by comparing 

their point target response. For this evaluation a MAT- 

LAB simulation is used in which an image of a reference scene 

of size 10 m in range and 10 m in azimuth is computed. The 

point target was simulated in the middle of the scene. The radar 

velocity was set at 1 m s 

and the PRF at 1.6 kHz, which results 

in 400 range samples and 16000 azimuth samples. 

The quality of the computed images is evaluated using the 

peak side lobe ratio (PSLR) and signal to noise ratio (SNR). 

The PSLR is simply based on the ration between the intensity 

of the main lobe level I manlobe and the intensity of the largest 

side lobe level I sidelobe 

P SLR = 10 log 10 

I sidelobe 

I mainlobe 

. (9) 

This measure gives information about how well a SAR signal 

processing algorithm is able to identify weak targets. 

The quality of signal is additionally evaluated using the SNR 

which is the ratio of the power of the signal and the noise 

SNR = 10 log P signal 

P noise 

. (10) 

The results of the simulation are shown in Table I. In 

Fig. 2 an overview of the resulting point target responses is 

given. The point target responses are shown in their range 

and azimuth profile. The range-Doppler algorithm is capable 

of correctly focusing the point target using an approximated 

interpolation kernel. Using a full interpolation kernel does not 

improve the result though. For performance reasons a 4-tap 

kernel should be sufficient. Without the RCMC the point target 

is not correctly focused. The ω-k algorithm performs best 

with the two-dimensional reference function. Because of the 

large curvature of the range profile, the compression with an 

one-dimensional reference function results in a less focused 

response. 

A. Computational Performance 

To evaluate the computational requirements the SAR imaging 

algorithms were implemented on a NVIDIA Jetson TK1 board 

(NVIDIA 4-Plus-1 2.32 GHz Quad-Core ARM Cortex-A15, 

342 


3000 

3000 

3000 

3000 

2500 

2500 

2500 

2500 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

500 

500 

500 

500 

0 

0 10 20 30 40 50 60 70 

Range sample 

(a) 

0 

7850 7900 7950 8000 8050 8100 8150 

Azimuth sample 

0 

0 10 20 30 40 50 60 70 

Range sample 

(b) 

0 

7850 7900 7950 8000 8050 8100 8150 


3000 

3000 

3000 

3000 

2500 

2500 

2500 

2500 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

Amplitude 

2000 

1500 

1000 

500 

500 

500 

500 

0 

0 10 20 30 40 50 60 70 

Range sample 

(c) 

0 

7850 7900 7950 8000 8050 8100 8150 


0 

0 10 20 30 40 50 60 70 

Range sample 

(d) 

0 

7850 7900 7950 8000 8050 8100 8150 


2500 

2500 

2000 

2000 

Amplitude 

1500 

1000 

Amplitude 

1500 

1000 

500 

500 

0 

0 10 20 30 40 50 60 70 

Range sample 

(e) 

0 

7850 7900 7950 8000 8050 8100 8150 


Fig. 3. Simulation results for the point target response of each implemented SAR algorithm. Subfig. a) shows the result for the range-Doppler algorithm (RDA) 

without RCMC. Compared to the RDA implementation with RCMC (Subfig. b) full kernel size and Subfig. c) 4-tap kernel size) the point target response is less 

focused in range. As it is visible in Subfig. d) the ω-k algorithm with a one-dimensional reference function has bad focusing in range since the one-dimensional 

approximation is in general only valid for small RCM. Therefore the ω-k algorithm with a two-dimensional reference function in Subfig. e) gives the best 

response. 

2 GB RAM). Automotive embedded systems usually only 

offer low computational performance. Therefore it is important 

to find a good trade off between the imaging quality and 

computational requirements of a radar imaging algorithm. To 

compare the previously discussed algorithms regarding their 

computational performance the clock cycles for the generation 

of a simulated SAR signal were measured. For the simulation 

of the SAR signal a reference scene with a range of 10 m and 

azimuth size of 20 m was generated. The PRF was set at 1.6 

kHz and the azimuth velocity at 2 m s . 

In Fig. 4 the results for the range-Doppler with a full size 

interpolation kernel, a 4-tap interpolation kernel and without 

RCMC are shown, as well as the ω-k algorithm using a onedimensional 

and two-dimensional reference function for compression. 

Using a full size interpolation for the RCMC in case 

of the range-Doppler algorithm results in a total of 8216 million 

clock cycles, which corresponds to more than 3.5 s runtime on 

the microcontroller. The runtime can be improved by a large 

amount using an approximated interpolation kernel. In this case 

the algorithm runs almost three times faster. Leaving out the 

RCMC correction gives an even better runtime performance 

with the cost of a less focused radar image. Depending on 

the application RCMC might not be necessary and therefore 

the range-Doppler algorithm without the RCMC might be a 

good option. If focusing is important, using a 4-tap interpolation 

kernel should deliver sufficient results. The performance of the 

ω-k algorithm with an one-dimensional and two-dimensional 

reference function is much better than the performance of the 

range-Doppler algorithm with RCMC. Considering the results 

presented in section III in which the ω-k algorithm showed 

much better results it seems to be the preferable option. 

B. Memory 

In general embedded systems usually only offer limited 

memory. Depending on the sampling frequency in range and 

343

8,000 

7,146 

8,216 

2D FFT 

CFAR 

RCMC 

Total 

6,000 

4,000 

2,000 

373 

790 

373 

790 

373 

790 

373 

790 

373 

CPU cycles [Mio] 

798 

727 

490 

790 

1,986 

590 

1,915 

1,689 

1,369 

0 

RDA 

ω-k 2D 

RDA 4-tap 

ω-k 1D 

RDA w/o RCMC 

Fig. 4. Computational load for the range-Doppler algorithm (RDA) with different interpolation kernel sizes and the ω-k algorithm with a 1D and 2D reference 

function. 

azimuth several MB of memory might be needed. For efficient 

processing a three memory buffer layout can be used. 

The first memory buffer is used to store the IF signal 

of one radar measurement. The IF signal can immediately 

be transformed in the Fourier domain applying a FFT. To 

compress the signal all negative frequencies as well as and the 

positive frequencies above the maximum considered frequency 

(depending on range) should be discarded. An intermediate 

buffer is filled up with the results of the range buffer, until 

enough measurements are stored for azimuth processing. In the 

azimuth buffer all azimuth computations are performed and the 

results are stored back in the intermediate buffer. After azimuth 

processing the resulting radar image can be read out of the 

intermediate buffer by third applications. 

IV. MEASUREMENT RESULTS 

In this section results from real measurements, which were 

taken on an outside car park, are shown. For the measurements 

a radar with the configuration shown in Table II was used. The 

radar was mounted at 90 ◦ on a movable trolley, which was 

shoved by hand at an approximated velocity of 2 m s 

. On the 

radar 8 receive antennae are simultaneously used to receive the 

echoed radar signal. The 8 channels were summed up before 

processing the signal to improve the SNR. An overview of the 

parking lot is given in Fig. 5. 

The resulting images are shown in Fig. 6. The results match 

the expectations from section III. The best result was achieved 

using the ω-k algorithm with a two-dimensional reference 

function. The radar images computed by the range-Doppler 

algorithm are better focused applying the RCMC. 

TABLE II 

PARAMETERS OF THE TEST SETUP 

Parameter 

Value 

RF carrier frequency f c 

76.0 GHz 

RF bandwidth 

1.0 GHz 

Ramp-up time t up 51.2 µs 

Ramp-down time t down 10.0 µs 

Samples per ramp 256 

PRF 

1.5 kHz 

Window-function at range-FFT Hann 

−3 − dB azimuth beam width 120 ◦ (±60 ◦ ) 

Fig. 5. Overview of the recorded parking lot scene 

V. CONCLUSION 

A comparison of the range-Doppler and ω-k algorithm 

was given in this work. Both algorithms were analyzed for 

344 


an automotive parking lot detection use case. To see how 

accurate both algorithms are, the point target response for 

different implementations was evaluated. As expected from the 

properties of each algorithm, the ω-k algorithm computes better 

focused radar images compared to the range-Doppler algorithm 

in case of a large radar beamwidth. Since the focus was on the 

specific requirements of the implementation on an automotive 

embedded system, different implementations of both algorithm 

were presented. The performance of each implementation was 

evaluated on an NVIDIA Jetson TK1 microcontroller. Additionally 

a suggestion for a memory buffer layout was given to allow 

efficient computation on an embedded system. The functional 

correctness of the presented implementations was tested in a 

real environment using an 77 GHz radar and the computed 

radar images were presented. 

An important aspect in automotive radar imaging, that was 

not included in this work, is handling motion errors. Implementations 

of motion compensation algorithms and their memory 

and computational requirements has to be considered for a 

complete automotive SAR application. 

REFERENCES 

[1] Deutsche Automobil Treuhand (DAT), “DAT Report 2016,” 2016. 

[2] “Mehrheit der Autofahrer würde dem Autopiloten das 

Steuer übergeben,” Bitkom, Feb 2017. [Online]. Available: 

https://www.bitkom.org/Presse/Presseinformation/Mehrheit-der- 

Autofahrer-wuerde-dem-Autopiloten-das-Steuer-uebergeben.html 

[3] H. Wu and T. Zwick, “Automotive SAR for parking lot detection,” in 

Microwave Conference, 2009 German. IEEE, 2009, pp. 1–8. 

[4] H. Wu, L. Zwirello, X. Li, L. Reichardt, and T. Zwick, “Motion 

compensation with one-axis gyroscope and two-axis accelerometer for 

automotive SAR,” in Microwave Conference (GeMIC), 2011 German. 

IEEE, 2011, pp. 1–4. 

[5] J. Mure-Dubois, F. Vincent, and D. Bonacci, “Sonar and radar SAR 

processing for parking lot detection,” in Radar Symposium (IRS), 2011 

Proceedings International. IEEE, 2011, pp. 471–476. 

[6] H. Iqbal, M. B. Sajjad, M. Mueller, and C. Waldschmidt, “SAR imaging 

in an automotive scenario,” in Microwave Symposium (MMS), 2015 IEEE 

15th Mediterranean. IEEE, 2015, pp. 1–4. 

[7] R. Feger, A. Haderer, and A. Stelzer, “Experimental verification of a 

77-GHz synthetic aperture radar system for automotive applications,” 

in Microwaves for Intelligent Mobility (ICMIM), 2017 IEEE MTT-S 

International Conference on. IEEE, 2017, pp. 111–114. 

[8] F. Harrer, F. Pfeiffer, A. Löffler, T. Gisder, and E. Biebl, “Automotive 

synthetic aperture radar system based on 24 GHz series sensors,” in 

Advanced Microsystems for Automotive Applications 2017. Springer, 

2018, pp. 23–36. 

[9] I. G. Cumming and F. H. Wong, “Digital processing of synthetic aperture 

radar data,” Artech house, vol. 1, no. 2, p. 3, 2005. 

[10] B.-C. Wang, Digital signal processing techniques and applications in 

radar image processing. John Wiley & Sons, 2008, vol. 91. 

345

(a) 

(b) 

(c) 

(d) 

(e) 

Fig. 6. Measurement results for different SAR algorithm implementations. Subfig. a) shows the result for the range-Doppler algorithm (RDA) with a full 

interpolation kernel, Subfig. b) with a 4-tap kernel and Subfig. c) without RCMC. In Subfig. d) and Subfig. e) the results for the ω-k algorithm are shown 

with an one-dimensional and two-dimensional reference function, respectively. The image computed with the RDA without RCMC is clearly less focused than 

the ones with RCMC. There is no visible difference between the full size and the 4-tap interpolation kernel though. Compared to the RDA the ω-k algorithm 

produces much better focused images. 

346 


Visual Modeling of Self-Adaptive Systems 

Saivignesh Sridhar Eswari 

Software Designer at Nobleo 

Eindhoven, Netherlands 

s.e.saivignesh@gmail.com 

Juha-Pekka Tolvanen 

MetaCase 

Jyväskylä, Finland 

jpt@metacase.com 

Emil Vassev 

SEDEV Consult Ltd 

Sofia, Bulgaria 

emil@vassev.com 

Abstract—When developing autonomous systems, designers 

employ different kinds of knowledge to specify systems. We 

present a visual modeling approach created for specifying selfadaptive 

systems. The approach uses model-based approach to 

specify the system context and ontology addressing both 

structural and behavioral parts. The resulting models are used 

with code generation with knowledge reasoning frameworks and 

tools. The presented approach supports collaboration and 

communications within the safety design team, improve 

productivity of the team and reduce the cost of software 

certification. 

Keywords—autonomous systems; self-adaptive systems; visual 

modeling; model-based development; domain-specific languages; 

KnowLang; MetaEdit+ 


Autonomous vehicles have to be safe and reliable. While 

certification programs and safety standards such as ISO 26262 

provide guidance, safety design and development of safe and 

reliable functionality is time-consuming and costly. Moreover, 

the integration and promotion of autonomy in vehicles is an 

extremely challenging task, and although autonomous cars are 

already seen on our streets, the first severe accidents prove 

that they are not as secure as we had hoped them to be. 

We present a visual modeling solution developed to meet 

recommendations for using model-based approaches in safety 

design and the trend on automotive industry on using code 

generation from visual models. The presented approach is 

based on the KnowLang framework [1] developed for 

Knowledge Representation and Reasoning (KR&R) in selfadaptive 

systems. The modeling approach consists of a set of 

integrated visual models, each providing a particular view of 

the system, such as its overall context, structures and their 

relationships, along with specific behavior. The visual 

modeling approach makes the system's description a relatively 

easy task where models support communication and gathering 

feedback within the team. The visual modeling approach also 

relies on capabilities to manage complexity, such as partition 

structure from behavior by providing different user-adapted 

views of the system (e.g. overall and detailed) and by 

providing possibility to have different views to the systems 

(e.g. view only inheritance among structural elements). 

Another key part is tooling that enables collaborative modelbased 

development: several engineers can edit the same 

specifications simultaneously with continuous integration. 

From the model-based specifications the implemented 

generator produces code for the KnowLang framework for 

further analysis and execution. This automates the routine and 

makes the KnowLang framework better accessible for 

engineers so that they can focus on safety design. The models 

can also be used for reporting, providing different views for 

different stakeholders and for documenting the system. By 

using visual modeling along with automatic code generation, 

we practically reduce both development time and effort, 

decrease certification costs and improve development 

productivity. 

In this paper, we present KnowLang framework along with 

the developed visual modeling approach and code generator. 

This work was done to meet needs of an automotive company. 

We describe the process of creating the tooling and show 

practical cases and examples of using the modeling approach 

when developing various self-adaptive systems. 

II. 

SELF-ADAPTATION AND KNOWLEDGE 

REPRESENTATION 

Autonomous systems, such as automatic lawn mowers, 

smart home equipment, driverless train systems, or 

autonomous cars, perform their tasks without human 

intervention. 

A. KnowLang 

KnowLang [1,2,3,4] is a framework for KR&R that aims 

at efficient and comprehensive knowledge structuring and 

awareness [5] based on logical and statistical reasoning. 

Knowledge specified with KnowLang takes the form of a 

Knowledge Base (KB) that outlines a Knowledge 

Representation (KR) context. A key feature of KnowLang is a 

formal language with a multi-tier knowledge specification 

model (see Fig. 1) allowing integration of ontologies together 

with rules and Bayesian networks [6]. 

The language aims at efficient and comprehensive 

knowledge structuring and awareness. It helps us tackle [2]: 1) 

explicit representation of domain concepts and relationships; 


347

2) explicit representation of particular and general factual 

knowledge, in terms of predicates, names, connectives, 

quantifiers and identity; and 3) uncertain knowledge in which 

additive probabilities are used to represent degrees of belief. 

Other remarkable features are related to knowledge cleaning 

(allowing for efficient reasoning) and knowledge 

representation for autonomic behavior. 

Fig. 1. KnowLang Specification Model. 

By applying KnowLang's multi-tier specification model 

(see Fig. 1) we build a Knowledge Base (KB) structured in 

three main tiers [1, 2]: 1) Knowledge Corpuses; 2) KB 

Operators; and 3) Inference Primitives. The tier of Knowledge 

Corpuses is used to specify KR structures. The tier of KB 

Operators provide access to Knowledge Corpuses via special 

classes of “ask” and “tell” Operators where “ask” Operators 

are dedicated to knowledge querying and retrieval and “tell” 

Operators allow for knowledge update. When we specify 

knowledge with KnowLang, we build a KB with a variety of 

knowledge structures such as ontologies, facts, rules and 

constraints where we need to specify the ontologies first in 

order to provide the vocabulary for the other knowledge 

structures. 

A KnowLang ontology is specified over concept trees, 

object trees, relations and predicates. Each concept is 

specified with special properties and functionality and is 

hierarchically linked to other concepts through “parents” and 

“children” relationships. For reasoning purposes every concept 

specified with KnowLang has an intrinsic “state” attribute that 

may be associated with a set of possible state values the 

concept instances may be in. The concept instances are 

considered as objects and are structured in object trees - a 

conceptualization of how objects existing in the world of 

interest are related to each other. The relationships in an object 

tree are based on the principle that objects have properties, 

where the value of a property is another object, which in turn 

also has properties. Moreover, concepts and objects might be 

connected via relations. Relations are binary and may have 

probability-distribution attribute (e.g., over time, over 

situations, over concepts' properties, etc.). Probability 

distribution is provided to support probabilistic reasoning and 

by specifying relations with probability distributions we 

actually specify Bayesian networks connecting the concepts 

and objects of an ontology. 

B. Knowledge Representation 

When developing autonomous systems, designers employ 

different kinds of knowledge to derive models of specific 

domains of interest. There’s no standard classification system 

- the problem domain determines what kinds of knowledge 

designers might consider and what models they might derive 

from that knowledge [7]. Designers can use different elements 

to represent different kinds of knowledge. Knowledge 

representation (KR) elements could be primitives such as 

rules, frames, semantic networks and concept maps, 

ontologies and logic expressions [7]. These primitives might 

be combined into more complex knowledge elements. 

Whatever elements they use, designers must structure the 

knowledge so that the system can effectively process it and 

humans can easily perceive the results. 

In the dynamically changing automotive industry, 

designers need to achieve optimized designs and successful 

validation earlier in the automotive engineering process. Many 

adopt advanced automation technologies based on modeldriven 

development to meet this challenge. Note that various 

model-based approaches provide automotive engineering 

software for design, simulation, verification, and 

manufacturing, allowing one to create a digital model that 

drives the entire product development process. Advanced 

analysts and designers can use analysis and simulation 

solutions for kinematics, dynamics, structural, thermal, flow, 

motion, multi-physics, and optimization in a single 

environment. Seamless sharing of model data between design 

and analysis delivers results quickly, to impact critical design 

decisions. The use of visual models was also a requirement 

from the company for a tool for specifying self-adaptive 

behavior. The use of visual models is also backed by empirical 

research in particular when investigating studies on quality of 

the specification, effectiveness and efficiency. As an example, 

Jakšić et al. [8] performed a statistical analysis for comparing 

the quality, efficiency and productivity between textual 

representation and graphical model-based representation. They 

focused on feature trees applied in product lines which 

resemble perhaps the closest the concept trees of KnowLang. 

The result of the empirical study was that graphically created 

specification was more complete and of better quality than the 

textually specified ones. Also graphical modeling took less 

time than creating the same feature model with textual 

specification. 

C. Visual Modeling Tools 

While it is possible to create tools from the scratch, we 

applied Language Workbench approach providing most of the 

needed functionality automatically: Only support for 

KnowLang language, its model-based visualization, checking 

correctness of the specifications, code generation and 

integration with other tools was added. MetaEdit+ [10] was 

applied as the tooling as it satisfied the requirements of the 

automotive company. These included support for collaborative 


348

modeling, version control, integration with relevant tools 

applied in automotive (e.g. Simulink, HiP-HOPS), updating 

both models and metamodels, as well as availability of 

supporting services. Naturally tool support was expected for 

visual modeling and implementing the generators for 

integration with KnowLang and other targets. 

MetaEdit+ provides tools that enable developing modeling 

support iteratively without programming. Language definition 

and language use happens in the same environment allowing 

immediate testing of the language definition and update it 

based on using the language. The language definition follows 

the process of: 

1) Defining the language concepts used to create the 

models. 

2) Setting the rules for these concepts allowing 

preventing creating illegal or unwanted 

specifications. 

3) Defining the visual notation that is used when editing 

and reading the models. Notation can also show 

information on incompleteness, model references etc. 

that are not directly related to specification itself. 

4) Implementing the generators that produce the 

required artifacts like code, simulation data, tests etc. 

At any step of this process the language definition can be 

applied and tried out. This is also possible with multi-user 

version of MetaEdit+: Language engineers can define the 

language and others may at the same time use the modeling 

language. The feedback loop between language definition and 

language use helps to reduces errors, minimize the risks of 

creating unwanted language features and improve user 

acceptance. 

III. 

MODEL-BASED DEVELOPMENT FOR KNOWLANG 

The visual modeling support for KnowLang was 

implemented by one person. The implementation was done 

incrementally and the results were reviewed by three persons. 

The implementation was done during spring 2017 within a 

period of three calendar months. 

A. Defining and Formalizing Concepts 

The language definition started by identifying the different 

visual views for KnowLang specifications: Concept trees 

expressing domain ontology, predicates to express complex 

system states, contexts to specify environment or situation in 

which the concepts are, and behavior expressed with Boolean 

expression. 

Since the concepts of KnowLang were already defined 

(see Section II.A) the metamodeling process largely means 

mapping the KnowLang concepts to the visual modeling 

concepts, such as to objects, their relations, roles and 

properties. Fig. 2 shows a definition of Concept trees and its 

modeling elements. These include various concepts used as 

modeling objects, their inheritance and probability based 

relations shown as relationships, and roles for defining how 

objects participate with the relationships. 

Fig. 2. Definition of Concept tree in KnowLang 

The elements modeled for representing Concept Trees are 

Metaconcept, Generic Concept, Explicit Concepts and 

Relations. Each of the modeling elements shown in Fig 2 are 

defined in further detail. Fig. 3 shows one such definition: the 

Concept with its properties. The description in the bottom of 

window is used in the help system available for the modeler 

using KnowLang. 

Fig. 3. Definition of Concept of KnowLang 

Once defined, each part of the language specification was 

tried out to specify reference systems. Other language 

concepts of Concept trees are defined similarly. Also other 


349

views of KnowLang, such as behavior and complex states, 

were defined similarly. 

Since we were defining a visual language also the resulting 

language definition differs from the grammar definition used 

for textual languages. In visual models a particular element, 

such as ‘Passenger’ Concept can be entered only once and 

referred elsewhere from the specification - also from other 

places than in the same visual diagram. Thus change in one 

diagram is reflected to everywhere without the need for find 

and replace or using refactoring tools as when working with 

plain text. Similarly automated trace, such as where a 

particular ‘Passenger’ concept is used is directly available. 

This helps traceability and providing documentation reports. 

Visual language also provides views and separation of 

concerns for knowledge representation as well as possibility to 

view and filter specification in different level of detail or for 

different audience. For example, one might be interested to 

view plain concept inheritance whereas others their 

connections. This notation part is discussed in Section III.C. 

B. Defining Rules 

Each modeling element, such as Concept Trees or 

Concepts, illustrated above, may have rules and constraints. 

For instance, names of concepts may be unique with the 

concept tree, or inheritance between the concepts may allow 

multiple inheritance. If such rules are defined into the 

metamodel they can be checked at the modeling time 

preventing creation illegal or unwanted specifications. As it is 

cheaper to prevent errors happen rather than correct them 

later, we added to the metamodel also various model checking 

rules. For the metamodel definition MetaEdit+ provides ready 

rule templates as applied for KnowLang definition in Fig. 4 

and 5. 

constraint that each concept must have a unique name within a 

concept tree. Similarly rules for mandatory naming, legal 

connections, number of connections etc. were added to the 

metamodel. 

Fig. 5. Uniqueness rule for naming 

Rules for all other parts of KnowLang were added 

similarly. As we divided the visual presentation to different 

views the metamodel was finalized with interlinking the 

views. In most cases such linking appeared automatically as in 

MetaEdit+ the model elements can be reused and linked 

between the views. For other cases, like organizing the model 

hierarchically, the metamodel definition was extended with 

linking rules. Fig. 6 illustrates some linking rules, such as that 

Concept may have a State chart and Action Concept may have 

Pre-conditions and Post-conditions been defined in own 

views. 

Fig. 4. Binding rule for directed relationship. 

Fig. 4 illustrates the definition of directed relationship 

between a set of objects. Two-directed relationship must have 

always at least two Directed roles when connecting any of the 

defined objects listed. Fig. 5 shows a definition of uniqueness 

Fig. 6. Explicit rules for linking modeling elements with views 

C. Defining Notation 

Visual modeling requires a concrete syntax. We defined 

the syntax following the KnowLang presentation material and 

extended it with visual properties to gather summary 


350

information, modeling linking data and error annotation. We 

applied various visual variables for the notation, such as 

shapes, colors, fonts, as guided by [9] to improve readability, 

understanding and working with the specification models. Fig. 

7 illustrates definition of the notation for a Concept element. 

The notation is defined with Symbol Editor of MetaEdit+. 

Alternatively existing visualizations could be imported and 

applied. 

Initially the notation provided just the basics: A green 

rectangle showing the unique name of the concept. To manage 

different views, the definition was extended with a visual clue 

on the upper right corner to indicate if concept has associated 

subgraph. This visualizes the rule of the language defined in 

Fig 6. 

more complete illustration of the visualization aspects is given 

in the example section. 

D. Implementing KnowLang generators 

After having defined the notation we created 

simultaneously models representing the knowledge for 

KnowLang. While these models were used to test the language 

they also served as basis for generators. After having models 

we implemented generators producing the knowledge in 

KnowLang and calling it for compilation. 

The generator was implemented with Generator Editor of 

MetaEdit+. Fig. 9 shows the main structure of the generator on 

the upper left corner. A generator called Code starts from the 

Concept tree and for each concept produces information on its 

inheritance relationships with other concepts, as well as with 

defined properties, functions, and states. For each of these 

parts there is own subgenerator: These subgenerators match to 

the concepts expressed in the metamodel. In the bottom of the 

screen, one subgenerator is shown handling the generation of 

states. It calls again other generators producing behavioral 

logic in Boolean expression given for the concept. 

Fig. 7. Symbol definition for Concept 

While the elements have richer structure than just name, an 

alternative representation approach was added. A modeler 

may want to see visually further details of the element. For 

this purpose two possible representations were added - both 

been shown in Fig 8. The symbol on the left shows the 

minimal view and the symbol on the right characteristics that 

were considered important to visualize. Note that part of the 

data like properties and functions are directly took from the 

metamodel of Concept (see metamodel in Fig 3.) whereas the 

states are retrieved from the State chart linked to the concept 

(see metamodel for this part in Fig 6.). 

Fig. 8. Two possible visualizations for the Concept - as chosen at modeling 

time 

The definition of the notation was done similarly to other 

views and their modeling elements - not just for main 

modeling objects but also for their relationships and roles. A 

Fig. 9. Definition of the generator (example) 

The other parts of the Generator Editor provide access to 

the metamodel (top right) and to the generator commands (top 

middle). This allows writing the generator within the context 

of the given metamodel, aka KnowLang concepts. 

While the main part of the generator is navigating the 

visual model to produce the code, it also integrates with 

KnowLang reasoner by calling it at the end with the generated 

output. In this way the developer can move easily from the 

visual model to see the results been executed in KnowLang. 

In addition to generating the KnowLang code the same 

generator system was used to provide model checking rules 

not included in the metamodel. These included model 

guidance (e.g. created Boolean expression is partial), reporting 

(e.g. document generation), queries on models and producing 

metrics. 


351

F. On the implementation process 

The implementation was done by one person during spring 

2017 within a period of three months. Half of the effort was on 

defining the metamodel (Section III.A-C) and second half on 

implementing the generator (Section III.D). During 

implementation phase 3 persons provided feedback to the 

work done. The implementation was tested and verified by 

using the created modeling language to specify various kinds 

of systems and by comparing to the reference test cases. 

IV. 

EXAMPLE 

The modeling solution is applied to specify safety 

functionality in different application areas, such as 

autonomous cars, unmanned space explorer and surveillance 

drones. We use next car safety as an example and for the sake 

of brevity show parts of the key models only. The aim of car 

safety project is to compute a set of alternative routes for its 

current destination, to ensure that the vehicle always runs on 

sufficient battery and to drive safely around crosswalks. 

The modeling process starts with defining the ontology of 

the system with concept trees. For each concept, its properties, 

functions and states are defined. Fig. 10 shows the ontology 

with concept trees in MetaEdit+ modeling tool, and Fig. 11 

shows a portion of this model with details. 

Fig. 11. Concept tree for software phenomenon (partial) 

Behaviour is described with states using Boolean 

expression. A Boolean expression for AvoidCollision in shown 

in Fig. 12. This differentiates InLowTraffic and InHighTraffic 

conditions: in a high traffic NeedFix and FlatTireAtCrosswalk 

are not allowed. All these states refer to other Boolean 

expressions defined for other concepts: Traffic concepts of 

Route, NeedFix for Brakefailure, and FlatTire on Journey 

concept. These concepts were defined in the Concept tree (Fig 

11). 

Fig. 10. Concept tree of car safety 

The notation uses different colors and shapes for the model 

elements to assist reader identify the different KnowLang 

concepts. Fig. 11 shows details of the concept tree dealing 

with Software Phenomenon on Journey and Route as well as 

related knowledge on errors, situations and policies. If there is 

a need to exemplify the ontology with examples, 

corresponding instances of these concepts can be defined as 

object trees. Object trees were specified in the metamodel as 

part of the KnowLang support. 

Fig. 12. Behavior expression for avoiding collision 

Predicates in KnowLang are considered as complex system 

states because their evaluation depends on the evaluation of 

the involved concept states. Fig 13 illustrates such predicate 

dealing with three concepts on collision avoidance. 


352

● 

● 

● 

● 

The model can be applied for generating the code 

improving the productivity and removing the need for 

learning particular syntax and debug coding errors. 

The modeling language guides developers by 

partitioning the system specification into different 

concerns, like concepts, dependencies, behavior, etc. 

Reduce the cost of software certification. 

Reduce the time to market a product. 

Fig. 13. Predicate or Complex State for avoiding collision 

Created models can be transformed at any point of time to 

the analyzer, reasoner, or simulation applied - given that 

generator is available. Since we applied KnowLang the 

generator produces KnowLang code. 

Fig. 14 shows the portion of the KnowLang code as 

generated from the visual models. The part highlighted is 

related to concept Journey (Fig. 11) and its related states, like 

that dealing with flat tire used to define behavior in Fig 12. 

KnowLang and the generated code is running on top of the 

system developed. When the system wants to take decisions it 

consults with KnowLang providing self-adaption. 

Fig. 14. Generated code in KnowLang 


We presented a visual modeling language for developing 

self-adaptive systems. The developed approach provides 

several benefits for development teams: 

● 

Support communication and collaboration within a 

team. Different users may take a different view to the 

specifications and can edit the same specification 


We presented also the actual language creation process 

covering the metamodel with rules, visual notation and code 

generator. The actual language implementation work was done 

in the period of 3 calendar months by one person. Because the 

investment is modest, it pays off quickly as all the other 

developers can then model with the language, and run the 

generators creating the code. As both the modeling language 

and generators are freely accessible, the presented approach 

also gives full control for the company for making possible 

extensions in the future. 

ACKNOWLEDGEMENT 

We would like to thank dr. ir. Ion Baroson and prof dr. 

Mark van den Brand at Eindhoven University of Technology 

for collaboration and for their continuous support and 

guidance throughout this project. We would also like to thank 

Baesis Automotive for initiating this project and supporting 

us. 

REFERENCES 

[1] Vassev, E., Hinchey, M., Knowledge Representation for Adaptive and 

Self-aware Systems. In Software Engineering for Collective Autonomic 

Systems, Volume 8998 of LNCS. Springer, 2015. 

[2] Vassev, E., Hinchey, M.., Knowledge Representation for Adaptive and 

Self-Aware Systems. In Software Engineering for Collective Autonomic 

Systems: Results of the ASCENS Project, Lecture Notes in Computer 

Science, vol. 8998. Springer Verlag, 2015. 

[3] Vassev, E., Hinchey, M., KnowLang: Knowledge Representation for 

Self-Adaptive Systems. In: IEEE Computer 48 (2), 81–84, 2015. 

[4] KnowLang Framework for Knowledge Representation and Reasoning 

for Self Adaptive Systems. 

http://www.knowlang.engineeringautonomy.com (accessed Jan 2018). 

[5] Vassev, E., Hinchey, M., Awareness in Software-Intensive Systems. In 

IEEE Computer 45(12), 84–87, 2012. 

[6] Neapolitan, R., Learning Bayesian Networks. Prentice Hall, 2013. 

[7] Vassev, E., Hinchey, M., Knowledge Representation and Reasoning for 

Intelligent Software Systems. In IEEE Computer 44 (8), 96–99, 2011. 

[8] Jakšić, A., France, F., Collet, P., Ghosh, S., Evaluating the usability of a 

visual feature modeling notation. International Conference on Software 

Language Engineering. Springer. 2014. 

[9] Moody, D., The “Physics” of Notations: Toward a Scientific Basis for 

Constructing Visual Notations in Software Engineering, IEEE 

Transactions on Software Engineering, Volume: 35, Issue: 6, 2009 

[10] MetaEdit+, http://www.metacase.com (accessed Jan 2018) 


353

IoT-Security and Product Piracy: Smart Key 

Management versus Secure Hardware 

Christian Zenger 1,2 and Mario Pietersz 2 

1 Ruhr-Universität Bochum 

Horst Görtz Institut für IT-Sicherheit 

Bochum, Germany 

christian.zenger@rub.de 

2 PHYSEC GmbH 

Universitätsstr. 142 

44799 Bochum, Germany 

mario.pietersz@physec.de 

The today’s fear to lose against competitors, manufacturers of 

physical products are urgently searching for solution to 

“smartify” their products, to establish new digital business 

models, and to offer new services. To them, digitalization means 

mainly the establishment of (Internet-) connectivity between their 

products and some digital service platform. However, many 

business models build on top of digitalization might lose its 

competitive advantage for the manufacturer if the data are not 

secured (available, authentic, confidential, and integer). We 

present a detailed overview what is arguably the most difficult 

part in the majority of security systems, namely device 

authentication and key establishment. We help answering a 

major question of decision makers: Which key establishment 

method and which (security) hardware solution reduces product 

piracy risk as well as cyber security risks sufficiently, is capable 

to start today with small charges and end up with a flexible longterm 

capable serial production, as well as provides a good costbenefit 

ratio for new IoT products? In the present paper we focus 

on details to find an individual answer, while potential lock-in 

effects of suppliers and platform providers are out of scope. 

Keywords—IoT-security; product piracy; key management; 

supply chain; ad-hoc provisioning 


We are in the midst of a deep technological transition. The 

result is described as the "Internet of Things" (IoT) and will 

affect all areas of business world. The IoT introduces the 

paradigm of an ecosystem of ubiquitous embedded systems 

that communicate with each other through everyday life. This 

is essentially what Eric Schmidt (CEO, Google) described in 

2015 at the World Economic Forum in Davos in the following 

words: “[The Internet] be part of your presence all the time. 

Imagine you walk into a room, and the room is dynamic. And 

with your permission and all of that, you are interacting with 

the things going on in the room.” 

Where once classic computers and servers communicated 

with each other over the Internet, today products, such as 

Coffee machines or industrial sensors, communicate with their 

associated virtualized, digital service platforms. The range of 

different IoT devices includes inexpensive consumer 

electronics as well as highly specialized industrial products 

(I4.0) as well as medical devices. All of these systems have 

very different requirements and properties. For instance, the 

protection of personal data, protection of production secrets, 

and high availability of industrial equipment can be 

differentiated. 

Gartner [1] predicts 20 billion devices that will be 

connected to the Internet by 2020. This creates new product 

features for customers and manufacturers (e.g., remote control 

or plagiarism controls), as well as revolutionary new business 

models (e.g., pay-per-use). In addition, the analysis of collected 

data promises cost and revenue optimization (such as, 

predictive maintenance). As part of this digitalization, new 

devices will be equipped with intelligent sensors and 

standardized connectivity solutions. 

In the time of digitalization and the fear to lose against 

competitors, manufacturers of classical physical products are 

urgently searching for solution to “smartify” and digitalize 

their products, to establish new digital business models, and to 

offer new services. To them, digitalization means mainly the 

establishment of (Internet-) connectivity between their products 

and some digital service platform enabling data sharing and 

artificial intelligence. However, many business models build 

on top of digitalization might lose its competitive advantage for 

the manufacturer if the data are not secured (available, 

authentic, confidential, and integer). Simultaneously, for the 

consumer and society at large, it is important that the 

technology is privacy preserving. 


354

Furthermore, now not only the honest and law-abiding 

manufacturers are digitizing, but unfortunately also criminals. 

Depending on the criminal’s "business model", e.g., building 

and leasing botnets, blackmailing or sabotage, attackers have 

several opportunities to compromise IoT ecosystems. 

Manufacturers come up against those threats with encryption 

and thus cryptographic primitives that require secret 

cryptographic material. However, stealing cryptographic keys 

is almost always the simplest, most impactful, and most 

"commercially" scalable attack. These keys are important 

because the root of the trust of modern security solutions is 

based on cryptographic keys. Thus, the question of origin and 

management of these keys arises. 

In the following, we will discuss the different approaches 

of key management and show that traditional ones are not 

necessarily suitable for the IoT. Afterwards, procedures will 

be described that focus on the user and ease of use. We present 

a detailed overview what is arguably the most difficult part in 

the majority of security systems, namely device authentication 

and key establishment. Today key establishment solutions for 

securing the IoT ecosystem are mainly dividable into three 

categories: 

Master secrets (e.g., hard-coded, factory default keys, 

easy to guess passwords). 

Device individual credentials integrated within the 

production (e.g., client certificates, symmetric token 

etc.). 

Ad-hoc based user- and device individual key 

establishment (e.g., using the resurrecting duckling 

principle). 

Each approach has its advantages (e.g., a cheap 

production, solid security, or flexible production) as well as 

disadvantages (e.g., a serious undermining in the case of a 

hack, new complexities and expenses within the supply chain, 

or manual provisioning) and works with standard MCUs, 

secure-MCUs (e.g., with read-out protection), or even secure 

hardware. A common example of a secure elements are 

Trusted Platform Modules (TPMs). They usually contain a coprocessor 

for energy-efficient computation of cryptographic 

primitives as well as a protected storage for keys. 

A major question of decision makers is: Which key 

establishment method and which (security) hardware solution 

reduces product piracy risk as well as cyber security risks 

sufficiently, is capable to start today with small charges and 

end up with a flexible long-term capable serial production, as 

well as provides a good cost-benefit ratio for new IoT 

products? In the present paper we focus on details to find an 

individual answer, while potential lock-in effects of suppliers 

and platform providers are out of scope. 

II. 

TECHNICAL BACKGROUND 

Before a more detailed description of classical approaches 

of key management is given, an overview of the protection 

goals in information security is provided. Based on this, the 

basic idea of key and identity management will be explained. 

A. Security goals in information security 

The requirements for security of information systems are 

so-called protection goals, is common today. A brief 

description of the security goals confidentiality, integrity, 

availability, and authenticity follows. These are the most 

important and widely accepted protection goals. The list is by 

no means exhaustive as there are much more sophisticated 

security goals defined in the literature. 

Confidentiality means that only legitimate parties can 

read a message. 

Data integrity ensures that messages have not been 

altered. 

Availability of systems implies that they have to be 

available on a long-term basis with an assured level of 

quality. 

Authenticity ensures the authorship of messages. 

The above protection goals can be achieved by 

cryptographic primitives. On basis of these primitives, it is 

possible to develop protocols which, in a reasonable 

combination, enable key and identity management. More 

detailed information on this can be found in Understanding 

Cryptography written by Paar et al. [2]. A well-known 

example from Microsoft Windows based Active Directories is 

the Kerberos protocol. The protocol uses, for instance, the 

cryptographic primitive of symmetric encryption to achieve 

the protection goal of authenticity. 

B. Key and identity management 

As the world gets more and more digitized, consumer 

products, industrial systems or more generally spoken "things" 

have been extended to communicate with their corresponding 

digital service platforms or directly with people. This is due to 

new business models in which a service provider depends on 

the generated data of the product, e.g., for billing purposes. 

Transferring this data directly via the Internet saves manual 

reading costs and thus scales to a large number of products in 

the market. There are many other applications, e.g., remote 

maintenance access for industrial systems in which protection 

of the transmitted data and thus compliance with the 

protection goals is necessary. From the perspective of the 

manufacturer and operator, therefore, the following questions 

arise: 

"How do we ensure that the data has not been 

manipulated?" 

"Can we protect the information from third party 

access?" 

"How can we be confident that a product is exactly 

what it claims to be?" 

With appropriate key and identity management, these 

questions can be answered. Such a management tool can 

establish the necessary relationships of trust between the 

devices via cryptographic protocols. Of course, these are 

based on cryptographic key material, which can be used in 

protocols to achieve the required protection goals. 

355

III. 

CLASSIC SOLUTIONS 

Based on a scenario of a generic manufacturer of products, 

the classic approaches of key management are considered first. 

In this chapter, particular attention is paid to the degree of 

security of a solution, as well as the impact on the 

manufacturing process. 

A. No security 

The starting point the digitalization of a product is adding 

a simple connectivity solution that allows the product to 

communicate with the manufacturer's digital service platform 

via Internet. In doing so, the product may send, among other 

things, usage values or status information about the individual 

components of the product to the service platform. This 

enables the manufacturer to implement new business models, 

for instance, pay-per-use or predictive maintenance. 

At this point in the development process, the concept does 

not yet provide any security requirements to achieve 

protection goals at all. This implies that the product 

communicate unsecured with the digital service platform via 

untrusted network segments. The lack of security could result 

in attack vectors like manipulate billing data, collect usage 

profiles, and the products can be captured and used for largescale 

attacks on other systems (DDoS botnet). 

A relationship of trust between the product and the digital 

service platform is not present at the moment. At the same 

time the manufacturing process is not touched obviously 

which implies that no security results in no additional costs in 

production. 

B. Pre shared keys 

After the manufacturer created a proof of concept, he may 

experience a demand for state of the art security features and 

wants to apply security concepts to the product. In real 

scenarios, this step may be motivated by customer 

requirements, regulation or by one's own business model. 

The manufacturer fulfils the protection goal of 

confidentiality by encryption with the secure and standardized 

encryption system AES (Advanced Encryption Standard). 

However, every symmetrical encryption primitive requires a 

cryptographic key, as every door of a house does require a 

physical key. The required key is intended to be included in 

the production process, i.e., stored on the product while 

flashing the firmware. The manufacturer can choose between 

two distinct options: 

With a master key, all or a large fragment of products 

share the same cryptographic key material. Anyone 

who knows the key can encrypt and decrypt the 

messages. Should the key be compromised, e.g., 

extracted from the firmware, the security concept 

collapses entirely. A successful attack scales from one 

product to the entire product line. Secret modifications, 

in which the key is based on, e.g., the serial number or 

mac address, falls in the same category. Although these 

keys may look individually per device, they are not 

safe from a cryptographically point of view because of 

the information involved is public as soon as the 

mechanism is revealed, e.g., trough reverse 

engineering. 

Another option is individualized, pre-shared keys. In 

this case, each product is given its own, secret key 

during the production process. The advantage now is 

that successful attacks on individual products can no 

longer scale to the entire product line. However, this 

method has serious drawbacks. The complexity of the 

key distribution increases quadratically as the number 

of products increases. For example, if there are 1000 

products in the field, about 500,000 symmetric keys are 

required for individual communication links between 

all parties. Another obvious disadvantage is that there 

is an inevitable linkage between the production systems 

and the key management system and that the devices 

need an (Internet) connection to the key management 

system when they are commissioned. 

It should be noted that in both cases, at least two parties 

use the same symmetric key to encrypt the messages among 

themselves. That fact implies that it cannot be distinguished 

which the two parties created the message. Thus, the 

protection goals of non-repudiation and message 

authentication cannot be achieved in this way. When using 

pre-distributed keys, a tradeoff between the level of security 

and the integration effort must be considered. The use of 

master keys is trivial but unsecure. The high administrative 

burden of individualized, symmetric keys often does not 

justify the level of security that can be achieved. 

C. Public key infrastructures 

The manufacturer has learned from the mistakes of other 

manufacturers and rejected the idea of pre-distributed 

symmetrical keys. He is now dedicated to the use of a socalled 

Public Key Infrastructure (PKI) and the use of 

certificates. With this approach, each product gets individual 

key material and the ability to cryptographically sign and 

verify messages. These signatures are similar to a letter seal or 

a signature on a document. As a result, the product can 

identify itself with the individual key material. In other words, 

the key material represents and assures the identity of the 

product in form of signed certificate. In this scenario, a 

centralized, trustworthy authority, the certification authority, is 

responsible for issuing certificates which can be used for 

authenticated communication with third parties. But how are 

the certificates distributed individually to the products? The 

following three scenarios with different trust relationships can 

be distinguished: 

In the first scenario, the certificates together with 

cryptographic material are pre-generated and 

transferred to the products during production process. 

In addition to the public part of the key material, the 

private and therefore secret part of this procedure must 

also be transferred. Similar to the symmetrical, 

individual keys, the transmission of the key material 


356

must be carefully protected. Therefore, important 

questions arise, such as: Where is the production 

settled? Which parties does the manufacturer has to 

trust to not copy the key material and use it for 

unwanted actions, e.g., counterfeit products? 

The second scenario is based on the fact that the key 

material can be generated during the first start of the 

product during the production process and 

subsequently checked and signed by a central entity. In 

this case, the secret part of the key material does not 

leave the device. However, a central instance must be 

reachable at production time. What happens if it fails 

or gets blocked by DDoS attacks? How much does it 

cost to stop the production lines for hours / days? 

In the third scenario, the cryptographic material is 

stored in a separate, secure chip, e.g., in a Trusted 

Platform Module (TPM) delivered by a trusted partner 

and embedded in the product. More details on TPMs 

will be provided in Chapter V. The manufacturer has to 

trust the manufacturer of the modules on several levels. 

Who guarantees that the manufacturer does not need to 

keep or keeps backup copies of the key material? What 

happens if the manufacturer of the modules changes his 

price structure? Which PKI solutions are tightly 

coupled now? To which service platform is the PKI 

bound? 

Although the procedures differ significantly, they have one 

thing in common. In all three methods, each device must be 

individually intervened in the production, each with different 

means and effects. This intervention in the production process 

is illustrated in Fig 1. Obviously, the coupled trusted third 

parties are very critical elements in the manufacturing process 

and the security level of the infrastructure and technical and 

organizational processes needs to be state of the art as well. 

The security of using individual keys is considered state of the 

art and reasonably secure. However, the manufacturer buys 

this security level through new complexities, necessary 

relationships of trust with other parties and higher costs. The 

complexities based on key establishment by integrating the 

process into production or the supply chain prevent flexible 

modularization of the individual processes. 

Fig. 1. Integration of Public Key Infrastructures into the manufacturing 

process. The introduced dependencies and complexity can be challenging. 

D. Relying on standards 

IoT stands for a world-wide network of interconnected 

objects uniquely addressable, based on standard 

communication protocols. So the idea is interconnectivity 

based on standards. Unfortunately, all established 

communications standards (e.g., Wi-Fi [3], BLE [4], ZigBee 

[5], LoRaWAN [6]) are either insecure by design due to the 

cryptographic primitives or do not provide a (user-friendly) 

possibility to change key material. 

Therefore an additional security layer must to be 

implemented; ideally with over-the-air update mechanism. 

This in turn makes the choice of proper communication 

technologies complicated because claimed performances 

(energy consumption, computation amplitude and time 

constraints) are no longer maintained. 

E. Interim conclusion 

In this chapter, classical approaches to key and identity 

management were presented and described. An adequate level 

of security implies at the same time a cost-intensive and risky 

intervention in the production. A trade-off between costs and 

benefits can lead to further insecurity on the market. The step 

to individual keys is a major challenge for manufacturers 

today, as the costs and complexities to be expected are 

difficult to estimate due to a lack of information and 

experience. In practice, one often sees the alibi solution that 

the problem is delegated to the user or technician who puts the 

machines into operation. Today’s solutions of shifting the 

complexity towards the consumer, who is then required to 

introduce or change credentials, requires often access to 

device individual user interfaces which is already a 

cumbersome process. That a human is very bad in guessing 

good passwords is a second erroneous assumption: 

Shifting the key establishment to the end user is a good 

idea for many use cases. Especially for those where the user is 

at least once interacting with the device. However, to smartify 

and simplify his or her life this will be successful if and only if 

this process is negligible with respect to consumer’s expense 

and IT-expertise. 

IV. 

NOVEL APPROACHES 

The previous chapter provided an overview of the 

currently widely used approaches to key management. This 

chapter peeks at new and future approaches. The first 

subchapter deals with the concept of ad-hoc authentication. 

Afterwards, two new methods based on this model are 

described in the following subchapters. 

A. Ad-hoc authentication 

From the point of view of the manufacturer, the 

cryptographic individualization of products on the production 

line is a cost factor that should be avoided. One way to 

achieve this is to establish the actual relationship of trust at a 

later point in time, for example during commissioning or 

installation. This procedure is known, e.g., from the use of 

Bluetooth devices, which must be coupled at least once with 

other devices. The security measures on the use of pins, 

however, are not sufficient. Such solutions are created when, 

in addition to the costs, the usability also plays an important 

role. Security measures are accepted only if they do not hinder 

the user in their actions. 

357

With regard to the application, the actual process of 

individualization can be done later. For example, the 

technician who performs the maintenance might do so in 

parallel. The realization of this approach is outlined in the 

following two chapters on the basis of the generic use case. 

B. Decentralized key management 

After the manufacturer decided to use a PKI solution, his 

business models are secured and flourishing. For further 

development, the question arises whether leaner solutions with 

less administration effort exist. Also, these solutions should 

provide long-term security by countering future threats such as 

quantum computers. 

A modern and resource-saving cryptographic process 

called Physical Layer Security (PLS) utilizes physical 

properties instead of mathematical complexities. For this, the 

methods rely on information, e.g., sound, light effects or 

electromagnetic signals originating from the local 

environment of the IoT product [7]. Of particular interest is 

the use of electromagnetic signals because they are already 

supported by any radio-enabled IoT product. The physical 

properties of the transmission channels that are exploited here 

are: 

The channel is symmetrical, i.e., both the sender and 

the receiver observe identical channel characteristics. 

Channel observations of third parties are statistically 

independent of the sender / receiver. 

The channel is not trivially calculable and cannot be 

simulated, i.e., the channel offers a reasonable amount 

of entropy. 

For reasons of simplicity, only the product authenticate 

method is elaborated here. The product exchanges via a 

standardized cryptographic protocol keys with a digital service 

platform. In order for the service platform to be sure that it is 

the right product and not an attacker, the product must be 

authenticated. In this case, a coupling device, this may for 

example be a smartphone, is placed in the physical proximity 

of the product and thus verifies the key. This class of protocols 

is also referred to as context-based security, out-of-band 

authentication or distance bounding protocols [7, 8, 9]. 

C. Key management middleware 

A key management middleware (KMM) is an abstraction 

layer that connects products with digital service platforms 

independently of each other. Those products can be connected 

to users of the digital service platforms in order to realize new 

business models. From the point of view of a product, a 

dynamic change of the digital service platform can also be 

realized over the lifetime, for example by using the KMM for 

this purpose. Over-the-Air Update (OTA) procedure just 

update the configuration of the product to achieve that. 

The introduction of a KMM allows the manufacturer to 

decouple the production and the product / key management. 

During commissioning, the individual cryptographic material 

is generated, which does not leave the machine and can be 

authenticated by a technician or user. This process of 

authentication is usually called pairing. This can be done for 

example via the smartphone of the technician. If the 

technicians have a trusted relationship to the manufacturer, 

they can unlock the individual features of the products, 

depending on the customer or service contract. 

The scenario can be further developed so that different 

users of different groups can also perform individual pairings 

with the product. As a result, comfort features can be 

automatically activated and used. Individual keys that can be 

inserted (and removed) by the user also can form the basis of 

protocols for the protection of digital privacy in the near 

future. 

With a KMM, the manufacturer has a solution that allows 

him to keep the production lean, flexible and independent and 

to move the key management into the commissioning phase. 

This also excludes scenarios where the confidential key 

material needs to be held by untrusted producers or suppliers 

within the supply chain. With the same level of security as for 

PKI-based solutions, manufacturers can now adapt their 

architecture to new needs and situations. For example, the 

manufacturer can change the digital service platform or opt for 

other hardware configurations. 

This process has been further developed to enable secure 

communication even over uninvolved, unmodified, and 

untrusted networks. A key management, which affects the 

production, is eliminated and instead one receives a set of 

keys, which were created and authenticated during the start-up 

phase. Obviously, there is no need for placing individual 

secret material into the product during manufacturing. 

Individual key material is brought and authenticated into the 

products dynamically during commissioning. 

Fig. 3. Exemplary architecture of a key management middleware. Business 

application is build on platforms. Products and users utilize the swapped out 

key management capabilities of KMMs. 

Fig. 2. After the product is completed, ad-hoc commissioning can be 

executed to setup the product. 


358

V. SOFTWARE, HARDWARE AND HYBRIDS 

Security solutions in general are resource-intensive 

compared to application or communication logic on embedded 

systems. Especially cryptography algorithms are resource 

consuming in terms of power and storage. Choosing a solution 

for security in IoT devices is always a compromise between 

achieving acceptable security levels, performance, and 

flexibility, cost and power consumption. There are three main 

categories of security approaches for IoT and embedded 

devices in general, which are hardware-based, software-based 

and hybrid. Figure 3 depicts a comparison approaches in terms 

of cost and performance. 

Fig. 4. Performance and cost versus hardware-based, software-based and 

hybrid solutions [10]. 

The first category is the software-only security approach. 

This approach relay on programming the embedded general 

purpose processors (GPP) to accomplish security tasks. Using 

software-only approach will achieve goals in terms of cost and 

flexibility but power consumption will remain comparably high 

and it will not deliver any improvements in terms of silicon 

area and in some cases this approach can exhaust the 

processing abilities of the GPP. Some examples of softwareonly 

approaches are [11]: 

Copyright notice and watermarking 

Proof-Carrying Code 

Custom OS 

The second category of security approaches is the 

hardware-only approach. In this approach the technology of 

ASICs (Application Specific Integrated Circuits) is being 

utilized to realize the required cryptography algorithm in 

hardware. Following this approach gives designers the ability 

to accurately control the energy consumed, computation 

amplitude and time constraints but the downside is still that this 

approach is not suitable when flexibility and cost effectiveness 

are required. Some examples of hardware-only approaches are 

[11]: 

Read-out protection 

Tamper-resistant packaging (for only a certain circuit or 

for covering the entire device) 

Secure coprocessors (also called secure elements, 

trusted platform modules/TPMs) 

The third category would be the hybrid approach which 

will utilize both hardware and software technologies to achieve 

a balance between processing efficiency, the required security 

level, flexibility and at the same time be compliant with design 

constraints. This approach needs collaboration between 

hardware designers, software designers and security experts at 

the level of designing and manufacturing the device. 

VI. 

DESIGN IMPLICATIONS 

Attacks on end devices are focused mainly on hardware 

components where the attacker needs to be physically close to 

the device [12]. Feasible attacks are: 

(1) Node tampering (replace components and to gain access 

to or alter sensitive information, e.g., cryptographic 

keys) 

(2) Cloning end devices (because of the relatively simple 

hardware architecture of many devices cloning is easy) 

(3) Denial of service attacks (e.g., battery draining attacks) 

The first attack is critical to it-security (esp. confidentiality 

and integrity) and privacy. The second attack is important with 

respect to product piracy. Countermeasures to both attacks are 

intrinsically motivated. However, tamper protection is mainly 

requested due to consumer’s digital privacy need, while anticloning 

properties may have a significant impact on (future) 

market shares of a vendor and, therefore, are requested by 

vendor’s business plan and strategy. 

Question is how to solve problems (1) and (2) properly 

using key management and software/hardware solutions with 

minimal costs? – From our perspective the most important 

design criteria is the choice of a certain level of non-scalability 

of potential attacks. 

First of all, device individual keys are important to reduce a 

scaling from one device to many or all other devices. This 

includes also a secure key establishment and management 

process with no (or extremely good protected) single point of 

failure. Second consumer’s possibility to change the E2E key 

material without being an IT expert. 

In this context, we belief that shifting the key establishment 

to an upstream process, e.g., using suppliers or OEMs, is a 

good solution for vendors with a clear 1:1 relationship between 

a single user and a digital service platform. In this context 

TPMs are often a vehicle to securely merge the key material 

with the device. For devices that might be used by different 

people and/or connected to different user specific service 

platforms a downstream key establishment process is better 

suited, e.g., using the resurrecting duckling principle. 

359

The understanding (as well as the resulting trust) in an adhoc 

trust establishment process (with respect to problem (1) 

and the related digital privacy) might be larger compared to 

pre-shared keys. 

TPM are ostensible argued to be secure against physical 

attacks, e.g., side-channels, fault injection, and key extraction. 

Therefore, they can be used as a countermeasure of product 

piracy. They are more energy-efficient than software solutions. 

Moreover, the complicated, time intensive, and security 

sensitive authenticated key establishment process, e.g., though 

certificate signing requests, can be done by a certified supplier. 

However, finding security flaws in TPMs and cracking 

them is more than a professional hobby of hackers and 

competitors. Continuously new security flaws on TPMs are 

published, cf. [13] and due to its static behavior, 

(compromised) key material of TPMs in the field are 

unfortunately hard to change. 

Table I and Table II summarize the results of the paper at 

hand to provide a condensed overview for product managers 

and decision makers. For the evaluation we introduce three 

metrics: 

Product piracy: 

o No product piracy prevention/detection (3) 

o Product piracy detection through online capabilities 

(2) 

o Product piracy prevention through hardware 

security 

End devices security (non-scalability of attacks) 

o Low (D) 

o Medium (C) 

o High (B) 

o Highest (A) 

Costs/complexity of production, logistic and key 

management 

o High (γ) 

o Medium (β) 

o Low (α) 

TABLE I. PRODUCTION LINE KEY ESTABLISHMENT 

Master / 

Group / 

Network keys 

Device individual keys 

Production 

Device and user individual keys 

No E2E Symm. Asymm. Symm. Asymm. 

privKey by 

Server 

privKey by 

Device 

privKey by 

Server 

privKey by 

Device 

Software 3Dα 2Cβ 2Cβ 2Cγ 2Bβ 2Bβ 2Bγ 

Hardware/Hybrid 3Dα 1Bβ 1Bβ 1Bγ 1Aβ 1Aβ 1Aγ 

Non-reputiation - - - ok - - ok 

Offline key server (easy to 

protect) 

- ok ok - ok ok 

TABLE II. APPROACHES WITH AD-HOC KEY ESTABLISHMENT 

Ad-hoc 

Master / 

Group / 

Network keys 

Deviceindividual keys 

Device and user individual keys 

No E2E Symm. Asymm. Symm. Asymm. 

privKey by 

Server 

privKey by 

Device 

privKey by 

Server 

privKey by 

Device 

Software 3Dα 2Bβ 2Bα 2Bα 2Bβ 2Bα 2Bα 

Hardware/Hybrid 3Dα 1Aβ 1Aα 1Aα 1Aβ 1Aα 1Aα 

Non-reputiation - - - ok - - ok 

Offline key server (easy to 

protect) 

- - ok ok - ok ok 

360 


VII. DISCUSSION 

Finally, the strengths and weaknesses of the individual 

approaches will be briefly discussed and the authors' opinion 

is shared. It is irresponsible to launch a product with network 

connectivity that is not sufficient protected against the 

heterogeneous attack vectors. Unencrypted protocols allow 

every conceivable attack on the manufacturer's product and 

infrastructure. If one chooses protective measures based on a 

pre-shared common secret, the protective measures can be 

bypassed with reasonable effort by suitable analyzes, e.g., 

reverse engineering a single product. It is crucial for the 

attacker how well his attack in the width, for instance, an 

entire product batch, scales. 

Security based on individual certificates is currently state 

of the art in the Internet. With proper implementation and 

compliance with organizational measures, this approach can 

be considered secure. Due to the complexities, however, this 

approach is rarely consistently and correctly implemented. 

Safety analyzes carried out during various employments of the 

authors quickly revealed problems in the implementation of 

PKI systems, both, in the technical and organizational 

implementation. Even with the use of strong cryptography and 

secure implementation, the security concept falls apart if the 

secret keys on which the PKI is based have been handed over 

to the insecure infrastructures. 

As described above, there are also solutions that shift key 

management from the production phase to the commissioning 

phase available. This enables dynamic key management at the 

user level when the devices are put into operation. Such 

methods can relieve the manufacturer. The manufacturer does 

not have to use weak key management in favor of costs. 

Furthermore, consumer’s digital privacy needs are trustfully 

feasible. 

Both, pure software solutions as pure hardware solutions 

have different advantages and drawbacks. Hybrids that are 

developed tightly coupled between hardware and software 

engineers could be a promising approach for future security 

architectures. 

The opinion that only TPMs can achieve a decent level of 

security is relatively popular these days. Those modules are 

known to resist a large area of attacks as they are tamper 

resistant and normally the key material does not leave the 

module. However, there is a tradeoff between costs and the 

desired level of security. As TPMs also have serious security 

flaws on a regular basis, manufacturers stay skeptical. 

Insecure and thus misused IoT products pose a serious 

threat to the entire Internet through DDoS attacks. 

Furthermore, proprietary (in-) security concepts or costly 

interlocking represent a huge investment risk, even for the 

most lucrative digital business models. Therefore, the authors 

see a lot of catching up to do in the area of easy integration of 

IoT security. This is based in particular on the fact that the 

security measures of the Internet are not per se suitable for 

applications in the Internet of Things. 


[1] Rob van der Meulen, “Gartner says 8.4 billion connected ‘Things’ will 

be in use in 2017, Up 31 Percent From 2016”, Gartner 2017, 

http://www.gartner.com/newsroom/id/3598917 

[2] Christof Paar, and Jan Pelzl, “Understanding cryptography”, Springer 

Monograph Series, 2009 

[3] Mathy Vanhoef, and Frank Piessens "Key reinstallation attacks: forcing 

nonce reuse in WPA2", Proceedings of the 2017 ACM SIGSAC 

Conference on Computer and Communications Security, CCS 2017, 

Dallas, TX, USA, October 30 - November 03, 20172 

[4] Mike Ryan, "Bluetooth: with low energy comes low security", 7th 

USENIX Workshop on Offensive Technologies, WOOT '13, 

Washington, D.C., USA, August 13, 2013 

[5] Tobias Zillner, “ZigBee exploited the good, the bad and the ugly”, 

Blackhat 2015 

[6] Gildas Avoine, and Loic Ferreira, ”Rescuing LoRaWAN 1.0”, eprint, 

July 2017 

[7] Christian T. Zenger, Jan Zimmer, Mario Pietersz, Jan-Felix Posielek, 

and Christof Paar, “Exploiting the physical environment for securing the 

Internet of Things”, ACM NSPW 2015 

[8] Markus Miettinen, N. Asokan, Thien Duc Nguyen, Ahmad-Reza 

Sadeghi, and Majid Sobhani., “Context-based zero-interaction pairing 

and key evolution for advanced personal devices”, ACM CCS 2014 

[9] Christian T. Zenger, Mario Pietersz, Jan Zimmer, Jan-Felix Posielek, 

Thorben Lenze, and Christof Paar, “Authenticated key establishment for 

low-resource devices exploiting correlated random channels”, Science 

Direct, Computer Networks Journal 2016 

[10] Sachin Babar, Antonietta Stango, Neeli Prasad, Jaydip Sen, and Ramjee 

Prasad, “Proposed embedded security framework for internet of things 

(iot)”, Wireless Communication, Vehicular Technology, Information 

Theory and Aerospace & Electronic Systems Technology (Wireless 

VITAE), 2011 2nd International Conference on IEEE, 2011, S. 1-5 

[11] Joseph Zambreno, Alok Choudhary, Rahul Simha, Bhagi Nara-hari, and 

Nasir Memon, “SAFE-OPS: a compiler/architecture approach to 

embedded software security”, ACM Trans. Embedded Computing 4 

(2005), Nr. 1, S. 189-210 

[12] Shivangi Vashi, Jyotsnamayee Ram, Janit Modi, Saurav Verma, and 

Chetana Prakash, “Internet of Things (IoT): A vision, architectural 

elements, and security issues.”, I-SMAC (IoT in Social, Mobile, 

Analytics and Cloud)(I-SMAC), 2017 International Conference on 

IEEE, 2017, S. 492–496 J. Clerk Maxwell, A Treatise on Electricity and 

Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73. 

[13] Matus Nemec, Marek Sys, Petr Svenda, Dusan Klinec, and Vashek 

Matyas, ”The Return of Coppersmith’s Attack: Practical Factorization of 

Widely Used RSA Moduli”, Proceedings of the 2017 ACM SIGSAC 

Conference on Computer and Communications Security, CCS 2017, 

Dallas, TX, USA, October 30 - November 03, 2017 

361

The IoT Requires Upgradable Security 

Lars Lydersen 

Senior Director of Product Security 

Silicon Labs 

lars.lydersen@silabs.com 

Abstract— Many of the things we use on a daily basis are 

becoming smart and connected. The Internet of Things or IoT, will 

improve our lives by helping us reach our fitness goals, reduce 

resource consumption, increasing productivity, and track and 

secure our assets. Many embedded developers realize the potential 

benefits of the IoT and are actively developing various 

applications, from connected home devices to wearables to home 

security systems. However, along with these benefits come risks. 

No one wants to design an application that’s prone to hacking or 

data theft. Undesirable events like high-profile hacks can lead to 

serious brand damage and loss of customer trust, and worst-case 

slow down or permanently reduce the adoption of IoT. 

Keywords— Internet of Things; Security; Hacking; Software 

updates. 


The Internet of Things (IoT) allows us to optimize and 

improve most aspects of modern life at an unprecedented scale. 

As billions of IoT devices unleash billions of dollars in 

economic value [1]. 

In the race for time-to-market, proper security is 

inconvenient because it adds cost: development cost, component 

cost and complexity. At the same time, in many industries it is 

not crucial to have adequate security. Rather, not having the 

worst security is the key to not being hacked. The issue is that 

bad press and major security and privacy issues might 

temporarily or permanently slow down the adoption of IoT for 

improving our lives. Many are already skeptical to connect 

simple devices we rely on in our every day. And security 

researchers are calling IoT a catastrophe waiting to happen [2]. 

In fact, quite recently there have been a number of highly 

publicized hacks that are gaining wide attention [3,4], so one 

could argue that the catastrophe is already on its way. 

II. 

THE HACKING OF QUANTUM CRYPTOGRAPHY 

The situation resembles that of Quantum Cryptography. 

Quantum Cryptography [5] (often referred to as Quantum Key 

Distribution) is a beautiful technology that unlike other key 

distribution schemes promises unconditional security based on 

the laws of physics. In comparison, most key distribution 

schemes rely on assumptions of the computational complexity 

of factoring large numbers or the discrete logarithm problem. 

Discovered in 1984, it took until ca. year 2000 before 

commercial cryptography systems were launched to market. 

Relying on single photons, building a quantum cryptographic is 

complicated, and again time-to-market was of the essence. In 

2010, the first security loophole that completely broke the 

security of the systems was published [6]. Quantum 

Cryptography is theoretically not possible to break, but in 

reality, there were side-channels that were not considered 

during the design of the systems. Also, interestingly, no 

loopholes were discovered until a dedicated team to break into 

such systems was established. Up until this team was 

established, the entire industry was focused on making the 

quantum cryptography systems robust and getting them to 

market. 

Several things can be learned from the Quantum 

Cryptography analogy. Notably, it was widely believed that 

Quantum Cryptography systems were unconditionally secure, 

until a novel attack rendered that not-true. In other words, the 

systems were secure against any attacker who was not aware of, 

or was not going to utilize the blinding attack. This shows how 

there are always assumptions on the adversary – who are you 

secure against, even in the cases where once tries to condense 

the number of assumptions to a bare minimum. 

One other interesting learning from the hacking of quantum 

cryptography is that it shows the importance of upgradable 

security. When the blinding attacks were discovered, the 

manufacturers of the systems were given a grace period to patch 

the vulnerabilities. It turned out, that it was possible to close the 

vulnerabilities via software updates. This article will not discuss 

the distribution of the software updates (this would require a 

quantum secure bootloader?), but the important point is that the 

security needed to be upgraded over the lifetime of the system. 

III. 

WHAT ATTACKER ARE YOU PROTECTING AGAINST? 

Security is not binary: secure, or insecure. The question one 

should ask is, secure against what? The reality is there are 

different levels of security, and a device can only be considered 

secure in the context of an attacker, when the level of security is 

higher than the capabilities of the attacker. 


362

Figure 1: How security upgrades are necessary to evolve with the security level of the attacker. A high level of 

security and hardware primitives (such as extra memory) maximizes the likelihood that security issues can be 

patched in the future. 

Moreover, the capabilities of the attacker are typically nonstatic, 

and therefore, the security level will change over time. 

The improved capabilities of the attacker can come about in 

several different ways, from the discovery and/or publication of 

issues and vulnerabilities to broader availability of equipment 

and tools. We have already discussed how this happened in the 

example of Quantum Cryptography, but let’s also review a few 

examples of how this has happened in classical security. 

In 1977, the data encryption standard (DES) algorithm was 

established as a standard symmetric cipher. DES uses a 56-bit 

key-size, so through increases in available computational power, 

the standard is vulnerable to brute-force attacks. In 1997, it was 

shown that it took 56 hours to break the algorithm via bruteforce. 

With DES clearly being broken, triple DES, basically 

running DES three times with different keys was established as 

a standard secure symmetric cipher. Regarding the security level 

of DES, there has been speculation that if governments could 

already break the cipher in 1977, DES could never resist nationstate 

attacks. However, since the early 2000s, DES could not 

even protect against hobbyists with personal computers due to 

the widespread availability of computational power. 

Since 2001, the advanced encryption standard (AES) 

replaced DES. But even AES does not guarantee absolute 

security. Even when the algorithm could not be easily broken, 

the implementation could be hacked, as was the case with 

Quantum Cryptography. Differential power analysis (DPA) 

attacks are done by measuring the power consumption or the 

electromagnetic radiation of the circuit performing the 

cryptography. The side-channel data is then used to obtain the 

cryptographic keys. Specifically, DPA involves capturing a 

large number of power consumption traces followed by analysis 

to reveal the key. DPA was introduced in 1998, and since then, 

companies such as Cryptographic Research Inc. (now Rambus) 

sell tools to perform DPA attacks, although at a price that made 

the tools inaccessible to hobbyists and most researchers. Today, 

the hardware tools to perform advanced DPA attacks can be 

purchased for less than $300, and advanced post-processing 

algorithms are available online free-of-charge. Thus, the ability 

to conduct DPA attacks has migrated from nation-states and 

wealthy adversaries to nearly any hacker. 

Now let’s discuss these historic lessons in the context of the 

longevity of an IoT-device. A typical lifetime of an IoT-device 

depends on the application, but in industrial applications 20 

years is common, and will be used for this discussion. A device 

that launched in 1998, for example, was once only vulnerable to 

nation-state attacks; today it must be able to withstand DPA 

attacks by hobbyists with $300 for tools, some spare time and 

lots of coffee. Predicting the future capabilities of a class of 

adversaries is very difficult if not impossible, especially over a 

20-year timespan. How does the adversary look in 2040? One 

might speculate if it is even human? 

The only reasonable way to counter future attack scenarios 

is for the security of the device to evolve with the increased 

capabilities of the adversary as shown in Figure 1. This requires 

IoT security that is software upgradable. There is of course 

functionality that requires hardware primitives, which cannot be 

retrofitted via software updates. However, it is incredible what 

can be solved in software when the alternative is a truck-roll. 

And it is clear that it is impossible to predict and account for all 

future attacks. 

IV. 

CONSEQUENCES FOR IOT PRODUCS 

First, the product needs to be able to receive software 

updates securely. Let’s discuss two aspects of secure software 

updates: from a technical point of view, namely the 

requirements for the device and software, and a process point 

363

of view, specifically the authorization and control of releasing 

software updates. 

From a technical perspective, secure updates involve 

authenticating, integrity checking and potentially encrypting 

the software for the device. The software handling such security 

updates is the bootloader, typically referred to as a secure 

bootloader. The secure bootloader itself, along with its 

corresponding cryptographic keys, constitutes the root-of-trust 

in the system and needs to have the highest level of security. 

This involves placing the bootloader and keys in immutable 

memory such as one-time-programmable memory or read-onlymemory. 

At this point, any vulnerability in this code is 

equivalent to an issue in hardware and cannot be fixed in the 

field. 

The authentication and integrity check should be 

implemented using asymmetric cryptography, with only public 

keys in the device. This way, it is not necessary to protect the 

signature-checking key in the devices. Since protecting keys in 

deployed devices is (or at least should be) harder than protecting 

keys in control of the device owner, it is also acceptable to use 

the same bootloader keys for many devices. Finally, since the 

device contain and use a public key, the system is secure against 

DPA attacks. 

Encrypting the software running on the IoT device has two 

benefits. First, it protects what vendors consider to be 

intellectual property (IP) from both competitors and 

counterfeiting. Secondly, encryption makes it more difficult for 

adversaries to analyzing the software for vulnerabilities. 

Encrypting the new software for secure boot does, however, 

involve secret keys in the device, and protecting secret keys 

inside a device in the field is becoming increasingly harder. At 

the same time, newer devices have increased resistance to DPA 

attacks. Furthermore, a common countermeasure against DPA 

attacks is limiting the number of cryptographic operations that 

can take place to make it infeasible to get sufficient data to leak 

the key. Even though protecting the key is difficult and 

motivated adversaries will likely extract it, it does make 

attacking more difficult for the attacker. Therefore, secure boot 

should always involve encryption of the software. 

Another consequence of secure updates is the likely future 

need for more memory in the IoT device. This is a complicated 

trade-off for several reasons. First, software tends to expand to 

the memory available in the device. So, a larger memory device 

requires discipline from the software team to leave room for 

future updates. The other complication is the value of free 

memory in the future versus the device’s initial cost. More 

memory tends to increase the cost of the device. This cost must 

be justified both from the device maker and the consumer point 

of view. Unfortunately, the fierce competition for market share 

makes many device makers myopic, and they are incentivized 

to prioritize cost over future security. 

Finally, it is important to have a plan for distributing the 

security updates. For most devices, these updates use the 

device’s existing Internet connection. But in some cases, this 

requires adding or using physical interfaces such as USB drives 

(using sneakernet). It is also important to consider that the 

devices might be behind firewalls or in some cases 

disconnected from the Internet. 

Once secure updates are possible from a technical point of 

view, the question becomes who has the authority to sign and 

issue software updates. Increasingly common for IoT devices, 

the software is fully owned and managed by the device maker. 

This means that the device maker should have proven processes 

in place to internally protect the signing keys and particularly 

who can issue updates. This might or might not be combined 

with authorization from the customer or end user. In fact, given 

the increase in device maker responsibilities, some times it 

might even be necessary to have the mechanism of forcing 

updates, not leaving the user the ability to opt out. For some 

devices, the end user must actively download an update and 

apply it, or at least initiate the update process. In other instances, 

the update is fully automatic. From a practical point of view, it 

is important that the scheme is fairly accommodating to 

different delivery mechanism. Especially if the device maker 

does not have direct contact with the device, but has to rely on 

3 rd party gateways or connectivity. 

V. SUMMARY 

The longevity of deployed IoT devices, combined with the 

proliferation of tools and knowledge of adversaries, makes it 

infeasible to create devices that will remain sufficiently secure 

at any security level for their lifetime. Therefore, for IoT devices 

to remain secure throughout their practical lifetime, it is 

necessary to ensure that the security of these devices is 

upgradable via software updates. But since an update 

mechanism is also an attack point, it is necessary to deploy 

security bootloaders in all programmable devices in the IoT 

product, and to properly secure the bootloader keys. A secure 

bootloader is functionality that IoT vendors should expect to get 

from the IC manufactures. Furthermore, IoT vendors need to 

plan up front for delivery mechanism and processes to issue 

updates. Luckily, secure bootloaders are readily available, and 

relevant devices are already internet connected, so enabling 

secure updates is a minor effort. So there is no excuse not to do 

it.. 

REFERENCES 

[1] McKinsey & Company, The Internet of Things: Mapping the value beynd 

the Hype, June 2015. 

[2] B. Schneier, http://www.wired.com/2014/01/theres-no-good-way-topatch-the-internet-of-things-and-thats-a-huge-problem/, 

visited 14/01- 

2018. 

[3] B. Krebs, Hacked Cameras, DVRs Powered To-days Massive Internet 

Outage, https://krebsonsecurity.com/2016/10/hackedcameras-dvrspowered-todays-massive-internet-outage/ 

October 2016, visited 14/01- 

2018. 

[4] E. Ronen, C. O’Flynn, A. Shamir and A. Weingarten, IoT Goes Nuclear: 

Creating a ZigBee Chain Reaction, http://iotworm.eyalro.net/iotworm.pdf 

, visited 14/01-2018 

[5] N. Gisin, G. Ribordy, W. Tittel and H. Zbinden, Quantum cryptography, 

Rev. Mod. Phys. 74, 145-195 (2002) 

[6] L. Lydersen, C. Wiechers, C. Wittmann, D. Elser, J. Skaar and V. 

Makarov, "Hacking commercial quantum cryptography systems by 

tailored bright illumination", Nat. Photonics 4, 686-689 (2010) 


364

Safety and Security from the Inside – a SoC’s 

Perspective 

Antonio J. Salazar Escobar 

Solutions Group, Synopsys 

Porto, Portugal 

Ralph Grundler 

Solutions Group, Synopsys 

Mountain View, California, USA 

Abstract—We can all agree that today’s electronics need to 

address safety and security considerations; however, these 

concepts are continuously evolving and are redefined based on 

perspective. Keeping unwanted eyes from your children’s 

monitors or protecting your smart-home's integrity are real 

concerns from today’s consumers, and with a growing trend for 

autonomous technology, from wearable, IoT to cloud, including 

artificial intelligence, it is a concern that needs to be addressed. 

That said, what are the issues? What do we need to secure and 

how? Although a number of protocols and infrastructures exist to 

“secure” data communication, what is the role of the endpoint, or 

better yet, the SoC? From the SoC perspective, this can translate 

into multiple things, such as encryption, key management, 

securing data, or even protecting a debug port. Of course, the 

specifics vary depending on the SoC’s purpose. Today’s 

intellectual property (IP) blocks that comprise an SoC are 

increasingly complex due to an ever-growing number of features 

and functional expectations; an understanding of the 

interoperability of the IP is paramount to ascertain where and how 

to add the necessary elements to keep your SoC secure. This paper 

discusses the considerations for safety and security from the inside 

of the SoC, going over the role of the IP, subsystems and overall 

design. 

Keywords—Security IP, Safety, HBM, SoC 


Today’s electronics are following trends where autonomous, 

near-ubiquitous and ever-connected are common expectations, 

and with them the expectation of reliability and data guarding. 

Such has motivated the need to understand and build solutions 

that account for security threats and safety concerns. 

Nonetheless, security and safety considerations are fluidic 

concepts, continuously evolving to address emerging pressures, 

threats and conditions of an ever-changing market and usage 

scenarios. 

From a process perspective, security/safety considerations 

need to be viewed as methodological requirements, ingrained in 

all aspects of the product’s lifecycle. The application space and 

usage scenarios can have significant impact on the associated 

costs (time, resources and procedural), affected by industry 

requirements, compliance standards and added market value. 

For instance: 

 

 

 

IoT: covers numerous industries from manufacturing 

and transportation, to healthcare, entertainment, and 

education. Safety-critical (healthcare) and high-impact 

(supply chain, facility, fleet management) applications, 

combined with the connected nature of IoT devices, 

drives strengthening of security and reliability 

requirements as key factors [1]. Security/safety profiles 

can vary based on considered usage scenario for the 

same device. For instance, a remotely accessed camera 

for monitoring children requires different 

considerations than one for parking lot surveillance. 

Automotive: although automotive electronic systems 

such as infotainment and motor control have to meet 

the requirements for stringent automotive standards, 

AEC Q100 and IATF 16949 (formally ISO/TS 16949,) 

advanced driver assistance systems (ADAS) are 

accelerating the adoption for functional safety standard 

ISO 26262. In addition, efforts such as the E-safety 

vehicle intrusion protected applications (EVITA), 

address security considerations related to remote and 

physical attacks to the integrity of the system, and 

inter/intra-vehicular communication, that can cause 

damage to the system, vehicle and structures, as well as 

injury to the passengers and/or passers-by. 

Media/Entertainment: content protection has gained 

significant traction in the past decades, driven by the 

digital revolution and content sharing; Content 

protection schemes require a number of collaborative 

components on all stages of the data sharing, advancing 

methods and frameworks such as: digital rights 

management (DRM), high-bandwidth Digital Content 

Protection (HDCP), digital transmission content 

protection (DTCP) and Movielabs digital distribution 

framework (MDDF). 

Neither security nor safety are just features to be added at a 

late stage of the product development. They need to be an 

integral part of the design from the start, and at all levels. 

II. CONCEPTS OF SAFETY AND SECURITY 

The level at which a design needs to consider safety/security 

requirements is not always evident. For example, researchers 

were able to demonstrate ways that hackers can put people in 


365

harm’s way, by accessing a Tesla Model S car’s control system 

remotely and gaining access to the car’s controller area network 

(CAN) bus [2]. The vulnerability allowed the researchers to 

access and control key systems (braking system, engine, 

sunroof, among other). Tesla followed up by providing a 

software update and code signing policy enforcement 

mechanism, for any new firmware installing into the CAN bus 

[3]. This approach was able to address the foreseeable issues; 

however, can a solely software based strengthening of the attack 

path resolve the underlined issues? Hardware elements, such as 

a trusted platform module (TPM) or even a hardware security 

module (HSM), could be required instead. 

Development procedures need to account for all 

stakeholders, following clear and traceable requirements, 

supported by strong documentation and verification results. 

Designers need to consider opposing forces during planning. In 

a “top down” view, designers must consider both SoC design 

elements and the safety documentation required by certain 

specifications/standards. On the other hand, they must consider 

a “bottoms up” approach to visualize how coverage closure can 

be achieved along with the required documentation, reports, 

verification specifications, etc. (Fig. 1). 

Figure 1: SoC development “top down” and “bottom up” 

methodologies for ADAS systems 

Addressing safety/security concerns could entail a 

methodological and process level inclusion that accounts not 

only for internal processes, but also 3rd party methodologies. 

For instance, when selecting IP or IP subsystems, designers need 

to ensure that the provider has the necessary processes and 

infrastructure. While at the design level, this might imply 

monitors, redundancy, or observability, the IP provider must 

have a “Safety & Quality Culture” to detect, manage, and 

address possible hazards, associated quality management 

systems that quantify and qualify errors, and standards of 

communication to minimize misinterpretations. 

Understanding of the use-case, methodologies and 

functional dependencies becomes paramount, in particular due 

to the increasing complexity of module interoperability within a 

design. In order words, understanding how to construct and use 

one’s defenses to shield against attacks/malfunctions becomes 

dependent on understanding how the different elements that 

compose your system interact and their associated dependencies. 

Within an SoC, this translates to understanding the different IP 

and IP subsystems present on your design, as well as the 

dedicated and shared resources. 

For instance, observability is a functional requirement of 

error recovery to meet a design’s safety goals. By adding 

functionality such as debug, performance monitors, and 

watchdog timers in the IP subsystem, the SoC or the system 

software can decide on the best mediating action when a fault 

occurs that requires recovery to a safe state. Often this can 

enhance the possibility of graceful recovery where the rest of the 

system is not affected. Other opportunities to leverage 

observability are measuring or fine tuning performance to make 

sure the needs of the complete system are met either in different 

systems or different operating modes. Regardless of the 

foreseeable benefits, there is inherit risk on any monitor/control 

path that needs to be weighed. 

Foreseeing potential weaknesses can be challenging; 

however, understanding designs’ inter-dependencies can 

facilitate where to focus efforts and when additional logic might 

be required, either for added robustness (such as built in selftesting 

or self-repair components), or for strengthening security 

(such as gateways or tamper resistance constructs). 

Cost becomes a decision factor for many design choices; 

related to time, effort, area, and so on. In general, the added 

value to the end-product needs to justify the associated cost; that 

said, cost can be mitigated from a product life-cycle perspective 

by introducing good practices. For instance, documentation can 

prove a value recourse of cost mitigation. From safety 

documentation to test plans, checks for linting errors, Clock- 

Domain Crossing (CDC) and Reset Domain Crossing (RDC), 

thorough documentation helps ensure that the implementation is 

clear and more seamless. When running required checks on the 

final SoC, the SoC engineer has these documents as reference 

should any questions arise. 

III. SOC ARCHITECTURE CONSIDERATIONS 

Today’s SoC are composed by an increasing number of 

processors, memories, interfaces, general logic and peripherals 

blocks, connected through progressively complex interconnect 

structures. Designing and implementing each individual block 

and interconnect arrangement has become prohibitively costly 

and reliance on pre-designed blocks is commonplace. 

366

additional resources to serve as gateways [6]. Another path is to 

utilize cJTAG (IEEE 1149.7), in lieu of JTAG, as to add a star 

topology and individual device addressing, that would facilitate 

overlaying security/safety considerations per block. A simpler 

strategy might be to use an auxiliary port, such as UART, to 

establish a formal authentication procedure that manages access 

to the JTAG TAP (see Fig. 3-c). In the end, the approach is 

dependent on the overall needs of the design and even the 

reusability of the resources for multiple purposes. One could 

argue that a key consideration is to have flexible and scalable 

solutions, thus future-proofing for unforeseen scenarios. 

Figure 2: Generic SoC Architecture 

Therefore, SoC designers benefit from IP providers that 

actively collaborate with their customers to provide more than 

just an IP block. Experts in the IP, with a system level 

understanding, can engage with designers at all technical levels 

to ensure their SoC requirements are met, and help assure that 

integration, testing, and integrity checks go smoothly all the way 

to tape out. By engaging at the subsystem level, all parties can 

consider clock domain crossing, reset domains, power 

management, and testability concerns. Moreover, by leveraging 

IP and interoperability knowledge a more systemic methodology 

can be applied. 

Of particular concern can be testing and verification 

considerations, key aspects to assessing requirement traceability 

and implementation. Embedded instruments and design for 

testability/debug structures have become commonplace within 

today’s SoCs. Hardware/software co-design strategies such as 

prototyping and emulation can have a significant impact, by 

accelerating bring-up and providing engineers added dedicated 

time to understand interoperability and IP functionality, as well 

as evaluating impact on the overall design requirements; thus 

strengthening the security/safety profile. 

Validation and debug structures can conflict with 

security/safety requirements [4] and ultimately affect the SoC 

architecture. For instance, an IEEE 1149.1 (commonly known 

as JTAG) port is frequently found in today’s systems (see Fig. 

3-a). This port has been time tested and proven to be a useful 

resource for debugging purposes, programming, data retrieval, 

among other uses. However, it represents a side-channel risk, 

providing physical access to system resources. 

Implementing an authentication protocol over the JTAG 

TAP, can require development of SoC specific communication 

software and IP wrapper re-design [5], which can prove 

time/resource consuming, and in the end might not properly 

address the concerns. An alternative could be to structure the 

boundary scan path to include IEEE 1687 (also referred as 

IJTAG) elements (see Fig. 3-b), which can facilitate securing 

access to specific areas of the SoC, by leveraging Segment 

Insertion Bit (SiB) and Test Data Register (TDR) blocks with 

Figure 3: (a) Generic IEEE 1149.1 implementation. (b) IEEE 1687 

architecture for embedded instrument connectivity. (c) IEEE 1149.1 

security augmented implementation 

IV. THE HARDWARE ROOT OF TRUST 

Adding ad-hoc security/safety mechanisms to each 

individual block could prove costly and against the overall 

design requirements, with regards to performance, area and 

timing considerations. SoCs can benefit from centralized 

approaches that capitalize on reusable structures and policies, 

such as a trusted execution environment (TEE). A properly 

structured TEE provides a framework in which the system can 

confidently run its privilege software and functions. Obviously 

the architectural approach can have significant impact on the 

methodologies to be implemented. For instance, an SoC can be 

augmented with an external trusted platform module (TPM), 

while using an integrated hardware secure module (HSM) [7] or 

a TEE implemented as a security subsystem with its dedicate 

secure CPU [8], can prove more cost efficient by saving space 

on the PCB and avoiding signal path exposure outside the 

silicon. 

An approach for a centralized, flexible and scalable SoC 

security framework comes through the implementation of a 


367

Hardware Root of Trust, i.e., a hardware protected TEE and 

device unique keys capable of implementing an array of security 

functions creating the basis (a root) for trust in the SoC. In 

general, a Hardware Root of Trust would count with a number 

of modules that support different operations in a runtime-secure 

tamper-resistant manner. To better understand the necessary 

components, it is important to consider expected functionality 

and all operation phases. Table I summarizes a number of 

common security tasks and the associated function. 

Table I. Security goals and associated security functions 

Operation 

Phase 

Goal 

Security Functions 

Power Off Non-volatile storage protection encryption/decryption 

Tamper protection 

authentication 

Device data binding 

device unique key 

storage 

Power Up Boot image protection code/data validation, 

signature check 

Device identity check authentication / 

identification 

Runtime Malicious instruction 

monitoring 

continuous transaction 

monitoring 

Access point protection authentication / 

permission control 

Secure communication 

integrity and 

confidentiality 

HSMs exist that address the different security functionality 

presented in Table I, through the use of a secure CPU and 

hardware based resources within a safe zone or security 

perimeter. The secure CPU provides an inherently trusted 

location for software components, which support the security 

functions, to run. Support blocks would include a secure 

memory to serve as a safe space for runtime data, a True Random 

Number Generator (TRNG) for producing a high level of 

entropy, and even a dedicated clock/counter for reliable time 

measurements. Additional blocks could improve performance 

with hardware cryptographic accelerators with side channel 

attack countermeasures for error detection, power and timing 

randomization, and so on (see Fig. 4). The management of 

security features is thus facilitated by a programmable security 

framework. 

Figure 4: Example HSM Architecture from Synopsys 

V. SECURITY STANDARDS 

HDCP (High-bandwidth Digital Content Protection) is most 

commonly used for integrating with HDMI (High-Definition 

Multimedia Interface) or video content protection but could be 

used for other key exchange protected data encryption such as 

audio or digital files. Also, HDCP is used in DisplayPort and 

USB applications. HDCP as well as other encryption engines 

will leverage a TRNG which will not only be used for key 

generation for the HDCP cypher but also to add more noise 

entropy to the system to make power monitoring attacks more 

difficult. Typically, the HDCP block is located close to the 

HDMI block but the TRNG is usually located centrally in the 

system unless needed as part of a subsystem as show in Fig. 5. 

Both blocks will require additional memories which are not 

shown. 

Figure 5: Synopsys HDMI RX Subsystem Showing HDCP and TRNG 

Communications and networking can be secured in many 

levels. MACsec (Media Access Control Security, IEEE 

Standard 802.1AE) can be used to secure Ethernet on the link 

layer. IPsec (Internet Protocol Security, IETF Standard RFCs) 

can be used to secure networks end to end. Both of them can 

benefit from a Hardware Root-of-Trust that protects the control 

plane software for authentication and key negotiation by 

providing secure key generation, and public key cryptography 

operations inside the security module. The session keys can also 

be injected to the data plane cryptography engines directly from 

the Hardware Secure Module, without exposing the keys in the 

host system. For higher bandwidth applications, a security HW 

accelerator leveraging an encryption hardware engine like AES- 

GCM (Advanced Encryption Standard-Galois Counter Mode) 

and TRNG is required. For encrypting the data in point to point 

applications, typically a MACsec hardware accelerator is used 

in-line with an Ethernet MAC and the physical layer as shown 

in Fig. 6. 

368

Figure 6: MACsec hardware accelerator used in-line with Ethernet 

MAC and Physical Layer 

In some applications because of legacy hardware, gate count 

for multiple ports or other system reasons the MACsec engine is 

placed as a look aside accelerator as shown in Fig. 7. 

safeguarding our physical selves. Addressing the wide and 

constantly evolving array of threats requires investment in 

understanding these attacks and delivering solutions at all levels 

of the design and all stages of the product lifecycle. 

Security and safety considerations need to address 

foreseeable threats; however, the usage scenarios of the end 

products are at times a moving target, requiring some solutions 

to aim to be adaptable and scalable as to be future-proof. 

Leveraging 3 rd party IP know-how and subsystem understanding 

can save time and lower design risks; while, early software 

development can permit strengthening of the final results. 

REFERENCES 

Figure 7: Ethernet Subsystem showing MACsec engine as look-aside 

accelerator 

Either implementation is highly flexible as the bandwidth 

support can be expanded by either speeding up the AES engine 

clock or adding in parallel additional AES engines supporting 5 

Gbps to Terabit applications. 

IPsec can be implemented as a look aside accelerator but 

typically as a hardware accelerator coupled to a processor. IPsec 

provides more cipher and security options, such as just the data 

payload encrypted for transmitting via routers, or the entire IP 

packet and creating a new header. This is known as ESP 

(Encapsulating Security Payload) and would be in used in VPN 

(Virtual Private Network) applications. 


Embedded systems growing complexities, numbers and 

communication networks require a revision and prioritization of 

security and safety considerations; continuously evolving to 

meet an ever growing set of requirements, driven by consumer 

and market expectations. Not only are connected devices 

becoming omnipresent and ever connected, they are becoming 

ingrained within everyday life, forcing a reevaluation of the 

value associated to protecting personal information and even 

[1] Columbus, L. (2017, December 10). 2017 Roundup of Internet of Things 

Forecasts. Retrieved from www.forbes.com: 

https://www.forbes.com/sites/louiscolumbus/2017/12/10/2017-roundupof-internet-of-things-forecasts/#4db5a3411480 

[2] Keen Security Lab of Tencent (2017, July 27). New Car Hacking 

Research: 2017, remote Attack Tesla Motors. Retrieved from 

keenlab.tencent.com: https://keenlab.tencent.com/en/2017/07/27/New- 

Car-Hacking-Research-2017-Remote-Attack-Tesla-Motors-Again/ 

[3] A. Greenberg, (2016, September 27). Tesla Responds to chinese hack with 

a major security upgrade. Retrieved from Wired: 

https://www.wired.com/2016/09/tesla-responds-chinese-hack-majorsecurity-upgrade/ 

[4] R. Sandip, J. Yang, A. Basak and S. Bhunia, “Correctness and Security at 

Odds. Post-silicon Validation of Modern SoC Design,” DAC 2015, San 

Francico, CA, USA, http://dx.doi.org/10.1145/2744769.2754896 

[5] G.M. Chiu and J. Li, “A secure test wrapper design against internal and 

boundary scan attacks for embedded cores,” IEEE Trans. on Very Large 

Scale Integration Systems, vol. 20, No. 1, pp. 126-134, January 2012. 

[6] S. K. K, N. Satheesh, A. Mahapatra, S. Sahoo and K. K. Mahapatra, 

"Securing IEEE 1687 Standard On-chip Instrumentation Access Using 

PUF," 2016 IEEE Int. Symp. on Nanoelectronic and Information Systems 

(iNIS), Gwalior, 2016, pp. 56-61. doi: 10.1109/iNIS.2016.024 

[7] A. Elias, (2017, October 30). Understanding Hardware Roots of Trust. 

DesignWare Technical Bulletin. Retrieved from www.synopsys.com: 

https://www.synopsys.com/designware-ip/technicalbulletin/understanding-hardware-roots-of-trust-2017q4.html 

[8] R. Collins, (2017, October 30). Securing High-Value Targets with a 

Secure IP Subsystem. DesignWare Technical Bulletin. Retrieved from 

www.synopsys.com: https://www.synopsys.com/designware- 

ip/technical-bulletin/securing-high-value-targets- 

2017q4.html?elq_mid=9452&elq_cid=32291 


369

Cyber Security for Automobiles 

BlackBerry’s 7-Pillar Recommendation 

Sandeep Chennakeshu 

BlackBerry Technology Solutions 

Ottawa, Canada 

Abstract—Auto cyber security is on national agendas because 

automobiles are increasingly connected to the Internet and other 

systems and bad actors can commandeer a vehicle and render it 

dangerous, amongst other undesirable outcomes. The problem is 

complex and the point-solutions that exist today are fragmented 

leaving a very porous and “hackable” system. BlackBerry 

provides a 7-Pillar recommendation to harden automobile 

electronics from attack. The solution is intended to make it 

significantly harder for an attacker to create mischief. This paper 

describes the 7-pillars and how BlackBerry can help. 

I. THE PROBLEM 

Cyber security for automobiles is on the national agenda of 

several countries. Why? There are four industry trends that 

make modern cars vulnerable to cyber attacks and potential 

failures: 

• Automobiles are increasingly accessible by wireless 

and physical means to the outside world and bad 

actors. 

• Software will control all critical driving functions and 

if bad actors can access and modify or corrupt the 

software it can lead to accidents and potential fatalities. 

The larger the amount of software in an automobile the 

larger is the attack surface. 

• Autonomous automobiles will be driverless. By design 

these automobiles will talk to each other and 

infrastructure by wireless means. This further 

exacerbates the vulnerability problem in so far as the 

number of access points through which an automobile 

may be breached. When this happens, the concomitant 

effects could be viral, as one car can infect another and 

so on. 

• Autonomous automobiles will deploy artificial 

intelligence, deep neural networks, and learning 

algorithms. These automobiles will learn from context. 

This means that software that was installed as being 

safety and security certified at production will morph 

with time and there needs to be new ways to ensure 

that the automobile is still safe and secure over its 

lifetime. 

This threat is amplified by the following characteristics of 

the automobile: 

• The electronics in a car (hardware + software) is built 

from components supplied by tens of vendors in 

multiple tiers who have no common cyber security 

standards to adhere to as they build their components. 

This makes the supply chain for the car complex and 

porous in respect to cyber security. Every vendor and 

every component is a point of vulnerability. 

• The electronics in a car is a complex network of 

distributed computers called electronic control units 

(ECUs). An ECU is a piece of hardware and software 

that controls an important function in the automobile 

such as braking, steering, power train, digital 

instrument cluster, infotainment and more banal 

functions such as window control and airconditioning. 

These ECUs are networked by buses 

(physical wires or optical fibre), which carry 

messages using some defined protocol. This 

interconnected network allows ECUs to talk to each 

other. Safety critical and non-safety critical ECUs 

interact through this network. Some of these ECUs 

can be accessed by wireless means or physical access 

(e.g. USB drive). Access means potential infection. 

Hence, it is paramount to isolate safety-critical and 

non-safety critical ECUs. 

• A car lives for 7 to 15 years. Over this period of time 

its software must be updated. This time period brings 

risk, as hackers become more sophisticated over time 

and users of cars may download software that may 

contain malware. 

Current practices and standards are inadequate. For 

example, functional safety standards like ISO 26262 (ASIL-A 

to ASIL-D), information sharing like Auto-ISAC, software 

coding guidelines like MISRA and the NHTSA 5-Star overall 

safety scores (which is more to do with collision), add value 

but do not solve the cyber security and safety problem 

described. These are point solutions not holistic solutions. 

There is need for a much more holistic cyber security solution 

for automobiles. 

II. EXPERIENCE DRIVES INNOVATION 

BlackBerry has a long history of cyber security with deep 

involvement in multiple facets of a holistic cyber security 

solution. As such BlackBerry understands the issues that need 

370

to be solved and has innovated to solve the same. It is therefore 

no surprise that BlackBerry: 

• Is regarded as the gold standard in government, 

regulated industry and enterprise mobile security. 

• Has been a leading supplier of reliable and safe 

software to the automobile industry for decades. 

• Supplies managed PKI (certificates) services, crypto 

tool kits and asset management (key injection) to 

major companies. 

• Operates a global over the air (OTA) secure software 

update service that has updated over 100 million 

devices in over 100 countries, with updates every 

week for over a decade. 

• Has built a safety aware culture amongst our 

automobile software developers through training, 

work methods and practices to secure safety 

certification and extends this training to its customers. 

• Developed and deployed world-class vulnerability 

assessment and penetration testing methods and tools. 

• Maintains an active and alert security incidence 

response team that monitors common vulnerabilities 

and exposures and reacts to address the same in 

products with industry leading response times. 

• Has built a FedRAMP certified emergency 

notification service that can be used to provide alerts 

when issues occur with bulletins on precautions to be 

taken by those impacted before a solution is delivered. 

• Is building a Rapid Incident Response Network to 

share information between enterprises to learn and act 

more quickly. 

BlackBerry’s experience and the products that it brings to 

bear on cyber security are extensive and valuable to the auto 

industry. BlackBerry’s DNA is security. We use our deep 

experience, vast repertoire of tools, practices and knowledge to 

innovate and stay ahead. It is via this accumulated knowledge 

and insight that we have developed the 7-pillar 

recommendation that is described below. 

III. THE 7-PILLAR RECOMMENDATION 

Safety and security are inseparable. Our approach to the 

problem is to look at the whole system and try and get as close 

to creating a system where there is an absence of unreasonable 

risk. 

The 7-pillars recommended by BlackBerry are outlined 

briefly below. These pillars are described for automobiles but 

can be extended to other devices and markets. 

A. The 7-Pillar Recommendation 

1) Secure the supply chain: 

a) Root of trust: Ensure that every chip and electronic 

control unit (ECU) in the automobile can be properly 

authenticated and are loaded with trusted software, 

irrespective vendor tier or country of manufacture. This 

involves injecting every silicon chip with a private key during 

its manufacturing stage to serve as the root of trust in 

establishing a “chain of trust” method to verify every 

subsequent load of software. This mechanism verifies all 

software loaded. 

b) Code Scanning: Use sophisticated binary static code 

scanning tools during software development to provide an 

assessment which includes: open source code content, the 

exposure of this open source code to common vulnerabilities 

and indicators of secure agile software craftsmanship. These 

data can be used to improve the software to reduce its security 

risk prior to production builds. 

c) Approved for Delivery: Ensure that all vendors and 

vendor sites are certified via a vulnerability assessment and 

are required to maintain a certificate of “approved for 

delivery”. This evaluation needs to be performed on a 

continuous basis. 

2) Use Trusted Components: 

a) Proven Components with Defense in Depth: Use a 

recommended set of components (hardware and software) that 

have proper security and safety features and have been 

verified to be hardened against security attacks. Create a 

security architecture that is layered and deep. For example: 

Hardware (System on Chips - SOCs) must be secure in 

architecture and have access ports protected (e.g. debug ports, 

secure memory etc.). SOCs should store a secret key, as 

described above, and act as the root of trust for secure boot 

verifying software that is loaded. The operating system must 

be safety certified and must have multi level security features 

such as access control policies, encrypted file systems, 

rootless execution, path space control, thread level anomaly 

detection etc. Applications should also be protected as 

described below. 

b) Application Management: All applications that are 

downloaded should be certified and signed by proper 

authorities. A signed manifest file will set permissions of 

those resources in the system this application will and will not 

be allowed to get access to. The applications must always run 

in a sandbox and are managed over their lifecycle. 

3) Isolation: 

a) ECU isolation: Use an electronic architecture for the 

automobile that isolates safety critical and non-safety critical 

ECUs and can also “run-safe” when anomalies are detected. 

b) Trusted Messaging: Ensure that all communication 

between the automobile and the external world and between 

modules (ECUs) in the car is authentic and trusted. 

4) In Field Health Check: 

a) Analytics and Diagnostics: Ensure that all ECUs 

software has integrated analytics and diagnostics software that 

can capture events and logs and report the same to a cloud 

based tool for further analysis and preventative actions. 

b) Security Posture: Ensure that a defined set of metrics 

can be scanned regularly when the vehicle is in the field, 

either on an event driven (e.g. when an application is 

downloaded) or periodic basis to assess the security posture of 

371

the software and take actions to address issues via over the air 

software updates or via vehicle service centers. 

5) Rapid Incident Response Network: 

a) Crisis Connect Network: Create an enterprise 

network to share common vulnerabilities and exposures 

(CVE) among subscribing enterprises such that expert teams 

can learn from each other and provide bulletins and fixes 

against such threats. 

b) Early Alerts: Typically, when a CVE is discovered 

there is a time lag between discovery of the issue and the fix. 

This time lag is a “risk period” and it is necessary to alert 

stakeholders on what to do with advisories until a fix can be 

deployed. 

6) Life Cycle Management System: 

When an issue is detected, using Pillar 4, proactively reflash 

a vehicle with secure over the air (OTA) software 

updates to mitigate the issue. 

7) Safety/Security Culture: 

Ensure that every organization involved in supplying auto 

electronics is trained in safety/security with best practices to 

inculcate this culture within the organization. This training 

includes a design and development culture as well as IT 

system security. 

IV. HOW DOES BLACKBERRY ADHERE TO THE 7-PILLAR 

RECOMMENDATION 

This section shares what BlackBerry provides by way of 

solutions and services to the 7-pillar recommendation. 

A. BlackBerry’s Solutions and Services 

1) Secure the supply chain: 

a) Root of trust: BlackBerry’s Certicom unit provides 

Asset Management equipment that can be used to inject keys 

into chips at silicon foundries or test houses. This system has 

been proven in over 450 million smart phone chips deployed. 

Furthermore, BlackBerry Certicom’s managed-PKI service 

issues certificates that can be included as part of each ECU 

while they are being manufactured. These certificates have 

been deployed in over 100 million Zigbee devices and 10 

million cars. 

b) Code Scanning: BlackBerry is developing a novel 

binary code scanning and static analysis tool that can provide 

a list of open source software files included in a build, as well 

as the files that are impacted by vulnerabilities and can list a 

wide variety of metrics/cautions that tell a developer what to 

improve to reduce the security debt of the code (secure agile 

software craftsmanship). This is a cloud based tool and hence 

BlackBerry can continuously upgrade the tool with new 

“execution engines” (engines that add new capabilities to do 

deeper scans) to enhance its capability and even add custom 

features for the auto industry. 

c) Approved for Delivery: BlackBerry Cyber Security 

services can conduct “bug bashes” and “penetration testing” 

on products and IT infrastructure to assess if the enterprise can 

be certified as secure and approved for delivery. 

2) Use Trusted Components: 

a) Proven Components and Defense in Depth: 

BlackBerry QNX runs in 60 million cars and offers safety 

certified secure software from an operating system and 

hypervisor to a host of platforms and components that are 

designed with defense in depth security. Further, BlackBerry 

can lend its expertise to hardware providers to assess security 

risks with their chip and module designs. BlackBerry 

Certicom also offers hardened security crypto toolkits and 

means to inject hardware with secret keys. 

b) Applications: All applications that are downloaded 

should be certified and signed by proper authorities. The 

signature of the applications and a signed manifest file will set 

permissions of what resources in the system this application 

will get access to. BlackBerry has fundamental patents in this 

area and can ensure that applications are signed properly. 

Further when built on the QNX operating system, applications 

will be managed with the right access permissions, path space 

restrictions and sandboxing to ensure the system is safer. 

3) Isolation: 

a) ECU isolation: BlackBerry recommends that all 

ECUs that are safety critical be run on a network that is 

physically isolated from ECUs that have external physical 

access or are not safety critical. Any non-safety critical ECUs 

access to a safety critical ECU should only be accessed by a 

security gateway, which enforces strict policies. This gateway 

could have a firewall with a single outbound port, similar to 

BlackBerry enterprise servers. All traffic will be authenticated 

and encrypted with rolling keys. Domain controllers that 

manage multiple virtual functions (e.g. braking, steering, 

powertrain) can be isolated by a safety certified hypervisor 

such as provided by QNX. Any one system can fail without 

“crashing” the other virtual systems or functions. This 

hypervisor-based isolation can also be used for safety certified 

and non-safety certified functions that share a single domain 

controller. 

b) Trusted Messaging: Messaging between ECUs and 

the outside world needs to be trusted. All external 

communication can be managed by the security gateway as 

described above for safety and non-safety critical ECUs. 

Messaging between ECUs should be authenticated and 

encrypted. As described in Pillar 1 each ECU has a unique 

private key and birth certificate, which can be authenticated by 

the security gateway and subsequently the Gateway can issue 

keys to the ECU, which can be used to sign messages it sends 

to other ECUs, such that receiving ECUs know the message is 

from an authentic source as well as being signed. Chips can be 

designed to render such protocols very fast. BlackBerry 

Certicom has developed such a protocol. 

4) In Field Health Checks: 

a) Analytics and Diagnostics: BlackBerry is developing 

analytics and diagnostic clients that can be embedded in 

ECUs, which can monitor events and log crashes and 

anomalies. These data are sent to the cloud that can be 

analyzed for valuable information and acted upon. 

b) Security Posture: BlackBerry is developing a cloudbased 

tool that can access ECUs in the automobile and scan 

372

key metrics either on a periodic basis or on an event driven 

(e.g. when an application is downloaded) basis. This allows 

the automaker to determine in pseudo real time to scan their 

automobile and take actions when there is a security or safety 

risk. 

5) Rapid Incident Response Network: 

a) Crisis Connect: BlackBerry is creating an enterprise 

network to share common vulnerabilities and exposures 

(CVE) among subscribing enterprises. This allows a network 

of skilled resources to share and act faster than if they were 

fragmented. 

b) Early Alerts: Typically, when a CVE is discovered 

there is a time lag to the fix. This time lag is a “risk period”. 

BlackBerry is developing a scheme to use its AtHoc 

emergency notification service to alert customers on 

precautions that can be taken during a risk period until a fix is 

deployed. 

6) Life Cycle Management System: 

BlackBerry has deployed a global, secure over the air 

(OTA) software update service. This service is unique in 

regard to its scalability, deployment options and security. The 

service was derived from its smartphone software update 

service, which served over 100 million devices in over 100 

countries with outstanding reliability. This service is now 

being deployed for automobiles, with the management console 

for administering complex deployments. 

7) Safety/Security Culture: 

BlackBerry has developed training to inculcate a safety 

and security awareness culture in its organizations working on 

safety and security software. This training includes education, 

processes, methods, tools and behaviours that are best 

practices and can be shared with a wider audience. 

While not every aspect of this 7-Pillar defense is deployed 

commercially, the overall framework is sufficient to build a set 

of standard requirements and criteria to achieve enhanced 

safety and security in automobiles. 

V. POLICY AND RECOMMENDATIONS 

Policy for automobiles is set by government bodies such 

NHTSA (National Highway and Traffic Safety Administration) 

and DOT (Department of Transportation). Typically, 

automakers do not support a common set of policies and their 

argument is that it stifles innovation and can raise costs. 

However, there have been some policies that have been 

successful. Mandating seat belts (passive restraint systems) and 

airbags (supplemental restraint systems), the NHTSA 5-Star 

scoring system for cars set in 1998 (mainly for front impact 

collision), which was later upgraded to the entire car in 2011. 

Likewise, we feel that NHTSA and DOT can mandate a 

minimum set of requirements such as the 7-pillars with certain 

criteria to be met to achieve a certain score. A 5- Star scoring 

system can be used to initially educate consumers and later to 

make their score a differentiator for their automobiles. 

However, implementations should not be mandated. This 

should be left to the automakers to differentiate their offerings. 

The scoring would be set based on how many of the 

recommended requirements are followed and how many 

objective criteria are met with tests. These requirements can 

also secure involvement with insurance companies to create the 

basis for insurance rates. 

Another area for policy and standardization is vehicle to 

vehicle and vehicle to infrastructure communication, 

collectively called V2X. This communication protocol, 

frequency bands, message structures, latency, security and 

misbehaviour management must be standardized. Here again 

we recommend that the standard focus on what is required 

rather than implementation, which should be up to the 

automakers and their eco system. The perfect example is 3GPP 

standards set by ETSI. In fact, they could set the V2X standard. 

They understand wireless and interoperability and can hence be 

efficient in creating such a standard. 

Privacy and security of data is another important topic for 

policy makers and regulators. For starters, automakers have 

expressed concerns regarding their ability to trust the data from 

another automobile (especially from a different automaker) or 

from the infrastructure (e.g. traffic light) the automobile is 

communicating with. In this regard standardization, as 

suggested above, will help. An equally important concern is 

how does one protect the rich data that an autonomous car 

collects regarding a consumer’s preferences and behaviours 

such as drive routes, favourite places to visit, travel times, 

applications downloaded and even transactions handled via the 

automobile. 

Autonomous cars pose several challenges to regulators, 

automakers and insurance companies. Regulators need to 

ensure that there is a national framework and individual states 

do not set up fragmented rules. Will Automakers make their 

own policies for actions that their driverless cars will take 

when confronted with a particular situation where machine 

learning and judgement can cause different outcomes for 2 

different car models or brands? Will such policies and rules be 

regulated? 

Insurance companies and underwriters will need to work 

with lawmakers and automakers to make the liability borne by 

an autonomous car proportional to the revenue of each 

component, and hence their contributing vendors, and not let 

the purchasing departments of automakers make this decision. 

These choices and resulting policy or regulations are unclear at 

this time. 

Intellectual property presents another challenge. There is a 

lot of innovation in autonomous cars. Will innovation ever 

come to fruition or will it be mired in inter partes reviews, as 

today and IP wars. Will the auto industry be like the cellphone 

industry? Will regulators set rules on the maximum stack of 

royalties that can be charged per car with appropriate 

allotments to patent holders, using certain rules, or will it be 

market driven? 

There are many unknowns. However, we need to make a 

start. To start with we recommend that we begin to define key 

requirements and criteria that makes the automobile safer and 

more secure. Towards this end we suggest to start with the 7- 

Pillars Recommendation by BlackBerry. 

373


This white paper contains thoughts and ideas from several 

contributors from different parts of BlackBerry. Among those 

are Adam Boulton, Chris Hobbs, Chris Travers, Christine 

Gadsby, Grant Courville, Jim Alfred, John Wall, Justin Moon, 

Scott Linke and members of their teams. As such this is as 

much their contribution. 

374

Secure Boot Essentials 

Prevent Edge node attacks by securing your firmware 

Donnie Garcia 

NXP Semiconductor, IoT and Security Solutions 

Austin, Texas, United States of America 

Donnie.Garcia@nxp.com 

Abstract— The reality of a world filled with smart and aware 

devices is that there is a world of attack possibilities versus the 

technology our society is reliant upon. Just consider the scenario 

where an IoT edge node is attacked by replacing firmware to 

allow access to a trusted network. In today’s Internet of Things 

(IoT) world of connected devices, phishing scams perpetrated by 

re-purposing edge nodes is a real threat. Therefore, a plan for the 

development, manufacturing and deployment of IoT edge node 

devices must be made. The complexities of life cycle management 

create a demanding environment where the end developers must 

make use of a range of hardware security features, software 

components and partnerships to achieve their security goals and 

prevent malicious firmware from being installed onto IoT edge 

node devices. 

Essential to sustaining end to end security is a secure and 

trusted boot, which can be achieved with the right MCU 

hardware capabilities and ARM® mbed TLS. This paper will 

introduce a life cycle management model and detail the steps for 

how to achieve a secure boot with a lightweight implementation 

leveraging NXP® ARM Cortex®-M based microcontrollers with 

mbed TLS cryptography support. 

Keywords—Security, IoT Edge Node, Phishing, Secure Boot, 

Cryptography, Lifecycel management 


Secure designs begin with a security model consisting of 

policies, an understanding of the threat landscape and the 

methods used to enforce physical and logical security. To protect 

firmware execution within today’s threat landscape, there must 

be a policy to only allow execution of authenticated firmware, a 

secure boot. The methods used to enforce this policy must rely 

on MCU security technology to create a protected boot flow. 

The boot firmware can contain public key cryptography to 

authenticate application code. In addition to these components 

that are integrated in the end device, there are tools and processes 

that must be leveraged in the manufacturing environment. These 

include using manufacturing hardware for code signing and host 

programs for provisioning. This paper will provide an overview 

of the essential components of implementing a secure boot from 

the concept and planning phases all the way through 

deployment. To aid the developer, a real-world implementation 

using actual hardware and tools will be explored. 

II. 

SECURE BOOT SYSTEM ARCHITECTURE 

A. Components of a secure boot 

The design of a secure boot to achieve authentication of 

application firmware requires the integration of numerous 

components. Fig. 1 represents the system level view of the 

components and how they interact with one another. 

Fig. 1: Secure boot architecture diagram 

At the base of Fig. 1 there is the hardware providing physical 

and logical security. This is where microcontroller capabilities 

are necessary to protect data, perform cryptography and 

monitor access to memories and peripherals. Secondly, sitting 

above the hardware must be unchangeable boot code. This 

code must always run when the device is powered. This boot 

code contains low level drivers to set up relevant security 

peripherals, a cryptography stack for performing authentication 

and or confidentiality of data and in many cases a way to load 

application code (a bootloader). 

With the unchangeable boot code present on the hardware, 

application code that is present or loaded on the edge device is 

authenticated upon every boot. Application code can be 

changed but the cryptographic authentication applied to the 

code by the boot code ensures that the changes are only and 

always provided by a trusted entity. Application code can 

make use of all or a portion of the microcontroller resources as 

determined by the boot code. This is because upon boot, the 


375

oot code is always executed first, ensuring proper memory 

resource management and protection. 

Represented on the left of Fig.1 are tools used in the 

manufacturing and deployment of the device. The 

microcontroller must be programmed, so tools for key 

management, creating firmware files and connecting and 

downloading firmware into the device are needed to implement 

the secure boot design. With these components considered, the 

goal of authenticating application firmware upon every boot is 

achievable. 

III. 

SECURE BOOT ESSENTIALS 

A. Essential pre-design:Security Model 

When designing a secure system, it is important to apply a 

security model. A security model is built from policies, the 

threat landscape and methods as shown in Fig. 2. This model 

provides a framework for understanding and designing to the 

security goals of the device. The methods, or how the security 

policies are enforced to achieve product goals, are made 

possible by the security technology that is integrated into the 

embedded controllers. 

Fig. 2: Security Model 

As an example, for the case of protecting firmware with a 

secure boot, a security model would be represented by what is 

shown in Fig. 3. As shown in the figure, there is a policy that 

only authenticated firmware should ever be allowed to be 

executed. The threat landscape typical for an IoT edge node is 

attackers will have physical access to the device and so its 

communication and debug ports could be exploited. Lastly, the 

methods that make use of microcontroller security technology 

supporting trust, cryptography and anti-tamper will be 

employed to enforce the security policy to the levels demanded 

by the threat landscape. 

Fig. 3: Example Security Model for Secure Boot 

With a security model in place tradeoffs on the level of security 

versus cost and performance can be made during development 

based on the model. 

B. Essential hardware features 

At the hardware level, there are several functions the 

microcontroller must support. These are controlling the boot 

flow of the device, protecting memory resources and making 

firmware immutable. The following sections will detail how 

this is achieved for a specific MCU device, the NXP Kinetis 

K28 150MHz device. 

1) Control of boot flow 

Kinetis MCUs are architected to boot up from internal 

memory. This protects against the threat of hijacking an 

embedded application by changing an external memory 

device. Some Kinetis devices such as the K28 150MHz MCU 

have an internal ROM. For this secure boot implementation, 

the internal ROM is bypassed so that the trusted secure boot 

code can be customized using internal flash. This is done by 

setting non-volatile control register bits [BOOTSRC_SEL] as 

highlighted in Fig.4 from reference manual section 7.3.4 Boot 

Sequence. Once configured this way, the RESET module state 

machine of the K28_150MHz device will ensure that internal 

flash will be fetched and the secure boot code will always run. 

Fig. 4: Boot Source Select bit 

376

2) NVM Protection 

As detailed in section 33.3.3.6 of the K28_150MHz reference 

manual, “The FPROT registers define which program flash 

regions are protected from program and erase operations. 

Protected flash regions cannot have their content changed; 

that is, these regions cannot be programmed and cannot be 

erased…” 

The protected region size is chip specific as they are 

defined as program flash size divided by 32. In the case of a 

2MB flash like the K28_150 device, these are 64KB blocks. 

This is substantial space for this secure boot implementation, 

but for smaller flash size devices, multiple blocks could be 

configured. As shown in Fig. 5, the FPROT3[PROT0] control 

bit must be set and the unchangeable boot code placed at 

memory map location 0x0000_0000 to protect the secure boot 

code. 

Fig. 5: Using Flash Block Protection 

3) Chip security settings 

Once development of the secure boot code is completed, 

the chip security setting can be set to disable access from 

JTAG/SWD port and restrict data accesses to internal 

memory. See reference manual section 9.2 Flash Security. The 

only allowable flash command once the security is enabled is 

the mass erase operation. This ensures that the data residing 

inside the chip cannot be read, only destroyed. Furthermore, 

the mass erase operation can also be disabled if the MEEN bit 

in the FSEC register is set to %01. See reference manual 

section 33.3.3.3 Flash Security Register (FTFE_FSEC). 

a) Configuration fields 

The control registers for controlling boot flow, setting flash 

block protect and chip security settings are all part of a block 

of non-volatile registers as detailed in section 33.3.1 Flash 

configuration field description. As detailed in Fig. 6, these 

registers are physically located in the memory map starting at 

address 0x0_400. These registers are also mirrored into 

peripheral registers to represent the settings that have been 

pre-configured. For the case of flash block protection 

(FPROT), the settings can be changed during run time to 

increase areas of protection, but never decrease protection. 

This allows the secure boot code to dynamically protect 

regions of flash by increasing areas of protection if desired. 

Fig. 6: Flash Configuration Field 

C. Essential Software and Tools 

1) ARM mbed TLS 

To satisfy the cryptography needed for the secure boot 

implementation, the solution uses the MCUXpresso Software 

Development Kit (SDK) configured with ARM mbed TLS 

support. The MCUXpresso SDK software abstracts the 

interface to the available hardware peripherals with a package 

consisting of peripheral drivers, middleware, board specific 

configurations and application code. Within the package there 

are many demo applications. For ARM mbed TLS, there are 

two demo applications that can be leveraged to gain a working 

knowledge of the software library. These are the test and 

benchmark applications. 

When ARM mbed TLS support is ported onto Kinetis 

devices, the software is configured to make use of available 

microcontroller hardware resources. In the case of the 

K28_150MHz MCU, this is using the MMCAU cryptographic 

accelerator block that assist with AES, DES and hash 

operations. 

Formerly Polar SSL, the ARM mbed TLS library is 

perfectly aligned to the needs of the secure boot development. 

The library is well documented and supported with numerous 

discussion forum post and code examples. The library is 

available as opens source under the Apache 2.0 license, which 

allows the code to be used in closed source projects. In 

addition, the library was created to be modular and with the 

consideration of the constraints of embedded systems allowing 

developers to fine tune their use of the library for the needs of 

specific applications. 

As a representation of the alignment to our needs for secure 

boot, Fig. 7 details the main use cases for the library. As 

shown on the left, the library has modules related to key 

exchange. The specific capabilities provided by the public key 

module are represented on the right. Here we see the functions 

which we have introduced in the system architecture diagram 

(Fig. 1) for generating a public key pair, signing a message, 

and verifying signatures. The hardware abstraction provided 

by these functions greatly eases the burden on the end 

developer for completing the necessary cryptographic 

operations. 


377

Fig. 7: ARM mbed TLS Design 

The ARM mbed TLS source files which are critical for an 

ECDSA implementation of the secure boot are: ec_curve.h, 

eccurve_config.h, ecdsa.h, ecdsa.c and ec_curve.c. Importing 

these files allows the end developer to make use of the ecdsa 

context structure defining the key information and the 

supporting APIs related to the ecdsa operations. Specifically, 

these APIs include ecdsa_genkey for public key generation. In 

addition, for transferring curve information 

ec_use_known_curve_param API is used. Depending on the 

lifecycle stage of the device, the ecdsa_sign and ecdsa_verify 

APIs are used. The curve selection is made in the 

eccurve_config.h file. Here you can see the options for a 

scalable security level based on the curves supported by mbed 

TLS. There is support for ECDSA curves ranging from 

SECP192 to SECP521. 

2) Bootloader and Provisioning tools 

a) Bootloader 

Providing the bootloader functions is the NXP Kinetis 

Bootloader product known as KBOOT. As shown in Fig. 8, 

KBOOT embedded software consist of peripheral interfaces, a 

command and data processor and memory interfaces. KBOOT 

is provided as full source code and can be modified for end 

use 

details the command API that is supported by the command 

and data processor block. In addition to the base commands 

for downloading firmware, the command API includes the 

ability to direct the device to execute firmware. This 

functionality is used in the factory setting to execute specific 

functions and extract signature and key data. 

Depending on the end device, KBOOT supports 

provisioning for all available memory interfaces. For example, 

on the K28_150MHz MCU, in addition to RAM and Flash, 

KBOOT can manage the placement of data into external serial 

NOR flash via the QuadSPI interface. 

b) Provisioning Tools 

In addition to the KBOOT software which runs on the device, 

KBOOT also includes other tools packages that run on 

Linux®, Mac® or Windows® host machines. These are 

shown below in Fig. 9. 

: 

processing of 

binaries, elf and 

SREC files into 

elftosbElftosb 

secure binaries 

(Special formats 

to work with 

KBOOT) 

Capable of 

encrypting files, 

generating keys 

blhost 

Figure 9: KBOOT Tools 

Command line 

program that 

interfaces to a 

Kinetis MCU 

running KBOOT 

Supports every 

KBOOT 

command 

Kinetis Flash Tool 

Graphical user 

interface to 

interface to a 

Kinetis MCU 

running KBOOT 

Easier to use 

than blhost, but 

not as powerful 

For the processing of binaries, elf files and srecords there 

is a tool named elftosb. The elftosb tool takes commands from 

BD files. BD, short for boot descriptor file is an input 

command file used by elftosb to create secure binary files (sb 

file). The sb file contains commands and firmware data that is 

sent to the device that is running the KBOOT bootloader. The 

blhost tool is what is used to process the sb files and interface 

to the devices running KBOOT. Also, worth mentioning is 

the Kinetis Flash Tool and the Kinetis MCU host application 

but these are not used in this implementation. 

Both elftosb and blhost are provided as source code and 

can be built for different operating systems. Fig. 10 shows a 

typical workflow for using the KBOOT tools. 

Kinetis MCU Host 

Kinetis K66 

application that 

performs host 

functionality to a 

Kinetis MCU 

running KBOOT 

Fig. 8: KBOOT Block Diagram 

There are processor defines for configuring which 

peripheral interfaces should be enabled. This serves as a dual 

purpose as it allows for a way to optimize for code size and 

addresses security because it disables interfaces to the 

bootloader functions from unsupported peripheral interfaces. 

An example of how to use these defines is shown in the 

KBOOT reference manual section 11.6 Modifying a 

Peripheral Configuration Macro. The reference manual also 

Fig. 10: Typical KBOOT Tools Workflow 

378

Moving from left to right, first the elftosb tool is used. Based 

on commands passed by a BD file, the elftosb tool takes input 

firmware files and creates the secure binary. With a secure 

binary, at a different time and place, a host PC running blhost 

tool can be used to provision a Kinetis microcontroller like the 

K28_150MHz device that is running KBOOT. 

IV. LIFECYCLE VIEW 

The secure boot design which was detailed in the previous 

secctions is a critical component to maintaining the life cycle 

Less‐Trust Environments 

Developers 

In the development phase, the product owner develops a 

factory security tool and security tool firmware. This tool is 

used to generate public key/private key pairs, sign application 

firmware and interface securely to a cloud service provider. 

The product owner also develops the root of trust firmware 

such as the secure bootloader. This firmware performs secure 

boot and secure boot loading. This stage is where sensitive 

data such as product IDs and service IDs are generated. These 

secrets can be passed to the cloud service provider in the 

development phase. 

For the case of a controlled manufacturing site that is in a 

Cloud Service 

Secure Environments 

Development Phase 

Application 

Firmware 

Audit 

Security 

Policies 

Secure 

Boot 

Firmware 

Application 

Firmware 

Security 

Tool 

Firmware 

Factory 

SecTool 

Manufacturing Phase 

Factory 

SecTool 

Secure 

Boot 

Firmware 

Device Assembling Process 

Audit 

Cloud Service 

Firmware 

Loading Policies 

Factory 

SecTool 

Signed 

Application 

Firmware 

Secure 

Boot 

Firmware 

Signed 

Application 

Firmware 

Device Assembling Process 

Assembly 

Policies 

Deployed Phase 

Cloud Service 

User Policies 

Fig. 11: Lifecycle view for a secure IoT edge node 

of the device. As shown in Fig.11 the IoT edge node device 

flows through several phases. These are shown on the left of 

the diagram as Development, Manufacturing and Deployment. 

Within these stages of the lifecycle the product could be in 

Secure Environments or Less-Trust Environments as shown at 

the top of the diagram. For example, in the development stage, 

application code could be developed by external developers 

which would be in a Less-Trust Environment. Alternatively, if 

the firmware development is handled by trusted internal 

developers then this would be in the more Secure 

Environment. 

secure environment, the factory security tool is used only to 

sign application firmware. Then standard tools can be used to 

place the root of trust firmware and signed application 

firmware. Microcontroller security mechanisms are used to 

protect the root of trust firmware. For the scenario where a 

less-trusted manufacturing site is used, then the factory 

security tool could be deployed there. The factory security tool 

can interface to the cloud service provider securely to get the 

root of trust firmware. The root of trust firmware must be 

securely placed on to the end device. Once the secure 

bootloader is on the end device, then the device will only 

accept and execute signed application code. 


379

To implement such a lifecycle requires preset agreements 

with multiple parties, such as application code developers, 

external manufacturing sites, cloud service providers and 

component manufacturers. There are policies and audits which 

need to be in place. The complexities of lifecycle management 

create a demanding environment where the end developer 

must make use of all available hardware, software and 

partnerships to achieve their security goals and prevent 

malicious firmware from being installed onto IoT edge node 

devices. 

Throughout the lifecycle, there are important policies that 

govern how the device should be handled. These are detailed 

below as the Security Policies, Firmware Loading Policies, 

Assembly Policies and User Policies. Some examples are 

shown in the following Table 1. 

addition to bootloader functions are to generate a PUB/PRIV 

key pair and to generate the signature for application code 

using the private key. 

2) Secure Boot Firmware 

This bootloader application is for use in a deployed device. 

The main security functions in addition to bootloader functions 

are to check the signature of application code using the public 

key, and only allow execution of the application code if the 

signature is authentic. 

The firmware for the Factory SecTool and Secure Boot is 

completely independent of application code development. 

Application code development can occur on a different target 

device, by different developers. As shown in Fig. 12 below, 

TABLE I. POLOCIES FOR LIFECYCLE MANAGEMENT 

Software security policies 

Firmware loading policies 

Assembly policies 

User policies 

Polocies for lifecycle management 

Policy Name Description Examples 

Ensure that the application code 

maintains the security of the end device. 

Ensure that the proper steps are taken and 

controls are in place to protect the 

programming of the end device. 

Ensure that only approved components 

are used. 

Provide guidelines for the end user to 

maintain the security of the device. 

No prompts for sensitive data 

such as Enter PIN or 

password 

A list of words that the end 

device should not say 

Password control for 

firmware source binaries 

Upon receiving the 

microcontroller, the device 

should be completely erased 

to ensure that it is in a known 

state (no un-wanted 

firmware) 

All components should be 

inspected for expected for 

proper markings during 

assembly 

Visual inspection of the 

device for tampering 

Device should be physically 

protected behind locked 

doors 

A. Lifecycle with target SoC and tools 

The following section relates the NXP Kinetis K28 

150MHz device secure boot implementation and KBOOT 

tools to the lifecycle view introduced in Fig. 11. 

B. Development Stage 

During the product development stage, there are two 

separate firmware developments which are done in the secure 

environment (please refer to Fig. 11). Both developments are 

based on the software described in the previous sections, 

KBOOT and ARM mbed TLS. 

The two developments are: 

1) Factory Security Tool Firmware 

This bootloader application is for use in a secure 

manufacturing environment. The main security functions in 

memory mapping on the left, this development can follow a 

traditional development flow for microcontrollers with 

firmware located at the start of the NVM space. During the 

manufacturing stage, the resulting firmware files can be 

relocated as shown on the memory mapping on the right to 

work with the secure boot firmware, which includes KBOOT 

and mbed TLS cryptography. 

380

Fig. 12: Memory Map for App. Development 

C. Manufacturing Stage 

After the application code has been audited versus security 

policy guidelines as shown in Fig.11, the following steps can 

be taken to complete the manufacturing of end devices that 

use a secure boot. Steps 1 and 2 are represented at the top of 

Fig. 13, and you’ll find steps 3 and 4 at the bottom. 

1) Application SREC is combined with Factory BD file to 

create the Factory Secure Binary (Factory.SB) 

2) Using HW with the Factory Security Tool firmware, the 

Factory.sb is downloaded and blhost commands are used to 

extract binaries for signature and public keys. 

3) Application SREC is combined with signature binary to 

make the Production secure binary (Production.sb) 

4) Production secure binary is used to program final hardware 

Once a public key, private key pair is generated in steps 1 and 

2, the programming of the production image can occur on all 

devices that will be protected by the same private key. 

Variations of this implementation can be made to address 

multiple key pairs and roll back protections. For example, 

multiple public key/private key pairs can be generated and 

stored onto the device during the manufacturing stage and then 

selected based on version settings. 

V. CONCLUSION 

In today’s connected world, the protection of firmware is 

an essential component to delivering solutions that safeguard 

device manufacturers and their customers. Essential to 

sustaining end-to-end security is a secure and trusted boot, 

which can be achieved with the right MCU hardware 

capabilities and ARM mbed TLS. Though a secure boot is 

achievable, as demonstrated in the previous sections, the end 

design is closely linked to the target platform. The developer 

must have detailed knowledge about the hardware and tools. 

As the drive towards lower power and higher performance 

efficiency for IoT edge nodes continues, there exists an 

opportunity for standardization and abstraction to ensure 

adoption of secure boot for more end designs. 

REFERENCES 

[1] http://www.nxp.com/docs/en/reference-manual/KBTLDR200RM.pdf 

[2] http://www.nxp.com/docs/en/referencemanual/K28P210M150SF5RM.pdf 

[3] https://tls.mbed.org/high-level-design 

[4] https://tls.mbed.org/module-level-design-public-key 

Fig. 13:Manufacturing with KBOOT Tools 


381

Security Filters for IoT Domain Isolation 

Dr. Dominique Bolignano 

Prove & Run 

Paris, France 

dominique.bolignano@provenrun.com 

Abstract — Network segregation is key to the security of the 

Internet of Things but also to the security of more traditional 

critical infrastructures or SCADA systems that need to be more 

and more connected and allow for remote operations. We believe 

traditional firewalls or data diodes are not sufficient considering 

the new issues at stake and that a new generation of filters is 

needed to replace or complement existing protections in these 

fields. 

Keywords— Internet of Things; firewalls; filters; data diodes; 

security; formal methods; embedded devices; connected car. 

1 INTRODUCTION 

Modern IoT (i.e. Internet of Things) security architectures 

generally make use of partitions to define security domains and 

try to impose strict information-flow policies on the messages 

that transit from one domain to another. Typically, this is 

achieved by forcing all messages to transit through dedicated 

filters. The correct implementation of such filters is essential 

for the whole security of the system as the only path available 

to hackers to perform remote attacks, when the architecture is 

well designed, is to send triggering messages through these 

filters. Gateways in new automotive architectures are 

representative example of devices that implement filters. They 

are typically used to control the information flows between 

various security domains, such as the powertrain domain, the 

infotainment domain, the comfort domain, etc. 

The proposed approach is meant to be applied to filters but 

only in situations where it is possible to explicitly identify and 

characterize commands and responses that are allowed to go 

through a given filter. As we will see this is always the case (or 

should always be the case) to answer to the new security 

requirements arising when connecting critical systems (e.g. 

Cyber Physical Systems), or connecting SCADA systems (e.g. 

Operational Technology Systems connected to the IT 

infrastructure), in embedded automotive, aeronautic, or railway 

equipment, and more generally the IoT. For the IoT, this is 

mainly due to the fact that the large volume of connected 

devices creates huge opportunities and extremely good 

business models for hackers. 

In this paper we will first explain why there is a new 

challenge. We will then explain how this new challenge can be 

addressed in general, and then show how the security of the 

more demanding filters can be achieved. 

2 THE NEW CHALLENGE WITH REMOTE ATTACKS 

In this section we will show that the new challenge is 

mainly due to the existence of new business models for 

hackers. In the past reaching an acceptable level security 

mainly boiled down to implementing a few basic ingredients: 

cryptographic algorithms and protocols (such as digital 

signatures and encrypted communications), secure elements, 

etc. However the advent of the IoT and the need to connect 

remotely to SCADA and critical systems are changing the 

security paradigm. There is now a real business model for 

hackers and organized crime syndicates in performing remote 

attacks. By investing a few millions of euros they are now 

indeed almost sure to be able to identify potential large-scale 

remote attacks in current connected architectures with 

potentially a very high return on investment. In the IoT 

industry hackers can for example send a few devices to 

"reverse-engineering consultants" located in countries where 

this can be done legally or without too much risk. With the 

proper reconstructed documentation, they can then ask 

"creative" hacking consultants to prepare an attack. With such 

a budget at hand it is almost always possible to identify 

dramatic large-scale attacks, at least by exploiting bugs and 

errors that always exist in the OS and protocol stacks that are 

included in the Trusted Computing Base (TCB) of a device. 

Such errors can usually be found in the software architecture, 

or in the design, implementation or configuration of a device. 


382

The business model is usually quite obvious to find as usually 

such attacks make it at least possible to block the normal 

operation of the targeted infrastructure, causing damages that 

are way beyond the investment. In many cases such attacks 

could even create more dramatic situations that might lead to 

loss of life. An attack similar to the well-publicized Jeep attack 

[6] would correspond roughly to an investment of less than half 

a million of dollars (an estimate based on the detailed 

description of the identification phase of the attack by the 

authors), and if performed on a massive scale by criminal 

organizations could have led to the death of the very large 

number of people. These new business models (which in the 

case of the IoT is exploiting the combination of high volume 

and potentially physical impact) are bringing unpreceded 

security needs on the resistance to logical attacks and this is 

clearly a disruption in the security needs. 

Security for high volume transactions (such as in payment 

systems) were (and are) mitigated by the use of proper risk 

management. Such risk management techniques are a lot less 

efficient (and in some cases not applicable) when it comes to 

IoT systems, as actions cannot be delayed or canceled as 

financial transactions can be. It is for example not practically 

possible to detect and block in real time an attack that would 

make all cars of a certain model turn right at a given time. 

In the next section, we try to give more accounts on the fact 

that it is always possible to use the weaknesses of the OSs or 

protocol stacks that are part of the TCB. 

2.1 The Challenge of Securing OSs, Kernels and Protocol 

Stacks 

Various public databases (such as [2]) provide statistics on 

public bugs or vulnerabilities on all kinds of software. These 

databases clearly show that current OSs and kernels suffer 

from a great number of errors and weaknesses, no matter who 

writes them, and no matter how long they have been in the 

field. For example, new errors are still reported in the 

thousands every year on “well-known” systems such as Linux. 

This situation is basically due to the inherent complexity of 

such OSs and kernels, which rely more and more on complex 

and sophisticated hardware. OSs and kernels are by nature 

concurrent and hugely complex because of the need to support 

various kinds of peripherals (interruption handling becomes 

more and more difficult), the performance objectives (e.g. 

complexity of cache management), the resource consumption 

issues (e.g. need for a sophisticated power management), etc. 

This complexity increases with time, increases with new IoT 

architectures and increases when it comes to real 

microprocessors (as opposed to microcontrollers). 

Even Trusted Execution Environments (TEEs), i.e. small 

security OSs that were introduced to very significantly reduce 

the size of the TCB, are regularly attacked ([11], [12], [16]). 

The real challenge (and only known solution) is to produce 

and demonstrate that the OSs, kernels and software stacks that 

are part of the TCB are as close as possible to “zero-bug” i.e. 

are free from errors (in their design and implementation) that 

could be potentially exploited for logical attacks. 

Traditional software engineering techniques such as 

exhaustive testing or code inspections are clearly not sufficient 

anymore to bring the level of assurance that is needed to secure 

complex open systems. This is due to the fact that there are too 

many different situations to consider for a kernel designer or 

tester and no real methods to review the quality of such kernel 

code in a systematic way, beside the use of proof techniques. 

Instead we believe the only valid response to such 

complexity is a special class of formal methods, which are 

known as deductive techniques or proof techniques. Even other 

formal methods such as static analysis or model checking are 

not fully addressing the problem at hands. More details are 

presented in [1]. 

2.2 Two Representative Attacks 

Many attacks on IT systems are reported every day. Here 

we use two very different ones as a matter of illustration. The 

first one is the so-called 2015 attack on the Ukrainian power 

grid [14]. It is quite representative of problems coming from 

the complexity of the general architecture of large-scale IT 

systems. Such attacks indeed exploit weaknesses in the general 

architecture or in its configuration. 

The second is the so-called Heartbleed attack which is one 

of the many attacks and vulnerabilities that were found on 

SSL/TLS overtime [12]. This latter attack is very representative 

of attacks that exploit the complexity of the software itself. 

Such bugs are very similar to the bugs that can be found in 

error-prone software components such as OS kernels or 

communication stacks. 

Errors are not only found in software. They can also happen 

at the hardware level and lead to logical and remote attacks 

such as the recently announced Meltdown [7] and Spectre [8] 

attacks. Other cache attacks had been demonstrated in the past 

([9], [10]) and new ones will probably be found in the future. 

We believe that hardware design should also be formally 

proven eventually, at least for their TCB part (MMU, ARM 

TrustZone mechanism, etc.). This will not prevent non-logical 

attacks such as the Rowhammer attack presented in [4], but it 

would prevent at least a large majority of logical attacks. 

However, errors in hardware that can be exploited for large 

scale remote attacks are very rare (one or two are found every 

year as of now) and they can usually be addressed by proper 

software countermeasures. Prove & Run has developed 

ProvenCore, a formally proven OS kernel that rely only on a 

few simple hardware mechanisms and to implement a very 

secure firmware update mechanism so that not only the risks 

from such hardware attacks are minimized but also that when 

they happen such problems can be easily fixed by a very robust 

over the air firmware update mechanism. 

3 ADDRESSING THE NEW CHALLENGE 

The proposed approach to design an extremely secure filter 

builds on the approach we presented in [1]. We recall here 

briefly this approach before presenting new ideas that can be 

used to develop this filter. Some of these ideas are patent 

pending. 

383

First it is important to use state-of-the-art security 

methodologies such as the one proposed by the Common 

Criteria framework. In particular we assume that for each 

architecture and use case a proper risk analysis and threat 

model are made available, and that a proper security target has 

been defined and is used to guide the security architect, the 

developers, the testers and the security evaluator. It is worth 

noticing that such documents can be reused from one 

evaluation to another so as to further reduce costs. 

We also recommend as described in [1] to explicitly 

describe a clear “security rationale” that fully explains the 

hypotheses, conditions and reasons why the security 

architecture meets the desired security level. The security 

rationale should not only be describe the countermeasures used 

to address each threat but also provide a detailed rationale as 

detailed and convincing as an informal mathematical proof. 

The last step of the approach is to define an architecture 

that is based on a TCB that contains only formally proven 

kernels and protocol stacks. So in the end the security rationale 

for the most complex parts of the TCB must rely on formally 

proven software (and using a tool is necessary to check that the 

proof is itself free of errors) whereas the other, simpler parts of 

the security rationale are presented as an informal proof which 

can be easily audited by experts. Now instead of formally 

verifying large OSs and kernels such as Linux or Android 

where new features and drivers are added on an ongoing basis 

so as to address new requirements we propose to use a separate 

formally proven secure OS kernel, i.e. in our case ProvenCore, 

to address peripherals that need be secured and to run secure 

applications, in a way that allows us to: 

• Retain the normal OS (for example Linux, Android or 

any other proprietary OS or RTOS) and thus benefit 

from all its features, 

• Push the normal OS outside of the TCB, so that any 

error in the normal OS cannot be used to compromise 

the TCB, 

• Use a proven OS to perform security functions. 

Our formally proven kernel, ProvenCore, was designed in a 

way that makes it generic enough to be used as COTS 

(Commercial Off-the-Shelf) in virtually any IoT architecture. 

We describe here how this can be done on ARM 

architectures that account for the vast majority of the IoT 

market, but the same approach can be transposed to other CPU 

architectures. 

On ARM architectures and in particular on the Cortex-A 

and Cortex-M families of ARM microprocessors and 

microcontrollers, a security mechanism called TrustZone 

provides a low-cost alternative to adding a dedicated security 

core or co-processor, by splitting the existing processor into 

two virtual processors backed by hardware-based access 

control mechanisms. This lets the processor switch between 

two states, i.e. two worlds, typically the “Normal World” on 

one side and the “Secure World” on the other side. Therefore 

TrustZone can be used as an extremely small and securityoriented 

asymmetric hypervisor that allows: 

• The so-called Normal World to run on its own, 

potentially oblivious of the existence of the Secure 

World and, 

• The Secure World to have extra privileges such as the 

ability to have some part of the memory, as well as 

some hardware peripherals, exclusively visible and 

accessible to itself. 

In the proposed architecture the proven secure OS kernel, 

i.e. ProvenCore in our case, runs in the Secure World, and the 

rich but error-prone OS (Linux, Android, etc.) runs in the 

Normal World. 

4 PROPOSED APPROACH AND SOLUTION 

Here the key assumption (or in other words the requirement 

that is to be met for the proposed solution to be applicable) is 

that the list of commands and arguments that we want to allow 

in each direction can be made explicit and fully characterized. 

In other words the security architect or administrator must be 

able to express a precise filtering security policy on the 

commands and arguments that must go across the filter from 

one security domain to the other. This may be difficult to do so 

within a standard information system: when security is not 

considered a high priority the administrator is often not in a 

position to fully characterize all the commands and arguments 

in use nor even to identify all information flows. However, 

defining such a filtering security policy is a must as soon as a 

high level of security is needed e.g. for connected SCADA and 

critical systems. If a filtering security policy goes beyond a few 

trivial commands taking no arguments, then the 

implementation of this policy as a filter must be formally 

proven. In the next section we will explore how formally 

proven filters can address the challenge of critical IoT systems. 

4.1.1 Connected Critical Systems and SCADAs 

In the case of critical or SCADA systems it is usually 

necessary to accept incoming commands sent through a VPN 

by authorized remote agents either to perform routine 

maintenance and configuration or to exert manual control, at 

least in the case of an emergency situation where some remote 

administrators or decision makers need to take action quickly. 

In this case it is quite easy to identify and characterize the list 

of allowed incoming commands and outgoing responses 1 . The 

filtering security policy may be stateless or state-based. For 

example, an authorized user might be required to authenticate 

itself before issuing a command that modifies the configuration 

of the system. In this case the corresponding filtering security 

policy will obviously be state-based (i.e. identification and 

authentication are required before accepting a given 

command). 

In the case of the 2015 attacks on the Ukrainian power grid 

[14] it appears that only a weak security policy was enforced, 

i.e. users with only a low-level credential could still send any 

commands and receive any response from critical systems. In 

1 The control of outgoing responses is less sensitive but still makes attacks 

more difficult and is also useful in case confidentially is at stake. 


384

their comprehensive report Booz-Allen-Hamilton recommends 

among other measures (1) to install a statefull firewall or data 

diode, (2) to use a stronger authentication mechanism (such as 

two-factor authentication) for some of the accesses. Using a 

statefull applicative firewall would allow to enforce a proper 

security policy but the security level of existing firewalls 2 is 

not sufficient to cope with potential attacks (considering the 

level of return of investment that could be obtained by 

organized criminal organizations). A data diode is simpler and 

therefore can be brought to the right level of security (for 

example some data diodes have obtained an EAL7 Common 

Criteria certification) but can only make sure that the flow of 

information goes in a single direction: it cannot selectively 

block some commands and allow other. In addition, such 

systems usually require bidirectional communications, so data 

diodes are not adequate for this purpose. The filter we propose 

in this paper brings the benefits of both, i.e. the resistance of a 

data diode with the selectivity and programmability of an 

applicative firewall. 

In the case of the Ukrainian critical infrastructure we would 

have proposed to clearly identify the list of remote commands 

that where acceptable for each authorized (and authenticated) 

user. This list could have been used as the base of a filtering 

security policy. 

4.1.2 Embedded Devices and the IoT 

In the case of embedded automotive, aeronautic, or railway 

connected equipment, or more generally any equipment part of 

the IoT, such filters will for example be placed in the gateways 

that exist for most of these systems, but may also be placed 

elsewhere (e.g. within the Telematic Control Unit of a car). 

In the automotive industry, this approach could be used to 

filter incoming V2X alerts coming from the car gateway. 

Today these alerts are delivered to the driver only through the 

dashboard, but in the very near future these alerts might be 

forwarded directly to the brake-control system, forcing the car 

to slow down. Filtering security policies similar to the ones 

described in the previous section may for example apply to 

data exchanged between the OEM and the car, and/or 

commands between various domains inside the car (such as 

chassis, engine or infotainment domains) 

Because of the new business models available to 

enterprising hackers high level security policies need to be 

expressed and enforced by the gateways. It is not easy (i.e. at 

the very best error prone and in some cases impossible with the 

right level of precision) to express such policies on the lowlevel 

objects (such as IP packets) that firewalls normally use. 

The administrator in charge of configuring such firewalls or the 

security architect defining the gateway has to use low level 

concepts such as ports whereas they would like to implement a 

high-level security policy where they could precisely specify 

and restrict the type of high level commands or responses that 

gets in or out. 

2 See the list of existing certified firewalls 

https://www.commoncriteriaportal.org/pps/ 

As we will see in the following section the resistance of 

such implementations is not high enough to cope with the 

remote attacks at stake. Thus, even if the firewalls are properly 

configured, hackers will still have many ways to attack such 

entry points. They will typically bypass information-flow 

policies by exploiting bugs and errors commonly found in 

protocol stacks and OSs used to implement such firewalls. In 

fact, the security level reached by the most secure firewalls is 

usually very limited. In addition, the most secure ones have an 

expensive bill of material, which does not fit well with 

embedded systems requirements. 

4.1.3 Limitations of Traditionals Firewalls 

The firewall is indeed the right concept for controlling and 

building the segregation of an architecture but it has two 

significant drawbacks, (1) the configuration of a firewall is 

usually done on low level protocol concepts such as ports, IP 

addresses, etc., and making sure that such configuration 

implements the correct high level security policy is difficult 

and very error prone at best (2) most importantly the TCB of a 

firewall includes at least its OS as well as its protocol stacks. 

Both are very error prone. In practice the complexity of the 

attack surface forbids this architecture from meeting the 

highest level of security, which is a must for the use-cases at 

hand. The first drawback can be avoided using applicative 

firewalls so that the security policies can be expressed using 

higher level concepts very close to the objects used in the 

security policy at hand, so that configuring the protocol to 

implement the right security policy is simple and not error 

prone. 

The second drawback is much more difficult to cope with 

and in fact we believe, as we will try to show in the next 

sections, that the TCB which necessarily includes at least one 

OS is very error prone (i.e. the TCB is complex and not 

formally proven as it should be). 

The attack surface of a traditional firewall is indeed 

unnecessarily large. In order to better understand this, let us 

consider an extremely simple (and unrealistic) security policy 

which is meant to impose that only the text command “set” can 

be sent remotely and that this command has a single mandatory 

parameter whose values can be only “on” or “off”. Let us 

consider here that these commands are sent using TCP/IP on an 

Ethernet network and let us consider in a first step, for the sake 

of simplicity, that we are not using a VPN or more generally 

that messages are not signed or encrypted. 

Even if this security policy is only to accept two possible 

commands: “set on” and “set off”, the degrees of freedom for 

the attacker are huge, and hence the surface of attack. First at 

the lexical level, the attacker could insert spaces in the text 

command (or other allowed delimiters such as tabs) in an 

attempt to exploit, for example, implementation bugs they have 

found in the lexical analyzer. They could in the same way 

exploit bugs in the syntactic analyzer (typically after reverse 

engineering it). The chances that they find problems that lead 

to real attacks there are limited because lexical and syntactic 

analysis is a well-understood software engineering problem 

with lots of available scientific know-how and tools. However 

such weaknesses may still exist anyway (inadequate grammar 

type, buffer overflow due to improper memory configuration, 

385

etc.). What is important in this case is that such degrees of 

freedom will typically exist within each layer of the protocol 

stack (e.g. application layer, host-to-host transport layer, 

internet layer, network interface layer), which enlarges the 

attack surface, rising the possibility of finding an exploitable 

bug. Wireless communication links are more exposed to these 

issues compared to wired ones because radio technologies (i.e. 

GSM, WiFi, Bluetooth, ZigBee, etc.) are usually complex and 

very error prone. In addition, in an OS such as Linux protocols 

stacks are part of the kernel, which make the attacks even 

simpler. In any case attackers will have an extremely large 

surface of attack (i.e. many degrees of freedom) to try to 

exploit bugs in the various protocol layers or in the OS itself. 

4.2 Proposed Architecture 

Instead of filtering low-level packets we propose to filter 

high-level commands and arguments directly, using a so-called 

“protocol break.” We propose to implement this filter as 

formally proven (or at least highly secure) application (statefull 

or stateless, depending on the requirements of the task) that 

only operates on high-level commands and arguments, running 

on a formally proven and secure OS. This OS will have to 

guarantee a number of security properties (such as separation, 

integrity, …) and which in addition will have to enforce 

configurable information-flow policies between its 

components. This information-flow policy will make sure that 

communication flows coming from the outside (e.g. incoming 

commands) go through the filtering application which is the 

one applying the filtering security policy. 

Fig. 1. 

Following is an example of such an architecture in which 

we use ProvenCore to guarantee the security properties 

required to host the filtering application such as isolation, 

confidentiality and integrity [1]. As presented in Figure 1 

ProvenCore also enforces a (programmable) information-flow 

policy between the various security applications and between 

the hardware peripherals and the corresponding drivers and 

other security applications. This policy ensures that there is no 

possibility for an incoming command or outgoing response to 

somehow bypass the filtering application. It is materialized by 

the black arrows that represent the only authorized 

communication channels. 

In this architecture the twin protocol stacks used to support 

the protocol break execute as distinct processes on the same 

instance of ProvenCore. Since ProvenCore guarantees the 

integrity and separation of the processes it executes, even a 

severe problem within the hardware drivers or in the protocols 

stack themselves will not lead to any security problem besides 

a lack of availability 3 . 

In the example above the filtering application implements 

two filtering security policies: one on incoming commands, 

one on outgoing responses. More than one filtering 

applications can be used with more complex topologies in 

which ingoing (resp. outgoing) messages are routed to different 

filters according to their nature, but the overall principles 

remain unmodified. 

Such an architecture allows us to design a filter that can be 

formally proven or more generally brought to the highest level 

of certification. We have summarized our architecture in Fig. 2. 

Fig. 2. 

The TCB is composed of (1) a formally proven kernel, here 

ProvenCore which is the very first formally proven kernel on 

the market with the proper security features to support this 

filtering architecture, and (2) a formally proven filtering 

application, which is by itself a very simple application, even if 

it includes the filtering per se but also the command and data 

lexical and syntactic analysis. This architecture thus allows us 

to obtain a filter (or an applicative firewall) whose TCB is 

entirely formally proven to satisfy the given filtering policy 

expressed in a simple and high level formal language. 

In other words with traditional firewalls we had to cope 

with a very error prone TCB with a large attack surface, not 

surprisingly inadequate to meet the highest level of security. 

With this new kind of filter we are relying on a bullet proof 

formally proven TCB, which in addition can be proved to 

exactly implement the intended filtering function. Non 

surprisingly such a formally proven can be brought to the very 

highest levels of security. 

But there is more to it. Even with a bullet proof filter there 

is still the problem that we might be forced to authorize 

potentially damaging commands (i.e. it is very likely that we 

have to accept as part of the filtering security policy some 

commands that are dangerous but necessary). So the remaing 

problem is not about is not about tampering with the filter (or 

the security policy) but with the fact that some valid commands 

3 The lack of availability that would result from a successful attack on the 

protocol stacks can be mitigated by adding complementary security 

applications running in parallel to detect such attacks (such as a specialized 

IDS, i.e. Intrusion Detection System) and providing a security application in 

charge of reloading a new update over the air (or even inspect and repair the 

other software components). This is not featured here as it is out of scope of 

the current paper. 


386

may be used to attack the receiving side. Going back to our 

artificially simple “set on”/”set off” example of a filtering 

security policy illustrates in an obvious way the fact that the 

attackers have almost no degree of freedom left to perform an 

attack on the receiving side. The only commands that can be 

sent are “set on” and “set off” as planned and the filtering 

application will leave absolutely no degree of freedom in the 

way any of them can be expressed. The situation would be 

exactly the same for more complex and realistic filtering 

security policies: the only degree of freedom left is indeed the 

one allowed by the filtering policy itself. But the commands 

that are defined as being acceptable by the filtering security 

policy could be dangerous by themselves. For example, most 

embedded devices will need a "firmware_update" command to 

manage the firmware update process for the whole platform. 

For this reason, it is usually also important to make sure that 

incoming commands have not been tampered with and have 

been issued by authorized and trusted persons. In other words it 

is necessary to add proper authentication, and also guarantee 

the integrity and potentially the confidentiality of the 

commands. Guaranteeing these security properties is typically 

the role of a proper VPN. Here we propose to integrate a VPN 

application that can be brought to the same level of security as 

the filtering application(s). This will give the simplified 

architecture presented in Figure 3. 

Fig. 4. 

Now the same benefits can be achieved for any kind of 

(statefull or stateless) filtering security policy. Another 

significant advantage is that this can be achieved without any 

impact on the bill of materials and therefore at very little cost. 

Therefore such filters are not only much more secure than 

existing ones, but this architecture is applicable to costsensitive 

devices sold in large volumes. The only costly 

investment was the design, implementation and formal proof of 

the security of ProvenCore, an investment which has been done 

once and for all and can benefit to the huge volumes of 

compatible devices from various market segments. Depending 

on the situations this filters can be used to replace existing 

filters or to complement them (to be put in sequence with 

another firewall or an IPS). 

Fig. 3. 

Using a proper VPN thus further reduces the attack surface, 

and shows the benefit that can be obtained by the use of these 

new generation of filters. Our artificially simple filtering 

security policy makes it easy to see that an attacker would have 

only one degree of freedom left: the possibility of (either) 

slowing down (or theoretically accelerating although this 

would be much harder) the reception of ingoing commands. 

Attackers would have no other degree of freedom and thus the 

attack surface for performing any attack would be almost nil. 

Here the fact that TCB is formally proven and can be brought 

to the highest levels of security is key. It allows the filtering 

application itself to be brought to the highest level of security 

and we believe that such a possibility is a real breakthrough in 

the firewalling/filtering world. 

4.3 A Practical Implementation 

In practice, the architecture presented above can be easily 

implemented on an ARM processor using the architecture 

presented in Figure 4. 

5 CONCLUSION 

In this paper we have shown why it is very difficult (or 

even impossible) to bring traditional firewalls and filters to the 

required level of security. We have proposed an approach that 

allows us to build new filters based on protocol breaks where 

the software TCB is made very simple and is just composed of 

a formally proven kernel, namely ProvenCore here (which is 

currently seeking a Common Criteria EAL7 certification), and 

a few security applications that can also be easily formally 

proven. The other parts of the software stack which normally 

compose a firewall, such as the drivers, the protocol stack, and 

the normal OS are here kept outside of the TCB. This is why 

such filters can be brought to levels of security that only simple 

physical data diodes could previously meet. 

REFERENCES 

[1] D. Bolignano, “Proven Security for the Internet of Things,” in 

proceedings of the Embedded World Conference 2016, February 2016. 

[2] National Vulnerability Database. NIST. [Online]. Available: 

https://web.nvd.nist.gov/view/vuln/search [Accessed 15 Jan. 2016]. 

[3] C. Miller and C. Valasek. "A survey of remote automotive attack 

surfaces". [Online] Available: 

http://illmatics.com/remote%20attack%20surfaces.pdf [Accessed 15 Jan. 

2016]. 

[4] M. Seaborn and T. Dulien, "Project Zero: Exploiting the DRAM 

rowhammer bug to gain kernel privileges,” 2015. [Online]. Available: 

http://googleprojectzero.blogspot.fr/2015/03/exploiting-dramrowhammer-bug-to-gain.html. 

[Accessed 15 Jan. 2016]. 

387

[5] "ADAC deckt Sicherheitslücke auf - BMW-Fahrzeuge mit 

'ConnectedDrive' können über Mobilfunk illegal von außen geöffnet 

werden,” 2015. [Online]. Available: 

https://www.adac.de/infotestrat/adac-im-einsatz/motorwelt/bmwluecke.aspx?ComponentId=227555&SourcePageId=6729. 

[Accessed 15 

Jan. 2016]. 

[6] C. Miller and C. Valasek, "Remote Exploitation of an Unaltered 

Passenger Vehicle". IOActive, Seattle, WA, Tech. Rep., 2015. [Online]. 

Available: 

http://www.ioactive.com/pdfs/IOActive_Remote_Car_Hacking.pdf. 


[7] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P. 

Kocher, D. Genkin, Y. Yarom and M. Hamburg, "Meltdown," [Online]. 

Available: https://arxiv.org/abs/1801.01207 [Accessed 11 Jan. 2018]. 

[8] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. 

Mangard, T. Prescher, M. Schwarz and Y. Yarom, "Spectre Attacks: 

Exploiting Speculative Execution," [Online]. Available: 

https://arxiv.org/abs/1801.01203 [Accessed 11 Jan. 2018]. 

[9] D.A. Osvik, A. Shamir and E. Tromer, "Cache Attacks and 

Countermeasures: The Case of AES," in Pointcheval D. (eds) Topics in 

Cryptology – CT-RSA 2006. CT-RSA 2006. Lecture Notes in Computer 

Science, vol 3860. Springer, Berlin, Heidelberg. 

[10] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, S. Mangard: 

"ARMageddon: Cache Attacks on Mobile Devices," in 25th USENIX 

Security Symposium (USENIX Security 16). Austin, TX : USENIX 

Association, August 2016. 

[11] C. Cohen, “AMD-PSP: fTPM Remote Code Execution via crafted EK 

certificate,” [Online]. Available: 

http://seclists.org/fulldisclosure/2018/Jan/12 [Accessed 11 Jan. 2018]. 

[12] G. Beniamini, "Trust Issues: Exploiting TrustZone TEEs," [Online]. 

Available: https://googleprojectzero.blogspot.com/2017/07/trust-issuesexploiting-trustzone-tees.html 

[13] TLS/SSL Explained – Examples of a TLS Vulnerability and Attack, 

Final Part. [Online]. Available: 

https://www.acunetix.com/blog/articles/tls-vulnerabilities-attacks-finalpart/ 


[14] When The Lights Went Out - A Comprehensive Review Of The 2015 

Attacks On Ukrainian Critical Infrastructure. [Online]. Available: 

https://www.boozallen.com/content/dam/boozallen/documents/2016/09/ 

ukraine-report-when-the-lights-went-out.pdf [Accessed 11 Jan. 2018]. 

[15] Internet Security Threat Report, Volume 22, April 2017, [Online]. 

Available: 

https://www.symantec.com/content/dam/symantec/docs/reports/istr-22- 

2017-en.pdf [Accessed 11 Jan. 2018]. 

[16] G. Beniamini, "Extracting Qualcomm's KeyMaster Keys - Breaking 

Android Full Disk Encryption,” [Online]. Available: http://bitsplease.blogspot.fr/2016/06/extracting-qualcomms-keymaster-keys.html 



388

From Matlab To FPGA in 

Manageable Steps, a True Story in 

Double Precision 

Mike Looijmans 

System Expert 

Topic Embedded Products B.V. 

Best, The Netherlands 

mike.looijmans@topic.nl 

Abstract—The growing computational power of our 

machines seems to only increase our hunger for even more 

teraflops. But at the same time we strive for low power 

consumption and flexibility. Our desktop CPUs have seen 

relatively little improvements over the past decade, unlike the 

highly capable hybrid systems that combine CPU, GPU and 

FPGA architectures, like Xilinx' Zynq MPSoC. FPGA based 

cloud computing provides computational power by the hour to 

those who need it. 

While FPGA devices offer a unique balance of flexibility and 

efficiency, programming these devices has usually been restricted 

to that handful of specialists who have the necessary knowledge 

and skills. This has been the major limiting factor in the broad 

adoption of these systems. And hybrid CPU/FPGA systems only 

appear to increase the amount of skill required, by requiring the 

engineer to also cope with the complexity of coupling the 

subsystems together. 

In this presentation I will show the complete flow from a 

mathlab model to its implementation in a hybrid CPU/FPGA 

system, a Xilinx Zynq. All that is required is a general 

understanding of what an FPGA is, and how it can be used to 

implement mathematical algorithms. No VHDL or Verilog 

experience required. 

The mathlab function in question is the discrete wavelet 

transform, often used in signal compression and pattern 

recognition. The algorithm implementation uses double-precision 

floating point math, usually frowned upon by FPGA engineers, 

but we will see later that this poses no problem for the hardware. 

We ported the implementation to plain C++ (or C) code, and 

write test code to verify it. This test code is being used throughout 

the project to detect any regression. We ran the code on the 

target CPU platform to get a baseline benchmark. Next step was 

to optimize the code for a more efficient implementation, and test 

and benchmark that again, on a desktop PC. Then we added an 

FPGA card to this desktop machine, to make it mimic the target. 

We used Dyplo to produce a bitstream for that FPGA card in a 

few clicks, which also adds high speed data transfer capability 

between the desktop CPU and FPGA. We'll pass the algorithm's 

C++ code on to Dyplo for implementation on the FPGA, and 

using the existing test code we can verify its performance. In a 

few iterations, the algorithm runs at the required speed, and 

within the resource limits. We can use the exact same software 

and tools to generate the final software and FPGA firmware for 

the Zynq target. 

In a few days work, we produced an implementation for the 

discrete wavelet transform on a hybrid CPU/FPGA platform that 

outperforms the CPU-only implementation in both speed and 

power efficiency. During the design we used test-driven software 

development, and in a sustainable pace arrived at the set goals. 

Keywords—acceleration; Dyplo; FPGA; wavelet 


Before diving into the technical challenge here, let's first 

describe the project. 

Delirium (acute brain failure) affects over 3 million 

hospitalized patients in Europe every year. It is a potentially 

fatal medical emergency that regularly leads to long-term 

cognitive impairment (dementia), longer hospital admission 

and higher healthcare costs. Its effect increases as the episode 

lasts longer, so timely detection is essential. To date, delirium is 

detected too little and too late, using subjective and ineffective 

methods. 

The project's mission is: "Safe and accurate delirium 

monitoring in routine hospital care". The DeltaScan Monitor is 

to be a brain activity analyzer that performs an algorithmic 

recognition of delirium, a combination of hardware and 

software and hence, an embedded device. 

Our customer, Prolira, has composed and verified a clinical 

mathematical model of the detection algorithm, and converted 

the algorithm into C++ code. From the C++ implementation, 

we have already learned that a modern desktop PC is able to 

run the algorithm within the given performance limits. Our 

challenge now is to implement this algorithm on a portable, 

battery-powered platform. 


389

II. 

PLATFORM 

From analysis of the mathematical model we derive that the 

core operations are discrete wavelet transforms (DWT) [1], 

both forward and inverse (iDWT). The discrete wavelet 

transform is based on convolution [2] operations, which in turn 

are multiply-accumulate (MAC) operations. These are very 

suitable for implementation on a broad range of accelerators, 

like SIMD, GPU, DSP and FPGA. Since a desktop PC is 

capable of meeting the performance requirements, we can be 

assured that it will also be possible to map the algorithm on a 

number of embedded platforms. 

Apart from running the algorithm, the embedded device 

must also acquire data in realtime from the frontend probe 

using an analog-to-digital converter (ADC), pre-process the 

data, provide appropriate clock and sample signals for the 

ADC, interpret the results and display them on a screen, 

provide a graphical user interface (GUI) for controlling the 

device, and monitor the battery status and the integrety of 

various other components. 

These combined requirements lead to the selection of a 

Xilinx Zynq 7000 series device, in particular the Topic Miami 

7030 system-on-module (SoM), as the central processing part. 

This is a combined dual-core ARM CPU and FPGA fabric, 

tightly coupled, in a single chip. We run Linux on the system, 

which gives us driver support for all peripherals on the board 

as well as the foundation for a GUI, and allows for a 

convenient hardware abstraction and programming 

environment. The task of ADC data acquisition is offloaded to 

the FPGA, so we can avoid using a real-time OS. 

III. 

IMPLEMENTATION 

The first approach is to implement all processing on the 

CPUs, using the C++ code as is. The data is being processed in 

sets of 4096 samples in double-precision floating point format. 

Each set must be processed in 2 milliseconds, to meet the 

performance deadlines. The C++ implementation does not 

come close to that, but we already anticipated that. What we do 

want to determine in this phase is where the performance 

bottlenecks are. Not surprisingly, this turns out to be the 

discrete wavelet transform function, which takes about 3 

milliseconds on this CPU, and each set requires 5 DWT 

operations. If we can offload the DWT operations to the FPGA, 

the system will meet the performance requirements. 

IV. 

DEVELOPMENT SETUP 

We'll first prototype the system, so we can work in a 

convenient development environment. So instead of an 

embedded board, we'll use a generic desktop PC with a PCIe 

card holding a comparable FPGA (a Kintex 160 series, similar 

fabric but with more logic cells). This sounds like introducing 

yet another issue, but this is completely mitigated by the 

operating system. Both systems run Linux, so whether the 

FPGA is connected through PCIe or directly integrated in the 

chip is completely transparent to our software thanks to 

hardware abstraction. 

Remains the job of creating a "bitstream" for the FPGA to 

load, so we can communicate with it. This is quite an easy task, 

just start the Dyplo DDE, create a new project, select the FPGA 

board from the list, and it's basically done. What's left for us is 

to select how many parallel DMA transfers we want, and what 

parts of the fabric we wish to use for our algorithms. I actually 

had done this for another project, so I just re-use everything 

from that. It's very convenient to just create a basic project with 

lots of bandwidth and free space that can be used for 

prototyping. The one I use divides the FPGA into 8 reprogrammable 

regions and provides 4 DMA channels, though 

for this project we'd only need one of each. 

Once that first static bitstream is loaded onto the board, we 

can plug it into the PC. Since Dyplo can reprogram the 

"algorithm" parts of the FPGA any time through the PCIe bus, 

we won't need to do this again. 

To send and receive data through the DMA channels, all we 

need to do in software is open a device file and read or write it, 

depending on the direction of the data flow. Before we can do 

that, we call a function that programs the algorithm into a part 

of the FPGA, and another to set up the data path between the 

DMA and algorithm nodes. All that is left for us to do now is 

actually implement the algorithm, writing (unit)test code to 

check the results and to measure performance. 

V. ALGORITHM CONVERSION 

Now that we have unit tests place, we can start on the 

conversion of the C code into hardware design language 

(HDL). To aid in the the conversion, we'll be using Dyplo. 

Though the FPGA excels at processing data at high speed, 

things like dynamic memory allocation cannot be practially 

implemented in logic. So the very first step for us is to refactor 

the code and change for example std::vector manipulations into 

simple arrays, and move all memory management outside the 

algorithm's main body. This turns out to be the major part of 

the work. As an added bonus, these changes also makes the C+ 

+ algorithm run considerably faster on the CPU. 

Fig. 1. 

Processing pipeline 

So the system we want to build now must transmit a data 

set of 4096 samples from CPU memory to the FPGA, run the 

algorithm in FPGA logic and return the result, also 4096 

values, back to the CPU for further processing. We'll use Dyplo 

for the communication between CPU and FPGA, and to assist 

in the implementation of the FPGA logic. 

Fig. 2. Wavelet filterbank implementation (source: Wikipedia [1]) 

The wavelet transform has been implemented as a 

“filterbank” implementation [1]. This repeatedly applies a lowpass 

filter g[n] and a matching high-pass filter h[n] to the data 

set, thus converting the sample data into a wavelet. With 4096 

390

samples per set, there are 11 stages. With a filter kernel size of 

8, The first stage requires 4096x8 MAC operations, and the 

next stager require an equal amount in total, so the conversion 

needs about 65536 MAC operations per set. 

In the Dyplo development environment, we create a new 

C/C++ "task" and point it to the algorithm's implementation C+ 

+ file. Select the function to instantiate in logic, and Dyplo then 

creates the interface logic around it, and creates a Vivado HLS 

project that will do the main conversion. This step completes 

without errors now, so we can create a bitstream and run a 

functional test on the FPGA in the PC. This verifies that the 

HDL implementation is actually performing correctly in real 

hardware. This first attempt processes data at about 1.8MB/s, 

corresponding to 17 milliseconds per set. Some tuning and 

optimizing is required still to meet the target of 2 milliseconds. 

Open the Vivado HLS project and add the C test code to it. 

This allows us to simulate the end result without actually 

implementing it, and gives quick feedback on the changes in 

terms of number of clock ticks required to process one set of 

data. The major steps that contributed: 

 

 

 

 

 

The first step of the algorithm processes 4096 samples 

and produces 2048 results. The next 9 loops produce the 

remaining 2048 results. Split up in two parts, which 

allows the hardware to implement them in parallel. This 

doubles the speed. 

Adding the 8 filter results was being done in a loop. 

Rewrite this as a binary tree to get another 2x speedup. 

The convolution filter coefficients were being accessed 

in reversed order. Reverse the coefficient array instead, 

so the code uses them in natural order. 

Inline the high-pass and low-pass parts into a single 

function. This allows them to run in parallel and again 

doubles the speed. 

Insert the PIPELINE directive into the convolution 

code. This instructs HLS to try and re-use this block 

more efficiently, and combined with a few minor 

changes, made for a 5 times faster implementation. 

The end result was a HDL implementation that used 363 

microseconds per set of 4096 samples, well below our set goal 

of 2 milliseconds. Since each set requires 65536 MAC 

operations, the implementation is performing about 180 million 

double precision floating point multiply-accumulate operations 

per second. 

Since the embedded board uses a similar FPGA, all that 

needed to be done is run the place-and-route for the final 

embedded board, cross-compile the test benches, and copy 

these to the board for a final test to confirm the results. 

VI. 

FINAL RESULTS 

The measured speed of the algorithm is 363 microseconds 

per set of 4096 samples. This includes the overhead of sending 

the data to FPGA and back. 

The algorithm takes less than 15% of the resources, and can 

be instantiated multiple times to further increase throughput. In 

this case, it has been chosen to place 2 instances in the system, 

which is convenient since there are also 2 CPU cores to deliver 

and process the data. 

The FPGA consumes 200mW while actively using two 

instances of the algorithm. The CPU consumes 400mW during 

2 milliseconds to perform the same calculation, so the total 

power saving is huge. For 2 sets the FPGA would use 200mW 

during 363 microseconds, while the CPU would use 400mW 

for 2000 milliseconds, so the CPU solution uses 11 times more 

power for the same amount of data. 

Total time spent on this part of the project was about 4 days 

on the C++ conversion to make it suitable for FPGA synthesis, 

and another day to optimize performance. 

VII. 

What we learned during this project: 

FINDINGS 

Remarkably fast implementation cycle by a software 

engineer with no noticeable FPGA experience. Apparently 

tools have evolved to a point where knowledge of a hardware 

design language is no longer a requirement. Not only the 

algorithm can be described in C code, also the test benches are 

written in C code. 

The best C to HDL optimizations were accomplished by 

code manipulation. Changing the code to reflect what we want 

the hardware to do was far more efficient than attempting to 

guide the HDL process using only compiler directives. 

Maintaining double-precision floating point does not pose a 

problem for hardware synthesis. This leads to a much shorter 

time to market because a conversion to fixed point usually 

requires a much more thorough understanding of the domain. 

The bandwidth of the CPU-FPGA communication can 

become a dominating parameter. Describing this in detail is 

material for a separate technical paper. 

An agile approach using a test driven design methodology 

assured progress and traceability. 

REFERENCES 

[1] https://en.wikipedia.org/wiki/Discrete_wavelet_transform 

[2] https://en.wikipedia.org/wiki/Convolution 


391

Partitioning of computationally intensive tasks 

between FPGA and CPUs 

Tobias Welti, MSc (Author) 

Institute of Embedded Systems 

Zurich University of Applied Sciences 

Winterthur, Switzerland 

tobias.welti@zhaw.ch 

Matthias Rosenthal, PhD (Author) 

Institute of Embedded Systems 

Zurich University of Applied Sciences 

Winterthur, Switzerland 

matthias.rosenthal@zhaw.ch 

Abstract—With the recent development of faster and more 

complex Multiprocessor System-on-Chips (MPSoCs), a large 

number of different resources have become available on a single 

chip. For example, Xilinx's Zynq UltraScale+ is a powerful 

MPSoC with four ARM Cortex-A53 CPUs, two Cortex-R5 realtime 

cores, an FPGA fabric and a Mali-400 GPU. Optimal 

process partitioning between CPUs, real-time cores, GPU and 

FPGA is therefore a challenge. 

For many scientific applications with high sampling rates and 

real-time signal analysis, an FFT needs to be calculated and 

analyzed directly in the measuring device. The goal of 

partitioning such an FFT in an MPSoC is to make best use of the 

available resources, to minimize latency and to optimize 

performance. The paper compares different partitioning designs 

and discusses their advantages and disadvantages. Measurement 

results with up to 250 MSamples per second are shown. 

Keywords—FPGA; UltraScale+ MPSoC; partitioning; ARM 

NEON; SIMD; asymmetric multi-processing; high performance 

FFT; low latency processing 


The transition from field-programmable gate arrays 

(FPGAs) to System-on-Chips (SoCs) in 2011 was the 

unavoidable development when FPGAs needed to execute ever 

more complex software programs. The soft-core processors 

available for inclusion in the programmable logic were either 

not powerful enough or took up too many logic resources. The 

combination of hardware processors with the FPGA, 

interconnected through a high-performance bus showed the 

potential of this architecture. 

With the recent development of faster and more complex 

Multiprocessor System-on-Chips (MPSoCs), many different 

resources are available on one chip. For example, Xilinx's 

Zynq UltraScale+ MPSoC combines up to four ARM Cortex- 

A53 application processor cores, two ARM Cortex-R5 realtime 

cores, an ARM Mali-400 GPU as well as an FPGA fabric 

with programmable logic, on-chip memory, hardware 

multipliers (DSP slices) and many high-throughput I/Os. The 

challenge for the system architect has now become finding the 

optimal execution environment for your design's processes: the 

partitioning. The goal is to make best use of the available 

resources, minimizing latency and optimizing performance. 

In this paper, we use the Fast Fourier Transform as a 

computationally expensive algorithm that can be accelerated 

through several means: 

 

 

 

 

 

 

multiprocessing on several cores of the same type 

(Symmetric Multiprocessing) 

vector processing using a special instruction set 

for Single Instruction, Multiple Data (SIMD), 

available on most current processors 

using additional, different processors than the ones 

the main software is run on (Asymmetric 

Multiprocessing) 

generating accelerator functions that run in the 

FPGA fabric and using them as external functions 

running the whole algorithm in the FPGA core, 

controlled by a CPU core 

running the algorithm standalone in the FPGA 

For each method, we present the communication paths and 

software architecture, along with performance data. 

The FFT is a well-studied algorithm and many papers have 

been published on methods for efficient execution on specific 

multiprocessor architectures in [1], [2], [3], [4] and [5]. It is not 

the goal of this paper to improve on these methods, but to 

provide an overview and an understanding of the possibilities 

available in today's devices. 

The paper is organized as follows: 

Section II introduces the FFT algorithm and how it can be 

calculated on multiple processing devices. In Section III, we 

discuss the partitioning methods based on software, executing 

on processor cores. The FPGA-based methods are explored in 

Section IV. Section V elaborates on ways of collaboration 

between the FPGA and processors. Finally, in Section VI we 

sum up the advantages of the presented methods. 


392

II. 

FFT PARTITIONING 

The discrete Fourier Transform (DFT) is used to transform 

a sequence of samples from the time domain into the frequency 

domain to analyze the frequency components of the sampled 

signal. Spectral analysis, measuring and controlling, signal 

processing and quantum computing are but a few applications 

of the DFT. The DFT has a very high computational cost of 

O(N 2 ). The Fast Fourier Transform (FFT) improves efficiency 

of the transform by reducing the number of redundant 

calculations. This is achieved by splitting the sequence into 

smaller parts and performing the Fourier Transform on these as 

shown in Fig. 1. In doing so, the computational cost can be 

reduced to O(N log N). Note that the splitting includes a 

reordering of the input values, effectively selecting every other 

input value for each subset. 

Fig. 1. Principle of the FFT algorithm. 

Since the FFT algorithm is a divide-and-conquer approach, 

it is well suited for parallel processing on multiple processors. 

Each core can perform the smaller FFT on its part of the data, 

independent of the remaining data. However, as shown in Fig. 

1, there will be at least one step requiring data from other cores 

when combining the smaller FFTs into the complete spectrum. 

This all-to-all communication is a critical step because it 

requires synchronization of the cores. The optimization of this 

step has been the subject of several publications, i.e. [2] and 

[4]. 

The technique of calculating smaller FFTs and combining 

them into larger spectra allows to efficiently process FFTs 

larger than the memory of your processor, given that the 

currently unused data is stored in an efficient way. Refs. [2], 

[4] and [6] have published possible implementations. 

III. 

MULTICORE PROCESSING FOR FFT 

A. Available resources 

The Xilinx Zynq UltraScale+ MPSoC portfolio offers 

multiple ranges of SoCs with varying numbers of processor 

cores and FPGA fabric resources. In this paper, we use the 

XCZU9EG device, an SoC with the following resources: 

 

 

 

four ARM Cortex-A53 application processing cores, 

running at 1100 MHz and featuring the NEON 

instruction set 

two ARM Cortex-R5 real-time processing cores with 

tightly-coupled memory (TCM) for low-latency access, 

running at 500 MHz 

an ARM Mali-400 GPU 

an FPGA fabric with 600'000 System Logic Cells, 32 

Mbit of FPGA memory and 2520 hardware multipliers 

(DSP slices) 

Fig. 2 shows the block diagram of the available resources. 

The ARM Cortex-A53 core is a mid-range application 

processing core that balances power usage vs. performance. It 

is equipped with the ARMv8 instruction set, including 

NEONv2 SIMD instructions for vectorized execution on 

multiple data (up to 128 bit wide). Four A53 cores make up the 

Application Processing Unit (APU), in the bottom right of Fig. 

2. The ARM Cortex-R5 core is a real-time processor with a 

focus on fast reaction to events. Its 128 kB of TCM allow very 

fast memory accesses, but in turn limit the amount of data that 

can be worked on. Two R5 cores form the Real-time 

Processing Unit (RPU) in the top right of Fig. 2. The Level 3 

interconnect enables fast data transfers between the APU, the 

RPU and the FPGA fabric with on-chip memory, DSP slices 

and programmable logic. 

B. Executing the FFT in Software 

When executing an FFT in software, you have the choice of 

several FFT libraries, many of them capable of exploiting both 

multiprocessing and vector processing. 

ARM Ne10 [7] provides highly optimized ARM NEON 

intrinsics written in Assembler and its FFT algorithm makes 

use of these. However, it does not support multiprocessing. 

kissFFT [8] is a very lightweight library with the goal of being 

easy to use and moderately efficient while supporting 

multiprocessing. However, it makes use of NEON instructions 

only to execute four separate FFTs in parallel instead of 

accelerating one FFT transform. 

The fastest and most versatile FFT library that was tested in 

our work is FFTW3 [9], exploiting both multi-processing and 

NEON instructions and including a mechanism to optimize the 

algorithm for the available hardware. This mechanism will test 

many possible FFT optimization algorithms, measuring 

performance and selecting the fastest one as described in [10]. 

This is done in order to make the best possible use of first and 

second level caches, memory access speeds and other hardware 

characteristics. For our implementations, we used FFTW3 on 

the A53 and Ne10 on the R5. 

Fig. 2. Block diagram of MPSoC. 

393

C. Implementations 

The software-only implementations were run on the Xilinx 

PetaLinux operating system, using the FFTW3 library to 

calculate double precision floating-point complex FFTs. Using 

double precision limits the speed improvement for the NEON 

instruction set to a factor of two, because the NEON registers 

are 128 bit wide and can therefore accommodate only two 

double precision floating-point values. 

The following five scenarios were tested: 

a. Single-core A53 

b. Single-core A53 with NEON instructions 

c. Symmetric Multi-Processing (SMP) with four A53 

d. Symmetric Multi-Processing (SMP) with four A53 

with NEON instructions 

e. Asymmetric Multi-Processing (AMP) with the R5 as 

coprocessor 

Scenarios a-d require no special software stack except the 

pthread library for SMP. 

Scenario e requires additional frameworks and drivers for 

communication between the Master A53 core and the remote 

R5 core to enable AMP. Fig. 3 shows the software architecture. 

Two operating systems are required: Linux on the APU master 

CPU and FreeRTOS on the RPU slave CPU. First, the APU 

boots Linux and uses the OpenAMP framework to load the 

RPU firmware into the TCM via a DMA transfer. The RPU is 

then booted out of the TCM. The remoteproc driver handles the 

life cycle management, allocates the required resources and 

creates a virtual I/O (virtIO) device for each remote processor. 

RPMsg is the Remote Processor Messaging API to provide 

inter-process communication between processes on 

independent cores. 

The flow of OpenAMP booting and software execution is 

as follows (as in [11]): 

1. The remote processor is configured as a virtual IO 

device, shared memory for message passing is 

reserved. 

2. The Master loads the firmware for the remote 

processor into its memory, then boots the remote 

processor. 

3. After booting, the remote processor creates the virtIO 

and RPMsg channels and informs the master. 

4. The Master invokes the callback channel and 

acknowledges the remote processor and application. 

5. The remote processor invokes the RPMsg channel. 

6. The RPMsg channel is established, both Master and 

Slave can initiate communication via RPMsg calls. 

7. During operation, communication buffers in reserved 

shared DDR memory are used to pass messages 

between the Master and the Slave. Usually, these 

buffers are small. To load larger amounts of data, such 

as the FFT input and output data, the data is written to 

or read from on-chip memory (OCM) of the R5, and 

the pointers are passed via message buffer. 

Shutdown proceeds in the reverse sequence of the booting 

and initialization process. 

D. Performance 

To provide an overview of the performance, FFTs of three 

sizes (4’096, 16’384 and 65’536 data points) were calculated 

using the implementation scenarios a-d. The Cortex-R5 can 

only perform 4’096 point FFTs due to its limited amount of 

OCM 

Fig. 3. Software stack for Asymmetric Multi-Processing (as in [11]) 


394

Table I shows the achieved calculation times in 

microseconds, comprised of the times for loading the input 

data, executing the FFT and storing the result in memory. We 

also show the feasible sampling rates that would allow the 

CPUs to keep up a seamless processing of the input data. 

TABLE I. 

SOFTWARE FFT PERFORMANCE 

Scenario 4’096 16’384 65’536 

time 

(µs) 

max. 

rate 

time 

(µs) 

max. 

rate 

time 

(µs) 

(MSa/s) 

(MSa/s) 

a A53 320 12 1600 10 8670 7 

b A53NEON 290 14 1460 11 7760 8 

c 4xA53 120 33 511 32 2770 23 

d 4xA53NEON 114 35 434 37 2290 28 

e R5 a 1455 3 -- -- -- -- 

max. 

rate 

(MSa/s) 

a. R5 time includes OpenAMP communication overhead (approx.100 µs) 

It is evident from the data that the FFT scales well for 

multiprocessing. When using four A53 cores, a speed-up of up 

to factor 3.3 is observed. Scaling is better for large FFTs, 

because there are more calculation steps that don't require allto-all 


More detailed tests have been run with one to four cores, 

but for the sake of brevity, the data is not shown here. In 

summary, the speedup corresponds almost directly to the 

number of cores as long as there remains at least one core for 

execution of the processes of the operating system. When all 

cores are used for multi-processing, the speedup is capped. A 

reasonable explanation is that the FFT is competing with the 

other processes, resulting in many context switches. 

Using the NEON instruction set, a speedup of roughly 10% 

is observed for single-processing. For multi-processing, 

enabling NEON yields a speed gain of 5-20%. This is nowhere 

near a factor of two that could be expected since two values 

can be processed at the same time. Among the possible 

reasons, we suspect the fact that the NEON instructions have 

their own execution pipeline and registers. If the algorithm can 

not be optimized to 100% NEON instructions, there will be 

data transfers between NEON and standard registers. 

The R5 is clearly not designed for computationally heavy 

tasks, its target assignment is to react to events in real-time. It 

has to be noted that the communication overhead of the 

OpenAMP framework contributes approximately 100 µs to the 

execution time, but even if this overhead could be avoided, the 

R5 would be no competition for the A53s. 

IV. 

ACCELERATORS IN FPGA 

Traditionally, bringing an algorithm to programmable logic 

means writing HDL code or using an existing IP core. Today, 

there are tools to generate HDL code from software code. 

Xilinx provides the SDSoC (Software Defined System on 

Chip) toolchain that generates logic blocks from your C-code 

along with the required data transfer logic. To allow your 

software to interface with this computation block, SDSoC 

compiles a software library with the required function for 

configuring the FFT core, loading and storing data as well as 

the necessary interrupt service functions. 

We have compared the performance of the following 

scenarios: 

f. SDSoC-Accelerator controlled by A53 

g. FFT IP-core controlled by A53 

h. FFT IP-core working standalone 

Scenario f: An SDSoC accelerator core can only be 

implemented for a fixed FFT size. Therefore, three accelerators 

were implemented in the programmable logic, clocked at 

300 MHz. The big advantage of performing the FFT in 

programmable logic is that the processor core can perform 

other tasks in the meantime. The processor will still be required 

for loading and storing the data. Fig. 4 shows the block 

diagram of this setup. 

Scenario g: The Xilinx FFT IP-core can be configured for 

different FFT sizes at runtime. One instance of the FFT core is 

therefore sufficient for our tests. The processor will load the 

input data into on-chip memory in the FPGA fabric, then start 

the FFT core and finally transfer the processed data back to 

DDR memory. These transfers can be done by DMA, leaving 

the CPU free for other tasks. This setup is shown in Fig. 5. 

Scenario h: If the input data is acquired in the FPGA fabric, 

there is no sense in transferring the data to DDR memory first, 

then loading it to FPGA on-chip memory for the FFT. Instead, 

the FFT core is configured for constant, standalone operation 

on the input data stream and a DMA stream is set up for 

transfer of the output data to a range of reserved DDR memory, 

where the APU can retrieve the processed data for analysis 

(See Fig. 6). 

Fig. 4. SDSoC accelerators in FPGA. 

Fig. 5. FFT IP block, controlled by A53 

395

efficiency reasons, this is best done in on-chip memory 

(BRAM). 

Fig. 6. FFT IP block, self-controlled 

Table II shows the achieved execution times for FPGAaccelerated 

FFT on the Zynq UltraScale+ MPSoC. Scenarios f 

and g have similar performance, showing that the SDSoCgenerated 

HDL code is efficient and compares well to 

manually optimized HDL code of the FFT IP. 

TABLE II. 

FPGA FFT PERFORMANCE 

Scenario 4’096 16’384 65’536 

time 

(µs) 

max. 

rate 

time 

(µs) 

time 

(µs) 

max. 

rate 

(MSa/s) 

(MSa/s) 

f SDSoC 108 38 410 38 1720 38 

g IP-A53 101 41 400 41 1680 39 

h IP-standalone 51 250 202 250 807 250 

Fig. 7. Partitioning FFT, parallel processing in FPGA 

The more efficient data path in scenario h can easily 

explain the difference in performance between scenarios g and 

h, omitting the transfer of the input data from DDR to on-chip 

memory. In fact, the limiting factor for the sampling rate is the 

DMA transfer rate from FPGA fabric into DDR memory. 

V. COMBINING FPGA AND PROCESSING SYSTEM 

The results in Sections III and IV show clearly that the 

FPGA easily outperforms the pure software implementations. 

Nevertheless, there are limits to the size of FFT than can be 

executed in the FPGA. The FFT IP core can process up to 

65’536 points for one FFT. The SDSoC toolchain would allow 

to create accelerator functions for larger FFT sizes, but the 

available FPGA resources (DSP slices and on-chip memory) 

would be exhausted quickly. 

We have explored ways for the FPGA and processing 

system to collaborate in processing the FFT. The goal was a 

65’536 point FFT, using fewer resources in the FPGA while 

still maintaining good performance. 

As shown in Fig. 1, the FFT algorithm is divided in clearly 

defined steps that can be processed in separate units. Our idea 

was to process the first steps of the FFT in the FPGA grid, then 

transfer the data into processor memory and do the remaining 

steps in software, as shown in Fig. 7. The FFT would be split 

into four 16’384 point FFTs in the FPGA. These smaller FFTs 

can be processed either in parallel or in series. 

Parallel processing requires four FFT cores with four times 

the resource usage. For serial processing (Fig. 8), only one FFT 

core is implemented, but the data of the three remaining FFTs 

must be stored until the core is ready for processing. For 

Fig. 8. Partitioning FFT, serial processing in FPGA 

We found that the amount of BRAM resources used is 

similar for both the parallel and the serial approach. Table III 

shows the number of BRAM blocks and DSP slices used. 

Values in parentheses show the percentage of all available 

resources. Because the amount of data to be stored or 

processed is the same as for a 65’536 point FFT, our approach 

even uses roughly the same amount of BRAM as the full 

65’536 point FFT core. With BRAM being the most limited 

FPGA resource for this application, there is no gain from 

partitioning the FFT between FPGA and processing system. 

Furthermore, the FFT calculation needs to be finished in the 

processing system, adding more latency and resource usage to 

the bill. 


396

TABLE III. 

RESOURCE REQUIREMENTS OF PARTITIONED FFT 

Scenario BRAM DSP 

Parallel FFT 4x 16k FFT 232 (27%) 180 (7%) 

Serial FFT 1x 16k FFT & BRAM 244 (25%) 45 (2%) 

Full FFT 1x 64k FFT 238 (27%) 54 (2%) 

VI. 

DISCUSSION 

For the FFT, we have shown that the FPGA fabric is able to 

perform several times faster than the complete processing 

system of the Zynq UltraScale+ MPSoC. This power can be 

harvested in several ways, be it as stand-alone FFT processor 

or as an external accelerator function. 

Depending on the amount of processing to be done apart 

from the FFT, doing the whole transform in the processing 

system can also be an option, leaving more room in your 

FPGA. 

The decision where to execute an algorithm depends on 

many factors, such as: 

 

 

 

 

Where does your data originate? Try to keep it local, 

reducing the amount of data transfer. 

What are the required data rates? Can the amount of 

data be transferred over the L3 interconnect without 

interfering with the remaining processes? 

How well can your algorithm be split up and be 

processed in parallel? The more an algorithm can be 

parallelized, the better the FPGA will perform in 

comparison to the processing system. 

How many FPGA resources can you spare for your 

algorithm? 

In the end, it remains the challenge of the system architect 

to choose where and how the data is to be processed. A deep 

understanding of the algorithm and both processing system and 

FPGA hardware is required. 

REFERENCES 

[1] P. N. Swarztrauber, "Multiprocessor FFTs," Parallel Computing, Vol. 5, 

Issues 1–2, pp. 197-210, 1987. 

[2] J. Sánchez-Curto, P. Chamorro-Posada, "On a faster parallel 

implementation of the split-step Fourier method," Parallel Computing, 

Vol. 34, Issue 9, pp. 539-549, 2008. 

[3] T. H. Cormen, D. M. Nicol, "Performing out-of-core FFTs on parallel 

disk systems," Parallel Computing, Vol. 24, Issue 1, pp. 5-20, 1998. 

[4] E. Chu, A. George, "FFT algorithms and their adaptation to parallel 

processing," Linear Algebra and its Applications, Vol. 284, Issues 1–3, 

pp. 95-124, 1998. 

[5] S. Xue, J. Wang, Y. Li and Q. Peng, "Parallel FFT implementation 

based on multi-core DSPs," 2011 International Conference on 

Computational Problem-Solving (ICCP), Chengdu, pp. 426-430, 2011. 

[6] R. Lyons, "Computing large DFTs using small FFTs", [Online]: 

https://www.dsprelated.com/showarticle/63.php 

[7] ARM Ne10 Project [Online]: https://projectne10.github.io/Ne10/ 

[8] kissFFT [Online]: https://sourceforge.net/projects/kissfft/ 

[9] FFTW3 [Online]: http://www.fftw.org/ 

[10] M. Frigo, S. G. Johnson, "The Design and Implementation of FFTW3," 

Proc. IEEE, vol. 93, no. 2, pp. 216-231, 2005. 

[11] Xilinx User Guide UG1211, "Zynq UltraScale+ MPSoC Software 

Acceleration Targeted Reference Design", [Online]: 

https://www.xilinx.com/support/documentation/boards_and_kits/zcu102 

/2017_2/ug1211-zcu102-swaccel-trd.pdf. Xilinx, Inc. 2017 

397

Paper 

Analyzing the Generation and Optimization of an 

FPGA Accelerator using High Level Synthesis 

Sebastian Kaltenstadler 

Ulm University 

Ulm, Germany 

sebastian.kaltenstadler@missinglinkelectronics.com 

Stefan Wiehler 

Missing Link Electronics 

Neu-Ulm, Germany 

stefan.wiehler@missinglinkelectronics.com 

Ulrich Langenbach 

Beuth University of Applied Sciences 


ulrich.langenbach@beuth-hochschule.de 

Abstract—Multi-Processor System-on-Chip FPGAs can utilize 

programmable logic for compute intensive functions, using socalled 

Accelerators, implementing a heterogeneous computing 

architecture. Thereby, Embedded Systems can benefit from the 

computing power of programmable logic while still maintaining 

the software flexibility of a CPU. As a design option to 

the well-established RTL design process, Accelerators can be 

designed using High-Level Synthesis. The abstraction level for 

the functionality description can be raised to algorithm level 

by a tool generating HDL code from a high-level language like 

C/C++. The Xilinx tool Vivado HLS allows the user to guide the 

generated RTL implementation by inserting compiler pragmas 

into the C/C++ source code. This paper analyzes the possibilities 

to improve the performance of an FPGA accelerator generated 

with Vivado HLS and integrated into a Vivado block design. It 

investigates, how much the pragmas affect the performance and 

resource cost and shows problems the tool has with coding style. 


For modern computing systems it is getting more popular to 

use heterogeneous computer architectures to further increase 

computing power. There are multiple ways to compensate 

the stagnation of single core performance of CPUs, ranging 

from instruction set extensions, multi core processors and 

GPUs to coprocessors on FPGAs. Cryptographic and hashing 

functions for example can be accelerated on an FPGA. The 

advantages of an implementation on an FPGA are almost 

ASIC-like computing performance, quick adaption to new 

protocols and standards as well as low energy consumption 

[7]. To develop such a coprocessor, the logic of the algorithm 

has to be described with VHDL/Verilog on Register Transfer 

Level. This is complicated because of the high level of detail 

and the high susceptibility to errors due to the low level of 

abstraction on Register Transfer Level. To lower development 

time, one can raise the abstraction level to Algorithm level 

through High-Level Synthesis (HLS). Instead of a complicated 

description of the accelerator in VHDL/Verilog, High-Level 

Synthesis uses standard C/C++ code to describe the logic. 

The HLS-Tool Vivado-HLS offers compiler pragmas to further 

define the hardware architecture of the C/C++ code. With those 

pragmas, the developer can create different implementations 

of the same algorithm without touching the functionality by 

just inserting or deleting one line in the source code. With 

those capabilities, Vivado-HLS can be used for design space 

exploration. In this paper, an FPGA coprocessor is used to 

accelerate AES encryption and decryption calls from the Linux 

Kernel Crypto API. 

A. Definitions 

II. DEFINITIONS AND ABBREVIATIONS 

This paragraph specifies how common FPGA build flow 

terms are used in this paper. 

Synthesis is the whole process of High-Level Synthesis. It 

basically summarizes all design steps from Vivado HLS. 

Implementation summarizes all steps from Vivado. 

Vivado Synthesis is the Synthesis step inside of the Implementation. 

Bitstream is the output of the implementation. It is used to 

program the FPGA. 

B. Abbreviations 

This paragraph gives a short summary of all abbreviations 

used in this paper. 

AES stands for Advanced Encryption Standard. See section 

III-A for an explanation. 

BRAM stands for Block RAM. BRAM is one of the resources 

on an FPGA. BRAMs are arranged in slices of 36 KBit. 

FF stands for Flip Flop. FFs are one of the resources on an 

FPGA. 

HLS stands for High-Level Synthesis. See section III-C for a 

brief explanation. 

II stands for Initiation Interval. See section III-B for an 

explanation. 

LUT stands for Lookup table. LUTs are one of the resources 

on an FPGA. They build the logic gates inside of the 

FPGA. 

RTL stands for Register-Transfer Level. 

398 


III. BASICS 

This chapter gives a short summary of the most important 

basics of the paper. It explains how AES works, what the 

design steps are within Vivado HLS and how optimization 

with HLS works. 

A. AES 

The Advanced Encryption Standard (AES) [4] is an encryption 

algorithm developed in 2000 by Joan Daemen and Vincent 

Rijmen and is one of the most important encryption algorithms 

today. It is a symmetric block cipher, which means, it uses 

the same key for en- and decryption. Block cipher means, it 

encrypts and decrypts blocks of data of a constant size, in this 

case the block size is 128 bits or 16 Bytes respectively. These 

16 Bytes are arranged in a 2-dimensional 4x4-array called 

state. The algorithm consists of four base operations repeated 

in multiple rounds. These operations are AddRoundKey, Sub- 

Bytes, MixColumns and ShiftRows. AddRoundKey is a simple 

XOR-connection between the state and the round key. Sub- 

Bytes replaces all Bytes of the state according to a substitution 

box, called S-Box. In this paper, the SubBytes operation was 

implemented using arrays with precomputed values as Lookup 

tables (this does not mean the LUT hardware resource on the 

FPGA, but the basic concept of Lookup tables in software). 

ShiftRows does a cyclic shift on the rows of the state according 

to its row number. MixColumns mixes the 4 Bytes so every 

input Byte affects every output Byte. MixColumns is like 

SubBytes implemented using precomputed arrays as Lookup 

Tables. To encrypt more than 16 Bytes, a operation mode is 

required. In this work Cipher Block Chaining (CBC) is used. 

In this mode, the ciphertext of a block depends on the plaintext 

and the ciphertext of the previous block. This data dependency 

cannot be resolved; thus the encryption cannot be pipelined. 

In decryption however this dependency does not exist, which 

enables pipelining. 

B. Performance of digital circuits 

To evaluate the results of High-Level Synthesis, we need 

to measure the performance of an FPGA accelerator which 

is determined by its throughput. The throughput is influenced 

by many different characteristics of the accelerator, which are 

listed below: 

Clock period is the time period of one clock cycle. All 

registers in the design are connected to the same clock 

to synchronize read and write operations. 

Blocksize is the amount of data that can be read and computed 

at once. The unit is bit or byte. In this paper, the blocksize 

is 128 bit or 16 Byte. 

Latency is the number of clock cycles after reading data until 

the result is available at the output registers. 

Initiation Interval is the number of clock cycles after reading 

a block, until the circuit can read new data. 

With these quantities the throughput can be computed as 

shown in eq. (1). 

BW = 

BS total f 

L init + L single + II(n − 1) 

(1) 

Fig. 1. Design flow of Vivado HLS [2]. 

with the throughput BW , the total amount of data BS total , 

the clock frequency f, the latency for initialization process 

L init , the latency for a single block of data L single , the initial 

interval II and the total number of data blocks n, which 

is BS total /16 Byte. Without pipelining, the initiation interval 

and the latency for a single block are the same, so the terms 

are interchangeable. Since the initial latency is constant with 

around 1000 clock cycles, it does not influence the throughput 

for a big amount of data. This simplifies the formula to eq. (2). 

BW = BS totalf 

(2) 

IIn 

This formula assumes, that there is no input data stream stall 

when processing a stream at the input of the accelerator and 

the data is read from the output immediately, so it does not 

get slowed down by back pressure. The result of this formula 

is used as a metric for the performance of the generated 

accelerators. 

C. High-Level Synthesis with Vivado HLS 

High-Level Synthesis is the Synthesis of a hardware description 

on Register-Transfer-Level (RTL) from a description 

on algorithm level. In this paper, we generate an AES accelerator 

from C/C++ source code with Vivado HLS. For a 

more detailed introduction to HLS see [6]. Figure 1 shows 

the design flow of Vivado-HLS. The source code can be 

any C/C++ implementation, as long as it only makes no 

use of dynamic memory allocation. It is recommended to 

use a generic implementation instead of an optimized one 

for a special compiler or processor. Since an FPGA works 

differently than a normal CPU, optimizations for a CPU are 

not suited for an FPGA and might even worsen the results. The 

correctness on the algorithm level of the code can be checked 

with the C-Simulation using a testbench. Now, the interface of 

the accelerator has to be declared with the Interface pragma. 

399

It describes how the interface has to be generated from the 

parameters of the top level function and which type of bus 

has to be used. In our case, we used an AXI-Stream interface 

for the data input and output with an additional AXI-Lite 

interface for control signals like starting the encryption or 

changing the encryption key. Once the code is synthesizeable, 

it can be optimized using additional pragmas. A list of all used 

optimization pragmas and a short description is given below: 

Loop Tripcount lets the user specify a minimum and maximum 

number of iterations for a loop. This does not 

influence the synthesis, but helps to get a precise latency 

estimation. 

Array Partition partitions an array in multiple smaller arrays. 

The default behavior is to only generate one input 

and output port for every array. By partitioning it into 

smaller arrays with an input and an output port for each 

sub-array, the manipulation of single cells of the array 

can be parallelized. 

Loop Unroll generates multiple instances of the code body of 

a loop. If there are no data dependencies between loop 

iterations, they can be parallelized by creating multiple 

instances of the body. 

Pipeline creates a pipelined architecture for a specified function. 

This increases the throughput by reducing the initiation 

interval. 

Inline eliminates the hierarchy level of sub-functions and 

dissolves their logic into the logic of the caller function. 

The default behavior of Vivado HLS generates one 

module for every function in the source code with a submodule 

for every sub-function. The logic optimization 

only works inside a module on one hierarchy level. By 

eliminating those borders with inlining, it is easier for the 

optimization to simplify and shorten the RTL description. 

Before starting the C-Synthesis, we need to specify a target 

frequency, that specifies the frequency, at which the accelerator 

should operate. 

The C-Synthesis generates the VHDL/Verilog code from the 

C/C++ source code. The RTL code can be verified using 

the C/RTL-Co-Simulation. Now the code gets packaged and 

exported with the IP-Packager and can later be included into 

a Vivado block design. More details to Vivado HLS can be 

found in [2]. 

IV. TEST SETUP 

Figure 2 shows the complete design flow with Vivado HLS 

and Vivado. A detailed view of the Synthesis is depicted 

in fig. 1. Both tools, Vivado and Vivado HLS, were used 

in version 2016.2. Newer versions could not be used because 

of incompatibilities with the hardware driver seen in 

fig. 4. Starting point is a C/C++ implementation of the AES- 

Algorithm. The one used for this paper can be found at [8]. 

After running through all the steps explained in section III-C 

Vivado HLS returns estimations for the resource demand and 

the performance of the accelerator, including the initiation 

interval and latency. The user can go through the whole 

Synthesis 

Vivado Synthesis 


Fig. 2. Design flow using Vivado HLS and Vivado [3]. 

hierarchy of the design and see these estimations for every 

single sub-module. 

The generated IP Core is part of the block design displayed 

at fig. 3 at the position highlighted in red. The clock is set to 

the estimated clock given by Vivado-HLS. After implementation, 

Vivado returns the actual resource costs and a timing 

analysis, including signal paths that fail the timing constraints. 

Performance tests are conducted on a Xilinx ZC706 board, 

featuring a Zynq 7045 SoC. Through a custom device driver, 

explained in [1], the Linux OS on the processing system 

(PS) of the Zynq SoC on the board is able to accelerate 

AES calls of the Linux Kernel Crypto API with the FPGA 

400

AXI Lite 

AXI Memory Mapped 

AXI Stream 

AXI Interconnect 

HP0 

ctrl 

src 

MM2S 

AXI\_LITE 

ZYNQ 

AES 

AXI DMA 

GP0 

Zynq 7 Processing System 

dst 

S2MM 

SG 

MM2S 

S2MM 

Fig. 3. Vivado block design [1]. 

PS 

OpenSSL GnuTLS … 

User space 

dm-crypt IPsec … 

Kernel space 

Driver 

Priority 

Request 

Queue 

Request 

Queue 

Hardware 

Driver 

Crypto API 

AXI 

Software Driver 

AES SHA-2 

AXI Interconnect 

PL 

AES 

AXI 

AXI 

DMA 

Fig. 4. Test setup with connection to the Crypto API [1]. 

accelerator. Figure 4 shows the whole test setup. This does 

not always work, there are two different types of failure 

that we observe. The first one is the creation of incorrect 

logic. When the driver is loaded, the Crypto-API verifies the 

correctness with a testbench. If the generated logic is incorrect, 

this returns an error message stating that the ciphertext is 

wrong. With the other type of failure, the measurement halts 

at the initialization of the measurement. Since both, driver and 

functionality, always stay the same for all tests, this points to 

a failure in the Synthesis. In the results paragraph we do not 

differ between the two types of failure and only state if the 

design passed the test or not. 

The accelerator is generated with different sets of optimization 

pragmas in five tests. Each test contains 9 designs with 

the same set of pragmas, but a different target frequency. It 

ranges from 100 MHz to 260 MHz in steps of 20 MHz. For 

higher target frequencies than 260 MHz, the design always 

fails to meet the timing constraints, so these implementations/frequencies 

are not considered in the evaluation. The 

different optimizations are as follows: 

Test 1 contains no pragmas for optimizing the architecture. 

The Interface pragma is inserted, because it is necessary 

for the tool to synthesize the IP core. Also the Loop 

Tripcount pragma is inserted to generate more precise 

estimation results for performance. With this pragma, the 

user is able to define a maximum and minimum amount 

of loop iterations for the latency estimation. 

Test 2 contains the Array Partition pragma. By default, for 

every array in the source code, the Synthesis generates 

one BRAM with only one read and write port. This 

decreases the performance, because the single array elements 

can only be read or modified in a sequential 

manner. By partitioning the array into multiple smaller 

memory blocks with a read and write port for each block, 

access to different array elements can be parallelized. 

Test 3 extends Test 2 with added Loop Unrolling pragma. It 

creates multiple instances of the loop body to calculate 

the results in parallel as long as there are no data 

dependencies in between the loop iterations. 

Test 4 extends Test 3 with added Pipeline pragma. The 

Pipeline pragma allows the user to define an Initiation 

Interval for the pipeline. We used 10, 3 and 1 clock 

cycle as Initiation Interval to test the influence of different 

intervals on the performance and the resource cost. 

Test 5 extends Test 4 with an initiation interval of 1 clock 

cycle with added Inline pragma. The Inline Pragma shifts 

the inlined function on a higher hierarchy level and 

eliminates hierarchical borders. This does not optimize 

the architecture directly, but the following optimization 

step in the Synthesis has a lot more freedom to combine 

operations and getting rid of unnecessary registers in 

between operations, which reduces latency and resource 

cost. 

Since the goal is to test the capabilities of the tool and 

compare the influence of different optimizations, the optimizations 

focus on the decryption of AES. The encryption and 

decryption consist of the same operations in different order, 

but with the operation mode CBC, pipelining is only possible 

for the decryption. The results in section V only contains the 

results of the decryption, since the rest (encryption and key 

expansion) are not included in the optimization process. 

V. RESULTS 

The highest RTL hierarchy level of the decryption always 

looks the same, no matter which optimizations were applied. 

The RTL description is displayed in fig. 5. While the two 

blocks CBC-XOR and AES-Decryption are directly influenced 

by the C/C++ source code, the FSM (finite state machine) is 

automatically generated. 

The tool always returns a specific estimation for resource costs 

and performance. In the following paragraphs, the ranges given 

for some values are the the minimum and maximum values 

for the 9 different target frequencies. 

A. Test 1: no pragmas 

The code without any optimization pragmas generates an 

implementation with low resource costs. The estimated de- 

401

IV 

ciphertext 

Input1 

FSM 

Output 

Input1 

Input2 

State1 

.... 

StateN 

CBC-XOR 

AES-CBC-Decryption 

Output 

AES Decryption 

Input 

State 

Output 

Fig. 5. Synthesized Logic AES-CBC-Decryption 

plaintext 

mand of LUTs ranges from 3654 to 3660, the amount of FFs 

ranges from 592 to 1155 and 9 BRAM slices are required. 

While the estimation of the BRAM is accurate, the actually 

required amount of LUTs and FFs is lower than the estimation. 

The required amount of LUTs ranges from 593 to 638, the 

required amount of FFs from 573 to 689. The minimal latency 

ranges from 1011 to 1683, the maximal latency from 1371 to 

2235 clock cycles, the initiation interval is identical. This leads 

to an estimated throughput of up to 3.06 MB/s according to 

eq. (2) for a target frequency of 260 MHz. The maximum for 

the target frequency seems to be around 180 MHz, since all 

higher frequencies fail to meet the timing constraints after the 

implementation. The actual measurement of the throughput 

failed for all designs in Test 1. 

B. Test 2: array partitioning 

Due to the added array partitioning, the estimated and 

the actual performance and resource requirements rise. The 

BRAM estimation and the actually required amount rise to 

40 BRAM slices. The estimated number of LUTs range from 

4918 to 5044, the FFs from 905 to 1946. The minimal latency 

ranges from 416 to 895 clock cycles, the maximum from 

562 to 1203. The highest estimated throughput is 6.55 MB/s 

for a target frequency of 240 MHz. The actual resource 

requirements are again lower than the estimation. The LUT 

demand ranges from 2363 to 2679, the FF demand from 880 

to 1212. All designs passed the functionality test, the highest 

throughput was 5.16 MB/s for a target frequency of 240 MHz. 

C. Test 3: loop unrolling 

The loop unrolling further increases the performance and 

the resource costs. The estimated number of LUTs drops to a 

range from 4602 to 4883. This is because the loop calculations 

now happen in parallel. Before, there was only one instance, 

which was reused for every loop iteration. That required 

a control logic which disappears for parallel computation. 

The amount of registers rises, since every register has to be 

duplicated for every parallel path. This leads to an increase of 

the estimated FFs to a range of 1590 to 2529. The required 

BRAM stays the same, since the methods that require BRAM 

do not contain loops. The minimal latency ranges from 50 

to 131, the maximum from 70 to 182. The actual amount of 

LUTs ranges from 1774 to 2338, the FFs range from 1592 to 

2012. The BRAM demand and estimation are identical. Apart 

from a target frequency of 180 MHz, all designs passed the 

measurement test with the highest throughput of 26.15 MB/s 

for a target frequency of 120 MHz. 

D. Test 4: pipelining 

1) Initiation Interval = 10 clock cycles: Pipelining only 

has a small impact on the minimal latency, but the maximal 

latency is now equal to the minimal latency. This is necessary, 

because pipelining needs a constant latency. For an initiation 

interval of 10 clock cycles it ranges from 44 to 131 clock 

cycles, the initiation interval is the specified 10 clock cycles. 

The LUTs estimation ranges from 7519 to 9693, the estimated 

FFs range from 1982 to 4247. The BRAM estimation rises to 

80 slices. Since pipelining requires additional instances of all 

base operations, the resource consumption increases. 189 MHz 

seems to be the maximum for the estimated frequency, since 

no design achieves a higher frequency. The highest estimated 

throughput is 301 MB/s. After implementation, the required 

amount of LUTs is between 2617 and 4028. The required FFs 

range from 1736 to 3820. Apart from a target frequency of 

180 MHz, all designs pass the measurement test. The highest 

measured throughput is 27.24 MB/s for a target frequency of 

100 MHz. This is significantly lower than the estimation of 

301 MB/s. The reason for this is explained in section VI, since 

it influences all pipelined designs. 

2) Initiation Interval = 3 clock cycles: The resource costs 

increase again, since the synthesis now creates 5 instances of 

all base operations. The latency ranges from 44 to 88 clock 

cycles. The LUTs estimation ranges from 17499 to 22541, the 

FFs from 2381 to 7097. The BRAM estimation rises to 200 

slices. The maximum for the estimated frequency seems to 

be 189 MHz again, the maximal estimated throughput is 1006 

MB/s. The implementation only needs between 4800 and 6635 

LUTs and between 2271 and 3915 FFs. Apart from a target 

frequency of 180 MHz all designs pass the measurement test. 

The highest throughput was achieved by the design with a 

target frequency of 140 MHz with 31.28 MB/s, which is again 

significantly lower than the estimations. 

3) Initiation Interval = 1 clock cycle: There are now 14 

instances of every base operation. This leads to a LUT estimation 

between 44790 and 45047 and a FF estimation between 

3077 and 15603. The BRAM estimation and utilization rises 

to 528 slices. The latency drops to a range from 42 to 84 clock 

cycles, the highest estimated throughput is 3019 MB/s for 

target clocks between 180 and 260 MHz. The actually required 

amount of LUTs ranges from 4653 to 5366, the FFs from 1085 

to 6857. Target frequencies from 120 to 180 MHz fail, the rest 

passes the measurement test. The highest throughput is 30.36 

MB/s. 

E. Test 5: Inlining 

Inlining the base operations decreases the latency to a range 

from 28 to 68 clock cycles. The estimated amount of LUTs 

402

anges from 13660 to 15327 and the estimated FFs range 

from 3431 to 13239. The estimated BRAM usage stays at 

528 slices. The maximal estimated throughput is 93 MB/s for 

a target frequency of 140 MHz when taking into account that 

pipelining does not work. The actual LUT usage ranges from 

5713 to 8163, the FFs from 3288 to 9918. No design passed 

the measurement test. 

VI. ANALYSIS 

All pipelined designs have an increased resource utilization 

in comparison to the designs without pipelining due to 

additional instances. The analysis with the Integrated Logic 

Analyzer shows, that there is always valid data at the input 

and the output always waits for the accelerator to read data. 

So the problem does not originate from the environment, but 

from the accelerator itself. 

So for example we investigate the design of Test 4 with an 

initiation interval of 10 clock cycles and a target frequency 

of 100 MHz. This design should achieve up to 181.81 MB/s, 

but the measurement only shows a throughput of 27.24 

MB/s. The synthesis shows, that it created an additional 

instance of every base operation and they still exist after the 

implementation. Pipelining with an initiation interval of 10 

clock cycles means, that one instance can only be occupied 

for 10 clock cycles by a single data block. 

AES consists of many rounds, as explained in section III-A, 

which consist of the four base operations AddRoundKey, 

SubBytes, MixColumns and ShiftRows. Depending on the 

key size there are between 10 and 14 complete rounds per 

data block. The scheduling diagram in Vivado HLS shows, 

that SubBytes and MixColumns are occupied for 2 clock 

cycles in each round. This sums up to 20 and 28 cycles 

for SubBytes and MixColumns each for every data block. 

Divided by two because of the additional instances of the 

base operations, this results in 10 to 14 clock cycles with an 

initiation interval of 10 clock cycles. So there need to be at 

least 3 instances of every base operation to actually enable 

pipelining with an initiation interval of 10 clock cycles. 

For the same reason, the design with an initiation interval 

of 3 clock cycles would need at least 14 instances of every 

base operation. For an initiation interval of 1 clock cycle 

and a target frequency of 100 MHz the High-Level Synthesis 

generates 14 instances of all operations plus an additional 

AddRoundKey instance for the first round. Because of the 

operations SubBytes and MixColumns always being occupied 

for 2 clock cycles at a time, it is still not possible to achieve 

the requested initiation interval. 

To check if this problem is specific for the tool version, 

we repeated the synthesis step of test 4 with tool version 

2017.2, which was the newest available version at the 

time. The synthesis results though stayed the same. The 

resource estimations were identical as well as the generated 

architecture. Even with this version, there were always to few 

instances of the base operations to enable pipelining. 

The goal of this paper was to evaluate the High-Level 

Synthesis with Vivado HLS. The generation of an FPGA 

accelerator with High-Level Synthesis is faster and easier 

compared to writing HDL code. It is possible to generate a 

completely different architecture for the accelerator within 

60 to 90 minutes including synthesis and implementation. 

Optimization pragmas like array partitioning and loop 

unrolling work just like they are supposed to. This enables 

the user to generate accelerators that are faster than most 

software solutions for fitting problems like encryption or hash 

algorithms. 

Pipelining however does not provide the throughput it should. 

Inlining can even change the logic and thus breaking the 

design if applied at the wrong place. The tool seems to 

heavily depend on the correct coding style. During the whole 

optimization process, only once an error message occurred 

stating that the placed optimization pragma does not work at 

this position. This was when trying to pipeline the encryption 

despite the operation mode making it impossible. In every 

other case, the tool stated a correct synthesis, the problems 

were only observed when actually loading the design to an 

FPGA and measuring the actual throughput. This behavior 

is not reliable and not usable for real-life applications. The 

problems with pipelining and inlining keep the user from 

creating high performance, high throughput designs. 

Another less important problem is, that the resource 

estimations, especially for the LUTs, is way higher than 

the actual usage after implementation. A possible reason 

for this could be an incorrect state machine which creates 

unnecessary states and logic that later gets optimized away 

through logic optimization in the implementation. 

VII. OUTLOOK 

There are multiple ways of progressing with this work. In 

this paper, we changed the target frequency for the High- 

Level Synthesis. By keeping the same design and increasing 

the frequency at the implementation step by step could show, 

how accurate the frequency estimation is and give another 

possibility to increase the throughput. One could also compare 

the results of different implementations to find the best coding 

style for High-Level Synthesis. One could also try out single 

pragmas for optimization to see the individual effect of the 

pragmas, but the results of this step highly depend on the 

algorithm and its implementation, so there would not be a 

general knowledge gain out of these tests. 


In this work we showed how to generate an AES FPGA 

accelerator with High-Level Synthesis. It turned out to be 

faster and easier compared to standard RTL development 

with VHDL/Verilog. Especially for developers, who are not 

experienced in RTL development, it is a way to still profit 

from the compute power of an FPGA. It is also useful for 

design space exploration, since it it possible to generate 

a completely different architecture within minutes by just 

inserting or removing a compiler pragma. 

403

However the tool is not ready to be used for real-life applications 

yet. A main flaw is the difficulties with the coding style. 

It is probably necessary to write optimized code for the High- 

Level Synthesis, comparable to code optimized for special 

processors. Yet this would take away the biggest advantage 

of being easy to use. In the current state it is not possible for 

the user to generate reliable high performance, high throughput 

designs. 

Another problem is the inaccuracy of the resource estimations, 

apart from the BRAM estimations. The estimations were 

always too high. The real LUT usage is only a fraction of 

the estimation. This would make it possible to implement a 

design even though the tool estimates a usage of more than 

100%. In the current state, the tool has to improve on its 

reliability before it can be integrated into a professional reallife 

workflow. 

REFERENCES 

[1] S.Wiehler, CPU-Offloading von Transformationsfunktionen aus dem 

Linux-Kernel. 2016. 

[2] Xilinx, Vivado Design Suite User Guide: High-Level Synthesis(UG902), 

v.16.2. 2016. 

[3] Xilinx, https://www.xilinx.com/content/dam/xilinx/imgs/applications/isolationdesign-flow/idf-flowchart.jpg.thumb.319.319.png. 

16.1.18. 

[4] National Institute of Standards and Technology, FIPS PUB 197: Advanced 

Encryption Standard (AES). 26.1.2001. 

[5] Xilinx, Vivado Design Suite: AXI Reference Guide(UG1037). 15.6.2017. 

[6] Coussy, P. and Gajski, D. D. and Meredith, M. and Takach, A., An 

Introduction to High-Level Synthesis. 2009. 

[7] Andre Dehon, Fundamental Underpinnings of Reconfigurable Computing 

Architectures. 3.3.15. 

[8] kokke, https://github.com/kokke/tiny-AES128-C. 2017. 

404

Test automation for reengineered modules using test 

description language and FPGA 

T. Krawutschke, G. Hartung, N. Kopshoff 

Faculty of Information, Media and Electrical Engineering 

TH Köln 

Köln, Germany 

M. Schulze, G.B. Faluwoye, C. Hoffmann 

OTL Elektrotechnik und Audio 

Bonn, Germany 

Abstract— Reverse engineering is a very important technique 

in systems where the overall system lifetime is much longer than 

the lifetime of its electronic and digital components. Obsolete 

components (e.g. processors from the 1970s, digital logic 

components) are replaced with FPGAs. The systematic test of 

these reverse engineered devices is subject of a project carried 

out by scientists of TH Köln and engineers of the company OTL 

specialized in reverse engineering. 

Keywords—Reverse Engineering, Test Automation, 

Obsolescense 


The process of reverse engineering includes the 

identification of system parts and their interrelationships with 

the aim to develop a representation of the system in another 

form or at a higher level of abstraction [1]. 

A reverse engineered electronic device (e.g. a board that is 

part of a system that is arranged of different devices in a rack) 

that is operated in railway vehicles needs an approval before it 

can be used for transportation. Several engineering standards 

have to be considered. The ISO 50126 is one of them and 

defines the lifecycle of an electronic device or system. The 

extended V-Model considers a phase of “change, 

retrofit/upgrade” (s. Fig. 1). Reverse engineering is part of this 

phase, when the requirement documents and engineering plans 

of the original devices are missing. Several standards related to 

Fig. 1. Extended V-Model (DIN EN 50126) 

the development of safety critical systems demand a complete 

chain of documentation of the development process. The 

combination of a Test Description Language (TDL) and a Test 

Automaton improve test coverage and documentation in 

relation to manual testing. 

FPGAs are used in two application areas: first as the target 

technology for a replacement device to replace obsolete logic 

components and microcontrollers, second as a measurement 

tool and pattern generator for the Test Automaton. The general 

process of reverse engineering and the usage of an FPGA as 

target technology are shown in the next chapter. The third 

chapter introduces the TDL and the Test Automaton using 

FPGAs for the data processing of measurements and stimulus 

generation. The fourth chapter describes the collaboration of 

the software interpreting TDL and the hardware of the Test 

Automaton. The last chapter gives an example of a reversed 

engineered device that is part of a railway slide protection 

system. 

II. 

GENERAL REVERSE ENGINEERING PROCESS WITH 

FPGA 

The aim of the reverse engineering process with FPGA as 

target technology is the development of a model that is 

functionally equal to the original device. The reverse 

engineering process of a digital electronic device without 

support of TDL and a Test Automaton consists of five steps (as 

illustrated in Fig. 2): 

The first step involves the analysis of the behavior of the 

original circuit and the digital components which requires a test 

setup with a pattern generator and a logic analyzer. By 

stimulation of the original device and measuring its reactions 

(measured values of outgoing signals) a set of test cases is 

produced which are used for verification of the next steps. 

In this second step the analysis of the circuit and the logic 

components of the original device produce a simulatable 

VHDL model of the device. This model may not be fully 

synthesizable which means that the describing VHDL Code (or 

parts of it) cannot be transferred to an FPGA. The simulatable 

405

Fig. 3. 

VHDL represents the reversed engineered original device 

The design and partitioning of the reverse engineered 

device is part of the fourth step. Until this step the HDL that 

describes the FPGA is independent of the FPGA-vendor. With 

the definitive selection of the hardware, vendor specific 

constraints are added. These covers pin assignment, logic 

levels, clock settings etc. Should the selected FPGA component 

become obsolete in the later lifecycle, a replacement with 

another FPGA is possible. The VHDL description stays 

unchanged; constraints are adapted to the new FPGA 

component. 

The vendor tool synthesizes, maps and fits the VHDL and 

constraint files to the FPGA. A post-fitting simulation reflects 

the expected timing behavior of the programmed FPGA. The 

test cases are now used to validate the synthesized design 

against the desired behavior before a prototype is build. 

Finally, a validation of the reverse engineered device is done 

with the same setup and test cases that were developed for the 

validation in the first step (s. Fig. 4) 

Fig. 2. 

General design flow reverse engineering 

model should represent the whole electronic device including 

parts that are externally to the FPGA. Analog parts that 

interface to the digital circuit may be emulated to achieve a 

simulation model with the same pin-out than the original 

device. For this initial step it is sufficient to have a pure 

simulative model. This VHDL model becomes the Device 

Under Test (DUT). 

A Testbench is created in the third step that stimulates the 

DUT which uses the test cases from step 1. The hierarchical 

concept of VHDL allows the arrangement of the derived model 

as subordinate of the Testbench. The Testbench connects to the 

inputs and outputs of the DUT. In general it contains a list of 

stimuli that are passed to the inputs of the DUT and it evaluates 

the responses of the model that are observed at its outputs. 

Moreover, the Testbench contains assertions about the 

expected behavior. The evaluation of the assertions is reported 

in a simple pass/fail expression (s. Fig. 3). If differences show 

up they may result from either a misunderstanding of the 

original device which leads to a change in the VHDL model or 

an uncritical derivation of timings which lead to a change in the 

assertions of the Testbench. The result of this step is a 

validated VHDL model of the original device and a set of 

refined test cases that can be used to test the reversed 

engineered device. 

III. REVERSE ENGINEERING WITH TDL AND A TEST 

AUTOMATON 

The process of reverse engineering outlined in the last 

chapter is an approved method. It takes extensive efforts in 

every step. Moreover, there are two major difficulties within it: 

• The process is error prone due to the difficulties in 

describing test cases correctly. VHDL pattern files are 

not really a good documentation of a test case. 

Fig. 4. 

The prototype of the reverse engineered device is tested 

406

Fig. 5. 

Fig. 6. 

Testbench and test definition are derived from a single TDL 

document 

• The tests of the reverse engineered device as well as 

of the original device is done by hand. Beside the 

effort of doing several tests manually it puts the 

serious questions how these tests are documented in a 

manner which is acceptable for the required test chain 

(see section I) 

To overcome these difficulties we defined a test description 

language (TDL) and a test automaton which allows to 

automatically test digital devices. 

The Test Description Language (TDL) is defined in 

accordance to ETSI Standard ES 203 119-1 [2]. A TDL file 

contains one or more test case definitions. Different test 

descriptions can be implemented in a single TDL file, e.g. to 

define tests that definitely fail to provoke an intended failure 

report. A test description itself describes the stimuli and the 

expected responses of a test sequence. The TDL syntax is 

implemented as a Domain Specific Language (DSL [5]) which 

is processed by a parser and code generators as follows: An 

Xtext [3] framework is used in conjunction with the abstract 

syntax tree defined in accordance to the ETSI standard to 

// begin declarative part, define participating instances 

TDLan Specification achskarte{ 

... 

Test Configuration ak_cf{ 

instantiate AKT as DUT of type hardware having { 

... 

instantiate TB_a as Tester of type TB having{ 

... 

// Test Automaton Hardware I/Os 

SignalAdapter Configuration de0_nano_output { 

attach geber_def 0 downto 0 to position 4 downto 4; 

attach fl 1 downto 0 to position 2 downto 1; 

logiclevel TTL; 

... 

// connect instances 

connect gate stim_out to gate eight_bits_in; 

// begin test description 

Test Description test_geberdef { 

use Test configuration: ak_cf{ 

... 

TB_a sends bit value of b0 to gate f1; gate f1 waits for 

(877 microseconds); 

... 

AKT sends bit value of b0 to gate geber_def; 

... 

Example TDL Code 

implement the parser for the TDL DSL. The code generator 

realized upon the parsed structure creates the VHDL 

Testbenches and test definition files (s. Fig. 5). 

The Test Automaton is a real hardware device. It consists 

of different Signal Processing Modules (SPM) for measuring or 

stimulus generation under control of a PC where the files 

describing the tests and the measurements are archived. Each 

SPM contains an FPGA/Memory combination for data 

processing and an interface to transfer measure or stimulus data 

from/to the control PC (s. Fig. 9). The SPMs are synchronized 

by a common clock and trigger distribution unit. The control 

PC is not only used as the data storage but also as the central 

documentation storage since every test action is protocolled. 

To integrate the test scenario in the test description 

documents we introduced a declarative part syntax in the TDL 

(s. Fig. 6). This allows describing the test members and its 

setup. The interfaces of the DUT are declared and the required 

SPMs are configured and instantiated. The connection between 

Test Automaton and DUT is defined, so that the tester knows 

how to connect the DUT to the SPMs and moreover the VHDL 

Testbench has matching input and output ports to the VHDL 

model of the original device. The last two lines of the TDL 

example code in Figure 7 show examples for the test 

description containing one stimulus and one assertion part of 

the TDL. With these elements, the TDL description is not only 

capable of defining the complete test scenario for a reverse 

engineered device but is also a key element in the provable and 

documented process of testing it. 

IV. 

TEST PROCESS WITH TDL AND THE TEST AUTOMATON 

The process of reverse engineering stays similar to the 

before described one (s. Fig. 7): the TDL file is created during 

the analysis part of the reverse engineering process and used 

for creating VHDL test benches for simulations and providing 

data for the measurements with the real hardware (reverse 

engineered and original device). 

With our Xtext/Xtend based generators several files (VHDL 

Testbench, stimuli files used by the pattern generators resp. 

converter to ‘translate’ measured data in VHDL form) are 

generated that support the different tests (shown in Fig. 9). 

While step 2, the analysis and creation of a first model of 

the reverse engineered device, is unchanged, in step 3 we use 

the generated Testbench to simulate the VHDL model that was 

derived from the original device. The generated assertions are 

evaluated by the VHDL simulator which collects results in the 

simulation report. The same is done in step 4: the post-fitting 

VHDL model is tested using the generated Testbench. 

In step 5 the measurement is prepared by generating several 

files with the testpattern generator from the TDL stimulus data 

for the different SPMs. The Xtext generators already mapped 

the stimuli to the output ports (defined in the declarative part of 

the TDL) and converted them to the exchange format for the 

SPMs. Before the testing starts, the stimulus data sets are 

uploaded to the dedicated output SPMs of the Test Automaton. 

A global trigger signal starts the pattern generation in parallel 

on all SPMs and activates the input SPMs to capture the 

407

validation of the measured data. The assertions are already 

present in the common Testbench used to simulate the 

behavioral VHDL model in the third step. To include the VCD 

data in the simulation an interfacing VHDL file is needed. As 

already mentioned, the Xtext generator prepared this interfacing 

VHDL file using the input and output port-naming defined in 

the TDL file in the first step. A simulation run with the 

simulator will now check the results of the measurements and 

generate a report containing the results, if the measured device 

behaves like expected. 

The use of the VCD data format has an additional benefit: 

The measurement can be easily displayed in a waveform 

viewer for inspection and the waveforms can be used for 

documentation of the test results. 

Fig. 7. 

Reverse engineering flow with TDL 

measurement data during the test process. After the completion 

of the test sequence the binary measurement data is 

downloaded from the input SPMs to the control-PC. Since the 

measured data is distributed over several files for each SPM, 

the data is merged to a single file and converted to the Value 

Change Dump (VCD, s. [4]) format based on the TDL test 

definition. 

The VCD file containing the measurements is used for the 

V. EXAMPLE OF A REVERSE ENGINEERED DEVICE 

The slide protection system of a city railway consists of 

several devices, some of them containing 8-Bit 

microcontrollers introduced in the 1970. Here, the focus is set 

on one of the devices that measures and evaluates the speed of 

an individual axle of the railcar. The speed is measured by a 

pulse encoder and the resulting frequency is transformed and 

digitized in the device. If the value of deceleration is too high, 

the device assumes an upcoming sliding of the train wheels and 

influences the pneumatic brake system by triggering valves to 

reduce the brake force. 

Since some of the original components are obsolete the 

company OTL decided to reverse engineer the device using an 

FPGA which replaces nearly all of the original digital 

components including the microprocessor which is mapped on 

a free available emulation. The original documentation as well 

as the software source is not available- So the reverse 

engineering process was difficult. At first, several test cases 

were developed to stimulate the device with different speed 

profiles and the resulting valve activation sequences were 

noted. Additionally the embedded software was examined to 

find the stimulation sequences of the different programmed 

reactions on speed changes. Test cases for the self-test feature 

and the failure detection were developed. 

The reverse engineered device contains an FPGA for the 

digital logic including the microcontroller emulation running 

the original software. The input electronic interfacing to the 

pulse encoder and the output electronic controlling the valves is 

Fig. 8. 

Original device (left) and protoype 

408

almost unchanged. A synthesizable VHDL model of the FPGA 

contents was developed and embedded in a (not synthesizable) 

VHDL model of the whole device. This allows the simulation 

of the complete device. 

An adapter board was developed to interface the original 

and the reverse engineered device to the Test Automaton to 

shift voltage levels and emulate the valves and their possible 

failures like short circuit or cable break. The detection of these 

failures is a function of the device that is tested. 

Single tests were carried out with a prototype of the Test 

Automaton using the original and the reverse engineered 

devices as DUT as well as the simulative part with test benches 

derived from TDL. 

VI. SUMMARY AND OUTLOOK 

We showed a test concept that integrates all test definitions 

in a single TDL file. Testbenches, stimuli and the evaluation of 

the measured data are derived from this single source to 

maintain consistency during the whole reverse engineering 

process. The simulation used to develop the prototype is tested 

with the same test data as the original device and the prototype 

of the reverse engineered device. Each simulation and 

measurement is documented by a test report. The usage of a 

DSL and code generators let us build automatically the 

different test representations from a single TDL file. 

A useful extension for the data management and 

organization would be the addition of a data base with a data 

model specialized in testing like realized in ASAM ODS [6]. 

The proof of test coverage and automatic test generation would 

be a nice feature but would require additional research in the 

field of testing. To ease the process of approving devices with 

embedded software, one can think of replacing simple 

programs with state machines and a finite state space to 

achieve test coverage of 100%. 


We like to thank the Federal Ministry for Economic Affairs 

and Energy of Germany for supporting this project and Gudrun 

Neumann from TÜV Saar for guidance in the process of 

approving safety related electronic devices. 

REFERENCES 

[1] Elliot J. Chikofsky. Reverse Engineering and Design Recovery: A 

Taxonomy.IEEE Software, 1990. 

[2] ETSI ES 203 119 v1.1.1. Methods for Testing and Specification (MTS); 

The Test Description Language (TDL); Specification of the Abstract 

Syntax and Associated Semantics. European Telecommunications 

Standards Institute (ETSI). 

[3] Xtext. http://www.eclipse.org/Xtext; Eclipse. 

[4] VCD, IEEE Computer Society: 1364-2001 IEEE Standard Verilog 

Hardware Description Language. 

[5] DSL, Martin Fowler. Domain Specific Languages. 1st edition. Addison 

Wesley, 2010. ISBN: 0321712943, 9780321712943. 

[6] ASAM ODS, Association for Standardization of Automation and 

Measuring Systems, Open Data Services, www.asam.net 

Fig. 9. 

Software and hardware architecture of the Test Automaton 

409

Hardware Deceleration 

The Challenges of Speeding up Software 

Kris Chaplin – Embedded Technology Specialist 

Intel Programmable Solutions Group 

Holmers Farm Way 

High Wycombe, Buckinghamshire, UK 

Kris.chaplin@intel.com 

Abstract— Developing a custom ASIC, or designing for a SoC 

FPGA, gives us the potential to create very specific accelerators 

to speed up software bottlenecks. However, this is not without its 

challenges. How do you account for cached data and translation 

from virtual to physical addresses when moving data payloads 

from user space into the hardware? Moving data from the SoC 

FPGA to the Accelerator and back, potentially has a significant 

software overhead before accelerators can be started. This paper 

will discuss techniques and mechanisms to allow hardware 

accelerators to accelerate, rather than slow down, a system (even 

accounting for the potential overhead required). 

Keywords—FPGA; acceleration; OpenCL; HLS; VHDL; 

Verilog 


FPGA devices can often play a vital role in accelerating 

functions that cannot be performed quickly or efficiently 

enough, in software. In many cases the bespoke nature of 

accelerators can make custom designs faster and more power 

efficient than their software counterparts. There is however not 

one clear acceleration technique that will work for all 

scenarios, and a broad range of acceleration methodologies can 

be used. 

This paper addresses some techniques that can be used to 

determine an accelerator strategy, and highlights some of the 

pitfalls that can be encountered should a sub-optimal 

acceleration strategy be used. 

II. 

DEFINING AN ACCELERATION STRATEGY 

It is vital that the system architect understand the data flow 

throughout their system to make informed choices about where 

acceleration would make sense versus where it would not. 

Also, there are different techniques that can be used to 

implement acceleration, each with their own unique benefits 

and drawbacks. This can quickly cause confusion and 

indecision. In some cases, offload to hardware can cause a 

reduction in performance, so it is in no way a ‘one size fits all’ 

solution. 

When architecting a system for which performance is 

critical, one approach is to take an existing optimized algorithm 

or system and choose a faster processor to improve 

performance further. This has in the past been valid, with 

Moore’s law allowing for a steady increase in performance 

over time. However, this exponential increase in performance 

is not limitless [1] and cannot alone solve all increased 

performance needs. When looking for breakthrough 

performance gains, it can sometimes not be enough to just ‘go 

faster’. In some cases, no faster solution exists, or is too 

expensive. It is then vital that alternative acceleration solutions 

are explored. 

Software profiling techniques, with tools such as the GNU 

Profiler (GPROF) [2] or Arm® DS-5 [3], are key to 

understanding which parts of a given algorithm are taking the 

most time to execute, and which architectural features of a 

processor are being used at any moment in time. By 

understanding these performance bottlenecks, it is possible to 

identify candidates for software optimization or acceleration. 

Once target functions or workloads are identified, the 

architect can then decide on which hardware acceleration 

technique would be the most appropriate and efficient. 

Hardware acceleration implementation techniques can be 

broadly divided into several categories: 

A. Bump in the wire / pre-processing / post-processing 

If an input/output interface is presented to the FPGA/ASIC 

rather than directly into the processor, then there is the 

possibility of performing work on the external data stream 

without direct involvement of the processor. This is known as 

pre- or post- processing. 

Fig. 1. Pre-processing data prior to CPU input 


410

For example, in a video system, functions such as deinterlacing, 

scaling, color conversion and format conversion 

can be performed on the incoming video stream before a Direct 

Memory Access (DMA) controller copies the data into a 

framebuffer. In this way, those memory-intensive functions 

can be completely removed from the CPU workload. 

Another example would be streaming Ethernet traffic. If 

this data was input via FPGA pins, then firewall or routing 

specifications could be implemented in FPGA hardware before 

authorized packets are received by the processor. 

1) Potential for performance gain 

For a data stream where it is feasible to pre-process data, 

such that it reduces CPU workload the performance gains 

should be relatively predictable. In addition, as the FPGA has 

a real-time, deterministic architecture, the data stream can 

benefit from these characteristics, with real-time tasks 

performed with clock-cycle accuracy. 

2) Potential for degradation in performance 

Pre- or post-processing of data will add to the latency 

between the processor and the I/O. In some instances, the 

latency of the transaction is critically important to the overall 

system performance, and needs to be minimized. 

B. Tightly coupled instruction set extension 

Some soft processor cores have facility to allow for the 

extension of the instruction set into the fabric of an FPGA. 

This would be via a bespoke custom instruction interface [4] or 

via a dedicated FIFO channel from the processor core [5]. 

These interfaces allow for a custom instruction set to be 

created, and in some cases direct access to the register file of 

the processor. 

This form of acceleration is appropriate to very fine-grained 

acceleration, where simple register inputs and outputs are used, 

such as a binary operation, rotation or hash. 


If the accelerated function would be used frequently by 

critical code in the system, then an acceleration can be 

achieved. Tight code loops can be accelerated compared to a 

multi-instruction approach due to the close proximity of the 

accelerator to the CPU (In or close to the instruction pipeline). 

Ideally all data has to be internal to the processor, any external 

data fetches may slow down the custom instruction to the 

point where software can be just as fast. These interfaces also 

tend to run at the same clock speed as the processor, and as 

such the instruction needs to be designed in such a way as to 

not to become the critical timing path of the design. 

C. Tightly-coupled memory-mapped accelerator 

Processor systems implemented within FPGA architectures 

have the advantage of locally exposing processor system buses 

to the FPGA fabric. This allows for tight integration of a 

memory-mapped interface for custom acceleration logic. In 

the case of soft processors, the latency of such interfaces can be 

in the order of a few clock cycles. Connecting to faster hard 

processor cores can have a latency of some tens of CPU clock 

cycles, due to clock domain crossing and CPU interconnect 

latencies. 

Memory-mapped accelerators have the advantage of being 

potentially asynchronous to the CPU core. In contrast to 

custom instruction implementations, the coupling with the 

processor is looser, and as such the accelerator can run in a 

different clock domain, and be far more complex. 


If data can be streamed to the accelerator, with the results 

being read later then its pipeline can be filled, and 

performance is maximized. Further performance gains can be 

achieved if the accelerator acts as a bus master or is filled by a 

DMA engine as this further offloads the CPU from driving the 

bus transactions. In this scenario the CPU can work in parallel 

on other tasks while waiting for the acceleration transaction to 

complete. The accelerator can have local memory for data 

storage (parameters/intermediate values) and can also stream 

data results directly back to memory without direct CPU 

involvement. 

D. External, bus-connected accelerator card 

When the FPGA and processor are on physically separate 

boards, chips or die, an interface needs to be established 

between the components. Industry-standard memory-mapped 

interfaces such as PCI Express® can be used to couple the 

FPGA accelerator to the host processor system. 


In architectures where the custom instruction can cause a 

CPU stall, it is important to ensure that efficiency of the CPU 

pipeline is maintained. If the custom instruction has a 

dependency on an external data source it could stall for a long 

time waiting for a new data value. 

Fig. 2. Custom Instruction Accelerator connected to soft CPU 

Fig. 3. Memory Mapped Accelerator via FPGA Bridge in SoC device 

411

With vendor-specific acceleration hardware, it is also possible 

to use bespoke interfaces to connect at lower latency and 

provide cache-coherent interfaces. 


Where data can be constantly streamed over the interface to 

the accelerator card, and resultant calculations returned, the 

latency of the link only serves as an initial latency, and once 

the pipeline is filled, full-bandwidth use can be made of the 

accelerator. 


Compared to on-chip interfaces, board-to-board standards 

such as PCI Express generally have higher latency. This is due 

to the serial nature of the protocol as well as the transaction 

overhead. As such, it is even more important to be able to 

understand and compensate for this latency in the design of the 

data flow between CPU and accelerator. Additionally, as a 

shared resource, external system buses can have lower 

performance when the interconnect is heavily loaded from 

other masters. 

E. Cache-coherent accelerator with on-chip processor 

interface 

In any processor system that has more than one CPU core, 

there is an architectural consideration that needs to be 

addressed – cache coherency. With more than one processor 

under control of the same operating system, working on 

common workloads, it is likely at times that data that had been 

modified on one CPU core will need to be worked on with 

another. For performance reasons, it is highly likely that this 

data is cacheable, such that the data can be stored local to the 

CPU, and so both level 1 caches (local to the processor) and 

level 2 caches (common between multiple processors) are 

enabled. 

A hardware mechanism is needed to maintain cache 

coherency across these cores. That is, if data is cacheable on 

both processors, and changes are made on both processors 

during the lifetime of the data, the changes need to be 

automatically communicated to each cache to maintain correct 

data integrity. This is the purpose of cache snooping 

mechanisms, under the control of a cache coherency unit 

(CCU). 

When an accelerator makes use of system memory to 

transfer data to the host CPU, it can make sense to extend the 

cache snooping mechanisms into the accelerator. In this way, 

data can be cache-enabled on the processor and does not need 

to be flushed to main memory before being re-fetched by the 

accelerator to do work. 

For architectures with high latency to external memory, 

certain hardware workloads could be impossible to accelerate 

without cache coherency due to the delays involved in the 

flushing of data prior to acceleration. 

Depending on the architecture, some FPGAs with on-die 

processors allow for the accelerator to participate in cachecoherency 

with the processor. By using such interfaces, the 

user can mitigate some of the performance limitations 

associated with flushing data to memory. An Accelerator 

Coherency Port (ACP) allows for an arbitrary master to 

participate into the Snoop Control Units (SCU) view of 

cacheable memory. 


Participation in cache coherence can have major 

performance benefits in some instances compared to memorymapped 

systems. In a system that is to offload part of a 

process to hardware, time would usually be taken flushing 

cached data to a memory device, to hand over payloads to the 

accelerator. By enabling the accelerator to directly access the 

L2 cache, and participate in snooping of L1 CPU local caches, 

this flushing operation need not happen. 


Participation with the caches need to be managed to prevent 

‘cache thrashing’. If large payloads are being moved through 

the ACP interface, and the data is not already present in L1 or 

L2 cache, then the L2 Cache unit would fetch new cache lines 

to serve the request. If the ACP-connected accelerator is 

reading megabytes of data through this interface, then the 

cache will soon be filled, and then re-filled with data to service 

the requests. On return to normal CPU operation, previously 

cached instructions and/or data will be lost and time will be 

taken to refresh the caches with commonly-used data. This can 

be mitigated to a certain extent using cache way locking based 

on master, however this would then reduce the overall cache 

size available to each master in the system. 

III. 

VIRTUALIZATION AND ITS EFFECTS ON HARDWARE 

ACCELERATION 

Virtualization of operating systems is increasingly common 

– especially in multi-user environments, and in systems that 

require more than one operating system to function. 

Fig. 4. Accelerator card connected to Host CPU Via PCI Express link 

Fig. 5. Cache-coherent accelerator via ACP 


412

At a very general level, one of the ideas behind 

virtualization is to allow for multiple operating systems to run 

on a given hardware architecture under control of a hypervisor. 

The hypervisor sets up the system, and facilities available to 

each guest operating system and routes interrupts, service calls 

and exceptions appropriately. Depending on the type of 

hypervisor used, the guest OS may not need to have any 

specific awareness of being run in a virtual environment. 

With the mechanisms that allow for virtualization to 

function, there is need for an additional level of address 

decoding. A guest operating system such as Linux makes use 

of virtual addressing, which through the memory management 

unit (MMU) is converted to a physical address. However, with 

the addition of a hypervisor, this physical address is in fact 

defined as an intermediate physical address (IPA). The 

hypervisor controls an additional level of address decode that 

allows this intermediate physical address to be further mapped 

into a final, actual physical bus address. 

The reason for this additional stage of decoding complexity 

is to allow for multiple operating systems to map system 

memory and peripherals to the same address (intermediate 

physical address), but in reality, not conflict in physical system 

memory or peripherals available to the system. 

This additional memory decode however is not without its 

challenges to hardware acceleration solutions – especially those 

that rely on accessing system memory to share data. Whilst the 

operating system may communicate an intermediate physical 

address to initiate a DMA transaction from system memory, a 

further decode is required to allow for the IPA to physical 

address. In software, this is an additional overhead that would 

need to be implemented, and this can therefore increase the 

delay for the data to initially become available to the hardware 

accelerator. In processor architectures that support 

virtualization, it may be possible to use a System Memory 

Management Unit (SMMU). One of the functions of a SMMU 

is to provide hardware functions to automate the lookup and 

decode of intermediate physical addresses into physical 

addresses, therefore offloading the processor, and locally 

caching common table lookups local to the hardware. The 

SMMU can potentially be used both by dedicated hardware 

blocks, such as DMA and Ethernet cores, as well as custom 

accelerator logic, such as an FPGA. 

Fig. 6. Virtual to physical address translation under hypervisor control 

IV. 

LANGUAGES THAT CAN BE USED FOR ACCELERATION 

The user has a choice in the input languages that can be 

used to develop FPGA custom accelerators. 

A. Hardware Description languages 

VHDL and Verilog are examples of Hardware Description 

Language (HDL) that can be synthesized into hardware at a 

low level. These are considered hardware-focused languages 

and have been used for decades to describe and implement 

ASIC and FPGA systems at a hardware description level. It 

can be argued that HDL languages give the developer the 

greatest control over the implementation of the resultant FPGA 

hardware, however as a low-level, descriptive language, the 

downside is the complexity and training needed to truly 

achieve this. In addition, some functional changes that appear 

trivial at a higher level of abstraction can cause heavily 

optimized HDL code to be dramatically different – especially if 

primitives are instantiated in the code. 

B. High Level Languages 

1) OpenCL 

OpenCL [6] is a programming framework based on the C 

language. A software engineer can use OpenCL to describe 

parallelism and kernels describing their functionality with C 

and OpenCL APIs. By using the features of OpenCL, the 

developer can describe ‘kernels’ that would provide the 

acceleration function. OpenCL enables developers to target 

different accelerator targets, such as GPU, CPU, DSP and 

FPGA with the same source code, however in reality 

optimizations would need to be made for each class of 

accelerator to get the most out of its architecture. 

In general terms, The OpenCL framework defines a host 

CPU and infrastructure for accelerator kernels to be 

implemented within the FPGA. As such, the OpenCL 

development environment tends to assume that the entire 

FPGA is available to the host as a resource. It is possible to 

create other HDL in an FPGA as part of the OpenCL Board 

Support Package, however this is an advanced use case, and 

requires knowledge of HDL languages and FPGA design 

methodology. 

2) C/C++ 

C and C++ can be used to describe the functionality of an 

accelerator, and these accelerators can be compiled into a 

memory-mapped IP block for inclusion in an FPGA design. 

This allows direct use of the C language without OpenCLspecific 

extensions. The system designer will take the resultant 

output IP block implemented from the C/C++ code and 

integrate it into the FPGA processor design using system 

integration tools and/or HDL. 

3) Other languages such as MATLAB/Simulink 

Vendors such as MathWorks® [7] have an ecosystem and 

environment around bespoke design techniques such as 

MATLAB®, and the block-level design tool Simulink®. 

Tools exist to integrate designs developed in these tools into IP 

blocks that can be implemented in FPGA fabric, and memorymapped 

to the processor core as an accelerator. 

413

V. SUMMARY 

The performance improvement of a processor system with 

accelerators is greatly influenced by architecture choices. The 

architect needs to consider latency and data flow dependencies 

to make sound implementation choices. In that way pipeline 

bubbles that can negatively affect performance are minimized. 

REFERENCES 

[1] Williams, RS “The End of Moore’s Law – What’s next”, Computing in 

Science and Engineering Issue 2 Mar-Apr 2017 

[2] S. Graham, P. Kessler, M. McKusick “gprof: a Call Graph Execution 

Profiler” Proceeding SIGPLAN ’82 Proceedings of the 1982 SIGPLAN 

symposium on Compiler construction pp. 120-126 

[3] R. Maiden “DPD Profiling and Optimization with Altera SoCs”; WP- 

01248-2.0 May 2016 

[4] Intel “Nios II Custom Instruction user Guide”; UG-N2CSTNST, 

December 2017 

[5] H. Rosinger "Connecting Customized IP to the MicroBlaze Soft 

Processor Using the Fast Simplex Link (FSL) Channel"; XAPP529 

(v1.3) May 12 2004 

[6] https://www.khronos.org/opencl/ 

[7] https://www.mathworks.com/ 


414

ARM Cortex-M and RTOSs are Meant for Each Other 

Jean J. Labrosse 

Micriµm Software, part of the Silicon Labs Portfolio 

Weston, FL, USA 

Jean.Labrosse@Micrium.com 

Abstract— A great majority of today's embedded systems are designed around 32-bit CPUs, 

which are integrated into microcontrollers units (MCUs) that also include complex 

peripherals, such as Ethernet, USB host, device, SDIO, LCD controllers and 

more. Integrating these peripherals demands the use of an RTOS kernel. 

Introduced in 2004, the ARM Cortex-M architecture is currently the most popular 32-bit 

architecture on the market, adopted by most if not all major MCU manufacturers. The 

Cortex-M was designed from the outset to be RTOS kernel friendly: dedicated RTOS tick 

timer, context switch handler, interrupt service routines written in C, tail-chaining, easy 

critical section management as well as other useful features. Once an RTOS kernel is ported 

to the Cortex-M using a given toolchain, the exact same port (i.e., CPU adaptation code) can 

be used with any Cortex-M implementation. Not only does Cortex-M excel at integer CPU 

operations, many Cortex-M MCU implementations are also complemented with a floatingpoint 

unit (FPU), DSP extensions, memory protection unit (MPU) and a highly versatile 

debug access port. 

Keywords: RTOS; Embedded System; Interrupts; Kernel; Debugging; IoT; Micrium 

415 



A real-time operating system (aka real-time kernel or RTOS) provides many benefits when used 

with today’s CPUs and MCUs. A real-time kernel is software that manages the time of a CPU 

(Central Processing Unit) or MPU (Micro Processing Unit) as efficiently as possible. Most kernels 

are written in C and require a small portion of code written in assembly language in order to adapt 

the kernel to different CPU architectures. 

When you design an application (your code) with an RTOS kernel, you simply split the work 

into tasks, each responsible for a portion of the job. A task (also called a thread) is a simple 

program that thinks it has the Central Processing Unit (CPU) completely to itself. On a single CPU, 

only one task can execute at any given time. Your application code also needs to assign a priority to 

each task based on the task importance as well as a stack (RAM) for each task. In fact, adding lowpriority 

tasks will generally not affect the responsiveness of a system to higher-priority tasks. 

A task is also typically implemented as an infinite loop. The kernel is responsible for the 

management of tasks. This is called multitasking. 

Multitasking is the process of scheduling and switching the CPU between several sequential tasks. 

Multitasking provides the illusion of having multiple CPUs and maximizes the use of the CPU, as 

shown in Figure 1. Multitasking also helps in the creation of modular applications. With a realtime 

kernel, application programs are easier to design and maintain. 

Fig 1. RTOS decides which task the CPU will execute based on events. 

Most commercial RTOSs are preemptive, which means that the kernel always runs the most 

important task that is ready-to-run. Preemptive kernels are also event driven, which means that 

tasks are designed to wait for events to occur in order to execute. For example, a task can wait for 

a packet to be received on an Ethernet controller; another task can wait for a timer to expire, and 

yet another task can wait for a character to be received on a UART. When the event occurs, the 

task executes and performs its function, if it becomes the highest priority task. If the event that the 

task is waiting for does not occur, the kernel runs other tasks. Waiting tasks consume zero CPU 

time. Signaling and waiting for events is accomplished through kernel API calls. Kernels allow 

you to avoid polling loops, which would be a poor use of the CPU’s time. Below is an example of 

how a typical task is implemented: 

416 


void MyTask (void) 

{ 

while (1) { 

Wait for an event to occur; 

// Tasks are infinite loops. 

// Task consumes no CPU time while waiting! 

Perform task operation; 

} 

} // A task doesn’t return 

A kernel provides many useful services to a programmer, such as multitasking, interrupt 

management, inter-task communication and signaling, resource management, time management, 

memory partition management and more. 

An RTOS can be used in simple applications where there are only a handful of tasks, but it is a 

must-have tool in applications that require complex and time-consuming communication stacks, 

such as TCP/IP, USB (host and/or device), CAN, Bluetooth, zigbee and more. An RTOS is also 

highly recommended whenever an application needs a file system to store and retrieve data as well 

as when a product is equipped with some sort of graphical display (black and white, grayscale or 

color). Finally, an RTOS provides an application with valuable services that make designing a 

system easier. 

417 


II. 

THE ARM CORTEX-M 

In 2004, ARM introduced a new family of CPU cores called Cortex-M (M stands for 

Microcontroller) based on a RISC (Reduced Instruction Set Computer) architecture. The first 

Cortex-M was called the Cortex-M3, and the family has evolved to include a number of derivative 

cores: Cortex-M0/M0+, Cortex-M4, high performance Cortex-M7 and the recently introduced 

Cortex-M23 and M33 with TrustZone-M. 

The programmer’s model (see Figure 2) of the Cortex-M processor family is highly consistent. 

For example, R0 to R15, PSR, CONTROL and PRIMASK are available to all Cortex-M 

processors. Two special registers, FAULTMASK and BASEPRI, are available only on the Cortex- 

M3, Cortex-M4, Cortex-M7 and Cortex-M33, and the floating-point register bank and FPSCR 

(Floating Point Status and Control Register) is available on the Cortex-M4, Cortex-M7 and Cortex- 

M33 within the optional floating-point. Some Cortex-M implementations are also equipped with 

a Memory Protection Unit (MPU). 

Fig 2. Cortex-M programmer’s model. 

418 


The Cortex-M was designed from the outset to be RTOS kernel friendly such that once an RTOS 

kernel is ported to the Cortex-M using a given toolchain, the same port (i.e., CPU adaptation code) 

can be used with any Cortex-M implementation. This is especially true for Cortex-M3, -M4, -M7 and 

-CM33. 

Dedicated Timer for RTOS Tick 

The Cortex-M always includes a 24-bit timer to be used by RTOS suppliers to serve as the system 

heartbeat, which is used to handle time delays and timeouts. The timer also has a preassigned 

interrupt vector (#15). This means that the exact same timer initialization code can be used across 

all Cortex-M implementation, irrespective of the MCU supplier. 

Dedicated Context Switch Handler 

The context switching code for most RTOS kernels is implemented through an exception handler, 

and the Cortex-M has a dedicated exception handler (#14) for exactly that purpose. This handler is 

called the PendSV. This means that the exact same context switching code can be used across all 

Cortex-M implementation, irrespective of the MCU supplier. 

System Service Calls 

The CPU allows two modes of operation: Privileged and Non-Privileged. Privileged mode allows 

privileges that are typically meant for an operating system, such as disabling/enabling interrupts, 

accessing debug features, altering the configuration of an MPU, etc. Non-Privileged mode is 

typically meant to be used by application code and access system services through a dedicated 

exception handler called the SVC Handler. Again, this mechanism is the same across different 

Cortex-M implementations making it portable. 

ISRs Written in C 

The Cortex-M also allows you to write ISRs (Interrupt Service Routines) directly in C as shown 

below. This avoids having to learn assembly language, making the code easier to read and support. 

All the programmer needs to do is populate the vector table with a pointer to the ISR code. 

void MyISR (void) 

{ 

Process interrupting device; 

} 

Dedicated Stack for ISRs 

Upon accepting an exception or an interrupt, the Cortex-M pushes onto the interrupted task’s stack 

the contents of eight CPU registers (R0-R3, R12, LR, PC and PSR), and, if the Cortex-M has an 

FPU, 17 FPU registers (S0-S15 and FPSCR). The Cortex-M then switches to a dedicate stack to 

process the exception or interrupt. This feature removes the requirements of allocating extra RAM 

for the stack of each task to accommodate for interrupt handling including nested interrupts. 

The NVIC 

The Nested Vectored Interrupt Controller (NVIC) supports up to 240 interrupts, each with up to 

256 levels of priority. Although the number of interrupts is fairly consistent across the Cortex-M 

family, it is always good to check the silicon manufacture’s data sheet. 

Stack Limit Registers (M33 Only) 

The recently announced Cortex-M33 contains stack limit registers, which are designed to prevent 

and detect stack overflows, one of the most common problems encountered in RTOS-based 

applications. There are two stack limit registers (one for the MSP and one for the PSP). 

419 


CLZ Instruction 

The Cortex-M contains a special instruction called Count Leading Zeros (CLZ). Although 

originally intended to be used to normalize floating-point numbers, the CLZ instruction can be used 

by the RTOS kernel’s scheduler to determine the priority of the highest priority task that is ready 

to run. This greatly accelerates the scheduling process of the kernel. 

Load and Store Exclusive Instructions 

Special CPU instructions allow easy implementation of semaphores as well as mutual exclusion 

semaphores, which are common in most modern day RTOSs. 

Easy Critical Section Management 

Most RTOS kernels need to disable interrupts when entering a critical section and enable interrupts 

upon leaving the critical section. However, it is important to preserve the state of the interrupt 

disable mask prior to entering the critical section so that the same state can be restored upon leaving 

the critical section. The Cortex-M allows us to easily implement this in assembly language as 

follows: 

CPU_SR_Save_DI: 

MRS R0, PRIMASK 

CPSID I 

BX LR 

CPU_SR_Restore_EI: 

MSR PRIMASK, R0 

BX LR 

Entering a critical section is done by calling CPU_SR_Save_DI(), which returns the state of the 

interrupt disable state of the CPU. Leaving the critical section is handled by calling 

CPU_SR_Restore_EI() and passing it the previously saved state. 

Kernel Aware (KA) and Non-Kernel Aware (nKA) Interrupts 

The above method disables all interrupts, which might not be desirable under certain circumstances. 

Disabling all interrupts affects the responsiveness of your application to highly time-sensitive 

events. It is possible to allocate specific time-sensitive interrupt service routines (ISRs) outside the 

reach of the RTOS. These are called non-kernel-aware (nKA) ISRs, and, as the name implies, they 

simply bypass the RTOS kernel. nKA ISRs are ISRs that have a higher priority than kernel-aware 

(KA) ISRs. 

Figure 3 shows the priority levels of ISRs and tasks for a typical Cortex-M CPU. If the RTOS needs 

to protect a critical section, it will set the Cortex-M CPU’s BASEPRI register to 0x40 and thus 

disable KA ISRs (priorities 0x40 and below). Since priorities 0x00 and 0x20 have higher priorities, 

they will be allowed to interrupt the CPU, even if the RTOS is in the middle of a critical section. 

420 


Fig 3. Cortex-M interrupt priority levels. 

Figure 4 shows that nKA ISRs are significantly more responsive than KA ISRs. Of course, nKA 

ISRs are not allowed to invoke any of the kernel services. However, it’s possible to have an nKA 

ISR trigger a KA ISR by using the interrupt vector of an unused peripheral and manually triggering 

the interrupt by writing to the NVIC->ISPR[n] associated with the peripheral. 

Fig 4. Responsiveness of nKA vs KA ISRs 

421 


Low Power Mode 

The Cortex-M contains a special instruction called Wait For Interrupt (WFI) that allows the 

processor to enter a low power state. The kernel can call this instruction whenever there are no 

tasks that are ready to run. In other words, this instruction would be called by the kernel’s Idle 

Task. The amount of energy savings is highly MCU manufacturer-specific. As its name implies, 

the low power state exits when an interrupt occurs. 

Optional FPU with Lazy Stacking 

Although not RTOS specific, the optional FPU allows applications that require floating-point 

computations to be greatly accelerated, which can also help reduce power consumption. The 

floating-point unit adds registers, which increases the overhead during a context switch. However, 

the Cortex-M logic is smart enough to only save the FPU registers onto the stack if the task actually 

made use of the FPU. 

Optional MPU 

Some Cortex-M implementations are equipped with a Memory Protection Unit (MPU), which can 

easily be programmed to protect against stack overflows and prevent code from executing out of 

RAM. Stack overflows occur when the programmer doesn’t allocate sufficient stack space for a 

task. As the stack overflows, memory locations used for other purposes are corrupted, which can 

cause unusual problems that might go undetected until the product is actually deployed. RTOS 

kernels might have a mechanism to check for stack overflows, but, often, the detection occurs too 

late. The MPU prevents RAM corruption by immediately detecting stack overflows. The newest 

Cortex-M cores based on the v8M architecture contain an improved MPU allowing regions to be 

as small as 32-bytes and aligned on 32-byte boundaries. 

Tail Chaining 

Although not a direct feature needed by an RTOS kernel, tail chaining reduces the amount of time 

it takes to handle back-to-back interrupts of the same or lower priority. This feature significantly 

reduces interrupt latency, which is always desirable in real-time applications. 

The CoreSight Debugger 

Cortex-M all have a special debug port that contains a free-running 32-bit cycle counter that can be 

used for CPU and execution time measurements. This is not a feature that is actually needed by an 

RTOS kernel but is quite useful for obtaining performance data on an application. 

The CoreSight debugger also offers on-the-fly reads and writes allowing PC-based applications like 

Micrium’s µC/Probe to monitor or change (at run-time) RTOS as well as application variables. 

µC/Probe is a data visualization tool that allows developers to display or change (at run-time) the 

current values of variables without requiring the application to be instrumented. µC/Probe has builtin 

µC/OS-III® (also available for µC/OS-II ® and µC/OS-5 TM ) kernel awareness, which means that 

it can display the current state of kernel objects, such as tasks, semaphores, mutexes, message 

queues, etc. A quick glance at the task screen in µC/Probe will show whether the system is behaving 

as expected. µC/Probe can display or change the value of any variables in an application as long as 

those are declared global or static. This allows developers to run “what-if” scenarios with PID-loop 

gains, change scaling offsets, etc. 

The CoreSight debugger also allows tools like Segger’s SystemView (Figure 5) or Percepio’s 

Tracealyzer to stream and store onto a PC the task execution profile of an application. This type of 

tool is invaluable when determining whether or not an RTOS-based application can meet its timing 

requirements. 

422 


Fig 5. Segger’s SystemView for µC/OS-III 

III. SUMMARY 

The Cortex-M was truly designed from the outset to be RTOS kernel friendly. 

Special instructions help with scheduling, and exclusive access to shared resources makes it easy 

to disable/enable interrupts for critical sections, allow the CPU to enter low-power mode when 

running the idle task, and so on. 

The interrupt handling mechanism is especially well suited for supporting real-time applications 

through its responsive NVIC, tail-chaining feature, support of non-Kernel Aware and Kernel Aware 

ISRs and more. 

Industrial applications can especially benefit from the floating-point capability of the optional FPU 

module, the protection offered by the MPU and, with the Cortex-M33, prevention and detection of 

stack overflows through the stack limit registers. 


423

IV. 

REFERENCES 

[1] Jean J. Labrosse, “Detecting Stack Overflows,” 

https://www.micrium.com/detecting-stack-overflows-part-1-of-2/ 

https://www.micrium.com/detecting-stack-overflows-part-2-of-2/ 

March 8, 2016 

[2] Micrium, “µC/Probe, Graphical Live Watch®,” 

https://micrium.com/ucprobe/about/ 

[3] Jean J. Labrosse, “Exploring µC/OS-III’s Built-In Performance Measurements,” 

https://www.micrium.com/exploring-ucos-iiis-built-in-performance-measurements-part-i/ 

December 3, 2015 

[4] Michael Barr, “Multitasking Alternatives and the Perils of Preemption,” 

http://www.drdobbs.com/embedded-systems/multitasking-alternatives-and-the-perils/193000965 

September 14, 2006 

[5] Segger, “SystemView for µC/OS,” 

https://www.micrium.com/systemview/about/ 

www.segger.com/systemview.html 

[6] Segger, “Debug Probes,” 

https://www.segger.com/jlink-debug-probes.html 

[7] Percepio, “Tracealyzer for µC/OS-III,” 

https://www.micrium.com/tracealyzer/about/ 

http://percepio.com/tz/micrium-ucos/ 

[8] Silicon Labs, “Simplicity Studio,” 

http://www.silabs.com/products/mcu/Pages/simplicity-studio.aspx 

424 


Automating Power Management in MCU Operating 

Systems 

Nick Lethaby 

Connected Microcontrollers 

Texas Instruments 

Goleta, USA 

nlethaby@ti.com 

Abstract—With Internet of Things (IoT) applications fueling 

an increase in battery-powered connected sensors and actuators, 

power management has become a critical technology for MCU 

developers. Advanced power management features implemented 

in silicon are of limited use unless complemented by a software 

layer that enables such features to be easily leveraged. The 

importance of ease-of-use is accentuated in the IoT market, 

where many developers lack embedded expertise. This paper 

presents an RTOS-based power management framework that 

automates power management in wireless MCU applications 

without developers having to implement specific power 

management code or have their applications decide when to enter 

specific low power states. We overview the underlying component 

implementations required to achieve this, including power-aware 

drivers enable the OS to understand when specific peripherals 

may be turned off, and efficient tracking of future events, such as 

periodic functions and timeouts, by the RTOS. We next discuss a 

power policy program that decides when to transition to a lower 

power state and which state to transition to. We conclude by 

looking at power consumption benchmark numbers based on 

ARM Cortex-M wireless microcontroller. 

Keywords—power management; real-time operating system; 

MCU; 


The emergence of the Internet of Things (IoT) promises to 

greatly increase the deployment of low-cost sensors or 

actuators, such as intelligent lighting, industrial data loggers, 

asset tracking tags, and smoke detectors, which will need to 

communicate to the internet. These sensors and actuators 

(henceforth referred to as ‘IoT nodes’) will often need to run 

for months or years on coin cell or AA batteries. As a result, 

energy efficiency will be a critical concern for developers. 

Users of laptops, mobile phones, and tablets are 

accustomed to have the operating system control power saving 

activities such as dimming displays or hibernation of the 

system after periods of no usage. However, these devices are 

based on sophisticated operating systems such as Windows, 

Linux, iOS, or Android. The low cost nature of IoT nodes will 

result in many implementations using MCUs with limited onchip 

memory, precluding the use of such high-level operating 

systems. While traditional MCU developers are often satisfied 

with a set of low-level libraries for managing the hardware 

functionality, such an approach will often be insufficient for 

IoT nodes for several reasons: 

Over the last decade, new silicon processes have created 

significantly more power leakage compared to devices built 

using older CMOS processes. To achieve the energy efficiency 

optimal for IoT nodes, more sophisticated power management 

features are being designed into MCUs aimed at IoT 

applications. Only providing a low-level software interface to 

these creates a learning curve for potential users, making it less 

likely they will exploit them. 

Achieving optimal energy efficiency will require using 

more complex power down modes, where much of the SoC – 

CPU, peripherals, and memory is shut down or power cycled. 

The silicon vendor should provide higher-level functions that 

implement these ultra-low power states reliably to insulate the 

user from device-specific complexities. In addition, these 

higher-level power management solutions should address 

issues such as maintaining a reliable timebase in applications 

that are spending significant time in sleep modes. 

Many IoT devices are originating from companies not 

traditionally associated with embedded systems development 

and it is anticipated that there will be insufficient traditional 

embedded developers to address all the opportunities available 

in the IoT marketplace. Developers of MCU-based IoT nodes 

who lack prior embedded development experience will 

certainly not want to be dealing with low-level registerabstraction 

APIs. They will expect something much closer to 

what is available in Windows or Linux where one can select a 

specific power down mode or have the operating system 

actively manage power. 

In the wireless MCU space, the Software Development Kits 

(SDKs) used by embedded developers commonly include 

multitasking kernels, network connectivity and device drivers. 

In this paper, we will examine the implementation of a power 

management framework that provides automated power 

management for wireless MCUs using popular embedded 

RTOS offerings such as FreeRTOS and TI-RTOS. We will 

demonstrate the effectiveness of this framework using power 

consumption data obtained from the Texas Instruments 


425

SimpleLink® CC2640R2 wireless MCU, which supports 

Bluetooth Low Energy (BLE) communication. This MCU uses 

is based on the widely used ARM® Cortex®-M3 core. 

II. 

HOW AN RTOS HELPS POWER MANAGEMENT 

Except for the simplest designs, using an RTOS has some 

inherent advantages for energy efficient designs. The first of 

these is that the preemptive multitasking design paradigm 

encourages interrupt-driven rather than polling-based drivers. 

This eliminates unnecessary CPU usage simply spent polling 

peripheral registers. The second generic advantage is the OS 

automatically drops into an idle thread when there is nothing to 

do, clarifying when power saving techniques can be applied. 

Furthermore, as we will see in later discussion, some of the 

more advanced power management capabilities require that the 

device drivers communicate with a centralized database that 

tracks which resources are in use. This fits naturally into an 

OS, which typically manages some or all of a system’s 

peripherals. Beyond these natural advantages, a power-aware 

RTOS must offer numerous other capabilities to achieve an 

optimal low power operating performance. We will examine 

the specific power management techniques that combine to 

produce a comprehensive framework. However, before getting 

into the specifics of the software, we will briefly overview 

some of essential hardware power management features that 

must be present on the device. 

III. 

HARDWARE POWER MANAGEMENT FEATURES 

To comprehend the software power management 

techniques explained later, it is necessary for the developer to 

have a basic understanding of some of the underlying hardware 

features that assist in effective power management: 

 

 

 

Clock Gating: Clock Gating enables the clock to be 

turned off for a particular peripheral, which in turn 

reduces the power consumed by the peripheral’s logic. 

Power Domains: Although turning off the clock to a 

peripheral eliminates most power consumption, 

depending on the process used to manufacture the 

device, there will often still be some power drain due 

to leakage. To address this issue, a SoC may 

implement power domains to completely shut off 

power to a particular circuit. Unlike clock gates, which 

will usually have a one-to-one correspondence to a 

peripheral, a power domain typically controls multiple 

peripherals, such as all the UARTs or all the serial I/O 

peripherals. 

Wake-up Generator: To implement very aggressive 

low power states, both the CPU and virtually all the 

peripherals domains are powered down. Since no 

interrupts can normally reach the CPU in these 

circumstances, additional logic that enables a subset of 

peripherals to wake up the CPU is required. The SoC 

designer must decide which interrupts can wake up the 

CPU and ensure that the wake-up generation logic is 

able to catch these interrupts, take the CPU out of reset 

so it can respond to the interrupt, and then forward the 

interrupt to the correct vector. 

 

CPU-independent High-resolution Timer: Since the 

great majority of embedded applications have some 

time-driven events, it is essential that an accurate 

timebase can be maintained across power saving 

modes. This requires a timer to be kept active while the 

rest of the SoC is powered down. This timer must have 

sufficient resolution to maintain something similar to a 

1ms tick count and sufficient width to avoid rollovers 

during periods of deep sleep. The required resolution 

and width will depend on the CPU clock rate and how 

long the application will sleep for. 

Fast wake up time and appropriate run-time 

performance: Although not explicitly used for power 

management, the ability of the SoC to wake up 

quickly, complete work quickly, and go back to a lowpower 

state quickly is of paramount importance to 

maximize time in low power states. Important design 

choices here include having high-frequency clock 

source stabilize quickly and selecting the right CPU 

speed and performance so that the work can be done 

quickly. 

We will discuss how an RTOS power manager utilizes 

these features, beginning with a discussion on how to minimize 

run-time power consumption. 

IV. 

“CPU ACTIVE” POWER MANAGEMENT TECHNIQUES 

Minimizing power consumption while the CPU is active 

primarily means aggressively managing power consumed by 

peripherals such as timers, serial ports, and radios. To do so, 

the RTOS power manager is reliant on the clock gating and 

power domains designed into the CC2640R2 silicon, which 

enable inactive peripherals to be powered down. Leveraging 

this hardware requires knowing when a particular peripheral is 

in use or not. Such knowledge can be tracked by an operating 

system and its associated device drivers. Each device driver 

must declare a dependency on the specific peripheral it will 

use. For example, when the SPI driver is invoked, it declares a 

dependency to the OS power manager on the specific SPI port 

(e.g. SPI2). The OS power manager knows the clock gate and 

power domain that are associated with SPI2 and verifies that 

these are enabled. If they are not, it enables them. When the 

driver completes execution, it informs the OS power manager 

to release the dependency on the chosen SPI. The power 

manager maintains a database of dependency counts on the 

clock gates and power domains. Whenever the dependency 

count for a clock gate or power domain goes to zero, the power 

manager is responsible for disabling them to reduce power. 

These peripheral power downs are done during normal system 

run-time and help increase energy efficiency 

V. MAXIMIZING CPU POWER STATE EFFICIENCIES 

In many IoT nodes, it will be common for the SoC to spend 

much or even most of its time in some form of sleep mode. To 

maximize energy efficiency, it is critical to not only maximize 

the amount of time spent in sleep modes, but also appropriately 

utilize the most power efficient sleep modes where possible. 

Achieving the most power efficient sleep state will typically go 

beyond just putting the CPU into a sleep state. It may be 

desirable to power down memories in addition to on-chip 

426

peripherals. It also is essential to have a real-time clock or 

high-resolution timer kept alive across power downs to ensure 

proper functioning of the application’s time-based functions. In 

the CC2640R2 implementation, the real-time clock is part of 

the “always on” hardware, so the application always has access 

to it. However, in other silicon implementations, it may be 

necessary for the power manager to specifically keep a timer or 

clock alive. There are a number of different techniques that can 

be utilized to ensure that sleep modes are as efficient as 

possible. We will begin with a discussion of tick suppression. 

VI. 

TICK SUPPRESSION 

Embedded applications typically employ a regular timer 

interrupt as a ‘heartbeat’. This timer interrupt is used as the 

basis for calculating when any time-based activities such as 

periodic functions or timeouts should occur. For RTOS-based 

applications, this timer interrupt is known as the system tick, 

but no-OS applications will typically have a similar regular 

timer tick. 

In practice, ticks execute periodically, at a rate sufficient for 

the most granular timing needed by the application. As a result, 

most system ticks will not result in a time-driven function 

being executed. In energy efficient applications, it is clearly 

undesirable to be woken up from a low-power state just to 

service the system tick timer interrupt and then find there is 

nothing to do. Fortunately the OS knows when any periodic 

functions or timeouts are due to occur. To implement tick 

suppression, the OS reprograms the timer associated with the 

system tick so the next timer interrupt only occurs when the 

next time-based function must run. As illustrated in figure 1, 

this approach can eliminate the majority of timer interrupts 

associated with the system tick. 

In the TI-RTOS implementation, the user has to simply set 

a configuration parameter to enable tick suppression. An 

alternative approach is to provide application-driven control 

through APIs. However, this forces the tick suppression logic 

into the application code as well as adding the overhead of 

APIs calls to a relatively simple operation. The core overhead 

of tick suppression is low as reprogramming the timer 

peripheral is simply a register write. TI-RTOS and most other 

RTOSs automatically track the next tick interval when work is 

scheduled for so this information is always available. A minor 

side effect is that it may take somewhat longer to execute OS 

system calls that must return tick counts, especially on 

architectures with poor math performance. This is because the 

count must be calculated, versus just returning a count variable 

that is simply incremented upon each timer interrupt. 

VII. A POWER POLICY MANAGER 

In earlier versions of the TI-RTOS power manager that 

worked on DSPs in mobile phone applications, decisions on 

when to go a particular low power state and which power state 

to select were pushed up to the application. Once a decision 

had been made to go to a specific power state, a register/notify 

framework enabled the power manager to notify relevant 

system entities such as device drivers, which would then take 

steps to complete any activities and prepare for a power state 

change. Once all the system entities had reported that they were 

ready, the power manager would then proceed with the power 

state change. This approach was sufficient in the mobile phone 

space where large application development teams incorporate 

power management experts and the non-deterministic nature of 

the notification process is acceptable when the main CPU is 

running a high-level operating system such as Android, which 

inherently has a lot of overhead. 

For IoT node applications, a simpler and lower-overhead 

approach is required. For the reasons earlier discussed 

concerning tick suppression, the OS power manager is wellplaced 

to make any decision about transitioning to a different 

power state. A function called a power policy manager was 

developed to provide a simple way to automatically decide on 

and manage power transitions. The register/notify framework 

was scaled back and greater use was made of a concept known 

as a constraint to simplify decisions about power state 

transitions. The power policy manager is configurable by the 

developer but comes with a set of default policies that can be 

used without the user having to understand significant levels of 

detail. 

When a multitasking OS-based application has nothing to 

do, it drops into an idle loop and the OS can invoke the power 


427

policy manager. The role of the power policy manager is to 

determine which low power state can be entered at this point. It 

is always safe to simply place the ARM core in a 

WaitForInterrupt (WFI) state as the core register contents are 

fully maintained and application execution can be resumed 

with minimal latency. However, since other power states offer 

much greater power savings, the policy manager will first 

determine if one of these can be entered. 

A common reason an application may drop into the idle 

loop is because one or more tasks are blocked waiting for 

peripheral IO operations to complete. If completing these IO 

operations or any other function is essential for the system’s 

correct operation, the application needs to be able to 

communicate this to the OS power manager. In the power 

manager implementation for the CC2640R2, the application 

informs the power manager of such critical functions by setting 

constraints. An example of when a constraint is appropriate 

would be when transmitting data over a BLE or 802.15.4 radio. 

An application that is waiting for acknowledgement or data 

from the wireless network would typically block on a 

semaphore. If no other application task needs to run, the 

application will then drop into the idle loop and the power 

policy would be run. Obviously, it would not be appropriate to 

shut down the radio and put the CPU into a long latency deep 

sleep mode, because this would result in the incoming BLE 

packets being lost. To prevent this from happening, the BLE 

stack or radio driver would set a constraint while it was 

operating. When its action was complete, it would release the 

constraint. The constraint should be limited to only the power 

down modes that would impair successful operation. For 

example, going into an IDLE state (see next section for more 

details of the different CC2640R2 power states) may be safe 

for a particular operation, but not going into a STANDBY 

state. The power manager tracks constraints in a relatively 

similar manner to dependencies. However it’s important to 

understand that the power policy only checks for constraints, 

not dependencies. The assumption is that power downs can be 

done regardless of on-going peripheral activity unless a 

peripheral’s associated stack or device driver sets a constraint. 

Assuming constraints are not preventing the system from 

transitioning to a lower power state, the power policy manager 

must weigh information from various sources to decide on 

which power saving mode to invoke. Each power saving mode 

is characterized by a specific latency, characterized by 

combining the time taken to perform the power down operation 

and time required for the SoC to fully wake-up and be ready 

for normal system execution. Similar to the technique used in 

tick suppression, the power policy will check when the next 

periodic functions or timeouts are due to occur and then 

compare this time against the latencies of the different power 

states. It will then choose the lowest applicable power state and 

program the appropriate wake-up configuration. The power 

policy understands the wake-up latencies from each power 

state and therefore will program the wake-up to occur 

sufficiently early to ensure the processor is ready to respond 

instantly to perform the previously scheduled work. When the 

power policy triggers a transition to a new power state, it will 

invoke callback functions registered by drivers that need 

notification of sleep transitions to shut down the peripheral’s 

activity. The default implementations of these callbacks are 

minimalistic and based on the assumption it is safe (due to no 

constraint being set) to shut down the peripheral as quickly as 

possible. 

VIII. SOC-SPECIFIC POWER STATES 

A key attribute of the Power Manager is that it provides 

proven implementations of a pre-defined set of power states for 

a device. These are extensively tested to ensure reliable 

transitions to and from the mode and eliminate the need for 

development from scratch. 

The power states for the CC2640R2 are listed in Table 1 as 

an example of those that can be present for a device poweroptimized 

for an IoT node. As can be seen from the data in the 

table, to achieve ultra-low power consumption, it is important 

to implement SoC-specific power states that do much more 

than simply sleep the main CPU. 

WaitForInterrupt mode simply results in gating the clock to 

portions of the main CPU. This may be used in any situation as 

it has virtually no latency. The primary role of the power policy 

manager is to determine if the IDLE or STANDBY modes can 

be used, as these greatly reduce power consumption, especially 

the latter. The IDLE mode will additionally power off some 

CPU logic completely, while retaining state of vital registers. It 

should be noted that no actions are taken in either the 

WaitForInterrupt or IDLE implementations to turn off 

peripherals. As a result, the actual power usage will vary 

depending on which peripheral and associated power domains 

are active 

In STANDBY mode, all peripheral domains are powered 

down, except for always on logic used for wake-up generation. 

The real-time clock in the ALWAYS ON domain is used to 

maintain an accurate time base while in this state. The device’s 

SRAM is put in retention mode and power supply is dutycycled 

to achieve further power savings, while maintaining 

sufficient charge to maintain vital state. 

The shutdown mode is provided for applications that wish 

to sleep for hours or even days. The main advantages of this 

mode compared to simply turning the whole SoC off is that any 

pin can be used to cause the SoC to power back up and there is 

no need for additional external circuitry to turn on the SoC. 

Because shutdown is used for very long power downs, the 

default power policy manager does not utilize it. The 

application can invoke it directly if appropriate or modify the 

power policy manager to use it. 

Power State 

Wake up time to 

ACTIVE state 

Current Used 

ACTIVE Not applicable 4.145 mA 

WaitForInterrupt A few cycles 2.028 mA 

IDLE 1.4 µs 796 µA 

STANDBY 14 µs 1-2 µA 

SHUTDOWN 700 µs 0.1 µA 

Table 1: Wake-up latencies and power consumption for the pre-defined power 

states of the TI CC2640R2, an MCU with integrated BLE 

428

IX. 

SUMMARY 

With the advent of the IoT triggering an explosion in 

battery-powered connected sensors and actuators, power 

management has become a critical technology for MCU 

developers. While aggressive power management strategies 

require specific features to be implemented in the silicon itself, 

it is equally important that a software layer be provided that 

enables such features to be easily leveraged. This is especially 

true in the IoT market, where many developers lack embedded 

experience. We illustrated RTOS-based power management 

components that provide low-level libraries for managing 

peripheral clocks and domains and transitioning to and from 

specific power states. These are complemented by poweraware 

drivers that enable the OS to understand when specific 

peripherals may be turned off. Finally, the OS power manager 

has the intelligence to decide when to transition to a lower 

power state, eliminating the need for the application to manage 

such details and simplifying the process for developers. 


I would to thank Scott Gary, Senior Member of Technical 

Staff at Texas Instruments, for providing technical insight on 

software power management in operating systems. 


429

The state of embedded open source software in 2018 

Rod Cope 

CTO 

Rogue Wave Software 

Louisville, CO, USA 

rod.cope@roguewave.com 

It’s no surprise that the adoption of open source software for 

embedded development has caught up to the rest of the world – the 

advantages are just too great – so what data, trends, and lessons 

can we learn? Like commercial software, open source presents 

technical, security, and quality challenges but it also adds skills, 

experience, and maintenance considerations into the mix. As 

developers of embedded devices with strict resource, performance, 

and reliability requirements, how do we ensure open source is 

managed and deployed effectively? 

Rod Cope, CTO of Rogue Wave Software, discusses the state 

of open source use in embedded device development today, using 

statistics, use cases, and examples from around the industry and 

specific technical support tickets. By delving into popular 

packages and tools across the application stack, common issues 

and solutions are extracted to form a representative framework 

for how open source is used in development environments and 

production devices. The use of open source has implications for 

package selection, integration, team staffing, and maintenance, 

and these topics and more are covered to provide specific best 

practices for teams to guide their development efforts: 

 

 

 

How to identify potential risk areas for your project 

Steps to better manage open source within the team 

Where to find help if something goes wrong 

open source software, software security, software quality, best 

practices, use cases, embedded software development 


Open source is everywhere and continues to be a growing 

trend for embedded systems development, presenting new and 

unique challenges to software teams. Open source software 

(OSS) is replacing commercial versions of development tools 

and packages, and slowly taking up residence on embedded 

target platforms. A recent survey of readers of EETimes and 

Embedded shows that the use of open source operating systems, 

without commercial support, has grown to 41 percent, up from 

31 percent in 2012 1 . Similarly, VDC Research states that “Free 

and/or publicly available, open source operating systems such as 

Debian-based Linux, FreeRTOS, and Yocto-based Linux 

continue to lead new stack wins, with nearly half of surveyed 

embedded engineers expecting to use some type of free, open 

source OS on their next project.” 2 

With more embedded devices connecting to the Internet of 

Things (IoT), back-end data processing and analytics are also 

embracing OSS. As teams get comfortable with, or bow to 

outside pressures to adopt open source, it’s important to 

understand where the risks are and how the industry is 

overcoming them. 

This paper examines data extracted from the Klocwork static 

code analysis tool and findings in industry literature (white 

papers, articles, and blogs) to identify major areas of open source 

risk for embedded systems and steps to better manage open 

source within development teams. 

II. 

WHERE OPEN SOURCE IS DEPLOYED 

There is a high degree of probability that an embedded 

software developer has seen or used open source software. 

Today, OSS is easy to get and often fills in technical gaps for 

which there is no commercial equivalent. Plus, developers 

prefer packages that are popular and have strong communities 

behind them. 

A. Popular repositories 

A survey of the most popular open source hosting sites is 

listed in Table 1, illustrating the popularity of OSS projects 

today. 

TABLE I. 

USAGE STATISTICS FROM POPULAR OPEN SOURCE HOSTING 

SITES 

Site Users Projects 

Bitbucket a 6,000,000 a Unknown 

GitHub 27,000,000 b 75,000,000 b 

LaunchPad 4,140,275 c 41,141 d 

SourceForge “Millions” e 500,000 e 

a blog.bitbucket.org/2016/09/07/bitbucket-cloud-5-million-developers-900000-teams/ 

b github.com/about 

c launchpad.net/people 

d launchpad.net/projects 

e sourceforge.net/about 

1 

m.eet.com/media/1246048/2017-embedded-market-study.pdf 

2 

www.vdcresearch.com/images/pr/2016/nov/EMB-Embedded-OS-11-29-16.html 


430

B. OSS deployments 

Given the less-stringent requirements on development 

environments versus embedded targets, it’s fair to say that most 

open source packages are used in the areas of software build and 

management tools. Yet there are other, perhaps surprising areas, 

where OSS is deployed: 

Hardware modelling tools – while electronic design 

automation (EDA) tools have been strongly held by proprietary 

vendors, open source alternatives are growing in popularity and 

features. Examples include Icarus Verilog, Verilator, and GNU 

Emacs 3 . 

Compiler tool chains – the GNU Compiler Collection (GCC) 

has been is use for 30 years and is, by far, the most popular code 

compiler supporting C, C++, Ada, Fortran, and other languages. 

Other open source compilers include Clang, Portable C 

Compiler (pcc), and Tiny C Compiler (TCC). 

Software libraries – focusing on embedded targets, there are 

many considerations when choosing which software libraries to 

include. Notably, with processor and memory resources at a 

premium, the library footprint is an important consideration. 

Popular examples include newlib, the C runtime library 

maintained by Red Hat 4 , and Qt for Device Creation 5 , a version 

of the widely-used Qt framework that supports various 

embedded targets and is free under the (L)GPL license. 

Debuggers – GDB is a popular open source option for source 

code debugging, and is often integrated into open source IDEs, 

such as Eclipse CDT, NetBeans, and SlickEdit 6 . Eclipse itself 

recently celebrated 15 years of supporting embedded systems 

development 7 . 

Version management systems – open source version control 

has always been around, examples are RCS and Subversion, and 

has evolved to include distributed and cloud-based systems, 

such as Git. 

Build systems – this is a broad category that covers build 

tools, such as GNU make, to modern continuous integration 

tools, such as Jenkins and Buildbot. 

Operating systems – real-time operating systems (RTOS) for 

embedded are characterized by their modularity and footprint, 

and there are several open source options available: FreeRTOS, 

eCos, and uClinux are a few examples. Linux is the most popular 

system, and the Yocto Project offers a complete development 

environment for embedded systems 8 . 

Databases – for development environments or back-end IoT 

servers, MySQL and PostgreSQL are widely-used database 

options, while SQLite is popular for embedded targets 9 . 

Web servers – while there’s a seeming disparity between the 

large processing and memory requirements of a web server and 

what’s available on a typical embedded device, connectivity is 

critical to IoT development and has driven the need for onboard 

HTTP. Busybox has a built-in httpd server and lighttpd is 

optimized for resource-constrained environments. 

III. 

THE CHALLENGES OF OSS USE 

The greatest strength of open source, and the reason it exists, 

also presents the biggest challenges to developers of embedded 

systems. As OSS packages can be developed and distributed by 

anyone, including commercial companies, to solve myriad 

technical needs, it is nearly impossible for a software 

development team to be able to support the ones they use. Most 

teams focus on the skills necessary to deliver new features, less 

so on the skills required to support any OSS packages integrated 

into environments and systems – especially if multiple packages 

are being used. 

The challenges of OSS use can be broken down into two 

areas, security risks and technical risks. Security risks are flaws 

in deployed software that can allow malicious entities access to 

program control or sensitive data, either on-board the system or 

through remote connections. A recent example is the Krack 

vulnerability in Wi-Fi devices, which allowed attackers to 

exploit a flaw in the WPA2 protocol and had the potential to 

affect “a seemingly infinite list of embedded and Internet of 

Things devices from companies like Linksys.” 10 

Technical risks are defined as coding, configuration, or 

architectural errors that can cause improper behavior or 

performance of one or more open source packages. While this 

covers a wide range of possibilities, an illustrative example is 

the Nest software bug, which caused the battery to drain 

prematurely and deactivate the device 11 . 

IV. 

TECHNICAL RISKS AND SOLUTIONS 

Running Klocwork static code analysis on the popular Boost 

C++ libraries, there is the potential for several bugs to impact 

the behavior and performance of code. Note that some of these 

bugs are also potential security flaws, categorized by the 

identified Common Weakness Enumeration (CWE) entry. 

Issue type 

(Klocwork checker 

name) 

Operands of 

different size in 

bitwise operation 

(CWARN.BITOP. 

SIZE) 

Void function 

returns value 

(VOIDRET) 

TABLE II. POTENTIAL BUGS IN BOOST 1.62.0 

Related CWE 

enumeration 

N/A 

CWE-394: 

Unexpected 

Status Code 

or Return 

Value 

Description 

When bitwise 

operations have 

operands of 

different sizes, 

unexpected results 

may occur 

Functions declared 

as void returning a 

value may indicate a 

logic problem in 

code 

Number of 

reported 

issues 

44 

23 

3 

opencores.org/howto/eda 

4 

sourceware.org/newlib/ 

5 

www.qt.io/download 

6 

sourceware.org/gdb/wiki/GDB%20Front%20Ends 

7 

www.eclipse.org/community/eclipse_newsletter/2017/october/article2.php 

8 

www.yoctoproject.org/about 

9 

www.embedded-computing.com/embedded-computing-design/the-ins-and-outs-of-embeddeddatabases-for-the-iot 

10 

www.wired.com/story/krack-wi-fi-wpa2-vulnerability/ 

11 

www.engadget.com/2016/01/14/nest-software-bug/ 

431

Issue type 

(Klocwork checker 

name) 

Possible 

dereference of end 

iterator 

(ITER.END.DER 

EF.MIGHT) 

Function returns 

address of local 

variable 

(LOCRET.RET) 

Uninitialized 

variable 

(UNINIT.STACK 

.MUST) 

Related CWE 

enumeration 

N/A 

CWE-562: 

Return of 

Stack 

Variable 

Address 

CWE-457: 

Use of 

Uninitialized 

Variable 

Description 

When an iterator is 

dereferenced when 

its value could be 

equal to end() or 

rend(), unexpected 

results may occur 

When a function 

returns a pointer to a 

local variable, it 

returns a stack 

address that will be 

invalidated after 

return, potentially 

causing unexpected 

results 

Uninitialized data 

may contain values 

that cause 

unexpected results 

Number of 

reported 

issues 

13 

The complexity of the Boost library code is too high to 

reproduce in this paper but, to illustrate one of the above 

findings, the following general example shows mismatched 

operands in a bitwise operation. 

1 typedef unsigned int u32; 

2 typedef unsigned long long u64; 

3 u32 get_u32_value(void); 

4 u64 get_u64_value(void); 

5 void example(void) { 

6 u32 mask32 = 0xff; 

7 u64 mask64 = 0xff; 

8 u32 value32 = get_u32_value(); 

9 u64 value64 = get_u64_value(); 

... 

10 value64 &= ~mask32; 

11 } 

Line 10 shows a 32-bit mask used with 64-bit data, which 

may cause unpredictable behavior. 

While these types of technical risks apply to any open source 

code, there are several steps to consider when selecting and 

implementing any package: 

1) Identify any known bugs/issues with the package 

and, if necessary, check for newer versions to see if 

they have been mitigated 

2) Follow best practices for the set up, configuration, 

and deployment of packages 

12 

11 

3) Run functional tests against the package, in 

isolation, before integrating into the overall 

application 

4) Define a process for supporting the package, 

including identifying resources to solve issues in 

production 

V. SECURITY RISKS AND SOLUTIONS 

Code security is a popular subject across the software 

industry, no less so for embedded systems where user safety and 

privacy are of paramount concern. The types of code issues that 

can introduce vulnerabilities include buffer overflows, tainted 

data, uninitialized data, and dangling pointers, to name a few. 

Additionally, the configuration of open source packages can 

contribute to attack surfaces. 

The MySQL zero-day vulnerabilities in 2016 (CVE-2016- 

6662 and CVE-2016-6663) exemplify bugs that are both 

inherent in open source code and can be mitigated through 

package configuration. Both vulnerabilities could allow 

attackers to execute code with root privileges, even if kernel 

security were enabled with default active policies for the 

MySQL service on some major Linux distributions. While 

patches were eventually released, there were configuration 

changes proposed by the reporting researcher to protect servers 

in the meantime 12 . Being aware of, and having the skills to 

implement these changes is not necessarily something 

developers consider, but it’s essential to protecting critical 


Another example is SQLite, a popular database used in 

embedded systems. Running Klocwork static analysis on an 

older version, SQLite 3.15.0, yielded two, potentially 

significant, buffer overflow vulnerabilities. 

TABLE III. POTENTIAL VULNERABILITIES IN SQLITE 3.15.0 

Issue type (Klocwork 

checker name) 

Buffer Overflow - 

Array Index Out of 

Bounds 

(ABV.GENERAL) 

Buffer Overflow 

(class or structure) - 

Array Index Out of 

Bounds 

(ABV.MEMBER) 

Related 

CWE 

enumeration 

CWE-120: 

Buffer Copy 

without 

Checking 

Size of 

Input 

CWE-120: 

Buffer Copy 

without 

Checking 

Size of 

Input 

Description 

Array bounds 

violation: Access to 

an array element 

that is outside of the 

bounds of that array 

Array bounds 

violation in a class 

or structure: Access 

to an array element 

that is outside of the 

bounds of that array 

Number of 

reported 

issues 

17 

The complexity of the SQLite code is too high to reproduce 

in this paper but, to illustrate one of the above issues, the 

following general example shows an array bounds violation. 

4 

12 

thehackernews.com/2016/09/hack-mysql-database.html 


432

1 int main() 

2 { 

3 char fixed_buf[10]; 

4 sprintf(fixed_buf,"Very long 

format string\n"); // Line 4. ABR 

5 return 0; 

6 } 

Line 4 shows a string of 24 characters being passed into an 

array fixed_buf[] of size 10, which may unintentionally 

overwrite adjacent memory. 

The OWASP Project has identified ten best practices for 

embedded application security, adding embedded-specific 

guidelines, such as firmware and included libraries, to general 

secure coding principles 13 . It’s up to the development team to 

decide how to implement policies and tests for these issues, but 

most organizations adopt static and dynamic analysis to offset 

development resources and prevent mistakes. 

TABLE IV. 

E1 – Buffer and Stack Overflow Protection 

E2 – Injection Prevention 

OWASP EMBEDDED TOP 10 BEST PRACTICES 

E3 – Firmware Updates and Cryptographic Signatures 

E4 – Securing Sensitive Information 

E5 – Identity Management 

E6 – Embedded Framework and C-Based Hardening 

E7 – Usage of Debug Code and Interfaces 

E8 – Transport Layer Security 

E9 – Data collection Usage and Storage – Privacy 

E10 – Third Party Code and Components 

Steps to consider to secure applications that include open 

source code: 

1) Identify any known vulnerabilities within the package 

and version by searching the U.S. Government’s 

National Vulnerability database 

2) Follow best practices for the secure configuration and 

deployment of packages 

3) Run security tests against the package, following the 

OWASP embedded top 10 practices above 

4) Train the development team to be able to understand, 

identify, and mitigate risks 

VI. 

WHERE DEVELOPERS GET HELP 

For technical and security risks, open source presents a 

unique challenge as packages typically do not offer commercial 

levels of support. There are three options for development teams 

to solve their issues: 

Self-support – relying on internal developers to be 

knowledgeable about packages, including staying ahead of 

technical issues, security vulnerabilities, and getting trained to 

deal with problems and the potential for poor documentation. 

Community support – relying on the community to help with 

set up, deployment, and production issues, and dealing with the 

possibility of slow response or the lack of a solution to a specific 

problem. 

Commercial support – relying on an outside organization 

that knows how to prevent and troubleshoot issues with the 

package or set of packages, following a guaranteed level of 

service and cost to meet project requirements. 

The best source of help is dependent on the unique 

requirements of the development team and often combines 

aspects of all three options. For embedded systems with strict 

mission- and safety-critical requirements, it’s highly 

recommended to find a level of support that is not only timely, 

but also offers the expertise necessary to cover the technical and 

security risks identified in this paper. 

VII. SUMMARY 

With over 37 million users across popular hosting sites and 

many different types of deployment scenarios, open source use 

continues to grow for embedded systems development, 

presenting technical, security, and support challenges that are 

different from traditional proprietary software packages. By 

identifying issues up front, following best practices for use, and 

running tests against the packages before integration, these 

challenges can be mitigated. By adopting a level of support that 

is timely and has the necessary package expertise, overall risks 

can be minimized. 

13 

www.owasp.org/index.php/OWASP_Embedded_Application_Security#tab=E 

mbedded_Top_10_Best_Practices 

433

Developing Safety Autonomous Driving Solutions 

Based on the Adaptive AUTOSAR Standard 

Leo Hendrawan 

Senior Member Technical Staff – Customer Support 

Wind River System, Germany 

Andrei Kholodnyi 

Senior Architect – CTO Office 

Wind River System, Germany 

ABSTRACT 

Since the first release of its standard in 2003, AUTOSAR[1] 

has established itself as one of the primary software development 

standards for the global automotive industry. As the automotive 

industry is now facing some of its greatest opportunities and 

challenges from the prospect of autonomous driving, new 

standards are needed to handle the complexity regarding 

software architecture for controlling the increasing number of 

E/E contents in the autonomous vehicle. The recent advent of the 

Adaptive AUTOSAR standard can help accommodate the 

extensive and complex requirements of autonomous driving by 

enabling a flexible, dynamic, and service-oriented platform while 

still maintaining the integrity of high degree of functional safety 

standards and also properly engaging with established platforms. 

The standard itself deploys some technologies/standards which 

are already established in the industry such as multi-core high 

end processors with MMU support, high speed Ethernet 

connectivity, hypervisor/ virtualization, POSIX PSE51, C++11 

for application development, ISO26262/ASIL compliance, etc. 

This presentation provides an example of an Adaptive 

AUTOSAR implementation based on VxWorks® RTOS from 

Wind River. As one of the very few solutions available on the 

market which is already fulfilling the requirements described 

above, VxWorks is a strong example of a foundational software 

platform for Adaptive AUTOSAR-based autonomous driving 

development. We will also explain how VxWorks 

features/profiles for Safety, Security Connectivity, and Device 

Management fit the basic components of Adaptive AUTOSAR 

standard. 

Keywords— Autonomous Driving, Adaptive AUTOSAR, POSIX 

PSE51, VxWorks, Safety, ISO26262 


Due to the push of the industry since a couple of years not 

only electric car, but a self-driving car has become more and 

more a reality to achieve soon than something which only 

exists in science-fiction movies. However, this doesn’t come 

without any challenges. It is estimated that an autonomous car 

will generate around 4,000 GB (4 terabytes) data a day in the 

future [2] which are coming from various sensors such as 

(cameras, LIDAR, radars, etc.) and also high speed 

communications such (5G, V2X, etc.). Thus the complexity of 

processing and managing these data is exploding exponentially. 

The (classic) AUTOSAR (AUTomotive Open System 

ARchitecture) standard has become the de-facto software 

standard in the automotive industry for embedded software 

application in the ECUs over the last decade. However its 

implementation is yet lack of versatility needed for the 

complexity of connected, autonomous driving application. 

Therefore the AUTOSAR consortium has now come up with 

the new Adaptive AUTOSAR platform to accommodate the 

challenges implementing connected and autonomous vehicles 

application while bridging also the classic AUTOSAR and 

infotainment applications in the vehicle. 

Furthermore safety becomes the key issue in the 

implementation of the new standard as driving application 

concerns human life. 

II. 

ADAPTIVE AUTOSAR 

Adaptive AUTOSAR is proposed by AUTOSAR 

consortium in 2017. The main goal is to define a software 

standard for advanced driving assistance application by 

offering high degree flexibility/modularity in a form of serviceoriented 

architecture. While the classic AUTOSAR standard is 

well defined for static and efficient implementation of 

application on top of microcontrollers and standard 

communication channel such as CAN bus, Adaptive 

AUTOSAR is defined on top of technologies which can cope 

with the high processing power and communication 

requirements such as multicore microprocessors, gigabit 

Ethernet communications, over-the-air update, etc. In order to 

have the flexibility between platforms and operating systems, 

Adaptive AUTOSAR also embraces other standards such as 

C++ and POSIX. As the intention is to have the same code 

running on top of any platform and operating systems, it is 

necessary to consider carefully the functional safety support of 

the underlying platform and operating system. 

Figure 1 shows the basic architecture of Adaptive 

AUTOSAR. 


434

A. Adaptive Applications (AA) 

Adaptive Applications (AA) are applications implementing 

the connected and autonomous driving functionalities. Each 

application is implemented as a single or multiple processes in 

the operating system which may contain single or multiple 

threads with separate address and name space with respect to 

another application/process. In order to communicate with 

other AA, an Adaptive Application may only use the ARA 

Communication Manager explicitly and not other means such 

as conventional IPC (Inter Process Communication). 

- REST: alternative communication management for AA 

based on RESTful API. 

- Diagnostic: implementing of UDSonIP (Unified 

Diagnostic Services on Internet Protocol) 

- Persistency: supporting mechanism for information 

storing in non-volatile memory. 

- Platform Health Management: supporting fail-safe 

applications by means of supervisions. 

- Update and Configuration Management: supporting 

flexible update of software and configurations through 

over-the-air updates. 

- Time Synchronization: offering time synchronization 

mechanism between applications. 

Fig. 1. Basic Adaptive AUTOSAR Architecture [3] 

B. AUTOSAR Runtime for Adaptive Applications (ARA) 

The AUTOSAR Runtime for Adaptive Applications 

(ARA) is an abstraction layer for the underlying hardware and 

operating system which is called Adaptive Platform (AP). The 

ARA abstraction layer is comparable to the AUTOSAR RTE 

(Run Time Environment) of the classic AUTOSAR. ARA 

provides standard C++ (or other language support in the 

future) interfaces to the Adaptive Platform consisting of a 

collection of the Functional Clusters. 

C. Adaptive Platform Foundation and Adaptive Platform 

Services 

The Functional Clusters of Adaptive Platform can be 

categorized into main groups: Adaptive Platform Foundation 

providing the fundamental functionalities of Adaptive 

Platform, and Adaptive Platform Services providing the 

standard services of AP. From Adaptive Application (AA) 

point of view however the two are almost indistinguishable 

due to the standard C++ interfaces. 

Here are the list of common/basic Adaptive Platform 

Foundation and Services [3]: 

- Execution Management: managing platform and 

application execution (based on Machine and 

Application Manifest) 

- Communication Management: managing 

communications between Adaptive Applications in form 

of either service oriented manner or static 

language/network binding. 

III. 

SAFETY COMPLIANT OS FOR ADAPTIVE AUTOSAR 

As mentioned earlier the Adaptive AUTOSAR standard 

aim to have a high degree of portability. Therefore it is 

important for user to take care the selection of the underlying 

platform and operating system to ensure the functional safety 

capabilities. The international functional safety standards for 

vehicle is the ISO 26262, which defines the Automotive Safety 

Integrity Level (ASIL) ranging from level A (lowest) to D 

(highest). 

As the safety concept of autonomous driving is still 

evolving, the automotive industry can refer to the already 

established safety-related concept from other industries. By 

taking an example of VxWorks RTOS (Real Time Operating 

System) which is well established COTS certifiable solution, 

here are the features which help user implementing safety 

critical application for autonomous driving: 

A. Real Time Process (RTP) with Time and Space Partition 

Scheduling 

The implementation of VxWorks RTOS kernel enables 

application processes to have pre-emptive scheduling with the 

addition of time partition and Core/CPU affinity policies. By 

using pre-emptive scheduling, critical applications which are 

usually implemented as high priority task will have 

predictable response time to ensure the safety of the system. 

Time partition guarantees that RTP tasks have access to the 

CPU at the specified time windows. Figure-2 illustrates an 

example showing how Time Partitioning scheduler works on 

VxWorks 7 Safety Profile. CPU affinity will avoid core 

transfer during execution of specified task to ensure the 

predictability. 

B. Resource Access Control 

To avoid malfunctioning task damaging and putting the whole 

system into unsafe state, it is necessary to control of all 

resource available in the system such as memory, objects 

(shared memories, message queues, semaphores, etc.), and 

even system calls. VxWorks 7 Safety Profile supports this by 

implementing hard-coded data structures which define 

explicitly the access control to each resource which needs to 

be protected. 

435

Fig. 3. helloAdaptiveWorld basic ara::com example 

The well defined abstraction layer of Adaptive 

AUTOSAR also enables in practice to use multiple operating 

systems on a multicore hardware with a hypervisor. By doing 

this, critical applications can then run on top of a safety 

certified operating system with hard real time requirements, 

while to non-critical applications run on top of another 

operating system. The following Figure 4 gives an illustration 

how the proposed solution might look like by using VxWorks 

7 RTOS and Linux Operating System running on a multi-core 

hardware. 

Fig. 2. VxWorks Time Paritioning Scheduler Example [4] 

C. Support of Certified Hardware Platform and Software 

Tools 

The implementation of safety critical application will also 

require the functional safety-compliant software to run on a 

safety-compliant hardware platform. Usage of software tools 

and development standards also helps to improve the 

confidence in developing safety relevant application. One of 

the common development standard used for automotive 

application development is the Automotive SPICE (Software 

Process Improvement and Capability Determination). 

Automotive SPICE is for example used for development of 

DIAB compiler used as one of the compiler tools of VxWorks 

RTOS. 

IV. 

IMPLEMENTATION OF ADAPTIVE AUTOSAR ON 

VXWORKS 7 

The goal of Adaptive AUTOSAR is to have a high degree 

of flexibility and portability. For this there are 2 keys standard 

components required: C++ and POSIX standards. As VxWorks 

7 supports these two standards running ARA stack on 

VxWorks is pretty straightforward. Following example such as 

the ara::comm helloAdaptiveWorld is already running on 

multiple hardware platforms. Figure 3 show the illustration of 

the basic helloAdaptiveWorld example. 

Fig. 4. Multiple OS Adaptive AUTOSAR Implementation 

V. CONCLUSSIONS 

Adaptive AUTOSAR is defined to cope with the 

challenging requirements for implementing complex 

applications of connected and autonomous vehicle. As it offers 

high degree of flexibility, it is necessary to consider proven 

safety-compliant solutions for the underlying layers (operating 

system) in order to ensure the success of its deployment. 

REFERENCES 

[1] www.autosar.org 

[2] B. Krzanich, “Data Is The New Oil In The Future of Automated 

Driving”. Retrieved November 2016, from: 

https://newsroom.intel.com/editorials/krzanich-the-future-of-automateddriving/ 

[3] “Explanations of Adaptive Platform Design”. March 2017. AUTOSAR. 

[4] “RTP Time Partition Scheduling” Retrieved December 2017 from: 

https://knowledge.windriver.com/enus/000_Products/000/020/000/020/020/000_Programmer's_Guide%2C_ 

Edition_17/0F0/050 


436

Could Virtualization be the key to reducing 

complexity within the automotive E/E architecture? 

Nicholas Ayres, Daniel Hayes 

DIGITS 

De Montfort University 

Leicester, United Kingdom 

nick.ayres@dmu.ac.uk, daniel.hayes@dmu.ac.uk 

Dr Lipika Deka, Dr Benjamin N. Passow 

DIGITS 

De Montfort University 

Leicester, United Kingdom 

lipika.deka@dmu.ac.uk, benpassow@ieee.org 

Abstract—The vehicle embedded system also known as the 

electronic control unit (ECU) has transformed the humble motor 

car making it more efficient, environmentally friendly and safer, 

but has led to a system which is highly complex. The modern 

motor vehicles electronic/electrical (E/E) architecture has become 

one of the most software-intensive machines we use in our day to 

day lives. As new technologies such as vehicle autonomy and 

connectivity are introduced and new features added on to 

existing Advanced Driver Assistance Systems (ADAS), an 

increase in overall complexity will no doubt continue. To address 

these future challenges the motor vehicle will require a radically 

new approach to the current E/E architecture. Virtualization has 

had a resurgence and transformed data centers and facilitated 

huge growth in cloud storage, As such it can effectively address 

the increasing complexity of the vehicle E/E architecture. By 

converting a hardware and software-based ECU into a virtual 

environment transforms it into a Virtualized ECU (VCU) 

utilizing some of the major benefits of a virtualized environment. 

Keywords—Embedded System, ECU, Virtualization, E/E 

Architecture 


The modern motor car can no longer be considered a 

solely mechanical device since the introduction of the 

electronic control unit (ECU). ECUs of which there are in 

excess of 70 [1], [2] embedded in the modern motor car, 

monitor, and control a wide range of software-based functions 

and applications. This software incorporates over 100 million 

lines of code [3] responsible for sending and receiving data to 

numerous sensors and actuators often with real-time 

constraints across several automotive domains. In automotive 

terms, a domain is a “means to group mechanical and 

electronic systems” [4] connected over multiple in-vehicle 

networks such as CAN, LIN and MOST. As more tasks and 

functions come under the umbrella of the automotive 

electronic / electrical (E/E) architecture it has led to a huge 

growth in the number of ECUs deployed throughout the car 

resulting in a highly decentralized, rigid and complex system. 

Virtualization: a virtual version of a device or resource could 

provide the automotive E/E architecture with a number of key 

benefits including flexibility, availability, scalability and 

security. When applying these to an automotive context they 

could address the overall increasing complexity of the E/E 

architecture. This paper explores some of the main, potential 

benefits virtualization could provide and how complexity in 

the E/E architecture can not only be addressed but reduced. 

II. BACKGROUND 

Since Karl Benz in 1886 built what was considered the 

first modern motor vehicle: the Benz Patent-Motorwagen [5], 

the humble car has transformed, not just in looks but function. 

1977 saw General Motors release the Oldsmobile Toronado 

which is regarded as the first car to include an electronic 

control unit ECU, this first implementation managed the 

electronic spark timing [6] of the combustion process. Since 

ECUs were introduced software has become an integral part of 

the motor car similar to any mechanical component which aids 

in its function and operation. ECUs benefit the driver with a 

safer, efficient and more comfortable ride but the benefits can 

also be seen with regard to the vehicle such as lower CO 2 

emissions, reduced mechanical wear and higher efficiency in 

operation. Vehicle systems are no longer mechanically linked 

together, but rather software driven hardware, connected 

between driver input and vehicle output. Gone are the days 

where a depression of the accelerator pedal would simply 

propel the vehicle in motion. In a typical modern motor 

vehicle, every time the accelerator pedal is depressed a whole 

plethora of electronic tasks are initiated, software algorithms 

ensure that parameters such as ignition timing, air to fuel 

ratios, temperatures and pressures are all kept to an optimum 

ensuring that the vehicle will accelerate as efficiently as 

possible [7]. 

Autonomous technologies currently being developed and in 

some cases deployed in a number of makes and models of road 

vehicles form the safety driven, advanced driver assistance 

system (ADAS). These safety systems include technologies 

such as park assist, adaptive cruise control, lane keeping and 

departure assist. To facilitate this, new technology will require 

a greater increase in hardware, software and network 

communication putting more complexity and pressure on the 

encumbered E/E architecture and it is clear that a new approach 

is required in tackling the inherent complexity of the E/E 

architecture. 

III. MOTIVATION FOR CHANGE, PAST & PRESENT 

The motor car has had to adapt from generation to 

generation in order to meet the challenges of new vehicle- 

437

ased technology. Since their introduction, ECUs were 

connected to sensors and actuators using a point to point (P2P) 

wiring system routing these dedicated wired connections 

through the vehicles existing wiring harness, as shown in Fig 1 

below. As each original equipment manufacturer (OEM) 

began to incorporate more and more technology into the 

running and monitoring of their vehicles the wiring harnesses 

became costly, unwieldy and overly complex. 

Fig. 1. ECU P2P Connection compared to CANBus Network 

A new method of connection based information 

interchange was required to address this complexity. To cope 

with the issues surrounding point to point connections within 

early E/E architectures the controller area network (CANbus): 

developed in the mid-1980s by Bosch and became the standard 

in-vehicle communication technology in almost every modern 

motor vehicle. 

With the increasing use of ECU based systems within the 

vehicle development costs have escalated to where 30-35% of 

a vehicles total cost is associated with the vehicles electronics 

and software [8]. The past and projected overall cost can be 

seen in Fig. 2 [9]. Again the motor industry rose to this 

challenge and in 2003 the AUTomotive Open System 

Architecture (AUTOSAR) consortium was formed to provide a 

standardized framework to promote software reusability in an 

attempt to lower automotive software development costs. 

Fig. 2. Automotive E/E Cost vs Overall Vehicle Cost 

In an effort to reduce the number of ECUs and associated 

hardware the ”car of the future” will incorporate “centralized, 

multifunctional, multipurpose hardware” [10] which is less 

reliant on an increasing number of ECUs, sensors, actuators 

and communication media. System on chip (SoC) and 

multiprocessor system on chip (MPSoC) are technologies 

which are being introduced into the E/E architecture often 

incorporating an ECU operation or function per core [11] 

consolidating many low level, individual ECUs. Although 

reducing the number of ECUs with MPSoC/SoC could be seen 

as a temporary solution due to new more complex system large 

scale integration applications [12]. As new technologies enter 

the automotive arena such as ADAS and vehicle autonomy 

complexity is and will be a fundamental issue and to address 

these future challenges the motor vehicle will require a 

radically new approach to the current E/E architecture. 

IV. THE MAIN SOURCES OF COMPLEXITY WITHIN THE 

MODERN AUTOMOTIVE E/E ARCHITECTURE 

Although the motor industry has historically met the 

challenges facing it over the years with key technologies such 

as vehicle networking and software architecture 

standardization complexity still exists within the current E/E 

architecture which is now facing new challenges, which 

include: 

A. Increasing Numbers of Physical ECUs 

The number of ECUs has grown dramatically [13] in the 

past 40 years since the first introduction in 1977 and this trend 

is set to continue over the coming decades [10]. As more 

legacy features and functions move from mechanical to 

electronic control as well as the addition of new ADAS and 

vehicle autonomy subsystems this number is inevitably set to 

rise bringing the associated rise in development and individual 

component costs, weight, required operating and application 

software and network traffic. 

B. Decentralized E/E Architecture 

ECUs control a vast and different array of vehicle tasks 

and functions over four key functional domains including 

chassis, powertrain, comfort, and infotainment [14], [15]. 

Vehicle functional domains represent a logical distribution of 

ECU hardware and in-vehicle functions throughout the 

vehicle. ECUs are typically located near to the components 

they control or monitor but many ECUs can have functions 

and operations which are distributed over a number of ECUs, 

across several domains with communication between them 

achieved through in-vehicle networks. As more functions and 

features become available it results in ECUs requiring data 

from other sensors and ECUs in other functional domains 

which have led to a highly decentralized E/E vehicle 

architecture. 

C. Multiple In-vehicle Networks 

As the numbers of ECUs have grown from model to model 

over the years, increasing amounts of network traffic on the 

primary CANbus network has become an issue as it struggles 

to cope with the volume of different applications of network 

traffic [16]. This has resulted in the inclusion of multiple invehicle 

network media and protocols co-existing to handle the 

various types of in-vehicle communication. The local 

interconnect network (LIN) is a limited node, small bandwidth 

protocol which supports simple, non-critical low priority 

ECUs such as climate control, seat and wing mirror position 

motors. In contrast to this, the media orientated systems 

transport (MOST) network has been designed and 

438

implemented to handle the high requirements demanded by 

streaming video, voice, and other data suited for infotainment 

and hi-fidelity systems which require a high-speed network. 

D. Embedded Vehicle Software 

Vehicle software is a major component of the modern 

motor car from the ECU operating system supporting 

functional applications to the features for consumption within 

the HMI device. ECUs within the E/E architecture perform 

over 2000 individual vehicle-related functions [17] from 

engine management to passenger comfort. As more ECUs are 

introduced into the E/E architecture it will inevitably result in 

even more lines of software code required to drive those 

embedded systems. As more and more directly connected 

mechanical functions are separated with ECU functionality, 

their software inevitably has to interact with multiple sensors, 

actuators, and other ECUs often across multiple domains of 

responsibility increasing overall complexity. Over the coming 

decade, the modern motor vehicle will see an influx of new 

safety features ADAS and vehicle autonomy. The autonomous 

motor vehicle has been hailed as offering a wealth of benefits 

to not just our day to day lives but society in general, from 

making our streets safer with the reduction in traffic accidents 

to greater access to independent mobility solutions including 

for the elderly and non-car owners/drivers [18]. The amount 

of embedded software, as well as generated data to support 

these new systems, is set to increase dramatically. 

The scope for even more lines of software code and E/E 

complexity is rapidly becoming a pressing issue and one which 

requires a robust and secure strategy to ensure that the software 

within the motor car is correct, up-to-date and paramount: safe 

and secure. 

E. Embedded Software Updates 

ECUs and their associated functions are governed by 

software which is often designed and written months or even 

years before the vehicle is driven off of the sales forecourt. 

ECU software is preloaded during the manufacturing process 

and in many cases fixed to the specific hardware, it has been 

written for. Software is now a critical component of the 

modern motor car, but, all software code is vulnerable to errors 

especially during the design, coding and implementation stages 

of the vehicles development process which when discovered 

must be addressed. If software errors are not addressed they 

may expose the OEM or supplier to liability if something goes 

wrong due to an inherent flaw in their software. Once a flaw 

has been discovered and rectified that new, updated code needs 

to be deployed and installed to the target vehicle or vehicles in 

a manner that is safe, secure and one which offers minimal 

disruption for the customer but is also cost-effective to the 

OEM and supplier. 

There is no doubt that today’s modern car has 

technologically advanced into a highly complex device [19]. 

The vehicles E/E architecture has become one of the most 

hardware and software-intensive machines we use in our day 

to day lives. There have been numerous initiatives, 

frameworks, and standards to address this increasing 

complexity, but, it is clear that a new approach is required 

before the E/E architecture reaches a saturation point. 

V. ECU VIRTUALIZATION 

Virtualization is a technology which could address the 

fundamental challenges concerning the modern automotive E/E 

architecture. A classic example of where virtualization 

technology has not just enhanced but transformed an industry 

is the traditional datacenter: These used a model that relied on 

individual and often underutilized servers dedicated to a 

particular role or task within an organization. As new tasks and 

roles were introduced typically new dedicated hardware and 

software was installed. The modern motor cars E/E architecture 

is very much in tune with the traditional data center model and 

many comparisons can be drawn where vehicle ECUs provide 

hardware and software resources to their dedicated clients, 

actuators and sensors, acting very much like individual servers. 

Virtualization technology can be applied to ECU functions 

where they could be converted to a virtual instance 

transforming an ECU into a virtual electronic control unit 

(VCU) see Fig. 3 below. 

Fig. 3: Automotive VCU Based System 

These VCUs could not only provide the similar 

functionality as their hardware-based counterparts but provide 

additional benefits: Virtualization could enhance the vehicle 

E/E architecture with improvements including flexibility, 

availability, scalability, utilization, security, and software as 

detailed below. 

A. Flexibility 

Hardware resources can be modified often ‘on the fly’ in 

order to meet peak demands of the system. In stark contrast, 

physical based ECUs have their hardware fixed at the design 

and subsequent manufacture ensuring that only the required 

and necessary hardware is embedded to run its embedded 

software. A virtual system can adapt to the current situation 

and provide additional resources as and when they are required 

especially when the original software is upgraded or replaced. 

B. Consolidation 

It is clear from the modern data center that consolidation 

has played a large part in the reduction of individual bespoke 

439

servers as well as lowering the energy used to power those 

devices. Automotive virtualization not only consolidates but 

addresses the decentralized nature of distributed ECUs along 

with the associated benefits of physical accessibility, 

maintenance, and replacement. 

C. Scalability 

A VCU is a software-based system and as such can be 

changed if an error in the code is discovered or additional code 

is added to provide additional functionality. The system is able 

to increase available memory, CPU cores, and other required 

resources to meet the demands of the newly updated system. 

Currently, practices where executed ECU code which has been 

modified to correct a flaw or provide additional functions or 

features may potentially not scale to the underlying hardware 

producing a reduction in overall performance. 

D. Utilization 

Physical-based ECUs have their hardware fixed at the 

design and subsequent manufacture ensuring that only the 

required and necessary hardware is included to run its 

embedded software, due to the large numbers of vehicles 

produced ensures that ECU hardware costs are kept to a 

minimum. In contrast to this, a system designed to run multiple 

virtual machines has a large number of resources available to 

cope with peak workloads and these virtual resources can be 

allocated as and when required. 

E. Security 

Virtualization provides a separation of services whereby 

these services can be separated into individual virtual machines 

[20]. If a service running a particular VM becomes 

compromised for any reason it will not directly affect other 

VMs on the same system. Critical vehicle functions can be not 

only segregated into its own VM but it can be run on a 

dedicated system whether that pertains to a particular domain 

or common vehicle function. 

F. Software 

History has shown that with any software there is always 

the need to periodically update the code to fix previously 

undiscovered bugs and vulnerabilities or offer new features and 

enhancements for the consumer. Many embedded systems in a 

motor vehicle do not allow or provide any form of mechanism 

to update embedded software. Such embedded system code is 

fixed during production making them more secure but isolated 

when it comes to any future software updates. A virtualized 

environment can be much more accessible as a VM is, in 

essence, a stored software image containing all the files 

required for their operating system, applications, and overall 

configuration. VM images are held within some form of 

permanent storage, by providing access to this centralized 

storage media allows new software to be deployed easily 

replacing faulty or obsolete VMs and their subsequent VCUs 

with updated code. 

Virtualization, although offers many benefits, does have its 

drawbacks. One such example is overhead; latency is 

introduced into the system due to additional layers of 

abstraction between application software and the underlying 

hardware especially when considering real-time constraints but 

this can be reduced with hardware assist. Single point of failure 

(SPOF) is also a factor concerning virtualization as multiple 

VMs operate on a single device but secondary redundant 

systems can provide an increased level of redundancy. 

Virtualization has already begun to enter the automotive 

domain, but, primarily within the human-machine-interface 

(HMI) unit which provides the vehicle occupants with 

infotainment services as well as key functions and features to 

control or adjust vehicle associated parameters. Often 

infotainment systems are coupled with some form of 

connectivity mechanism which is either built into the vehicle or 

is provided by mirroring a smart device such as a smartphone. 

Although access to these different services is through the same 

HMI device a clear separation of access has to be in place to 

provide not just shared functionality but more importantly 

security from external access and threats to the interconnected 

underlying critical systems. 


In summary, virtualization can bring many benefits to the 

E/E architecture as well as address aspects of its overall 

complexity. Virtualization is not a complete panacea, it does 

have its disadvantages including overhead and SPOF but some 

of the main drawbacks of virtualization are being addressed. As 

ADAS and vehicle autonomy become more mainstream 

technologies within new upcoming vehicle makes and models, 

the challenges surrounding E/E architecture complexity must 

be addressed. Virtualization is one such technology which not 

only can meet these challenges but add additional benefits 

especially when addressing embedded software. The data 

center environment has vastly benefitted from virtualization 

and there are many parallels that can be drawn from this 

industry when applying it to an automotive context. The 

number of ECUs can be consolidated on to centralized and, 

high specification yet redundant hardware which is flexible and 

scalable as and when the system demands. 

REFERENCES 

[1] S. Fürst, "AUTOSAR Adaptive Platform for Connect and Autonomous 

Vehicles," 1st December 2015. [Online]. Available: 

https://www.autosar.org/fileadmin/files/presentations/AUTOSAR_Adap 

tive_Platform_FUERST_Simon.pdf. [Accessed 11 October 2016]. 

[2] J. A. Cook, "Control, Computing and Communications: Technologies 

for the Twenty-First Century Model T," IEEE, special issue on 

automotive power electronics and motor drives, vol. 95, pp. 334-355, 

2007. 

[3] A. Sangiovanni-Vincentelli and M. Di Natale, "Embedded system 

design for automotive applications," Computer 40, pp. 42-51, 2007. 

[4] F. Simonot-Lion and Y. Trinquet, "Vehicle Functional Domains and 

Their Requirements," in Automotive Embedded Systems handbook, 

Boca Raton, CRC Press, 2009, pp. 22-43. 

[5] S. J. C. Nixon, The Invention of the Automobile, Country Life, 1936. 

[6] R. N. Charlette, "This Car Runs on Code," 1 February 2009. [Online]. 

Available: http://spectrum.ieee.org/transportation/systems/this-car-runson-code. 

[Accessed 13 October 2016]. 

[7] D. Work, A. Bayen and Q. Jacobson, "Automotive Cyber Physical 

Systems in the Context of Human Mobility," National Workshop on 

High-Confidence Automotive Cyber-Physical Systems, pp. 3-4, 2008. 

[8] M. Shavit, A. Gryc and R. Miucic, "Firmware Update Over The Air 

(FOTA) for Automotive Industry," SAE, 2007. 

[9] R. Chitkara, W. Ballhaus, B. Kliem, S. Berings and B. Weiss, "Spotlight 

on Automotive," PwC, 2013. 

440

[10] M. Broy, "Challenges in Automotive Software Engineering," in 

Proceedings of the 28th international conference on Software 

engineering , Shanghai, 2006. 

[11] M. Urbina and R. Obermaisser, "ulti-core architecture for AUTOSAR 

based on virtual electronic control units," in Emerging Technologies & 

Factory Automation, Luxembourg City, 2015. 

[12] K. Suzuki, "Automotive Electronics Trend in Automotive Industry," 

Nikkei Automotive Technology, 2015 1 28. [Online]. Available: 

https://www.slideshare.net/kenjisuzuki397/car-electronization-trend-inautomotive-industry-44007679. 

[Accessed 16 January 2018]. 

[13] D. Reinhardt and M. Kucera, "Domain Controlled Architecture," in 

Third International Conference on Pervasive and Embedded Computing 

and Communication Systems , Barcelona, 2013. 

[14] M. Strobl, M. Kucera, A. Foeldi, T. Waas, N. Balbierer and C. Hilbert, 

"Towards automotive virtualization," in International Conference on 

Applied Electronics (AE), 2013. 

[15] D. Reinhardt, D. Kaule and M. Kucera, "Achieving a scalable e/earchitecture 

using autosar and virtualization," SAE International Journal 

of Passenger Cars-Electronic and Electrical Systems , pp. 489-497, 

2013. 

[16] D. Reinhardt and M. Kucera, "Domain controlled architecture A New 

Approach for Large Scale Software Integrated Automotive Systems," in 

Third International Conference on Pervasive and Embedded Computing 

and Communication Systems, Barcelona, 2013. 

[17] M. Broy, H. I. Kruger, A. Pretschner and C. Salzmann, "Engineering 

Automotive Software," Proceedings of the IEEE, pp. 356-373, 2007. 

[18] R. Ramos, "Self-Driving Vehicles -- Are We Nearly There Yet?," 10 

October 2016. [Online]. Available: 

http://www.eetimes.com/author.asp?section_id=36&doc_id=1330599&. 

[Accessed 11 October 2016]. 

[19] G. de Boer, P. Engel and W. Praefcke, "Generic remote software update 

for vehicle ECUs using a telematics device as a gateway," in Advanced 

Microsystems for Automotive Applications, Berlin, Springer, 2005, pp. 

371-380. 

441

Cycle Approximate Simulation of RISC-V Processors 

Lee Moore, Duncan Graham and Simon Davidmann, Imperas Software Ltd., and 

Felipe Rosa, Universidad Federal Rio Grande Sul 

Abstract 

Historically, architectural estimation, analysis and optimization for SoCs and 

embedded systems has been done using either manual spreadsheets, hardware 

emulators, FPGA prototypes or cycle approximate and cycle accurate simulators. 

This precision comes at the cost of performance and modeling flexibility. 

Instruction accurate simulation models in virtual platforms, have the speed necessary 

to cover the range of system scenarios, can be available much earlier in the project, 

and are typically an order of magnitude less expensive than cycle approximate or 

cycle accurate simulators. Previously, because of a lack of timing information, virtual 

platforms could not be used for timing estimation. We report here on a technique for 

dynamically annotating timing information to the instruction accurate software 

simulation results. This has achieved an accuracy of better than +/-10%, which is 

appropriate for early design architectural exploration and system analysis. This 

Instruction Accurate + Estimation (IA+E) approach is constructed by using Open 

Virtual Platforms (OVP) processor models plus a library that can introspect the 

running system and calculate an estimate for the cycles taken to execute the current 

instruction. Not only can these add-on libraries dynamically inspect the running 

system estimate timing effects, they can annotate calculated instruction cycle timing 

back into the simulation and affect timing of the simulation. 

Introduction 

Performance and power consumption are two key attributes of any SoC and 

embedded system. Systems often have hard timing requirements that must be met, for 

example in safety critical systems where reaction time is of paramount importance. 

Other systems, particularly battery powered systems, have power consumption 

limitations. 

Because of the importance of these characteristics, many techniques have been 

developed for estimation of performance and power consumption. Recently, with the 

explosion of system scenarios that must be considered, this job has become much 

more difficult. 

Instruction accurate simulation has previously not been considered as a potential 

technique for timing and power estimation, because it is instruction accurate and does 

not model processor microarchitecture details: there is no information about timing or 

power consumption of instructions and actions in instruction accurate models and 

simulators. Recently some universities, using the Open Virtual Platforms (OVP) 

models and OVPsim simulator [1], have experimented with adding this information 

into the instruction accurate simulation environment as libraries, with no changes to 

the models or simulation engines [2]. These efforts have shown great promise, with 

442

timing estimation results within +/- 10% of the actual timing results for the hardware 

for limited cases. 

We report here on the further development of this technique, and the extension of this 

technique for RISC-V ISA based processors. This is critical for the RISC-V 

ecosystem, since for RISC-V semiconductor vendors to win embedded system sockets, 

their customers are going to want to know about the timing and power consumption of 

those SoCs when running different application software. 

Current State of the Art 

Historically, SoC architectural estimation, analysis and optimization has been done 

using either manual spreadsheets, hardware emulators, FPGA prototypes, cycle 

approximate simulators or cycle accurate simulator and performance simulators such 

as Gem5 [3]. These all have significant drawbacks: insufficient accuracy, high cost, 

RTL availability (meaning that the technique is only available later in the project 

when the RTL design is complete), low performance, limited ability to support a wide 

range of system scenarios or are very complex to use and gain good results. Table 1 

provides a summary of the strengths and weaknesses of each technique. 

Technique Strength Weaknesses 

Manual spreadsheets Ease of use Lack of accuracy; inability to 

support estimations with real 

software 

Hardware emulators Cycle accurate High cost (millions USD); needs 

RTL; < 5 mips performance 

FPGA prototypes Cycle accurate High cost (hundreds of thousands 

USD); needs RTL 

Cycle approximate Good performance Lack of accuracy; lack of 

simulation 


simulation 



High cost (hundreds of thousands 

of USD); lack of availability of 

models 

Gem5 Microarchitectural detail A lot of work to develop a model 

of specific microarchitecture and 

to get realistic traces of SoC. 

Table 1. Strengths and weaknesses of currently used techniques for timing and power 

estimation. 

Instruction Accurate Simulation 

Instruction set simulators (ISSs) have long been used by software engineers as a 

vehicle for software development. Over the last 20 years, this technique has been 

extended to support not only modeling of the processor core, but also modeling of the 

peripherals and other components on the SoC. The advantages of these simulators are 

their performance, typically hundreds of millions of instructions per second (MIPS), 

and the relative ease of building the necessary models. However, the simulator 

engines and models are instruction accurate, and are not built to support timing and 

power estimation. 

443

The performance of these simulators comes from the use of Just-In-Time (JIT) binary 

translation engines, which translate the instructions of the target processor (e.g. Arm) 

to instructions on the host x86 PC. This enables users to run the same executables on 

the instruction accurate simulator as on the real hardware, such that the software does 

not know that it is not running on hardware. Peak performance with these simulators 

can reach billions of instructions per second. A more typically use case, such as 

booting SMP Linux on a multicore Arm processor, takes less than 10 seconds on a 

desktop x86 machine. 

There are also significant libraries of models available, and it is easier to build 

instruction accurate models than models with timing or power consumption 

information, or real implementation details. One such library and modeling 

technology is available from OVP. The OVP processor model library includes 

models of over 200 separate processors (e.g. Arm, MIPS, Power, Renesas, RISC-V), 

plus a similar number of peripheral models. Most of these models are available as 

open source. The C APIs for building these models are also freely available as an 

open standard from OVP. 

Instruction Accurate Simulation Plus Estimation 

Instruction accurate simulation holds the promise of faster simulation performance to 

support examination of more system scenarios, plus lower cost and earlier availability. 

With the Imperas APIs and dynamic model introspection it is easy to add in timing 

and power estimation capabilities into the instruction accurate simulation environment. 

The idea of adding these capabilities as libraries is the combination of annotation 

techniques and binary interception libraries used with JIT simulation engines. 

Annotation techniques can be imagined as a full instruction trace which is then 

annotated with the timing or power information. However, just using annotation 

requires significant host PC memory, and can slow the simulation. 

Binary interception libraries are used with the Imperas JIT simulators to enable the 

non-intrusive addition of tools, such as code coverage and profiling, to the simulation 

environment. Combining these techniques maintains the high simulator performance 

with minimal memory costs. This combined technique is being called Instruction 

Accurate + Estimation (IA+E). 

In the Imperas simulation products, which require the use of OVP models, it is 

possible to create a standalone library module with entry points that are called when 

instructions are executed. This library can introspect the running system and calculate 

an estimate for the cycles taken to execute the current instruction, and can take into 

account overhead of different memory and peripheral component latencies. Not only 

can these add-on libraries dynamically inspect the running system and estimate timing 

affects, they can annotate calculated instruction cycle timing back into the simulation 

and affect (i.e. stretch) timing of the simulation. An overview of the simulation 

architecture is shown in Figure 1. 

444

Figure 1. Overview of the Imperas IA+E simulation environment. 

For processors, the instruction estimation algorithm includes: 

• a mixture of table look ups for simple instructions 

• dynamic calculations for data dependent instructions 

• adjustments due to code branches taken 

• taking into account effects of memory and register accesses 

A view of the timing estimation mechanism is shown in Figure 2. 

Load, store, branch, jump, barrier, etc 

a s s e m b l y c o d e 

b c s 25 d 0 

l d r r 3 ,[ p c ,# 172 ] 

s t r r 3 ,[ r 7 ,# 16 ] 

l d r r 3 ,[ p c ,# 168 ] 

l d r r 3 ,[ r 3 ] 

s t r r 3 ,[ r 7 ,# 20 ] 

b 255 e 

I S A t i m i n g i n f o r m a t i o n 

I n s t r . C y c l e s 

B c s 3 

L d r 2 

S t r 2 

L d r 2 

L d r 2 

S t r 2 

b 3 

Calibration from a 

reference CPU 

datasheet 

Figure 2. Simplified view of the timing estimation mechanism. 

For memory subsystems and peripheral components table, lookup and dynamic 

estimation can be made and timing back annotated into the simulation to simulate the 

delay effects of slow memories and other components. 

445

With this Instruction Accurate + Estimation (IA+E) approach, there is a separation of 

processor model functionality and timing estimation. This means while building a 

functional model there is no need to worry about any timing or cycle complexity. It is 

only when the more detailed timing is needed is it necessary to add the extra timing 

data to enable the Imperas IA+E timing tools to provide cycle approximate timing 

simulation for the RISC-V processors. 

This extra timing data is added in two steps. First, the cycle information is added to 

the library. Second, the time per cycle, which is dependent upon the specific 

semiconductor process and physical implementation details, is added. 

The approach of providing the timing data as a separately linked dynamic program 

enables RISC-V processor designers to create a cycle approximate timing simulation 

for their specific processor implementation - without sharing any internal information. 

IA+E simulation performance slows down from normal simulation performance, with 

typical overhead of about 50% of normal performance. Still, this puts IA+E 

simulation performance at 100-500 MIPS. 

IA+E does have some limitations. This technique has currently been proven only for 

simple processors with a single core, no cache, and in-order pipeline. 

Results 

This IA+E technique was first tested with Arm Cortex-M4 based processors. The 

results were much better than expected, with an average estimation error of +/- 5% as 

compared to the actual device. The device was an ST Microelectronics STM32F on a 

standard development board, running the FreeRTOS real time operating system, with 

39 different benchmark applications used. Almost all timing estimation errors were 

within +/- 10% of actual timing values. Figure 3 shows these results. 

Figure 3. Timing estimation results for IA+E simulation show average errors of 

better than +/- 5% over 39 different benchmarks for Arm Cortex-M4. 

446

IA+E was recently extended to support RISC-V processors, by using publicly 

available information (from the processor vendor's data books) to build the cycle data 

libraries. 

In the data below, showing processor implementations from Andes Technology, 

Microsemi and SiFive, only the cycle data is presented, since comparing timing for 

the various implementations would not be an accurate comparison. Also, in keeping 

with this theme, different benchmark applications were used for each of the different 

processors. All benchmarks were run with the range of compiler optimization settings, 

and estimated cycles were reported first assuming 1 cycle per instruction, i.e. using IA, 

then using the IA+E technique. These results are shown in Figure 4. 

Figure 4a. IA+E cycle estimation results for the Andes N25 processor. 

Figure 4b. IA+E cycle estimation results for the Microsemi Mi-V RV32IMA 

processor. 

Figure 4c. IA+E cycle estimation results for the SiFive E31 processor. 

447

Conclusions 

The Instruction Accurate + Estimation (IA+E) technique developed here has shown 

excellent results for timing estimation of in-order processors. It also has the benefits 

of easy model building, high performance to enable examination of multiple 

benchmarks and system scenarios, and lower cost than other techniques. In this paper, 

the IA+E technique has been extended to support RISC-V processors. Further work 

is needed to apply this technique to power estimation, and to more complex 

processors. 

Acknowledgements 

We would like to thank Andes Technology, Microsemi, and SiFive for access to their 

processor datasheets/databooks. 

References 

1. www.OVPworld.org 

2. Felipe Da Rosa, Luciano Ost, Ricardo Reis, Gilles Sassatelli. Instruction- 

Driven Timing CPU Model for Efficient Embedded Software Development 

Using OVP. ICECS: International Conference on Electronics, Circuits, and 

Systems, Dec 2013, Abu Dhabi, United Arab Emirates. 

3. Gem5, www.gem5.org 

448

Comparing Automotive Secure Gateway Design 

Approaches 

Carmelo Loiacono 

Field Applications Engineer 

Green Hills Software 

Turin, Italy 

carmelo@ghs.com 

Abstract— Considering the complexity of the today’s cars, 

guaranteeing their security is not an obvious task. In this scenario, 

a hacker could attack the on-board networks of the car, even 

without having physical access. Moreover, sending malicious 

messages to ECUs over the CAN bus can potentially compromise 

the safety of the vehicle. To prevent such kind of attacks, the 

automotive architecture introduced the Secure Gateway. Since the 

Secure Gateway is a complex system, bad design can compromise 

the security, and potentially the safety, of the car. We will focus on 

analyzing Secure Gateway design methods. We compare different 

design approaches and give guidelines to guarantee the security of 

the whole system. Finally, we discuss the advantages of using a 

separation kernel and the important hardware requirements for 

Secure Gateways. 

Cloud (3G-LTE-GPRS) 

V2V (IEEE 802.11p) 

External Devices 

(USB-WIFI-BT) 

V2I (IEEE 802.11p) 

Keywords—Automotive Secure Gateway, Separation Kernel, 

System Security 


The vehicle and mobility industry is dealing with the trend 

of bringing different electronic domains onto a single platform. 

This leads to the challenge of enabling applications with more 

strict security and safety requirements to work in a trusted 

environment on a single platform. Vehicle internal networks are 

now more connected to external devices, thereby exposing the 

internal network to the outside world. 

Moreover, the evolution of Vehicle to Everything (V2X) 

communication has increased data exchange with external 

resources via Wi-Fi, 3G, and LTE networks. Automotive ECUs 

could be subject to external attacks that aim to control their 

software behavior. Such attacks arrive as data over a regular 

communication channels (e.g. external network) and, once 

resident in program memory, trigger pre-existing hardware and 

software vulnerabilities. By exploiting such flaws, these attacks 

can subvert the execution of the software and gain control over 

its behavior [1]. Figure 1 shows how the modern car are 

connected to different sources incrementing the attack surface. 

Secure Gateways (SGs) are used to separate the internal 

vehicle networks from the external one, i.e. to protect the 

internal communications from potential attacks coming from 

Fig. 1 Modern Connected Cars 

external sources. SGs are crucial for the security of the vehicle 

so compromising the security of the SGs can compromise the 

security of the whole system. There are two main aspects to 

consider for the SGs security: low level security, more related to 

the operating system running on them and security of the data 

and applications. In this paper we analyze different possible 

ways to design SGs with respect to security and safety aspects. 

We also provide suggestions and guidelines to design secure 

SGs. 

The rest of the paper is organized as follows. Section 2 

reports comparison and guidelines on SGs design. Section 3 

presents methods to manage I/O devices in a virtualized system. 

Finally, Section 4 concludes the paper with summarizing 

remarks. 

II. 

SECURE GATEWAYS SOFTWARE DESIGN 

With increasing intelligence, modern vehicles are equipped 

with more and more sensors, such as sensors for detecting road 

conditions and drivers fatigue, sensors for monitoring tire 

pressure and water temperature in the cooling system, and 

advanced sensors for autonomous control [2]. 

In addition, the increasingly interconnected nature of a 

vehicles control modules means there is no safety without 

security. Security features must include not just physical access 


449

and protection of confidential information, but also critical 

safety systems. 

For this reason, using a SG to protect the internal network 

from external attacks is very important for the security and 

safety of the vehicle. Figure 2 shows the main software 

components of a SG. 

Application Environment is composed by non-critical 

applications with no safety and security requirements. Security 

services are used to guarantee the confidentially and integrity of 

messages exchanged within SG components or between ECUs 

security (symmetric and asymmetric cryptography). 

Indeed, SGs are mixed criticality systems in which jobs are 

running with different security and safety requirements. SGs 

should be designed to ensure that the execution of non-trusted 

applications does not compromise the execution of the others. 

This can be achieved thanks to the separation properties offered 

by separation kernels. 

A. Separation Kernels 

Separation Kernels offer advanced features to the embedded 

systems software developers that need to: ensure that the 

heterogeneous software components are free from interference, 

protect the information flow and reinforce the car 

communication system with respect to security and safety 

requirements. A well designed Separation Kernel must ensure 

that errors within a process will not propagate in the whole 

system, this can be done by confining the writing space of the 

processes in a specific memory area. The Separation Kernel 

consist of "compartments" named partitions, in each of these 

partitions is running a process. A process running on a partition 

can be composed of multiple tasks (threads). Inside the partition 

the separation is not guaranteed, whereas the separation is 

ensured between the different partitions. The key benefits of the 

separation kernel are the following: to act as errors containers, 

to allow execution without interference of different critical 

processes on a single hardware platform, to ensure 

confidentiality of sensitive data and to integrate new features 

without having to re-test the entire system. 

Operating systems, which do not use the separation as the 

foundation, can enter undefined states, get in deadlock and have 

a non-deterministic execution flow. This can have serious 

consequences on systems, especially in the automotive field. A 

separation kernel designed to be used in critical systems, such as 

SGs, must always ensure that the computational and memory 

resources are always available to each process running on a 

partition. Another important property of operating systems 

security oriented is preventing denial-of-service attacks. 

Usually, such attacks are avoided by assigning to each process a 

fixed amount of resources in terms of CPU and memory. 

Moreover, the static allocation of resources in terms of time will 

ensure that each process will be executed in a given time 

window. This preserves the integrity of the processes avoiding 

the execution outside of their temporal window. 

The main requirements of SGs are: Real Time, safety, 

security, reliability, and performance. The microkernel 

architecture, introduced in some separation kernel, like the 

INTEGRITY RTOS [3] from Green Hills Software, ensures that 

the kernel is easy to be tested and verified, in order to be free of 

bugs and security holes. In microkernel architectures only basic 

services are part of the kernel: Support for communication 

between partitions (IPC), virtual memory management and 

scheduling. Other complex services are running inside the 

partitions, this allows to have a more safe and reliable kernel. 

Figure 3 shows a separation architecture composed of 

microkernel. Notice that some services, such as the file system 

management and device drivers are running in different 

partitions and independent from the microkernel, that 

implements only basic functions. 

B. Linux on Secure Gateways 

Embedded Linux OS, with its kernel and software packages 

consisting of millions of lines of code, provide an attractive set 

of ready-made software also useful for SGs design. Since it is 

virtually impossible to test millions of lines of code, it is 

inevitable that Linux will continue to contain security 

vulnerabilities and software bugs. Also, the increasingly 

interconnected nature of embedded systems allows hackers to 

exploit those vulnerabilities, sometimes even letting them 

perform remote attacks. 

When it is not possible to replace the Linux with a Separation 

Kernel operating system, a powerful method for improving the 

security of SGs (having Linux as an Operating System) is to use 

hypervisors that guarantee separation between the system 

software components. A hypervisor is a layer of software below 

Application 

Environment 

Security 

services 

Network 

Stack 

Secure / 

safety 

Critical 

apps 

RTOS Operating System 

Embedded HW Platform 

Fig. 2 Secure Gateways Software Components 

Fig. 3 Separation MicroKernel Architecture 

450

the OS that runs at a higher privilege level than the OS and 

virtualizes the hardware resources. Because of the higher 

privilege level, the integrity of the hypervisor remains intact 

even if the OS is compromised. A hypervisor that is designed to 

be secure and reliable from the ground up offers significant 

advantages over hardware for implementing low-level security. 

Also, it can provide multiple levels of privilege so that a service 

with sensitive data could run in an isolated “compartment”, or 

partition, alongside a service with less sensitive information. 

While these different levels of security can run concurrently, 

they would never be able to see or modify each other’s data. 

There are other approaches to support multiple OS contexts 

than using a Separation Kernel or classic Type-1 Hypervisor. 

Linux Containers [4], LXC in short, is a method for running 

multiple isolated Linux systems (containers) on a control host 

using a single Linux kernel. FreeBSD [5] Jails follows a similar 

container scheme with a BSD-compatible userland. However, 

containers do not address kernel-level attacks, in particular 

against device drivers, which are running privileged on Linux 

and BSD systems. Containers are less secure than Separation 

Kernel and hypervisors because the kernel that hosts the 

containers has a much larger attack surface. The smaller attack 

surface of the latter decreases the probability that a privilege 

escalation attack will allow an attacker to compromise the 

security of a virtual machine and affect other components on the 

system. 

III. 

I/O DEVICE SECURITY 

An important aspect regarding the security of a virtualized 

system is the device management. In particular, we refer to 

devices that will be available to a Guest OS (Linux) on SGs 

described in Section 2.b. Figure 4 shows a use case where a 

system as a SG is using high-speed expansion ports that permit 

direct memory access (DMA Devices). If the Separation Kernel 

were to give complete control of DMA devices to the Linux 

Guest, the security of the whole system may be compromised. 

Indeed, using the DMA, the Guest OS could instruct the Device 

to read or write directly to any area of main memory including 

the kernel. Unless specific protection is in place, an attacker can 

use such a facility to potentially gain direct access to parts of or 

all of the physical memory address space of the system, 

bypassing all security mechanisms. 

For this reason, many modern SoC have introduced some 

functionality to limit the scope of what a DMA Device can 

access: the IOMMU. The IOMMU provides a programmatic 

interface to define which ranges of addresses the device can 

access. This allows device drivers to run purely in a Separation 

Kernel partition, or a Guest OS. While direct Device access from 

the Guest is strongly discouraged where an IOMMU is not 

capable of protecting the system, this is a common practice, 

taken as a compromise for the sake of either maintainability or 

time-to-market. 

Without such hardware protection, DMA devices should 

instead be managed by the Separation Kernel to ensure that a 

DMA Devices 

Security Services 

Network 

stack/Gateway 

Virtual Driver 

flaw in a Guest OS device driver cannot wrongfully program the 

DMA hardware and cause potentially fatal memory corruption. 

More precisely, the DMA requests need to be handled by the 

Separation Kernel, while the more complex part of the driver can 

still run in the Guest. This pushes driver implementation toward 

a para-virtualized, specific model, to ensure the behaviour can 

still perform. The added complexity is the price to pay for 

keeping the system robust, safe and secure, and experience 

demonstrates that the overhead is actually smaller than 

anticipated when the interface is correctly designed. 

IV. 

Embedded HW Platform 

Fig. 4 Improving security using a Separation Kernel and Virtualization 

CONCLUSION 

Secure Gateways (SGs) are complex systems and they have 

to guarantee the security of the vehicle from external attacks. 

Using a secure and reliable Separation Kernel has several 

advantages to improve the safety and security of the SGs, 

assuring the separation between critical and non-critical 

software components while performing a multi-level protection. 

Typically such SW components can manage different types of 

buses, i.e., the whole purpose of the separation solution between 

different domains is to act as a gateway on the same hardware 

box, and possibly add filtering and gateway protection features. 

Finally, Separation technology open up new scenarios in the 

automotive world and improve the security of the whole system. 

REFERENCES 

Linux Guest 

Device 

drivers 

Application 

(HMI) 

Separation (Micro)Kernel 

Other Devices 

[1] Erlingsson Ú., Younan Y., Piessens F. (2010) Low-Level Software 

Security by Example. In: Stavroulakis P., Stamp M. (eds) Handbook of 

Information and Communication Security. Springer, Berlin, Heidelberg. 

[2] N. Lu, N. Cheng, N. Zhang, X. Shen and J. W. Mark, "Connected 

Vehicles: Solutions and Challenges," in IEEE Internet of Things Journal, 

vol. 1, no. 4, pp. 289-299, Aug. 2014. 

[3] https://www.ghs.com/products/rtos/integrity.html 

[4] https://linuxcontainers.org/it/ 

[5] https://www.freebsd.org/it/ 


451

Smart Contracts for Industry 4.0 Using 

Blockchain 

Christoph Reich 

University of Applied Science Furtwangen 

Christoph.reich@hs-furtwangen.de 

Abstract 

Because of the digital transformation of enterprises, a stronger collaboration of the companies is 

expected, to achieve customer adapted, individual hybrid business models. These inter-enterprise 

business models need secure, reliable, and repeatable possibilities to protocol and monitor the 

information flow between manufacturing machines, users and service providers. 

The blockchain technology allows end-to-end trust chains, create digitized Service-Level- 

Agreements (SLA) contracts, and grant evidential control of the data flow between the enterprises. 

Introduction 

The progressive conversion of the manufacturing industry to Industry 4.0 technologies and the 

associated networking of the individual components of production facilities, logistics and employees 

enable companies to collect detailed information about their processes and products, in order to 

analyse data for process optimization, condition monitoring, predictive maintenance, etc. 1 . The use 

of data across company boundaries results in further innovative hybrid business models. According 

to vbw 2 , a strong growth for these Industry 4.0 hybrid business models is expected. Manufacturing 

companies are turning into industrial service providers who will offer their customers individual 

industrial services. Future value chains will be highly networked structures with a large number of 

involved people, IT systems, automation components and machines. 

A fundamental requirement for the acceptance of the provision and exchange of information between 

the customers/users and the service providers is the trust in the underlying systems, the strict 

adherence to Service Level Agreements (SLAs) and the observance of the protection goals 

availability, confidentiality, authenticity, integrity and traceability in information processing across 

organizational and corporate boundaries. Blockchain technology (such as Ledger, Ethereum), as a 

distributed database, ensures encrypted, unchanging, permanent, (persistent) traceable and auditable 

storage of cross-company information with guaranteed integrity. The essential basis of the blockchain 

concept is the technique of distributed consensus building, which replaces trust in a third party with 

trust in a collective of participants, technology and cryptography. 

This paper starts with an introduction of blockchain and smart contracts, followed by an overview of 

possible Industry 4.0 use cases, where blockchain is an interesting approach, and ending with a 

conclusion. 

1 Institut der deutschen Wirtschaft Köln; „Digitalisierung und Mittelstand – Eine Metastudie“; 

https://www.iwkoeln.de/_storage/asset/312105/storage/master/file/10916485/download/IW- 

Analyse_2016_109_Digitalisierung_und_Mittelstand.pdf; Nov. 2016 

2 vbw-die bayrische Wirtschaft; „Neue Wertschöpfung durch Digitalisierung Analyse und 

Handlungsempfehlungen“; 2017; https://www.vbw-bayern.de/Redaktion/Frei-zugaengliche-Medien/Abteilungen- 

GS/Forschung-Technologie/2017/Downloads/vbw_Zukunftsrat_Handlungsempfehlung-V14RZ-Ansicht.pdf 

452

Blockchain and Smart Contracts 

The blockchain technology and smart contracts have the capability to solve some of the industry’s 

crucial problems, like provable product traceability, autonomous payment, etc. 

Blockchain 

The blockchain technology should be tamper-proof thanks to a clever combination of proven 

technologies and encryption mechanisms. In addition, it should make its users independent of 

monolithic systems and the associated risks. The best-known blockchain application to date is 

Bitcoin, a worldwide, decentralized, digital currency payment. 

The blocks of a blockchain are individual records that string together in a chronological order. Each 

block contains a hash (kind of checksum) from the previous block. In order to record a new block in 

the blockchain, it has to be verified including the hash, as shown in the Figure 1 below: 

Figure 1: Blockchain 

The hash and the blockchain technology guarantee that a falsification of already written blocks can 

hardly be changed afterwards or only with a very high effort. A change in an older block causes the 

hash of all following verified blocks to deviate, causing the change to flare up. The blockchain is 

checked in a decentralized network by all authorized subscribers. Only if consensus between all 

subscribers is achieved, a new block can be added. Blockchains features are: 

• Transaction transparence 

• Decentralized data records 

• Tamper-proof 

Blockchain technology provides a platform for industry to irreversibly store data, values or properties 

of things in a network. For example, you can register production data or important measured values, 

but also contracts or agreements that have been agreed upon. This creates a platform that is trusted 

by all participants within a production or supply chain network. The blockchain thus creates trust 

between partners who are not yet familiar with existing processes. 

Smart Contracts 

A smart contract is a contract implemented in blockchain as a piece of software in which various 

contractual conditions can be deposited. In this case, the contractual conditions defined in the digital 

contract are automatically monitored and specified actions are carried out automatically based on the 

information received [Christidis16]. 

Ensuring the traceability (notary function) and the immutability in the use of such smart contracts is 

thus a key requirement for the contractual implementation of collaborative processes between several 

companies. Previous prototype implementations using blockchain technologies, however, focus 

primarily on financial applications (Bitcoin) or applications in supply chain management 

[Tschorsch16]. Smart contracts and the associated automation can be used to improve many processes 

and, in some cases, to reduce them to certified inspection bodies if the consistency of the information 

is ensured by a smart contract and audit-proof storage. Once information has been confirmed by the 

smart contract, it is documented audit-proof and can be integrated in a variety of contexts. Thus, from 

453

a technological point of view, the blockchain is a natural tool for process optimization. If, for example, 

it is only possible to import a video in a community platform if the corresponding audio rights are 

available, the entire monitoring and monitoring processes can be omitted. However, this consistency 

is easy to maintain through smart contracts. A simple example for a coin transfer checking to be 

balanced is shown Figure 2. 

Figure 2: An Example Smart Contract on Ethereum [ethereum] 

Industry 4.0 Blockchain Use Cases 

The prerequisites for using blockchain technology in the industrial environment is an infrastructure 

that operates the blockchain and ensures access to the blockchain for all subscribers. Blockchain 

infrastructures are divided in public (e.g. Etherium) and private. A private blockchain will only be 

made available to my network. The infrastructure for this can be operated by the company itself or 

by a cloud provider. There are distributed servers for the blockchain, each operating one node of the 

blockchain network. Ideally, each participant in a production network operates a blockchain node. 

Each node is connected to the other and each node has a complete copy of all the data stored in the 

blockchain. If legacy systems, sensors, machines, etc. must be connected, unique identity and 

protection against identity manipulation, a non-manipulatable connection and the protection against 

malware has to be ensured. This also applies to all other physical and logical blockchain subscribers, 

e.g. Manufacturing Execution Systems (MES) or gateways, etc. that want to put information into the 

blockchain. In semi-automated manufacturing processes, it is also necessary to include people's 

activities into a blockchain, which requires the support the management of identities and the use of 

appropriate input and output devices. 

Recently the interest to exploit blockchain in the manufacturing industry has increased dramatically. 

Especially applications of blockchain for supply chain management and auditing are always 

mentioned first. Making each step of a supply chain more transparent, where smart contracts allow 

tracking product movement from the factory to the store shelves or along the value chain. IoT devices 

can write location data straight to a smart contract, which allows simplifying the tracking process. 

Such feature provides real-time visibility of an entire supply chain and may improve your business 

e.g. by detecting products that are stuck at customs, reducing the risk of fraud and theft as well. 

Before a detailed use case is described, additional use cases are summarized in the following table, 

based on Bahga and Madisetti [Bahga2016] paper: 

454

Application 

On-Demand Manufacturing 

Smart Diagnostics & Machine 

Maintenance 

Product Certification 

Tracking Supplier Identity & 

Reputation 

Registry of Assets & Inventory 

Short Description 

Manufacturing services (such as CNC machining or 3D printing) by sending 

transactions to the machines. 

Machines will be able to monitor their state, diagnose problems, and 

autonomously place service, consumables replenishment, or part 

replacement requests to the machine maintenance vendors. 

The manufacturing information for a product (such as the manufacturing 

facility details, machine details, manufacturing date and parts information) is 

recorded to prove authenticity of the products. 

Application track various performance parameters (such as delivery times, 

customer reviews and seller ratings) for sellers. 

Applications for maintaining records of manufacturing assets and inventory. 

Reliable interactive maintenance 

Regular maintenance work is depicted in so-called maintenance plans (maintenance plans), which 

the machine and system builder creates for the respective system. What maintenance tasks have to 

be carried out remotely by the machine builder and which task by the plant operator himself is 

specified in service maintenance contracts. These contracts vary from no-service to full service by 

the machine builder. Figure 3 shows a small example of blockchain subscribers for a possible smart 

contract (pseudo code). The digitized contract checks the daily maintenance (cleancheck), logs 

the hourly temperature (temp) and allows remote maintenance between 15:00 and 18:00 

(productionLineCheck). 

In addition to maintenance-specific smart SLA contracts, the individual maintenance plans are also 

to be mapped and monitored by smart contracts to attest to the fulfillment of the maintenance tasks. 

If a claim should occur despite maintenance measures, the service logbook of the blockchain 

platform (ledger, general ledger) provides maintenance transparency and the responsibility for the 

repair is clearly defined. 

Figure 3: Pseudo Code of a Smart Contract for Maintenance 

455

In addition to the frequent maintenance intervals individual maintenance steps can be checked and 

the proof of actions immutable stored. The maintenance contracts can be easy customized according 

to the customer's requirements and each individual machine. With information about the equipment 

(e.g., age, replacement parts, etc.) maintenance plans can be further optimized. All in all, this leads 

to individual, over time changing maintenance contracts implemented as smart contracts into the 

blockchain. 

Conclusion 

This paper introduced the blockchain technology and described smart contracts. Thanks to 

decentralized nature of the blockchain technology it allows to interact with peers in a trustless, 

auditable manner. Smart contracts allow us to automate complex multi-step processes to achieve an 

agreement without involving any intermediaries. Concluded is this paper by describing some 

Industry 4.0 blockchain use cases. Still the full potential of the blockchain technology is not 

discovered, yet and further exploitation of the technology for manufacturing industry is required. 

References 

[Christidis16] K. Christidis and M. Devetsikiotis, "Blockchains and Smart Contracts for the Internet 

of Things," in IEEE Access, vol. 4, pp. 2292-2303, 2016. 

[Tschorsch16] F. Tschorsch and B. Scheuermann, "Bitcoin and Beyond: A Technical Survey on 

Decentralized Digital Currencies," in IEEE Communications Surveys & Tutorials, vol. 18, no. 3, 

pp. 2084-2123, thirdquarter 2016. 

[ethereum] Minimum Viable Token Coin Example; Source: https://www.ethereum.org/token 

[Bahga2016] Arshdeep Bahga, Vijay K. Madisetti; “Blockchain Platform for Industrial Internet of 

Things”; Journal of Software Engineering and Applications; Vol.09 No.10(2016), Article 

ID:71596,14 pages; 10.4236/jsea.2016.910036 

456

How connected cars are driving connected payments 

James Carroll 

CTO, Solutions Team 

Mobica 

Wilmslow, United Kingdom 

Jim.carroll@mobica,com 

Abstract— Technologies associated with IoT are 

inexpensive, low powered and frequently based on common 

software platforms. Recent rapid development in the 

automotive/haulage industries allows the integration of this tech 

into cars/trucks. These developments have coincided with the 

simultaneous rise in FinTech with the use of technology 

supporting financial services such as payments, fostering a 

commonality of innovative development across these domains. 

This paper describes innovative opportunities created by 

merging one market with another and also driven by IoT. 

Illustrated by the implementation of an autonomous "pay at 

pump" use case as an example, the integration of IoT and 

payments technology into car IVI systems and fuel pumps will 

allow the financial transaction associated with buying fuel to be 

handled completely autonomously. The deployment of computer 

vision technology for authentication, and connectivity for 

communication with payment authorities, will facilitate 

autonomous payments in a large range of scenarios including 

for example road- toll payments too. 


Modern cars generally include an in-vehicle infotainment 

(IVI) system. These systems usually combine multimedia, 

navigation, radio and telephony functions. Software to execute 

these functions requires a sophisticated operating system (OS) 

to provide underlying services. Rather than develop a fully 

custom OS, automotive manufacturers (OEM) and Tier 1 

suppliers tend to adopt existing, licensed OSs, such as Linux, 

Windows or QNX on which to build IVI systems. Such OSs 

require significant processor power to execute applications and 

services at an appropriate level of performance. Suitable 

System-on-Chips (SoC) for such applications are provided by 

semiconductor companies such as Intel, Renesas, NVIDIA and 

Qualcomm. Each SoC typically includes multiple CPU cores, a 

DSP and a GPU, in addition to an array of on-chip peripheral 

hardware. 

This combination of application software, complex OS and 

SoC is similar to the approach taken to build mobile phones. 

Application software such as ApplePay, AndroidPay and 

SamsungPay in combination with Near Field Communication 

(NFC) or Bluetooth Low Energy (BLE) peripheral hardware and 

techniques like Host Card Emulation (HCE) and QR codes allow 

a mobile to be used as a payment device. 

It is therefore possible to build payment technologies into 

automotive IVI systems using a similar approach, enabling the 

car to become a payment device. 

Note that although NFC is commonly used in mobile phones 

to facilitate payments, it may be impractical for automotive use 

cases. The range for NFC communication is 20 cm at most with 

most systems working at a range of less than 4 cm between 

receiver and transmitter. Reliably positioning a car with this 

proximity to an NFC reader is difficult. BLE has a much larger 

range, so requires less accuracy; QR codes are based on cameras, 

and have a still larger range, but require “line of sight”; wired 

solutions are the most reliable and secure, but may be considered 

less “user friendly”. None of these alternatives are yet 

standardised for payment use cases. 

Some IVI systems, when combined with specific mobile 

phones, provide support for “projection” type functionality - 

based on Android Auto and Apple CarPlay. With these systems, 

application software running on a mobile phone and connected 

to the IVI system renders onto the IVI system; the application is 

fully usable using the IVI system only. In this use case, the car 

may use the phone as a payment device. This use case is not 

considered in this paper; here we focus only on integrated 

solutions. 

Additionally, there are specific variants of Android available 

for use in IVI systems - O.Car is a recent version. As these 

variants present the same Application Programming Interfaces 

(API) for developers, existing software will be readily portable 

to these devices. However, the underlying hardware or software 

enablers are not available in IVI devices at present. 

A car is of little use as a payment device, if there are no 

devices available capable of accepting payments from them. 

Some devices such as toll road booths are already able to accept 

payment without human intervention. Other use cases, such as 

fuel payment at the pump and fast food drive-throughs are 

equipped for the use of card readers. Relatively simple changes 

are required to support IVI based payments. 


457

II. 

THE “PAY AT PUMP” USE CASE 

components can be considered as core OS components. The 

purpose of each of the layers is: 

• Application: provides the user with a method of 

accessing the payment services supported by the IVI 

system; 

• Middleware: provides hardware abstraction and 

protocols for the IVI system to communicate with 

external devices (pump and payment providers); 

• BSP: controls underlying hardware directly. 

Fig. 1. Connected Car Pay at Pump Usage Model 

Figure 1 illustrates how a connected car pay at pump model 

would work: 

● 

● 

● 

The driver fuels the car; 

When triggered by the driver, the IVI system 

communicates with the pump, using a secure, short 

range wireless connection, providing authentication. 

This may happen whilst refueling is in progress; 

When the driver has completed refueling and the 

authentication process, the car transfers money to the 

fuel company electronically, over the internet. 

III. 

Fig. 2. IVI Software Components 

IN-CAR TECHNOLOGY 

Figure 2 illustrates a typical architecture of an IVI system. In 

common with all modern OS deployments, a layered approach 

is taken. The diagram is not exhaustive - it shows only the 

components relevant to the use of connected cars as payment 

devices. The middleware and Board Support Package (BSP) 

Figure 2 shows that much of the required technology is 

already available (but not necessarily enabled) in the core OSs 

deployed in IVI systems: 

• Short range wireless and wired protocols: WiFi, BLE, 

NFC, USB; 

• Encryption algorithms; 

• Application environments: C/C++, Java, JavaScript, 

HTML5; 

• Internet connectivity - WiFi, 4G; 

• Hardware support - BLE, NFC, camera, image 

processing; 

• Security frameworks. 

The most significant software work required is porting and 

enabling of existing OS features and hardware support for the 

target devices. It is possible that device drivers for specific 

hardware modules may be unavailable directly, e.g. NFC, 

encryption, DSP hosted algorithms, MMU schemes, TEE. 

However, in such cases, it is likely that the semiconductor 

vendor will provide a base driver for a reference platform. 

Application software may need to be developed from 

scratch, depending on the underlying APIs and the nature of 

existing applications. At a minimum, applications will require 

porting from existing mobile platforms, considering different 

input methods, display geometry, hardware platforms and safety 

requirements. 

Software to be deployed in cars is subject to strict 

engineering process - there is likely to be significant additional 

software test work associated with the deployment of these 

features. Similarly, all software handling financial transactions 

is subject to regulation. This may lead to additional certification 

work. 

Note that it is common to deploy software in a virtualised 

environment within an IVI system. This may imply an additional 

level of system validation - at both OS and fully integrated 

system levels. 

458

Where open source software supporting payment use cases 

is deployed in an IVI system, there are also additional policy and 

procedural aspects of the software to consider: 

• Is the license suitable and acceptable? 

• How will 3rd party changes to the software be handled 

(down streaming)? 

• Will changes to the software deployed in the IVI system 

be made available to 3rd parties (upstreaming)? 

IV. 

PUMP TECHNOLOGY 

For the connected car pay at pump use case, we also need to 

consider the necessary changes to the fuel pump. It is already 

commonplace to allow a user to pay for their fuel at the pump, 

using a payment card. In this regard, the pump includes the 

hardware and software of standalone payment terminals. These 

terminals frequently include an embedded OS, such as Windows 

or Linux - this implies that the pump is designed in such a way 

to make software modification relatively simple. 

Generic payment terminals usually permit chip, magnetic 

strip and contactless modes of payment. Existing Pay at Pump 

solutions are mostly based on chip and PIN; some form of 

enhanced contactless payment must be added to the pump. As 

the payment terminals are based on embedded OSs, the 

necessary protocols and hardware abstractions are already 

available. The addition of hardware support for contactless 

payment is similar to the IVI scenario. 

Assuming that a wireless protocol is to be deployed for ease 

of use, the key decision to be made is the transport for 

communication between IVI system and pump. The following 

table summarises available transports and their suitability for 

communications between car, pump and local network. 

Transport IVI / Pump Pump / Network 

NFC N N 

It may be tempting to add other forms of authentication to 

the pump, such as facial or fingerprint recognition. However, 

there are several drawbacks to this: 

• All require additional hardware, such as fingerprint 

sensors or cameras. These would add to the software 

“footprint” and the overall BOM cost of the pump. In the 

case of cameras, it may be possible to make use of 

camera technology already deployed at fuel stations for 

security purposes; 

• The hardware will have many users and will be exposed 

to the elements; it may be easily damaged; 

• The human element of security in financial transactions 

has been demonstrated to be the most easily 

compromised. 

V. SAFETY 

Software intended for deployment in cars is developed 

according to standardised production processes (e.g. ISO26262), 

coding guidelines (e.g. MISRA) and risk classification (e.g. 

ASIL). The primary reason for this is driver, passenger and road 

user safety. Engineering and quality processes are also 

employed to identify defects early. The cost of fixing defects late 

in the product lifecycle is very high in the automotive sector, 

where vehicle recalls can cost many millions of Euros. Early 

defect resolution is a key goal. 

In general, safety standards are applied to driver systems 

such as Engine Control Units (ECU), Instrument Clusters (IC), 

Advanced Driver Assist Systems (ADAS) and autonomous 

driving. IVI systems are not typically subject to the same safety 

standards. The main safety issue for IVI systems is one of driver 

distraction. There are clear regulations in this domain, which are 

addressed at the specification stage - no additional process 

requirements are made of the development process. 

Increasingly, clusters are being integrated with IVI systems, 

running on the same SoC. The most common approach in such 

a scenario is to isolate safety critical and non-safety critical 

functions using OS virtualisation, based on hypervisors. An 

example architecture is illustrated in Figure 4, below. 

BLE Y N 

WiFi Y Y 

Camera Y N 

Fig. 3. Table of Connected Car Pay at Pump transports 

The table shows that only WiFi can support all of the 

required communication channels. It would be possible to use 

BLE or camera based solutions for IVI to pump communication 

and WiFi for pump to local network communication, but 

embedded devices are typically resource limited; it may not be 

practical to deploy multiple transports in the pump. Minimising 

software “footprint” is a common design goal. 


459

Fig. 4. IVI and Cluster Visualisation Architecture 

The safety issue for an IVI system in this architecture is one 

of shared hardware access. The safety critical system demands a 

guaranteed minimum access to underlying hardware. Where the 

hardware request is in conflict with an IVI system request, IVI 

performance may be compromised in favour of the safety critical 

system. An example of this is the use of the SoC’s GPU, which 

will be used to render both speedometer and payment 

applications simultaneously. In the pay at pump use case, this is 

unlikely to be an issue - the performance demands of the IVI 

system are not great and the speedometer will not be in use as 

the car is stationary! 

For future developments, this issue cannot be ignored: 

developments in the automotive industry are likely to introduce 

other conflicts. Cameras are increasingly being included in cars 

for driver assist (ADAS) applications - such cameras may also 

be used for IVI applications, including payment authentication. 

It is likely that this trend for “shared” hardware in the context of 

the car will continue. 

VI. 

SECURITY 

In the same way that automotive sector has safety at the heart 

of its technology, the financial sector focuses on security - fraud 

prevention and data security are crucial to the success of these 

businesses. 

In the context of software development, safety and security 

bear some similarities. Both aspects: 

● 

● 

● 

Aim to prevent damage or loss to individuals; 

Are subject to standards and regulation; 

Are implemented using technical and process 

measures. 

This means that industries experienced in the 

implementation of safety critical software should be able to 

adapt to the development of secure software (and vice versa). 

The main standards deployed for the development of 

financial software include PCI DSS and EMV. These standards 

are primarily concerned with specifying how the devices and 

processes should work, rather than how they are developed. PCI 

DSS describes information security; EMV describes how 

payment devices work and ensures compatibility across 

payment providers. 

Other important technologies and techniques in the 

implementation of financial software include: 

● 

● 

● 

Card tokenisation - for the substitution of sensitive data 

with non-sensitive data, minimising the handling of 

secure data; 

Single use keys - for the one-time encryption and 

decryption of data, minimising the risk associated with 

key loss or theft; 

Encryption - for the conversion of data (in storage and 

during transmission) between readable and nonreadable 

formats, preventing the use of fraudulently 

acquired data. 

These techniques are based on the implementation of 

algorithms; in modern SoC applications, these algorithms are 

frequently executed on a DSP or GPU rather than the host CPU. 

Although mobile payment applications may currently host such 

algorithms on CPUs, it may be advantageous to move these 

algorithms to a co-processor in an IVI system. The SoC is 

running many functions, performance enhancements may be 

achieved by sharing the load between processors in this way. 

OS deployments in IVI systems currently include security 

features associated with the OS itself - for example data caging 

and cryptography. Other related features provided by third 

parties are readily “integratable”, such as the Linux SMACK 

module or TrustZone specific device drivers. 

The introduction of payment related features into cars may 

make the car a target for malicious hackers. The creation of the 

pay at pump use case introduces an additional point of failure 

into the payment chain in the form of the car itself. A focus of 

payment technology is to reduce such vulnerabilities; a full 

security assessment will be a precursor to the development and 

deployment of such technology. Although penetration testing is 

prevalent in automotive software, it is by no means mandatory. 

If this threat is realised, it is likely that penetration testing will 

become a required part of the development process. 

As vulnerabilities are identified and corrected, an effective 

way of deploying software updates to consumer owned vehicles 

will be required. Such mechanisms are used on all mobile 

platforms today, and this provides the mechanism by which 

mobile payment software is updated. Similar mechanisms are 

available for IVI systems today, but they lack “immediacy”; 

updates are typically done by dealerships, not pushed out to 

users when available. There are no technical issues with the use 

of such methods in IVI systems: the problem is to be able to 

guarantee that Over The Air (OTA) updates do not cause issues 

for users, resulting in costly recalls. 

460


All of the necessary software and hardware components and 

processes for the creation of the connected car pay at pump use 

case are currently available in some form. The creation of a 

commercially viable, fully integrated solution is the next step. 

There are no technical or regulatory barriers to doing so. 

This use case is merely an example of the possibilities of the 

concept. There are myriad other potential use cases: 

● 

● 

● 

● 

● 

● 

● 

● 

● 

Road tolls 

Parking 

Car taxation 

Servicing 

Regulatory periodic vehicle testing 

Drive through restaurants 

Car rental 

Insurance 

In-car entertainment 

Autonomous driving extends potential use cases further; 

many use cases implemented for non-autonomous driving may 

also require modification when implemented in autonomous 

environments. 

Many of the technology vendors working in the automotive 

and financial technology sectors are already working on proofs 

of concept integrating payments into cars. This includes: 

● 

● 

● 

● 

semiconductor vendors 

automotive OEMs and tier 1s 

platform and OS providers 

payment providers 

assign a payment card to the bank account or to initiate a money 

transfer from the account. The regulation changes are also 

allowing finance companies to sell each other’s products and 

services, broadening competition and creating niches in which 

the smaller FinTech companies can innovate. This is changing 

the world of mobile payment technology. It is also likely to have 

the same impact on connected car payments. Further, the new 

FinTech services being developed for deployment on mobile 

will also be deployed in an automotive context. 

In the embedded and automotive domains, the most 

significant opportunities for innovation in this area arise from 

the integration of IoT solutions. Sensors being added to cars and 

the smart cities in which they drive provide a plethora of 

potential new services, such as smart parking and motor 

insurance. These services will require payment solutions - users 

will increasingly expect to be able to make such payments from 

within their cars. 


[1] Mobile & NFC Council, “Host Card Emulation (HCE), Smart Card 

Alliance, June 18, 2015 

[2] Tod E Kurt, NFC & RFID Android, ThingM, 2011, 

https://www.slideshare.net/todbotdotcom/nfc-rfid-on-android 

[3] Visa, “The Connected Car: Visa Looks Ahead,” March 2015, 

https://usa.visa.com/visa-everywhere/innovation/visa-connectedcar.html 

[4] Chris Giordano & Jim Carroll, DiSTI & Mobica, “Hardware Convergence 

& Functional Safety: Optimal Design Methods in Today’s Automotive 

Digital Instrument Clusters” June 2016, https://www.disti.com/hardwareconvergence-functional-safety-whitepaper/ 

[5] MISRA, https://www.misra.org.uk/ 

[6] National Instruments “What is the ISO 26262 Functional Safety 

Standard?,” April 2014, http://www.ni.com/white-paper/13647/en/ 

[7] EMVco, https://www.emvco.com/ 

It is anticipated that the overlap of the Automotive and 

FinTech sectors will come to fruition in 2018, as IVI systems 

mature to provide a more generic (if not open) platform on which 

3rd party developers can provide innovative new software and 

services. The IVI system is at the heart of the connected car; in 

the same way that the mobile market rapidly expanded as mobile 

platforms matured, so too will the connected car market. The 

majority of large businesses now have distinct digital and mobile 

strategies for selling and deploying their products and services. 

Within a few years, it is likely that many of these will also have 

a distinct connected car strategy. Such strategies will, as a matter 

of course, include an element of monetisation; this requires 

integrated payment solutions. Will mPayment technology spawn 

a sub-branch to support connected cars - cPayments? 

The rise of the FinTech companies, challenger banks and 

changes to regulation of the financial services market in the EU 

have all stimulated significant innovation. In particular, the 

Payment Services Directive (PSD2) is allowing smaller 

financial organisations to provide services competitively. PSD2 

is focused on electronic payments and will therefore extend the 

use of such payments further. PSD2 compels banks to allow 3rd 

parties to extend electronic payment options - for example to 


461

ATM Protection Using 

Embedded Machine Learning Solutions 

Antonio Rizzo, Francesco Montefoschi, 

Alessandro Rossi, Maurizio Caporali 

University of Siena 

Siena, Italy 

antonio.rizzo@unisi.it, francesco.montefoschi@unisi.it, 

alessandro.rossi2@unisi.it, maurizio.caporali@unisi.it 

Antonio J. Peña, Marc Jorda 

Barcelona Supercomputing Center (BSC) 

Barcelona, Spain 

antonio.pena@bsc.es, marc.jorda@bsc.es 

Gianluca Venere 

SECO Srl 

Arezzo, Italy 

gianluca.venere@seco.com 

Carlo Festucci 

Monte dei Paschi di Siena 

Siena, Italy 

carlo.festucci@mps.it 

Abstract— ATMs are an easy target for fraud attacks, like 

card skimming/trapping, cash trapping, malware and physical 

attacks. Attacks based on explosives are a rising problem in 

Europe and many other parts of the world. A report from the 

EAST association shows a rise of 80% of such attacks between 

the first six months of 2015 and 2016. This trend is particularly 

worrying, not only for the stolen cash, but also for the significant 

collateral damages to buildings and equipment [1]. 

We developed a video surveillance application based on Intel 

RealSense depth cameras that can run on Seco’s A80 Single 

Board Computer. The camera can be embedded in the ATM’s 

chassis, and focus the area under the screen, where explosive 

based attacks begin. The use of depth cameras avoids privacyrelated 

regulatory issues. The computer vision analysis rests on 

Machine Learning algorithms. We designed a model based on 

Convolutional Neural Networks able to discriminate between 

regular ATM usage and breaking attempts. The dataset has been 

built by recording and tagging depth videos where different 

people stage withdrawals and attacks on a retired ATM, 

replicating the actions the thieves do, thanks to the knowledge of 

the Security Department of the Monte dei Paschi di Siena Bank. 

The results show that the implemented architecture is able to 

classify depth data in real-time on an embedded system, detecting 

all the test attacks in a few seconds. 

Keywords— Bank Security; Machine Learning; Convolutional 

Neural Networks; Computer Vision; Intel RealSense; Single Board 

Computer 


In recent years the global digitalisation and the 

consolidation of information technologies sensibly changed our 

daily life and the way we interact together, both at local and 

global level. This digital revolution is also changing how users 

access banks and financial services, turning a relationship 

based on the peer-to-peer trust into a mainly online service, 

with sporadic human interactions. Such mutation and the 

resulting change in the bank branch structure obviously affect 

the criminal behaviour related to this environment. Sectorial 

international studies [2] show that despite the use of explosives 

and other physical attacks continues to spread, in the long term 

the attacks will focus on the cyber and logical approaches. In 

fact, ATM malware and logical security attacks were reported 

by seven countries in Europe during the year 2017. 

Moreover, statistics from ABI (Italian Banking 

Association) show a sensible increase of attacks to the ATMs 

in opposition to a reduction to bank branches robberies. This is 

due both to the juridical categorisation of the committed crime 

and to the lower amount of money that can be stolen in a 


462

obbery. Indeed, security systems are in general concentrated 

on the branch rather than on the ATM area, which is usually 

located outside of the buildings. This also allows perpetrators 

to perform the assaults during nightly hours. An important 

issue to consider about these gestures is related to the collateral 

effects. In fact, the violence necessary in such attacks often 

lead to serious physical damage to buildings and objects in the 

neighbourhood of the targeted area, such as cars; this is when 

considering the best scenario, where no human is involved. 

After these premises it is clear how can be fundamental to 

develop technologies capable of preventing in some way this 

kind of situation. Crucial features of such a system are the low 

rate of false alarms and effective promptness in detecting the 

potential risk, both to alarm the interested control systems and, 

in the first place, to try to automatically discourage the 

underway criminal action with some deterrents. 

In this paper we propose ATMSense, an automatic 

surveillance system based on video stream analysis of depth 

frames. This approach allows to analyse in real-time the action 

performed in front of the ATM, while preserving the privacy of 

customers. Depth images are processed by a Machine Learning 

algorithm in order to predict the nature of the running situation. 

Even if the tests are performed on data recorded in our 

laboratory, the goodness of the obtained results lays the 

groundwork for an in-depth experimentation on the field. 

II. 

RELATED WORKS 

A. Video Surveillance 

Recent advances in Deep Learning techniques and in 

particular in those approach dedicated to Computer Vision 

[3][4] lead to a cutting-edge improvement in Image and Video 

Analysis algorithms. Even if methodologies for Video 

Surveillance and, more in general, for Action Recognition [5] 

based on different approaches had been investigated in the past, 

allowing us to reach good results in restricted scenarios, Deep 

Learning methods can provide state-of-the-art achievements, at 

best in the short term. Taking in account these results and the 

possibility of fast and portable prototyping of such algorithms, 

it seems reasonable to follow this direction and to going 

towards technologies that should be even more widespread and 

consolidated in the future. Moreover, such approaches should 

also allow us a direct scalability when facing new kind of 

specific situation and typologies of attacks. 

B. ATMs Protection 

As ATMs started to play a central role in the customers 

services, many works had been developed trying to improve 

the security of these interactions. Several systems designed to 

deal with identity thefts [6][7][8], interactions with forged 

documents and certificates [9] and the detection of the different 

specific dangerous situation [10][11] had been developed 

through the investigation and the integration of various 

hardware devices. However, the most common approach is the 

analysis by surveillance cameras trying to recognise those 

actions characterising a potential critical scenario [13]. In other 

cases, more specific systems had been oriented towards face 

detection and tracking [14] or to the recognition of partially 

occluded faces and bodies [15][16]. 

Fig. 1. ATMSense uses a depth camera connected to a Single Board 

Computer to analyse the surrounding of an ATM. 

In our approach, we head towards a quite new technology 

like the images analysis throughout depth cameras, which is, at 

the best of our knowledge, unexplored. This should allow us to 

join the representation capabilities of videos processing and the 

need for customer privacy protection, both for ethical and 

juridical reasons. 

III. 

ATMSENSE 

ATMSense is intended to discriminate people's behaviour 

exhibited in front of an ATM, in order to detect risky situations 

at an early stage. The sensor used to analyse the scene is the 

Intel RealSense depth camera. Using the depth image instead 

of the RGB one provides great advantages: we can avoid 

dealing with personal data and privacy issues; the image is 

unaffected by lighting conditions; from a computational point 

of view, we can rely on a slight improvement by reducing the 

input channels from three to one. Depth images are processed 

on a Single Board Computer (Seco A80) with image 

processing techniques and Convolutional Neural Networks. 

A. Intel RealSense 

Intel RealSense is a family of depth cameras proving 

several video streams: RGB, depth and Infrared. 

ATMSense is compatible with two camera models. The 

short-range RealSense SR300 can be placed in the ATM 

chassis, focusing the ATM keyboard area. The long-range 

RealSense R200 camera is intended to be placed above the 

ATM, focusing the whole interested scene. As stated in the 

Results section, the performance is similar for both cameras. 

The short-range camera should be embedded in new ATMs, 

the long-range fits better as an external plugin for already 

installed ATMs. 

Whichever camera is used, the depth video stream is used 

to classify what is going on in the ATM area. For debugging 

purposes RGB streams can be collected, but they are not used, 

neither for the system training, nor for the runtime. 

463

Fig. 3. On the left is shown a frame from Intel RealSense R200. On the 

right, the same frame is preprocessed reducing the noise and subtracting 

the 

background. 

Fig. 2. Seco A80 Single Board Computer. 

In fact, relying on the RGB stream would create a 

dependency on factors that we do not want to depend on, like 

light conditions. Moreover, dealing with faces and other 

personal images can be an issue for the privacy laws. Having 

only a low-resolution shape of the person does not allow the 

personal identification. 

B. Seco A80 

Seco A80 [17] (depicted in Figure 2) is a low power Single 

Board Computer based on the Intel Braswell CPU family, up 

to the quad-core Intel Pentium N3710. RAM memory is 

modular, providing two DDR3L SO-DIMM slots. The board 

offers standard desktop connectivity: USB3 ports, HDMI 

output, M.2 for SSDs, Gigabit Ethernet ports. 

By providing a standard UEFI firmware, it runs 

mainstream X86 operating systems. Our tests are done on 

Ubuntu 16.04, although any modern Linux distribution 

providing Python 2.7 can be used. 

C. Image Processing 

Depth images collected from the cameras are preprocessed 

before the classification. In this phase we want to remove both 

the noise and the background objects. The noise is intrinsic in 

the camera sensor and is reduced using a cascade of standard 

image processing filters (i.e. median filtering, erosion, depth 

clipping and so on). This technique leads to the generation of 

one video frame starting from 5 frames read from the depth 

camera. Although the dynamics of the system scales down 

from 30 fps to 6 fps, the information necessary to classify the 

images is preserved. The background suppression is related to 

the environment in which the ATM is located, and includes 

the device itself. The background is subtracted (using kNN 

based techniques) making the solution independent from the 

ATM machines and environments. Moreover, in order to 

improve the generalization capabilities of learning algorithms, 

it is better to provide only the necessary information. 

Fig. 4. Intel RealSense SR300 frames are less noisy. Background 

information is removed from the right image. 

The difference between the original image read from the 

camera and the cleaned version is visible in Figure 3 (Intel 

R200) and Figure 4 (Intel SR300). 

D. Convolutional Neural Networks 

Once we get a cleaned stream from the camera, we need to 

perform computations needed to predict the state of the current 

scene. As already said, the algorithmic approach relies on 

Deep Learning techniques. In particular, Convolutional Neural 

Networks (CNNs) represent the state-of-the-art in almost all 

Computer Vision applications as Image Segmentation and 

Classification, Object Detection and Recognition. This kind of 

architectures are biologically inspired by the human visual 

system [18] and the characterizing property is expressed 

through the concept of receptive field. These elements are a 

sort of pattern detectors, which are used to generate internal 

features maps representing the presence of specific shape in 

each region of the images. This process is reiterated 

throughout several layers (see Figure 5) to come up with a 

numerical 1-D vector by iteratively performing dimensionality 

reduction (Max-Pooling) and producing an encoding of the 

original image. Hence, the obtained representation can be feed 

to a standard Artificial Neural Network (ANN) classifier 

which should perform the desired predictions. This 

composition allows a high representational capability, a 

relatively simple training procedure (which is derived 

straightforward from the standard Back-Propagation 

algorithm), and weights sharing policy between hidden units 

that reduces the computational cost. 

However, the large number of parameters (of the order of 

tens of millions) of such algorithms requires a correspondent 

large dataset to achieve an effective training leading to 

accurate and general prediction performances. 


464

TABLE I. 

FRAME BY FRAME CLASSIFICATION ACCURACY 

Camera Sequence dim. Withdrawal Attack Average 

1 92.91% 94.30% 93.60% 

SR300 

5 92.69% 92.50% 92.59% 

10 92.86% 94.48% 93.67% 

Fig. 5. Convolutional Neural Network architecture. 

IV. 

EXPERIMENTAL SETTING 

In order to collect the required data, we reproduced in our 

laboratory the real working environment by installing 

ATMSense on a dismissed ATM provided by Monte dei 

Paschi di Siena Bank. As a prototype, we taped an Intel 

RealSense SR300 to the ATM frame, and we installed the 

R200 camera on the top of a support above the ATM. With 

both the cameras connected, we recorded 132 depth videos 

simulating both the withdrawal and the attack scenarios, 

representing the two class to be discriminated by the classifier. 

To improve variability and generalisation, these videos has 

been staged by several actors in different sessions, using 

different light conditions (which only slightly affect the 

acquired images). Videos have been manually labelled at the 

single frame level. Background profiling has been carried out 

by recording 25 videos without any kind of interaction with 

the ATM. 

A. CNN Training 

In the training phase, pre-processed videos (as stated in 

section III.C) are split as reported in Table 1 among Train and 

Test sets. Hence, the dataset is generated by separating and 

shuffling sequences of consecutive frames together with the 

correspondent labels. In this way we obtained about 250,000 

and 30,000 labelled samples for training and test respectively. 

The training phase has been performed within the Keras 

framework using the TensorFlow backend. This enables an 

easy implementation capable of exploiting the multi-GPU 

cluster (provided by Barcelona Supercomputing Center). Since 

this process requires several hours to be completed and 

considerable trial and error tests have been necessary to find 

the best hyper-parameters and network configurations, we also 

carried out an investigation on a few settings related to 

computational issues. In practice, a preliminary tuning of a 

few variables (i.e. the mini-batch size of the network forward 

step) enabled to halving the execution time of the training 

phase. Applying this tuning, the global train-validation-test 

process has been accelerated by a scaling factor of 1.86 while 

maintaining the same accuracy. 

R200 

V. RESULTS 

Different CNN architectures have been tested, but we only 

report the results of the best one, composed by 3 convolutional 

layers, with ReLU as non-linear activation and Max-Pooling 

to perform dimensionality reduction. The fully connected 

classification layer is composed of 256 hidden units. All the 

architectures have been tested on different datasets, generated 

using different lengths of the input sequences. The obtained 

classification accuracies of the best networks are reported in 

Table 2. 

Since the predictions are, in practice, not perfect, in order 

to refine the working performances, we added an additional 

layer. Such layer determines if to raise an alarm, based on a 

majority voting on a buffer of recent network predictions (of 

length varying from 10 to 20 elements). In fact, an alarm is 

raised only if more than the 95% of the last predictions are 

classified as attacks. This allows to correctly classify each 

video from the test set in a more realistic scenario. We can 

find many configurations in which no false alarm is raised on 

withdrawal videos and, on the other hand, all the attacks are 

detected. In Table 3 we report statistics on the detection time 

w.r.t. the beginning of an assault. For brevity, we only report 

the best case for each sequence length. 

As we can observe, the reported detection times are 

admissible w.r.t. a real situation, since a potential attack can 

be detected in few seconds giving enough time to the 

Surveillance Control Room to analyse the scene and, possibly, 

to take dissuasive actions or call the security. From a practical 

point of view, we can see how the additional layer used to 

filter the network’s predictions by the majority voting is 

fundamental to reach the final results. We can also observe 

that feeding the classifier with a sequence of frames (5 or 10 

TABLE III. 

1 90.78% 90.75% 90.76% 

5 92.18% 92.46% 92.32% 

10 91.16% 92.19% 91.67% 

ASSAULT DETECTION TIMES 

Camera Seq. dim. Avg (sec) Min (sec) Max (sec) 

1 3,00 2,50 5,16 

TABLE II. 

NUMBER OF VIDEOS USED FOR CNN TRAINING 

SR300 

5 2,69 2,33 5,50 

Withdrawal 

Attack 

10 4,25 4,00 7,00 

Train 54 42 

Test 18 18 

Total 72 60 

R200 

1 1,89 1,66 3,33 

5 4,45 4,00 10,00 

10 4,27 4,00 8,00 

465

in our tests) instead of than a single frame does not lead to a 

remarkable improvement and, at the end, this choice only 

delays the system promptness. This can be due to the fact that 

the scene understanding task is collapsed to a two-class 

classification problem. However, from an external point of 

view, it also seems reasonable that a human could be able to 

decide from a single picture of the scene if an assault is taking 

place or not. 

A. Real-Time Classification 

After the training, we tested the real-time performance on 

a Seco A80 SBC. Having relatively low computer power 

available, the application creates different threads to 

parallelize the computation. The first one handles the USB 

connection with the Intel RealSense camera and stores the 

incoming video frames in a buffer; another one preprocesses 

the incoming frames subtracting the background and reducing 

the noise; the last one classifies the image. 

The A80 SBC can execute all the computation in real time. 

The heaviest threads are the image preprocessor, which runs in 

23ms, and the CNN classifier, which runs in 17 ms. 

Considering that we need to classify 6 frames each second, the 

required computational power is more than enough for realtime 

operation. 

VI. 

CONCLUSION 

In this work we propose an application of Automatic 

Video Analysis to improve the surveillance and the security on 

ATMs. From laboratory tests, the system can detect attacks 

very quickly, both when the depth camera is integrated into 

the ATM itself, and when it is installed nearby. Moreover, the 

approach employs off-the-shelf technologies of a total cost 

which is quite inexpensive when compared with an ATM cost 

or with the potential financial and general damages. The 

software solution is general for the approach, even if an 

additional data collection and a re-training phase will be 

necessary, depending on particular needs of specific situations. 

Although the current solution is customised for a single 

mode of assault, the obtained results allowed us a short terms 

scheduling of a more real experimentation phase on the field. 

Indeed, the very fast attack detection time will allow to the 

Surveillance Control Room to promptly intervene. Moreover, 

the high accuracy reduces the possibility of false alarms. 

VII. FUTURE WORK 

Detection accuracy in a real-world scenario could be 

improved by collecting further data, statistically enlarging the 

events analysed by the system. In general, adding more 

training data helps the CNN to better generalize, instead of 

over-fit on training examples. 

The depth footage recorded for the training is focused on 

explosive-based attacks. New videos could be recorded with 

the perspective to detect additional kind of ATM assaults, 

providing a more complete surveillance equipment. 

The downside of having more depth videos is the need of 

manually tagging the frames. A complementary approach 

could be to introduce a Novelty Detection algorithm, which 

runs parallelly with the CNN. As an example, the solution we 

proposed in [19] to bank branch Audio-Surveillance can be 

redesigned in this scenario. This algorithm would be totally 

unsupervised, and capable of detecting any kind of anomaly 

which comes from unexpected users behaviour. An arbiter 

would take as input the outputs of both the algorithms, and 

rule a final decision. 


This work was supported by Monte dei Paschi Bank grant 

DISPOC017/6. We thank the support of NVIDIA through the 

BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. 

Peña is cofinanced by the Spanish Ministry of Economy and 

Competitiveness under Juan de la Cierva fellowship number 

IJCI-2015-23266. 

REFERENCES 

[1] European Association for Secure Transactions: ATM Explosive Attacks 

surge in Europe, https://www.association-secure-transactions.eu/atmexplosive-attacks-surge-in-europe/, 

2016 

[2] European Association for Secure Transactions: EAST Publishes 

European Fraud Update 3-2017, https://www.association-securetransactions.eu/east-publishes-european-fraud-update-3-2017/, 

2017 

[3] J. Deng el al. “A large-scale hierarchical image database,” IEEE 

Conference on Computer Vision and Pattern Recognition (CVPR), pp. 

248-255, 2009 

[4] A. Krizhevsky et al., “Imagenet classification with deep convolutional 

neural networks”, Advances in neural information processing systems 

(NIPS), pp. 1097-1105, 2012 

[5] S. Herath et al., “Going deeper into action recognition: A survey,” Image 

and Vision Computing, pp. 4-21, 2017 

[6] F. Puente et al., “Improving online banking security with hardware 

devices,”, 39th Annual International Carnahan Conference on Security 

Technology (CCST), pp. 174-177, 2005 

[7] H. Lasisi and A.A. Ajisafe, “Development of stripe biometric based 

fingerprint authentications systems in Automated Teller Machines,” 2nd 

International Conference on Advances in Computational Tools for 

Engineering Applications (ACTEA), pp. 172-175, 2012 

[8] R. AshokaRajan et al., “A novel approach for secure ATM transactions 

using fingerprint watermarking,” Fifth International Conference on 

Advanced Computing (ICoAC), pp. 547-552, 2013 

[9] H. Sako et al., “Self-defense-technologies for automated teller 

machines,”, International Machine Vision and Image Processing 

Conference (IMVIP), pp. 177-184, 2007 

[10] M.M.E. Raj and A. Julian, “Design and implementation of anti-theft 

ATM machine using embedded systems,” International Conference on 

Circuit, Power and Computing Technologies (ICCPCT), pp. 1-5, 2015 

[11] S. Shriram et al., “Smart ATM surveillance system,” International 

Conference on Circuit, Power and Computing Technologies (ICCPCT), 

pp. 1-6, 2016 

[12] A. De Luca et al. , “Towards understanding ATM security: a field study 

of real world ATM use,” Proceedings of the sixth symposium on usable 

privacy and security, 2010 

[13] N. Ding et al. “Energy-based surveillance systems for ATM machines,” 

8th World Congress on Intelligent Control and Automation (WCICA), 

pp. 2880-2887, 2010 

[14] Y. Tang et al. “ATM intelligent surveillance based on omni-directional 

vision,” World Congress on Computer Science and Information 

Engineering (WRI), pp. 660-664, 2009 

[15] I-P. Chen et al., “International Conference on Image processing based 

burglarproof system using silhouette image, Multimedia Technology 

(ICMT), ” pp. 6394-6397, 2011 


466

[16] X. Zhang, “A novel efficient method for abnormal face detection in 

ATM,” International Conference on Audio, Language and Image 

Processing (ICALIP), pp. 695-700, 2014 

[17] Seco SBC A80, http://www.seco.com/prods/it/sbc-a80-enuc.html, 2017 

[18] K. Fukushima, “Neocognitron: A self-organizing neural network model 

for a mechanism of pattern recognition unaffected by shift in position,” 

Biological Cybernetics, pp. 193-202, 1980 

[19] A. Rossi et al. “Auto-Associative Recurrent Neural Networks and Long 

Term Dependencies in Novelty Detection for Audio Surveillance 

Applications, ” IOP Conference Series: Materials Science and 

Engineering, 2017 

467

Powering the Processor 

Basics of power conversion 

George Slama 

Wurth Electronics Midcom Inc. 

Watertown, SD USA 

george.slama@we-online.com 

Abstract— The basics of power conversion from linear 

regulators to switching converters - non-isolated buck, boost and 

SEPIC converters used in low voltage and battery systems to 

isolated off-line flyback converters commonly used in adapters. 

Introduction to the major components, various control methods, 

compensation, ancillary circuits, safety and electromagnetic 

interference. 

Keywords—power; micrporocessor; linear; switch-mode; 

inductor; capacitor; buck; boost; sepic; flyback; emi, module 


For a microprocessor to perform any useful function, it needs 

clean regulated power from some type of power source. This 

might come from the wall receptacle or an energy storage 

device like a battery. The power source is usually variable and 

subject to harsh conditions. To begin with, the AC power from 

a wall receptacle has the wrong voltage and is of the wrong type 

– fluctuating between zero and as high as 375 volts peak, 

positive and negative, 50 or 60 times a second. Additionally it 

is subject to transients from other connected devices and 

lightning strikes. Batteries on the other hand may have the right 

type of voltage, DC but they do not stay at a constant level. The 

voltage drops as the energy is depleted and it changes with 

temperature and load. 

Modern microprocessors have extremely fine features and 

therefore require precise voltages to operate without being 

damaged. Gone are the days of 15 V tolerant CMOS digital 

logic ICs! Today processor cores need 3.3 V, 1.8 V or even 1.2 

V. These voltages must be tightly regulated while at the same 

time current demand is dynamic – suddenly changing from 

quiescent to full load when the system goes from sleep mode to 

active. 

II. 

A. Linear regulators 

VOLTAGE REGULATION 

The simplest form of regulated voltage is to use a Zener 

diode. This is only a solution for low power and, where strict 

voltage levels are not required. 

To increase the precision (or tightness) of the regulation and 

to extend the power capacity, a linear regulator can be used. 

These consist of a precision voltage reference, an error 

amplifier and a semiconductor device acting as a controllable 

resistance. It’s an adjustable power voltage divider. In the past, 

they would be discrete circuits but today linear regulators come 

as complete integrated circuits with many additional features 

built in such as thermal shutdown. Though not as efficient as 

switching regulators, they are quiet from an electrical noise 

perspective, fast and inexpensive. 

VIN 

Fig. 1. Linear regulator 

Q 1 

Error 

R 1 

Amplifier 

V REF 

The main problem with efficiency is that they must dissipate 

the product of the voltage difference between input and output 

times the square of the current as heat. For a large input and 

small output this could be more that the power being used! They 

are often used in two situations. One to regulate a low current, 

poorly regulated secondary or auxiliary output on a switching 

power supply. Secondly, where small shifts in voltage are 

required. For instance, a processor that runs on 3.3 V but gets 

its power from a 5 V USB connection or to generate voltages 

for sensors. 

In their integrated form, three terminal regulators come with 

either fixed output voltages or a pin where a resistive divider 

can set the output voltage. Different package styles determine 

the power handling ability. Over time, the minimum voltage 

drop between input and output has decreased from about 3 V to 

0.1 V as the pass transistor has been replaced by MOSFETs. 

The very low voltage units are called LDO for low drop out. 

B. Switching regulators 

Switching converters are the heart of modern power 

conversion. The concept is based on chopping up a DC voltage 

into pulses, storing or converting the pulse energy in capacitors, 

+ 

- 

V FB 

R 2 

VOUT 


468

inductors or a transformer and then passing those pulses 

through a filter, which averages them back to a steady voltage. 

Regulation comes from being able to set and adjust the pulse 

width automatically. If the pulse width were zero, the output 

would be zero. Similarly, at 50% the output could equal half the 

input voltage. If the load current increases and causes the output 

voltage to drop, the pulse width increases to compensate. 

Similarly, if the input voltage were to increase, the pulse width 

decreases to compensate. Thus, there is both input line and load 

regulation. Clever use of inductors and capacitors as energy 

storage devices can even boost the output voltage higher than 

the input. 

III. 

MAIN POWER COMPONENTS 

It is important to understand the components that make up a 

power supply and how their characteristics affect the design. A 

large part of the time spend designing is the selection of these 

parts because they affect the ultimate performance and cost of 

the unit. There are three main categories – capacitors, magnetics 

(inductors and transformers) and switches (including diodes). 

A. Capacitors 

Capacitors come in a large variety of types and styles. They 

are energy storage devices and therefore their capacity is size 

and material dependent. The two most obvious characteristics 

are capacitance value and voltage rating. Additionally 

considerations are equivalent series resistance (ESR), 

equivalent series inductance (ESL), peak and RMS current 

rating, tolerance, aging effects, temperature limits, maximum 

dv/dt rating and failure mechanisms. 

Capacitors are divided into two general types – polarized 

and non-polarized. Polarized capacitors are electrolytic and 

super caps, and must be connected correctly. They tend to be 

large, have wide tolerances, store more energy, and have higher 

ESR and ESL. They mostly serve as bulk storage devices. Nonpolarized 

types are ceramic and metal film capacitors. They 

tend to be smaller, hence lower energy capacity, have lower 

ESR and ESL making them suitable for higher frequency 

operation. Typically, they are used for decoupling and filtering 

high frequencies. Ceramic capacitors can fail from overvoltage 

and mechanical stresses – like cracking. 

B. Magnetics (Inductors and transformers) 

Inductive components are one of the most important 

components in a switching power supply. Inductors function to 

limit current rate of change and store energy that provides 

power when the switch is off. Coupled inductors used in flyback 

transformer perform the same function but allow greater 

difference between input and output voltages and can provide 

galvanic isolation. Transformers are used to covert voltages and 

currents to different levels in real time. Inductors store energy 

in their magnetic fields and release it. Various specialized 

inductors also provide filtering for common mode and 

differential mode noise filtering. 

C. Power Switches/diodes 

Fast transistors and diodes make switching power supplies 

possible. Diodes can be considered as switches because they are 

one-way devices. The main characteristics of interest are the 

forward voltage drop when conducting, the reverse recovery 

time and the breakdown voltage. The forward voltage drop plays 

into efficiency and at low output voltages, synchronous rectifiers 

are replacing them. Synchronous rectifiers is a term applied to 

MOSFETS that are used as diodes. Reverse recovery time is the 

time it takes the diode to stop the current flow when the polarity 

reverses. A slow diode can allow large transient currents that 

reduce efficiency and can create noise. 

Most small switching power supplies today use metal oxide 

semiconductor field-effect transistors, commonly called 

MOSFETs. N-channel types are predominant because they are 

smaller, more rugged, and less expensive. The MOSFET is a 

voltage driven device with a high turn-on threshold whose gate 

capacitance requires a high current transient for fast switching. 

The voltage drop is the fixed on-state resistance (Rdson) times 

the current. The bipolar transistor (BJT) is a current driven 

device with low turn-on threshold and low voltage drop. Other 

devices like IGBTs and Thyristors are used in applications that 

are more specialized. 

IV. 

SWITCHING POWER TOPOLOGIES 

Switching topologies come in two classes. Non-isolated are 

supplies that share a common ground between the input and 

output whereas isolated supplies have some form of galvanic 

isolation between the input and output. Aside from technical 

reasons where it might be necessary, isolation is required for 

power supplies connected to mains (the AC power receptacle) 

for safety of the user. 

In the following explanations, the switching element is 

represented as simple switch. It could be a transistor, P-channel 

MOSFET or an N-channel MOSFET with or without a driver. 

In many cases the diode could be a synchronous rectifier to 

increase efficiency. 

A. Buck 

Buck or step-down converters always have lower output 

voltage than the input voltage. The output voltage is directly 

proportional to the duty cycle. Compact size, high efficiency, 

fast response, and the ability to be shut-off make them an 

attractive alternative to linear regulators. 

Fig. 2. Buck regulator. 

When S1 closes the current increases linearly in L1, storing 

energy in its magnetic field as it charges C1 and supplies the 

load. Diode D1 is back biased with its cathode at Vin. When S1 

opens the magnetic field in L1 starts to collapse, reversing its 

voltage polarity as it tries to maintain the current flow. Now D1 

469

ecomes forward biased and current continues to flow to 

recharge the capacitor and to the load. The low pass filter formed 

by L1 and C1 smooths the pulses into a steady voltage. The 

voltage-time product of the on period and the off period must 

equal. 

B. Boost 

Boost or step-up converters always have output voltage 

higher than the input voltage. It uses the same components only 

rearranged. This time the output voltage is proportional to 1/1- 

D where D is the duty cycle. The practical limit for voltage 

boosting is 2-3 times the input. The input voltage must not rise 

above the output for then D1 would conduct connecting the 

input to the output without regulation. 

In this circuit when S1 closes the current once again flows 

through the inductor, increasing linearly, storing up energy in 

the magnetic field. Diode D1 is back-biased because its cathode 

is at the input voltage. The load current comes solely from the 

capacitor C1. When S1 opens, the magnetic field of L1 

collapses, reversing the voltage, allowing D1 to conduct, 

recharging the capacitor and supplying the load. Note however, 

the polarity reversal from the input. 

D. SEPIC 

The SEPIC converter is a two-stage inverting buck-boost 

converter. The single ended primary inductance converter 

(SEPIC) is used when the required output voltage could be 

higher or lower than the input voltage. This converter does not 

invert the output voltage polarity and has the advantage of low 

ripple current. This makes it ideal for power supplies that use 

batteries as a power source because with a common ground rail 

it can recharge the batteries and at the same time power the 

circuit. The capacitor C1 provides inherent output short circuit 

protection. 

Fig. 3. Boost regulator. 

When S1 is closed the current flows through L1, increasing 

linearly, storing energy in its magnetic field. The load current 

comes solely from the capacitor C1. Diode D1 is back biased 

because its anode is tied to ground by S1. When S1 opens, the 

magnetic field of L1 starts to collapse, reversing the voltage. 

This voltage will rise up until the diode D1 conducts, then 

passing current to recharge the capacitor C1 and to the load. 

C. Buck-boost 

Inverting buck-boost converters provide a stable output 

voltage whether the input voltage is higher or lower. One cravat 

is that the output voltage polarity is opposite to the input. This 

can still be used with batteries because the battery can be left 

floating. Because S1 does not have a ground connection a level 

shifter is required which adds complexity to the design. 

Fig. 4. Buck-boost regulator. 

Fig. 5. SEPIC regulator. 

The SEPIC converter operates as follows. When S1 closes, 

the current flows through the inductor L1, increasing linearly, 

storing energy in its magnetic field. At the same time, capacitor 

C1, which would have previously been charged to Vin, 

discharges its energy into L2. The diode D1 is back-biased. 

Capacitor C1 supplies the load. When S1 opens, L1 transfers its 

energy to C1 while L2 transfers its energy through D1 into C2 

and the load. L1 and L2 can be on the same core. 

E. Isolated Flyback 

The flyback converter is an isolated version of the buckboost 

converter. Instead of a single winding inductor, it uses 

two coupled windings on one core but it is still an inductor. One 

winding is used during the first half of the cycle to store the 

energy in the magnetic field and the other is used to harvest the 

energy for the output. This provides two useful benefits in 

situations like off-line converters - galvanic isolation for safety 

and a large turn ratio allowing the converter to operate with a 

large input range and still have a reasonable pulse width. If a 

buck converter had to drop 400 V to 5 V the pulse width would 

be extremely narrow, too narrow to be useful. The turns ratio 

allows the converter to operate at reasonable duty cycles. It 


470

makes the wide input voltage range seen in universal battery 

chargers that can operate from 85 to 400 VDC possible. 

V IN 

Fig. 6. Isolated flyback converter. 

The flyback converter works like the boost converter. 

When S1 closes, the current flows through the inductor L1, 

increasing linearly, storing up energy in the magnetic field. 

Diode D1 is back-biased because its anode is effectively at 

ground by the dot convention of the windings (the dotted ends 

of a winding are at same polarity). Capacitor C1 supplies the 

load. When S1 opens, the magnetic field of L1 collapses, 

reversing the voltage polarity, allowing D1 to conduct, 

recharging the capacitor and supplying the load. 

F. Discontinuous and continuous mode 

Most of the converters mentioned can operate in several 

ways with respect to the currents flowing in the storage 

elements. In discontinuous current mode (DCM) the current 

returns to zero during each period. The advantage is lower turn 

on losses in the switch because no current is flowing when it 

turns on. Generally, the inductor is smaller but the ripple and 

peak currents are higher. 

In continuous current mode (CCM) the inductor current does 

not return to zero so is switched while flowing, increasing 

switching losses. The inductor is larger due to the constant DC 

bias but the ripple is smaller resulting in lower peak currents. 

In the quest for better efficiency, other methods have been 

developed such as boundary mode, valley switching, multimode, 

pulse skipping and so on. 

V. CONTROL METHODS FOR REGULATION 

A. Hysteretic control 

T 1 

D 1 

S 1 

C 1 

V OUT 

Current, S 1 closed 

Current, S 1 open 

Hysteretic control, sometimes called a “bang-bang” control 

is the simplest and least expensive method to implement. It is 

merely a voltage comparator that compares the measured output 

voltage against a voltage reference and turns the power switch 

on if it is too low or off if it is too high. The hysteresis is the 

difference between the two levels and determines the amount of 

output voltage ripple that will present. 

B. Constant on-time control 

This type of control also operates at a variable frequency. 

The on time is constant, set by a timer and the off time varies 

by comparing to the limits as in hysteric control. The 

advantages is a simple stable control system with few 

components, fast response, and high efficiency at light loads. 

C. Constant off-time control 

Though similar to constant on-time, this control scheme 

suffers under light load, where the frequency must increase and 

the pulse width decrease causing performance issues. However 

is has a place in some specialized applications like fast charging 

of flash capacitors. 

D. Voltage mode control 

This technique operates at a fixed frequency and varies the 

duty cycle in proportion to the error difference between the 

actual output voltage and a reference voltage. This means it can 

only respond to changes in the load voltage. Since it does not 

measure load current or input voltage, it must wait for the effect 

on the load voltage. There is always a delay of several clock 

periods before the control loop reacts and stabilizes. The control 

needs to be compensated to avoid instability and overshoot. 

Fig. 7 shows a typical implementation of a voltage-mode 

PWM controller. The error amplifier (EA) measures the 

difference between a highly accurate voltage reference and the 

output voltage scale down by the voltage divider formed by R1 

and R2. The error amplifier output is proportional the difference 

between the reference and the output voltage. This feeds to the 

PWM comparator where it is compared to a linear, periodic 

ramp voltage, which starts at zero at the beginning of each clock 

cycle. The latch also turns on the power switch at the beginning 

of each cycle. When the ramp voltage crosses the error voltage, 

the latch resets, turning off the power switch. 

The controller can overshoot and then undershoot so the 

voltage is always oscillating around the desired level. The 

feedback response is often slowed down to reduce this behavior 

with the disadvantage the converter will be slower to respond 

to sudden changes. 

V IN 

V OUT 

feedback 

V REF 

R1 

R2 

sawtooth 

+ V E 

- 

EA + 

- 

Fig. 7. Voltage mode control. 

E. Current mode control 

PWM 

comparator 

V R S 

R Ǫ 

OSC 

Clock 

Clock 

V E 

This control technique builds on voltage mode control by 

adding a second control loop based on the switch current. There 

V R 

Q 

V OUT 

471

is an inner control loop regulating the current and an outer 

control loop that regulates the voltage as in voltage mode. The 

loops run at different speeds. The current loop reacts on a pulseby-pulse 

basis whereas the voltage loop is slower, being after 

the output filter. 

Fig. 8 shows a typical implantation of a current-mode 

PWM controller. The principle difference is the current sense 

voltage replaces the ramp voltage. Operation is similar. At the 

beginning of the cycle, the latch is set and the switch turned on. 

The current sense voltage, typically measured across a shunt 

resistor rises until it meets the error voltage. The latch is reset 

and the power switch turn off until the cycle starts again. 

V IN 

V OUT 

feedback 

V REF 

R1 

R2 

Fig. 8. Current mode control. 

One benefit is that the inductive element builds up the same 

energy level regardless of the input voltage. A change in input 

voltage affects the rate of rise and duration of the charging 

current – taking longer for lower voltages and less time for 

higher voltages. This scheme adjusts on a pulse-by-pulse basis 

without the voltage control loop. 

A second benefit is pulse-by-pulse current limit by merely 

clamping the maximum error amplifier’s output voltage level. 

A third benefit is faster respond time to load changes. An 

increase in load would cause the error voltage to increase, 

extending the charging duration. The inner current loop would 

have limited effect until the output voltage rose to the regulated 

level again. Current control is the preferred mode for most 

designs. 

F. Multi-phase control 

PWM 

+ 

- 

EA VE comparator 

- 

+ 

V CS 

current 

sense 

RCS 

OSC 

One of the consequences of our ever-increasing digital 

world are processors that use low input voltages with high 

current loads. Additionally to save energy, these loads are 

dynamic where the current demands are large and fast while 

maintaining a tight voltage level. From a practical point of 

view, a single buck regulator can work up around 20 A. Beyond 

that, multi-phase controllers have been developed to run 

multiple regulators in parallel but offset in phase. This divides 

the load, reduces the ripple and allows faster response to load 

changes. The components are smaller, with less stress and the 

thermal load is distributed over a larger area. 

R 

S 

Clock 

Ǫ 

Clock 

V E 

V CS 

Q 

V OUT 

G. Feedback loop compensation 

All voltage regulators using negative feedback rely on the 

corrective nature of the feedback signal to compensate for 

errors generated in the forward path. In practice, the gain and 

phase of both the forward and feedback paths vary with 

frequency, and therefore it is possible that at some frequency 

(or frequencies) the output voltage will respond too slowly 

resulting in inadequate performance or, too fast causing 

oscillation or ringing. 

The term frequency compensation describes the design of 

feedback circuits that take into account the frequency response 

of the forward path and ensure that the frequency response of 

the feedback signal compensates it in such a way that the system 

provides adequate performance and is stable. 

There are three types of compensation schemes known as 

Type 1, Type 2 and Type 3 as shown in the fig. 9. The criterion 

for stability in a system employing negative feedback is that the 

loop gain must be less than 0dB when the loop phase is 360°. 

The term phase margin refers to the loop phase at the 

frequency where the loop gain equals 0 dB, and the term gain 

margin refers to the loop gain at the frequency where the loop 

phase equals 360° (see fig.10). Gain margin and phase margin 

are both terms that describe the stability of a negative feedback 

system in qualitative terms. In general, higher gain and phase 

margins indicate a more stable system. 

Generally, a system with a gain margin greater than 10 dB and 

a phase margin greater than 45° will perform adequately in most 

applications. Additionally, the loop gain should exhibit a slope 

of -20 dB per decade as it passes through the 0 dB axis. 

Explaining and determining the proper compensation is beyond 

the scope of this paper. 

Type 1 

Type 2 

Type 3 

Fig. 9. Types of compensation circuits. 

R 1 

C 1 

- 

V 

+ REF 

- 

+ 

V REF 

R 3 

C 3 

R 1 

C 2 

R 1 

C 2 

V REF 

- 

+ 

C 1 

C 1 

R 2 

R 2 


472

zener diode is use to trigger an SCR across the output terminals, 

effectively shorting them and causing a fuse to blow. 

0 dB 

Gain 

Frequency 

f c 

Gain margin 

E. Shorts 

Short circuits caused by a load failure (or accident) can be 

protected by fuses but it is preferred if the controller can limit 

the current and resume operation once the short is removed. 

Current foldback is one means and pulse-by-pulse current 

limiting inherent in current mode control is another means. 

0° 

Phase 

180° 

Fig. 10. Gain and phase margins illustrated on bode plot. 

VI. 

A. Under voltage lockout 

Phase margin 

ANCILLARY CIRCUITS 

Under voltage lockout protection circuits allow for 

controlled operation during power-up and power-down 

sequences. The UVLO circuit ensures that Vcc is adequate to 

make the controller fully operational before enabling the output 

stage. This may be part of the controller, an external circuit or 

a combination of the two. Today microprocessors often use 

multiple voltages and the turn-on sequence and timing is 

critically important. Specialized power controllers are often 

available to match a large microprocessor with multiple 

switching regulator controllers, LDO’s, timing and 

communication built into one chip. 

B. Soft start 

At power up, to limit inrush current due to uncharged 

capacitors or a connected load, it is desirable to increase the 

PWM pulse width gradually staring at zero duty cycle. Most 

modern controller ICs have this important function built-in or a 

means to achieve it with external circuitry. 

C. Fault management 

Power supply faults can be divided into two categories – 

human safety and circuit failure. Human safety is covered by 

international and national standards such as IEC 62368-1 2 nd 

“Audio/video, information and communication technology 

equipment – Part 1: Safety requirements”. This standard and 

others cover all potential hazards – electrical, fire, chemical, 

mechanical, thermal and radiation. They impose extra safety 

features beyond what’s needed for functional operation. For 

example, off-line transformers are slightly larger to 

accommodate extra spacing (creepage and clearance 

requirements) and extra insulation (double or reinforced). 

D. Overvoltage protection 

Should the controller or some critical component fail it is 

desirable to have some means of protecting downstream circuits 

for systems where the input voltage is higher than the output. A 

typical over voltage protection circuit is the ‘crowbar’ where a 

VII. ELECTROMAGNETIC NOISE 

The many benefits of switching power supplies, primarily 

their small size and high efficiency come with the price of 

needing to deal with the fast switching voltage and current 

waveforms. These fast transients may cause noise, which can 

interfere with other electronic devices. Generally, noise below 

30 MHz is conducted – either as differential mode or as 

common mode, and noise above 30 MHz is radiated – either 

magnetic or electric. 

Differential mode noise, also known as normal mode is the 

disturbance across the power or signal lines. It follows the 

normal current paths with current flowing down a wire in one 

direction and returning on another. In common mode noise the 

disturbance is across multiple lines with an external conduction 

path like earth ground or chassis. The currents return through a 

different path then the normal path. 

There are strict standards limiting the amount of noise 

allowed. The International Special Committee on Radio 

Interference puts out CISPR-22 which is the most universally 

accepted standard. Consequently, all power supplies must be 

designed and built to meet the standards. Most off-line power 

supplies need some type of additional input filtering in the form 

of inductors and capacitors. The entire topic is one of 

specialization requiring unique equipment and setups to test, all 

of which is beyond the scope of this paper. 

L 

N 

C X 

L 1 

Fig. 11. Minimum basic mains line filter. 

VIII. MODULES 

C Y1 

C Y2 

Complete power supplies as modules have been available for 

some time. More recently with the increase in operating 

frequency and miniaturization, modules the size of large ICs 

have become available. These contain the complete power 

supply – controller with the feedback loop, the magnetics and 

even some capacitance. Only the bulk input capacitor and 

perhaps a voltage divider to set the output need to be added. 

473

This can save considerable time in designing, building and 

testing a power supply. 

IX. 

SUMMARY 

Linear regulators are simple and easy to use. Though not as 

efficient and limited to only reducing voltages, they offer low 

noise and fast transient response. Switching regulators are more 

complex and efficient. They can be used to both reduce or 

increase the voltage. A prime example is the extending the 

operating time of battery based systems. Today products with 

microcontrollers often use both in the same system using each 

solution to the best advantage. 

REFERENCES 

[1] T. Brander, A. Gerfer, B. Rall, H. Zenkner, Trilogy, 4th ed., Waldenburg, 

Wurth Elektronik eiSos GmbH & Co. KG, 2010. 

[2] R. Mammano, Fundamentals of Power Supply Design, Texas Instruments 

Inc., 2017. 

[3] S. Roberts, DC/DC Book of Knowledge, 2nd ed., Austria, RECOM 

Engineering GmbH & Co. KG, 2015. 

[4] M. Brown, Power Supply Cookbook, Newton, MA: Butterworth- 

Heinemann, 1994. 

[5] S. Menzel, ABC of Capacitors, Waldenburg, Wurth Elektronik eiSos 

GmbH & Co. KG, 2014 

[6] S. Wolf, R. Regenhold, ABC of Power Modules, Wurth Elektronik eiSos 

GmbH & Co. KG, 2015 


474

Achieving Ultra Low Power in Embedded Systems 

Understand where your power goes and what you can do to make things better 

Herman Roebbers 


Altran Netherlands B.V. 


Herman.Roebbers@altran.com 

Abstract— Over the course of the last years the need to reduce 

energy consumption is growing. This article focuses on the 

possibilities for reduction of energy consumption in embedded 

systems. We argue that energy consumption is a system issue and 

therefore a matter of making compromises. Energy consumption 

can be reduced by software, but only so far as hardware allows. 

There are many things that can be done to reduce energy 

consumption. The goal is to define an approach for achieving less 

energy consumption. Also criteria for the selection of an 

appropriate MCU are presented. Conclusion: Many (unexpected) 

things can have a big impact on your achievable battery lifetime. 

Look beyond just the CPU/processor and software in order to 

achieve better results. 

Keywords— Ultra Low Power; approach; embedded; system 

issue; reducing energy consumption 


In the last years the need to reduce energy consumption is 

growing. One the one hand this is instigated by governments 

(e.g. EnergyStar), on the other hand by the need to do more with 

the same or less energy (think mobile telephone battery lifetime, 

Internet-of-Things node battery life time). In this article we will 

focus on the backgrounds of energy consumption in embedded 

systems and how to reduce this consumption (or its effect). This 

article covers a part of a two-day Ultra Low Power workshop 

about this subject which is available via the High Tech Institute 

(http://www.hightechinstitute.nl), T2prof and Altran. 

The fact that energy consumption is an important issue is 

illustrated by the fact that chip manufacturers make a lot of noise 

about their energy-economic chips. There even are benchmarks 

for energy economy of embedded processors: the EEMBC 

ULPMark TM (http://www.eembc.org/ulpbench) CP (Core 

Profile) and PP (Peripheral Profile), IoTMark-BLE and the 

soon-to-be-released SecureMark. 

Energy consumption is an important point in all sorts of 

systems. It gets more and more important in the IoT world, 

where the biggest consumer is usually the radio. All sorts of 

solutions are tried to require the radio for as short as possible. 

This leads to non-standard protocols that use much less 

energy than standard protocols. 

It is important to realize that energy consumption is a system 

issue. And a matter of weighing one thing against another and 

making compromises. Energy consumption can be reduced by 

software, but only so far as hardware allows. It is also a 

multidisciplinary thing, because both software discipline and 

hardware discipline must be involved in the design in order to 

achieve the desired goal. 

For this article we limit ourselves to smaller embedded 

systems like sensor nodes. These systems are typically asleep for 

a large proportion of the time. Depending on what functionality 

is required during sleep and how fast the system must wake up, 

the system can sleep lighter or deeper. 

There are many measures that can reduce energy 

consumption. The goal is to define an approach that should lead 

to less energy consumption. That approach is detailed in this 

article as well as in the workshop. 

II. 

CATEGORIES OF MECHANISMS FOR ENERGY 

REDUCTION 

The mechanisms for energy reduction fall into three main 

categories. TABLE 1 lists commonly used mechanisms per 

category. This list is not exhaustive. Different vendors may use 

different names for the same mechanism. 

A. Software only (includes compiler) 

The energy reduction mechanism is solely implemented in 

the software domain. 

B. Software and hardware combined 

Hardware and software together implement an energy 

reduction mechanism. 

C. Hardware only 

The energy reduction mechanism is implemented at the 

hardware level. 

Each of the hardware mechanisms mentioned in the table 

below may or may not be available in your system. If the 

hardware does not support it then software cannot use it. 


475

Overview of 

Power 

Management 

Mechanisms 

Power 

management 

works at all 

these levels 

TABLE 1. POWER MANAGEMENT MECHANISMS 

Level Mechanism Category 

Application 

Operating 

System 

Driver 

Board 

Chip 

IP block / 

chip 

IP block / 

RTL 

Transistor 

Substrate 

III. 

Event driven architecture 

Use Low Power modes 

Select radio protocol 

… 

Power API 

Operation Performance 

Points API 

Tickless operation 

Use DMA 

Use HW event 

mechanisms 

Suspend / resume API 

Dynamic Voltage and 

Frequency Scaling. 

Power Gating via I/O pin 

Controlling Voltage 

regulator via I/O pin. 

Clock Frequency Mgt. 

Controlling device 

shutdown pins by I/O pin. 

Power Gating 

Offer Low Energy Modes 

(Automatic) clock gating 

Clock frequency 

management 

Dynamic Power Switching 

Adaptive Voltage Scaling 

Static Leakage 

Menagement 

Power Gating State 

Retention 

Automatic power / clock 

gating 

Body Bias 

FinFet 

TriGate Fet 

Sub-threshold operation 

SOI, FD-SOI 

SIMPLE THINGS TO DO 

(Domain) 

Category A 

(Software) 

Category B 

(Software 

& 

Hardware) 

Category C 

(Hardware) 

A. Look at the OS configuration (if there is an OS) 

Operating Systems use a periodic scheduler invocation 

(‘tick’) to check whether the currently executing process is still 

allowed to use the processor or that it should be descheduled in 

favor of some other process. This periodic invocation can take 

quite some time, and also happens if no processes are ready for 

execution. In this case a so-called idle task is executed, which 

usually consists of a simple while (1) {}; loop, just 

burning energy. 

Some Operating Systems (e.g. Linux and FreeRTOS) offer 

what is known as a tickless configuration to make the CPU sleep 

until either a timer expires or an interrupt occurs. The standard 

scheduler tick timer (default 100 Hz for Linux versions prior to 

version 3.10) is then no longer necessary. In versions before 3.10 

the #define CONFIG_NO_HZ configures this behavior, in later 

versions it is the #define CONFIG_NO_HZ_IDLE. In order for 

FreeRTOS to be used in this way the #define 

configUSE_TICKLESS_IDLE must be set. When applicable, 

this is a very simple way to (possibly substantially) reduce 

power. 

B. We look at the architecture of the application. 

If we look at the architecture of the application software we 

can distinguish two major types: Super loop or event driven. The 

super loop goes around one big loop all of the time, often not 

sleeping at any time. In order to reduce energy consumption we 

would like the system to sleep as long as possible between 

successive passes through the loop. It depends on the application 

whether sleeping is allowed at all and what the maximum 

sleeping time can be. It may, however, be quite possible to do 

some sleeping at the end of the loop without causing any 

problem and in doing so save substantial energy. 

IV. 

APPROACH FOR OBTAINING ULTRA LOW POWER. 

We will now describe our approach toward achieving ultralow 

power in a step by step fashion. Basically the strategy is: 

Use the facilities the hardware offers. We can do this in steps, 

roughly in the order these features were offered over time. 

A. In the beginning 

In the beginning there was only one bus master in the system: 

the CPU. It could read data from instruction memory and read 

from and write data to data memory and peripherals. In order to 

check for an event the CPU had to resort to polling: 

while (!event_occurred()) 

{}; 

This piece of code keeps the CPU busy, as well as the code 

memory and the bus. Both the CPU and the code memory (flash 

usually) are big contributors to the total energy consumption, 

especially when code memory isn’t cached. 

B. Phase 2: Introducing Direct Memory Access (DMA) 

At some point in time a second bus master is introduced: The 

DMA unit. It is capable (after being programmed by the CPU) 

to access memory and peripherals autonomously. It can also 

generate an interrupt to the CPU to signal completion of its task, 

e.g. copying of peripheral data to memory or vice versa. This 

DMA unit can operate in parallel with the CPU, but they cannot 

access the bus simultaneously. While the DMA is copying data, 

the CPU can check a variable in memory for DMA completion. 

Pseudocode of the Interrupt Service Routine (ISR): 

void ISR_DMA_done(void){ 

} 

... /* clear interrupt */ 

ready = true; 

476

The main program: 



start_DMA(); 


{ 

} 

__delay_cycles(CHECK_INTERVAL); 

Here we check another variable, but not continuously. 

The __delay_cycles() function executes NOP 

instructions during CHECK_INTERVAL. This keeps the data 

bus free so that the DMA unit isn’t hindered by the CPU’s data 

accesses and so may complete its assignment quicker. The CPU 

is still fetching code from instruction memory, though. 

C. Stop the CPU clock when possible 

A relatively recent addition to the CPU’s capabilities is 

stopping the CPU clock until an interrupt occurs, saving power 

by doing so. This can be in the form of a 

WAIT_FOR_INTERRUPT instruction, which removes the 

clock from the CPU core until an interrupt occurs. ARM CPU 

cores offer the WFI instruction for this purpose, others such as 

MSP430 set a special bit in the processor status register to 

achieve the same effect. This does not affect our interrupt service 

routine. Our main program code changes thus: 



start_DMA(); 


{ 

} 

__WFI(); /*special insn, CPU sleeps*/ 

In the new situation the CPU is stopped by disabling its clock 

until the interrupt occurs. This saves energy in several ways: The 

CPU is not active, instruction memory is not read and both the 

data bus and the instruction bus are completely available for the 

DMA unit to use. Most new processors know this trick. 

D. Events 

Later CPUs have the notion of events that also can be used 

to wake up the CPU from sleep. This mechanism is quite similar 

to that of using the interrupt, except that no ISR gets invoked. 

This saves some overhead if the ISR didn’t have to do anything 

other than wake the CPU. Using this mechanism requires that 

the CPU have an instruction to WaitForEvent. ARM Cortex 

processors have the WFE instruction, others, such as MSP430 

don’t have it. 

E. Passing events around: Event router 

When this event mechanism is coupled with peripherals that 

can produce and consume events using some programmable 

event connection matrix (‘event router’), a very powerful system 

emerges. In the case of Silabs EFM32 series the mechanism is 

referred to as Peripheral Reflex System; Nordic has another 

name for it. MSP430 has something a bit simpler than the other 

two. 

This mechanism allows quite complex interaction between 

peripherals can take place without CPU interaction. This allows 

the CPU to go into a deeper sleep mode and save more energy. 

As an example we can configure a system to do the following 

without any CPU interaction: On a rising edge on a given I/O 

pin an ADC conversion is started. The conversion done event 

triggers the DMA to read the conversion result and store it into 

memory, incrementing the memory address after each store. 

After 100 conversions the DMA transfer is done, generating an 

event to the CPU to start a new acquisition series and to process 

the buffered data. 

F. Controlling power modes 

The latest ULP processors have a special hardware block to 

manage energy modes and transitions between them in the 

system, combined with managing clocks and power gating 

peripherals in certain energy modes: The Energy Control Unit in 

EFM32, or Power Management Module for MSP430 for 

instance. They can save a lot of time otherwise required to 

program many registers when going to or coming out of sleep. 

They can also manage retaining peripheral register content at 

retention voltage (lower than operational voltage), such that the 

peripheral can immediately resume operation when power is 

restored. This hardware mechanism is called State Retention 

Power Gating. 

The main program is now: 

setup_hw_for_event_generation(); 

configure_sleep();/*This is the extra*/ 

start_DMA(); 

__WFE();/*CPU sleeps, low power mode*/ 

Using a deeper sleep can make a difference of more than a 

factor thousand! 

We have just seen what stepwise refinements we can 

implement to reduce energy consumption. Each step can be 

implemented as a logical successor to the previous one. 

V. WHAT TO LOOK FOR WHEN SELECTING AN MCU 

There is a number of parameters that one can look at and 

compare to select the best MCU for the application at hand. Here 

is one set of parameters: 

1) What is the active current (A/MHz) at what voltage 

2) What is the performance of the CPU (CoreMark/MHz) 

3) What is the sleep current in each of the low power modes 

intended to be used 

4) What is the wake-up time from each of these low power 

modes. 

5) What is the power consumption of each of the 

peripherals used 

6) What peripherals are available in which low power 

modes 


477

7) Can peripherals operate autonomously (e.g. be 

controlled by a DMA engine) 

8) Is there a hardware event mechanism to orchestrate 

hardware-based event production and consumption 

9) Do the available low power modes fit well with the 

application 

10) Are the peripherals designed for ultra low power 

operation (e.g. Low Energy UART, Low Power Timer) 

11) Can sensors be operated with low energy consumption 

(e.g. Low Energy sensor interfaces) 

12) Are there “on-demand oscillators” 

The answers to these questions serve as a guide to an informed 

selection of the MCU type to use for best performance for the 

given application. They can be used as input for a power 

model of the application and, together with a battery model 

can help predict the battery/charge lifetime for the application. 

VI. 

WHAT ELSE CAN ONE DO? 

There are still many more factors that all can play a role in 

the overall energy consumption. These are factors not obvious 

to many people, such as: 

 

 

 

 

 

 

 

 

Regulator efficiency 

Switching sensors off when not in use: prepare your 

hardware to be able to do so 

Clocks: how to set them for lowest energy 

consumption 

Voltages: lower is better, the fewer the better 

Compiler: can make 50 % difference 

Compiler settings: can make 50 % difference 

Where do I locate critical code / data 

How to measure the consumption? 

 

I/O pin settings 

Battery properties in relation to energy 

consumption profile. 

 

Look for possibilities to make use of energy 

harvesting to prolong battery lifetime 

During the workshop many of these issues and others will be 

addressed and illustrated through hands-on sessions. 


Ultra-Low Power is a system thing. Hardware alone or 

software alone cannot achieve the lowest consumption. 

We have shown a stepwise approach to reducing energy 

consumption. 

In order to realize the maximum energy reduction one has to 

understand the details of the hardware and write the software to 

use available features. 

Energy savings can be found in unexpected places. 

It is possible to reduce consumption by more than a factor 

thousand in certain scenarios. 


The author wishes to thank Altran for giving him the 

opportunity to investigate this subject matter and his colleagues 

for helpful feedback during the development of the workshop 

and for reviewing related publications [1]. 

REFERENCES 

[1] H. Roebbers, “Hoe spaar je energie in een embedded systeem?,” Bits & 

Chips 08, pp. 34-39, October 2015. 

478

Understanding Power Management and Processor 

Performance Determinism 

Ben Boehman 

Enterprise, Embedded, and Semi-Custom Business 

Advanced Micro Devices, Inc. 

Austin, TX USA 

Abstract—High-performance embedded systems crave the 

processing power of modern x86 processors, but current hardware 

architectures consistently prioritize peak performance over 

deterministic behavior. Advanced power management methods 

exploit inherent part-to-part variations, boosting core frequencies 

in unpredictable ways. Adding to this, PC architectures tend to 

target specific processor power constraints that can artificially 

clamp operating frequencies to maintain thermal and electrical 

specs. This creates scenarios where the power-density of the 

workload defines the effective operating frequency of the CPU, 

further reducing predictability. Real-time operating systems are 

there to help address determinism in the software domain but they 

cannot address it at the hardware level. Once these hardware 

implications are understood, designers will know what to look for 

when choosing processors for embedded systems where 

performance determinism is an important factor. Discover 

methods to disable features of modern processors that reduce 

hardware determinism. 

Keywords – determinism; deterministic; x86; performance; 

power; management; real-time; AMD; 


Embedded system applications span a tremendous range of 

uses and some of these devices become mission critical 

equipment where performance behavior must be highly 

predictable. Embedded system designers in these markets are 

familiar with the use of real-time operating systems to improve 

determinism at the software level, but variations introduced by 

hardware are often overlooked. For this work, hardware 

determinism is defined as a guaranteed, predictable response 

time to an event, assuming a fixed sequence of code and input 

stimuli. Deterministic systems can replicate that predictability 

across all units. The increased demand for high-performance 

embedded systems has also driven a trend toward usage of PCcompatible 

x86 processors from desktop and notebook product 

lines, though their power and performance architecture are not 

designed with determinism in mind. Even product variants 

targeted at embedded markets tend to retain the favoritism 

toward performance prevalent in the PC models. Power 

management behavior in leading x86 processors has consistently 

striven to squeeze the last drop of performance out of every 

device, including exploitation of inherent part-to-part variations. 

This paper will review the source of these variations, discuss 

common power management behaviors that exploit them, and 

review methods of mitigation. Focus will be on common 

desktop, notebook, and embedded processors in the 6-65W 

power range and may not be reflective of x86 server processors. 

II. 

SILICON BASICS 

a. DEFINING OPERATIONAL LIMITS 

Before power management behaviors can be discussed, it is 

important to understand the fundamental limitations of silicon 

integrated circuits. In fact, the primary purpose for power 

management in such devices is to ensure these limitations are 

not exceeded so that device reliability and functionality are 

maintained. There are many factors that affect silicon-based 

transistor performance, but the focus here is to briefly 

familiarize readers with the most significant factors affecting 

x86 processors in their typical operating ranges. 

Processor frequency is possibly the most obvious of 

performance limiting factors. Even consumers have become 

quite familiar with equating frequency to performance. 

Frequency defines how fast the logic of the device is clocked, 

and thus how fast instructions are executed. Performance will 

not be equivalent when comparing two processors of equivalent 

frequency and different architecture, but it is generally true that 

increasing frequency will increase execution performance. 

Frequency in a processor can be limited by several underlying 

factors, but the most basic are voltage and current. Those 

familiar with transistor mechanics know that voltage has a key 

relationship to frequency. Faster switching of the transistors 

requires increasing voltage to overcome the resistive and 

capacitive elements of the transistor. However, higher voltage 

increases ageing effects (Gielen, 2013), putting practical limits 

on voltage application to ensure product longevity. Faster 

switching of transistors also generates higher currents as those 

capacitive elements are charged and discharged. While 

individual transistor currents may be very small, modern 

processors can have several billion transistors (Cutress, 2017), 

so this current adds up quickly. The processor die is typically 

mounted on a package of some kind and there are also real, 

practical limitations to how much current can be delivered to the 

die effectively. Every digital IC must deal with the balancing act 

of transistor voltage and current to yield a useful frequency. 

The combination of Ohm's and Joule's laws teach us that all 

this voltage and current generates power, and that both 

parameters have a direct relationship with power. In fact, the 

reality is that most processor frequency limitations also boil 


479

down to power or current limits. Faster switching of transistors 

increases current and may also require increasing voltage, and 

doing either will increase power. Integrated circuits of every 

kind must provide designers with a maximum power 

consumption limit so that systems can be adequately designed to 

handle the current and cooling requirements. Power limits are 

often the most significant performance limiting factor, 

especially at the lower end of a device family’s power range. 

Modern processors based on the x86 architecture tend to be 

power limited rather than frequency limited with heavy 

workloads. The reasons will be discussed in later sections. 

Die temperature is a simple factor to consider, though not the 

most obvious. As the processor operates, consumed power is 

converted to heat. Heat affects transistor operating 

characteristics, as well as the rate of diffusion of the doping 

elements in the silicon that form the transistor junctions. 

Eventually, diffusion will change the electrical properties of the 

transistors until they fail to operate correctly and the processor 

will reach the end of its life. Limiting junction temperature in 

the device is critical for maintaining its expected longevity. 

Manufacturers will set maximum die temperatures for their 

products that must be followed. Maintaining this temperature 

limit is an important task for the power management entity in the 

processor. 

b. LEAKAGE POWER 

Another basic principle of silicon transistors is that they leak 

current across junctions and to the substrate (Kaushik, 2003). 

The amount of leakage current in a processor of a particular 

process type will vary largely by applied voltage and 

temperature and it can become quite significant in today's highperformance 

processors. This is because the same factors that 

are required to make transistors switch faster (i.e., achieve 

higher frequency) also increase leakage. All this leakage current 

creates additional power that must be counted as part of the 

device’s total power consumption. Naturally, leakage power 

effectively reduces the amount of the device’s total power 

envelope that can be consumed as active power (i.e., power used 

in transistor switching that does work). Figure 1 below shows 

the leakage power distribution for a current, undisclosed AMD 

processor based on a 14nm FinFET process as a percentage of 

total processor power. 

Percentage of Units 

15% 

10% 

5% 

0% 

Figure 1- Leakage power distribution for an undisclosed AMD product 

based on a 14nm FinFET process. 

Leakage power is exponentially related to die temperature, 

often doubling several times over the operating temperature of 

an integrated circuit (Wolpert & Ampadu, 2012). This means 

that device power will increase as the device temperature rises, 

even if the rest of the operating scenario is unchanged (i.e., fixed 

clock frequency, voltage, and workload). CPU manufactures 

must either leave enough headroom to accommodate this 

potential increase in power over the temperature range, or have 

a power management scheme that is dynamic with device 

temperature. Figure 2 below shows how leakage power is 

affected by temperature in that same AMD processor family. 

Leakage Power as % of 

Total Power 

Processor Leakage Power Distribution 

(@high voltage & High Temp) 

14% 

16% 

18% 

20% 

22% 

24% 

26% 

28% 

30% 

32% 

34% 

36% 

38% 

40% 

42% 

Leakage Power as a Percentage of Total Power 

Processor Leakage Power Versus 

Temperature (@high voltage) 

30% 

20% 

10% 

0% 

0 20 40 60 80 

Die Temperature ( o C) 

Figure 2 - Leakage power over temperature for a typical sample of an 

undisclosed AMD product based on a 14nm FinFET process. 

c. PART-TO-PART VARIATIONS 

The silicon photolithography process used to create 

semiconductors has inherent imperfections that manifest as 

variations in transistor construction and thus affect their 

operational characteristics. These variations not only exist 

between batches of silicon wafers, but even across a single 

wafer. Such variations may require die in one area of the wafer 

to have a higher voltage to achieve the same frequency than its 

neighbors, or cause its leakage power to be greater. Figure 1 

illustrates leakage power variations quite well. Since power is a 

480

key factor in determining achievable performance for a given 

device, performance variations follow suite. 

Processor manufacturers sort these die into groups targeting 

various product models with different specifications (e.g., 25W 

vs. 35W) to maximize yield. The amount of variation possible 

across units is defined by the specific model, and lower cost 

models will tend to allow wider variance. It is important to 

understand why these variations exist before discussing how 

power management exploits them. 

d. WORKLOAD POWER DENSITY 

Understanding power management behavior in complex 

microprocessors also requires understanding the concept of 

workload power density. This concept essentially means that 

different workloads (i.e., executed instruction sequences) will 

generate different amounts of power consumption in the 

processor, even at the same utilization level. This is to say that 

the central processing unit (CPU) core power incurred by two 

workloads can be significantly different even if the core is 100% 

utilized (i.e., consistently busy executing instructions) in both 

cases. This situation can occur because different instructions 

stimulate different amounts of transistor logic inside the core. 

As an example, one can image that a complex floating-point 

calculation will trigger more transistor activity in the CPU than 

a simple data movement operation. Data movement from one 

CPU general purpose register to another involves a minor 

number of gates while a complex AVX or SSE instruction to 

perform a multiply accumulate operation at 256 bits wide may 

activate many thousands of gates. Workloads may repeat such 

operations as part of an algorithm, compounding the power 

consumption increase. The potential difference in power 

between workloads becomes even larger when considering that 

nearly all x86 microprocessors sold today are multi-core, and 

most have integrated many other functions that were previously 

external. Integration of the graphics processing unit (GPU) is 

the most significant, as it is a very large processing core on its 

own. As consumer use-cases have become increasingly 

graphical, the GPU in some x86 processors can be even larger 

(i.e., more transistors) than the CPU cores. This is especially 

true for companies like AMD, who specifically target high 

performance integrated graphics in their microprocessors. 

Mixed workloads that execute a combination of CPU and GPU 

instructions simultaneously can experience the effects of 

workload power density differences on both core types. 

Allocation of the power budget to these various cores is one 

challenge of processor power management that will be explored 

further in later sections. 

To illustrate the difference in workload power density, 

power consumption was measured with two different CPU-only 

workloads on a random sample of an AMD embedded RX- 

421BD SoC based on the “Excavator” CPU core. Both 

workloads can saturate a single CPU core while sustaining max 

frequency, so utilization will stay at 100% for the core under test. 

The Prime95 workload represents an extreme case (often 

referred to as a “thermal virus”), and power values have been 

normalized to that level. 

CPU Core Power 

(normalized to Prime95) 

Figure 3 - Prime 95 v29.3 b1 Large FFT, Microsoft SysInternals CPU 

Stress v1.0 

The data in figure 3 show that the power consumption of the 

less power-dense workload was only 57% of Prime 95 with a 

single CPU core active. When extrapolated across multiple 

physical cores, it is easy to see that power variation by workload 

can grow quite large. In this test case, the CPU was able to 

maintain maximum frequency (i.e., 3.5GHz) on the active core 

without reaching power or current throttling, so no frequency 

reduction was required. 

The power density of GPU workloads can be compared in 

the same way. The graph below compares a simple 3D workload 

from the Microsoft DirectX 9 SDK (“blobs”) to Furmark, an 

extreme GPU workload falling in the thermal virus class. GPU 

frequency was artificially limited to 720MHz to avoid power 

limit throttling and expose the full potential power consumption 

difference. A comparison of the RX-421BD processor power 

for both workloads is shown in Figure 4. 

GPU Power 

(normalized to Furmark) 

Workload Dependent CPU Power 

Consumption (1 Core) 

120% 

100% 

80% 

60% 

40% 

20% 

0% 

Prime 95 

CPU Stress 

Workload Dependent GPU Power 

Consumption 

120% 

100% 

80% 

60% 

40% 

20% 

0% 

Furmark 

Blobs 

Figure 4 - Furmark v1.18.2.0, Microsoft DirectX 9 SDK "Blobs" 

The GPU power data shows the Blobs application consumed 

only 82% of the power of Furmark, confirming a difference in 

power density. It is also worth noting that the increase in power 

dissipation with the heavier workload will raise die temperature 

in a given system environment. The higher temperature will 

increase leakage power, adding to the power difference. Truly 


481

comparing the power difference caused only by the workload 

would require tight control of the die temperature which was not 

attempted in this test. However, the few degrees of difference 

observed here do not significantly affect the results. 

III. 

PROCESSOR POWER MANAGEMENT 

Previous sections establish the key observation that power is 

inextricably linked to temperature, frequency and 

voltage/current. Power management in modern processors is all 

about controlling these parameters to control power 

consumption, while maximizing workload performance. 

Current processors from AMD and Intel contain dedicated 

microcontrollers that are independent of the x86 processor cores 

to administer power management. The firmware in the 

microcontroller is tailored in some ways for the product’s 

intended use case. For example, mobile products will be more 

aggressive in the use of power saving features like clock and 

power gating in the interest of improving battery life. Desktop 

and server processors that are always wall powered will tend to 

favor performance and only save power when it has minimal 

impacts on performance. 

a. DEFINING POWER LIMITS 

Definition of the maximum power consumption is a common 

starting point when defining processor models. Manufactures 

choose power levels to address various use-cases with differing 

power restrictions, and performance (i.e., frequency) is largely 

derived from that. X86 processors are largely marketed by their 

Thermal Design Power (TDP), even though it is a specification 

related to the thermal solution requirement and not a maximum 

electrical power that the device can consume. Maximum 

sustainable power levels will be equal to or greater than TDP, 

depending on the product. This paper will focus on the 

maximum sustained power of the processor when discussing it 

as a limit. 

b. SCENARIO DEFINED PERFORMANCE 

The power management controller of the processor monitors 

key parameters to ensure the processor specifications for 

maximum power, current and temperature are not exceeded. If 

changes in the operating scenario cause any one parameter to 

approach its limit, the controller must throttle the processor’s 

performance to compensate. This throttling usually takes the 

form of reducing operating frequency of the core(s) consuming 

the largest amounts of power (i.e., CPU and GPU), as they have 

the biggest impact. Reducing frequency often allows voltage 

reduction for additional power savings. Reductions in power 

consumption will reduce temperature and current, helping the 

processor to stay within these specifications. These 

adjustments can happen as much as every millisecond for very 

quick response to changes in the operating environment or even 

the workload (Howse, 2015). Previously, x86 processors 

moved between discrete “performance states” (specific 

combinations of voltage and frequency at which cores can 

reliably operate) that differed by hundreds of megahertz and 

required suspension of execution during transitions. Newer 

Intel 7 th Generation Core Processors and AMD Ryzen 

Processor architectures allow much more granular frequency 

changes for better efficiency and, at least in the Ryzen case, 

uninterrupted execution. 

Since power consumption varies with the workload, one can 

recognize why achieving maximum frequency of a core may 

not always be possible. What if a very power dense workload 

is run on a CPU core at maximum frequency and causes the 

device to exceed its power limit? What if that workload is then 

run on multiple cores further exceeding the limit? What if a 

graphics workload is suddenly introduced on the integrated 

GPU simultaneously? In these cases, the power management 

controller has no choice but to throttle frequencies to maintain 

power and current limits. Many system designers erroneously 

assume that processor manufacturers configure their products 

to ensure that cores can sustain maximum frequency for any 

workload in all configurations. This is definitely not the case. 

Doing so would require these vendors to continuously search 

out the worst-case (i.e., most power dense) workload in 

existence, characterize the power usage on their architecture, 

and set the product’s maximum frequency low enough to 

accommodate it safely across all units of that model (including 

their part-to-part variations). This “fixed frequency” model is 

no longer used by most x86 processors in the PC and embedded 

spaces. Ignoring the fact that the worst-case workload could 

keep changing over time, the reality is that defining the max 

frequency in this way would be extremely limiting and easily 

reduce the operating frequency of a CPU core to a fraction of 

its potential because of the wide variance in workload power 

density. The consequence would be that lighter workloads with 

less power density would also be limited to this reduced 

frequency, even if it would have been safe to execute them 

much faster. An artificial performance limitation would be 

created to guarantee a predictable maximum frequency that is 

achievable for all workloads. A better approach for generalpurpose 

processors is to define the max frequency by silicon 

capability and allow the power management controller to 

dynamically provide the best performance possible for the 

specific operating scenario in real-time. 

Designers should remember that the operating scenario not 

only includes the workload (i.e., the exact instruction sequences 

running on processor cores) but also its timing and usage of 

integrated peripheral functions and I/O. With high levels of 

integration in modern processors, I/O power cannot be ignored 

(in this instance, logic power for the I/O interfaces will be put 

in the same category as the power used by the physical I/O 

pins). Interfaces like system memory, Serial-ATA, Ethernet, 

PCI Express, audio, and USB are commonly integrated and they 

all consume power. I/O power is largely dependent on the 

system configuration and usage model. For example, a network 

gateway device may not implement any SATA devices, while a 

network attached storage (NAS) system may have many. The 

NAS unit use-case will involve lots of ethernet activity 

(increasing power used in that logic), while a machine 

controller may have very little. The portion of the total power 

envelop consumed by I/O can’t be used by compute cores, so 

changes in configuration or usage model can impact achievable 

core performance when processors are power limited. 

482

Including the system configuration and I/O usage model in the 

workload definition is key when attempting to improve 

performance determinism. 

c. EXPLOITING DEVICE VARIATIONS 

The natural result for the power limited (versus fixed 

frequency) model is that performance is maximized for each 

workload, but frequency is not predictable with workload 

changes. Any scenario where the workload reaches 

temperature or power-limit throttling, performance can be 

degraded from the fixed-frequency model. System designers 

can avoid temperature throttling by developing enough 

headroom into the thermal solution to ensure maximum 

temperature is never reached. After all, the maximum sustained 

power level is a known quantity and airflow and ambient 

temperature limits can be specified for the final system. Power 

throttling is a more difficult challenge due to the part-to-part 

variations discussed earlier that affect power consumption. 

Two samples of the same processor model could have 

differences in their leakage power, causing one unit to reach its 

power limit at a lower average frequency even when running an 

identical workload under identical operating conditions. 

Vendors happily exploit this difference by allowing the lower 

leakage units to spend more time at higher frequency, yielding 

better performance. Earlier discussions of voltage 

dependencies reveal why different processor units of the same 

model can also have different voltage requirements to achieve 

a given clock frequency. This difference can be exploited by 

fusing unit-specific voltage vs. frequency curves into each part 

that enable the power management controller to minimize core 

voltage. Reductions like this to active power allows those units 

to further increase average frequencies before reaching power 

limits. Fortunately, lower leakage devices tend to also require 

higher voltages to reach the same frequency as a higher leakage 

device, so these two factors work to cancel each other out rather 

than compound. Despite this, material differences in the 

consumed power can remain. 

Many real-world PC use-cases have been found to be 

bursty, where applications often sit idle waiting for user input 

and then perform some activity before waiting again. This 

could be a user starting a program or loading a new web page. 

Periods of inactivity will naturally coincide with low power and 

lower die temperature. Some processors take advantage of this 

situation by defining a maximum power limit that is greater 

than the sustained power limit. The processor can be allowed 

to reach this higher power consumption for a short amount of 

time that is “thermally insignificant”. Thermal solutions have 

a relatively large thermal inertia, meaning it takes a while for 

the processor to raise their temperature to a steady state value. 

Increasing the power limit in this way allows for short periods 

of increased performance benefiting bursty workloads, but at 

the cost of performance determinism. The operating 

environment now has another mechanism by which to affect 

performance, and workloads may have to run for several 

minutes to reach a steady state behavior. 

d. REGULATOR TELEMETRY 

Since processor performance limitations boil down to power 

in so many ways, accurately determining power consumption is 

critical to maximizing performance. Measuring processor 

power in real time requires accurate current sensing which is not 

practical for implementation on high-speed digital process 

technologies. Until recently, processor power management 

technology relied on power curves derived from actual power 

measurements at manufacturing test time with a reference 

workload. Values were programmed into the processor and 

combined with run-time data from complex activity monitors in 

the logic. Management algorithms calculated power usage to 

ensure power limit adherence. This method allowed some 

exploitation of part-to-part variations but still required moderate 

guard-banding due to inaccuracies of the activity monitors to 

estimate power. Conservative estimations of power 

consumption leave performance headroom untapped. A recent 

change seen with AMD processors is use of power telemetry 

data from the regulators powering the primary voltage rails. 

Real-time voltage and current data allows the power 

management unit to be much more accurate in its total power 

calculation. Doing so enables every variation of the unit that 

affects power consumption to be factored in along with 

instantaneous environmental circumstances (i.e., temperature) 

and exploited for performance gain. Naturally, maximizing 

performance in this way increases non-determinism across units. 

IV. 

REDUCING EFFECTS OF POWER MANAGEMENT ON 

PERFORMANCE DETERMINISM 

Maximization of performance at the cost of determinism 

works well for consumer use-cases where the user does not rely 

on repeatable performance across multiple systems. Enterprise 

and embedded systems can be quite different and may not be 

able to tolerate performance variations across units. However, 

it important to differentiate the need for a minimum performance 

versus true performance determinism. For digital signage or 

casino gaming machine examples, a minimum performance 

need likely applies. Functionality and user satisfaction are not 

affected if the frame processing time varies slightly across units 

as long as it is fast enough to meet the level of the content (e.g., 

60fps) in all cases. Units that complete work faster may simply 

spend more time idle between frames, which would be 

unobservable to the user. Special cases like industrial machine 

controllers or military applications may require truly repeatable 

performance due to sensitive timing interactions. Even some 

datacenters desire such repeatability so that job execution time 

can be predicted regardless of which system it is scheduled on. 

It should be clarified that true hardware determinism is not 

possible with modern x86 CPU architectures. Small timing 

variations can exist because of interactions between hardware 

and software, and hardware interrupts can occur with 

unpredictable timing. Some amount of variation will always 

exist, but there are ways to improve the situation, particularly for 

variations caused by power management. One thing that system 

designers can count on is that improving performance 

determinism will come with a cost to peak performance. 

a. MINIMUM PERFORMANCE LEVEL 

Ensuring a minimum performance level begins with testing 

in a worst-case environment. The specific workload of interest 


483

must be run on a worst-case processor sample operated at 

maximum temperature. The frequency behavior of the 

processor and the resulting performance of the workload should 

represent the lowest level of any sample in the distribution. If 

the performance is still acceptable, then the processor model 

choice is sufficient. System designers can have confidence that 

all samples of the chosen model will perform at this level or 

better. Of course, changing processor models or modifying the 

workload means testing must be repeated. Unfortunately, 

worst-case samples are rare and processor vendors don’t 

usually supply them upon request. Holding the processor die 

under tight temperature control while running an active 

workload can also be difficult, and usually requires a specialty 

thermal equipment like thermal stream blowers or oil baths. 

Many embedded designers will need an alternative method to 

ensure their minimum performance level. 

Most x86 processors sold today specify a base and boost 

frequency for CPU cores. A few models even do the same for 

integrated GPUs. A good rule of thumb has been that base 

frequency should be sustainable for all processor samples, but 

designers must understand when this can be broken. Processor 

vendors generally do intend for base frequency to be sustainable 

on all cores of a CPU under “real world” workloads. 

Differentiation is made because worst-case power density of real 

versus synthetic applications can be very large. The example 

provided earlier used two synthetic workloads for uniformity of 

power usage but it still illustrates the point. Real applications 

tend to have a mixture of compute, memory, and I/O operations 

while synthetic “power viruses” can intentionally loop on small, 

power dense instruction sequences. As previously discussed, 

defining base frequency with the most power dense workload 

available would be extremely conservative and would 

artificially limit performance of more typical workloads. The 

catch is that definition of “real-world” is subjective and varies 

by vendor and product. Vendors will choose reference 

workloads to represent the worst-case of the real-world, and then 

use it to define base frequency of the product. If the reference 

workload is known, both it and the custom workload can be 

compared on a random sample. If the power density of the 

custom workload is less than the reference workload, then it 

should be able to sustain base frequency across all units of the 

model distribution. To get an accurate measurement, testing 

should be performed on a specific sample/system in a fixed 

configuration and environmental conditions. CPU core 

frequency boost should be disabled to prevent reaching 

temperature or power/current limits (boost disable is commonly 

available in system BIOS 1 firmware options of most x86 

platforms). If either workload can reach these infrastructure 

limits then results will be skewed. Both AMD and Intel provide 

tools to log processor power, but they do not publicly disclose 

reference workloads. Such information must be obtained under 

non-disclosure agreements. If comparison is successful and the 

custom workload’s power density is less than the reference 

workload, then the performance of the boost-disabled scenario 

should be achievable across all units. Capping frequency in this 

way does reduce peak performance, but that is the sacrifice 

required for consistency. 

1 

Basic Input / Output System 

If characterization of the custom workload reveals it is more 

power dense than the reference workload, then further frequency 

reduction is required to ensure minimum performance across 

units. In addition to disabling boost states, frequency should be 

reduced until power/current measurements are below those of 

the reference workload on the test unit. Once a suitable 

frequency limit is determined, performance can be evaluated for 

acceptability. A challenge with this case is that setting a CPU 

frequency limit below base frequency can require more invasive 

software modification. The Linux kernel supports a simple 

software daemon (e.g., cpufreqd) that can set core operation at a 

specific P-state and this mechanism can be used to limit CPU 

frequency with some processor architectures. For Windows 

operating systems, custom modifications must be made to the 

ACPI 2 PSS table in BIOS that communicates supported CPU P- 

states to the operating system (OS) (Unified Extensible 

Firmware Interface Forum, 2017). Higher, unwanted frequency 

states can be removed, and the table rebuilt. The same method 

can be used for other operating systems that support the ACPI 

_PSS table. Modification of the table takes significant BIOS 

expertise and access to source code. Once a P-state is identified 

that brings power density of the custom workload below that of 

the reference workload, performance can again be evaluated for 

acceptance. It should be noted that this method of comparing 

power density is less exact than the ideal method of using a true 

worst-case sample. Including some reasonable margin into the 

operating point is wise. 

Mixed workloads that highly utilize both CPU and GPU 

cores simultaneously complicate the ability to confirm if a 

custom workload is more power dense than the reference 

workload. If a GPU base frequency is defined at all, reference 

workloads for CPU and GPU are likely measured independently 

so there is not significant interaction. Using the workload power 

density comparison method would also require tools to provide 

power data separately for each core type, which processor 

vendors do not typically provide. Workloads of this type cannot 

establish a reliable minimum performance operating point 

without assistance from the processor manufacturer. In cases 

where GPU performance is not critical, its maximum frequency 

could be set to a very low value in the interest of limiting its 

contribution to processor power and possibly avoid 

power/current throttling of CPU cores (AMD integrated GPUs 

use a vendor-specific “PowerPlay” table in BIOS to define 

frequency states for the GPU, much like the ACPI PSS table for 

CPUs; they can be edited, but this approach is not universal to 

other vendors). However, without a way to quantify the power 

usage to a reference point, ensuring repeatable performance 

requires adding a large, arbitrary guard-band to core frequencies. 

There will always be some uncertainty about coverage for worstcase 

samples. 

b. DETERMINISTIC PERFORMANCE 

Any system that desires performance determinism from the 

processor will need to start disabling power management 

features to get as close as possible to the old fixed-frequency 

model. The first to go are those features that provide temporary 

performance improvements based on real-time environmental 

factors. Examples discussed earlier include temperature-based 

2 

Advanced Configuration and Power Interface Specification 

484

oosting or time-based power excursions (e.g., AMD STAPM 3 

and sub-features of Intel DPTF 4 ). These features often don’t 

provide value anyway for embedded use-cases where a 

workload is run continuously. The ability to disable them may 

not be exposed in off-the-shelf embedded platforms, but most 

can be turned off by the system BIOS developer. Review the 

processor documentation thoroughly to understand which of 

these features can be disabled. 

Frequency boosting is also scenario driven and therefore 

must be disabled. Even if a frequency in the boost range is 

sustainable for the custom workload on a worst-case sample, 

current processors do not allow fixed frequency operation in this 

range. Operating systems are unaware of boost frequencies, so 

there are no OS-level mechanisms to set them. To fix CPU 

frequency, a value at or below base must be used. From there, 

evaluation of the workload power density compared to the 

reference workload can be used to determine if CPU frequency 

must be further reduced below base frequency. The method 

described earlier still applies. 

After the new maximum frequency has been established and 

set, states below this must also be eliminated to ensure fixedfrequency 

behavior of cores. Latency is increased when cores 

transition to low frequency during idle periods, reducing 

deterministic behavior. As described in the previous section, OS 

frequency governors can be used with Windows and Linux to 

set “performance” mode which will ensure hardware does not 

go below base frequency. This method has been verified on 

AMD Embedded R-series and G-series processors, as well as 

Intel 7 th Generation Core processors. Excursions below base 

frequency can still occur if triggered by thermal throttling, but 

proper design, as outlined here, can prevent it. Despite being 

effective for Windows and Linux, designers requiring 

determinism will likely be running a real-time OS. RTOSs with 

support for ACPI PSS tables can still use that method. Others 

with no CPU P-state management must rely on the platform 

BIOS to set the desired state before the OS handoff. 

If the workload is mixed for CPU and GPU, the same 

complications previously discussed apply. Extreme guardbanding 

could ensure fixed frequency operation across all units 

of a model distribution, but there is no reliable method to 

confirm that without manufacturer support. For applications 

that are willing to go the extra mile to secure deterministic 

performance, custom screening can be implemented where each 

sample is pre-tested to ensure operation within specific limits. 

Obviously, this kind of screening is very costly in both 

infrastructure and labor but has found use in military markets 

where cost sensitivity is low. 

server processors are more conservative in this area. The 

processor will exploit some of these differences to reach a 

maximum (i.e., deterministic) power consumption for a given 

workload (at a given temperature) and thus maximize 

performance while maintaining infrastructure power limits. 

“Performance Determinism Mode” offers the unique ability to 

achieve the same performance with every processor of a given 

TDP. Creating repeatable performance requires part-specific 

power and frequency curve data to be fused into the device at 

production time. This data essentially provides a negative 

performance offset that can be used to make each individual unit 

replicate the performance of a worst-case unit of the entire 

model distribution. The power management controller also uses 

the predictable calculated-power method based on activity 

monitors instead of regulator telemetry data. Enablement of the 

feature is a simple compile-time option in the BIOS firmware. 

In performance determinism mode, part-to-part variations will 

only result in differences in power consumption for a given 

workload (at a given temperature) while performance (derived 

from frequency behavior) is minimally impacted and remains 

consistent. This type of reliable hardware determinism can only 

be provided by the processor manufacturer and it ensures that 

only the minimum necessary performance sacrifice is made to 

achieve that determinism. The performance deterministic mode 

certainly simplifies system architecture for designers looking for 

improved performance determinism for enterprise applications 

and its existence is noteworthy given the topic of this paper. 

However, it is yet to be seen if this kind of feature will find its 

way into lower-power embedded processor products from AMD 

or Intel. 

V. VENDOR PROVIDED HARDWARE DETERMINISM 

AMD has recognized the demand for improved determinism 

in some enterprise and high-end embedded applications and has 

introduced dual operating modes in their EPYC line of 

enterprise processors to address differing needs. A “Power 

Determinism Mode” offers higher performance by taking 

advantage of many of the mechanisms described earlier in this 

paper including part-to-part variations (Fruehe, 2017), though 

3 

Skin Temperature Aware Power Management 

4 

Dynamic Power and Thermal Framework 


485

REFERENCES 

Cutress, I. (2017, February 22). AMD Launches Zen. 

Retrieved from Anandtech.com: 

http://www.anandtech.com/show/11143/amd-launch- 

ryzen-52-more-ipc-eight-cores-for-under-330- 

preorder-today-on-sale-march-2nd 

Fruehe, J. (2017). Power / Performance Determinism. Moor 

Insights and Strategy. 

Gielen, E. M. (2013). Analog IC Reliability in Nanometer 

CMOS. Analog Circuits and Signal Processing, DOI: 

10.1007/978-1-4614-6163-0_2, 23-28. 

Howse, B. (2015, Nov 26). Examining Intel's New Speed Shift 

Tech on Skylake: More Responsive Processors. 

Retrieved from Anandtech: 

https://www.anandtech.com/show/9751/examiningintel-skylake-speed-shift-more-responsive-processors 

Kaushik, S. a.-M. (2003). Leakage Current Mechanisms and 

Leakage Reduction Techniques in Deep- 

Submicrometer CMOS Circuits. IEEE. 

Unified Extensible Firmware Interface Forum. (2017, May). 

Retrieved from Unified Extensible Firmware 

Interface Forum: 

http://www.uefi.org/sites/default/files/resources/ACP 

I_6_2.pdf 

Wolpert, & Ampadu. (2012). Managing Temperature Effects 

in Nanoscale Adaptive Systems. In D. Wolpert, & P. 

Ampadu, Managing Temperature Effects in 

Nanoscale Adaptive Systems (pp. 22-24). Springer. 

486

Understanding Where Power Goes in Energy 

Efficient Systems 

Rod Watt 

Director of System Architecture, Arm 

Cambridge, UK. 

Abstract—The traditional way of measuring the efficiency of a 

system is to measure the overall system power while running a 

series of synthetic benchmarks and see which scores the highest 

while minimizing power consumption, or simply running use 

cases and measure the battery drain. 

Of course, the entire system including the main processor, the 

memory, the Wi-Fi chip set, the LCD and the rest of the circuitry 

all contribute to the drain on the battery. Measuring the power at 

the battery level will certainly provide data on how the entire 

system is performing but will not give the granularity required to 

really understand where the power is going during the task. 

This paper will discuss techniques and procedures used to allow 

systems to be measured down to a SoC level leading to a much 

deeper understanding of the system’s overall efficiency. 

It will also compare the differences in CPU activity and usage 

between synthetic benchmarks and traditional everyday use 

cases. 

Keywords—power, energy, efficient, synthetic, workloads 


When consumers are deciding to buy a new device, they will 

typically start off with a few basic requirements in terms of 

functionality. For example, the choice of a set top box may 

include requirements covering connectivity, picture quality, 

applications support and so on. Looking at what is available in 

the market place, several systems will no doubt meet these 

basis requirements. Looking beyond the top-level features, the 

consumer may choose to look at the detailed specifications of 

the system. Details such as the number and speed of the 

processors and the size of the memory in the system may give 

an indication of which system may be the highest performant 

solution, but it can be difficult to decide this based purely on 

the specifications. Equally, a system may come with a higher 

spec’d power supply which may suggest that this system 

consumes more power, but again, it may be naive to assume 

this. 

II. TRADITIONAL METHODS OF COMPARISON 

There are two basic methods for comparing the 

performance of a system. 

1) Use Cases 

Basically, this involves using the system and running typical 

workloads. For example, if it were a set top box, the user may 

choose to play some video content and check for picture 

quality and download speed. Playing a graphic intensive game 

and assessing performance in terms of screen lag and 

sustained frames per second would indicate how the graphic 

processors compare in the different systems. 

2) Synthetic Benchmarks 

If the user wishes to stress the system further, synthetic 

benchmarks may be of use. Although these do not typically 

replicate what a user would actually do, they will stretch the 

system and attempt to push the compute subsystems to their 

limits. 

Although both methods will give an indication of the 

performance of the system, neither will provide any 

information on the power or energy that was consumed to 

attain that performance. Without an appreciation of the power 

consumption, these measurements will only provide part of the 

answer. 

III. POWER VS ENERGY 

Frequently, commentators will use the words “Power” and 

“Energy” when talking about system consumption. However, 

it’s important to understand the differences between these two 

and how they should be used when discussing overall system 

efficiency. 


487

Power is an instantaneous measurement that deals with a point 

in time. 

Where 

and 

P(t) = V(t) x I(t) (1) 

P(t) is Power in Watts at Time, t 

V(t) is Voltage in Volts at Time, t 

I(t) is current in Amps at Time, t 

It is important to note that, in this equation, the time during 

which the current flows is not considered. The Power being 

measured is purely the product of the Voltage that is being 

applied and the Current that is flowing. 

In contrast, Energy, typically measured in Joules is the amount 

of energy that is consumed. 

Where 

and 

( 

! = $(&) 

) 

E is Energy in Joules 

P(t) is Power in Watts at Time t 

dt is the time interval 

Or alternatively, Power can be described at the rate at which 

Energy is consumed. 

Where 

and 

IV. DEFINING EFFICIENCY 

*& (2) Efficiency is defined a + b = as c. the ratio (1) of useful (1) work performed by a 

machine or a process to the total energy expended. 

$ = !/*& (3) 

P is Power in Watts 

E is Energy in Joules 

dt is the time interval 

What’s important to note here, is that Energy takes time into 

consideration, whereas typically Power is normally just an 

instantaneous measurement taken at a specific time. 

To illustrate the difference, consider two systems, “A” and 

“B”. System “A” consumes a high level of current for a short 

space of time whereas system “B” consumes a lower average 

current over a much longer period. 

Comparing the Power consumption of both systems, (assuming 

both systems run off the same Voltage), System “A”, with its 

higher average current, will consume the higher power. 

However, since System “A” only draws this high current for a 

short period of time, it may consume less Energy than “System 

B” which is drawing (lower) current for a much longer time. 

It’s Energy that is the important measurement, not just Power 

at an instantaneous time. It’s Energy that will drain the battery 

(current and power over time) and will also increase the 

electricity bill! 

Where 

and 

Figure 1 Power vs. Energy 

!,, = -/! (4) 

Eff is the efficiency 

W is the Work or task completed 

E is the Energy 

In this equation, Efficiency is calculated as a relative number, 

not an absolute value. This is due to the fact that in this case, 

how “Work” is measured will vary depending on the workload 

that is being run. 

For example, if the workload is a synthetic benchmark, the 

measure of Work could be the score that is reported. 

Comparing the Efficiency of both systems, (which would be 

calculated by dividing the benchmark score by the Energy 

consumed to complete the benchmark), would allow the user to 

calculate the relative efficiency of both systems. 

Similarly, the systems’ efficiency could also be compared by 

using traditional use cases. In this example, the “Work” will be 

defined depending on what the use case actually is. For 

example, if the use case is a video, the Work could be defined 

as the frames per second that can be achieved or it could be 

defined as the speed at which the video was loaded. If the use 

case was an application, the Work could be defined as the 

speed at which the application is started. 

As long as the definition of the Work is consistent during 

testing, the calculated efficiency can be used to compared the 

two systems. 

488

V. EFFICIENCY – A REAL EXAMPLE 

For this example, consider two similar systems, System A and 

System B. The challenge is to decide which one is the “best”. 

Of course, being best can mean many things but, in this 

example, Efficiency will be considered the measure of choice. 

Firstly, running a simple synthetic benchmark will give an 

indication of performance. 

TABLE I. 

TABLE III. 

Measurement System A System B 

Score 43329 12350 

Average Current 1898mA 1048mA 

Average Voltage 3.4V 2.78V 

Average Power 6495mW 2920mW 

Time For Test 5.59s 13.95s 

Consumed Energy 2527uAH 2836uAH 

Efficiency 17.1 4.4 


Score 43329 12350 

In this case, System A scores 43329 vs 12350 on System B, 

suggesting that System A is approx. three times “better” than 

System B. 

This test can then be repeated, however, this time, the system 

Voltage and Current will be noted. This will allow the average 

system Power to be calculated. 

TABLE II. 


Score 43329 12350 

Average Current 1898mA 1048mA 

Average Voltage 3.4V 2.78V 

Average Power 6495mW 2920mW 

Looking at the average power consumption shows that 

although System A did achieve the higher score, it did so while 

drawing more current and hence more power. So, although 

System A is providing higher performance, its taking more 

power to achieve this. At this stage, an efficiency calculation 

could be done, but as discussed earlier, for this to be accurate, 

the Energy, not the Power, needs to be considered. 

Running the tests again, but this time noting the time in 

addition to the other metrics, allows the Energy and hence the 

Energy efficiency to be calculated. 

Figure 2 Relative Scores 

Even though System A consumed a higher current, it did so 

over a much shorter time, hence the overall energy 

consumption was less. Calculating the relative Energy 

efficiency, (score/Energy), it can be see that System A is over 

four times more energy efficient than System B, for this 

specific workload. Other workloads may give different results. 

So, it can be seen, in order to calculate the efficiency of a 

system, two main measurements are required: 

1) The energy consumption of the system 

2) A measure of work 

VI. THE PRACTICALITIES OF POWER MEASUREMENT 

Different systems will require different methods for measuring 

power. Ultimately, it will come down to a number of variables, 

including 

1) The type of system 

2) The accuracy required 

3) Time and money 

This paper will discuss a few of the more common systems. 

A. Fuel Gauges 

Many consumer systems, such as mobile phones and tablets 

contain a “Fuel Gauge” which monitors the battery capacity 

and the current flow form it. Typically, these are used in 

mobile devices to report the amount of energy that is 

remaining in the device. 

It is possible to access the fuel gauge via software and also to 

monitor the current flow from the battery. Although this is 

non-invasive, the results can vary greatly in accuracy. 

489 


Additionally, not all consumer systems have fuel gauges onboard. 

In the case where the USB connector is also used to transmit 

data, for example, as a debug port, this method has the 

disadvantage that this capability will be lost. Typically, this 

port will be connected to a personal computer that would 

supply power in addition to providing the debug port. 

E. USB Power Monitor 

B. Measure at the Battery 

Figure 3 Fuel Gauge Accuracy 

When a consumer device is powered from a battery, it may be 

possible to bypass this battery and power the system from a 

metered power supply. The practicalities of removing and 

bypassing the battery will be system specific. Some systems, 

with removable batteries, make this a relatively easy process. 

In this case, the battery’s terminals may be isolated and wires 

added to bypass it. In the case of a system which is not 

designed to have a removable battery, this process can be 

trickier. Physically removing the battery, getting access to the 

terminals and adding the bypass wires can be problematic. 

Assuming this process can be achieved, a metered power 

supply can be connected to power the system. Although this 

process is a good alternative when a fuel gauge is not 

accessible, it is invasive, the level of difficulty depending on 

the specific system. 

C. Measure at Power Jack 

Typically, development boards will be powered from a DC 

power supply which connects to a power jack on the board. In 

this case, it’s relatively easy to modify the cable to enable it to 

be plugged into a metered power supply. 

This does have the advantage that the development board itself 

does not have to be modified, just the power lead. In this case, 

it may be advisable to purchase an additional power lead to 

keep the original one intact. 

D. Measure at USB Connector 

Some development boards, such as the Raspberry Pi for 

example, do not use a separate power lead for the system, 

electing instead to use a USB style connector. Similarly, to 

modifying a “normal power” lead, the USB lead can also be 

modified to enable an external power supply to be used. 

Relatively inexpensive USB power monitors can be used to 

measure the power being provided to the USB port. These 

devices are totally non-invasive, (there is no need to modify 

the board), and also enable data to be transmitted and 

received. Unfortunately, they only indicate the instantaneous 

current and voltage, not the average. 

F. Modified USB Cable 

To overcome the problem of only being able to monitor 

instantaneous current and voltage, another method involves 

modifying the USB cable to enable direct measurement. In 

this case, a shunt resistor, typically 5-10mW is inserted into 

the power lines within the USB cable. By measuring the 

voltage drop across this shunt resistor, it is possible to 

calculate the current flow. When the absolute voltage of this 

power line is also noted, (by tapping into one side of the shunt 

resistor and the ground wire), the power consumption can be 

calculated. 

This method does enable accurate measurements of the power 

but requires relatively expensive test equipment. Typically, an 

instrument, known as Data Access Equipment, or DAQ, is 

used to monitor the power rail continually. This can be a 

relatively expensive piece of equipment. Development boards 

can use numerous USB cables. For instance, micro, mini and 

micro-C. In this case, the specific cable will need to be 

modified for each type of cable used. 

G. Modified USB Power Monitor 

One solution to the problem of having to modify multiple 

USB cables, is to modify a USB power monitor. Although 

these devices are designed to show the instantaneous current 

and voltage, it is possible to tap into the internal power rails 

and shunt resistors which can then in term be monitored using 

a DAQ as previously described. As the USB interface to these 

devices tends to be a standard USB type A connector (plug in, 

socket out), these enable various USB cables to be used. 

H. Measure at SoC 

All of the processes described above allow the total system 

power and, by implication, total system energy to be 

measured. Although this is of great interest from a user point 

of view (a user wants to know when the battery will run out or 

490

how much the system costs to run), it may be of less interest to 

a developer or engineer. When trying to understand the 

efficiency of a system, of course the total system 

power/energy is of interest. But, to truly understand the 

efficiency of the system, it’s important to have greater 

granularity; a deeper insight into which part of the system is 

consuming most/least energy. 

In order to understand this level of detail, it’s necessary to tap 

into various sections of the system and monitor the energy at a 

more detailed level. For example, a system may have separate 

power rails for the central processing unit, the graphics 

processor, the memory interface and the other peripherals. 

By adding shunt resistors into these individual power rails and 

using a DAQ, similar to the process described for modifying 

the USB cable, it is possible to monitor the individual power 

rails around the system. Although this method is very 

invasive, and potentially tricky, it does provide a very detailed 

insight into the energy consumption around the system. 

I. Summary of Methods 

A summary of the various methods described are shown in the 

table below. Each method will have its own challenges and 

will provide its own level of accuracy. Ultimately, cost, time 

and the level of accuracy needed will determine the best 

method to utilize. 

TABLE IV. 

Method Advantages Disadvantages 

Fuel Gauge Non-invasive. Not always available. 

Measure at Battery 

Measure at Power 

Connector 

Measure at USB 

Connector 

USB Power Monitor 

Modified USB Cable 

Modified USB Power 

Monitor 

Measure at SoC 

Works with/without 

Fuel Gauge. 

Accurate. 

Dev board does not 

require modification. 

Allows access to boards 

without separate power 

connectors. 

Non-invasive. 

Enables data port to be 

used. 


used. 

Provides continuous 

power monitoring. 


used. 

Can be used with any 

USB cable. 

Provides continuous 

power monitoring. 

Provides power 

breakdown . 

Invasive. 

Not all boards have 

separate power inputs. 

Lose access to debug 

port. 

Only shows 

instantaneous power. 

Each USB cable need 

to be modified. 

Additional, expensive 

equipment required. 

Additional, expensive 

equipment required. 

Very invasive. 

Can be tricky. 

TABLE V. 

Method Accuracy Granularity 

Fuel Gauge Variable System Level 

Measure at Battery Accurate System Level 

Measure at Power 

Accurate 

System Level 

Connector 

Measure at USB 

Accurate 

System Level 

Connector 

USB Power Monitor Not Accurate System Level 

Modified USB Cable Accurate System Level 

Modified USB Power Accurate 

System Level 

Monitor 

Measure at SoC Very Accurate SoC Level 

VII. BENCHMARKS VS REAL-LIFE WORKLOADS 

When deciding which workloads to use to exercise the system 

under test, a user typically has choices. 

1) Synthetic Benchmarks 

These are workloads that have been designed specifically to 

test certain aspects of the design such as processor 

performance, frames per second achievable from the 

graphics processor, memory bandwidth and so on 

2) Real-Life use cases 

These are general workloads that a user would typically run 

on the system. Typically, different systems would run 

different types of workloads. For example, for a mobile 

phone, a real-life use case could be a low-end game. For a 

tablet, a use case could be web browsing. 

The table below summarizes the general advantages and 

disadvantages of each. 

TABLE VI. 

Method Advantages Disadvantages 

Synthetic 

Benchmarks 

Real-Life Use 

Cases 

Easily Repeatable. 

Provide a “Score”. 

Designed to stress the 

major components. 

Not synthetic, this is 

what users actually do. 

Use is important, not 

quality. 

Cannot be tuned for a 

target device. 

Synthetic – not necessarily 

representative of what the 

system actually does. 

Variable/dubious quality. 

Can be tuned for a target 

device. 

Difficult to repeat. 

Don’t always stress the subsystem. 

Don’t typically provide a final 

score. 

In order to test which workload, synthetic or real-life, would 

best represent the efficiency of the system, the following tests 

were conducted. 

491 


Three different mobile phones were used for the experiment. 

The phones varied in cost and specification. 

The table below summarizes the specifications of each of the 

handsets used. 

TABLE VII. 

Attribute Premium Mid-Tier Entry 

Cost > $400 > $200 < $400 < $200 

Thickness < 8mm < 9 mm < 10mm 

Screen > 1080p < 1080p < 720p 

CPU 

GPU 

4 x big 

4 x LITTLE 

> 70fps T-Rex 

4 x little 

4 x LITTLE 

> 40fps T-Rex 

4 x LITTLE 

< 40fps T-Rex 

Memory > 2GB > 1GB < 1GB 

Camera(s) > 12MP > 8MP < 8MP 

Each handset had a different set of attributes, these were 

categorized as following 

1) Cost 

The average cost of the handset, SIM free, measured 

in US Dollars. 

2) Thickness 

3) Screen 

4) CPU 

5) GPU 

The thickness of the handset, measured in mm. 

The screen density. 

The type and number of CPU clusters supported in 

each device. “4 x big” refers to 4 x Cortex-A57 

processors, “4 x LITTLE” refers to 4 x Cortex-A53 

processors running at > 1.5GHz, “4 x little” refers to 

4 x Cortex-A53 processors running at < 1.5GHz. 

The performance of the graphics processor was 

measured by running GFX-Bench T-REX benchmark 

and the scores were noted. The GPU performance 

was rated on the scores achieved on this test. 

6) Memory 

The amount of DRAM in the handset, measured in 

GB (Gigabytes). 

7) Camera(s) 

The number of pixels the camera can support, 

measured in MP (Mega Pixels). 

VIII. COMPARING BENCHAMRKS 

A series of benchmarks were run on each of the three 

handsets. The tests targeted some of the main functions of 

each handset such as the processor, the graphics, memory and 

storage. In addition to noting the scores of each test, the 

amount of energy that was consumed during the test was also 

measured. This enabled the efficiency of each test to be 

calculated. At the end, the efficiency of all the tests were 

averaged out across the three handsets and the results 

compared. A summary of the results are as follows. 

1) The results were normalized around the mid tier 

devices. So, in terms of the mid tier device, its performance, 

energy and efficiency were all normalized to 1. 

2) The premium device was 2.7 x more efficient than the 

mid-tier device. This advantage was a combination of higher 

performance and lower energy consumption 

3) The low-end device was 40% less efficient than the 

mid-tier device. This disadvantage was a combination of lower 

performance and higher energy consumption 

Not surprisingly the premium handset out performed the other 

two in terms of benchmark scores, undoubtedly due to the 

higher specified CPU, GPU, memory and screen. What may 

be more surprising, was that in general, the premium handset 

achieved these high scores while consuming less energy. The 

higher rated processors were able to complete the time bound 

synthetic benchmarks in a shorter time and hence minimized 

the energy consumption. 

IX. 

COMPARING WORKLOADS 

The tests were then repeated but this time using real-life 

workloads. A series of tests were run on each of the three 

handsets. The tests targeted various workloads such as social 

and messaging, web access and gaming. Again, the 

performance, energy consumption and efficiency was 

calculated and averaged across all the tests. A summary of the 

results are as follows. 

1) As before, the results were normalized around the mid 

tier device. So, in terms of the mid tier device, its 

performance, energy and efficiency were all normalized to 1. 

2) Again, the premium device was the most efficient, but 

this had now dropped to 1.8 x more efficient than the mid-tier 

device. This advantage was a combination of higher 

performance and lower energy consumption 

492

3) This time, the low-end device was 13% more efficient 

than the mid-tier device. This advantage was a combination of 

higher performance and lower energy consumption 

The results of running real-life workloads on the handsets was 

very different than running synthetic benchmarks. The 

performance advantage of the premium device was not as 

marked as previously; often very similar to the mid-tier and 

entry level phones. 

In some cases, the entry level device actually out-performed 

the mid-tier device. The entry level device had a smaller 

screen, in terms of pixel count. This reduces the amount of 

data the phone may have to process during the workloads. 

X. BENCHAMARKS VS REAL-LIFE - CONCLUSIONS 

Benchmarks are designed to stress the main processing 

elements within a system. Although a premium handset will 

show a performance and efficiency advantage over the others, 

when running less stressful, real-life workloads, this advantage 

is reduced. 

As the cost of the phone goes up, so does its overall 

specification. For example, the premium device and the midtier 

device both have bigger (in terms of screen density) 

screens than the entry level device. More processing power is 

required to service these higher screen densities and more 

power is required to run them. 

Synthetic benchmarks are designed to stress individual 

subsystems and measure peak performance. As they tend to 

target different parts of the system, (central processor, graphics 

processor, memory), no one benchmark really has the complete 

answer. In order to get a clearer understanding of the 

efficiency of the system, it’s advisable to run a number of these 

tests and look at the overall trend. 

Workloads are more representative of how the overall device 

will perform as they tend to put less stress on the main 

subsystems, rather spreading the load around the entire system. 

From this point of view, running real-life workloads will 

provide a better indication of the system efficiency whereas 

synthetic benchmarks will give an indication of the peak 

performance that an individual sub-system within the device 

can attain. 

XI. 

OVERALL BENEFITS OF EFFICIENT SYSTEMS 

Perhaps the most obvious benefit of a more efficient system is 

extended battery life. From a marketing point of view, a claim 

that the system provides a longer battery life than its 

competitors is a clear advantage. Manufacturers will use these 

claim in their marketing campaigns, sellers will use this to 

compare product lines and technical journalists like to quote 

this data in systems reviews. However, the design of efficient 

systems will also provide some secondary advantages. 

1) Cheaper / Fewer Components 

Power Management ICs or PMICs – devices that can be used 

to regulate the power inside a consumer device – can be 

relatively expensive. A recent analysis of the Bill of Materials 

(BOM) of a modern mobile device found that 4% of the 

silicon cost could be directly attributed to voltage regulation 

and power generation. 

If a system is designed to be more efficient, and hence able to 

run off a lower current rating, the cost of these components 

can be reduced. 

Lower current/power devices tend to Be less expensive. If the 

current required for a power rail can be reduced, it may be 

possible to share the power generation of this power rail 

between power regulators, hence reducing the component 

count and BOM cost. 

2) Printed Circuit Board Stack up 

When designing a Printed Circuit board (PCB), there are a 

number of rules that the designer follows. 

PCBs tend to have an even number of layers and these layers 

are typically symmetrical around the middle layers. When 

routing high speed signals, such as dynamic memory address 

and data lines, these signals should be routed adjacent to a 

solid power or ground plane. This ensures that the signal has a 

fast and uninterrupted return path. 

When there are numerous power planes within a PCB, some 

designers will choose to use split power planes where one 

plane on the PCB can support multiple power rails. Although 

this helps reduce the layer count, it does restrict the routing of 

traces on adjacent signal planes as these cannot be routed 

across these split power planes. The alternative to split power 

planes is to add additional layers – this obviously has a cost 

implication. Typically, each additional two layers will add 

10% to the overall cost of the PCB. 

An efficient design may be able to replace split power planes 

with thick signal traces, due to the reduction in current. As 

well as helping to reduce the overall layer count, these can be 

routed in such a way as to minimize the impact of the highspeed 

signals on the adjacent layers. 

In addition, if the current is reduced, power traces can be 

shared between components, again, easing the layout 

challenges. 

3) Easier Thermal Design 

With higher energy dissipation come thermal challenges, A 

system that is consuming more energy will naturally have the 

requirement of additional cooling. Ideally a system should be 

passively cooled, i.e. cooling naturally in the surrounding air, 

rather than be actively cooled by a fan. There are a number of 

reasons of this. 


493

a) Fans and heat sinks will add to the overall cost of the 

system. 

b) The introduction of moving parts will introduce 

another point of failure, potentially reducing the overall 

reliability of the system. 

c) Typically, a larger enclosure is required to house the 

additional fans, adding cost. 

d) To enable air flow to/from these fans, additional 

tooling costs will be incurred in adding air vents and slots. 

e) Overall this bigger, noisier and heavier design will 

be les elegant than a more efficient, passively cooled design. 

4) Component Failure 

Systems tend to fail following the classic “bath tub” curve 

where the majority of the failures happen at the start or the end 

of a system’s life cycle. 

The early “infant mortality” failures seen early on in a 

system’s life cycle are mainly due to faulty components. This 

can be for a number of reasons but will include manufacturing 

and process defects. Typically, if a system survives this early 

stage in its life, additional defects don’t tend to occur until 

well into the system’s life cycle. Again, these “wear out” 

failures can occur for a number of reasons but one of the most 

common is thermal stress. 

it’s not a true representation of the true “cost” of completing 

the workload. 

There are numerous ways of measuring and verifying your 

design’s energy consumption. These methods vary greatly in 

accuracy and complexity. When defining a strategy for 

measuring energy the user must decide on how much time, 

effort and cost they are willing to spend on the problem. 

Synthetic Benchmarks will give an indication of performance, 

but they don’t tend to measure efficiency. They are great for 

showing peak performance of the main subsystems in a device 

but don’t highlight a subsystem’s combined impact on user 

experience. So, although they stress the system, they are not 

necessarily representative of use cases. Care must be taken 

when using them as they can be manipulated. 

Consumers run workloads, or “real-life use cases” every day, 

not synthetic benchmarks. In terms of understanding the 

efficiency of the system, these can be of limited use as they 

don’t stress the system by pushing it to run at its peak. They 

don’t tend to provide an indication of performance (the movie 

just played!) and they can be difficult to repeat consistently. 

However, they do demonstrate that the total system is 

important, not just the process subsystem. Everything has an 

effect on the system efficiency 

Although the obvious benefit to an efficient system is 

extended battery life or reduced cost to operate, there are 

additional benefits to making an energy efficient system. The 

benefits of a, energy efficient, low power design go beyond 

battery life to include reduced system costs, increased 

reliability and overall design simplicity 

Thermal stress is caused by a component continually heating 

up and cooling down – one of the byproducts of an inefficient 

system when excess current, energy and heat need to be 

dissipated. If the system is designed to be be more power 

efficient, this thermal stress can be reduced and the life cycle 

of the product extended. 

XII. SUMMARY 

When comparing systems and deciding “which is best”, 

efficiency is a key measurement. Performance is an important 

metric, but without understanding the “cost” of that 

performance, it’s of limited use. 

Although the two words are frequently used together, it’s 

important to remember that Power is not the same as Energy. 

Power tends to represent a point in time and is calculated by 

multiplying the instantaneous current by the instantaneous 

voltage. This does not give any indication of the duration for 

which the current was being drawn at any specific voltage so 

494

Internet of Threats? – A Code Quality Management 

Strategy 

Mark Rhind 

Senior Technical Consultant, PRQA 

Ashley Park House, 42-50 Hersham Road, Walton on Thames 

Surrey, KT12 1RZ, United Kingdom 

Mark_Rhind@prqa.com 

Abstract— HP Security Research (2015), found that 70% of the 

most commonly used IoT devices, such as smart thermostats and 

home security systems, contain serious security vulnerabilities. 

The rising number of complex connected devices invites attacks on 

multiple fronts, from client applications and cloud services to 

firmware and applications. We need to prevent the Internet of 

Things (IoT) from becoming the “Internet of Threats”. 

How should we protect ourselves? 

The answer lies in finding software vulnerabilities in the 

applications as early as possible in the development stage. This can 

be achieved by incorporating code quality management including 

static analysis into your software development process. 

In this paper, we will outline the different types of software 

verification and provide advantages and drawbacks for each of 

them. We will explain that the most effective and proven 

methodology is to use static analysis tools with a coding standard. 

We will then provide the added benefits of using a static analysis 

tool. 

Keywords—Software engineering; security; internet of things 

I. JUST BUGS? 

As a security person, you need to repeat this mantra: 

"security problems are just bugs" 

In what has become a somewhat infamous tirade on the 

Linux Kernel mailing list, Linus Torvald asserted that "security 

problems are just bugs"; that the primary purpose of software 

hardening strategies is often debugging [1]. 

Looking beyond the personal interests of the parties involved 

in this exchange, Torvald makes a valid point. Studies have 

found that 64% of the vulnerabilities described in CERT 

National Vulnerability Database were the result of programming 

errors [2]. 

However, it might also be fair to suggest that Torvald's 

statements risk trivializing a complex and costly problem; 

creating bug-free software remains a significant challenge that 

is rarely - if ever - achieved. In this era of software flaws that 

have global implications and massive financial impact, is it 

appropriate to describe a programming error as "just" a bug? To 

take one example, the OpenSSL Heartbleed Bug has been 

estimated to have a cost in excess of $500M [3]. 

HP Security research found that 70% of the most commonly 

used smart devices still contain serious security vulnerabilities 

[4]. With the number of internet-connected devices projected to 

reach more than 20 billion by 2020 [5], this raises an important 

question: How do organizations ensure that these devices are 

secure and bug-free? 

II. 

HARDENING VS. DEBUGGING 

A. Hardening 

The term software hardening is growing in popularity to 

describe a range of strategies and techniques to secure software 

or devices against intrusion or misuse. Presently, there is little 

formal consensus on what is covered by software hardening, but 

the term is frequently used to describe strategies for security-bydesign, 

which may include: 

● 

● 

● 

● 

● 

Layered security, or defense in depth 

Applying the principle of least privilege 

Encrypting communication where possible 

Securely storing sensitive data 

Enforcing secure configuration, such as minimum 

password requirements 

These design strategies are commonly verified by automated 

or manual penetration testing. 

Conventional penetration testing methodologies for IoT 

devices rely on testing the complete electronic ecosystem for a 

specific device. This includes the hardware, software - including 

any operating system, communications protocols, mobile 

applications, cloud services, and so on [6][7]. Testing of this 

breadth is often very expensive to perform and can be of limited 

value until the product is almost ready for launch. 


495

hashOut.data = hashes + SSL_MD5_DIGEST_LEN; 

hashOut.length = SSL_SHA1_DIGEST_LEN; 

if ((err = SSLFreeBuffer(&hashCtx)) != 0) 

goto fail; 

if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0) 

goto fail; 

if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0) 

goto fail; 

if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0) 

goto fail; 

if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0) 

goto fail; 

^ 

MISRA C:2012 Rule-15.6 (qac-9.4.0-2212) Body of control statement is not enclosed within 

braces. 

goto fail; 

if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0) 

^ 

MISRA C:2012 Rule-2.1 (qac-9.4.0-2880) This code is unreachable. 

goto fail; 

Fig. 1. Example MISRA C:2012 violations in file sslKeyExchange.c from the Apple SSL/TLS library. Retrieved from: 

https://opensource.apple.com/source/Security/Security-55471/libsecurity_ssl/lib/sslKeyExchange.c 

While the value of software hardening strategies in securing 

devices is widely accepted, it is also possible for them to be 

undermined by the presence of a programming error. This is 

clearly demonstrated by the 2014 Apple "goto fail" SSL bug [8]. 

In this case, a simple error stemming from erroneous indentation 

rendered Apple's official SSL/TLS library insecure, opening OS 

X and iOS devices to man-in-the-middle attacks. 

This flaw was present in released products, demonstrating 

that the software verification strategies used in this case were 

clearly ineffective. Research has shown that the cost of fixing 

defects doubles after the implementation phase and rises by six 

times for defects that must be fixed post-release [9]. 

This suggests that even vulnerabilities that are discovered by 

penetration testing will cost significantly more to fix than those 

identified during the development of the code. 

B. Debugging 

A widely used strategy for debugging software is to employ 

a static analysis tool. In contrast to other testing tools, static 

analysis identifies issues in the source code without executing it. 

This allows static analysis to be utilized at any point in the 


Taking the aforementioned Apple SSL vulnerability, it is 

apparent that this error could have been detected much earlier in 

the release process had the code been analyzed with a static 

analysis tool during development. Additionally, errors of this 

type are well known and documented in many coding standards, 

including MISRA and CERT (see Fig. 1.) The obvious 

implication is that, had an appropriate formal coding standard 

been used and enforced in the development of this library, this 

software would not have been released containing this 

vulnerability. 

Many well documented software vulnerabilities were 

previously recognized as serious software errors. A typical 

example of this is Buffer Overflow. "Classic Buffer Overflow" 

is ranked third on the CWE Top 25 Most Dangerous Software 

Errors [10] - however, issues of this type have been recognized 

for more than two decades and are addressed in coding standards 

as old as MISRA C:1998. 

Despite the awareness of buffer overflow as an attack vector, 

vulnerabilities of this nature are still frequently discovered. 

CVE-2017-1000251, published in September 2017 describes a 

potential buffer overflow vulnerability in the Bluetooth stack in 

the Linux kernel, which could be exploited to allow remote code 

execution [11]. 

Static analysis tools have also been proven effective in 

identifying common vulnerabilities in modern software. Studies 

have found that even in addressing widely understood 

vulnerabilities such as SQL injection, static code analysis tools 

typically have much higher coverage than penetration testing 

tools [12]. 

C. Hardening and Debugging 

By their nature, static analysis tools cannot entirely replace 

penetration testing. However, penetration testing is typically 

expensive - both in terms of time and money. Detection of 

common errors - such as buffer overflows - will rarely be the 

most effective use of the resources required for a comprehensive 

penetration test, particularly if it's possible to detect errors of this 

type much earlier in the development process. In addition, issues 

detected by penetration testing will be much costlier to resolve - 

particularly if follow-up testing is required. 

In contrast, static analysis tools can be built-in to the 

software development lifecycle at the implementation phase, 

allowing issues to be identified and resolved much earlier. 

Typically, organizations using static analysis tools report that 

they find - and fix - errors earlier in the development process and 

496

discover more defects overall than organizations not using static 

analysis [13]. 

Therefore, it should be apparent that a combined approach is 

most effective. Static analysis will identify a large proportion of 

security issues and programming errors during a project's 

implementation phase. This can be supported by penetration 

testing during the integration and testing phase to verify the 

implementation of the design's security features. 

III. 

CODING STANDARDS 

Static analysis tools are often most effective when they are 

being used to enforce a well-defined set of coding guidelines. In 

industries that deal with safety-critical software, the use of a 

coding standard supported by suitable analysis tools is already 

standard practice. Industrial standards for the automotive and 

medical industries, such as ISO 26262 [14] and IEC 62304 [15], 

effectively mandate the use of these tools. However, in the wider 

field of embedded software, the use of these technologies is still 

relatively uncommon. 

A 2017 study by the Barr Group [16] reported that 60% of 

the organizations surveyed expected to be developing devices 

with some degree of network connectivity. Yet only two thirds 

of the organizations surveyed reported that they use a written 

coding standard, and only half reported that they use a static 

analysis tool. 

A. Enforcing a Coding Standard 

There is a significant range of free and commercial coding 

standards available. While selecting the correct one for any 

application is certainly not trivial, that decision is often less 

important than simply deciding to use a coding standard. It has 

been well established that coding conventions - including 

applying a coding standard - are only effective if the decision is 

made at the outset of the project [17]. Selecting a coding 

standard before beginning development and enforcing it 

throughout the software implementation will produce code that 

has fewer defects in addition to being consistent and easier to 

maintain. Developing a product and then attempting to make it 

safe and / or secure is costly and potentially dangerous. 

Yet the practical considerations are often more complex. 

Greenfield projects where all decisions can be made in advance 

are a rarity for the majority of organizations. Most product 

development involves some degree of pre-existing code. 

This is often true for consumer goods where the importance 

of time-to-market invites rapid development cycles, making 

code-reuse an appealing option. 

Conversely, industrial machinery typically has an expected 

lifetime measured in decades. It is costly and impractical to 

replace equipment of this nature, yet the benefits of connected 

infrastructure drive a desire to add this functionality into existing 

equipment [18]. 

In both of these cases, products will contain, or interface 

with, a significant body of legacy code. In all probability, this 

code has not been developed in line with modern practices and 

almost certainly has not been developed with the security 

demanded by the IoT in mind. 

B. Legacy Code 

Historically, conformance of legacy code to a coding 

standard has not been enforced [19]. The conventional wisdom 

was that legacy code was "proven in the field" - if it had operated 

defect-free for a significant period of time, it was considered 

unlikely that any remaining errors in the code would lead to 

sudden failures. 

However, this principle cannot be applied when 

incorporating connective functionality. Additional interfaces 

open up a much broader range of attack vectors. It is possible for 

relatively innocuous defects that cannot cause a failure under 

normal operating conditions to become serious vulnerabilities if 

exploited by an attacker. 

This is particularly apparent - and frightening - in the 2016 

attack on Ukraine's power grid. Attackers succeeded in seizing 

control of power stations' SCADA software through public 

networks, resulting in a blackout in parts of the country that 

lasted for several hours. The attackers were able to move 

laterally through the network, first infiltrating the business 

networks and from there, gain access to the production 

networks. 

One element of this sophisticated and coordinated attack 

involved the hackers exploiting vulnerabilities in serial-toethernet 

connectors, commonly used to interface legacy 

industrial equipment with modern computers. The attackers 

were able to upload malicious firmware to these devices, 

compromising operators' ability to respond to the attack [20]. 

C. Analysing Legacy Code 

While it is no longer acceptable to simply assume that legacy 

code is secure, it will typically be infeasible to make it fully 

compliant with any given coding standard. Fortunately, this is 

an issue that has been addressed in several industries and the 

solutions proposed in those cases can be applied to IoT devices. 

IEC 62304 introduces the concept of Software of Unknown 

Provenance (SOUP.) This is defined as either "off-the-shelf" 

(also known as third-party) software, or software that has been 

previously developed without adequate records of the 


This standard lays out a set of requirements for the use of 

software of this nature, largely addressing the process of 

incorporating this software into the device. IEC 62304 requires 

the manufacturer to: 

● Document the requirements that make it necessary to 

use this software 

● Define the software architecture that ensure this 

software operates in appropriate conditions 

● Monitor the software's lifecycle, including patches and 

new versions 

● 

● 

Perform a risk analysis on the use of this software 

Manage the configuration of the software 

These principles are just as relevant to IoT devices as they 

are to the medical industry. It would be most appropriate to 


497

address these requirements during the design and planning 

stages of the project. 

Specifically addressing compliance with a coding standard, 

the MISRA Compliance:2016 guidelines [21] make a distinction 

between native code - defined as the code developed within the 

scope of the project, and adopted code - which includes thirdparty, 

auto-generated and legacy code. 

These guidelines set out several key requirements for 

claiming MISRA compliance in projects that make use of 

adopted code. These include: 

● There shall be no violations of a Mandatory MISRA 

Guideline 

● Violations of a MISRA Required Guideline must be 

supported by a formal deviation 

However, it is recognized that adopted code is unlikely to 

have been developed following the same processes and criteria 

as the code under active development. This means that violations 

of MISRA guidelines are likely to be unavoidable, particularly 

in system-wide guidelines that consider both the code under 

active development and the legacy code. 

For this reason, the MISRA Compliance guidelines propose 

using a Guideline Re-Categorization Plan. This allows certain 

guidelines to be "disapplied" - essentially ignored altogether - 

while others may be escalated to being Mandatory. 

For legacy code, this principle allows an assessment to be 

made of the potential impact of non-compliance with any 

particular guideline in the coding standard, and appropriate 

requirements enforced based on this assessment. 

While the MISRA Compliance guidelines are written to 

complement the MISRA coding standards, the principles 

described could be applied to many other coding standards with 

little modification. The majority of coding standards incorporate 

some concept of the importance or severity of the guidelines - 

for instance, CERT's "Severity" - which can be trivially mapped 

to the MISRA categories. 

IV. 

ANALYSIS TOOLS 

All coding standards require at least one analysis tool to be 

effectively enforced. When addressing legacy code, tool 

selection becomes particularly important. Any analysis tool used 

in this manner must have a robust mechanism for suppressing 

warnings for guidelines that have been disapplied or deviated 

from without masking genuine defects in the code. 

PRQA's QA·Verify incorporates a dynamic suppression 

system, designed for this purpose. This allows for specific 

warnings to be suppressed, with supporting processes to record 

a formal deviation. In addition, these suppressions can be 

applied across multiple versions of the project with full 

traceability. 

In any project that is adopting and revising legacy code in 

this manner, there will come a point where all priority issues 

have been resolved in the legacy code in preparation for new 

development to begin. At this stage, it is important that the 

chosen coding standard is enforced in its entirety on the newly 

developed code. In addition, it is necessary to identify defects in 

the interface between the new and legacy code. 

This process can be greatly simplified by creating a baseline 

of the legacy code before new functionality is implemented - 

essentially a known state of the project before any new 

development begins. 

QA·Verify includes the functionality to create an intelligent 

baseline. This means that warnings will only be issued for new 

code that is added, or for issues arising from newly added code. 

This allows developers to easily identify new issues in the 

project, without having to manually filter out any remaining 

warnings in the legacy code. 


It is clear that the number of internet-connected devices is 

continuing to grow at an incredible rate. This includes critical 

infrastructure, and devices and equipment responsible for safetyor 

mission-critical functions. Therefore, ensuring these devices 

are both defect-free and secure is of great importance. 

In many cases, it can be demonstrated that serious security 

vulnerabilities are caused by common programming errors. This 

means that, in order to ensure the security of connected devices, 

it is critical to ensure the code is free of errors. 

In the process of developing a secure device, penetration 

testing and code analysis are complimentary verification 

techniques. Code analysis, in which a suitable coding standard 

is applied and enforced with a static analysis tool, will detect 

many programming errors and security vulnerabilities early in 

the development process, reducing the cost of fixing these 

defects. 

In projects that make heavy use of adopted code, fully 

enforcing a coding standard is often infeasible. However, 

existing strategies described by industrial standards can be 

reapplied to ensure that the product is robust and defect-free. 

REFERENCES 

[1] Torvald, Linus, 2017. Re: [GIT PULL] usercopy whitelisting for v4.15- 

rc1 [Online]. Linux Kernel mailing list. Available at: 

http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01701.html 

[2] Heffley, J. and Meunier, P., 2004. Can source code auditing software 

identify common vulnerabilities and be used to evaluate software 

security? Proceedings of the 37th Hawaii International Conference on 

System Sciences - 2004. 

[3] Kerner, Sean Michael, 2014. Heartbleed SSL flaw's true cost will take 

time to tally [Online]. eWeek. Available at: 

http://www.eweek.com/security/heartbleed-ssl-flaw-s-true-cost-willtake-time-to-tally 

[4] HP, 2015. Internet of things research study. Hewlett Packard Enterprise. 

[5] Van der Meulen, 2015. Gartner says 6.4 billion connected "things" will 

be in use in 2016, up 30 percent from 2015 [Online]. Gartner. Available 

at: https://www.gartner.com/newsroom/id/3165317 

[6] Tierney, Andrew, 2017. IoT security testing methodologies [Online]. 

PenTestPartners. Available at: https://www.pentestpartners.com/securityblog/iot-security-testing-methodologies/ 

[7] Francis, Ryan 2017. How to conduct an IoT pen test [Online]. 

NetworkWorld. Available at: 

https://www.networkworld.com/article/3198495/internet-of-things/howto-conduct-an-iot-pen-test.html 

498

[8] Ducklin, Paul, 2014. Anatomy of a “goto fail” – Apple’s SSL bug 

explained, plus an unofficial patch for OS X! [Online]. Naked security by 

Sophos. Available at: 

https://nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-failapples-ssl-bug-explained-plus-an-unofficial-patch/ 

[9] Briski, K. A. et al., 2008. Minimizing code defects to improve software 

quality and lower development costs. IBM Development solutions white 

paper. 

[10] Christey, Steve, 2011. 2011 CWE/SANS top 25 most dangerous software 

errors [Online]. CWE. Available at: https://cwe.mitre.org/top25/ 

[11] CVE, 2017. Vulnerability details : CVE-2017-1000251 [Online]. CVE. 

Available at: https://www.cvedetails.com/cve/CVE-2017-1000251/ 

[12] Antunes, N. and Vieira, M., 2009. Comparing the effectiveness of 

penetration testing and static code analysis on the detection of sql 

injection vulnerabilities in web services. 2009 15th IEEE Pacific Rim 

International Symposium on Dependable Computing. 

[13] Balacco, S. and Rommel, C., 2011. The increasing value and complexity 

of software call for the reevaluation of development and testing practices 

[Online]. VDC Research Whitepaper. Available at: 

http://info.prqa.com/hubfs/Whitepapers/PRQA-VDC-white-paper- 

2011.pdf 

[14] ISO 26262-6, 2011. Road vehicles — Functional safety — Part 6: Product 

development at the software level. International Organization for 

Standardization. 

[15] IEC 62304:2006. Medical device software - Software life-cycle 

processes. European Committee for Electrotechnical Standardization. 

[16] Barr Group, 2017. Embedded systems safety & security survey. 

[17] McConnell, S., 2004. Code complete second edition. Redmond, 

Washington: Microsoft Press. 

[18] Intel, 2014. Connecting legacy devices to the internet of things (IoT) 

[Online]. Intel Solution Brief. Available at: 

https://www.intel.com/content/dam/www/public/us/en/documents/soluti 

on-briefs/connecting-legacy-devices-brief.pdf 

[19] MISRA, 1998. Guidelines for the use of the C language in vehicle based 

software. The Motor Industry Software Reliability Association. 

[20] E-ISAC, 2016. Analysis of the cyber attack on the Ukrainian power grid 

[Online]. Industrial Control Systems. Available at: 

https://ics.sans.org/media/E-ISAC_SANS_Ukraine_DUC_5.pdf 

[21] MISRA Compliance:2016. Achieving compliance with MISRA coding 

guidelines. Motor Industry Software Reliability Association. Available at: 

https://www.misra.org.uk/LinkClick.aspx?fileticket=w_Syhpkf7xA%3d 

&tabid=57 


499

Combining Static and Dynamic Analysis 

Paul Anderson 

GrammaTech, Inc. 

Ithaca, NY. USA. 

paul@grammatech.com 

Abstract— Static analysis tools are useful for finding serious 

programming defects and security vulnerabilities in source and 

binary code. These tools inevitably report some false positives, or 

bugs that are highly unlikely to manifest as real problems in 

deployed code. Consequently, results must be inspected by a 

human to determine whether they warrant action, and most tools 

provide program understanding features to make this easier. This 

inspection process, known as warning triage or assessment, can be 

much more effective if it is guided by information from dynamic 

analyses such as code coverage, crash analysis, and performance 

profiling. For example, a static analysis report of a resource leak 

that occurs on a path that has not been tested is more likely to be 

a real undiscovered bug than one that occurs in code that has been 

tested much more comprehensively. Furthermore, the results of 

static analysis tools can be used to guide testing too. For example, 

a developer can save a great deal of effort if the static analysis can 

prove that it is fundamentally impossible to achieve full condition 

coverage. 

This paper describes how the results of static analyses and 

dynamic analyses can be fused to allow developers to get more 

value from both processes, and produce higher quality software 

more efficiently. 

Keywords—static analysis; dynamic analysis; test coverage; 

crash analysis; defect reduction 

I. INTRODUCTION TO STATIC ANALYSIS 

The examples in this paper use CodeSonar (the advanced 

static analysis tool that I work on) to illustrate how static and 

dynamic analysis tools can be integrated. However, the 

techniques and principles are not unique to CodeSonar. Several 

other advanced static analysis tools are commercially available, 

and have features similar to those that are described here. 

Roughly speaking, advanced static-analysis tools work as 

follows. First they create a model of the entire program which 

they do by reading and parsing each input file. The model 

consists of representations such as abstract-syntax trees for each 

compilation unit, control-flow graphs for each subprogram, 

symbol-tables, and the call graph. Checkers that find defects are 

implemented in terms of various kinds of queries on those 

representations. Superficial bugs can be found by doing pattern 

matching on the abstract syntax tree or the symbol tables. The 

really serious bugs are those that cause the program to fail, such 

as null pointer dereferences, buffer overruns, etc., and these 

require sophisticated queries to find. Those queries can be 

thought of as abstract simulations — the analyzer simulates the 

execution of the program, but instead of using concrete values, 

it uses equations that model the abstract state of the program. If 

an anomaly is encountered a warning is generated. 

The kinds of defects fall into three main categories: 

1. Bugs that violate the fundamental rules of the runtime, 

thereby causing the program’s behavior to be 

undefined. These includes memory errors such as null 

pointer dereferences and buffer overruns, concurrency 

errors such as data races, and bugs such as use of 

uninitialized memory. 

2. Defects that arise because the program breaks the rules 

of using a standard API. For example, the C library 

does not specify what happens when the same file 

descriptor is closed twice; this makes no sense to do 

deliberately so is probably a bug. Leaks of finite 

resources such as memory also fall into this category. 

3. Inconsistencies or contradictions in the code. These 

may not cause the program to crash, but likely indicate 

that the programmer misunderstood an important 

property of the code. For example, a condition that is 

either always true or always false is unlikely to be 

intentional because it leads to dead code. 

Static analysis tools are also useful for finding violations of 

coding standards such as MISRA. These mostly fall into the 

third of the above categories. Also, they allow users to define 

their own domain-specific rules. 


500

Figure 1. An example warning from an advanced static-analysis tool. 

These tools are useful because they are good at finding 

defects that occur only in unusual circumstances, and because 

they can do so very early in the development process. They can 

yield value before the code is even ready to be tested. They are 

not intended to replace or supplant traditional testing techniques, 

but instead are complementary. 

Figure 1 shows an example warning report from CodeSonar 

for a null pointer dereference. The report shows the path through 

the code that must be taken for the bug to trigger. Interesting 

points along the way are highlighted. An explanation of the 

reasoning the tool used to conclude there was a bug is given at 

the point which the pointer dereference occurs. 

When a warning is generated, it is written to a database. 

Many advanced static analysis tools, including CodeSonar, 

allow users to annotate warnings. A user can mark a warning as 

a true or false positive, give it a priority, assign it to someone to 

fix, or attach a note. It is important that this information be 

persistent; that is, if a later run of the analysis detects the same 

warning, even if the code has changed, the information should 

continue to be associated with the warning. This is known as 

persistent triage. 

Some of the value of integrating CodeSonar with dynamic 

analysis tools comes from allowing persistent triage to be used 

for the results of the dynamic analyses. 

II. 

COMBINING STATIC AND DYNAMIC ANALYSIS 

This section describes two ways in which the results of 

dynamic analysis can be integrated with static analysis: metrics 

about execution can be imported and associated with elements 

of the program model, and information about anomalies such as 

memory errors can be imported into the database of warnings. 

Metrics about program executions are essential for 

understanding performance characteristics and for identifying 

parts of the code that are not adequately tested. These metrics 

can be used to help a user interpret the static results, which is 

especially useful for prioritizing results. Some examples are the 

following: 

 

 

 

Information about how many times a procedure is 

called can be very helpful at determining if certain 

defects are serious or not. The best examples are 

resource leaks, which are often insignificant in code 

that is called rarely, but highly serious in code that is 

called a lot. 

Data about whether a path through the code has been 

tested can help an analyst determine if a static analysis 

warning is a true or a false positive. A buffer overrun 

that is reported on a path that is not tested is more likely 

to be a latent defect than one that is reported on a path 

that is executed a lot. 

A memory profile, which shows which locations in the 

code are responsible for dynamically allocating 

501

memory, can be used to highlight which leaks reported 

by static analysis are most hazardous. 

Similarly, some dynamic analyses generate reports of 

anomalies such as invalid memory accesses, resource leaks, and 

program crashes. The most obvious way to integrate with a static 

analysis tool is to import those reports into the tool database. 

The following sections describe how the results of several 

 

 

 

The number of calls to a function 

The time spent in the function itself 

The time spent in the function and everything it calls 

transitively 

A simple approach to combining the static and dynamic 

results is to import this data as metrics. The following examples 

show the results of running a profile on an open-source 

Figure 2. A screenshot of a visualization of a call graph with metrics about dynamic execution superimposed. 

different classes of tool can be imported. For each example, the 

process of combining static and dynamic analysis results 

described in this paper is the following: 

 

 

 

A dynamic analysis is run, and the results are stored in 

a set of files. 

The static analysis is invoked, and the dynamic analysis 

results are used to augment the static results. 

The results of both analyses are presented through the 

same user interface. 

In each of these examples, the integration is done using either 

with a plugin to CodeSonar, or by setting configuration 

parameters. 

III. 

Time Profiling 

One of the simplest forms of dynamic analysis is time 

profiling, which helps developers understand how much time 

each part of the program takes to execute. There are many tools, 

both commercial products and open source, available for 

collecting raw timing data from executions, and for converting 

that data into a form that is convenient for consumption. Profiles 

may be gathered at function and statement granularity. For 

simplicity, the following section describes function-level 

profiling information only. 

Most profilers will gather data such as the following: 

calculator program named bc, with the Gnu profiler gprof. 

After the program is executed, gprof is invoked as follows: 

gprof -b -L -p --inline-file-names bc >gprof.txt 

This writes the profiling information to the file gprof.txt. The 

first few lines of an example run are shown below. Note that the 

functions and the files in which they are found are identified by 

name. 

Flat profile: 

Each sample counts as 0.01 seconds. 

% cumulative self self total 

time seconds seconds calls ms/call ms/call name 

18.75 0.09 0.09 3 30.01 160.04 execute (execute.c:67) 

18.75 0.18 0.09 1021353 0.00 0.00 bc_multiply (number.c:639) 

16.67 0.26 0.08 971853 0.00 0.00 bc_divide (number.c:742) 

It is relatively easy to write a simple program that can 

convert this information into a comma-separated-value format 

file. A plug-in for CodeSonar is available that can then read that 

file and create metrics for each procedure. Once these metrics 

are in the CodeSonar database, they can be viewed in several 

ways. The simplest way to see them is alongside some of the 

built-in metrics. They can also be shown in the visualization 

tool. Error! Reference source not found.Figure 2 shows a 

screen capture of a visualization of the call graph. In this 

particular instance, the size of the rectangles is proportional to 

the percentage time spent in each function, and the intensity of 

the red is proportional to the number of static analysis warnings 

found in the item. From this, the user can easily pick out the 

places in the code that are both consuming most time during 


502

Figure 3. A screenshot showing a warning generated from the test effectiveness metric. 

execution (the size of the box), as well as those that are 

potentially most risky (the intensity of the red). This will help 

the user focus on the parts of the program most likely to benefit 

from increased scrutiny. Selecting the box reveals a link that 

allows the user to see all of those static analysis warnings. 

IV. 

Code Coverage 

Code coverage tools measure how much of the code is 

exercised during execution. There are several forms of coverage, 

the most popular of which are statement coverage and condition 

coverage. Again, there are both open-source tools (e.g, gcov) and 

commercial tools such as CTC Testwell from VerifySoft. The 

examples described below are generated using Testwell 

(http://www.verifysoft.com/en_ctcpp.html). 

Coverage tools typically generate metrics on test 

effectiveness; a standard metric is Test Effectiveness Ratio 

(TER), defined as the ratio of elements exercised by the tests as 

a percentage of the whole. In Testwell, TER can be generated the 

different kinds of coverage it supports. 

Additionally, Testwell can show which parts of the code did 

not get exercised by the tests. 

Testwell will create a data file in a convenient format 

(JSON). Again, it is a simple matter to write a CodeSonar plugin 

that will read this file and create metrics corresponding to the 

TER values. Those metrics will then show up in the CodeSonar 

user interface in the same way as demonstrated for the profiler 

metrics described above. 

Those metrics can then be used to generate static analysis 

warnings. For example, to generate a warning for a procedure 

whose TER is less than 80% is as simple as adding the following 

lines to the CodeSonar config file for the project: 

METRIC_WARNING_CONDITION = TER[PROCEDURE] < 80 

METRIC_WARNING_CLASS_NAME = Low Test Effectiveness 

METRIC_WARNING_BASE_RANK = 5.0 

METRIC_WARNING_SIGNIFICANCE = RELIABILITY 

Figure 3 shows such a CodeSonar warning. 

The previous example demonstrated how to specify a new 

warning class in CodeSonar using configuration file parameters. 

CodeSonar also has an API in which it is possible to implement 

checkers in a more general fashion. Figure 4 shows a warning 

that was generated for a condition that was not exercised during 

the tests. This warning was generated by the same script that was 

used to import the test execution metrics. 

Plug-ins for CodeSonar can be written in several languages 

(Python, C++, C, Scheme, Java and C#); the API for accessing 

the program model and for generating metrics and warnings is 

available in all of those languages. The plug-ins written for this 

paper were written entirely in Python. A few snippets from that 

script are shown below. 

First a warning class is created: 

untested_condition = cs.analysis.create_warningclass( 

"Untested Condition", 

"", 2.0, 

cs.warningclass_flags.PADDING, 

cs.warning_significance.RELIABILITY) 

When the Testwell file is read, the script identifies the file 

and line number where the untested condition is found. The 

warning is then reported as follows: 

503

Figure 4. A Testwell untested condition imported as a CodeSonar warning 

untested_condition.report( 

sfile.arbitrary_instance(), 

probe['line'], proc, str(msg), 

cs.report_flags.ALREADY_XML_ENCODED) 

Writing checkers such as these in Python is usually fairly 

straightforward. 

V. Crash Analysis 

If a program crashes during execution, the operating system 

may arrange for a memory dump of the process to be written to 

a file. On Linux and other Unix systems, this is referred to as a 

core dump; debuggers such as gdb may be used to examine the 

state of the program at the point when it crashed. The most useful 

information is usually the stack trace. It is a fairly simple matter 

to import the stack trace into CodeSonar. 

The screenshot in Figure 6 shows an example of a crash 

dump that was imported into CodeSonar. 

The mechanism for doing this is straightforward: a simple 

script looks for core files and invokes gdb in batch mode as 

follows: 

gdb exe corefile --batch -q -ex bt 

The output of this is read and converted into a form that can 

be imported into CodeSonar with a simple plug-in that reads the 

data and creates warnings. 

VI. 

Memory Analysis 

Memory analysis tools find errors such as resource leaks, use 

of invalid address, and buffer overruns. Valgrind with the 

memcheck module is a popular option for developers on Linux 

systems (http://valgrind.org). 

Valgrind can be invoked in a manner that creates an XML 

file containing the report of errors; the following command runs 

the program named crash and writes the report to crash.vg.xml. 

valgrind --leak-check=yes --xml=yes \ 

--xml-file=crash.vg.xml ./crash 

The screenshot in Figure 5 shows a warning generated from 

having imported the report into CodeSonar. 

In this instance the memory had already been freed. The 

report shows the location where the second free took place (line 

14); the stack trace at the point of the illegal free is represented 

by other events in the report. This report also shows two other 

stack traces; the first is the stack at the point where it was 

previously freed (line 13), and the stack at the point where the 

memory was allocated (line 12). 

In this case it was most convenient to convert the XML file 

into a SARIF file. SARIF (Static Analysis Interchange Format) 

is designed to facilitate integrating static analysis tools. A plugin 

for CodeSonar for importing these files is in development and 

is available upon request. 

In the case of Valgrind, persistent triage of results can be 

very useful. Normal practice calls for users of Valgrind to 

maintain “suppressions files”. These tell the tool do refrain from 

generating certain reports. This is useful because although many 

reports are technically true positives, they have been judged to 

be either acceptable or harmless. Managing this file for a team 

of programmers on a large project can be tedious. A reasonable 

alternative is to mark Valgrind warnings in CodeSonar as False 

Positive or Don’t Care, then automatically generate a 

suppressions file for use in future runs. 


504

Figure 6. A crash report imported as a CodeSonar warning. 

Figure 5. A report from Valgrind as a CodeSonar warning. 


Static analysis tools and dynamic analysis tools are powerful 

and complementary approaches to finding and eliminating 

programming errors. As demonstrated above, it is feasible to use 

the results of each style of analysis to help strengthen or augment 

the other. This is relatively easy to accomplish because modern 

tools are designed and built in a way that allows them to be 

integrated. 


CodeSonar: http://www.grammatech.com 

505

Finding Safety Defects and Security Vulnerabilities 

by Static Analysis 

Daniel Kästner, Laurent Mauborgne, Christian Ferdinand 

AbsInt GmbH 

66123 Saarbrücken, Germany 

Abstract—Static code analysis has evolved to be a standard 

technique in the development process of safety-critical software. It 

can be applied to show compliance to coding guidelines, and to 

demonstrate the absence of critical programming errors, 

including runtime errors and data races. In recent years, security 

concerns have become more and more relevant for safety-critical 

systems, not least due to the increasing importance of highlyautomated 

driving and pervasive connectivity. While in the past 

static analyzers have been primarily applied to demonstrate 

classical safe properties they are well suited also to address data 

safety, and to discover security vulnerabilities. This talk gives an 

overview and discusses practical experience. 

Keywords— static analyiss, abstract interpretation, runtime 

errors, security vulnerabilities, functional safety, cybersecurity 


Some years ago static analysis meant manual review of 

programs. Nowadays, automatic static analysis tools are gaining 

popularity in software development as they offer a tremendous 

increase in productivity by automatically checking the code 

under a wide range of criteria. Many software development 

projects are developed according to coding guidelines, such as 

MISRA C, CERT, or CWE, aiming at a programming style that 

improves clarity and reduces the risk of introducing bugs. 

Compliance checking by static analysis tools has become 

common practice. 

In safety-critical systems static analysis plays a particularly 

important role. A failure of a safety-critical system may cause 

high costs or even endanger human beings. With the growing 

size of software-implemented functionality preventing softwareinduced 

system failures becomes an increasingly important task. 

One particularly dangerous class of errors are runtime errors 

which include faulty pointer manipulations, numerical errors 

such as arithmetic overflows and division by zero, data races, 

and synchronization errors in concurrent software. Such errors 

can cause software crashes, invalidate separation mechanisms in 

mixed-criticality software, and are a frequent cause of errors in 

concurrent and multi-core applications. At the same time these 

defects are also at the root of many security vulnerabilities, 

including exploits based on buffer overflows, dangling pointers, 

or integer errors. 

This is recognized by the MISRA C norm by a particular rule 

which recommends deeper analysis: “Minimization of runtime 

failures shall be ensured by the use of at least one of (a) static 

analysis tools/techniques; (b) dynamic analysis 

tools/techniques; (c) explicit coding of checks to handle runtime 

faults.” ([23], rule 21.1). 

In safety-critical software projects obeying coding 

guidelines such as MISRA C is strongly recommended by all 

current safety standards, including DO-178B, DO-178C, IEC- 

61508, ISO-26262, and EN-50128. In addition, all of them 

consider demonstrating the absence of runtime errors explicitly 

as a verification goal. This is often formulated indirectly by 

addressing runtime errors (e.g., division by zero, invalid pointer 

accesses, arithmetic overflows) in general, and additionally 

considering corruption of content, synchronization mechanisms, 

and freedom of interference in concurrent execution [3]. 

Semantics-based static analysis has become the predominant 

technology to detect runtime errors and data races. 

Abstract interpretation is a formal methodology for 

semantics-based static program analysis [8]. It supports formal 

soundness proofs (it can be proven that no error is missed) and 

scales to real-life industry applications. Abstract interpretationbased 

static analyzers provide full control and data coverage and 

allow conclusions to be drawn that are valid for all program runs 

with all inputs. Such conclusions may be that no timing or space 

constraints are violated, or that runtime errors or data races are 

absent: the absence of these errors can be guaranteed [16]. 

Nowadays, abstract interpretation-based static analyzers that can 

detect stack overflows and violations of timing constraints [28] 

and that can prove the absence of runtime errors and data races 

[10, 17], are widely used for developing and verifying safetycritical 

software. From a methodological point of view, abstract 

interpretation-based static analyses can be seen as equivalent to 

testing with full data and control coverage. They do not require 

access to the physical target hardware, can be easily integrated 

in continuous verification frameworks and model-based 

development environments [18], and they allow developers to 

detect runtime errors as well as timing and space bugs in early 

product stages. 

In the past security properties have mostly been relevant for 

non-embedded and/or non-safety-critical programs. Recently 

due to increasing connectivity requirements (cloud-based 

services, car-to-car communication, over-the-air updates, etc.), 

more and more security issues are rising in safety-critical 

software as well. Security exploits like the Jeep Cherokee hacks 

[30] which affect the safety of the system are becoming more 

and more frequent. In consequence, safety-critical software 


506

development faces novel challenges which previously only have 

been addressed in other industry domains. 

On the other hand, as outlined above, safety-critical software is 

developed according to strict guidelines which effectively 

reduce the relevant subset of the programming language used 

and improve software verifiability. As an example, dynamic 

memory allocation and recursion often are forbidden or used in 

a very limited way. In consequence for safety-critical software 

much stronger code properties can be shown than for non-safetycritical 

software, so that also security vulnerabilities can be 

addressed in a more powerful way. 

The topic of this article is to show that some classes of defects 

can be proven to be absent in the software so that exploits based 

on such defects can be excluded. On the other hand, additional 

syntactic checks and semantical analyses become necessary to 

address security properties which are orthogonal to safety 

requirements. Throughout the article we will focus on software 

aspects only without addressing safety or security properties at 

the hardware level. 

II. 

SECURITY IN SAFETY-CRITICAL SYSTEMS 

Functional safety and security are aspects of dependability, 

in addition to reliability and availability. Functional safety is 

usually defined as the absence of unreasonable risk to life and 

property caused by malfunctioning behavior of the software. 

The main goals of information security or cybersecurity (for 

brevity denoted as ‘security’ in this article) traditionally are to 

preserve confidentiality (information must not be disclosed to 

unauthorized entities), integrity (data must not be modified in an 

unauthorized or undetected way), and availability (data must be 

accessible and usable upon demand). 

In safety-critical systems safety and security properties are 

intertwined. A violation of security properties can endanger the 

functional safety of the system: an information leak could 

provide the basis for a successful attack on the system, and a 

malicious data corruption or denial-of-service attack may cause 

the system to malfunction. Vice versa, a violation of safety goals 

can compromise security: buffer overflows belong to the class 

of critical runtime errors whose absence have to be demonstrated 

in safety-critical systems. At the same time an undetected buffer 

overflow is one of the main security vulnerabilities which can be 

exploited to read unauthorized information, to inject code, or to 

cause the system to crash [32]. To emphasize this, in a safetycritical 

system the definition of functional safety can be adapted 

to define cybersecurity as absence of unreasonable risk to life 

and property caused by malicious misusage of the software. 

The convergence of safety and security properties also 

becomes apparent in the increasing role of data in safety-critical 

systems. There are many well-documented incidents where 

harm was caused by erroneous data, corrupted data, or 

inappropriate use of data – examples include the Turkish 

Airlines A330 incident (2015), the Mars Climate Orbiter crash 

(1999), or the Cedars Sinai Medical Centre CT scanner 

radiation overdose (2009) [11]. The reliance on data in safetycritical 

systems has significantly grown in the past few years, 

cf. e.g., data used for decision-support systems, data used in 

sensor fusion for highly automatic driving, or data provided by 

car-to-car communication or downloaded from a cloud. As a 

consequence of this there are ongoing activities to provide 

specific guidance for handling data in safety-critical systems 

[11]. At the same time, these data also represent safety-relevant 

targets for security attacks. 

A. Coding Guidelines 

The MISRA C standard [23, 24] has originally been 

developed with a focus on automotive industry but is now 

widely recognized as a coding guideline for safety-critical 

systems in general. Its aim is to avoid programming errors and 

enforce a programming style that enables the safest possible use 

of C. A particular focus is on dealing with 

undefined/unspecified behavior of C and on preventing runtime 

errors. As a consequence, it is also directly applicable to 

security-relevant code. 

The most prominent coding guidelines targeting security 

aspects are the ISO/IEC TS 17961, the SEI CERT C Coding 

Standard, and the MITRE Common Weakness Enumeration 

CWE. 

The ISO/IEC TS 17961 C Secure Coding Rules [15] 

specifies rules for secure coding in C. It does not primarily 

address developers but rather aims at establishing requirements 

for compilers and static analyzers. MISRA C:2012 Addendum 

2 [25] compares the ISO/IEC TS 17961 rule set with MSIRA 

C:2012. Only 4 of the C Secure rules are not covered by the first 

edition of MISRA C:2012 [24]. MISRA C:2012 Amendment 1 

[26] contains 14 additional guidelines, one directive and 13 

rules with a focus on covering additional security concerns, 

which now also cover the previously not handled C Secure 

rules. This illustrates the strong overlap between the safety- and 

security-oriented coding guidelines. 

The SEI CERT C Coding Standard belongs to the CERT 

Secure Coding Standards (https://www.securecoding.cert.org). 

While emphasizing the security aspect CERT C [14] also 

targets safety-critical systems: it aims at “developing safe, 

reliable and secure systems”. CERT distinguishes between 

rules and recommendations where rules are meant to provide 

normative requirements and recommendations are meant to 

provide general guidance; the book version [14] describes the 

rules only. A particular focus is on eliminating undefined 

behaviors that can lead to exploitable vulnerabilities. In fact, 

almost half of the CERT rules (43 of 99 rules) are targeting 

undefined behaviors according to the C norm. 

The Common Weakness Enumeration CWE is a software 

community project (https://cwe.mitre.org) that aims at creating 

a catalog of software weaknesses and vulnerabilities. The goal 

of the project is to better understand flaws in software and to 

create automated tools that can be used to identify, fix, and 

prevent those flaws. There are several catalogues for different 

programming languages, including C. In the latter one, once 

again, many rules are associated with undefined or unspecified 

behaviors. 

B. Vulnerability Classification 

Many rules are shared between the different coding 

guidelines, but there is no common structuring of security 

vulnerabilities. The CERT Secure C roughly structures its rules 

507

according to language elements, whereas ISO/IEC TS 17961 

and CWE are structured as a flat list of vulnerabilities. In the 

following we list some of the most prominent vulnerabilities 

which are addressed in all coding guidelines and which belong 

to the most critical ones at the C programming level. The 

presentation follows the overview given in [32]. 

1) Stack-based Buffer Overflows 

An array declared as local variable in C is stored on the 

runtime stack. A C program may write beyond the end of the 

array due to index values being too large or negative, or due 

to invalid increments of pointers pointing into the array. A 

runtime error then has occurred whose behavior is undefined 

according to the C semantics. As a consequence the program 

might crash with bus error or segmentation fault, but 

typically adjacent memory regions will be overwritten. An 

attacker can exploit this by manipulating the return address 

or the frame pointer both of which are stored on the stack, or 

by indirect pointer overwriting, and thereby gaining control 

over the execution flow of the program. In the first case the 

program will jump to code injected by the attacker into the 

overwritten buffer instead of executing an intended function 

return. In case of overflows on array read accesses 

confidential information stored on the stack (e.g. through 

temporary local variables) might be leaked. An example of 

such an exploit is the well-known W32.Blaster.Worm 1 . 

2) Heap-based Buffer Overflows 

Heap memory is dynamically allocated at runtime, e.g. by 

calling malloc() or calloc() implementations provided by 

dynamic memory allocation libraries. Just like stackallocated 

arrays, there may be read or write operations to 

dynamically allocated arrays that access beyond the array 

boundaries. In case of a read access information stored on 

the heap might be leaked – a prominent example is the 

Heartbleed bug in OpenSSL (cf. CERT vulnerability 

720951 2 ). Via write operations attackers may inject code and 

gain control over program execution, e.g., by overwriting 

management information of the dynamic memory allocator 

stored in the accessed memory chunk. 

3) General Invalid Pointer Accesses 

Buffer overflows are special cases of invalid pointer accesses 

which are listed here as separate points due to the large 

number of attacks based on them. However, any invalid 

pointer access in general is a security vulnerability – other 

examples are null pointer accesses or dangling pointer 

accesses. Accessing such a pointer is undefined behavior 

which can cause the program to crash, or behave erratically. 

A dangling pointer points to a memory location that has been 

deallocated either implicitly (e.g. data stored in the stack 

frame of a function after its return) or explicitly by the 

programmer. A concrete example of a dangling pointer 

access is the double free vulnerability where already freed 

memory is freed a second time. This can be exploited by an 

attacker to overwrite arbitrary memory locations and execute 

injected code [32]. 

4) Uninitialized Memory Accesses 

Automatic variables and dynamically allocated memory 

have indeterminate values when not explicitly initialized. 

Accessing them is undefined behavior which can cause the 

program to behave erratically or in an unexpected way. 

Uninitialized variables can also be used for security attacks, 

e.g., in CVE-2009-1888 3 potentially uninitialized variables 

passed to a function were exploited to bypass the access 

control list and gain access to protected files [14]. 

5) Integer Errors 

Integer errors are no exploitable vulnerabilities by 

themselves, but they can be the cause of critical 

vulnerabilities like stack- or heap-based buffer overflows. 

Examples of integer errors are arithmetic overflows, or 

invalid cast operations. If, e.g., a negative signed value is 

used as an argument to a memcpy() call, it will be interpreted 

as a large unsigned value, potentially resulting in a buffer 

overflow. 

6) Format String Vulnerabilities 

A format string is copied to the output stream with 

occurrences of %-commands representing arguments to be 

popped from the stack and expanded into the stream. A 

format string vulnerability occurs, if attackers can supply the 

format string because it enables them to manipulate the 

stack, once again making the program write to arbitrary 

memory locations. 

7) Concurrency Defects 

Concurrency errors may lead to concurrency attacks which 

allow attackers to violate confidentiality, integrity and 

availability of systems [31]. In a race condition the program 

behavior depends on the timing of thread executions. A 

special case is a write-write or read-write data race where the 

same shared variable is accessed by concurrent threads 

without proper synchronization. In a Time-of-Check-to- 

Time-of-Use (TOCTOU) race condition the checking of a 

condition and its use are not protected by a critical section. 

This can be exploited by an attacker, e.g., by changing the 

file handle between the accessibility check and the actual file 

access. In general, attacks can be run either by creating a data 

race due to missing lock-/unlock protections, or by 

exploiting existing data races, e.g., by triggering thread 

invocations. 

Most of the vulnerabilities described above are based on 

undefined behaviors, and among them buffer overflows seem 

to play the most prominent role for real-live attacks. Most of 

them can be used for denial-of-service attacks by crashing the 

program or causing erroneous behavior. They can also be 

exploited to inject code and cause the program to execute it, and 

to extract confidential data from the system. It is worth noticing 

that from the perspective of a static analyzer most exploits are 

based on potential runtime errors: when using an unchecked 

value as an index in an array the error will only occur if the 

attacker manages to provide an invalid index value. The 

obvious conclusion is that safely eliminating all runtime errors 

1 

https://en.wikipedia.org/wiki/Blaster_(computer_worm) 

2 

http://www.kb.cert.org/vuls/id/720951 

3 

CVE-2009-1888: SAMBA ACLs Uninitialized Memory Read. 

https://nvd.nist.gov/vuln/detail/CVE-2009-1888 


508

due to undefined behaviors in the program significantly reduces 

the risk for security vulnerabilities. 

C. Analysis Complexity 

While semantic-based static program analysis is getting 

more widespread for safety property, there is practically no such 

analyzer dedicated to security properties. This is mostly 

explained by the difference in complexity between safety and 

security properties. From a semantical point of view, a safety 

property can always be expressed as a trace property. This means 

that to find all safety issue, it is enough to look at each trace of 

execution in isolation. If it were not for the overhead and the fact 

that the detection would occur too late, it would be possible to 

solve all safety issues using monitors that would detect during 

the execution of the critical software any safety issue. 

This is not possible anymore for security properties. Most of 

them can only be expressed as set of traces properties, or 

hyperproperties [7]. A typical example is non-interference [27]: 

to express that the final value of a variable x can only be affected 

by the initial value of y, one must consider each pair of possible 

execution trace with the same initial value for y, and check that 

the final value of x is the same for both executions. It was proven 

in [4] that any other definition (tracking assignments, etc) 

considering only one execution trace at a time would miss some 

cases or add false dependencies. This additional level of sets has 

direct consequences on the difficulty to track security properties 

soundly. 

Other examples of hyperproperties are secure information 

flow policies, service level agreements (which describe 

acceptable availability of resources in term of mean response 

time or percentage uptime), observational determinism (whether 

a system appears deterministic to a low-level user), or 

quantitative information flow. 

Finding expressive and efficient abstractions for such 

properties is a young research field (see [30] for a promising 

approach), which is the reason why no sound analysis of such 

properties appear in industrial static analyzer. The best solution 

using the current state of the art consists in using dedicated 

safety properties as approximation of the security property, with 

non-standard semantics, such as the taint propagation described 

in Sec. IV.B. 

III. 

PROVING THE ABSENCE OF DEFECTS 

In safety-critical systems the use of dynamic memory 

allocation and recursions typically is forbidden or only used in 

limited ways. This simplifies the task of static analysis such that 

for safety-critical embedded systems it is possible to formally 

prove the absence of runtime errors, or report all potential 

runtime errors which still exist in the program. Such analyzers 

are based on the theory of abstract interpretation 

\cite{CousotCousot77-1}, a mathematically rigorous formalism 

providing a semantics-based methodology for static program 

analysis. 

A. Abstract Interpretation 

The semantics of a programming language is a formal 

description of the behavior of programs. The most precise 

semantics is the so-called concrete semantics, describing closely 

the actual execution of the program on all possible inputs. Yet in 

general, the concrete semantics is not computable. Even under 

the assumption that the program terminates, it is too detailed to 

allow for efficient computations. The solution is to introduce an 

abstract semantics that approximates the concrete semantics of 

the program and is efficiently computable. This abstract 

semantics can be chosen as the basis for a static analysis. 

Compared to an analysis of the concrete semantics, the analysis 

result may be less precise but the computation may be 

significantly faster. 

A static analyzer is called sound if the computed results hold 

for any possible program execution. Abstract interpretation 

supports formal correctness proofs: it can be proved that an 

analysis will terminate and that it is sound, i.e., that it computes 

an over-approximation of the concrete semantics. Imprecision 

can occur, but it can be shown that they will always occur on the 

safe side. In runtime error analysis, soundness means that the 

analyzer never omits to signal an error that can appear in some 

execution environment. If no potential error is signaled, 

definitely no runtime error can occur: there are no false 

negatives. If a potential error is reported, the analyzer cannot 

exclude that there is a concrete program execution triggering the 

error. If there is no such execution, this is a false alarm (false 

positive). This imprecision is on the safe side: it can never 

happen that there is a runtime error which is not reported. 

The difference between syntactical, unsound semantical and 

sound semantical analysis can be illustrated at the example of 

division by 0. In the expression x/0 the division by zero can be 

detected syntactically, but not in the expression a/b. When an 

unsound analyzer does not report a division by zero in a/b it 

might still happen in scenarios not taken into account by the 

analyzer. When a sound analyzer does not report a division by 

zero in a/b, this is a proof that b can never be 0. 

B. Astrée 

In the following we will concentrate on the sound static 

runtime error analyzer Astrée [5, 22]. It reports program defects 

caused by unspecified and undefined behaviors according to the 

C norm (ISO/IEC 9899:1999 (E)) [6], program defects caused 

by invalid concurrent behavior, violations of user-specified 

programming guidelines, and computes program properties 

relevant for functional safety. Users are notified about: 

• integer/floating-point division by zero 

• out-of-bounds array indexing 

• erroneous pointer manipulation and dereferencing 

(buffer overflows, null pointer dereferencing, dangling 

pointers, etc.) 

• data races 

• lock/unlock problems, deadlocks 

• integer and floating-point arithmetic overflows 

• read accesses to uninitialized variables 

• unreachable code 

• violations of optional user-defined assertions to prove 

additional runtime properties, e.g., to guarantee that 

output variables are within the expected value ranges 

• violations of coding rules (MISRA C:2004/2012 incl. 

Amendment 1, ISO/IEC TS 17961, CERT, CWE) and 

code metric thresholds. The supported code metrics 

509

include the statically computable HIS metrics (HIS 

2008), e.g., comment density, and cyclomatic 

complexity. 

• non-terminating loops. 

Astrée computes data and control flow reports containing a 

detailed listing of accesses to global and static variables sorted 

by functions, variables, and processes and containing a summary 

of caller/called relationships between functions. The analyzer 

can also report each effectively shared variable, the list of 

processes accessing it, and the types of the accesses (read, write, 

read/write). 

The C99 standard does not fully specify data type sizes, 

endianness nor alignment which can vary with different targets 

or compilers. Astrée is informed about these target ABI settings 

by a dedicated configuration file in XML format and takes the 

specified properties into account. 

The design of the analyser aims at reaching the zero false alarm 

objective, which was accomplished for the first time on large 

industrial applications at the end of November 2003. For 

keeping the initial number of false alarms low, a high analysis 

precision is mandatory. To achieve high precision Astrée 

provides a variety of predefined abstract domains, including the 

following ones: 

• The interval domain approximates variable values by 

intervals. 

• The octagon domain [19] covers relations of the form 

x ± y ≤ c for variables x and y and constants c. 

• Floating-point computations are precisely modelled 

while keeping track of possible rounding errors. 

• The memory domain empowers Astrée to exactly 

analyze pointer arithmetic and union manipulations. It 

also supports a type-safe analysis of absolute memory 

addresses. 

• The clock domain has been specifically developed for 

synchronous control programs and supports relating 

variable values to the system clock [9]. 

• With the filter domain [12] digital filters can be 

precisely approximated. 

Fig. 1: Astrée GUI with alarm overview 

Any remaining alarm has to be manually checked by the 

developers – and this manual effort should be as low as possible. 

Astrée explicitly supports investigating alarms in order to 

understand the reasons for them to occur. Alarm contexts can be 

interactively explored, the computed value ranges of variables 

can be displayed for each different context, the call graph is 

visualized, and a program slicer is available to identify the 

program parts contributing to a selected defect. By fine-tuning 

the precision of the analyzer to the software under analysis the 

number of false alarms can be further reduced. 

To deal with concurrency defects Astrée has been extended 

by a sound low-level concurrent semantics [20] which provides 

a scalable sound abstraction covering all possible thread 

interleavings. The interleaving semantics enables Astrée, in 

addition to the classes of runtime errors found in sequential 

programs, to report data races, and lock/unlock problems, i.e., 

inconsistent synchronization. The set of shared variables does 

not need to be specified by the user: Astrée assumes that every 

global variable can be shared, and discovers which ones are 

effectively shared, and on which ones there is a data race. After 

a data race, the analysis continues by considering the values 

stemming from all interleavings. Since Astrée is aware of all 

locks held for every program point in each concurrent thread, 

Astrée can also report all potential deadlocks. 

In some situations data races may be intended behavior. As 

an example a lock-free implementation where one process only 

writes to a variable and another process only reads from it may 

be correct, although there actually is a data race. However, a 

prerequisite is that all variable accesses involved are atomic. 

Astrée explicitly supports such lock-free implementations by 

providing means to specify the atomicity of basic data type 

accesses as a part of the target ABI specification. Data race 

alarms explicitly distinguish between atomic and non-atomic 

accesses. 

Thread priorities are exploited to reduce the amount of 

spurious interleavings considered in the abstraction and to 

achieve a more precise analysis. A dedicated task priority 

domain supports dynamic priorities, e.g., according to the 

Priority Ceiling Protocol used in OSEK systems. Astrée includes 

a built-in notion of mutual exclusion locks, on top of which 

actual synchronization mechanisms offered by operating 

systems can be modeled (such as POSIX mutexes or semaphores 

[13]); program-enforced mutual exclusion is also exploited by 

Astrée to reduce spurious interleavings. When these features are 

insufficient to match the concurrency semantics of the analyzed 

program, Astrée reverts to unrestricted preemption, which 

ensures a sound analysis coverage for all concurrency models, 

including execution on multi-core processors. In particular, 

Astrée is not limited to collaborative threads nor discrete sets of 

preemption points. 

Programs to be analyzed are seldom run in isolation; they 

interact with an environment. In order to soundly report all 

runtime errors, Astrée must take the effect of the environment 

into account. In the simplest case the software runs directly on 

the hardware, in which case the environment is limited to a set 

of volatile variables, i.e., program variables that can be modified 

by the environment concurrently, and for which a range can be 

provided to Astrée by formal directives. More often, the 


510

program is run on top of an operating system, which it can access 

through function calls to a system library. When analyzing a 

program using a library, one possible solution is to include the 

source code of the library with the program. This is not always 

convenient (if the library is complex), nor possible, if the library 

source is not available, or not fully written in C, or ultimately 

relies on kernel services (e.g., for system libraries). An 

alternative is to provide a stub implementation, i.e., to write, for 

each library function, a specification of its possible effect on the 

program. Astrée provides stub libraries for the ARINC 653 

standard, the OSEK/AUTOSAR standards [2, 1], and for POSIX 

threads. 

A particularity of OSEK is that system resources, including 

tasks, are not created dynamically at program startup; instead 

they are hardcoded in the system: a specific tool reads a 

configuration file in OIL format (OSEK Implementation 

Language) describing these resources and generates a dedicated 

version of the system to be linked against the application. To 

support this workflow Astrée provides its own OIL file reader 

and automatically creates the implementation code from the OIL 

file. Combining the C sources of the OSEK application, the fixed 

OSEK stub provided with Astrée, and the C file automatically 

generated from the OIL file, we get a stand-alone application, 

without any undefined symbol, that can be analyzed with Astrée 

and models faithfully the execution of the application in an 

OSEK environment. This workflow enables a high level of 

automation with minimal configuration when analyzing OSEK 

applications. 

Practical experience on avionics and automotive industry 

applications are given in [21, 17]. They show that industry-sized 

programs of millions of line of code can be analyzed in 

acceptable time with high precision for runtime errors and data 

races. 

IV. 

CONTROL AND DATA FLOW ANALYSIS 

Safety standards like the DO-178C and ISO-26262 require 

to perform control and data flow analysis as a part of software 

unit testing and in order to verify the software architectural 

design. Investigating control and data flow is also subject of the 

Data Safety guidance [11], and it is a prerequisite for analyzing 

confidentiality and integrity properties as a part of a security 

case. Technically, any semantics-based static analysis is able to 

provide information about data and control flow, since this is the 

basis of the actual program analysis. However, data and control 

flow analysis has many aspects, and for some of them, tailored 

analysis mechanisms are needed. 

Global data and control flow analysis gives a summary of 

variable accesses and function invocations throughout program 

execution. In its standard data and control flow reports Astrée 

computes the number of read/write accesses for every global or 

static variable and lists the location of each access along with the 

function from which the access is made and the thread in which 

the function is executed. The control flow is described by listing 

all callers and callees for every C function along with the threads 

in which they can be run. Indirect variable accesses via pointers 

as well as function pointer call targets are fully taken into 

account. Astrée also provides a call graph enhanced by data flow 

and concurrency information, which can be interactively 

explored. 

Fig. 2: Call Tree Visualization enhanced by Data Flow and 

Concurrency Information 

More sophisticated information can be provided by two 

dedicated analysis methods: program slicing and taint analysis. 

Program slicing [29] aims at identifying the part of the program 

that can influence a given set of variables at a given program 

point. Applied to a result value, e.g., it shows which functions, 

which statements, and which input variables contribute to its 

computation. Taint analysis tracks the propagation of specific 

data values through program execution. It can be used, e.g., to 

determine program parts affected by corrupted data from an 

insecure source. In the following we give a more detailed 

overview of both techniques. 

A. Program Slicing 

A slicing criterion of a program P is a tuple where s 

is a statement and V is a set of variables in V. Intuitively, a slice 

is a subprogram of P which has the same behavior than P with 

respect to the slicing criterion . Computing a statementminimal 

slice is an undecidable problem, but using static 

analysis approximative slices can be computed. As an example, 

Astrée provides a program slicer which can produce sound and 

compact slices by exploiting the invariants from Astrée’s core 

analysis including points-to information for variable and 

function pointers. A dynamic slice does not contain all 

statements potentially affecting the slicing criterion, but only 

those relevant for a specific subset of program executions, e.g., 

only those in which an error value can result. 

Computing sound program slices is relevant for 

demonstrating safety and security properties. It can be used to 

show that certain parts of the code or certain input variables 

might influence or cannot influence a program section of 

interest. 

B. Taint Analysis 

In literature, taint analysis is often mentioned in combination 

with unsound static analyzers, since it allows to efficiently detect 

potential errors in the code, e.g., array-index-out-of-bounds 

accesses, or infeasible library function parameters parameters 

[14, 15]. Inside a sound runtime error analyzer this is not needed 

since typically more powerful abstract domains can track all 

undefined or unspecified behaviors. Inside a sound analyzer, 

taint analysis is primarily a technique for analyzing security 

properties. Its advantage is that users can flexibly specify taints, 

511

taint sources, and taint sinks, so that application-specific data 

and control flow requirements can be modeled. 

In order to be able to leverage this efficient family of 

analyses in sound analyzers, one must formally define the 

properties that may be checked using such techniques. Then it is 

possible to prove that a given implementation is sound with 

respect to that formal definition, leading to clean and well 

defined analysis results. Taint analysis consists of discovering 

data dependencies using the notion of taint propagation. Taint 

propagation can be formalized using a non-standard semantics 

of programs, where an imaginary taint is associated to some 

input values. Considering a standard semantics using a successor 

relation between program states, and considering that a program 

state is a map from memory locations (variables, program 

counter, etc.) to values in V, the tainted semantics relates tainted 

states which are maps from the same memory locations to 

V × {taint, notaint}, and such that if we project on V we get 

the same relation as with the standard semantics. 

To define what happens to the taint part of the tainted value, 

one must define a taint policy. The taint policy specifies: 

• Taint sources which are a subset of input values or 

variables such that in any state, the values associated with 

that input values or variables are always tainted. 

• Taint propagation describes how the tainting gets 

propagated. Typical propagation is through assignment, 

but more complex propagation can take more control 

flow into account, and may not propagate the taint 

through all arithmetic or pointer operations. 

• Taint cleaning is an alternative to taint propagation, 

describing all the operations that do not propagate the 

taint. In this case, all assignments not containing the taint 

cleaning will propagate the taint. 

• Taint sinks is an optional set of memory locations. This 

has no semantical effect, except to specify conditions 

when an alarm should be emitted when verifying a 

program (an alarm must be emitted if a taint sink may 

become tainted for a given execution of the program). 

A sound taint analyzer will compute an over-approximation 

of the memory locations that may be mapped to a tainted value 

during program execution. The soundness requirement ensures 

that no taint sink warning will be overlooked by the analyzer. 

The tainted semantics can easily be extended to a mix of 

different hues of tainting, corresponding to an extension of the 

taint set associated with values. Then propagation can get more 

complex, with tainting not just being propagated but also 

changing hue depending on the instruction. Such extensions lead 

to a rather flexible and powerful data dependency analysis, while 

remaining scalable. 

V. CONCLUSION 

In this article we have given an overview of code-level 

defects and vulnerabilities relevant for functional safety and 

security. We have shown that many security attacks can be 

traced back to behaviors undefined or unspecified according to 

the C semantics. By applying sound static runtime error 

analyzers a high degree of security can be achieved for safetycritical 

software since the absence of such defects can be proven. 

In addition, security hyperproperties require additional analyses 

to be performed which, by nature, have a high complexity. We 

have given two examples of scalable dedicated analyses, 

program slicing and taint analysis. Applied as extensions of 

sound static analyzers they allow to further increase confidence 

in the security of safety-critical embedded systems. 


The work presented in this paper was funded within the project 

ARAMiS II by the German Federal Ministry for Education and 

Research with the funding ID 01|S16025. The responsibility for 

the content remains with the authors. 

REFERENCES 

[1] AUTOSAR (AUTomotive Open System ARchitecture). http://- 

www.autosar.org. 

[2] OSEK/VDX Operating System. Version 2.2.3, 2005. 

[3] AbsInt GmbH. Safety Manual for aiT, Astrée, StackAnalyzer, 2015. 

[4] Mounir Assaf, David A. Naumann, Julien Signoles, Eric Totel, and 

Frédéric Tronel. Hypercollecting semantics and its application to static 

analysis of information flow. CoRR, abs/1608.01654, 2016. 

[5] B. Blanchet, P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Miné, D. 

Monniaux, and X. Rival. A Static Analyzer for Large Safety-Critical 

Software. In Proc. of PLDI’03, pages 196–207. ACM Press, June 7–14 

2003. 

[6] JTC1/SC22. Programming languages – C, 16 Dec. 1999. 

[7] Michael R. Clarkson and Fred B. Schneider. Hyperproperties. Journal of 

Computer Security, 18:1157–1210, 2010. 

[8] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model 

for static analysis of programs by construction or approximation of 

fixpoints. In Proc. of POPL’77, pages 238–252. ACM Press, 1977. 

[9] Patrick Cousot, Radhia Cousot, Jérôme Feret, Antoine Miné, Laurent 

Mauborgne, David Monniaux, and Xavier Rival. Varieties of Static 

Analyzers: A Comparison with ASTRÉE. In First Joint IEEE/IFIP 

Symposium on Theoretical Aspects of Software Engineering, TASE 2007, 

pages 3–20. IEEE Computer Society, 2007. 

[10] D. Delmas and J. Souyris. ASTRÉE: from Research to Industry. In Proc. 

14th International Static Analysis Symposium (SAS2007), number 4634 

in LNCS, pages 437–451, 2007. 

[11] SCSC Data Safety Initiative Working Group [DSIWG]. Data Safety 

(Version 2.0)[SCSC-127B]. Technical report, Safety-Critical Systems 

Club, Jan 2017. 

[12] Jérôme Feret. Static analysis of digital filters. In Proc. of ESOP’04, 

volume 2986 of LNCS, pages 33–48. Springer, 2004. 

[13] IEEE Computer Society and The Open Group. Portable operating system 

interface (POSIX) – Application program interface (API) amendment 2: 

Threads extension (C language). Technical report, ANSI/IEEE Std. 

1003.1c-1995, 1995. 

[14] CERT Software Engineering Inestitute. SEI CERT C Coding Standard – 

Rules for Developing Safe, Reliable, and Secure Systems. Carnegie 

Mellon University, 2016. 

[15] ISO/IEC. Information Technology – Programming Languages, Their 

Environments and System Software Interfaces – Secure Coding Rules 

(ISO/IEC TS 17961), Nov 2013. 

[16] D. Kästner. Applying Abstract Interpretation to Demonstrate Functional 

Safety. In J.-L. Boulanger, editor, Formal Methods Applied to Industrial 

Complex Systems. ISTE/Wiley, London, UK, 2014. 

[17] D. Kästner, A. Miné, L. Mauborgne, X. Rival, J. Feret, P. Cousot, 

A. Schmidt, H. Hille, S. Wilhelm, and C. Ferdinand. Finding All 

Potential Runtime Errors and Data Races in Automotive Software. In 

SAE World Congress 2017. SAE International, 2017. 

[18] D. Kästner, C. Rustemeier, U. Kiffmeier, D. Fleischer, S. Nenova, 

R. Heckmann, M. Schlickling, and C. Ferdinand. Model-Driven Code 


512

Generation and Analysis. In SAE World Congress 2014. SAE 

International, 2014. 

[19] A. Miné. The Octagon Abstract Domain. Higher-Order and Symbolic 

Computation, 19(1):31–100, 2006. 

[20] A. Miné. Static analysis of run-time errors in embedded real-time parallel 

C programs. Logical Methods in Computer Science (LMCS), 8(26):63, 

Mar. 2012. 

[21] A. Miné and D. Delmas. Towards an Industrial Use of Sound Static 

Analysis for the Verification of Concurrent Embedded Avionics 

Software. In Proc. of the 15th International Conference on Embedded 

Software (EMSOFT’15), pages 65–74. IEEE CS Press, Oct. 2015. 

[22] A. Miné, L. Mauborgne, X. Rival, J. Feret, P. Cousot, D. Kästner, 

S. Wilhelm, and C. Ferdinand. Taking Static Analysis to the Next Level: 

Proving the Absence of Run-Time Errors and Data Races with Astrée. 

Embedded Real Time Software and Systems Congress ERTS 2 . 

[23] MISRA Limited. MISRA-C:2004 Guidelines for the use of the C 

language in critical systems, Oct. 2004. 

[24] MISRA Limited. MISRA-C:2012 Guidelines for the use of the C 

language in critical systems, Mar. 2013. 

[25] MISRA Limited. MISRA-C:2012 – Addendum 2. Coverage of MISRA 

C:2012 against ISO/IEC TS 17961:2013 "C Secure", Apr. 2016. 

[26] MISRA Limited. MISRA-C:2012 Amendment 1 – Additional security 

guidelines for MISRA C:2012, Apr. 2016. 

[27] A Sabelfeld and A. C. Myers. Language-based information-flow 

security. IEEE Journal on Selected Areas in Communications, 21(1):5– 

19, 2003. 

[28] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Victor Jégu, 

Guillaume Borios, and Reinhold Heckmann. Computing the worst case 

execution time of an avionics program by abstract interpretation. In 

Proceedings of the 5th Intl Workshop on Worst-Case Execution Time 

(WCET) Analysis, pages 21–24, 2005. 

[29] Mark Weiser. Program slicing. In Proceedings of the 5th International 

Conference on Software Engineering, ICSE ’81, pages 439–449. IEEE 

Press, 1981. 

[30] Wired.com. The jeep hackers are back to prove car hacking can get much 

worse. https://www.wired.com/2016/08/jeep-hackers-return-high-speedsteering-acceleration-hacks/, 

2016. 

[31] Junfeng Yang, Ang Cui, John Gallagher, Sal Stolfo, and Simha 

Sethumadhavan. Concurrency attacks. In In the Fourth USENIX 

Workshop on Hot Topics in Parallelism (HOTPAR12, 2012. 

[32] Yves Younan, Wouter Joosen, and Frank Piessens. Code injection in c 

and c++ : A survey of vulnerabilities and countermeasures. Technical 

report, Departement Computerwetenschappen, Katholieke Universiteit 

Leuven, 2004. 

513

C++17: Analysis and risk mitigation of security 

vulnerabilities 

Walter Capitani 

Product manager, Klocwork 

Rogue Wave Software 

Ottawa, ON, Canada 

walter.capitani@roguewave.com 

With the recent approval of the C++17 language standard, 

along with new features introduced in C++11 and C++14, 

embedded developers are writing code in all sorts of exciting new 

ways. However, the introduction of these new features introduces 

new points of failure and attack vectors for hackers. 

This paper explores the impact of these new features on 

software quality and identifies new and expanded security 

vulnerabilities and attack vectors that can be exploited. Based on 

an analysis of the standards by language experts and actual 

running code, sample vulnerabilities and defects will be presented, 

and techniques and standards for reducing risk will be evaluated. 

C++, software security, software quality, best practices, 

embedded software development 


C++ continues to be one of the most popular programming 

languages in the world, even 33 years after its first release. The 

latest version of the C++ standard, C++17, was released in 

December 2017, introducing many new features to simplify the 

language, support large-scale systems, and improve 

concurrency. With every version, the community tries to 

improve code security by adapting to strategies employed by 

malicious entities yet there remain some ways in which the 

language can be exploited. 

TABLE I. 

C++ POPULARITY ACROSS DIFFERENT SITES 

Site Method Ranking 

GitHub Opened pull requests 6 a 

TIOBE Search query hits 3 b 

IEEE Spectrum 12 different metrics 4 c 

RedMonk 

Data from GitHub 

and StackOverflow 

a 

octoverse.github.com 

b 

tiobe.com/tiobe-index/ 

c 

spectrum.ieee.org/static/interactive-the-top-programming-languages-2017 

6 d 

This paper examines the C++17 standard (ISO/IEC 

14882:2017) to identify potential security vulnerabilities and 

provide examples. It is hoped that this identification will assist 

compiler developers, application developers, and software 

testers in the remediation of the vulnerabilities. 

II. IMPROPER DEALLOCATION OF DYNAMICALLY- 

ALLOCATED RESOURCES 

Potential security issues can be introduced by the incorrect 

usage of smart pointers, a problem that was introduced in C++11 

and not addressed in C++17. The following code sample, from 

the CERT C++ Coding Standard (MEM51-CPP: Properly 

deallocate dynamically allocated resources 1 ), illustrates the 

issue: 

1 #include 

2 

3 struct S {}; 

4 

5 void f() { 

6 std::unique_ptr s{new S[10]}; 

7 } 

Here, a std::unique_ptr is declared to hold a pointer to 

an object but is initialized with an array of S objects. When 

std::unique_ptr is destroyed as it goes out of scope on line 

7, undefined behavior will result as, by default, delete will be 

called instead of delete[]. This could cause abnormal 

application termination, memory leaks, or other issues. 

To avoid this issue, std::unique_ptr should be declared 

to hold an array of S objects, to ensure the correct deleter is 

called upon destruction, and std::make_unique()should be 

called to initialize the smart pointer, which alerts the user if the 

resulting std::unique_ptr is not of the correct type. This 

solution is below. 

d 

redmonk.com/sogrady/2017/06/08/language-rankings-6-17 

1 

wiki.sei.cmu.edu/confluence/display/cplusplus/MEM51- 

CPP.+Properly+deallocate+dynamically+allocated+resources 

514

1 #include 

2 

3 struct S {}; 

4 

5 void f() { 

6 std::unique_ptr s = 

std::make_unique(10); 

7 } 

III. LAMBDA OBJECTS 

Referencing EXP61-CPP from the CERT C++ Coding 

Standard 2 , undefined behavior may result when using a capture 

of the this pointer when creating a lambda object. As this 

standard applies to C++14, the following code sample illustrates 

this issue for C++17. 

1 #include 

2 #include 

3 

4 class C { 

5 public: 

6 C() : p(std::make_unique(10)) { } 

7 ~C() { p.release(); } 

8 

9 auto f() { 

10 return [this] { return p.get(); }; 

11 } 

12 

13 private: 

14 std::unique_ptr p; 

15 }; 

16 

17 int main() 

18 { 

19 auto myf = C().f(); 

20 auto pp = myf(); 

21 *pp += 2; 

22 return 0; 

23 } 

Here, the function f() returns a lambda, capturing a 

reference to the this pointer in an object of class C. On line 19, 

the temporary object C() is deleted but a reference to it is still 

used by the lambda function. On line 20, the captured this 

object does not exist anymore, which may cause unpredictable 

behavior when the freed memory is attempted to be used on line 

21. In this example, a null pointer deference was forced (which 

should cause a crash of the application) by using the code in line 

7. However, without line 7, the error is subtler and this would be 

a use of freed memory issue. 

The fix for this is to either extend the lifetime of the object 

of class C or copy the object of class C when creating the lambda 

object. This sample illustrates the former idea: 

1 #include 

2 #include 

3 

4 class C { 

5 public: 

6 C() : p(std::make_unique(10)) { } 

7 ~C() { p.release(); } 

8 

9 auto f() { 

10 return [this] { return p.get(); }; 

11 } 

12 

13 private: 

14 std::unique_ptr p; 

15 }; 

16 

17 int main() 

18 { 

19 C c; 

20 auto myf = c.f(); 

21 auto pp = myf(); 

22 *pp += 2; 

23 return 0; 

24 } 

IV. BREAKING OF BACKWARD COMPATIBILITY 

As a general goal, the creators of the C++ standards attempt 

to maintain backward compatibility between versions, however, 

there are cases where compatibility is broken to correct 

undesired behavior present in older versions. This can present 

potential issues when a user attempts to use newer behavior with 

a compiler enforcing an older standard. 

An example of this case is over-aligned types, which CERT 

has a rule for, MEM57-CPP: Avoid using default operator new 

for over-aligned types 3 . In this code sample from the CERT 

website, the new expression invokes the default operator new on 

line 6, which constructs an object of the user-defined type Vector 

with an alignment of 32 bytes (line 2), exceeding the typical 

alignment of 16 bytes for most implementations. This can cause 

unpredictable behavior if this object is passed into SIMD (single 

instruction, multiple data) vectorization instructions, which 

require specific aligned arguments. 

1 struct alignas(32) Vector { 

2 char elems[32]; 

3 }; 

4 

5 Vector *f() { 

6 Vector *pv = new Vector; 

7 return pv; 

8} 

This behavior was fixed in the C++17 standard 4 but can still 

occur if using an older compiler. This can cause a security issue 

such as abnormal termination of the application. 

To avoid this, the best practice would be for developers to 

know which standard their compiler enforces, which standard 

they are coding to, and to avoid programming practices that 

differ between the two unless the implications are clearly 

understood. The use of a static analysis tool would help automate 

the detection of these issues. 

2 

wiki.sei.cmu.edu/confluence/display/cplusplus/EXP61- 

CPP.+A+lambda+object+must+not+outlive+any+of+its+reference+captured+ 

objects 

3 

wiki.sei.cmu.edu/confluence/display/cplusplus/MEM57- 

CPP.+Avoid+using+default+operator+new+for+over-aligned+types 

4 

open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0035r4.html 

515

V. IMPRECISION WITH THE STANDARD 

With such a broad and complex standard as C++ (and the 

inclusion of many different opinions and practices), there is the 

possibility for imprecision in the guidelines that leave room for 

interpretation. The C++ committee recognizes this and 

maintains a list of defect reports for future improvement 5 but 

there are examples today that still occur. 

Defect report 2176 documents an issue with destructor 

throws that can still arise using the C++17-compliant Clang 

compiler. This code sample is from the defect report: 

1 #include 

2 

3 struct X { 

4 X() { puts("X()"); } 

5 X(const X&) { puts("X(const X&)"); } 

6 ~X() { puts("~X()"); } 

7 }; 

8 

9 struct Y { ~Y() noexcept(false) { throw 0; 

} }; 

10 

11 X f() { 

12 try { 

13 Y y; 

14 return {}; 

15 } catch (...) { 

16 } 

17 return {}; 

18 } 

19 

20 int main() { 

21 f(); 

22 } 

The issue is that, upon destruction, the current compiler 

implementation prints X() twice but ~X() only once, which is 

incorrect. This can cause security issues such as memory leaks 

and incorrect resource handling. 

Note that there is a proposed resolution to correct this 

behavior 6 . 

VI. SUMMARY 

C++ continues to be a popular programming language and 

the release of the C++17 standard illustrates its longevity. With 

this release comes potential security vulnerabilities that 

developers should be aware of and take steps to prevent. Best 

practices for avoidance include following the general secure 

coding guidelines listed in the CERT C++ Coding Standard, 

ensuring dynamically allocated resources are declared and 

dereferenced correctly, and understanding any gaps or 

compatibility issues between the standard and compiler being 

used. A useful method for implementing these best practices to 

prevent potential security vulnerabilities is to use a static code 

analysis tool. 

5 

open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html 

6 

open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html 

516

How can you sustain high performance with 

functional safety features? 

Jon Taylor 

Senior Technology Manager, Arm 

Cambridge, UK 

Abstract— Embedded systems markets are seeing a relentless 

push for higher performance, while at the same time markets such 

as drones and robotics are introducing requirements for 

functional safety. Software developers concerned about 

developing for safe systems may be using coding standards such as 

MISRA, or safety certified compilers and operating systems. The 

underlying hardware is also adding more features to support 

functionally safe systems, and it’s imperative that software 

engineers understand these features to get best performance from 

the design and meet their safety goals. 

This paper considers methods to achieve both high 

performance and high levels of functional safety, and will help 

software engineers understand features in both existing and new 

hardware platforms. It will discuss features including software test 

libraries, error correcting memories and detecting both software 

and hardware faults at runtime. It also considers at a higher level 

how hardware features such as additional privilege levels for 

virtualization allow new software development methodologies. 

Keywords— Functional safety, embedded software, real-time 


Functional safety is concerned with the mitigation of faults 

that might cause hazards in a system in operation. With respect 

to hardware, there are two main sources of fault to consider. The 

first is systematic faults – faults that might occur due to failures 

between requirements and implementation. These are addressed 

with rigorous design and verification processes, and further 

discussion of this is outside the scope of this paper. The other is 

runtime, random faults. These can occur for many reasons, but 

the more common occurrences are due to events such as 

radiation particle strikes, or the system itself failing due to age. 

To detect these faults, additional hardware, software, or a 

combination can be used. This paper considers some of these 

features, and the effects they may have on software performance. 

Consideration is given to areas that particularly affect 

performance. This paper does not cover techniques to develop 

software in a manner to meet functional safety goals, it is about 

understanding hardware features and the effects they can have 

on software performance. 

New markets, particularly involving autonomous systems, 

are now mixing the requirement for very high levels of compute 

with high levels of functional safety, and this provides a 

challenge for system designers. This can be made more 

challenging when these systems also add hard real-time response 

requirements too. 

II. 

RUN TIME TESTING 

In safety critical applications, there is a concern that latent faults 

can accumulate over time, which may eventually prevent safe 

operation of a device. Imagine a car air-bag. For the vast 

majority of its life, it will be quiescent, and the majority of its 

circuitry will not be used. But in an accident it will need to 

deploy, and previously unused circuitry will be used. If faults 

have accumulated in that over time, it may operate incorrectly. 

To protect against this occurring, diagnostic tests may be run 

during normal operation. These can include testing both 

memories and logic. 

Sometimes these tests can be run only at boot, but depending 

on application and safety analysis, there can also be a 

requirement for testing during normal operation at runtime. 

A. Memory Test 

While error correcting (ECC) memory can detect and correct 

for some faults, this only happens when any specific address is 

accessed. Memory Built-in Self Test (BIST) can operate over 

the whole memory array, in what is often called “scrubbing”. 

This prevents faults accumulating over time. When a memory 

is being accessed for testing, it is unavailable to the processor 

core. This can potentially cause performance (or availability) 

problems, particularly if the operation runs for an extended 

period of microseconds or milliseconds. In the case of a cache, 

the processor could clean the cache, then continue operating 

from backing memory while the test completes, albeit at lower 

performance. However companies such as Arm have 

introduced an alternative approach, called on-line MBIST. This 

allows one or two addresses to be tested in isolation, in a very 

short time period (a few tens of cycles). This can be done with 

minimal interference to normal operation, but over a longer 

time period can provide coverage of the whole memory. It can 

also be used to test whole memories efficiently during boot. 

For the developer, on-line MBIST is interesting as it allows fine 

grained control over what testing happens and how frequently. 


517

In the case where an error is detected, it could also run a test 

sequence to check whether the error is transient or permanent. 

B. Software test 

Software test is an elegant solution to the requirement for runtime 

testing of logic. In the same way that an RTOS can 

interleave different tasks, the software test routines can be 

considered as other tasks. They consist of carefully crafted test 

functions, written in assembler to ensure that instruction 

sequences and register use don’t change. By doing this, 

the processor designers can measure fault coverage of these 

sequences against the processor logic to ensure the desired 

coverage targets are achieved (this can only be done in 

simulation – it’s not something that can be done on silicon). The 

functions are then wrapped with C to provide a more 

programmer friendly interface. 

The benefit of having a software test library is that a system 

designer can choose how often to interleave code sequences 

with normal operation to achieve their coverage goals, and 

these can be run as small sections of code, relatively frequently, 

so they don’t affect availability in the way logic BIST would. 

Both software and memory BIST can be controlled by software 

running on the processor itself and are readily interleaved with 

normal operation. The third kind of coverage (logic BIST) is 

somewhat different. 

C. Logic test 

Logic BIST involves using manufacturing test logic (scan) to 

load patterns in and out of the processor logic, and is therefore 

destructive of processor state. These patterns are designed to 

provide test coverage that all the processor logic is operating 

correctly. Often logic BIST will be used during boot to ensure 

a device is functioning correctly prior to normal operation (for 

instance after turning a cars ignition on, the ECUs may be tested 

with LBIST). 

Running logic BIST will destroy any state in the processor, so 

it means taking a processor out of use while all the processor 

state is scanned out (and saved), the test run, then state restored. 

Unless you have multiple processors and some redundant 

capacity, this impact on availability is usually prohibitive of 

run-time logic BIST. It also requires additional capability 

outside the processor to manage the saving and restoring of 

processor state. 

In a many processor system, it might be part of the availability 

strategy that processors are rotated out of and back into use to 

allow high levels of fault coverage during normal operation. 

Logic BIST is mostly outside of scope for software developers; 

however if you are considering a scheme of taking cores in and 

out of use, then power sequences need to be considered, along 

with the ability to migrate tasks between processors. For system 

developers the consideration is around what level of availability 

must be maintained to achieve the total performance 

requirements needed, coupled with how often the testing needs 

to be performed. 

III. 

ERROR-CORRECTING MEMORY 

A common feature on many processors now is error-correcting 

memory. As process geometries have shrunk, memory has 

become more susceptible to bit flips caused by radiation, even 

on earth. One of the more common types of ECC protection is 

able to correct single bit errors, and detect double bit errors 

(SEC-DED). Usually the operation of this is transparent to the 

developer, however there are cases where it can affect 

performance, or even operation of the device. 

A single-bit error can usually be corrected in-line (the encoding 

used to detect the error contains enough information to correct 

it), i.e. the processor doesn’t have to perform any further 

memory accesses. However, the corrected value will need to be 

re-written to the memory to prevent an increase in the number 

of errors over time – i.e. single-bit correctable errors becoming 

double-bit uncorrectable errors. This will usually happen 

without affecting program operation – although if several errors 

occur in the middle of multiple back-to-back memory accesses, 

it may incur a slight performance penalty. A double-bit error 

may even be recoverable, if the access is a read from a clean 

cacheline, the data can be re-fetched from main memory – 

although this of course incurs a performance overhead. 

Where a performance impact is more likely, is when accessing 

small items of data. The ECC code adds an overhead in terms 

of memory storage. For example, SEC-DED ECC for a byte 

would require four additional data bits. The encoding of this 

code becomes more efficient as the data size increases – a 32- 

bit word can be protected with seven bits of ECC coding. The 

tradeoff is that ECC is calculated across the whole data – so if 

you write a byte to memory which has ECC chunk size of 32- 

bits, it will require a read-modify-write operation to calculate 

the new ECC value with the new byte of data. Again, this is 

always transparent to the software, but if many accesses are 

made close to each other in time, the processor may not be able 

to perform all of these operations without affecting 

performance. This is something that may need to be considered 

by the system designer too. For instance, an instruction cache 

may use 64-bit ECC encoding (as values are rarely written), but 

a data cache may use 32-bit ECC as this is a common size of 

access. 

Another benefit of using a smaller ECC chunk size is higher 

resilience to faults (fig 1). When a chunk size of 64-bits is used, 

a two-bit error within this word may result in an uncorrectable 

error. If 32-bit chunks had been used and one fault occurred in 

each half of the 64-bit data value, then this would be correctable 

(assuming a SEC-DED scheme). To know which to use comes 

back to the system designer to consider expected error rate and 

availability requirements, along with the most common size of 

memory access. 

ECC 

ECC 

ECC 

ECC 

32-bit word 

32-bit word 

64-bit double word 

64-bit double word 

ECC 

ECC 

32-bit word 

32-bit word 

Correct 

Detect 

Correct 

Correct 

ECC 32-bit word ECC 32-bit word Detect 

Figure 1 - comparison of ECC schemes 

518

A final consideration for ECC is what happens if the processor 

encounters a permanent fault. In this situation, an error is 

detected during a read operation. The processor tries to correct 

this by performing a write of the corrected data, before 

performing the read again. However in the case of a permanent 

fault, the error will recur. A hard error cache allows this location 

to be marked as bad, and instead the hard error cache is used to 

replace this location. This allows the processor to continue to 

make forward progress. Having this hard error cache requires 

additional memory, so it may be only a very small number of 

entries that can be supported. So long as the errors encountered 

are only single bit, forward progress can still be made, albeit 

with reduced performance. 

As well as considering the performance impact of ECC, 

software developers may also want to track the rate of errors 

occurring. ECC may operate transparently to the developer, or 

it may be recorded in special registers such that software can 

track what errors are occurring and measure how often they 

occur. This could be used to predict failure and indicate to a 

user that a hardware failure might be imminent. 

IV. 

MEMORY PROTECTION AND MANAGEMENT 

Many embedded processors these days include a memory 

protection unit or memory management unit. Effective use of 

these is critical to safe software development. These are usually 

supported by operating systems, but may not be enabled by 

default or if you run a bare metal environment. The advantages 

of being able to protect sections of memory, or peripherals from 

code that shouldn’t be able to access them should be obvious, 

but the performance aspect may need a little more 

consideration. 

A. Memory Protection Units (MPU) 

For hard real-time applications, a processor with a memory 

protection unit is a good solution. These typically have a fixed 

number of regions. The main advantage of an MPU is that 

lookups are deterministic, and usually place no additional 

overhead on a memory access. At a basic level, they can be used 

for stack protection (a stack overflow results in a trap, rather 

than data corruption), marking code as read-only, and data as 

execute-never. In a system with an operating system, the OS 

may need to reconfigure the MPU between different tasks. 

Depending on the frequency of the task switches and 

complexity of the memory map, this should still take a small 

amount of time, but is something the developer may need to 

consider. 

B. Memory management units (MMU) 

Operating systems such as Linux use an MMU to abstract 

applications from the underlying memory system. In processors 

with hardware virtualization support, a second stage of 

translation is added such that only the hypervisor knows the 

physical memory map; guest OSs have their accesses translated, 

and applications within those guests have a second layer of 

translation. Like MPUs, this also provides isolation and 

freedom from interference between guest operating systems and 

applications. 

Having memory address translation is a very flexible approach, 

and makes software more easily portable between platforms, 

but there can be a hidden cost. Unlike an MPU, with a fixed 

number of regions, typically an MMU does not have the same 

limit. A processor cannot store all these translations internally, 

so it uses a cache (called a Translation Lookaside Buffer, or 

TLB). The performance variation occurs depending on whether 

it hits in this translation cache, or goes to the main page tables 

stored in main memory (this is called a “page table walk” and, 

because it accesses main memory, may take many cycles). Fig 

2 shows the different approaches of MPU and MMU and how 

applications see the memory map of the device. 

V. VIRTUALIZATION 

Memory protection relies on the concept of task privilege. 

Privileged code is trusted and validated to perform system 

control, while application code typically runs unprivileged and 

is 

Shared 

Shared 

Address space 

Shared 

App1 

App2 

stage1 

MMU 

Shared 

App2 

App1 

Kernel 

Address space 

Task1 

Task2 

stage1 

MPU 

Task2 

Task1 

Kernel 

OS1 

Virtual address 

Physical address 

Figure 2 – MMU vs MPU Physical address = 

virtual address 


519

therefore unable to alter the system configuration in case of an 

error. Good practice is to minimize the amount of code required 

to operate in privileged mode (bug rates are often considered 

proportional to code size, so a smaller code base reduces the 

probability of bugs in the code and a smaller code base is 

simpler to validate fully). 

Virtualization has become common place in servers, where it is 

used to run multiple guest operating systems on the same 

physical hardware. This has now become possible in embedded 

processors, where the introduction of an additional privilege 

level to newer processors allows a hypervisor to be used to 

control the system. While virtualization is possible without this 

additional privilege, it requires the guest OSs to be 

paravirtualized (paravirtualization is a software-based 

virtualization solution which requires guest OSs to be ported to 

the virtualization API). The extra privilege level allows full 

virtualization and higher levels of performance. The 

introduction of these processors is particularly helpful in mixed 

criticality applications, where instead of having to integrate 

software from multiple vendors into a single partition, a 

hypervisor can be used to run the different partitions 

individually, isolating them in both memory and time. 

From a performance perspective, there are several factors that 

can affect operation when running under a hypervisor, some 

related to the processor hardware, others related to the 

hypervisor software itself. While hypervisors are often 

designed to be transparent to software developers, it is still 

worth considering different use models as this may affect 

whether use of a hypervisor is appropriate for your application. 

The biggest factor is whether multiple guests are running on a 

single core or not. If the OS is pinned to a specific core, and that 

core is not shared with other guests, there should be absolutely 

minimal overheads. Once shared though, the two main 

considerations are 

a) There will be times when your guest is not 

running, so longer interrupt latencies will be 

expected 

b) Access to peripherals may be slower, particularly 

if shared between guests 

Regarding peripherals, whether there are multiple guests 

sharing the same core or peripheral may affect which choice of 

Direct 

Shared 

access model is used, which in turn affects performance (see 

fig. 3, Peripheral access models). Direct access is the most 

performant, but least flexible, while virtualized is the most 

flexible, but has a higher overhead to all guests. A compromise 

is the shared model, where a single guest has direct access, with 

other guests accessing through the primary owner. 

A further performance consideration is for systems using 

MMUs. The cost of a page table walk has already been 

discussed, however with two stages of memory translation in a 

virtualized system, there is now the potential for two levels of 

page table walk. Understanding the cost of this can be 

particularly challenging when trying to work out worst-case 

execution times. It can be mitigated to some extent by managing 

the number of pages in use (if the number of pages used is 

smaller than the TLB size, then accesses should hit in the TLB). 

Some architectures may allow TLB entries to be locked, 

ensuring accesses to critical sections of code or data are 

consistent in timing, although this can reduce average 

performance as a fraction of the TLB is then not available for 

normal code. 

In complex systems, using hypervisors to isolate different 

components is often an easier approach than trying to integrate 

everything into a single OS. 

Microkernel hypervisors are typically very efficient and with a 

small code footprint that makes certification for safety or 

security more straightforward. Applications class processors 

have supported virtualization for a number of years, but now 

processors such as Arm Cortex-R52 mean that virtualization 

can be used even where the application has very hard real-time 

requirements as it uses a two stage MPU, rather than MMU. 

The hypervisor may be used to merge multiple guest operating 

systems onto the same processor, or in a multi-processor system 

can manage the system configuration and error handling, using 

core-pinning such that each guest is tied to a particular core. 

A full discussion of virtualization techniques and 

considerations is outside the scope of this paper. 

VI. 

REDUNDANCY 

For the most demanding safety applications, the methods 

described above still do not provide adequate hardware fault 

coverage. Sometimes it is necessary to execute a program 

Virtualized 

App 

App 

App 

App 

App 

App 

App 

App 

App 

App 

App 

App 

RTOS 

RTOS 

BME 

RTOS 

RTOS 

BME 

RTOS 

RTOS 

BME 

Driver 

Driver 

Driver 

Driver 

Driver 

Driver 

Driver 

Driver 

Driver 

Hypervisor 

Hypervisor 

Driver 

Hypervisor 

Cortex-R52 

Cortex-R52 

Cortex-R52 

Periph A 

Periph B 

Mem 

Periph A 

Periph B 

Mem 

Periph A 

Periph B 

Mem 

Figure 3 – Peripheral access models 

520

multiple times and ensure the results are consistent. Until now, 

systems requiring the highest levels of functional safety have 

generally had constrained processing requirements. With the 

rise of applications such as autonomous driving, there is now a 

need to consider how to achieve high levels of safety on high 

performance processors. 

A. Dual core lockstep (DCLS) 

As mentioned earlier, from a software developer’s perspective 

one of the simplest methods of improving fault coverage is to 

use dual-core lockstep processors. Two identical copies of the 

processor logic execute the same code, on the same data, and 

the results are continuously compared by the hardware. There 

is usually some temporal separation between these copies to 

avoid common mode failures (i.e. where a particle strike causes 

the same error in both processors and the outputs, although 

erroneous, still match). 

To software engineers, dual-core lockstep processors offer 

simplicity in that they are transparent during normal operation 

and software can run unmodified. The hardware will detect an 

error when the processors diverge (for example due to a 

hardware bit-flip caused by a radiation strike), and this can be 

handled by the system. However, there are still features within 

a system that software engineers need to consider that can affect 

performance. When a fault is detected, this has to be handled at 

the system level, and recovery would require the processors to 

be reset. There is no method to resynchronize the processors or 

detect which processor has the fault. 

While effective and simple, DCLS has some costs too. The first 

is the lack of flexibility – everything running on a DCLS core 

is executed twice, whether it needs to be or not. It also does not 

provide for any diversity in software. The other major cost is 

physical - it requires additional comparison logic, and a second 

copy of the execution logic (memories can be protected by ECC 

so don’t need duplicating). 

B. Software lockstep 

This is an alternative approach to DCLS, sometimes also called 

redundant execution. Both approaches take a single set of input 

data, operate on it twice, and check the results. However, while 

DCLS compares the processor outputs every cycle, in software 

lockstep the checking process is under software control. The 

main benefit of this approach is flexibility. In a DCLS system, 

everything running on the processor is checked, whether 

required from a safety perspective or not. Similarly, identical 

software is run on both copies of the logic; in the DCLS system, 

the checking process is transparent to software 

Software lockstep allows this to be selectively managed – not 

everything has to be run in duplicate, either creating more CPU 

time for additional processing or allowing a processor to be put 

to sleep to save energy. The redundant processors could even 

be separate SoCs. 

So far, we have assumed that the same software will run on both 

processors, but another possibility with software lockstep is to 

have diverse software implementations. In this case the 

comparison may be checking that both answers are within an 

expected range, rather than identical. 

While there are benefits of software lockstep, one of the biggest 

challenges of this approach is proving the fault coverage of the 

system, particularly if diverse software is used. Remember that 

the goal is ultimately to protect against errors in the system 

causing harm – if we can’t measure what fault coverage is 

achieved, we cannot meet the safety target. 

C. Heterogeneous platforms 

As application performance requirements continue to grow, 

developers are looking to heterogeneous platforms combining 

CPUs, GPUs and custom accelerators. At the same time, 

software running on these platforms (such as machine learning 

algorithms) may be hard to validate to the highest levels of 

functional safety. 

Use of decomposition techniques may be needed, splitting 

applications into different ASILs with appropriate hardware 

and software for each part of the system to achieve the safety 

goal at system level. This could include using either or both 

redundancy techniques already discussed. 

Heterogeneous solutions also allow tailoring of the compute to 

the workload – for instance using hard real-time processors 

mixed with applications class processors – seen in SoCs such 

as Renesas RCar-H3 and Xilinx Ultrascale+. 

From a performance perspective, unfortunately there is no 

single answer to what the best solution is. However, having a 

good understanding of the tradeoffs of the different solutions 

will help a system or software designer work out the best option 

for their use case. 


Safety certified software, tools and hardware have been in 

common use for some time. However, as the markets in which 

safety is important continue to grow, ever more developers will 

be required to think about these use cases. Through this paper 

we have discussed some of the most common hardware features 

used to achieve functional safety goals, and the impact they can 

have on software performance. 

Most important of all is for the developer to understand the 

safety requirement of their application use case – what level of 

fault coverage is required and therefore what combination of 

software and hardware features are used to achieve this. 

Once a developer knows what their safety goals are, they can 

decide which mix of features to use to achieve this. 

Many of the features are provided automatically by the 

hardware, with little requirement for the software developer to 

take action. However, it is important for the software developer 

to understand these mechanisms and how they can affect 

performance when they are active (e.g. ECC). Other features 

require active involvement from the software (such as memory 

management), but if done optimally, have minimal impact on 

performance, making it possible to sustain high performance 

with high levels of functional safety. 


521

Balancing functional safety with performance 

intensive systems 

Marcus Nissemark 

Field Applications Engineer 

Green Hills Software 

Sweden 

marcusn@ghs.com 

Abstract—When creating performance intensive systems that 

will be used in critical applications like autonomous driving, 

walking robots or semi-autonomous clinical operation machines 

there are many challenges. The performance requirements drive 

the need for fast multicore CPUs and the usage of GPUs for 

computation, creating heavy computing platforms running on 

general purpose operating systems. There is still a need for realtime 

behavior and meeting functional safety requirements in 

these scenarios, and such system challenges will be discussed in 

this paper. 

Keywords—hard real-time, functional safety, separation, 

virtualization 

I. REAL-TIME BEHAVIOR REQUIREMENTS 

A real time system is one that has well defined and fixed 

time constraints, making it highly deterministic. Processing 

must be done within the defined constraints or the system will 

fail. This need for predictability is one of the key factors that 

drives the need for real-time behavior of the high-performance 

system. The performance intensiveness of these systems 

relates to the ability to run computation algorithms for 

transforming and processing sensor data. The system can have 

direct coupled sensors, remote sensors, or even a combination 

of these. The input data of these sensors often need to be 

correlated in time as different sensors may detect the same 

object, but focusing on different properties. The correlation 

then requires determinism, possibly through time-stamping, to 

make sure that sensor sampling and correlation is done within 

a specific time window, to be used in the computation flow. 

For software execution, this means that we need to be able to 

schedule jobs to be executed when certain events occur, with 

minimal latency. This is typically something that a Real-Time 

Operating System (RTOS) can address and help sort out. In 

particular, the coordination of input acquisition, processing, 

and output to actuators or further analysis systems is one of 

the main drivers to stay away from general purpose OSes. 

The high-performance computing systems of today 

normally uses large 64-bit multicore System on Chips (SoCs), 

running at gigahertz speed using gigabytes of live memory, 

optionally controlling GPU, FPGA and/or DSPs for building 

an Artificial Intelligence (AI) framework. The complexity of 

these systems in terms of hardware configuration drives the 

need for an advanced operating system to leverage 

controllability of the system, from an application point of 

view. 

Embedded Linux is frequently chosen as the operating 

system, and used in the industry to control such computing 

platforms. However, despite many claims that Linux can meet 

real-time requirements [1], Linux was never designed to do so, 

and you need to patch and change the default kernel operation 

to achieve this [2], rendering a lot of the middleware on the 

platform unusable. This is in particular a separate discussion. 

The general assumption is that you will need a productiongrade 

RTOS to control your high-performance critical system, 

given the real-time constraints and hardware complexity. 

Several such operating systems are readily available, but we 

need to consider their ability to have or to be able to reach 

Functional Safety (FuSa) according to the relevant industrial 

standards, as later sections will show. The choice of operating 

system is beyond the scope of this paper, but worth noting is 

that there is only a handful of these RTOSes available which 

fulfill both hard real time and functional safety requirements. 

Typical examples, non-exhaustive, are INTEGRITY from 

Green Hills Software [3], QNX OS for Safety [5], and 

VxWorks from Wind River [4]. 

II. 

SAFETY REQUIREMENTS 

The next challenge is the overall need for functional safety, 

which applies to both hardware and software of the highperformance 

system. In the automotive world the standard is 

called ISO 26262, which is derived from the industrial 

standard IEC 61508 [6]. In a simplified way, functional safety 

for software can be interpreted as the software needs to be 

proven to have a significantly low level of systematic faults to 

be able to be used in safety solutions. This level is the amount 

of risk reduction achieved by system safety processes and 


522

safety requirements. It is generally described as the safety 

integrity level, ASIL for automotive, or SIL for industrial, and 

can be illustrated as below in Fig. 1. 

Probability 

High 

Low 

Minor 

Fig. 1. Safety Integrity Level illustrated. 

In the context of a performance intensive system, possibly 

based on a general-purpose operating system solution like 

Linux, it means that all of the code running in the system 

needs to undergo extensive test and verification as well as deal 

with development process-related questions throughout the 

lifetime of the software product. The goal of such testing is to 

reduce the amount of systematic faults, aka bugs, which could 

cause the system to stop fulfilling its function adequately. 

Special care needs to be taken for code that is running in 

privileged mode (kernel), or that can affect the stability of the 

system (drivers, also in kernel on Linux). This of course 

includes the operating system code, which in many cases can 

be millions of lines of code. Therefore, achieving safety with 

such operating systems is not very likely [7]. 

One of the prerequisites of the software running on a given 

hardware platform is that the hardware itself fulfills its 

function and executes the code correctly. However, once 

deployed, the system may encounter some disturbances, 

maybe due to the environment or aging of the system. 

Consequently, the system itself needs to additionally take care 

of the random hardware faults that will occur, which drives the 

need for a safety architecture. Typically, high-performance 

systems use hardware that have not been designed with 

hardware fault tolerance or self-diagnostics capabilities. This 

means that a typical 1oo1D system architecture could be 

deployed [8], which adds separate diagnostic coverage to 

detect faults as seen in Fig. 2. 

Environment 

Hazard 

Remaining 

Risk 

Diagnostics 

Input Logic Output 

Fig. 2. Typical 1oo1D system architecture 

Safety Integrity Level 

Major 

Severity 

Control 

confidence in the system and underlying hardware. In turn, 

this can increase the safety integrity level of the system. This 

is an important and complex effort, and understanding the 

safety context is non-trivial. It has to be considered at design 

time since safety needs to be designed from the beginning. 

Furthermore, in any safety application, safety has precedence 

over performance, which means that the system designer must 

consider some performance being dedicated to the safety 

functions of the system, especially those related to diagnostics. 

III. 

SEPARATION AS MITIGATION FOR SAFETY 

So, there is a need for mitigations to solve these robustness 

challenges. The functional safety standards allow separation of 

functionality into different elements, each element can then be 

treated at a different level of criticality as long as the 

separation method guarantees freedom from interference. 

Diagnostic channels can also be separated, to simplify the 

safety architecture. The most straightforward separation would 

be the division into different hardware components, like 

multiple CPUs, or the division of functionality between 

heterogeneous CPU cores in modern SoCs. 

The division of functionality between homogenous cores of 

a multicore SoC does not suffice because enough separation is 

not achieved. For instance, such installations are sharing 

caches, memory bus and other chip internals. These systems 

need to manage the separation and functionality division in 

software. Such separation can be done with techniques like a 

separation kernel or a hypervisor. 

A. Separate CPUs 

When using separate CPUs for performance intensive 

systems, they are often divided into high performance CPUs, 

possibly with a GPU, and separate lower speed MCUs that 

usually bear the safety compliance requirement. In other 

words, a performance domain and a safety domain are 

configured. These domains need to exchange data, feeding the 

calculations with input, and getting processed data back. 

Normally this transitions data to and from the safety domain 

through some physical connection, typically through simple 

buses like UART, SPI, or I2C, but cases using PCI, Ethernet 

or shared external memory uses can be seen. 

Although the physical separation is good from the safety 

standard perspective, you still need to consider the 

communication path. Complex buses like Ethernet and PCI 

may require that the device driver also goes through the 

formal process of functional safety certification if there is any 

point of interference in that path, and even when running on 

safety MCUs that use MPU for memory protection it is a nontrivial 

task. This is one of the reasons safety solutions tend to 

use the simpler buses, as they are easier to prove to be correct. 

An example architecture of hardware separation can be seen in 

Fig. 3. 

Diagnostics allow a detected dangerous event to be 

converted into a safe failure, which can be used to increase 

523

Safety domain 

ASIL A 

application 

Safety RTOS 

MCU 

ASIL C 

application 

Bus 

Communication 

Performance domain 

QM 

application 

Performance RTOS 

CPU 

QM 

application 

changes on one side shall allow the other side to behave 

differently. If the cores would be sharing configuration 

registers, memory busses or caches, these are apparent sources 

of interference which need extra protection. In these cases, the 

SoC vendor needs to assure us that the separation is proven for 

usage according to the safety standard. This is non-trivial, as 

these modern SoCs have vast functionality, and keeping 

everything under control is critical. 

Fig. 3. Example of hardware separation architecture 

However, those simpler buses may come with a 

performance penalty, compared to the more complex but faster 

buses. If your system needs to move a lot of data between 

different executing entities, i.e. between the safety and 

performance domains, the bus usage can become a bottleneck 

in both the performance and safety MCU, as not enough data 

can be transitioned between the domains. Additionally, this 

architecture is not flexible in algorithm scalability; if some 

performance intensive algorithm needs to run in the safety 

domain there is a limit to the execution capabilities the smaller 

MCUs can handle. 

The smaller MCU may very well be running an RTOS to 

leverage some of the constraints above, but the main drawback 

of running safety algorithms on a low performance MCU still 

exists. Furthermore, the performance CPU may also need to 

run a real-time operating system, not necessarily safety 

critical, but still capable of dealing with the real-time 

requirements of the system predictability requirements. 

B. Hardware consolidation / Heterogeneous core 

architectures 

The other side of the hardware separation is hardware 

consolidation. Better integrated System-on-Chip (SoC) start to 

propose a small core with a dedicated memory, clock and 

power management on the same die as the bigger systems, to 

allow higher-performance communication. SoC vendors try to 

bring these separate CPUs and MCUs into one SoC, we have 

seen examples like Xilinx Zynq Ultrascale MPSoC [9], 

Renesas R-Car H3 [10], or NXP i.MX 8 [11]. An example can 

be seen in Fig. 4. 

Safety domain 

ASIL A 

application 

Safety RTOS 

MCU Core 

ASIL C 

application 

SoC 

Fig. 4. Hardware consolidation separation architecture 


QM 

application 


CPU Core(s) 

QM 

application 

When doing so they need to consider, design and test that 

the separation between the different cores is free from 

interference, as the performance domain and the safety domain 

separation still must exist. This means that no dynamic 

Clearly this type of SoC removes some of the 

communication and data exchange bottlenecks, at least if they 

safely can use SoC internal shared memory or similar paths. 

But, there is still the issue of scalability of algorithms between 

safety domain and performance domain, as the safety side 

normally is locked down to the smaller MCU that now is built 

into the large SoC. 

What is needed is a way to run both safety algorithms and 

performance algorithms on the same cores in any SoC. In 

other words there is a need to have high level software that 

can do isolation and separation. A common conception for 

software separation is to think that virtualization using a 

hypervisor is the only way to it. However, the separation 

kernel architecture must also be considered. 

C. Software separation kernels 

The software separation kernel was originally presented by 

John Rushby in a 1981 paper [12]. He describes this as "the 

task of a separation kernel is to create an environment which is 

indistinguishable from that provided by a physically 

distributed system: it must appear as if each regime is a 

separate, isolated machine and that information can only flow 

from one machine to another along known external 

communication lines. One of the properties we must prove of 

a separation kernel, therefore, is that there are no channels for 

information flow between regimes other than those explicitly 

provided." Therefore, a proper separation kernel provides 

isolation equivalent to hardware isolation. 

Originally designed for security solutions, separation 

kernels can also be very useful in safety applications. The 

separation kernel solution allows for a division of applications 

into multiple levels of criticality, which significantly helps in 

the overall system architecture, i.e. creating a performance 

domain and a safety domain within the same RTOS. Green 

Hills Software’s INTEGRITY RTOS is an example of an 

RTOS separation kernel architecture, and Wind River’s 

VxWorks also provides a separation kernel profile. Both these 

operating systems, and a few others, have also undergone 

formal proof or functional safety pre-certifications, as 

previously mentioned. Therefore, these solutions already 

provide safety evidence that they have undergone the testing 

and scrutiny to claim that their solutions are adhering to the 

safety standard [13]. 

On such a system, only a subset of the applications need to 

be assigned a safety integrity level, and the rest can be kept as 

regular quality code. This works in favor of scalable safety 


524

algorithms, as you can run such performance algorithms 

within the safety domain, but you do not have to run all of 

them there. Because code in applications is isolated by design 

from code in other applications, there is no longer a need to 

test and certify all the code at the highest level. This helps 

with process-related items when following ISO26262 or other 

standards, because the effort in certification is as a 

consequence greatly lowered. Thus, the safety domain and the 

performance domain will run on the same SoC, as long as the 

separation kernel supports the actual cores of the SoC. This 

can be exemplified with Fig. 5. 

Safety domain 

ASIL A 

application 

ASIL C 

application 

Fig. 5. Separation kernel architecture with a safety domain and a performance 

domain. 

The performance of these systems is also important, but it 

is assumed that most real-time operating systems actually 

work in favor of general performance. Consider that system 

performance itself is a very vast topic, which means there is 

no trivial formula for observation of performance. Operating 

systems can make heavy use of caching, with an expectation 

that some operations will be cache hits (probably fast) and 

some will be cache misses (probably slow). Interrupts often 

occur at unpredictable times, possibly resulting in altered 

performance of the code they break up. Different scheduling 

disciplines will insert varying delays into the performance of 

individual processes. Even if you have access to the operating 

system code, down at the hardware level there are caches and 

pipelines and other optimizations that you cannot see at all, 

except that they produce varying performance results. The end 

user application still needs to undergo optimizations in the 

context it will execute, whereby only then can the real 

performance be measured. 

Another useful benefit with separation kernels is besides 

the original design idea to support security, they also allow for 

software consolidation of multiple software functions on the 

same SoC, while still being logically separated; as long as the 

platform has sufficient performance capability, it is reasonable 

to integrate more and more independent functionalities on the 

platform without hardware redesign. 

D. Separation through virtualization 

QM 

application 

Separation kernel RTOS 

SoC 


QM 

application 

The other method for doing software separation is generally 

called hardware virtualization, which is different from but also 

similar to the separation kernels. Hardware virtualization 

means that special execution paths are taken to allow multiple 

operating systems to share the CPU and memory of the SoC. 

The framework that allows this is called a hypervisor, or 

virtual machine monitor, which creates these virtual machine 

instances that can run different operating systems on a single 

physical hardware platform. Virtualization can be hardware 

accelerated through the usage of modern CPU features like 

ARM-VE and Intel VT-x, generally called hardwareaccelerated 

virtualization. This type of virtualization is 

different from operating system virtualization, where the 

instances, or so-called containers, share a single operating 

system kernel. The containers are not covered in this paper. 

There are several examples of hardware virtualization 

hypervisors, like the open-source Xen Project [17] and PikeOS 

from Sysgo [18]. Introducing hypervisors in the software 

architecture for safety is not without challenges because it 

adds complexity. The hypervisor needs to do memory 

separation and protection as well as scheduling of different 

workloads and management of the privilege levels of the 

virtualized guest operating systems. The hypervisor itself 

becomes the highest privileged software in the system, which 

means that in the context of safety applications, the hypervisor 

itself also needs to be considered safety relevant. In essence, a 

hypervisor is scheduling several operating systems like a 

separation kernel schedules applications. 

Hypervisors can also be categorized, typically into native 

or bare-metal, Type 1, and into hosted hypervisors, Type 2. 

Type 1 hypervisors run directly on the hardware, see Fig. 6, 

while hosted hypervisors run similarly to regular applications 

on the actual operating system. The distinction between the 

types is not clear as some configurations can be ambiguous. In 

some Type 1 implementations there is also an initial guest 

domain with higher privilege levels and dedicated peripheral 

access, called domain 0. The latter is typically seen in the Xen 

hypervisor solution. It is clear that such solutions also would 

require that the entire domain 0 guest is safety-relevant in a 

safety implementation. 

Safety domain 

ASIL A 

application 

Safety RTOS 

ASIL C 

application 

Hypervisor 

SoC 


QM 

application 


Fig. 6. Typical Type 1 hypervisor separation architecture. 

QM 

application 

Besides the general need for separation to achieve safety, 

the main benefit for using virtualization in the context of 

safety is the ability to re-use high performance algorithms 

designed for an operating system like Linux, while providing 

an efficient path to isolate such applications from safety 

525

applications running in the same system in a virtualized safety 

RTOS. Furthermore, the hypervisor should allow IPC (Inter 

Process Communication) mechanisms to support data 

transitions between a safety domain and a performance 

domain, as this type of data exchange is important. 

The drawback is that Type 1 virtualization of operating 

systems can add undesired latency to event handling, which 

causes issues in the real-time application scenarios and general 

determinism. Furthermore, scheduling of workloads across the 

cores on the SoC becomes non-trivial and can affect 

performance negatively. 

Type 2 hypervisors are normally seen as lacking 

performance, and are limited to the host operating systems 

safety and security solutions. This is due to the fact that the 

hypervisor either does not take advantage of hardware 

acceleration, does not allow direct device assignment (passthrough), 

and/or runs on a non-real-time operating system. 

Using a separation architecture microkernel can overcome 

those drawbacks, creating an interesting hypervisor solution. 

Furthermore, native applications in such a scenario can meet 

both safety and security requirements, as well as real-time low 

latency requirements. An example of this hypervisor 

architecture is Green Hills INTEGRITY Multivisor [16]. This 

alternative hypervisor architecture is illustrated below in Fig. 

7. 

Safety domain 

ASIL A 

application 

Fig. 7. Separation kernel based Type 2 hypervisor architecture. 

IV. 

ASIL C 

application 

QM 

application 

Separation kernel RTOS 

SoC 



VMM/Hypervisor 

QM 

application 

AI AND GPU PROCESSING IN A SAFETY CONTEXT 

Many recent high-performance computing systems are not 

only using the CPU for computation, it is also common to use 

GPU solutions to accelerate algorithms that can execute in 

multiple parallel implementations. This is normally done 

through programming extensions like OpenCL or the 

proprietary framework CUDA [15]. In fact, implementations 

of deep learning algorithms or other neural network 

techniques can make use of the massive parallelism that GPU 

acceleration provides, which basically is the driving force for 

the development of Artificial Intelligence (AI) solutions today 

[14]. 

Normally, the GPU is under the control of a generalpurpose 

operating system and its device drivers, which feeds 

the GPU with executable workloads in a massive parallelism. 

This is basically how frameworks like OpenCL works. In the 

context of a safety system, the control functions for these 

workloads, and the resulting outcome may be the safety 

critical information that needs the extra protection that is 

involved for safety. Therefore, it is a fair assumption to make 

the controlling system also a system capable of meeting safety 

requirements. Then workload control and workload results can 

be made safety critical, while the actual computation is not. 

This further separates the safety and performance domains 

into separate processing entities, but can make use of fast 

CPUs for controlling the even faster GPU calculations in a 

safe approach. 

Both the separation kernel architecture and the 

virtualization solution can support this control path, although 

using virtualization adds an extra data transition path for the 

safety critical data required. Such data needs to transition to 

and from the safety domain via the virtualized guest operating 

system from and to the GPU, creating a usable but complex 

path. 

V. CONCLUSION 

Balancing functional safety with performance intensive 

systems requires separation through hardware or software. The 

hardware separation is limited in scalability and data 

transportation, but provides an easily proven safety versus 

performance domain separation. Software separation on the 

other hand provides scalable solutions for performance versus 

safety, but sometimes requires a complex software solution 

through virtualization. The middle path of using a separation 

kernel architecture provides the least complex path for 

software separation, allowing for scalable safety applications 

running on high performance CPU cores. When it comes to 

controlling GPU computations for safety, the controlling 

system most likely also needs to be capable of meeting safety 

requirements, or at least involve a hypervisor based control 

solution which incorporates safety. 

VI. 

REFERENCES 

[1] Intro to Real-Time Linux for Embedded Developers 

https://www.linuxfoundation.org/blog/intro-to-real-timelinux-for-embedded-developers 

[2] HOWTO setup Linux with PREEMPT_RT properly 

https://wiki.linuxfoundation.org/realtime/documentation/ 

howto/applications/preemptrt_setup 

[3] INTEGRITY Real-Time Operating System 

https://www.ghs.com/products/rtos/integrity.html 

[4] VxWorks Safety Profile Product Overview 

https://www.windriver.com/products/product- 

overviews/Safety-Profile-for-VxWorks_Product- 

Overview/ 

[5] QNX OS for Safety 

http://blackberry.qnx.com/en/products/certified_os/safekernel 

[6] ISO26262 Wikipedia 

https://en.wikipedia.org/wiki/ISO_26262 


526

[7] R H Pierce, “Preliminary assessment of Linux for safety 

related systems”, 2002. ISBN 0 7176 2538 9 

[8] 1oo1D Architecture, 

http://www.globalspec.com/reference/76370/203279/1oo 

1d-architecture 

[9] Zynq UltraScale+ MPSoC product site 

https://www.xilinx.com/products/silicondevices/soc/zynq-ultrascale-mpsoc.html 

[10] Renesas Generation 3 Automotive Computing Platform 

https://www.renesas.com/enus/solutions/automotive/products/rcar-h3.html 

[11] i.MX 8 Series Application Processors 

https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/i.mxapplications-processors/i.mx-8-processors:IMX8- 

SERIES 

[12] John Rushby, "The Design and Verification of Secure 

Systems," 8th ACM Symposium on Operating System 

Principles, pp. 12-21, Asilomar, CA, December 1981. 

(ACM Operating Systems Review, Vol. 15, No. 5). 

[13] Safety Automation Equipment List by Exida 

http://www.exida.com/SAEL/Green-Hills-Software- 

INTEGRITY-RTOS 

[14] Jensen Huang, “Accelerating AI with GPUs: A New 

Computing Model”, 2016 

https://blogs.nvidia.com/blog/2016/01/12/accelerating-aiartificial-intelligence-gpus/ 

[15] General purpose computing on graphics processing units 

https://en.wikipedia.org/wiki/Generalpurpose_computing_on_graphics_processing_units 

[16] INTEGRITY Multivisor, Virtualization Architecture for 

Secure Systems 

https://www.ghs.com/products/rtos/integrity_virtualizatio 

n.html 

[17] The open source standard for hardware virtualization 

https://xenproject.org/users/virtualization.html 

[18] PikeOS Hypervisor 

https://www.sysgo.com/products/pikeos-hypervisor/ 

527

Obtaining Worst-Case Execution Time Bounds on 

Modern Microprocessors 

Daniel Kästner, Markus Pister, Simon Wegener, Christian Ferdinand 

AbsInt GmbH 

D-66123 Saarbrücken, Germany 

info@absint.com 

Abstract—Many embedded control applications have realtime 

requirements. If the application is safety-relevant, worstcase 

execution time bounds have to be determined in order to 

demonstrate deadline adherence. If the microprocessor is timingpredictable, 

worst-case execution time guarantees can be 

computed by static WCET analysis. For high-performance multicore 

architectures with degraded timing predictability, WCET 

bounds can be computed by hybrid WCET analysis which 

combines static analysis with timing measurements. This article 

summarizes the relevant criteria for assessing timing predictability, 

gives a brief overview of static WCET analysis and 

focuses on a novel hybrid WCET analysis based on non-intrusive 

real-time instruction-level tracing. 

Keywords— worst-case execution time, static analysis, real-time 

tracing, timing predictability, path analysis, functional safety 


In real-time systems the overall correctness depends on the 

correct timing behavior: each real-time tasks has to finish 

before its deadline. All current safety standards require reliable 

bounds of the worst-case execution time (WCET) of real-time 

tasks to be determined. 

With end-to-end timing measurements timing information 

is only determined for one concrete input. Due to caches and 

pipelines the timing behavior of an instruction depends on the 

program path executed before. Therefore, usually no full test 

coverage can be achieved and there is no safe test end criterion. 

Techniques based on code instrumentation modify the code 

which can significantly change the cache and pipeline behavior 

(probe effect): the times measured for the instrumented 

software do not necessarily correspond to the timing behavior 

of the original software. 

One safe method for timing analysis is static analysis by 

Abstract Interpretation which provides guaranteed upper 

bounds for WCET of tasks. Static WCET analyzers are 

available for complex processors with caches and complex 

pipelines, and, in general, support single-core processors and 

multi-core processors. A prerequisite is that good models of the 

processor/System on-Chip (SoC) architecture can be 

determined. However, there are modern high performance 

SoCs which contain unpredictable and/or undocumented 

components that influence the timing behavior. Analytical 

results for such processors are unrealistically pessimistic. 

A hybrid WCET analysis combines static value and path 

analysis with measurements to capture the timing behavior of 

tasks. Compared to end-to-end measurements the advantage of 

hybrid approaches is that measurements of short code snippets 

can be taken which cover the complete program under analysis. 

Based on these measurements a worst-case path can be 

computed. The hybrid WCET analyzer TimeWeaver avoids the 

probe effect by leveraging the embedded trace unit (ETU) of 

modern processors, like Nexus 5001 [16], which allows a 

fine-grained observation of a core’s program flow. 

TimeWeaver reads the executable binary, reconstructs the 

control-flow graph and computes ranges for the values of 

registers and memory cells by static analysis. This information 

is used to derive loop bounds and prune infeasible paths. Then 

the trace files are processed and the path of longest execution 

time is computed. The computed time estimate provide 

valuable feedback for assessing system safety and for 

optimizing worst-case performance. TimeWeaver also provides 

feedback for optimizing the trace coverage: paths for which 

infeasibility has been proven need no measurements; loops for 

which the analyzed worst-case iteration count has not been 

measured are reported. 

In this article we give an overview of timing predictability 

in general and provide criteria for selecting suitable WCET 

analysis methods. We will outline the methodology of hybrid 

WCET analysis and report on practical experience with the tool 

TimeWeaver. 

II. 

TIMING PREDICTABILITY 

In general, a system is predictable if it is possible to predict 

its future behavior from the information about its current state. 

We consider predictability under the assumption that the 

hardware works without unexpected errors. Hardware faults 

like soft errors or transient faults have to be addressed by 

specific error handling mechanisms to ensure overall system 

safety. 

In [4] the program input and the hardware state in which 

execution begins are identified as the primary sources of 

uncertainty in execution time. Hardware-related timing 

predictability can be expressed as the maximal variance in 

execution time due to different hardware states for an arbitrary 

but fixed input. Analogously, software-related timing 

predictability corresponds to the maximal variance in 

execution time due to different inputs for an arbitrary but fixed 


528

hardware state. A basic assumption is uninterrupted program 

execution without interferences. In a concurrent system, 

interferences due to concurrent execution additionally have to 

be taken into account. 

To ensure the correct timing behavior it is necessary to 

demonstrate the deadline adherence of each task. To this end, 

the worst-case execution time of each task has to be 

determined, i.e. the concept of software-related predictability 

as defined above can be reduced to the predictability of the 

worst-case execution path. 

This leads to the following two main criteria for execution 

time predictability: 

• It must be possible to determine an upper bound of the 

maximal execution time which is guaranteed to hold. 

• To enable precise bounds on the maximal execution time to 

be determined the behavioral variance, i.e. the maximal 

variance in execution time due to different hardware states, 

has to be as low as possible. In general, the larger the 

behavioral variance is 

o 

o 

o 

the more the execution time depends on the execution 

history, 

the less meaningful is one particular execution time 

measurement in a specific execution context, and 

the larger can be the gap between the largest measured 

execution time and the true worst-case execution time. 

Even in single-core processors timing predictability is 

compromised by performance-enhancing hardware 

mechanisms like caches, pipelines, out-of-order execution, 

branch prediction and other mechanisms for speculative 

execution, which can cause significant variations in timing 

depending on the hardware state. Interestingly hardware 

speculation has recently been discovered to constitute a critical 

security vulnerability [21, 19]. 

For multi-core processors all challenges to timing 

predictability are relevant that apply to single-core processors. 

In addition, there are new challenges imposed by the multi-core 

design. In the following we will first discuss timing 

predictability on single-core processors and then address 

specific challenges for multi-core processors. 

A. Single-Core Processors 

For simple non-pipelined architectures adding up the 

execution times of individual instructions is enough to obtain a 

bound on the execution time of a basic block. However, 

modern embedded processors try to maximize the instructionlevel 

parallelism by sophisticated performance-enhancing 

features, like caches, pipelines, or speculative execution. 

Pipelines increase performance by overlapping the executions 

of consecutive instructions. For timing measurements this 

means that there may be big variations between the execution 

times measured with different starting states of the hardware. 

Furthermore there may be a significant gap between the largest 

measured execution time and the true worst-case execution 

time. For a timing analyzer it means that it is not feasible to 

consider individual instructions in isolation. Instead, they have 

to be analyzed collectively—together with their mutual 

interactions—to obtain tight timing bounds. In the following 

we will give an overview of timing-relevant hardware features 

and discuss their effect on timing measurements and on static 

analysis methods. 

In general, the challenges for timing analysis of single-core 

architectures originate from the complexity of the particular 

execution pipeline and the connected hardware devices. 

Commonly used performance-enhancing features are caches, 

pipelines, out-of-order execution, speculative execution 

mechanisms like static/dynamic branch prediction and branch 

history tables, or branch target instruction caches. Many of 

these hardware features can cause timing anomalies [29] which 

render WCET analysis more difficult. Intuitively, a timing 

anomaly is a situation where the local worst-case does not 

contribute to the global worst-case. For instance, a cache miss 

—the local worst-case—may result in a globally shorter 

execution time than a cache hit because of hardware scheduling 

effects. In consequence, it is not safe to assume that the 

memory access causes a cache miss; instead both machine 

states have to be taken into account. An especially difficult 

timing anomaly are domino effects [22]: A system exhibits a 

domino effect if there are two hardware states s, t such that the 

difference in execution time (of the same program starting in s, 

t respectively) may be arbitrarily high. E.g., given a program 

loop, the executions never converge to the same hardware state 

and the difference in execution time increases in each iteration. 

In consequence, loops have to be analyzed very precisely and 

the number of machine states to track can grow high. For 

timing measurements this means that the difference between 

measured and true worst-case execution time caused by an 

incomplete hardware state coverage can grow arbitrarily high. 

The article [37] categorizes the timing compositionality of 

computing architectures according to the presence of timing 

anomalies. Fully compositional architectures, such as the 

ARM7, contain no timing anomalies; individual components, 

e.g., basic blocks, can be considered separately and their worstcase 

information can be combined. Compositional architectures 

only contain bounded timing effects, i.e., additional delays 

(e.g., due to an access to a shared resource or due to a 

preemption or interrupt) can be bounded by a constant and 

added to the local worst-case figures (e.g. TriCore 1797). Noncompositional 

architectures contain domino effects, i.e., 

unbounded anomalies (e.g. PowerPC 755). Depending on the 

state of the pipeline and the predictors, the occupancy of 

functional units, and the contents of the caches—i.e., the 

execution history—an instruction needs only a few or several 

hundred cycles to complete its execution [8]. A rigorous 

definition of compositionality is given in [14]. 

As the runtime of embedded control software often is 

dominated by load/store operations, memory subsystems 

nowadays introduce queues before the caches to buffer them 

and overcome early stall conditions like cache misses. Often 

this is complemented by fast data forwarding for consecutive 

accesses into cache lines that have already been requested by 

previous pending instructions, where the requested data might 

already be present in the core. This helps to reduce the number 

of transactions over the slow system bus. In the abstract model 

of the timing analysis, the representation of these hardware 

features has to be close to the concrete hardware to achieve 

529

satisfactory analysis precision. Due to their size, especially the 

dynamic branch prediction and the branch history tables 

consume a significant number of bits in the abstract state 

representation which increases the memory consumption of the 

analysis. Unknown or not precisely known effective addresses 

of memory requests further increase the timing analysis search 

space due to the number of possible scenarios (cache hit/miss, 

fast data forward or not, …). Concerning processor caches, 

both precision and efficiency depend on the predictability of 

the employed replacement policy [28, 8]. The Least-Recently- 

Used (LRU) replacement policy has the best predictability 

properties. Employing other policies, like Pseudo-LRU 

(PLRU), or First-In-First-Out (FIFO), or Random, yield less 

precise WCET bounds because fewer memory accesses can be 

precisely classified. Furthermore, the efficiency degrades 

because the analysis has to explore more possibilities. Another 

deciding factor is the write policy. Typically, there are two 

main options: write-through where a store is directly written to 

the next level in the memory hierarchy, and write-back where 

the data is written into the next hierarchy level if the concrete 

memory cell is evicted from the cache. The write-back policy 

induces timing uncertainty because the precise point in time 

when the write-back occurs is hard to predict; for example, it 

might happen after a task switch and slow down a different 

(and possibly higher-priority) task than the one that issued the 

store operation in the first place. Another timing analysis 

challenge is to model processor external devices which are 

typically connected with the caches over the system bus. Such 

devices are memory controllers for static (SRAM, Flash) or 

dynamic memory (DRAM, DDR or QDR) or controllers for 

system communication (CAN, FlexRay, AFDX). The 

corresponding bus protocol and memory chip timing have to be 

modeled precisely. 

Individually, each of the above features can be modeled 

without complexity problems. Only their combination can 

actually result in a large number of possible system states 

during the abstract simulation of a basic block. Smart system 

configurations as described in [18] can decrease both the 

execution time variability and the analysis complexity. In 

consequence, the complexity of timing analysis decreases such 

that highly complex processors like the Freescale PowerPC 

7448 can be handled. At the same time the accuracy of timing 

measurements will be improved. 

Some events in modern architectures are either 

asynchronous to program execution (e.g., interrupts, DMA) or 

not predictable in the model (e.g., ECC errors in RAM or some 

hardware exceptions). Their effect on the execution time has to 

be incorporated externally, i.e., by adding penalties based on 

the worst-case occurrence of the events to the computed 

WCET, or by statistical means. 

B. Multi-Core Processors 

Whereas timing analysis of single-core architectures 

already is quite challenging, the timing behavior of multi-core 

architectures is even more complex. A multi-core processor is a 

single computing component with two or more independent 

cores; it is called homogeneous if it includes only identical 

cores, otherwise it is called heterogeneous. Thus, all 

characteristic challenges from single-cores are still present in 

the multi-core design, but the multiple cores can independently 

run multiple instructions at the same time. Some multi-core 

processors can be run in lockstep mode where all cores execute 

the same instruction stream in parallel. This typically 

eliminates interferences between the cores, so from a timing 

perspective the processor behaves like a single-core. 

When the processor is not run in lockstep mode, the intercore 

parallelism becomes relevant. To interconnect the several 

cores, buses, meshes, crossbars, and also dynamically routed 

communication structures are used. In that case, the 

interference delays due to conflicting, simultaneous accesses to 

shared resources (e.g. main memory) are the main cause of 

imprecision. On a single-core system, the latency of a memory 

access mostly depends on the accessed memory region (e.g. 

slow flash memory vs. fast static RAM) and whether the 

accessed memory cell has been cached or not. On a multicore 

system, the latency also depends on the memory accesses of 

the other cores, because multiple simultaneous accesses might 

lead to a resource conflict, where only one of the accesses can 

be served directly, and the other accesses have to wait. The 

shared physical address space requires additional effort in order 

to guarantee a coherent system state: Data resident in the 

private cache of one core may be invalid, since modified data 

may already exist in the private cache of another core, or data 

might have already been changed in the main memory. Thus, 

additional communication between different cores is required 

and the execution time needed for this has to be taken into 

account. Multi-core processors which can be configured in a 

timing-predictable way to avoid or bound inter-core 

interferences are amenable to static WCET analysis [18, 36]. 

Examples are the Infineon AURIX TC275 [17], or the 

Freescale MPC 5777. 

The Freescale P4080 [13] is one example of a multicore 

platform where the interference delays have a huge impact on 

the memory access latencies and cannot be satisfactorily 

predicted by purely static techniques. It consists of eight 

PowerPC e500mc cores which communicate with each other 

and the main memory over a shared interconnect, the CoreNet 

Coherency Fabric. The main problem for static analysis 

approaches is that the publically available documentation about 

the CoreNet is not enough to statically predict its behavior. 

Nowotsch et al. [24] measured maximal write latencies of 39 

cycles when only one core was active, and maximal write 

latencies of 1007 cycles when all eight cores were running. 

This is more than 25 times longer than the observed best case. 

A sound WCET analysis must take the interference delays into 

account that are caused by resource conflicts. Unless 

interference is avoided by means of the overall software 

architecture, ignoring these delays might result in 

underestimation of the real WCET whereas assuming full 

interferences at all times might result in huge overestimation. 

To improve predictability of avionics systems the 

Certification Authorities Software Team (CAST) [5] advocates 

to either deactivate or control existing interference channels. If 

deactivation is not possible the software architecture has to be 

able to prevent or bound the interferences. One hardware 

element where such mechanisms are required is the 

interconnect, i.e., the Network-on-Chip (NoC) or shared bus 

connecting main memory to the individual cores. Several 


530

approaches to address interference on shared memory accesses 

have been discussed in literature, most of them in the context 

of Integrated Modular Avionics (IMA). They typically rely on 

a time-triggered static scheduling scheme, e.g., corresponding 

to the avionics standard ARINC 653. As an example, with the 

approaches of [30] or [24] precise static WCET bounds can be 

computed, albeit at the cost of high computational complexity. 

For systems which do not implement such rigorous software 

architectures or where the information needed to develop a 

static timing model is not available, hybrid WCET approaches 

are the only solution. 

III. 

WCET GUARANTEES ON PREDICTABLE PROCESSORS 

The most successful formal method for WCET computation 

is Abstract Interpretation-based static program analysis. Static 

program analyzers compute information about the software 

under analysis without actually executing it. Semantics-based 

static analyzers use an explicit (or implicit) program semantics 

that is a formal (or informal) model of the program executions 

in all possible or a set of possible execution environments. 

Most interesting program properties—including the WCET— 

are undecidable in the concrete semantics. The theory of 

abstract interpretation [6] provides a formal methodology for 

semantics-based static analysis of dynamic program properties 

where the concrete semantics is mapped to a simpler abstract 

model, the so-called abstract semantics. The static analysis is 

computed with respect to that abstract semantics, enabling a 

trade-off between efficiency and precision. A static analyzer is 

called sound if the computed results hold for any possible 

program execution. Applied to WCET analysis, soundness 

means that the WCET bounds will never be exceeded by any 

possible program execution. Abstract interpretation supports 

formal soundness proofs for the specified program analysis. 

Like model checking and theorem proving, it is recognized as a 

formal method by the DO-178C and other safety standards (cf. 

Formal Methods Supplement [26] to DO-178C [27]). It is 

based on a mathematically rigorous concept and provides the 

highest possible confidence in the correctness of the results (cf. 

IEC-61508, Ed. 2.0 [15], Table C.18). 

In addition to soundness, further essential requirements for 

static WCET analyzers are efficiency and precision. The 

analysis time has to be acceptable for industrial practice, and 

the overestimation must be small enough to be able to prove 

the timing requirements to be met. 

Over the last few years, a more or less standard architecture 

for timing analysis tools has emerged [9, 11]. It neither requires 

code instrumentation nor debug information and is composed 

of three major building blocks: 

• control-flow reconstruction and static analyses for control 

and data flow, 

• micro-architectural analysis, computing upper bounds on 

execution times of basic blocks, 

• path analysis, computing the longest execution paths 

through the whole program. 

The data flow analysis of the first block also detects 

infeasible paths, i.e., program points that cannot occur in any 

real execution. This reduces the complexity of the following 

micro-architectural analysis. Basic block timings are 

determined using an abstract processor model (timing model) to 

analyze how instructions pass through the pipeline taking 

cache-hit/ cache-miss information into account. This model 

defines a cycle-level abstract semantics for each instruction's 

execution yielding in a certain set of final system states. After 

the analysis of one instruction has been finished, these states 

are used as start states in the analysis of the successor 

instruction(s). Here, the timing model introduces nondeterminism 

that leads to multiple possible execution paths in 

the analyzed program. The pipeline analysis has to examine all 

of these paths. 

In the following sections we will focus on the commercially 

available tool aiT [1] which implements the architecture 

described above. It is centered around a precise model of the 

microarchitecture of the target processor and is available for 

various 16-bit and 32-bit single-core and multi-core 

microcontrollers. aiT determines the WCET of a program task 

in several phases corresponding to the reference architecture 

described above, which makes it possible to use different 

methods tailored to each subtask [34]. In the following we will 

give an overview of each analysis stage. 

• In the decoding phase the instruction decoder reads and 

disassembles the input executable(s) into its individual 

instructions. Architecture specific patterns decide whether 

an instruction is a call, branch, return or just an ordinary 

instruction. This knowledge is used to reconstruct the basic 

blocks of the control flow graph (CFG) [33]. Then, the 

control flow between the basic blocks is reconstructed. In 

most cases, this is done completely automatically. 

However, if a target of a call or branch cannot be statically 

resolved, then the user can write some annotations to guide 

the control flow reconstruction. 

• The combined loop and value analysis determines safe 

approximations of the values of processor registers and 

memory cells for every program point and execution 

context. These approximations are used to determine 

bounds on the iteration number of loops and information 

about the addresses of memory accesses. Contents of 

registers or memory cells, loop bounds, and address ranges 

for memory accesses may also be provided by annotations 

if they cannot be determined automatically. Value analysis 

information is also used to identify conditions that are 

always true or always false. Such knowledge is used to 

infer that certain program parts are never executed and 

therefore do not contribute to the worst-case execution 

time or the stack usage. 

• In the micro-architectural analysis phase cache and pipeline 

analysis has to be combined because the pipeline analysis 

models the flow of instructions through the processor 

pipeline and therefore computes the precise instant of time 

when the cache is queried and its state is updated. The 

combined cache and pipeline analysis represents an 

abstract interpretation of the program's execution on the 

underlying system architecture. The execution of a 

program is simulated by feeding instruction sequences 

from a control-flow graph to the timing model which 

computes the system state changes at cycle granularity and 

531

keeps track of the elapsing clock cycles. The correctness 

proofs according to the theory of abstract interpretation 

have been conducted by Thesing [35]. The cache analysis 

presented by [10] is incorporated into the pipeline analysis. 

At each point where the actual hardware would query and 

update the contents of the cache(s), the abstract cache 

analysis is called, simulating a safe approximation of the 

cache effects. The result of the cache/pipeline analysis 

either is a worst-case execution time for every basic block, 

or a prediction graph that represents the evolution of the 

abstract system states at processor core clock granularity 

[7]. 

• The path analysis phase uses the results of the combined 

cache/pipeline analysis to compute the worst-case path of 

the analyzed code with respect to the execution timing. 

The execution time of the computed worst-case path is the 

worst-case execution time for the program. Within the aiT 

framework, different methods for computing this worstcase 

path are available. 

aiT has been successfully employed in the avionics [12, 11, 31] 

and automotive [23] industries to determine precise bounds on 

execution times of safety-critical software. It is available for a 

variety of microprocessors ranging from simple processors like 

ARM7 to complex superscalar processors with timing 

anomalies and domino effects like Freescale MPC755, or 

MPC7448, and multicore processors like Infineon AURIX 

TC27x. 

IV. 

HYBRID WCET ANALYSIS 

Techniques to compute worst-case execution time 

information from measurements are either based on end-to-end 

measurements of tasks, or they construct a worst-case path 

from timing information obtained for a set of smaller code 

snippets in which the executable code of the task has been 

partitioned. With end-to-end timing measurements, timing 

information is only determined for one concrete input. As 

described above, due to caches and pipelines the timing 

behavior of an instruction depends on the program path 

executed before. Therefore, usually no full test coverage can be 

achieved and there is no safe test end criterion. Approaches that 

instrument the code to obtain timing information about the 

code snippets of a task modify the code which can significantly 

change the cache and pipeline behavior (probe effect): the 

times measured for the instrumented software do not 

necessarily correspond to the timing behavior of the original 

software. 

The solution which is implemented in the hybrid WCET 

analysis tool TimeWeaver [2] combines static context-sensitive 

path analysis with non-intrusive real-time instruction-level 

tracing to provide worst-case execution time estimates. By its 

nature, an analysis using measurements to derive timing 

information is aware of timing interference due to concurrent 

execution and multicore resource conflicts, because the effects 

of asynchronous events (e.g. activity of other running cores or 

DRAM refreshes) are directly visible in the measurements. The 

probe effect is completely avoided since no code 

instrumentation is needed. The computed estimates are safe 

upper bounds with respect to the given input traces, i.e., 

TimeWeaver derives an overall upper timing bound from the 

execution time observed in the given traces. Thus, the coverage 

of the input traces on the analyzed code is an important metric 

that influences the quality of the computed WCET estimates. 

The trace information needed for running TimeWeaver is 

provided out-of-the-box by embedded trace units of modern 

processors, like NEXUS IEEE-ISTO 5001 [16] or ARM 

CoreSight [3]. They allow the fine-grained observation of a 

program execution on single-core and multicore systems. 

Examples for processors supporting the NEXUS trace interface 

are the NXP QorIQ P- and T-series processors (using either an 

e500mc or an e5500/e6500 core). 

A. NEXUS Traces 

On the PowerPC architecture TimeWeaver relies on 

NEXUS program flow trace messages. Such traces consist of 

trace segments separated by trace events. TimeWeaver maps 

the events to points in the control-flow graph (trace points) and 

the segments to program paths between these points. This is 

done for those parts of the trace that reach from the call of the 

routine used as analysis entry till the end of that routine or any 

other feasible end of execution. Such parts are called trace 

snippets. A single trace may contain several trace snippets. 

TimeWeaver can operate on one or more traces given as trace 

files, each containing one or more trace snippets. 

A NEXUS trace event encodes its type, a time stamp 

containing the elapsed CPU cycles since the last trace event 

and the contents of the branch history buffer, which can be 

used to reconstruct execution path decisions and allows to map 

trace segments to the control-flow graph of the corresponding 

executable. 

Microprocessor debugging solutions like the Lauterbach 

PowerDebug Pro [20] allow to record NEXUS trace events as 

they are emitted during program execution and to export them 

in various formats. TimeWeaver can process those exports for 

its timing analysis as described below. 

Here is a sample NEXUS trace excerpt (with some 

information removed) in ASCII format: 

+056 TCODE=1D PT-IBHSM F-ADDR=F1F4 HIST=2 TS=8847 

+064 TCODE=21 PT-PTCM EVCODE=A TS=88F1 

+072 TCODE=1C PT-IBHM U-ADDR=03DC HIST=1 TS=8D62 

+080 TCODE=21 PT-PTCM EVCODE=A TS=8E2F 

+088 TCODE=21 PT-PTCM EVCODE=A TS=8FBA 

+096 TCODE=21 PT-PTCM EVCODE=A TS=9105 

+104 TCODE=1C PT-IBHM U-ADDR=02CC HIST=1 TS=9275 

+112 TCODE=1C PT-IBHM U-ADDR=01F0 HIST=1 TS=93BF 

+120 TCODE=21 PT-PTCM EVCODE=A TS=997B 

+128 TCODE=1C PT-IBHM U-ADDR=0044 HIST=1 TS=9B02 

+136 TCODE=21 PT-PTCM EVCODE=A TS=9F21 

This output has been generated using the following 

command in the Lauterbach Trace32 tool: 

Trace.export.ascii nexus /showRecord 

Each line corresponds to a trace event. The number at the 

beginning of the line is the trace record number. The second 

and third column represent the particular trace event type 

followed by type-specific information like branch history and 

program address information associated with the event. The TS 

number at the end is a time stamp. 


532

Debugging solutions differ in the format in which they 

export trace data. Some debuggers allow the user to configure 

the output. TimeWeaver can currently import traces which 

have been exported by Lauterbach, PLS or iSYSTEM 

debuggers. Whenever the format is configurable, we have 

identified a minimal set of information needed to perform the 

TimeWeaver analysis. Additionally, TimeWeaver can be easily 

extended to support other trace formats. 

B. TimeWeaver Toolchain 

The main inputs for TimeWeaver are the fully linked 

executable(s), timed traces and the location of the analyzed 

code in the memory (entry point, which usually is the name of 

a task or function). Optionally, users can specify further 

semantical information to the analysis, like targets of computed 

calls, loop bounds, values of registers and memory cells. This 

information is used to fine-tune the analysis. The analysis 

proceeds in several stages: decoding, loop/value analysis, trace 

analysis, and path analysis. Most steps in this tool chain are 

shared with aiT, leveraging its powerful static analysis 

framework. 

The decoding phase of TimeWeaver is mostly identical to 

the decoding phase of aiT. One important difference is that 

when encountering call targets which cannot be statically 

resolved, TimeWeaver can be instructed to extract the targets 

of unresolved branches or calls from the input traces. To this 

end there is a feedback loop between the CFG reconstruction 

and the trace analysis step (cf. Fig. 1). As an alternative, the 

same user annotations can be used as in the aiT tool chain. 

In the next phase, several microarchitectural analyses are 

performed on the reconstructed CFG starting with the 

combined loop and value analysis, again equal to the aiT tool 

chain. It determines possible values of registers and memory 

cells, addresses of memory accesses, as well as loop and 

recursion bounds. Based on this, statically infeasible paths are 

computed, i.e., parts of the program that cannot be reached by 

any execution under the given configuration. This is important 

because each detected infeasible path increases the trace 

coverage. Such paths are pruned from further analysis. If the 

value analysis cannot compute a loop bound or if the computed 

bound is not precise enough, users can specify custom bounds 

by means of annotations which are used by the analysis. The 

loop transformation allows loops in the CFG to be handled as 

self-recursive routines to improves analysis precision [32]. 

After value analysis, the analyzer has annotated each 

instruction in the control-flow graph with context-sensitive 

analysis results. This context-sensitivity is important because 

the precision of an analysis can be improved significantly if the 

execution environment is considered [32]. For example, if a 

routine is called with different register values from two 

different program points, the execution time in both situations 

might be different. Depending on the context settings, this is 

taken into account leading to higher precision in the analysis 

result. 

Fig. 1. TimeWeaver tool chain structure 

In the trace analysis step the given traces are analyzed such 

that each trace event is mapped to a program point in the 

control-flow graph. This mapping defines the trace points and 

segments mentioned above and is not only necessary for the 

whole analysis but also ensures that the input trace matches the 

analyzed binary. In case a preemptive system has been traced, 

interrupts are detected and reported. The extracted timing 

information, i.e., the clock cycles which have been elapsed 

between two consecutive trace points are annotated to the CFG 

in a context-sensitive manner. 

After the trace conversion, a CFG which combines the 

results of value analysis and traced execution timings (both 

context-sensitive) is available. This graph is the input for the 

next step, the path analysis phase. Here, the trace segment 

times alongside the control-flow graph are used to generate an 

integer linear program (ILP) formulation to compute the worstcase 

execution path with respect to the traced timings. At this 

point, the recorded times for each pair of trace segment and 

533

analysis context, get maximized. The ILP formulation is 

structurally the same as in the path analysis of aiT [33] with the 

exception that the involved execution times are not computed 

by a micro-architectural pipeline analysis but are extracted 

from the input traces. The generated ILP is fed to a solver 

whose solution is the worst-case execution path alongside its 

costs, i.e., the WCET estimate of the analyzed task. This 

solution is annotated to the CFG for the final step, namely 

reporting and visualization. 

As mentioned above, the input traces might contain 

asynchronous events like DRAM refreshes which can lead to 

exceptionally high trace segment times. TimeWeaver allows to 

address these with a filter for trace segment times based on 

their cumulative frequency (CF), i.e. their occurrence 

percentage. The threshold refers to a percentage of occurrences 

ordered by execution times (as in the survival graph, see 

below). A threshold of 0% is passed by all occurrences. A 

threshold of 5% is passed by all but the 4 most expensive ones 

(in terms of execution time) if there are 100 occurrences, by all 

but the 9 most expensive ones if there are 200 occurrences, etc. 

Trace segment times that do not pass the specified threshold 

are ignored in the ILP generation. The filter function is applied 

for each trace segment separately. TimeWeaver also allows to 

simulate the effect of the CF filter in its statistics view to 

experiment with different filter values. 

C. TimeWeaver Result Reporting and Visualization 

Besides the global WCET estimate and the execution path 

triggering it, TimeWeaver offers a variety of reporting 

facilities: 

• WCET estimate per routine (including cumulative 

information of called sub routines), 

• Context-specific WCET estimate per routine (including 

cumulative information of called sub routines), 

• Determined loop bounds (distinguishes between traced, 

analyzed, and effective bounds) including loop scaling 

conflicts, 

• Variance of trace segment times (context-sensitive), 

• Trace coverage with respect to the number of basic 

blocks and instructions in the analyzed code, and 

• Memory access information along the computed worstcase 

path. 

In addition to the above described statistics, TimeWeaver 

provides the following visualizations: 

• Analysis result graph to interactively explore the results, 

• Per trace segment distribution graph for the recorded 

segment times (cf. Fig. 2), and a 

• Per trace segment survival graph to show the 

cumulative frequency of the recorded segment times 

(cf. Fig. 3). 

Fig. 2. Sample distribution graph of a trace segment 

Fig. 3 Sample survival graph of a trace segment 

D. WCET Estimate Extrapolation 

As mentioned above, TimeWeaver computes the global 

WCET estimate based on the observed execution times of trace 

segments. The times are maximized per trace segment and the 

maximized times are composed to identify the worst-case path 

with respect to those figures. 

Where in general, one would need to measure all possible 

execution paths (which is impractical on real-world 

applications) of the analyzed program for coverage reasons, 

TimeWeaver allows to compute an upper bound on the global 

execution time of the analyzed program based on the trace 

segment times extracted from the input traces. 

This way, it is only necessary to trace all possible execution 

paths between two consecutive trace points. By inserting 

custom trace points, the user can further decrease the required 

number of measurements. Fig. 4 illustrates this by showing 

three consecutive trace points (TP1, TP2, and TP3) and the 

possible execution paths between each of them. TimeWeaver 

composes the WCET estimate for the time between TP1 and 

TP3 by the sum over the maximized trace segment time 

between TP1→TP2 and the maximized trace segment time 

between TP2→TP3. Thus, the measurements need to cover the 

four execution paths between TP1→TP2 as well as between 

three execution paths between TP2→TP3. Without that time 

composition, all 12 execution paths between TP1→TP3 need 

to be measured. 

E. Loop Scaling 

For loops, there might be a gap between the maximum of 

the observed iteration counts in the input traces (traced bound) 

and the statically possible maximum iteration count (analyzed 

bound) which is computed by the value analysis. The bound 


534

actually used for the ILP generation is the so-called effective 

bound which is the analyzed bound if it is finite and applicable 

(cf. scaling conflicts below) and otherwise the traced bound. 

Per user request, the intersection of analyzed and traced bound 

is used. 

If the effective bound is higher than the traced bound, the 

maximum observed execution time (context-sensitively) for 

one loop iteration is scaled up to the effective bound. This 

overcomes the necessity to trace each loop in the analyzed task 

with its worst-case iteration count, which might be hard to 

achieve because loop conditions often are data-dependent and 

thus can be complex to trigger. 

However, loop scaling as described above is not always 

directly applicable. It requires each trace to pass a trace point 

inside the loop body. If there is at least one traced execution 

path through the loop body without a trace point, scaling 

cannot be done and only the traced bounds are used for this 

loop. Such a situation is called an event loop scaling conflict. 

The solution is to either trace the worst-case loop iteration 

count or to ensure that each traced path through the loop body 

passes a trace point (by inserting custom trace points). 

There is another situation which triggers a loop scaling 

conflict: if due to the context settings of the analysis a loop is 

virtually unrolled more times than the corresponding loop body 

has been executed in the trace, scaling cannot be applied, as 

well. The reason is that the scaling is applied in the last loop 

context, i.e., in that context which represents the last loop 

iteration(s). In that case, there is no traced loop body time in 

the trace mapped to this context which prevents scaling. Such a 

conflict is called a unroll loop scaling conflict. To solve this 

conflict, one can either trace the worst-case iteration count of 

the corresponding loop or the (virtual) loop unroll during 

analysis of this particular loop can be decreased to the traced 

bound. 

Fig. 4 Execution paths between trace points 

V. EXPERIMENTAL RESULTS ON TIMEWEAVER 

To evaluate TimeWeaver for PowerPC, we recorded 

program executions on an NXP T1040 [25] evaluation board 

using a Lauterbach PowerDebug Pro JTAG debugger. 

A. Loop scaling 

Execution times for loops can be scaled up from the 

maximum observed execution time of the loop body. This can 

be seen in the analysis of the following program: 

1 volatile int sensor; 

2 

3 int helper (int x) 

4 { 

5 int result = x; 

6 result += sensor; 

7 return result + 3; 

8 } 

9 

10 

11 int main (void) 

12 { 

13 int result = 0; 

14 

15 result += helper(256); 

16 

17 int i; 

18 int loop_bound = (sensor-0xDEADBEEF)+5; 

19 

20 /* Loop with statically unknown bound */ 

21 for (i=0 ; i

Application Trace [cycles] Estimate [cycles] Diff [%] 

crc 809068 829039 2.47 

edn 4788025 4791420 0.07 

eratosthenes sieve 368345 369803 0.40 

dhrystone 168093 177314 5.49 

md5 127857 131718 3.02 

nestedDepLoops 2747357 2747359 0.00 

sha 23426161 23815350 1.66 

Avionics Task 420677 498028 18.38 

Automotive Task 1 65058 71964 10.62 




Tab. 1. TimeWeaver Result Comparison 

For each application, the maximum observed end-to-end 

time has been extracted from the traces and compared with the 

WCET estimate computed by TimeWeaver. The difference 

represents the overestimation of TimeWeaver resulting from 

the composition of trace segment times to a global estimate. In 

average, the TimeWeaver results from the table above are 

5.24% above the maximum observed end-to-times from the 

traces. 

VIII. 

CONCLUSION 

In this article we have given a definition of timing 

predictability and discussed hardware features which increase 

the difficulty of obtaining safe and precise worst-case 

execution time bounds, both on single-core and multicore 

processors. We have described the methodology of static 

worst-case execution time analysis which can provide 

guaranteed WCET bounds on complex processors, if the 

timing behavior of the processor is well specified, and 

asynchronous interferences can be controlled or bounded. 

Hybrid worst-case execution time analysis allows to obtain 

worst-case execution time bounds even for systems where 

these conditions are not met. We have given an overview of 

the hybrid WCET analyzer TimeWeaver which combines 

static value and path analysis with timing measurements based 

on non-intrusive instruction-level real-time traces. The trace 

information covers interference effects, e.g., by accesses to 

shared resources from different cores, without being distorted 

by probe effects since no instrumentation code is needed. The 

analysis results include the computed WCET bound with the 

time-critical path, and information about the trace coverage 

obtained. They provide valuable feedback for optimizing trace 

coverage, for assessing system safety, and for optimizing 

worst-case performance. Experimental results show that with 

good trace coverage safe and precise WCET bounds can be 

efficiently computed. 


This work was funded within the project ARAMiS II by the 

German Federal Ministry for Education and Research (BMBF) 

with the funding ID 01IS16025B, and within the BMBF 

project EMPHASE with the funding ID 16EMO0183. The 

responsibility for the content remains with the authors. 

REFERENCES 

[1] AbsInt GmbH. aiT Worst-Case Execution Time Analyzer Website. 

http://www.AbsInt.com/ait. 

[2] AbsInt GmbH. TimeWeaver Website. http://www.AbsInt.com/- 

timeweaver. 

[3] ARM Ltd. Coresight TM Program Flow Trace TM PFTv1.0 and PFTv1.1 

architecture specification, 2011. ARM IHI 0035B. 

[4] Philip Axer, Rolf Ernst, Heiko Falk, Alain Girault, Daniel Grund, Nan 

Guan, Bengt Jonsson, Peter Marwedel, Jan Reineke, Christine 

Rochange, Maurice Sebastian, Reinhard von Hanxleden, Reinhard 

Wilhelm, and Wang Yi. Building timing predictable embedded systems. 

ACM Transactions on Embedded Computing Systems, 13(4):82:1– 

82:37, 2014. 

[5] Certification Authorities Software Team (CAST). Position Paper 

CAST-32A Multi-core Processors, November 2016. 

[6] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model 

for static analysis of programs by construction or approximation of 

fixpoints. In 4 th POPL, pages 238–252, Los Angeles, CA, 1977. ACM 

Press. 

[7] Christoph Cullmann. Cache persistence analysis for embedded realtime 

systems. PhD thesis, Universitaet des Saarlandes, Postfach 151141, 

66041 Saarbruecken, 2013. 

[8] Christoph Cullmann, Christian Ferdinand, Gernot Gebhard, Daniel 

Grund, Claire Maiza, Jan Reineke, Benoît Triquet, and Reinhard 

Wilhelm. Predictability considerations in the design of multi-core 

embedded systems. In Proceedings of Embedded Real Time Software 

and Systems, pages 36–42, May 2010. 

[9] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case 

Execution Time Analysis. PhD thesis, Uppsala University, 2003. 

[10] Christian Ferdinand. Cache Behavior Prediction for Real-Time Systems. 

PhD thesis, Saarland University, 1997. 

[11] Christian Ferdinand, Reinhold Heckmann, Marc Langenbach, Florian 

Martin, Michael Schmidt, Henrik Theiling, Stephan Thesing, and 

Reinhard Wilhelm. Reliable and precise WCET determination for a 

real-life processor. In Proceedings of EMSOFT 2001, First Workshop 

on Embedded Software, volume 2211 of Lecture Notes in Computer 

Science, pages 469–485. Springer-Verlag, 2001. 

[12] Christian Ferdinand and Reinhard Wilhelm. Fast and Efficient Cache 

Behavior Prediction for Real-Time Systems. Real-Time Systems, 17(2- 

3):131–181, 1999. 

[13] Freescale Inc. QorIQTM P4080 Communications Processor Product 

Brief, September 2008. Rev. 1. 

[14] Sebastian Hahn, Jan Reineke, and Reinhard Wilhelm. Towards 

compositionality in execution time analysis: Definition and challenges. 

SIGBED Rev., 12(1):28–36, March 2015. 

[15] IEC 61508. Functional safety of electrical/electronic/programmable 

electronic safety-related systems, 2010. 

[16] IEEE-ISTO. IEEE-ISTO 5001 TM -2012, The Nexus 5001 TM Forum 

Standard for a Global Embedded Processor Debug Interface, 2012. 

[17] Infineon Technologies AG. AURIXTM TC27x D-Step User’s Manual, 

2014. 

[18] D. Kästner, M. Schlickling, M. Pister, C. Cullmann, G. Gebhard, 

R. Heckmann, and C. Ferdinand. Meeting Real-Time Requirements 

with Multi-Core Processors. Safecomp 2012 Workshop: Next 

Generation of System Assurance Approaches for Safety-Critical 

Systems (SASSUR), September 2012. 

[19] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike 

Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael 

Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative 

execution. ArXiv e-prints, January 2018. 

[20] Lauterbach GmbH. Lauterbach Website. http://www.lauterbach.com. 

[21] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, 

Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval 

Yarom, and Mike Hamburg. Meltdown. ArXiv e-prints, January 2018. 

[22] Thomas Lundqvist and Per Stenström. Timing anomalies in 

dynamically scheduled microprocessors. In Real-Time Systems 

Symposium (RTSS), December 1999. 


536

[23] NASA Engineering and Safety Center. Technical Support to the 

National Highway Traffic Safety Administration (NHTSA) on the 

Reported Toyota Motor Corporation (TMC) Unintended Acceleration 

(UA) Investigation, 2011. 

[24] J. Nowotsch, M. Paulitsch, D. Bühler, H. Theiling, S. Wegener, and 

M. Schmidt. Multi-core Interference-Sensitive WCET Analysis 

Leveraging Runtime Resource Capacity Enforcement. In ECRTS’14: 

Proceedings of the 26th Euromicro Conference on Real-Time Systems, 

July 2014. 

[25] NXP Semiconductors. QorIQTM T1040 Reference Manual, 2015. 

[26] Radio Technical Commission for Aeronautics. Formal Methods 

Supplement to DO-178C and DO-278A, 2011. 

[27] Radio Technical Commission for Aeronautics. RTCA DO-178C. 

Software Considerations in Airborne Systems and Equipment 

Certification, 2011. 

[28] Jan Reineke, Daniel Grund, Christoph Berg, and Reinhard Wilhelm. 

Timing predictability of cache replacement policies. Real-Time Systems, 

37(2):99–122, 2007. 

[29] Jan Reineke, Björn Wachter, Stephan Thesing, Reinhard Wilhelm, Ilia 

Polian, Jochen Eisinger, and Bernd Becker. A Definition and 

Classification of Timing Anomalies. In Frank Mueller, editor, 

International Workshop on Worst-Case Execution Time Analysis 

(WCET), July 2006. 

[30] Andreas Schranzhofer, Jian-Jia Chen, and Lothar Thiele. Timing 

predictability on multi-processor systems with shared resources. In 

Workshop on Reconciling Performance with Predictability (RePP), 

2010, October 2009. 

[31] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Victor Jégu, 

Guillaume Borios, and Reinhold Heckmann. Computing the worst case 

execution time of an avionics program by abstract interpretation. In 

Proceedings of the 5th Intl Workshop on Worst-Case Execution Time 

(WCET) Analysis, pages 21–24, 2005. 

[32] Stefan Stattelmann and Florian Martin. On the Use of Context 

Information for Precise Measurement-Based Execution Time 

Estimation. In B. Lisper, editor, 10th International Workshop on Worst- 

Case Execution Time Analysis (WCET 2010), volume 15 of OpenAccess 

Series in Informatics (OASIcs), pages 64–76. Schloss Dagstuhl– 

Leibniz-Zentrum fuer Informatik, 2010. 

[33] Henrik Theiling. Control Flow Graphs for Real-Time System Analysis. 

Reconstruction from Binary Executables and Usage in ILP-Based Path 

Analysis. PhD thesis, Saarland University, 2003. 

[34] Henrik Theiling and Christian Ferdinand. Combining abstract 

interpretation and ILP for microarchitecture modelling and program 

path analysis. In Proceedings of the 19th IEEE Real-Time Systems 

Symposium, pages 144–153, Madrid, Spain, December 1998. 

[35] Stephan Thesing. Safe and Precise Worst-Case Execution Time 

Prediction by Abstract Interpretation of Pipeline Models. PhD thesis, 

Saarland University, 2004. 

[36] Simon Wegener. Towards Multicore WCET Analysis. In Jan Reineke, 

editor, 17th International Workshop on Worst-Case Execution Time 

Analysis (WCET 2017), volume 57 of OpenAccess Series in Informatics 

(OASIcs), pages 1–12, Dagstuhl, Germany, 2017. Schloss Dagstuhl– 

Leibniz-Zentrum fuer Informatik. 

[37] Reinhard Wilhelm, Daniel Grund, Jan Reineke, Markus Pister, Marc 

Schlickling, and Christian Ferdinand. Memory hierarchies, pipelines, 

and buses for future time-critical embedded architectures. IEEE TCAD, 

28(7):966–978, July 2009. 

537

Missing Relationship between Software FTAs and 

System FTA on Multi-Core Platforms 

– Identification and Resolving 

Hossam H. Abolfotuh 

Functional Safety Department 

eJad L.L.C 

Cairo, Egypt 

Hossam.Abolfotuh@ejad.com.eg 

Esam Mamdouh 


eJad L.L.C 

Cairo, Egypt 

Esam.Mamdouh@ejad.com.eg 

Abstract—Functional safety is a key player in the 

development of Advanced Driver Assistance Systems (ADAS). 

The primary objective of applying safety analysis on software 

architectural design is to anticipate potential scenarios of failure. 

This kind of analysis aims to identify how failures originate at the 

low-levels of the design and how combinations or sequences of 

such low-level failures propagate to higher levels leading to a 

safety goal violation. Such described analysis can be realized by 

applying software Fault Tree Analysis (FTA) method. Applying 

the software FTA on ADAS architectures is a challenge, where 

the ADAS software architecture is mainly developed based on 

multi-core platforms. This paper will discuss how the software 

FTA will be performed on multi-core platform taking into 

consideration the dependencies between the cores; it also will 

discuss the linking of these software FTAs with system FTA to 

reach consistent analysis. 

Keywords—Software; Automotive; Multi-Core; Functional 

Safety; ISO 26262; FTA; Fault Tree Analysis; 


Functional safety highly impacts the automotive industry 

especially when the autonomous driving has been adopted. In 

critical systems, such as radar and camera applications that 

participates in autonomous driving, functional safety will be a 

must. In fact, there are many systematic failures that can lead to 

a violation to the safety goals which in turn may put 

passengers’ lives at risk. This arises the need for performing a 

systematic safety analysis on the system, software and 

hardware level. The primary objective of applying safety 

analysis is to anticipate potential scenarios of failure. This kind 

of analysis aims to identify how failures originate at the lowlevels 

of the design and how combinations or sequences of 

such low-level failures propagate to higher levels leading to a 

safety goal violation. Such described analysis can be realized 

by applying software Fault Tree Analysis (FTA) method 

according to ISO-26262 [1]. FTA is a top-down approach 

which is more appropriate for software applications than 

bottom-up approaches. This paper will discuss how the 

software FTA will be performed on multi-core platform taking 

into consideration the dependencies between the cores; it will 

also discuss the linkage of these software FTAs with system 

FTA to reach a consistent safety analysis. 

II. 

FAULT TREE ANALYSIS (FTA) 

A. What is Fault Tree Analysis? 

In general, the FTA is a deductive top-down analysis 

approach, in which an undesired state of a system is analyzed 

using Boolean logic to combine a series of lower-level events. 

The undesired states of the system are defined as a set of Top 

Level Events (TLEs) that represent the failure events which 

lead to this state and affect the critical system outputs. Then it 

traces these events till their root causes which are known as 

Basic Events (BEs). After defining these BEs, a list of safety 

mechanisms is provided to tolerate them. 

B. How is Fault Tree Analysis performed? 

The FTA is performed for each one of the set of TLEs 

separately. Each TLE will be the starting point of Fault Tree as 

shown in Fig. 1. For instance, consider (TLE1) as a “Failure in 

Object Detection” in a radar system. Working backward from 

this top event it might be determined that it is caused by one of 

two events (E); the first one is a failure in transmission of 

messages containing list of detected objects (E1), while the 

second one is a failure in the object detection algorithm (E2). 

This condition is represented in the fault tree diagram as a 

logical OR between these possible causes as shown in Fig. 1. 

Following (E2), perhaps it might be determined it is caused by 

one of two events; the first one is a failure in radar signal 

processing algorithm to create the list of detected objects (E3), 

while the second one is a memory corruption in the list of 

detected objects (E4). This is another logical OR. A design 

improvement using “Memory Protection Unit” can be 

implemented to protect the critical data – such as list of 


538

Fig. 1. Fault Tree diagram example 

detected objects – from corruption. This is a safety mechanism 

(SM1) added in the form of a logical AND with the memory 

corruption as a (BE1). 

III. CHALLENGES FACING FTA 

The first challenge during performing FTA is that the TLEs 

are defined on the system level where the safety goals are 

defined. So after performing the FTA on system level there 

must be some events that needs deeper analysis in either 

software or hardware level. This requires clear identification of 

the relations between different applied analyses (e.g. system 

FTA and software FTA) to have a consistently integrated FTA 

at the end. Problems in this step will lead to difficulties in the 

integration phase of the different FTAs performed on different 

development levels. 

Another challenge facing FTAs on multi-core platform, is 

that FTA is usually performed separately on each core without 

considering the inter-dependencies between them during the 

safety analysis phase. Which leads to missing possible failures 

resulting from these inter-dependencies and consequently 

missing additional safety mechanisms. 

A. Complications during linking Software to System FTA 

Starting the FTA at the software level apart from the system 

FTA is a big mistake. The software may have its own TLEs but 

it’s never independent from the system. Any system FTA will 

end up with some BEs that need to be deeply analyzed in the 

software architecture. If these events are not taken into 

considerations when performing software FTA, a gap will arise 

in the integration phase of the System with Software FTAs. In 

section IV, a proposal is illustrated to see how this gap could be 

covered. 

B. Missing dependencies between FTAs on multi-cores 

Considering a multi-core platform, the relations between 

the cores creates dependencies between the events. These 

dependencies must be taken in consideration during the FTA of 

each core. As a result, FTA can't be performed on each core in 

complete separation from the other cores. But, if the separation 

is needed in case that responsibility of each core is with a 

different team. So, separation is allowed but after considering 

the dependencies and translating them into new events. 

For example, the data transferred through Inter Process 

Communication (IPC) is a critical source of dependencies that 

should be considered during FTA on multi-core platform. 

IV. THE PROPOSED METHODLOGY 

In this paper, a systematic methodology is proposed to face 

the previously mentioned challenges. This methodology is 

based on the separation of the work as flexible as possible 

without dropping any dependency. System FTA can be 

performed by a separate team from the software. Even the 

software FTAs themselves can be separated among different 

cores. But at the end, the dependencies must be handled in a 

systematic well-organized way. 

A. Integrating System FTA with Software FTA 

Starting with the challenge of linking system FTA to 

software FTAs, the dependency is clear since the system FTA 

generates more TLEs for the software. In order to make sure no 

links are missed between the system and the software analyses, 

the set of software TLEs must be stated clearly including two 

categories. First category includes the created software TLEs 

corresponding to the BEs of the system FTA which need to be 

deeply analyzed in the software. For example the generation of 

correct critical output based on software algorithms. The 

second one includes the TLEs arising from the software 

architecture itself and they don’t have clear mapping to system 

BEs. New BEs shall be created in the system and linked to 

these software TLEs. For example, the sequence and timing 

constraints of executing a specific software algorithm. 

B. Performing Software FTA on multi-core 

The second challenge is performing the analysis on multicore. 

The dependencies between the cores shall be studied and 

defined. In order to make sure no links are missed between the 

cores, a set of TLEs must be stated clearly for each core; 

including two categories of TLEs. First category includes the 

share of this core from the entire software set of TLEs, based 

on the core functionality. For example, the generation of list of 

objects from the main core. The second one shall cover all the 

dependencies in which this core is a source of dependency. It 

adds new TLE in the source side and expects a corresponding 

BE in the receiver side of the dependency. For example, the 

transmission of car information such as speed and yaw rate 

from main core (core-0) to other cores (core-1 and core-2) 

which use this information for algorithms as shown in Fig. 2. 

Fig. 2. IPC on a Tri-core platform must be taken in consideration during 

FTA on each core. 


539

V. CASE STUDY 

In the case study, these solutions are applied on a Medium 

Range Radar system for Emergency Braking Assistance, 

developed on a tri-core platform "MPC5774-RaceRunner" [2]. 

Core-0 is the main core which communicates with the 

centralized ECU via CAN bus, it receives car information such 

as speed, steering angel; and sends the detections list. Core-1 is 

responsible for radar signal processing. Core-2 is responsible 

for the algorithms execution to detect the objects. This system 

has a main TLE defined as "Existing object not reported by 

Radar ECU". The system and software FTAs were performed 

using Medini Analyze. 

On the system level, a high-level failure – identified by 

system FTA – is the transmission of a wrong list of detected 

objects. On the software level, this can be easily mapped to a 

TLE – (TLE2) in this example – and deeply analyzed in 

software. This link is highlighted in green color. On the other 

hand, based on the software nature there is another failure 

arises from the transmission of correct but obsolete list of 

detected objects. This failure can occur due to some timing 

violation. It can be tolerated using timing protection safety 

mechanism such as flow control monitoring. Accordingly, a 

BE shall be added in the system FTA – highlighted in purple – 

as shown in Fig. 3 in order to be linked to (TLE4) that is 

created in the software FTA. 

Later in the project, it is required to apply software FTA on 

the three cores. During a software FTA on core-2, a TLE is 

analyzed related to the generation of the objects list. This list is 

sent later to core-0. So a link between core-0 and core-2 – 

highlighted in blue color – shall appear in different software 

FTAs. A BE in the software FTA of the receiving core (core-0) 

as shown in Fig. 4 which is linked to (TLE5) in the source core 

(core-2) as shown in Fig. 5. 

Afterwards, by analyzing the dependencies between the 

two cores, it was found that the critical vehicle information – 

such as speed and yaw rate – is transferred from (core-0) to 

(core-2) was missing in the FTAs. A BE appeared in the 

software FTA of the receiving core (core-2) as shown in Fig. 5 

which was not linked to any TLE. Accordingly an additional 

(TLE3) was created in the source core (core-0) as shown in 

Fig. 6, then linked together as highlighted in orange color. 

Fig. 3. System FTA diagram 

Fig. 5. Software FTA diagram on Core-2-TLE5 

Fig. 4. Software FTA diagram on Core 0-TLE2 

Fig. 6. Software FTA diagram on Core-0-TLE3 


540

VI. 

CONCLUSION 

In this paper, a systematic methodology was introduced to 

identify missing relationships during performing FTA on 

multi-core platforms. In order to have a consistently integrated 

System-Software FTAs, an additional set of TLEs must be 

defined for the System FTA to make sure they will link with 

the Software FTAs. 

On the other hand; when performing the software FTAs 

among different cores, the dependencies between the cores 

shall be considered because they add more TLEs to each core 

in addition to its original share of Software TLEs. 

By covering these two challenges, it will ensure the 

integrity of performing different FTAs on the System and the 

Software levels. 


[1] International standard, “Road Vehicles – Functional Safety”, ISO 

Standard 26262, first edition, Nov. 2011. 

[2] NXP Datasheet, “MPC5775K Reference Manual”, Document Number: 

MPC5775KRM, Rev. 2, 2/2014. 


541

Stopping Buffer Overruns 

Connecting static and dynamic code analysis 

Mark Hermeling 

GrammaTech, Inc. 

Ithaca, NY, USA 

mhermeling@grammatech.com 

Abstract—Buffer overruns are abundant in many deployed 

software systems, open source or commercial, enterprise and 

embedded. They are causing an embarrassing number of 

software security issues. A system is only as secure as its weakest 

link and a buffer overrun may provide the attacker a foothold 

into the system. Static analysis has been used for decades to 

detect buffer overruns, but they still occur as static analysis is not 

perfect and prone to both false positives and false negatives. This 

paper will explain why buffer overruns are hard to detect and 

propose how we can combine static and dynamic analysis to help 

detect and resolve them. 

Keywords—buffer overrun; static analysis; dynamic analysis, 

functional testing; security; quality; 


Static analysis has been a key technology in software 

developer’s toolboxes for decades. Static analysis is the 

technique of performing detailed analysis on source code 

without actually executing the software. This means that the 

technique can be applied very early in the software 

development lifecycle, before the team is even able to perform 

significant unit, integration, or system testing dynamically. 

Static analysis finds serious programming mistakes, such as 

buffer overruns, which can lead to security vulnerabilities. It 

can find these mistakes with very little effort so the use of 

static analysis is highly recommended by many practitioners, 

organizations as well as standards bodies that concern 

themselves with high quality software products in practically 

all verticals such as automotive, transportation, industrial, 

medical networking, consumer electronics, aerospace and 

defense. 

Static analysis has evolved significantly from the flexible 

linters of the early days that scanned for simple code patterns to 

today’s advanced, whole program, static analysis tools that 

explore paths through the code and perform abstract analysis of 

that code. Even these advanced tools are not perfect on 

significantly sized, real-world code bases. Static analysis tools 

suffer from false positives, warnings that the tool emits on code 

that is not defective, as well as false negatives, warnings that 

the tool does not emit on code that is actually defective. 

In this paper, we will explore buffer overruns and explore 

what makes some buffer overruns difficult to detect by static 

analysis tools. We will then explore techniques that can be 

employed to detect these buffer overruns dynamically during 

run-time. 

II. SIMPLE BUFFER OVERRUNS 

A buffer overrun in its simplest form is a read, or write of 

data after the end of an object in memory. This can happen in 

many different programming languages, but this paper will 

focus on C and C++. Objects can of course be allocated either 

statically or dynamically. A very simple buffer overrun can be 

of the form in Figure 1. 

char buf[10]; 

… 

buf[10] = 'a' ; 

Figure 1 -- Basic buffer overrun 

In this case memory for an array of 10 characters is 

allocated and later the 11 th index in the array is accessed. The 

result of this access is non-defined, in most implementations 

the character ‘a’ will be written to whatever memory area was 

next to this buffer, overwriting what was there before. 

These types of faults are easy to detect by static analysis 

tools, the index value is hard-coded and static analysis tools 

will easily flag this as a buffer overrun. In the real world 

though, indexes are not hard-coded and can come from a 

variety of sources: device input, user input, network input, file 

reads, random variables and the like. There are a lot of patterns 

that static analysis tools have a hard time calculating through, 

like, for example the code in Figure 2. 

int i; 

char * s; 

s = (char *) malloc(100); 

... 

i=0; 

while (s[i] != '\0') 

i++; 

Figure 2 -- More complex buffer overrun 


542

In this example, it is harder for a static analysis tool to 

detect a buffer overrun. Whether or not there is an overrun here 

will depend on the value of the string pointed to by the variable 

and especially whether it is null-terminated. This variable can 

be populated anywhere in the program, can come from user, 

network or other input that the tool may not be able to track. 

Advanced static analysis tools will catch this in some cases, but 

not all. 

III. DATA TAINT 

Instead, what a static analysis tool may do, is flag a 

warning to the user indicating that a particular variable has 

been read from a suspicious source and that it has not been 

sufficiently assessed for erroneous, or suspicious values. A 

suspicious source could, for example be user input, network 

input or file input. Take the example in Figure 3. 

int c; 

char buf[10]; 

c = getchar(); 

buf[c] = 'a'; 

Figure 3 -- Data taint 

The last line could lead to a buffer overrun, depending on 

the user-input on the third line. Static analysis tools cannot 

predict user input, but they can flag line 4 as a data taint. This 

code will lead to problems in fielded systems. Data taint is 

further explained in the white paper ‘Protecting against Tainted 

Data in Embedded Apps with Static Analysis’ 2 . 

IV. CONTROL FLOW 

In the previous, very simple examples, file name and line 

number would be sufficient information to understand the 

problem. In real-world code though, it is not sufficient to just 

provide file-name and line-number. As mentioned before, 

advanced static analysis tools perform whole-program analysis, 

for a specific problem they will indicate why a particular 

statement is considered a problem, but the tool will also 

provide the path of execution that it analyzed. This path is 

important for the software developer to understand the 

reasoning and come up with a fix to the problem. 

The control flow of a particular problem may include 

multiple different function calls across different compilation 

units, if statements, switch statements, for loops and the like. 

The control flow can also contain pointer dereferences, 

including function pointers. All these constructs can make 

analysis quite complex. 

V. RECALL AND PRECISION 

Static analysis tool vendors work hard at creating tools that 

can do deep analysis and find as many problems as they can. 

The goal is always to have a high recall, where recall is defined 

as the percentage of real-world problems the tool is able to 

identify. A problem that the tool is not able to identify is 

referred to as a false negative. However, the tool also needs to 

have high precision, which is defined as the proportion of 

results that are true positives. A false positive is where the tool 

reports a warning where no problem exists. 

For safety and security-critical code recall is typically more 

important than precision. A false negative that is lurking in a 

fielded product can have disastrous impacts. Still, a tool needs 

to have sufficient precision, or developers loose trust in it. This 

means that a static analysis tool cannot flag every construct that 

it thinks may be problematic, it needs to have sufficient 

evidence that there is a case where it can be a true problem. 

Take for example the code in Figure 2. Should the tool 

issue a warning here or not if it is unable to trace the origin of 

the content of the string? 

Static analysis is focused on preventing programming 

mistakes. Buffer overruns are one type of these, but there are 

many others such as null-pointer dereferences, dead code, 

wrong type casts and the like. There are typically four 

categories of problems that static analysis tools can catch: 

1. Behavior that is undefined by the language. This is 

the category that buffer overruns fall in. 

2. API misuse. An example would be to do a send 

without opening a socket. 

3. Suspicious behavior. Dead code for example 

4. Coding standard violations. 

Using static analysis to catch these programming mistakes 

significantly improves the quality of the code in the source 

code repository. While many senior programmers sometimes 

complain that they do not need a tool to watch over their 

shoulder, the reality is that a) everybody makes mistakes and b) 

not everybody is a senior programmer. Static analysis helps 

everybody write better code. Static analysis does not verify 

functional correctness, though. That is what functional testing 

is supposed to address. 

VI. FUNCTIONAL TESTING 

Once code is sufficiently fleshed out, it can be tested. 

Testing typically happens at different levels, from unit testing, 

where a single function or set of functions are tested, to 

integration testing where multiple components come together 

to system testing where the system is tested in its entirety. 

Testing mostly focusses on functional correctness, which 

means verifying whether an input has the desired effect. The 

effect could be an output, or a change in system state. Testing 

typically starts at the unit-test level and is driven through 

testing harnesses, either hand-written, or built through 

automation tools from vendors like VectorCast, QA Systems, 

VerifySoft, or the like. These tools not only make creating test 

harnesses easier, they also facilitate execution of the test cases 

on desktop, host or embedded targets and collecting and 

reporting of the results. 

2 https://resources.grammatech.com/whitepapers/protecting-against-tainteddata-in-embedded-apps-with-static-analysis 

543

The challenge with functional testing is that it can easily 

overlook the state corruption caused by buffer overruns. There 

are two reasons for this: 

1) Problems may only occur in corner cases; 

2) State corruption is not detected 

The first problem can be dealt with by exhaustive testing. 

Functional testing needs to test not just the ‘happy path’, where 

all input is correct and expected and we are making sure the 

algorithm works. Testing also needs to try and break the 

algorithm by providing malformed input, or going outside of 

data ranges. One of the famous examples of this is the 

Heartbleed bug in OpenSSL, caused by a simple programming 

error, where malformed input could trick a server into a buffer 

overrun and share too much sensitive information. 

Testing tools help with this, as do techniques such as fuzz 

testing (fuzzing) 3 , which generates input values in a way that 

tries to steer an algorithm into corner cases. We will not delve 

into this deeper in this paper. 

The second problem is due to the fact that functional testing 

tools are do not directly look for state corruption. They 

generally look for the right output that corresponds to a given 

input. They may not detect even the simplest buffer overrun 

examples presented earlier unless the overwrites cause 

incorrect output or abnormal termination. This is often not the 

case with buffer overruns. 

VII. CATCHING BUFFER OVERRUNS DYNAMICALLY 

Let’s assume that we have proper unit testing that tests the 

happy path, but that also tests corner cases. As is often the case 

in code that is security and safety critical. Projects that build 

these types of products have the focus and are given the time 

and resources to make sure that their software is exhaustively 

tested. Projects drive this by making sure that they have 

complete code coverage, meaning that they have executed (and 

hence tested) every statement or condition outcome in the code 

at least once. While this is good, this is not sufficient to prove 

that there are no buffer overflows in the program, for that you 

would have to make sure to test all paths through the source 

code. 100% statement or condition coverage is no guarantee 

that you also have 100% path coverage. 

Still, we have to detect when the program writes or reads 

outside of a buffer and corrupts the state of a program. This 

typically involves special treatment of memory allocations, the 

addition of canaries around memory areas and inspection of 

memory accesses into these accesses. There are a number of 

different existing tools for this, each with their own benefits 

and disadvantages. These tools monitor memory accesses 

during execution and when they see a suspicious access they 

provide some amount of feedback in a log file, or on standard 

output. The output is generally a memory region where the 

problem happened and a short stack trace. 

Valgrind 4 is an instrumentation framework for dynamic 

analysis tools. It is popular and extremely flexible and people 

3 https://en.wikipedia.org/wiki/Fuzzing 

4 http://valgrind.org/ 

have built a number of different tools on top of it, including a 

memory error detector, which would suit our needs. However, 

the execution-time overhead that Valgrind requires is 

significant, which makes it not always feasible for use. 

AddressSanitizer (ASan) 5 is another popular solution that is 

faster than Valgrind and available with Clang and GCC 

compilers. 

Both Valgrind and ASan are considered debugging tools. 

Tools that developers use when they hit problems and try to 

figure out how to resolve them. Both report on memory 

problems by giving addresses that can then be resolved through 

the debugger to point to a location in the source code. 

GrammaTech has also recently announced a product to 

detect these state corruptions, CodeSonar/X 6 , an addition to 

GrammaTech’s static analysis tool CodeSonar. This solution is 

different from Valgrind and ASan as it is more performant in 

both time and space dimensions compared to both Valgrind 

and ASan and can be used during the development cycle, and 

can be left into deployed systems as well. It supports different 

operating systems (including embedded operating systems like 

VxWorks) and can be made to support additional compilers. 

The technology behind GrammaTech’s CodeSonar/X is 

derived from its participation in DARPA’s Cyber Grand 

Challenge 7 . 

VIII. PUTTING IT ALL TOGETHER 

So far this paper has argued that static analysis is required 

and that dynamic analysis is required. Combining static and 

dynamic analysis is the next step to assist projects to build 

better software faster. 

Static analysis can be done early in the development 

lifecycle and it catches programming mistakes early in the 

process and improves the source code that ends up in the 

source control repository. 

Functional testing is always required and should exercise 

the code as much as possible, touching as many code paths as 

is realistic, as early in the software development lifecycle as 

possible. Different projects will have different requirements for 

the depth of analysis here, an internet-connected fridge will 

spend less time on low level functional testing compared to the 

auto-pilot function of an airplane, or algorithms for self-driving 

cars. 

Dynamic state corruption detection is a great asset and can 

be integrated into all layers of the testing cycle. Combined with 

proper test coverage, this provides an additional layer of fault 

detection. 

Detecting the problems is one part of the puzzle, the second 

part is to help developers understand the problems. To do this, 

GrammaTech CodeSonar can combine the output of state 

corruption tools (Valgrind, ASan and of course CodeSonar/X) 

with its static analysis results. 

5 https://github.com/google/sanitizers/wiki/AddressSanitizer 

6 https://www.grammatech.com/products/codesonar 

7 https://www.darpa.mil/program/cyber-grand-challenge 


544

Any state corruptions are reported in the static analysis 

tool’s user interface and combined with the static analysis 

warnings. Delivers two main benefits: 

• Confirmation of existing warnings 

• Detection of false negatives 

The confirmation of existing warnings as true positives 

happens when a dynamically found warning happens on the 

same line as a similar warning that was found statically. This is 

an immediate sign to the software engineer that the problem is 

a serious one and should be high priority to fix. 

The detection of false negatives happens when a state 

corruption occurs where static analysis had not previously 

reported a problem. This will also result in a high priority 

warning report and provides not just filename and line-number, 

but reports on the execution trace as well. 

IX. EXAMPLES 

A couple of examples to demonstrate combining static and 

dynamic analysis. Figure 4 shows a traditional static buffer 

overflow warning on line 30. Intermixed in the output are two 

dynamically detected warnings on lines 21 and 30. The ‘Invalid 

Write’ warnings were detected by CodeSonar/X during runtime. 

The warning on line 30 shows the power of combining 

static and dynamic analysis. The static warning would have 

been found first in the software development lifecycle. Once 

the developer checks in the code, this warning would have 

been flagged immediately, even before the code is executed. 

With the dynamic tests now though, this proofs that this 

problem has been hit during testing, which should increase its 

priority. 

The warning on line 21 shows that dynamic tests can find 

things that were missed statically. 

Figure 4 -- Static and dynamic warnings intermixed 

As a second example, we can combine static analysis on 

source code with the dynamic results from Valgrind and get 

something similar, see Figure 5. In this case not a buffer 

overrun, but an ‘abort()’ call hit during execution. 

Figure 5 -- Crash observed through Valgrind, reported in CodeSonar 

X. SUMMARY 

Buffer overruns can lead to exploitable vulnerabilities and 

these can be costly, cyber vulnerabilities cost a company 

545

approximately $15.4 million per instance according to Forbes 9 . 

Any reasonable effort that we can make to reduce the amount 

of buffer overruns that make it into fielded products seems 

justified. 

Static analysis is not new, functional testing is not new, 

state corruption detection is not new, but the combination of 

the three together provides exciting new capabilities to the 

software development teams. Applying these three 

technologies requires proper investing in testing infrastructure 

and investment in proper test cases and is in no way free. 

However, combining these technologies promises to find 

difficult-to-find problems earlier and hence reduce the amount 

of fielded buffer overruns, which handsomely justifies the 

investment. 

9 https://www.forbes.com/sites/moneybuilder/2015/10/17/an-average-cybercrime-costs-a-u-s-company-15-4-million/#2bdf663032cb 


546

X-Ray Your Software Supply Chain 

Creating Automated Security Gates 

Ralf Huuck 

Software Integrity Group 

Synopsys 

Sydney, Australia 

ralf.huuck@synopsys.com 

Abstract— Software security has become a key challenge for 

embedded systems. This is particularly true for connected 

products such as those that can be found in the IoT space or the 

autonomous driving market. One of the big unknowns are thirdparty 

and open source software. In this work we present the results 

of the analysis of over 120,000 software artifacts. For each we 

identified the open source components and compared them with 

the known software vulnerabilities. The results are striking. 

Moreover, we advise on how to integrate such a security scanning 

activity in the (SDLC) and how to manage the supplier 

relationship. 

Keywords—software compostion; security; automated securiy 

gates; CVE; CVSS; open source security study 


As seen with the IoT-based MIRAI botnet, security 

vulnerabilities can have their root cause several layers down in 

the supply chain. This is particularly threating for complex 

and deep supply chains as prevalent in domains such as 

automotive and industrial control systems. 

In this work, we present our results from security scanning 

over 120,000 embedded software packages across a wide 

range of application domains. We automatically decomposed 

each software package into its component and cross-matched 

each component with its known security vulnerabilities as 

recorded in the National Vulnerability Database (NVD). We 

explain the purpose of the NVD, how to use it and how to 

make sense of the recorded Common Vulnerability Exposure 

(CVE) entries. 

We detail our findings by listing the components that are 

most commonly used, those that have most commonly a 

vulnerability as well as their age and the likelihood of existing 

patches that would remedy the situation. Moreover, we give an 

overview of the most critical vulnerabilities and the 

prevalence of “celebrity” bugs still active in embedded 

software. 

To remedy the situation, we explain how an automated, 

trustworthy supply chain process could look like that is built 

on various scanning and security gates across embedded 

suppliers, integrators and vendors. In particular, we take into 

account that many embedded vendors are not security experts. 

II. 

THE STATE OF OPEN SOURCE COMPONENTS 

A. Background 

Synopsys regularly publishes research into vulnerabilities 

in open source components. It summarizes the results of 

software uploaded to the Protecode platform [2]. Protecode is 

an analysis software that examines binary files for existing 

open source components, determines the version numbers of 

the open source components, and compares these components 

with existing databases of vulnerabilities. The primary 

database of vulnerabilities is the National Vulnerability 

Database (NVD) as maintained by NIST [3]. This database 

contains some 90,000 entries that document known 

vulnerabilities (CVEs), their causes and their implications. 

Moreover, each CVE is assigned a vulnerability score in the 

Common Vulnerability Scoring System (CVSS) [4]. The 

CVSSS ranges between 0 and 10, the higher the score the 

more critical the vulnerability. 

B. Open source component secutiy study 2016/2017 

The 2016/2017 Software Composition Study included about 

130,000 uploads to the Protecode platform. Over 16,000 

different components and versions were automatically 

identified. Figure 1 shows a breakdown of the most frequently 

identified components by task and application area. About twothirds 

of all components are utilities for Windows and Linux 

tools, network protocols such as SLL and http, and media 

libraries for jpg, png or XML. 

While this is not unexpected, it is interesting to note that 

common utilities are implicitly trusted. In fact, such basic 

utilities are often not even considered as 3rd part software that 

would be subjected to a rigorous security analysis. At the same 

time these utility functions are part of standard deployments to 

establish network connections, parse data formats and read from 

files or databases. Any weakness in these components can easily 

be imagined having larger security implications. 


547

In addition, the security vulnerabilities found are not 

generally not new as Figure 2 shows: About 50% of all 

vulnerabilities are four years old or older. In most cases, there is 

a newer and safer version for the component in question 

available. It is, however, not used. It is worth noting, that 

security vulnerabilities are typically discovered over time. This 

means, a component that could be considered perfectly fine 

today might not be secure tomorrow as new research and 

insights are obtained. This is particularly difficult for 

manufacturers to control and correct. 

A typical example of outdated components is the Heartbleed 

vulnerability. First publicized widely in 2014 it gained a lot of 

press due to its SSL vulnerability affecting a large number of the 

world’s web servers. In our study about 3 years later we find that 

Heartbleed is still in the top 50% of all found CVEs. This means 

it is still widely prevalent. Other celebrity bugs such as 

Stagefright or Ghost, however, only occur sporadically and can 

be assumed to be generally addressed. 

Fig. 1. Overview of Top 20 components detected. 

In our study we were able to identify around 9,000 security 

vulnerabilities with corresponding CVEs in the overall 16,000 

different components. This means, a large number of 

components identified cannot considered to be secure. We note, 

however, having a security vulnerability is not the same as being 

exploitable. It only means, there exists an exploit possibility for 

that component under the right circumstances. If these 

circumstances are present cannot necessarily be verified 

automatically was not part of this study. 

Fig. 2. Age of CVEs by initial detection year. 

For the software supply chain these findings have serious 

implications: It cannot be assumed that third-party party 

software is generally secure. In fact, the opposite is more likely. 

Furthermore, given the delay between the introduction of a 

security vulnerability and its detection through researchers there 

is a likelihood that even in the best case secure products that are 

out today might need some updates in the future. As a result, a 

strong and lasting supplier-manufacturer relationship is 

advisable. 

In the following, we make some suggestions how to 

structure some automated security scanning process. 

III. AUTOMATION & INTEGRATION INTO THE SDLC 

It is unrealistic to dispense with components from open 

source sources and third-party providers. These are useful 

ingredients to deliver products less costly and relatively quickly. 

Moreover, it is by no means proven that open source 

components are worse than off-the-self software. In fact, the 

reverse is often the case. Out study shows, however, that the 

security of third-party components needs to be vetted. 

The vetting of third-party software used to be a complex and 

specialized domain only. However, new software solutions such 

as Protecode, Sonatype, or Black Duck are available in the 

marketplace to perform these evaluations automatically [5]. 

Moreover, these software solutions can be integrated 

automatically into the development process. This means that, for 

example, a DevOps Jenkins process can be started, which runs 

an automated analysis with every product creation, which 

discovers open source components and compares them with 

their known security vulnerabilities. These results can then be 

made available promptly to the software development teams and 

quality teams. 

In order to achieve this automated vetting pipeline, the right 

understanding and processes must be in place. This means, the 

organization must be set up to have this vetting process of part 

of their release or development plan. There needs to be an owner 

to define the acceptable policies and actions that need to be taken 

should a component fail the security qualification. Moreover, it 

is advisable to integrate this vetting process early and 

548

continuously into the SDLC as a once-off just before release 

checking process often does not allow to apply appropriate fixes. 

IV. 

SUMMARY 

In this work we presented our insights of scanning software 

products for known third-party components and their 

vulnerabilities. We showed that a large number of security 

vulnerabilities can be identified in common products. 

Moreover, we showed that software components are often 

used that are outdated and for which newer and patched 

versions exists. 

We indicated how an automated SDLC approach might look 

like to vet software components against known vulnerabilities. 

Finally, we believe it is advisable to communicate this security 

vetting process with the suppliers to increase the awareness also 

on the supplier side, establish contractual terms for mitigation 

and patches and encourage the supplier to proactively initiate 

their own scanning to avoid passing down low-security 

components. As a result, higher quality and more secure 

software can be produced without much overhead and enables 

any market player to stand out as a premium vendor. 

REFERENCES 

[1] Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou and 

Jeffrey Voas. DDoS in the IoT: Mirai and Other Botnets. IEEE Computer, 

Volume 50/7, 2017. 

[2] Synopsys Software Integrity Group. The State of Software Composition 

2017. https://www.synopsys.com/software-integrity/resources/analystreports/state-of-software-composition-2017.html 

[3] Harold Booth, Doug Rike, Gregory A. Witte. The National Vulnerability 

Database (NVD): Overview. ITL Bulletin, December 2013. 

[4] Peter Mell, Karen Scarfone, and Sasha Romanosky. 2006. Common 

Vulnerability Scoring System. IEEE Security and Privacy 4, 6 

(November 2006), 85-89. 

[5] Millar S. Vulnerability Detection in Open Source Software: The Cure 

and the Cause. Queen's University Belfast, 2017. 


549

My Processor is Inside of an FPGA 

What Do I Do Now? 

Glenn Steiner (Author) 

Xilinx, Inc. 

San Jose, CA USA 

Abstract—With the drive to increase integration, reduce 

system costs, accelerate performance, and enhance reliability, 

software developers are discovering the processor they are 

targeting may be embedded inside of an FPGA. This paper will 

help you, the system architect or software developer, understand 

how you can architect and develop software, and even accelerate 

code via FPGA accelerators. (Abstract) 

Keywords— FPGA, Programmable Logic, SoC, System on a 

Chip, Extensible Processing Platforms, Multicore, Reconfigurable 

Architectures, Reconfigurable Systems, Programmable Systems (key 

words) 


As a software developer, you may have just been told that 

your next software project will be targeting a processor inside 

of an FPGA. How will this impact your development process 

and what benefits might you gain with this tight integration 

of processor and FPGA? Starting from the basics of what 

FPGAs are (in terms of software programming), this paper 

provides a simple to understand primer of what modern 

FPGAs with embedded processors can do. Next we will 

describe how one develops and debugs embedded processor 

applications. Finally, will wrap up with examples of how 

high level synthesis tools can move software to 

programmable logic hardware enabling dramatic software 

acceleration. 

As product designs increase in complexity, there is a need 

to use integrated components, such as Application Specific 

Standard Products (ASSPs), to address design requirements. 

Years ago, engineers chose individual components for 

processor, memory, and peripherals, and then pieced these 

elements together with discrete logic. More recently, 

engineers search through catalogs of ASSP processing 

systems attempting to find the nearest match to meet system 

requirements. When additional logic or peripherals are 

required, an FPGA is frequently mated with an ASSP to 

complete the solution. Over the last few years, FPGA sizes 

have increased, providing sufficient space to accommodate 

complete processor and logic systems within a single device. 

Software engineers are now faced with developing and 

debugging code targeting a processor inside of an FPGA and 

in some cases fear doing so. In this paper we will describe 

FPGAs and the process of creating and debugging code for 

FPGA embedded processors. 

II. 

WHAT IS AN FPGA? 

A Field Programmable Gate Array (FPGA) is an integrated 

circuit containing logic that may be configured and connected 

after manufacturing or “in the field”. Where in the past 

engineers purchased a variety of logic devices and then 

assembled them into a system design via connections on a 

printed circuit board, today hardware designers can implement 

complete system designs within a single device. In their 

simplest form FPGAs contain: 

• Configurable Logic Blocks 

• AND, OR, Invert & many other logic functions 

• Configurable interconnect enabling Logic Blocks to be 

connected together 

• I/O Interfaces 

With these elements an arbitrary logic design may be created. 

Note: With the transition of embedded processors integrated 

with FPGA’s, and the concept of both programmable processors 

and programmable FPGA’s, the idea that FPGA’s are now 

Programmable Logic aligns with the terminology of 

Programmable Processors. Thus, in this paper we will use 

Programmable Logic to describe the FPGA logic inside of an 

All Programmable device. 

Hardware engineers usually write code in HDL (typically 

either Verilog or VHDL) and then “compile” the design into an 

“object file” which is loaded into the device for execution. On 

the surface the HDL programs can look very much like High 

Level Languages such as C. 

The following is an implementation of an 8 bit counter 

written in Verilog courtesy of www.asic-world.com. One can 

see many constructs taken from today’s high level languages: 

© Copyright 2017 Xilinx 


550

---------------------------- 

// Design Name : up_counter 

// File Name : up_counter.v 

// Function : Up counter 

// Coder : Deepak 

//---------------------------- 

module up_counter ( 

out ,// Output of the counter 

enable ,// enable for counter 

clk ,// clock Input 

reset // reset Input 

); 

//---Output Ports-------------- 

output [7:0] out; 

//----Input Ports-------------- 

input enable, clk, reset; 

//---Internal Variables-------- 

reg [7:0] out; 

//------Code Starts Here------- 

always @(posedge clk) 

if (reset) begin 

out

IP power-gating options. The fourth power domain is the 

programmable logic (PL). 

o 

o 

o 

2 tightly coupled memories (TCM): Connected to 

the Cortex-R5s, each with 4 individually powergated 

banks 

On-Chip Memory (OCM): 4 individually powergated 

banks 

2 USBs: Each individually power-gated 

Figure 1: Zynq UltraScale+ MPSoC Power Domains 

1) Battery Power Domain 

The battery power domain, which can be powered by an 

external battery, contains battery-backed RAM (BBRAM) for 

an encryption key, and a real-time clock with external crystal 

oscillator to maintain time even when the device is off. 

2) Full-Power Domain 

The full-power domain consists of the Application Processor 

Unit, with the ARM® Cortex-A53 processors, the Graphics 

Processing Unit, the DDR memory controller, and the highperformance 

peripherals including PCI Express®, USB 3.0, 

DisplayPort, and SATA. 

3) Low-Power Domain 

The low-power domain consists of a Real-time Processor 

Unit (RPU) with the ARM Cortex-R5 processors, static On-Chip 

Memory (OCM), the Platform Management Unit (PMU), the 

Configuration and Security Unit (CSU), and the low-speed 

peripherals. 

4) Programmable Logic 

The Programmable Logic power domain consists of logic 

cells, block RAMs, DSP blocks, XADC, I/Os, and high-speed 

serial interfaces. Some devices include the video codec, PCIe 

Gen-4, UltraRAM, CMAC, and Interlaken. 

C. Power Islands for Fine-Grain Power Management 

Within the full- and low-power domains, there are multiple 

power islands. Each island is capable of being power-gated 

locally within the device. The following islands can be powergated: 

 

 

Full-Power Domain 

o 

o 

o 

4 ARM Cortex-A53 applications processors; each 

can be individually power-gated 

L2 cache servicing the Cortex-A53 processors 

2 pixel processors in Graphics Processing Unit: 

Each can be individually power-gated 

Low-Power Domain 

o 

2 ARM Cortex-R5 processors: Power-gated as a 

pair 

VI. 

HOW THE HARDWARE ENGINEER IMPLEMENTS A 

PROCESSING SYSTEM DESIGN 

Tools allow the rapid assembly of a processors systems via 

wizards. Using drop-down lists or check boxes, one simply 

specifies the targeted part, the desired processor, and 

peripherals. The processor and data processing systems in the 

device can be connected by graphically connecting bus 

interfaces. 

VII. HOW THE SOFTWARE ENGINEER CREATES AND DEBUGS 

CODE 

The software development process follows the following steps: 

1. Create a software development workspace and import 

the hardware platform. 

2. Create the software project and Board Support 

Package 

3. Create the software 

4. Run and debug the software project 

5. Optional: Profile the software project 

Steps 3, 4 and 5 are familiar to most developers. Steps 1 and 

2 may be new to some developers but are straight forward. We 

will use the Eclipse development environment as an example for 

the above steps. 

1. Creating a Software Development Workspace and 

Importing the Hardware Platform: 

After starting Eclipse the user is prompted for a workspace 

to use. A workspace is simply a directory path where project 

files are to be stored. Next the user specifies the hardware 

platform (design). This file is automatically generated by the 

hardware development tools and describes the processor system 

including memory interfaces and peripherals including memory 

maps. The file is output from the hardware development tools 

and the hardware engineer will typically supply this file to the 

software developer. Once specified, the hardware platform is 

imported and this step is complete. 

2. Creating the Software Project and Board Support 

Package (BSP) 

The Board Support Package (BSP) contains the libraries and 

drivers that software applications can utilize when using the 

provided Application Program Interfaces (APIs). A software 

project is the software application source and settings. 

For Xilinx C projects, Eclipse automatically creates 

Makefiles that will compile the source files into object files and 

link the object files into an executable. 



552

Next the system generates the BSP and automatically loads 

the applicable drivers based upon the defined hardware platform 

and operating system. These drivers are then compiled. 

3. Creating the Software 

At this point one may either import a software example or 

create code from scratch. As one saves code Eclipse 

automatically compiles and links the code reporting out any 

compiler or linker errors. 

4. Running and debugging the software project 

With FPGAs there is one step that must be completed prior 

to executing code; the FPGA must be programmed. In Eclipse 

the user simply selects Tools Program FPGA. This step takes 

the hardware design created by the hardware engineer and 

downloads it to the FPGA. Once completed the user may select 

the type of software to be built: 

Debug – Turns off code optimization and inserts 

debugging symbols 

Release – Turns on code optimization 

Note: For profiling one uses the –pg compile option. 

Finally the user may run the code by selecting Run and 

defining the type of run configuration and compiler options. If 

Release has been selected the processor will immediately begin 

code execution. Otherwise, the processor will execute a few 

boot instructions and will stop at the 1st line of source code and 

the Debug perspective will appear in Eclipse. 

From the Debug perspective the user may view the source or 

object code, registers, memory and variables. They may single 

step code at either the source or object level and may set 

breakpoints for code execution. 

5. Profile the software project 

Should the user desire they may profile code and view the 

number of function calls as well as see the percentage of time 

spent in any given function. 

VIII. SOFTWARE ACCELERATION VIA PROGRAMMABLE LOGIC 

With All Programmable Devices, one has the unique 

opportunity of turning software code into hardware accelerators. 

In the past one had to do such via tedious manual steps of 

creating a hardware engine that performed the desired software 

function; attach DMA engines to move data between the 

accelerator and memory; and create software interfaces between 

the replaced function(s) and the hardware accelerators and 

associated memory. Today there are modern C to HDL tools 

such as the Xilinx SDSoC environment that automates this 

process. When developing accelerators with such a tool the user 

performs the following steps: 

a. Profile and identify time critical functions 

b. Use the C to HDL tool to automatically create: 

i. The hardware representation and HDL code 

of the function to be accelerated 

ii. Attached DMA engines to move data to and 

from the accelerator 

iii. Replacement hardware functions for the 

original software functions 

c. Tune the design using provided performance data 

including logic utilization, estimated clock cycles and 

latency 

Design tuning allows the user to optimize the design for 

performance via increasing pipelining which allows more 

computations to be done in parallel per clock cycle, or via 

parallelizing computations by having multiple computation 

pipes running at the same time. 

Dramatic acceleration of software functions can be obtained 

using this methodology. A few examples include: 

Algorithm 

Hardware 

Acceleration vs 

Software 

MRI Back Projection Algorithm 

8x 

16k Fast Fourier Transform (FFT) 10x 

Optical Flow 

25x 

Stereo Local Block Matching 

25x 

2D Video Optical Filter 

30x 

Binary Neural Network 9,000x 

IX. 

CONCLUSION 

For cost, power, size and overall system efficiency embedded 

processors with programmable logic are becoming primary 

design choices. Software engineers do not need to consider an 

FPGA embedded processor as a mystery or any more difficult to 

program than an external processor. Industry standard 

development environments such as Eclipse are now being 

provided by FPGA vendors at competitive costs and are 

customized for FPGA embedded processing. Within these 

environments users can create, compile, link and download 

code, and as necessary debug their designs in the same manner 

as they have done in the past with external processors. FPGA 

embedded processors have extensive IP libraries, drivers and OS 

support. Finally, modern C to HDL tools enable software 

engineers to automatically build hardware accelerators for 

software functions yielding orders of magnitude improvement in 

software performance. 

REFERENCES 

[1] Xilinx, Inc., “Zynq UltraScale+ Device Technical Reference Manual,” 

Decemnber, 2017. 

[2] Xilinx, Inc., “Zynq UltraScale+ MPSoC Software Developer Guide,” 

November, 2017. 

[3] Xilinx, Inc., “Xilinx Software Development Kit (SDK),” 

[4] Xilinx, Inc., “SDSoC Environment User Guide,” December, 2017. 

(references) 



553

Yocto Project Linux as a Platform for 

Embedded Systems Design 

Alex González García 

Software Engineering Manager 

Digi International Inc. 

Logroño, Spain 

Abstract—Given the wide variety and individuality of 

embedded devices, choosing an operating system is not simple. 

This paper examines the process of selecting an embedded device 

operating system by highlighting the most important decision 

factors and weighing the available options. It discusses the benefits 

of using the Yocto Project to build a custom Linux-based 

embedded operating system. Keywords—yocto; debian; 

embedded; buildroot; operating system 


The choice of an operating system (OS) is one of the most 

critical decisions in embedded product design. Embedded 

systems, as opposed to general-purpose computers, are not a 

homogeneous group of devices and cannot be treated as a single 

entity. Every embedded device is unique. For instance, a single 

widely used architecture such as ARM encompasses embedded 

devices ranging from 32bit microcontrollers to 64bit multicore 

CPUs. 

While they range in features and complexity, embedded 

devices also share common OS considerations: 

 

 

 

 

 

 

Power consumption 

Security, particularly with always-connected devices 

and the internet of things 

Quick start up time 

Networking stacks 

Some amount of real time (RT) determinism and low 

latency 

User interface, possible graphical 

The choice of an embedded OS is also influenced by cost and 

time to market, both of which are directly proportional to system 

complexity. 

1 

https://www.freertos.org/ 

2 

https://www.mbed.com/ 

II. MICROCONTROLLER- OR MICROPROCESSOR- BASED 

SYSTEMS 

Embedded devices exist on a spectrum of complexity. At the 

low end, which generally equates to lower cost and faster time 

to market, are embedded devices with a microcontroller 

(MCU). These devices typically lack memory management 

(MMU-less). Embedded developers working with MCUs are 

intimate with the hardware and microprocessor architecture and 

make extensive use of JTAG debuggers. The application is 

usually bundled with the OS on a single flat memory model. 

MCUs usually run in-house developed, bare-metal OSs. There 

has also been a recent uptick in the use of open source 

alternatives such as FreeRTOS 1 , mbed 2 , or Zephyr 3 . 

Microprocessor (CPU)-based systems are higher up the 

complexity spectrum. For example, in an ARM architecture, 

System-On-Chips extend the available MCU interfaces with 

more complex cores like HDMI and USB. They may also 

provide graphical, video, or cryptographic acceleration. 

III. REAL TIME OR GENERAL PURPOSE OPERATING SYSTEM 

The most important consideration when choosing an embedded 

OS is determining the optimal amount of real-time capabilities. 

Determinism and low latency requirements will prescribe either 

a real time OS (RTOS) or a general purpose OS (GPOS). An 

RTOS is the most complex; hence, cost and time to market are 

also the highest. Software development for RTOS and GPOS 

also requires substantially different skill sets. 

Hybrid approaches that use the real-time responsiveness of 

MCUs with the high processing and graphical capabilities of a 

GPOS on a CPU are also possible. 

On the GPOS side, embedded Linux and its multiple facets has 

become the standard choice. This is also true for soft RT 

3 

https://www.zephyrproject.org/ 


554

capabilities (hard RT Linux is possible but significantly 

increases the complexity). In a 2017 embedded market survey 

[1], 62% of projects were using embedded Linux in one form 

or another, and 82% were thinking of using embedded Linux in 

2018. 

IV. EMBEDDED LINUX 

The choice of embedded Linux already implies a jump in 

software complexity, and the software skills needed on 

embedded Linux teams are broader than those of traditional 

embedded developers. Disregarding the steep learning curve of 

embedded Linux for a RTOS embedded developer is a common 

mistake that can significantly increase the time to market of 

embedded Linux projects. 

As system complexity increases, the embedded developer role 

changes. Embedded Linux teams are bigger and more 

heterogeneous than traditional embedded teams and typically 

consist of three distinct roles: 

 

 

 

BSP developer 

Application developer 

System developer 

BSP developers work on bootloaders and the Linux kernel. The 

bootloader developer is very close to the traditional embedded 

developer—intimate with the hardware and running on a flat 

memory system. However, the development work to be done in 

a bootloader is limited to hardware bring-up and the execution 

of the Linux kernel. Embedded Linux kernel developers, on the 

other hand, have more in common with desktop PC kernel 

developers than with traditional embedded developers. JTAG 

devices are of little use beyond bring-up, and the CPU board 

support package (BSP) is usually provided by the manufacturer. 

Even drivers are usually provided either by the community or 

the device manufacturer, so an embedded Linux kernel 

developer does device tree customization and maybe some 

driver debugging and development. 

Application developers work at a high level, abstracted by the 

Linux kernel. Application development work is very similar to 

desktop application development; in many cases, applications 

can initially be developed on a PC before cross-compiling them 

in the embedded hardware. User interfaces typically use the QT 

framework 4 or are actually a web-based interface. 

Programming languages are no longer limited to C and C++. 

Applications are now developed in Python, Node.js, and even 

Java. 

An embedded Linux application is not a traditional monolithic 

entity. Rather, it is usually comprised of a collection of 

applications that communicate and cooperate. Typical 

embedded Linux application services include: 

Process monitoring and watchdog, which are in charge 

of monitoring the rest of the system and restarting 

other applications or the whole system when a failure 

occurs 

Messaging services like d-bus or similar 

Network managers, including Wi-Fi, cellular, and 

other technologies 

Configuration managers, which translate user 

interface commands into configuration changes 

User interfaces, and possibly CLIs or web interfaces 

Logging service 

A main application 

All these services collaborate to provide the experience of a 

single embedded application. 

System developers are the people in charge of system 

integration and the build system, including root filesystem 

customization and software development kit (SDK) generation. 

While a traditional embedded application would be bundled 

with the OS, Linux requires a user space or root filesystem 

which contains the runtime set of applications and libraries. 

Building this root filesystem has always been complex and is 

one of the reasons why embedded Linux adoption has been 

slow. 

The SDK is used as an interface between the development roles, 

allowing teams to scale in size and specialize to a certain 

degree. For example, application and BSP developers use and 

update an SDK but do not usually need to be concerned with 

root filesystem customization. 

A. Choosing an embedded Linux distribution 

A Linux distribution is an operating system based on the Linux 

kernel and GNU 5 Linux software, most importantly the GNU 

toolchains, libraries, and development tools. 

A distribution sets the policies for the system and includes 

components such as: 

 

 

 

 

 

The selection of supported packages 

The initialization system to use 

The graphical backend 

System-wide choices, like the Bluetooth stack 

Graphical environments 

An embedded Linux distribution provides: 

 

 

 

 

The bootloader 

The Linux kernel 

The user space or root filesystem 

A software development kit 

4 

https://www.qt.io/ 

5 

https://www.gnu.org/ 

555

A distribution is generated in one of two ways: 

1. Customize an existing binary Linux distribution such 

as Debian 6 

2. Build a Linux distribution from source 

1) Binary Linux distributions 

Binary distributions contain pre-built binary packages that are 

added (downloaded from the cloud) or removed from a system 

using a package manager. Systems are usually bootstrapped on 

target, and on-target compilation is easy and common. 

Binary-based distributions have the lowest complexity and 

quickest time to market if the hardware is already supported by 

the distribution, so they initially appear to be an easy solution. 

However, they have several drawbacks that make them 

inadequate for embedded products. 

Because package maintenance is taken care of by the 

distribution provider, binary distributions offer very limited 

package configuration. Packages are also generic, so they are 

usually heavily patched to cover a wide range of use cases 

instead of focusing on embedded application needs. 

The policies and architectural choices are pre-defined and offer 

little customization options. Embedded products are unique, so 

customization of the binary distribution is often necessary. This 

leads to manual non-standard builds that are difficult to 

reproduce and trace. Even if very little customization is needed, 

package maintenance becomes a problem once the distribution 

maintenance period ends and manual non-standard builds are 

required. 

Performing package updates via package managers is also 

unsuited for embedded devices. After several updates, there is 

no way to guarantee that the deployed system is the same as the 

tested system. Also, losing power while updating could leave 

the system in inconsistent states. 

Binary distributions produce bigger systems, so images are 

larger and slower to boot. They are also more complex systems 

that require more resources to run and are more difficult to 

secure. 

Finally, binary distributions are not easily portable, so they do 

not scale well to run on multiple platforms. 

In summary, binary distributions have high maintainability and 

low reproducibility, which are disadvantages for embedded 

systems. 

2) Build from source 

Building from source is more complex and has traditionally 

meant longer development cycles. However, it allows 

maximum flexibility to architect the system with full package 

configuration and no pre-determined choices. It also allows for 

package maintenance as long as necessary. 

A system built specifically for an embedded system provides a 

more compact system with smaller images that are faster to 

boot. Also, reduced system complexity allows it to run on fewer 

resources and makes it easier to secure. 

It is also highly portable with good scalability to multiple 

hardware platforms. 

In summary, a system built from source offers good 

maintainability and reproducibility. 

The longer development cycle of the do-it-yourself approach, 

even with the help of projects like cross Linux from scratch 7 

and crosstool-ng 8 , placed embedded Linux projects at a 

disadvantage. However, standard tools have emerged that 

greatly simplify the building of custom Linux systems. The two 

most prominent tools are Buildroot 9 and Yocto Project 10 . 

a) Buildroot 

Buildroot is an easy to learn tool for small projects and small 

teams. It uses the kbuild system (as the Linux kernel) as 

configuration tool, which means the configuration is only kept 

in one place. It can be thought of as an image generator more 

than a distribution builder. It does not generate binary packages, 

and does not support package managers or native on-target 

compilation. 

It also has a good selection of well-maintained packages, and 

custom external packages can be added. Buildroot only 

performs full system updates; this is good for production 

systems but not for development, as the images must always be 

updated as a whole. 

It also has no concept of a build cache and often needs to 

perform full system rebuilds instead of incremental builds. 

Buildroot has a three-month release cadence and a long term 

support (LTS) release every year. 

Although at this point it has diverted considerably, openWRT 11 

is an example of a distribution that originated in Buildroot. It is 

focused on networking devices, particularly routers. 

Buildroot is a good choice for small projects and teams, as it 

keeps the complexity low while significantly reducing time to 

6 

https://www.debian.org/ports/ 

7 

http://trac.clfs.org/ 

8 

http://crosstool-ng.github.io/ 

9 

https://buildroot.org/ 

10 

https://www.yoctoproject.org/ 

11 

https://openwrt.org/ 


556

market. However, it does not scale as well as Yocto for multiple 

platforms or bigger teams. 

b) The Yocto Project 

The Yocto Project is a distribution builder that provides a 

reference distribution called Poky. Its OpenEmbedded 12 build 

system is based on Bitbake, a task scheduler written in Python, 

with package recipes that are structured in layers. It supports a 

large number of packages, and the layers facilitate software 

reuse. But since the layer ownership is distributed, maintenance 

can be problematic. 

Configuration is scattered in distro, machine, image, and local 

configuration files. Bitbake parses all configuration and recipes, 

resolves dependencies, and prepares and executes a list of tasks. 

The built output is binary packages, which are then installed 

into a root filesystem image. Package managers running on the 

target are supported and especially useful for development. The 

Yocto Project can be used to create binary-based distributions, 

but it is mostly used as an image generator. 

The Yocto Project has a six-month release cadence, and 

maintains both the current and previous software releases. 

It has a steeper learning curve than Buildroot. However, once 

proficiency is achieved it also reduces the system complexity 

and the time to market while scaling well to multiple platforms 

and bigger teams 

An example of a Yocto Project-based distribution is 

Ångström 13 . 

V. CONCLUSION 

Although there is no one-size-fits-all embedded operating 

system, embedded Linux covers the majority of use cases for 

microprocessor-based solutions. Even so, real-time 

considerations must be taken into account. 

However, the increased software complexity and different 

software skillset required for embedded Linux development is 

also important to consider. 

The choice between using a binary distribution and taking on 

the complexity of building a Linux distribution from source is 

made easier with the use of system builders like Buildroot and 

the Yocto Project. Even though Buildroot is a great tool for 

smaller projects, the scalability of platforms and development 

workflows with the Yocto Project makes it the de facto standard 

for embedded Linux systems. 

A more detailed discussion of the differences between the 

Yocto Project and Buildroot can be found at [2], while a 

detailed comparison with Debian can be found at [3]. 

REFERENCES 

[1] EETimes/Embedded 2017 Embedded Market Study by AspenCore 

https://m.eet.com/media/1246048/2017-embedded-market-study.pdf 

[2] Alexandre Belloni and Thomas Petazzoni, "Buildroot vs. 

OpenEmbedded/Yocto Project: A Four Hands Discussion". Embedded 

Linux Conference 2016. 

https://elinux.org/images/7/7a/Bellonipetazzoni.pdf 

[3] Mads Doré Hansen, "Yocto/Debian Comparison White Paper". 

https://www.prevas.dk/download/18.58aaa49815ce6321a327da/1506087 

244328/Yocto_Debian_Whitepaper.pd 

12 

https://www.openembedded.org/ 

13 

http://www.angstrom-distribution.org/ 

557

Boot Time 

Benefits & Drawbacks of Linux Sleep and Hibernate 

Thom Denholm 

Technical Product Manager 

Datalight, Inc. 

Bothell, WA 

Thom.Denholm@datalight.com 

Abstract— Embedded designs need to start up quickly, and 

those based on Linux and Android must overcome challenges of 

initializing many peripherals and complex applications. Many 

designs today rely on a sleep or hibernate solution. What are the 

risks of these options, and is there a better alternative? 

This session will examine strategies to optimize boot time 

including a detailed discussion of trade-offs to consider when 

working to perfect your users' experience. Bill of materials costs, 

power budgets, and required development team expertise will be 

examined. 

Keywords—Linux kernel; boot time; hibernate; sleep; Android 


Consumers desire an instant-on experience, so embedded 

designs need to start up quickly. Systems that once had no 

processor or a simple startup ROM set that standard for the 

consumer, and they expect similar start times from embedded 

devices which are considerably more complex. The Linux 

Kernel adds significantly to this overhead. 

Android, running on top of Linux, further adds to the 

burden. Google is aware of the problem, and even targeted 

faster boot speed with their latest release, Oreo. [1] Above both 

the Linux Kernel and Android environment is the application, 

which may need to initialize graphic environments and/or load 

databases to start up. 

The goal of this paper is to survey available options to 

improve overall device startup time, and also shed some light 

on the risks and benefits of the various approaches. 

II. 

DEFINING THE PROBLEM 

For an embedded device to be ready for input, it must start 

the hardware, drivers and Kernel, then any application 

environment and finally the application. When people focus on 

startup time, it is usually the hardware and Kernel that get the 

most attention. 

Boot tracer and other utilities can be used to measure the 

boot process. When that data is routed through the 

bootgraph.pl script, a colored chart results to break down 

those results [2]. Some can be removed or postponed, and 

others may be reduced through code manipulation. Another 

technique used is to parallel initialization, allowing two or 

more operations to start together. 

NAND media startup can be particularly pernicious. 

Drivers must usually scan the entire media to remap the wear 

leveling, and that is before whatever scans and checks are 

required by the file system for reliability validation. 

Work done to reduce Linux startup time can minimize the 

time spent in cold boot, but it can have unexpected costs for 

longer term projects. Modifications will usually apply only to 

the current Linux kernel used on the project; any changes will 

require much of this work be performed again. This knowledge 

can also be something of a special skill, meaning staffing 

changes could significantly impact your project. 

III. 

APPLICATION BEYOND THE KERNEL 

Just starting the Kernel isn’t enough for most embedded 

designs; any required applications need to start as well. Any of 

these could involve loading data files and databases, each of 

which takes time to read from the media. Other major startup 

hurdles include graphic libraries and external communication. 

It is very difficult to optimize the startup time for a complex 

application to any great degree. 

The Android environment is really a specialized user 

interface application, and it must startup also. Like Windows 

and iOS, this environment loads applications and initializes 

information. The system may not be fully ready for tens of 

seconds (or even minutes). It contains its own task manager 

which can be used to disable some app startup to improve that 

speed somewhat, but this knowledge is also subject to change 

with new versions. 


558

One interesting option is to store a suspended version of the 

application (sometimes called a snapshot). This is popular with 

virtual machines, which have their state suspended instead of 

starting from scratch each time. For the entire design, this 

would be known as hibernation, more on that later. 

IV. 

OVERALL DESIGN 

While the entire design is going through various stages of 

startup, it is drawing power. This is being used to read from the 

media and allow the processor to work, but most designs also 

use the display screen as well. Whether your customers see a 

splash screen or the status of the boot cycle, this additional 

display time draws current for the device. Reducing the entire 

system startup time has a secondary benefit of reducing power 

usage and improving device lifetime. This is especially useful 

if the device must initialize many times on one battery charge. 

V. SLEEP MODE 

One solution is to use sleep mode. Here, the processor 

suspends nearly all activity (exceptions include DRAM 

refresh). To the end user, the device appears to be off, though it 

is in fact consuming a small amount of power to maintain 

operation. This can eventually drain a battery. It is less risky 

when the device is plugged in, but power can (and will) be 

interrupted. 

Many devices will add a small indication that the device is 

not “off” but merely sleeping – a slowly blinking light, for 

example – to make sure the customer is aware of this risk, and 

will not arbitrarily unplug a device. This is often referred to as 

a heartbeat, which is appropriate – when the power is lost, the 

device is dead. When power is lost, the status of everything in 

memory will also be lost – loaded applications, open files and 

non-committed program operations. 

In addition to adding a visible heartbeat, there are other 

changes required to support sleep mode. Of the options 

surveyed here, this is the most complex from a hardware 

design standpoint. 

VI. 

HIBERNATION 

The alternative to sleep is hibernation. In this case, rather 

than suspending operations in memory, the machine state 

contents are committed to the media. Committing and restoring 

that state can take some time, and should be done in a powerfail 

safe manner. If the commit is not complete, the device will 

have to cold boot – the same as if power was lost in sleep 

mode. 

The major advantage of both sleep and hibernate is that 

both allow the device to avoid the time needed to reload the 

applications and their associated files, in contrast to improving 

the cold boot by optimizing system loading which does very 

little to improve the startup time of the application. Also, 

neither of these methods require significant update when the 

Kernel or Android version changes, and no special Linux 

development expertise is required. 

Hibernate requires support in the BSP and drivers. One 

disadvantage of Hibernate is the amount of data that needs to 

be committed and restored, which grows as the system DRAM 

footprint grows. The additional time required for this I/O also 

draws power, and could require a status screen to reassure the 

user, which requires even more. There are a few techniques 

which could be used to improve this operation. Each will 

require changes to the code to perform both the hibernate and 

the restore. 

VII. SKIP UNUSED PORTIONS 

Large portions of memory in a system are blank – either 

unused or allocated as part of a program’s stack or heap. Not 

only is there no point in committing those unused portions to 

the media, but those writes (and subsequent reads) are costly in 

terms of time and flash media life. The solution is to compress 

the memory in use by those amounts. 

Performing this operation requires knowledge of the 

individual drivers and the application, but in this case a little 

knowledge can yield a significant savings. One source for this 

knowledge is the Linux kernel source code, which is freely 

available. Here the hibernate code can simply skip the unused 

portions; the restore code should initialize that memory to the 

values expected by the various drivers and applications. 

VIII. FURTHER COMPRESSION 

While storing blocks to the media, why not take advantage 

of compression algorithms to further reduce the footprint? This 

operation would require changes to both the hibernation and 

restoration code, but would be well worth it. Allocated 

program space in memory is far more compressible than media 

and images, and the time required to read back the data from 

the disk is even further reduced. 

IX. 

ANOTHER ALTERNATIVE - REMOVE THE TIME REQUIRED 

TO CREATE THE HIBERNATION IMAGE 

For devices which present the same interface to the 

customer each time they start up, another alternative would be 

to create the hibernation image just one time, then restore that 

image each time. This allows for the fastest possible device 

ready state – Linux Kernel and application running, files preloaded. 

This "factory configuration" is the same one used each 

time, with no per-user or site customization. Going further, an 

application could be modified to use a small configuration file 

for this data, reading it again when the system detects it has 

been restored. 

559

X. BILL OF MATERIALS 

All the hibernation techniques discussed are not without a 

cost. Some additional storage media is required to save these 

images, with the amount required dependent on how much 

overall DRAM is in the system, and whether compressed 

images are an option. Fortunately, media storage has come 

down in price, and media vendors are pushing larger size parts 

while using the same board footprint. 

XI. 

CONCLUSION 

Devices with Linux are often far more complex than (and 

slower to start up) than those designed with an RTOS. Many 

techniques are available to accelerate the startup on Linux. 

Only hibernation keeps the application loaded AND protects 

against power failure. Some clever modifications to hibernate 

are available to speed the restore time and reduce the storage 

time, even removing it completely. This solution delivers 

improved overall device start time with the least additional 

Linux kernel modifications. 

REFERENCES 

[1] Sameer Samat, Aug 21 2017, 

https://blog.google/products/android/android-oreo-superpowers-comingdevice-near-you/ 

[2] Chung-Yeh Wang, Dec 31 2012, 

http://linuxonarm.blogspot.com/2012/12/boottime-patchubuntunexus7.html 


560

Connecting Sub-1 GHz Low-Power IoT Nodes to the 

Internet Using 802.15.4 

Nick Lethaby 

Connected Microcontrollers 


Goleta, USA 

nlethaby@ti.com 

Abstract—The Sub-1 GHz spectrum provides publically 

available bands that allow long-range low-power communication, 

rendering it ideal for many IoT applications. Unfortunately, the 

Sub-1 GHz band supports many different PHYs and lacks a 

standard networking solution. This has limited its use in the IoT 

market because developers must have implementation expertise 

in low-level RF communications. IEEE extended support to the 

Sub-1 GHz band in the ‘g’ amendment of the IEEE 802.15.4 

specification, creating an opportunity for standards-based 

protocol implementations. We will overview an 802.15.4g-based 

protocol stack implementation that enables sensors and actuators 

to connect to the cloud using Sub-1 GHz radios, including which 

802.15.4 standards are useful and which additional proprietary 

implementation is needed for a full stack. This includes a Linuxbased 

stack and gateway implementation to bridge the wireless 

network to the internet using a serial port abstraction for the 

Sub-1 GHz radio device. We will conclude with benchmark data 

demonstrating network reliability and potential battery life for 

application usage scenarios with an ARM Cortex M-based Sub-1 

GHz wireless microcontroller. 

Keywords—IoT; Sub-1 GHz; 802.15.4; 


With Internet of Things (IoT) applications triggering large 

scale deployment of wirelessly connected sensor and actuator 

nodes, cost-effective implementation will depend on keeping 

deployment costs low. The choice of wireless technology will 

significantly affect these costs as it will determine how many 

gateways or intermediate routers are required, whether mains 

power is required, and the hardware costs, such as processing 

power and memory, of the end node. 

In many IoT applications, mains power may not be 

conveniently available, adding to the deployment costs unless a 

node can function for long periods on a battery or similar 

power source. In addition, in applications such as agriculture, 

warehouse asset tracking, or industrial plants, nodes will need 

to be placed over a wide area and in locations where metal or 

concrete may attenuate or obstruct the signal, creating 

challenges for wireless technologies such as Wi-Fi or BLE that 

have somewhat limited range and don’t easily adapt to radio 

environments that must turn corners. 

The Sub-1 GHz spectrum provides publically available 

bands that allow long-range communication and good 

penetration. This band has been extensively proven in IoT 

application segments such as smart metering and home alarms, 

which are based on proprietary network implementations. It is 

also the band used by emerging Low Power Wide Area 

Network technologies like Sigfox and LoRa. Unlike 

connectivity technologies such as Ethernet, BLE, or Wi-Fi, 

which have a very limited number of PHYs, Sub-1 GHz allows 

a very wide range of different PHYs. As a result there is no 

standard networking solution. This has limited its use in the 

broader IoT market because developers needed implementation 

expertise in low-level RF communications and the associated 

protocol stacks. 

In 2011, IEEE extended support to the Sub-1 GHz band 

with the ‘g’ amendment of the IEEE 802.15.4 specification, 

creating an opportunity for more standards-based protocol 

implementations. Since 802.15.4 is purely a MAC-layer 

standard, it will always require additional custom stack layers 

to be developed for a fully functional implementation. 

However, using 802.15.4 as the starting point enables the many 

engineers who are knowledgeable with this standard to have a 

more familiar stack for Sub-1 GHz applications, compared to a 

fully custom implementation. 

Since connecting low power wireless networks to the 

internet requires a gateway, the 802.15.4 stack implementation 

for Sub-1 GHz will be discussed for both low power sensor 

nodes and a Linux-based gateway that provides internet 

connectivity, including which components of 802.15.4 were 

used and which custom stack elements needed to be 

implemented. 

However, we will begin with a summary of the network 

requirements since these strongly influenced many of the 

implementation choices. 

II. 

NETWORK REQUIREMENTS SPECIFICATION 

Any wireless network design must first determine the 

optimal blend of cost, range, data rate, and power consumption 

for the targeted applications as these factors heavily influence 

the implementation path. These are summarized below. In the 


561

specific context of Sub-1 GHz, a key requirement was very low 

power performance, as this involves a trade-off with 

transmission ranges: 

 

 

 

 

 

 

 

 

 

Robustness: For IoT applications such as smoke 

alarms or predictive maintenance, sensor data 

must be delivered reliably to enable a response 

Scalability: IoT applications will often have 

dozens to hundreds of sensors on an individual 

wireless network 

Latency: An IoT application like a home alarm 

must deliver data in a timely manner 

Security: It is very important to minimize the 

opportunity for unauthorized access or 

eavesdropping 

Regulatory compliance: Since the Sub-1 GHz 

band is subject to national or regional 

telecommunication rules, the network 

implementation must confirm to these in areas 

such as FCC channel occupancy requirements 

1 km range: The ability to deploy IoT sensors 

over a whole building or factory in a simple star 

network topology avoids the cost complexity of 

mesh networking with intermediate routers or 

range-extenders. It is important to understand that 

the relationship between range and data rate in 

wireless transmission. As the range increases, the 

possible data rates decrease. As a result, for an 

equivalent amount of data, devices must stay 

active longer (and thereby consume more power) 

when transmitting at longer range. To achieve 

very long battery life, the range requirement was 

therefore limited to enable a data rate of 50 kbps. 

Low power nodes: Since many IoT sensors and 

actuators lack access to mains or solar power and 

it is not cost-effective to frequently replace 

batteries, multiple years of operation on a coin cell 

battery was required. To achieve this, the network 

must allow devices to sleep for long periods 

without being woken up purely for network 

synchronization purposes. 

Two-way communication: Some nodes will be 

actuators that wait for commands or related input 

from the network. The network must be able to 

support sending out commands as well as simply 

receiving sensor data. 

Low cost nodes: To achieve low cost, it was 

required that the network implementation worked 

an embedded MCU with

enables the receiver to verify that the packet 

contents were not altered (countering man-in-themiddle 

attacks). 

Asynchronous mode: 802.15.4 is designed for 

low power operation and, unlike many network 

implementations, avoids the need for a device to 

regularly synchronize with the network. This 

allows a sensor node to sleep until it has a reason 

to connect to the network, such as for transmitting 

data. 

Broadcast mode: While sensors may be able to 

sleep until they must transfer data, an actuator, 

such as a LED lighting controller, needs to be able 

to respond quickly to a user command. The 

802.15.4 broadcast mode allows actuators to act as 

beacons waiting for input from the network. 

While 802.15.4 provided standards-based solutions that 

addressed many of the network implementation requirements, 

there were several areas where additional standards or 

proprietary techniques were utilized in the implementation: 

 

 

 

Frequency hopping: Although 802.15.4 includes 

frequency hopping standards, we chose the Wi- 

SUN (Wireless Smart Ubiquitous Networks) 

frequency hopping as the basis for our 

implementation as this is simpler and directly 

designed for 802.15.4g. We enhanced the Wi- 

SUN implementation to add support for sleepy 

devices and broadcast (beacons) mode. 

Logical Link Controller: 802.15.4g is purely a 

MAC layer standard and does not address 

functions addressed by the Logical Link 

Controller as well as higher layers in the network 

stack, such as network formation and 

management. It was therefore necessary to 

implement a proprietary application to provide the 

additional networking function needed to bridge 

sensor data from 802.15.4g to the internet. 

Security: 802.15.4 does not define standards for 

device authentication or secure key exchange. 

These are generally regarded as essential in 

modern IoT network implementations. 

We will discuss the details of the Logical Link Controller 

and enhanced security implementation to 802.15.4 in greater 

detail in subsequent sections. 

V. LOGICAL LINK LAYER 

As discussed earlier, 802.15.4g is purely a MAC layer 

standard. A significant amount of additional functionality must 

be added to have a viable wireless network. Although we titled 

this software module as the Logical Link Controller in our 

implementation, it is important to understand that this module 

provides much more than that what is typically thought of as 

Logical Link Layer functions. The key functions implemented 

were: 

 

 

 

Network Management: This starts and closes the 

network and ensures that the network connections 

are functional. This functionality is entirely 

implemented on the gateway. 

Device Management: The gateway adds devices 

that wish to join and removes devices that wish to 

leave or are not responding, and maintains device 

connections. On the device side, a device must 

verify if the connection is still open. If note, it 

should look for a new network. 

Service Discovery: This identifies which types of 

devices are connected to the network, such as 

temperature or soil moisture sensors, so that that 

these can be configured by applications for 

appropriate reporting. 

VI. 

INTERNET CONNECTIVITY 

Connecting the 802.15.4 wireless network to the internet 

requires a gateway. A software stack that performed the 

gateway function was implemented on an embedded Linux 

board, which connected to the Sub-1 GHz radio using a USB 

cable (see figure 1). 

The Sub-1 GHz radio was operated as network processor 

and treated as a Linux character device. The 802.15.4g MAC- 


563

layer functions continue to run on the MCU in the network 

processor, similar to a node device. In addition the network 

processor required additional software to enable it to 

communicate to the Linux gateway stack through a serial 

device driver. On the Linux side, the gateway stack contains a 

network processor interface layer that serializes data and 

commands between the NPI and the gateway application. 

The Linux gateway application implements the Logical 

Link Controller functions described earlier such as network 

formation and management and service discovery. It is also 

responsible for data transmission from the higher level 

applications to the radio. It provides a simple API layer that 

offers APIs such transmit, receive, network open and close, and 

device join and remove. The gateway also encapsulates data 

received from the NPI into JSON objects, which may be easily 

consumed by the cloud. For sensor objects, we chose the 

Internet Protocol Smart Objects (IPSO) definitions, as these 

provide a standard for formatting sensor data. The end user can 

create an application that posts the JSON data into an MQTT 

queue or whatever other mechanism is required to pass the data 

into the cloud for storage and processing. 

VII. NETWORK SECURITY 

As described earlier, AES-CCM addresses several aspects 

of network security. However, to offer the security expected in 

a modern IoT network, these capabilities must be enhanced 

further. Since AES is a symmetric cipher, it suffers from the 

drawback of all symmetric encryption: namely how to securely 

exchange the key required for subsequent secure transmission. 

A second weakness of AES-CCM is device authentication. 

While AES-CCM can authenticate the message contents, it has 

no mechanism for authenticating that a device joining the 

network is legitimate and not a potential bad actor. 

To authenticate devices, a password mechanism was 

chosen. The device manufacturer must embed a unique IEEE 

address and an 8- or 16-characters alphanumeric passcode into 

each device. When a user wishes to commission the node into 

the network they take that information, which will typically be 

encapsulated in a Quick Response (QR) code, and input them 

into the gateway application. The gateway application 

computes the unique token ID by performing a hash operation 

of the concatenated string of the passcode and IEEE address 

and then opens the network for joining. The node computes the 

unique token ID from its embedded IEEE address and passcode 

information, and then attempts to join using a TLS-like 

exchange over 802.15.4 that is secured over the token-derived 

link. If the device-generated token matches the token generated 

on the gateway, then the connection is allowed. An alternative 

to a password-based approach would be to use certificates. 

These would offer even stronger security for device 

authentication as certificates contain more information such as 

the manufacturer’s identity or manufacturing location identify, 

for example. A certificate based approach is also more efficient 

when it is necessary to join large number of devices at the same 

time as it eliminates the need to manually enter the device 

credentials. 

As intimated above, a TLS-like sequence is used to pass 

the AES key between the node and gateway. ECC-HD was the 

chosen method to generate and securely exchange the AES 

key. The same AES key is used to encrypt all future 

communication between the device and the node, unless the 

node is completely power cycled, in which case it will generate 

and exchange a new key. 

VIII. POWER CONSUMPTION RESULTS 

Table 1 shows that the current consumption and projected 

battery life for a node based on testing of the stack in specific 

application scenarios. These indicate that a node can potentially 

operate for multiple years on a coin cell battery. It should be 

stressed that battery life is completely dependent on the 

application profile and how frequently and how long a node is 

awake for. However the profiles illustrated are quite reasonable 

for a sensor. 

Node 

Application 

Profile 

Sends data every 

3 mins 

Sends data every 

3 mins and polls 

every minute 

Without Frequency 

Hopping 

Average 

Current 

Predicted 

Battery Life 

With Frequency 

Hopping 

Average 

Current 

Predicted 

Battery Life 

1.9 uA 13.9 years 2.2 uA 11.9 years 

4 uA 6.6 years 6.7 uA 4 years 

Table 1: Predicted coin cell battery life for two different application profiles 

using the Sub-1 GHz 802.15.4g stack 

IX. 

SUMMARY 

The Sub-1 GHz band offers many attributes that are highly 

suitable for IoT sensor network applications, including longrange, 

robustness, and very low power operation. The long 

range and low power can significantly reduce deployment and 

operation costs by allowing simple star-network configurations 

and eliminating the need to makes mains power available or 

frequently install new batteries. 

The Sub-1 GHz band has been used for some years in select 

IoT segments based on proprietary networks, such as smart 

meters and home alarms. A major barrier to wider adoption has 

been the need for developers to have the requisite experience to 

implement low-level RF protocols. 802.15.4g provides a basis 

for a robust, reliable standards-based wireless networking stack 

for Sub-1 GHz with low-power operation. Although 802.15.4g 

usage limits Sub-1 GHz range to around 1 km, this is still 

significantly more than competing low-power wireless 

technologies. 

Since 802.15.4 focuses on the MAC layer only, additional 

custom implementation effort is required to produce a stack 

that can make IoT data easily available for transmission to the 

cloud. We described the encapsulation of the radio into a Linux 

device driver, a network formation and management 

application, and the use of the IPSO data formats to provide 

sensor data in an easily consumable format for IoT platform 

agents. 


I would to thank Roberto Sandre, of the Connected MCU 

organization at Texas Instruments, for providing technical 

insight into the 802.15.4-based stack for Sub-1 GHz. 

564

emb::6: An Open-Source IoT stack for Multiple 

IPv6 Communication Protocols 

Nidhal Mars, Lukas Zimmermann, Manuel Schappacher and Axel Sikora 

Institute of Reliable Embedded Systems and Communication Electronics, 

University of Applied Sciences Offenburg, D77652 Offenburg, Germany 

{nidhal.mars, lukas.zimmermann, manuel.schappacher, axel.sikora}@hs-offenburg.de 

Abstract—6LoWPAN (IPv6 over Low Power Wireless 

Personal Area Networks) is gaining more and more attraction for 

the seamless connectivity of embedded devices for the Internet of 

Things. It can be observed that most of the available solutions are 

following an open source approach, which significantly leads to a 

fast development of technologies and of markets. Although the 

currently available implementations are in a pretty good shape, 

all of them come with some significant drawbacks. It was 

therefore decided to start the development of an own 

implementation, which takes the advantages from the existing 

solutions, but tries to avoid the drawbacks and support multiple 

communication protocols. This paper describes the emb::6- 

implementation and its characteristics. It also covers the 

extension to support Thread protocol and 6Tisch (IPv6 over the 

Time-Slotted Channel Hopping mode of IEEE 802.15.4e) 

networks. The presented implementation is available as opensource 

project under [1]. 

Keywords—6LoWPAN; ContikiOS; IEEE802.15.4; Thread 

Network; 6Tisch 


The Internet Protocol is the building block not only of the 

legacy Internet, but also of the upcoming Internet of Things 

(IoT), which connects small, reasonably powerful, energy and 

cost efficient embedded device to communicate not only with 

each other but also connects the embedded devices seamlessly 

with the existing Internet. On one side, these devices have 

interfaces to the physical world, whereas, on the other side, 

they connect to the virtual world of databases and servers in the 

Internet and thus are a cornerstone of a Cyber-Physical System 

(CPS). 

The lower three layers of the protocol stacks are already 

well defined with IEEE802.15.4 [2] and 6LoWPAN [3] [4] [5]. 

6LoWPAN is developed to enable IPv6 connectivity for 

constrained embedded devices that use 802.15.4 low-power 

wireless communication. Although open issues here remain 

here with regard to the selection of physical layer, to unified 

commissioning procedures, to routing functions and 

parameters, and to security, a reasonable level of 

interoperability has already been achieved that is comparable 

with other more homogeneous protocol stack. 

It can be observed that most of the available solutions are 

following an open source approach, which significantly leads 

to a fast development of technologies and of markets. As 

example we could mention BLIP from TinyOS, RIOS OS [6], 

openWSN [7] and µIPv6 from Contiki. Although the currently 

available implementations are in a pretty good shape, all of 

them come with some significant drawbacks. 

After a thorough analysis of existent solutions for 

6LoWPAN, regarding their maturity and maintenance, the 

authors of this article decided to develop a new network stack 

which shall fulfill industry grade requirements for applications 

and has to provide comprehensive parametrization and 

commissioning capabilities. 

II. 

PROPOSED SOLUTION 

A. Design Principles 

The initial development of the emb::6 network stack started 

as a fork of Contiki OS including µIPv6, however to reach the 

requirements of an industry grade network stack several 

Contiki related core parts were removed or reworked. The most 

important aspects are: 

Architecture part: 

event-driven paradigm for network stack management 

improved modularity for the use of different Data Link 

Layer implementation, i.e. to also support developed in 

Wake-On-Radio-enabled IEEE802.15.4 stack [10] or 

optional security enhancements, like (D)TLS [11]. 

clear separation between functional parts. 

seamless integration in other software environments. 

Implementation part: 

reduced usage of macros 

flexible modular and clear building system – SCons 

improved possibilities for parameterization thanks to 

extended APIs 

improved portability due to extended abstraction at the 

HAL. Any possible combinations of transceivers, MCU, 

sensors and periphery is possible, even simulation on PC. 

Conserve µIPv6 core in a suitable manner for regular bug 

fixing. 


565

B. Architecture 

Figure 1 shows the basic architecture of the emb::6 network 

stack with its networking core in the middle of the block 

diagram. The Networking core handles the network related 

tasks, mainly the communication part, whereas the different 

tasks have been split up into several layers. Beginning on top at 

the Application Layer (APL), usually serving as interface to the 

device application, requests will be forwarded layer by layer 

down to the physical layer (PHY) which is responsible for the 

implementation of the RF-module drivers. 

Figure 1: Protocol Stack of emb::6 

Brief description of layers follows: 

Application layer. The application layers (APLs) are the 

highest layers of the emb::6 Networking Stack and are 

located above the transport layer (TPL). The APL is an 

optional part of the emb::6 Networking Stack. Depending 

on the application, different APLs may be used whereas the 

following are currently included: 

o COAP. The COAP protocol is HTTP like protocol 

adapted and optimized for the Internet of Things (IoT). 

It is based on RESTful services [12]. 

o ETSI M2M. According to the Global Standards 

Collaboration Machine-to-Machine Task Force, more 

than 140 organizations around the world are involved in 

M2M standardization. A considerable effort is done by 

ETSI to decrease M2M market fragmentation by 

defining a horizontal service platform for M2M 

interoperability. The proposed solution provides a 

RESTful Service Capability Layer (SCL) [13] 

accessible via open interfaces to enable developing 

services and applications independently of the 

underlying network 

o LWM2M. LightweightM2M is a device management 

protocol standardized by Open Mobile Alliance, 

designed to meet the requirements of applications. 

LightweightM2M is not restricted to device 

management, it also able to transfer service / application 

data. 

Transport layer. The transport layer is based on the µIPv6 

embedded TCP/IP Stack. By default only UDP is 

supported, however TCP can be enabled by request. 

Network layer. The network layer contains two sublayers, 

the upper IPv6 layer and the lower 6LoWPAN adaption 

layer. The IPv6 layer includes the routing protocol (RPL), 

ICMPv6 and the neighbor discovery protocol (NDP). The 

6LoWPAN adaption layer provides IPv6 and UDP header 

compression and fragmentation to transport IPv6 packets 

with a maximum transmission (MTU) of 1280 bytes over 

IEEE 802.15.4 with a MTU of 127 byte. 

MAC layer. Reduced in functionality implementation of 

IEEE 802.15.4. 

PHY layer. The physical layer is represented by the radiointerface 

driver and supports hardware depended 

functionality of the transceiver, e.g. CSMA and auto 

retransmission. 

Besides the networking core, a separate so called Utility 

Module implements all common functionalities such as timer 

and event handling, which are used by all other layers and 

modules. 

C. Implementation and Parametrization 

To support different hardware platforms including different 

microcontrollers, RF modules and target boards all hardware 

dependent parts of the emb::6 networking stack are 

encapsulated in a separate so-called Board Support Package 

(BSP) which is accessing a hardware dependent hardware 

abstraction layer (HAL). This allows easy porting of the emb::6 

Networking Stack across different hardware platforms. 

The implementation of the emb::6 networking stack makes 

use of so called structure based interfaces to build up the 

complete stack. Therefore each of the stack, HAL and utility 

parts have their well-defined interface descriptions that can be 

connected during initialization (cf. Figure 3) depending on the 

required configuration using e.g. C structures as shown in 

Figure 2 for the network stack. This makes it possible to 

dynamically change, add or remove functionality (e.g. change 

compression algorithm) by providing different modules 

conforming to the given interface. 

The emb::6 networking stack was designed to become 

highly scalable and configurable. Therefore configurations can 

be made during compile time as well as during runtime 

whereas compile time parameters mainly decrease the 

functionality of the stack in exchange for smaller requirements 

regarding memory and performance requirements. This makes 

it possible to use the stack also on more constrained devices. 

566

typedef struct netstack { 

const struct netstack_headerCompression* hc; 

const struct netstack_highMac* 

hmac; 

const struct netstack_lowMac* 

lmac; 

const struct netstack_framer* 

frame; 

const struct netstack_interface* inif; 

}s_ns_t; 

Figure 2: emb::6 network stack interface structure 

Figure 3: Example emb::6 initialization sequence. 

Runtime parameters have been implemented in layer and 

utility based configuration structures. Parameters are seat 

during the stack initialization and can be changed during 

runtime. Common use cases for such runtime parameters are 

e.g. transceiver output power or device addresses. 

In order to configure emb::6 stack during compilation time 

and manage overall complexity of different software and 

hardware configurations a module based approach was 

designed handled by the SCons building system [14]. 

D. Code Size 

Since the emb::6 network stack was mainly developed for 

the usage with resource constrained embedded devices, 

benchmarks especially regarding to memory consumption in 

FLASH and RAM are key points of the stack implementation. 

As the emb::6 network stack can be configured in many ways 

and all changes within a configuration affect the resulting 

memory usage, it is nearly impossible to provide a common 

number here. However, Figure 4 gives a basic overview of the 

memory consumption of a full function device (FFD) in 

comparison to a reduced function device (RFD). The different 

configurations are based on a sample implementation for 

different targets with a gnu-gcc compiler and code optimization 

activated. 

Stack 

Configuration 

Setup initial networking 

stack parameters 

loc_initialConfig() 

Setup application dependent 

layer types 

loc_demoAppsConf() 

Initialize stack layers 

emb6_init() 

Initialize application 

loc_demoAppsInit() 

stk3600 

Flash/RAM 

45.7 / 4.3kB 

Stack parameters (SConsTargets): 

- MAC-Address 

- TX Power 

- RX Sensivity 

- Modulation 

Application Configuration (SConsTargets): 

- COAP (client/server) 

- UDP-Alive 

Stack layers: 

- all emb6 layers 

- BSP with radio driver 

Application Initialization (SConsTargets): 

- COAP (client/server) 

- UDP-Alive 

xpro_212b 

Flash/RAM 

46.9 / 4.3kB 

atany900 

Flash/RAM 

46.6 / 2.8kB 

atany900_rfd 

Flash/RAM 

26.4 / 1.0kB 

COAP: 11.6 / 2.2kB 11.9 / 2.2kB 12.5 / 2kB 

RPL: 13.2 / 0.3kB 14.4 / 0.3kB 13.3 / 0.2kB 10.1 / 0.1kB 

IPV6: 16.4 / 1.5kB 15.5 / 1.5kB 15.3 / 1.3kB 10.5 / 0.7kB 

6LOWPAN: 4.5 / 0.3kB 5.1 / 0.3kB 5.8 / 0.3kB 5.8 / 0.2kB 

Figure 4: Memory overview estimation for the emb::6 networking stack 

E. Demo Applications 

To provide usability and an easy entry into the emb::6 

networking stack, it comes with a lot of different demo 

applications providing basic functionalities e.g. to establish a 

network. Here simple UDP based socket applications are 

included as well as demos using application protocols such as 

COAP. 

III. 

EXTENSION TO SUPPORT THREAD PROTOCOL 

A. Overview 

To enrich our emb::6 stack we choose to support a recent 

development based on 6LoWPAN named Thread Network. It 

came with extensions regarding a more media independent 

approach, which – additionally – also promises true 

interoperability. 

Our extension covers mainly the layer 2 and layer 3 

requirements from the Thread Specification [15]. The 

implementation covers Mesh Link Establishment (MLE) and 

network layer functionality as well as 6LoWPAN mesh under 

routing mechanism based on MAC short addresses. The 

development has been verified on a virtualization platform and 

allows dynamical establishment of network topologies based 

on Thread’s partitioning algorithm. Note that all parts related to 

commissioning and security are not supported yet. 

B. Thread protocol 

The Thread protocol is an open standard for reliable, costeffective, 

low power, wireless device-to-device 

communication. It is designed specifically for connected home 

applications where IP-based networking is desired and a 

variety of application layers can be used on the stack. The 

Thread standard is based on IEEE 802.15.4 (2006) MAC and 

physical layer operating at 250 kb/s in the 2.4 GHz band. 

Figure 5 illustrates a general overview of our Thread stack 

implementation architecture [16]. This work mainly 

concentrates on MLE and network layer as described in 

chapter 4 and 5 in the Thread specification. 

Figure 5 Thread network stack 


567

C. Thread device types 

The Thread network uses different types of devices as 

illustrated in Figure 6. 

Border Router: A specific type of router that supports 

multiple interfaces besides IEEE 802.15.4 in order to 

connect with other networks, e.g. Wi-Fi, Ethernet, etc. 

Router: Used to provide routing services to the network 

and handle joining and security services for devices trying 

to join the network. Routers are not allowed to operate as 

sleepy end devices, but may downgrade their functionality 

and become REEDs (Router-eligible End Devices). 

Leader: The device that makes decisions within the Thread 

network and manages router ID assignments. The Leader 

is the first active router on the network and can be elected 

in case of losing connectivity. 

Router-eligible End Devices: REEDs have the capability 

to become routers without user interaction, if necessary. 

End Devices: End devices communicate only through their 

parent router and cannot forward messages to other 

devices. To save energy they can sleep for a time period 

and poll their associated router for data once they are 

awake. 

by allowing a node to send periodical link-local multicast 

messages containing an estimated link quality for all links. In 

addition, MLE exchanges link costs between nodes by sending 

MLE advertisement messages. 

MLE advertisement messages are used to exchange 

bidirectional link quality between neighboring routers. All 

routers exchange periodically single-hop MLE advertisement 

packets containing link cost information. Those periodic 

advertisements allow routers to quickly detect changes in the 

set of neighboring routers. For instance, if a new router joins 

the network, an existing router has been downgraded to REED 

or if a router lost connection to the Thread network. 

Regarding the architecture, MLE cannot be placed in OSI 

Model clearly. Instead, it operates alongside the stack using 

UDP (User Datagram Protocol) as transport protocol. This 

architecture is also given for other systems that make use of 

MLE, such as ARM mbed OS [19]. Figure 7 shows the 

different protocol modules used by ARM mbed OS and the 

interaction of the MLE protocol with the existing layers. 

Figure 6 Thread device types [17] 

D. Mesh Link Establishment 

The existence of many asymmetric radio links within the 

IEEE 802.15.4 network represents one of the main issues while 

establishing links between nodes. Thread is using the so-called 

Mesh Link Establishment protocol (MLE) [18] to resolve such 

kind of problems in addition to other capabilities. 

In this section, we give an overview about the capabilities 

of the MLE layer and its architecture. Furthermore, we will 

highlight the main processes of MLE that have been 

implemented. 

MLE capabilities and architecture 

MLE is a protocol that is used to configure and secure radio 

links dynamically as the topology and physical environment 

change. This is done by exchanging IEEE 802.15.4 radio 

parameters between nodes such as addresses, node capabilities 

and frame counters. 

MLE allows all nodes to synchronize periodically and share 

radio link parameters to adapt to any change that might happen 

on the topology such as joining of new devices. Furthermore, 

MLE can detect unreliable links before any effort is spent for 

configuring them. For example, a link between two devices 

that is strong in one direction may be unusable due to weak 

signal strength towards the other direction. MLE resolves this 

Figure 7 ARM 6LoWPAN stack alongside OSI model [19] 

MLE processes and test cases 

In Thread networks, all devices join to the network either as an 

end device or as a REED. Joiner devices always try to attach to 

an active Thread router from which they allocate a 16-bit short 

address. In case such a join process fails, a second request is 

sent to both routers and REEDs. 

Figure 8 briefly shows such a mesh link establishment 

scenario. We use four nodes that have 0xAA, 0xB0, 0xC0 and 

0xB1 as last two bytes of their MAC address, respectively. 

Possible radio links are defined statically by the environment 

and are delineated by the gray line. 

Figure 8 Testing scenario for MLE joining process. 

568

Figure 9 shows the trace output of the nodes for this 

scenario. Window A corresponds to node 1 (0xAA) and 

window B to node 2 (0xB0). 

- Node 1 (0xAA) is the first active node in the network. The 

joining process fails since no other routers are available 

(line A.12). Consequently, the node creates a new partition 

and starts operating as a parent (lines A.13 and A.14). 

- Node 2 (0xB0) attaches to node 1 after exchanging four 

handshake messages. Node 2 operates as a child after 

receiving the CHILD ID RESPONSE (lines B.18 and 

B.20). 

- Node 3 (0xC0) sends a multicast PARENT REQUEST. 

Node 1 and node 2 receive the message (line A.22 and 

B.13), but only node 1 replies due to a scan mask TLV 

(the first request should be replied only by the active 

router). In case that more than one parent responds, the 

joining device compares them and selects the best device 

to be its parent using the connectivity TLV received in the 

parent response and the calculated two-way link quality 

(calculated using the received link margin TLV in the 

parent response and the RSSI of the response itself as 

explained in Table 1 ). Then the handshake process will 

continue normally between node 3 and node 1. Finally, 

node 3 (0xC0) will start operating as a child. 

- Node 4 (0xB1) has one possible radio link with node 2. 

Although, node 2 (0xB0) only replied for the second 

PARENT REQUEST (line B.24). This is explained by the 

fact that on the first request only the active router should 

reply (node 2 is operating as a child at that moment). Once 

node 2 receives a CHILD ID REQUEST, it sends a request 

to the leader to become a router (line B.26). Finally, 

node 2 switches its mode to active router (line B.27) and 

node 4 starts operating as a child. 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

MLE UDP initialized : 

lport --> 19788 

rport --> 19788 

MLE protocol initialized. 

[+] JP Send mcast parent request to active router 

==> MLE PARENT REQUEST sent to : ff02::2 

[+] JP Waiting for incoming response from active router 

[+] JP Send mcast parent request to active Router and REED 

==> MLE PARENT REQUEST sent to : ff02::2 

[+] JP Waiting for incoming response from active Router and REED 

Joining process failed. 

Starting new partition. 

MLE : Node operating as Parent. 

[+] SNY process: Send Link Request to neighbor router. 

==> MLE LINK REQUEST sent to : ff02::2 

[+] SNY process: Synchronization process finished. 

MLE PARENT RESPONSE sent to : fe80::250:c2ff:fea8:b0 

MLE PARENT RESPONSE sent to : fe80::250:c2ff:fea8:c0 

MLE CHILD ID RESPONSE sent to : fe80::250:c2ff:fea8:b0 

MLE CHILD ID RESPONSE sent to : fe80::250:c2ff:fea8:c0 

Child linked with id : 1 and timeout is : 10 

Child linked with id : 2 and timeout is : 10 

A 

1 MLE UDP initialized : 

B 

2 lport --> 19788 

3 rport --> 19788 

4 MLE protocol initialized. 

5 [+] JP Send mcast parent request to active router 

6 ==> MLE PARENT REQUEST sent to : ff02::2 

7 [+] JP Waiting for incoming response from active router

EID is a stable IPv6 address that uniquely identifies a Thread 

interface within a Thread partition. EIDs are not directly 

routable, because the Thread routing protocol only exchanges 

route information for RLOCs. To deliver an IPv6 datagram 

with an EID as the IPv6 destination address, a Thread device 

must perform EID-to-RLOC lookup. When attaching to a 

partition, a node must retrieve an RLOC IPv6 address from a 

router. The RLOC’s 16 least significant bits are called 

RLOC16 and map router ID and child ID of the node. Routers 

assign child ID 0. Figure 10 shows the RLOC16 structure. 

Figure 10: RLOC16 structure 

A router retrieves his router ID from the partition leader by 

sending a CoAP address query message. RLOC addresses are 

only used for communicating control traffic and delivering 

IPv6 datagrams to their destinations. Since no RLOC address is 

available when initially sending an address query message, the 

EID is used. Intermediate nodes must perform EID-to-RLOC 

lookup in order to forward the packet to the partition leader and 

vice versa. The child ID is allocated by the parent node and 

communicated through MLE attachment process. 

Routing algorithm 

A Thread network has up to 32 active routers that use next-hop 

routing for messages based on its routing database. This 

database includes path cost calculation that is performed by 

applying distributed Bellman-Ford algorithm (cf. RIPng) [20]. 

The routing database is a set of neighbor router table (Link 

Set), routing table (Route Set) and all valid router IDs (Router 

ID Set). All routers advertise their routing table periodically. 

The rate at which routing advertisements are sent is determined 

by an instance of the Trickle algorithm. For routers, in order to 

keep track of the validity of shared data in the network, an 

incremental ID sequence number is attached to the routing 

data. After looking up the shortest path for a route, a router 

generates the IPv6 RLOC address of the destination router 

using its router ID. 

All tables that are part of the routing database have been 

implemented by using linked lists. When looking for a routing 

entry, linked list structures can be used as a mask when 

iterating trough all list entries. The benefit of this approach is 

that predefined fields can be accessed easily. Since embedded 

devices usually underlie memory constraints, we implemented 

least recently used (LRU) replacement policy for Link Set and 

Route Set. The last recently accessed item is inserted at the 

head of the linked list by modifying appropriate pointers. As a 

result, when transmitting fragmented packets the lookup 

iteration for subsequent fragments will terminate after the first 

list element. When inserting a new list element, the last one of 

the linked list is removed if the number of elements would 

exceed a defined maximum. 

Link cost determination 

The link set stores information about neighboring routers 

including the measured link margin (RSSI) in dB. Furthermore, 

the link margin plays a leading role in parent selection during 

attachment process. The measured one-way link margin may 

change during runtime due to noise floor or altered 

environmental conditions. To smooth out short-term volatility, 

Thread devices must perform exponentially weighted moving 

average (EWMA) method of the link margins for each 

neighbor. Equation (1) shows the EWMA calculation where 

M t −1 is the currently stored link margin for a specific 

neighbor, Y t is the last recently measured link margin and M t 

will be the newly calculated link margin for that neighbor. 

M t = α ∙ Y t + (1 − α) ∙ M t−1 (1) 

To avoid costly floating point computations on the microcontroller, 

equation (1) has been rewritten to equation (2). 

M t = Y t+ ( 1 α −1)∙M t−1 

1 

(2) 

α 

The exponential smoothing fraction α (equation (3)) is used as 

weighting and defined as either 1 or 1 

[21]. 

8 16 

α = {α ∈ R | 0 ≤ α ≤ 1} (3) 

The preceding transformation results in a bit shift to the left of 

the numerator around the reciprocal value of α. This allows to 

exploit the benefits of integer calculations without glaring 

rounding errors. 

Routing advertisement 

Distributed routing algorithms reduce node-sided 

computational costs in terms of shared route data. When using 

non-distributed algorithms, each node has to expand a graph by 

incrementally improving path costs. In Thread, MLE 

advertisements are used from nodes acting as routers for 

advertising their routing table to neighboring routers. A 

practicable approach of determining the rate at which 

advertisements are sent is to deploy a dependency on the 

change of routing data. Thread uses the Trickle algorithm to 

generate dynamic and random transmission windows. If the 

routing entries are stable, the rate is reduced to a minimum. 

The flowchart in Figure 11 shows our implementation of the 

Trickle algorithm. We use a timer that recalculates its 

expiration time after a timeout. The limits of the time slots can 

be defined via C macros. After initializing the Trickle timer, it 

runs independently from other processes. 

Figure 11: Flowchart of Trickle timer implementation 

570

Unicast packet forwarding 

Routing inside a Thread network is performed using RLOC 

IPv6 addresses. Unicast packets are forwarded by applying a 

mesh under strategy on the 6LoWPAN layer. Router packets 

include the 6LoWPAN mesh header carrying the originator and 

final RLOC16 addresses. When receiving a packet including a 

6LoWPAN mesh header, a routing table lookup is performed 

without uncompressing the packet. Therefore, we extended the 

emb::6 implementation in order to support mesh under routing 

for unicast packets. During MLE joining process, the 16-bit 

short MAC address is set to the RLOC16 assigned to the 

router. Usually, IPv6 packets from higher layers, e.g. 

application layer, are using the EID of the destination device. 

Then, routers must perform EID-to-RLOC lookup to retrieve 

the router ID of the destination router. The EID-to-RLOC 

lookup mechanism consists of CoAP messages targeting CoAP 

resources provided by routers [22]. The router that is 

responsible for the given EID is sending a response message 

including its router ID. Each router maintains a EID-to-RLOC 

map cache to hold a list of recently used lookups. This prevents 

from frequently sending lookup messages when transmitting 

fragmented packets. For instance, an end device does not have 

routing capability and therefore must forward packets to its 

parent router. In this case the packet is sent without adding 

6LoWPAN mesh header. 

Slot Number (ASN). The pairwise assignment of a directed 

communication between devices in a given timeslot on a given 

channel offset is a link 

During a timeslot, one node typically sends a frame, and 

another sends back an acknowledgement if it successfully 

receives that frame. If an acknowledgement is not received 

within the timeout period, retransmission of the frame waits 

until the next assigned transmit timeslot (in any active 

slotframe) to that address occurs. Figure 13 shows the structure 

of transmit and receive timeslots in TSCH mode Note that the 

CCA is a configurable option before transmit in timeslots. 

IV. 

INTEGRATION OF 6TISCH PROTOCOL. 

A. 6Tisch Overview 

As it becomes required and it 

represents an elegant solution for 

many industrial applications to have 

slotted network were the quality of 

service is guaranteed. We decided to 

go for integrating 6Tisch protocol. 

Figure 12 shows the operation layer 

for this protocol among the stack. 

The 6top layer is a logical link 

control sitting between the IP layer 

and the TSCH MAC layer, which 

provides the link abstraction that is 

require for IP operations. The 6top 

operations are specified in [23]. The 

6top sublayer hides the complexity of 

the schedule to the upper layers. 

Time Slotted Channel Hopping 

(TSCH) is a MAC layer of the IEEE 

802.15.4e-2012 amendment [24]. 

Figure 12: 6Tisch 

architecture 

Figure 13 the structure of transmit and receive timeslots in IEEE 802.15.4e 

TSCH mode [25]. 

C. Channel Hopping 

One of the advantage behind using channel is to Mitigate 

Channel Impairments by allowing frequency diversity to 

reduce the effects of interference and multipath fading 

It also Increase Network Capacity so that one timeslot can be 

used by multiple links at the same time. 

Figure 14 shows an example of how to calculate the current 

channel of each link at a given time slot. 

In this each link rotates through 6 available channels over 6 

cycles. The channe is calculated using the following equation: 

Ch = CH_Table [(ASN+ChannelOffset)% Number_of_Channels ] 

B. Time slotted 

All nodes in the network are synchronized on a slotted time 

base. A slotframe is a collection of timeslots repeating in time. 

The number of timeslots in a given slotframe determines how 

often each timeslot repeats. The total number of timeslots that 

has elapsed since the start of the network is called the Absolute 

Figure 14: Frequency calculation 


571

D. Synchronization 

Device-to-device synchronization is necessary to maintain 

connection with neighbors in a slotframe-based network. There 

are two methods for a device to synchronize to the network: 

 

Acknowledgment-based synchronization involves the 

receiver calculating the delta between the expected time 

(explained in Figure 15) of frame arrival and its actual 

arrival, and providing that information to the sender node 

in its acknowledgment. This allows a sender node to 

synchronize to the clock of the receiver. 

1. Transmitter node sends a packet, timing at the start 

symbol. 

2. Receiver timestamps the actual timing of the reception 

of start symbol 

3. Receiver calculates: 

TimeAdj = Expected Time – Actual measured Time 

4. Receiver informs the sender TimeAdj 

5. Transmitter adjusts its clock by TimeAdj 

 

 

Configuration [I-D.ietf-6tisch-minimal] specification. And 

does not preclude other scheduling operations to co-exist 

on a same 6TiSCH network. 

Neighbor-to-Neighbor Scheduling refers to the dynamic 

adaptation of the bandwidth of the Links that are used for 

IPv6 traffic between adjacent routers. 

Remote Monitoring and Schedule Management refers 

to the central computation of a schedule and the capability 

to forward a frame based on the cell of arrival 

Hop-by-hop Scheduling refers to the possibility to 

reserve cells along a path for a particular flow using a 

distributed mechanism. 

As Figure 16 shows with RPL we can use either Static or 

Neighbor-to-Neighbor scheduling. However the current 

implementation support only static scheduling. 

Frame-based synchronization involves the receiver 

calculating the delta between the expected time of frame 

arrival and its actual arrival, and adjusting its own clock by 

the difference. This allows a receiver node to synchronize 

to the clock of the sender. 

1. Receiver timestamps the actual timing of the reception 

of start symbol 

2. Receiver calculates 

TimeAdj = Expected Time – Actual timing 

3. Receiver adjusts its own clock by TimeAdj 

Figure 16: Routing, Forwarding and scheduling [26] 

V. SUMMARY AND OUTLOOK 

With emb::6 a flexible and modular 6LoWPAN stack with 

multiple protocols has been developed and its basic 

functionality and performance has been tested using an 

automated testbed. Next steps of development will include 

commission and security missing parts of Thread protocol and 

support other scheduling mechanism for 6TiSCH protocol. The 

given implementation has already been released and is 

available as an open-source project on GitHub [1]. 

REFERENCES 

Figure 15: Time Adjustment calculation [25] 

A node will only synchronize to its time-parent, where the tree 

formed by the time parents is rooted at the gateway. This forms 

a synchronization tree, and ensures that all the nodes in the 

network has a common sense of time. 

E. Scheduling 

The 6TiSCH architecture identifies four ways a schedule can 

be managed and CDU cells can be allocated: 

 

Static Scheduling refers to the minimal 6TiSCH operation 

whereby a static schedule is configured for the whole 

network for use in a slotted-aloha fashion. The static 

schedule is distributed through the native methods in the 

TSCH MAC layer. It is specified in the Minimal 6TiSCH 

[1] emb::6, https://github.com/hso-esk/emb6. 

[2] IEEE 802. 15. 4-2011, "Part 15. 4: Low-Rate Wireless 

Personal Arae Networks (LR-WPANs), " September 

2011.. 

[3] https://tools.ietf.org/html/rfc4944. 



[6] http://www.riot-os.org/. 

[7] https://openwsn.atlassian.net/wiki/. 

[8] https://openwsn.atlassian.net/wiki/ display/OW/uRES. 

[9] http://dunkels.com/adam/pt/. 

[10] N. M. Phuong, M. Schappacher, A. Sikora, Z. Ahmad, 

A. Muhammad, “Real-Time Water Level Monitoring 

using Low-Power Wireless Sensor Network”, 

Embedded World Conference, Feb 2015, Nuremberg.. 

[11] A. Yushev, A. Walz, A. Sikora, "Securing Embedded 

572

Communication with TLS1.2", embedded world 

conference 2015, Nuremberg, 24.-26. Feb. 2015.. 

[12] RFC7252, The Constrained Application Protocol 

(CoAP), June 2014. 

[13] ETSI TS 102.690 v2.1.1. Machine-to-Machine 

communications (M2M); Functional architecture. 

October 2013. 

[14] Knight, Steven. "Building software with SCons." 

Computing in Science and Engineering 7.1 (2005): 79- 

88.. 

[15] Thread Group, Inc., Thread Specification, Revision 

1.1.0 (July 2016). 

[16] A Sikora, Funknetzwerke für das Internet der Dinge: 

6LoWPAN OpenSource-Projekt: emb6, Elektronik 

Wireless 2016, (2016). 

[17] B Curtis, S Ashon. Thread Open House. Thread Group, 

(May 2016). 

[18] R K Kelsey, Mesh Link Establishment, (Oct. 2013). 

[19] ARM mbed 6LoWPAN Stack Overview, 

https://docs.mbed.com/docs/arm-ipv66lowpanstack/en/latest/02_N_arch/, 

(04.02.2017). 

[20] Malkin, G and R Minnear, RIPng for IPv6, RFC 2080, 

(Jan. 1997). 

[21] T Agami Reddy, Applied data analysis and modeling for 

energy engineers and scientists, in. Boston, MA: 

Springer US, pp. 253–288 (2011). 

[22] Z Shelby, K Hartke, C Bormann, The Constrained 

Application Protocol (CoAP), RFC 7252, (2014). 

[23] https://tools.ietf.org/html/draft-ietf-6tisch-6top-protocol- 

09. 

[24] http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=61 

85525. 

[25] Kris Pister, Chol Su Kang, Kuor Hsin Chang, Rick 

Enns, Clint Powell, José A. Gutierrez, Ludwig Winkel, 

Time Slotted, Channel Hopping MAC, 1 Sep, 2008. 

[26] https://tools.ietf.org/html/draft-ietf-6tisch-architecture- 

13. 


573

Unraveling Mesh Networking Options 

Benchmarking Zigbee, Thread and Bluetooth Low Energy Mesh Protocol Stacks 

Tom Pannell 

Senior Director of IoT Marketing 

Silicon Labs 

Austin, Texas 

tom.pannell@silabs.com 

Skip Ashton 

Vice President of Software 

Silicon Labs 

Boston, Massachusetts USA 

skip.ashton@silabs.com 

Abstract—Developers of home and building automation 

products have many wireless protocol choices. Zigbee and 

proprietary wireless controls dominate these markets today. 

Thread and Bluetooth mesh are new entrants to this market. 

Bluetooth and Wi-Fi are mature protocols that are also popular. 

The deployed networks, regardless of the underlying protocol, 

must be robust. The robustness of a network is quantified by 

measuring throughput, latency and reliability. These 

measurements depend on installation size and other system-level 

requirements. “One size does not fit all” when it comes to mesh 

networking protocol choices. Each protocol presents unique 

characteristics and advantages, depending on the use case and end 

application. Understanding the inner workings of mesh technology 

goes beyond a list of key features. More importantly, developers 

need to understand how these network protocols perform in the 

key areas of power consumption, throughput, latency, scalability, 

security and Internet Protocol (IP) connectivity. Zigbee, Thread 

and Bluetooth mesh are all designed differently from the ground 

up, and how the mesh is implemented can have an impact on 

performance and robustness. 

Keywords— mesh network; Bluetooth, Bluetooth mesh, 

embedded IP; low power, IEEE 802.15.4, Zigbee, Thread, BLE 

I. HUMANITY’S SEARCH FOR COMFORT AND SAFETY 

The desire to control our environment is a central aspect of 

human history and behavior, which has led to the establishment 

of permanent houses, farms, transportation and communications 

infrastructure, and cities. Life as we know it is a result of humans 

searching for comfort, convenience and a connection to the 

world around them. It is fundamental to the human condition to 

want more, to make things easier and create more comfort. 

Wireless technologies have developed in our modern era to 

enable humans to communicate over long distances and control 

aspects of their life to enhance comfort, convenience and 

security. 

Wireless communication has become part of the fabric of our 

daily lives. We have Bluetooth in our phones, Zigbee to control 

the buildings where we work and live, proprietary wireless is in 

our factories and Z-wave in the security systems that protect our 

homes. These wireless technologies exist to make our lives 

easier and more efficient. This trend of wireless connectivity has 

no end in sight as common objects are becoming increasingly 

more connected. 

II. 

WIRELESS CONNECTIVITY 

A. Wireless SoCs 

Wireless SoCs have become cost effective enough to be 

added to the “things” that provide us convenience, safety and 

comfort daily. A “thing” becomes an “IoT” device when 

wireless connectivity is added. Many of today’s IoT devices 

were previously things that didn’t have wireless connectivity to 

the Internet. Changing regulations and consumer expectations 

are forcing product manufacturers to add wireless connectivity 

to a myriad of products and systems to meet regulatory 

requirements, stay competitive or create the potential for new 

revenue streams. 

When developers choose to build IoT devices, they must 

consider how the end product is used and the ecosystem in which 

these products will operate. 

B. Types of Wireless Networks 

There are many competing wireless technologies in the IoT. 

Two basic topologies exist: mesh and star (Figure 1). Mesh is 

often preferred over star networks in home and building 

automation due to mesh’s ability to scale to many more nodes 

and cover long distances. Star networks rely on a point-to-point 

connection between an end-node and a central device. If the 

environment changes after the network is installed, a star 

network can fail. Mesh, on the other hand, is distributed and selfhealing. 

If the environment changes or a node fails after the 

network is deployed, the mesh network can heal itself. 


574

C. Which Network is best for Home and Building Automation 

Zigbee is commonly used in building and home automation. 

More recently Thread and Bluetooth mesh are being considered 

for these applications. Z-wave is a mesh technology that is 

popular in home security and home automation. However, this 

paper does not cover Z-wave due to the lack of access to a 

comparable test network where results can be verified. 

Home and building automation includes a combination of 

energy harvesting devices, battery-powered devices and linepowered 

devices. Lighting and thermostats are typically line 

powered because they are part of the infrastructure, but that 

doesn’t mean power consumption can be ignored. Devices that 

are part of the infrastructure and are AC-powered must be 

managed carefully due to new government regulations limiting 

“vampire power.” Batteries usually power remote sensors and 

control elements. That means the mesh must comprehend two 

fundamentally different use cases from a power perspective. 

III. 

USE CASES 

There are many possible use cases in home and building 

automation. A few are discussed below. 

A. Comfort Use Case 

Consider, for example, lighting and environmental control in 

a theater or museum. These installations usually have hundreds 

to thousands of nodes. The lights, motors for curtains and blinds 

need to be controlled in a precise and choreographed way. All 

the lights need to dim simultaneously, and the motors controlling 

the curtains should all work in concert. Slight differences are 

noticeable and would detract from the experience of the 

audience. The home has similar requirements. If you are creating 

a scene with lights and window shades, the user expects a 

seamless and choreographed experience where all lights dim 

simultaneously and all window shades move in unison. 

B. Safety use case 

An environment like a warehouse may have different 

lighting needs than a theater. Often the lights are turned on in a 

section simultaneously. However, it doesn’t really matter if 

those lights turn on together or if it takes a few seconds for all to 

illuminate. The user experience and expectation are different. 

On the other hand, if certain lights need to turn on quickly due 

to a power outage, suddenly time does matter. 

C. Convenience Use Case 

A developer may want to add additional services to the 

wirelessly controlled lights in the warehouse described above. It 

may not matter if every light turns on in unison in the 

installation. However, it could matter how robust the network is 

if the developer wants to add additional services. A service that 

is gaining popularity in mesh installations is asset tracking. In 

this instance, the designer relies on the control network to also 

transmit data about the assets being tracked by the installed 

infrastructure. In this example, throughput and latency matter in 

terms of how quickly the asset information will propagate 

through the network. 

D. Which Mesh Protocol Is Best? 

The answer is not so simple. There are fundamental 

architectural differences between Zigbee, Thread and Bluetooth 

mesh. Zigbee and Thread can use flooding when required but 

generally use a routing mesh to minimize network overhead that 

can interfere with messaging. Bluetooth mesh uses a flooding 

mesh but allows configuration of the devices to act as routers to 

reduce the impact of the flooding. The Bluetooth Special Interest 

Group (SIG) calls this “managed flooding.” 1 

Zigbee and Thread networks include routing nodes and end 

nodes. The routing nodes are usually line powered and serve as 

the backbone to the mesh. The end nodes are normally battery 

powered, operating on the periphery of the mesh, and use routers 

to relay messages for them. The routing table is established when 

the mesh is created. The routing table is a directory of sorts that 

tells each device how to communicate to other devices in the 

mesh. In this manner, one node can efficiently communicate to 

another node by sending messages in a precise route through the 

mesh. This has a positive effect on throughput of the mesh and 

can reduce latency as the mesh grows. 

Routing mesh is historically preferred to a flooding mesh 

because it provides more efficient communications and 

predictable performance. On the other hand, it is more difficult 

to implement for the developers of the stack. 

IV. 

PACKET STRUCTURE 

A. Zigbee and Thread Packet Structure 

Both Zigbee and Thread use IEEE 802.15.4 with 127 byte 

packets and an underlying data rate of 250 kbps. While the PHY 

headers are the same, the packet structure is different, resulting 

in slightly different payload sizes. Zigbee packet format is 

shown in Figure 2 and results in a 68 byte payload. For payloads 

above 68 bytes, Zigbee fragments into multiple packets. Thread 

packet format is shown in Figure 3 and results in a 63 byte 

payload. For payloads above 63 bytes, the Thread stack 

fragments using 6LoWPAN. Silicon Labs’ mesh performance 

data is based on payload size as this is the design parameter of 

concern when building an application. 

As noted above, each of these networks fragments larger 

messages into smaller ones. For Zigbee, fragmentation occurs at 

the application layer and is performed end to end from the source 

to the destination. For Thread, the fragmentation is done at the 

6LoWPAN layer, as well as from source to destination. 

575

For unicast forwarding within these networks, the message 

is forwarded as soon as the device is ready to send. For multicast 

forwarding, there generally are networking requirements for 

how messages are forwarded. These include: 

a. For Zigbee devices, a multicast message is forwarded 

by a device only after jitter of up to 64 milliseconds. However, 

the initiating device has a gap of 500 milliseconds before 

retransmitting the initial message. 

b. For Thread devices, RFC 7731 MPL forwarding is 

used. The trickle timer is set to 64 milliseconds so the devices 

back off a random amount up to this time before retransmitting. 

B. Bluetooth LE Packet Structure 

Bluetooth low energy has the following packet structure to 

minimize time on air and energy consumption. 

Bluetooth mesh further refined this packet structure to add 

the mesh and security capabilities. 

IVI 

Network 

ID 

1 1 3 2 2 12 or 16 4 or 8 

Sequence Source Dest 

TTL Number Address Address Packet Payload NWK MIC 

CTL 

Figure 4 – Bluetooth Mesh Packet Format 

This means Bluetooth mesh has only 12 or 16 bytes available 

for payload, and beyond this the packets are segmented into 

individual packets and reassembled at the destination. This 

segmented packet carries a header identifying the segment and 

12 bytes of application payload except for the last segment 

which can be shorter. However, there are additional backoff 

requirements in the Bluetooth mesh specification that space out 

these segmented packets, increasing latency and decreasing 

throughput. As all of our throughput and latency analysis is 

based on application payload, we can see that Bluetooth mesh 

will require more packets than Zigbee or Thread because of this 

lower packet payload size. 

V. ROUTING VS FLOODING MESH 

As stated previously, Zigbee, Thread and Bluetooth mesh 

were designed for home and building automation. Zigbee 

supports several routing techniques including flooding of the 

mesh for route discovery or group messages, next hop routing 

for controlled messages in the mesh, and many-to-one routing to 

a gateway, which then uses source routing out to devices. It is 

normal for a Zigbee network to use all of these methods 


Thread also supports next hop routing as well as flooding. 

However, thread networks maintain next hop routes to all routers 

as part of normal network maintenance instead of a device 

performing route discovery. Thread also minimizes the number 

of active routers to address scalability to large networks. 

Previously, this has been viewed as a limitation for embedded 

802.15.4 networks because the network flooding in the presence 

of a large number of routers limited the frequency and reliability 

of multicast traffic. Note that the thread network manages the 

number and spacing of active routers, and user intervention or 

management is not required. 

Bluetooth mesh supports managed flooding. This is a slight 

spin on flooding mesh in that the user can designate which 

powered devices participate in the flooding. This will reduce the 

impact of flooding but requires the user to determine the 

appropriate density and topology for routers in their network and 

this can be difficult. As network conditions change over time, 

which devices participate in the flood may also need to change 

and this would require user intervention. Bluetooth also has end 

devices similar to Zigbee or Thread, and these are called 

“friendship” devices. A friendship device is coupled with an 

adjacent powered node, and packets for the friend are stored by 

the line-powered node. The friend will wake periodically to ask 

the neighbor if there are any packets. The powered node only 

saves the packet for a defined period of time so the “friend” 

needs to check in with its paired relay node 

Figure 5 – Bluetooth Mesh Example 2 

Our study of mesh topologies analyzes both small and large 

networks. These networks can behave very differently, and the 

routing and management techniques often need to change when 

considering a 10-node network or a 200-node network. 

Typically, in a small network, devices are within 1 or 2 hops 

and very simple routing or flooding can be suitable. As the 

network grows in size, it adds complexity such as more hops 

between devices, density of devices which may interfere with 

each other when sending messages, and more concern over 

latency and reliability. If a flood type message is used to turn on 

100 lights, it is normally not acceptable for 98 or 99 of the 100 

lights to turn on or off. This type of problem is rare in a 10-node 

network and may become common in a 100-node network. 

VI. 

FIGURES OF MERIT 

In the previously cited use cases, the designer desires a 

robust network for the application. The figures of merit to be 

measured in assessing the robustness of a network are 

throughput, latency and reliability. These three measurements 

can accurately predict the robustness of a network for a given 

installation. 

Throughput defines the scalability of the network (how 

many devices can be sending normal traffic) and also the 

behavior for higher data operations such as pushing a firmware 

update to devices. 

Latency describes how long it takes for an action to happen. 

It is a critical parameter for any interaction involving end users 

(as opposed to machine-to-machine communications) as most 

people can detect operations that take longer than 100 

milliseconds. For processes where simultaneous operation is 


576

desired, such as turning on multiple lights, the timing must be 

lower than 100 ms so that end users do not complain of a 

“popcorn” effect as lights turn in succession. 

Reliability is taken for granted, but when interacting with 

everyday devices such as lights and switches, users expect 

nearly 100 percent reliability. As a matter of practice, Silicon 

Labs tests to 99.999 percent reliability. 

These are the most critical aspects of the mesh network to 

measure and strongly relate to the design goals for devices and 

wireless systems no matter what technology or underlying 

wireless is used. 

VII. TEST SETUP 

To minimize the variability of device testing, the test can be 

performed in fixed topologies where the RF paths are wired 

together through splitters and attenuators to ensure the topology 

does not change over time and testing. This is used for the 7 hop 

testing to ensure the network topology. MAC filtering can also 

be used to achieve the network topology. 

Large network testing is best conducted in an open-air 

environment where device behavior is based on existing and 

varying RF conditions. The Silicon Labs lab in Boston, 

Massachusetts (USA) is used for this open-air testing process. 

The wireless conditions in the open-air testing environment 

have typical Wi-Fi and Zigbee traffic present as noise. This is 

not part of the test network and is used as a typical building 

control system independent of any tests being performed. 

Latency (milliseconds) 

200 

150 

100 

50 

0 

Latency vs Hops - Mesh Comparison 

1 2 3 4 5 6 7 

Hops 

10 byte Thread 20 byte Thread 8 byte BT Unseg 

11 byte BT Unseg 8 byte BT Seg 16 byte BT Seg 

Figure 6 – Latency vs Hops 

This chart (Figure 6) shows the average latency per hop for 

a Thread network versus Bluetooth mesh unsegmented and 

segmented packets. Zigbee data is not included as it is similar to 

Thread. In this example, we can see for these smaller payloads 

the Bluetooth unsegmented and Thread latency is very similar 

out to 6 hops. As we add the Bluetooth segmented packet and 

increase the payload to 16 bytes, the latency increases 

substantially due to the additional packets being transmitted. 

Latency (milliseconds) 

1200 

1000 

800 

600 

400 

200 

0 

Thread vs Bluetooth Latency - 4 hop 

10 

30 

50 

70 

90 

110 

130 

150 

170 

190 

210 

230 

250 

270 

290 

Figure 7 – Thread vs Bluetooth Mesh Latency 

Looking at 4 hop data with increasing payload (Figure 7), 

Bluetooth mesh has higher latency as it has to use segmented 

messages. This shows the importance of Bluetooth mesh devices 

trying to keep payloads within one packet to avoid this increase 

latency in applications where it is an important factor. 

Silicon Labs has published additional details about mesh 

network performance testing in the application note, AN1132. 

VIII. TEST RESULTS 

Comprehensive test results are available in AN1132 

published by Silicon Labs. 

IX. 

Payload Size 

4 hop Thread 4 Hop BT Mesh 

CONCLUSION 

The choice of mesh network depends on the end application 

or ecosystem. There are many established ecosystems such as 

Philips Hue, Amazon Echo Plus, Comcast Xfinity and countless 

others. If a device manufacturer wants to interoperate with these 

ecosystems, Zigbee is an optimal choice. 

If the ecosystem has not been specified for the application, 

then many protocol choices are available. Thread and Bluetooth 

mesh are both viable options and the most commonly considered 

aside from Zigbee. Development tools provided by the IC 

vendor matter greatly in terms of how quickly a mesh can be 

developed. Tools such as packet trace and a multi-node energy 

profiler can ensure whichever mesh is chosen will be robustly 

designed. Ultimately, the network size, the required latency, 

desired throughput and overall reliability will drive the choice of 

mesh protocols. 


The authors would like to thank Matteo Paris, Dave Fiore, 

Alex Showers, Hannu Mallat and Petri Pitkanen, all from Silicon 

Labs, for their tireless work to collect and analyze mesh network 

performance. 

577

REFERENCES 

[1] Woolley, M., (2017, August 01) “An Intro to Bluetooth Mesh Part 2,” 

https://blog.bluetooth.com/an-intro-to-bluetooth-mesh-part2 

[2] Di Marco, P. Skillermark, P., Larmo, A. & Arvidson, P., (2017, July 22) 

“Bluetooth mesh networking”. 

https://www.ericsson.com/en/publications/white-papers/bluetooth-meshnetworking 


578

Meeting the Challenge of Coexistence 

in the Connected Home 

Brian G. Bedrosian 

VP of Marketing, IoT Business Unit 

Cypress 

San Jose, California 

Abstract—The increasing number of IoT-based devices in the 

home has led to a densely populated network. Without coexistence 

measures at the hardware, software, and system levels, the 

performance, reliability, and quality of the user experience will be 

negatively affected. This paper details coexistence requirements, 

considers the impact of mesh networks, and explores technologies 

like Real Simultaneous Dual Band (RSDB) that enable 

collaborative coexistence. 

Keywords—IoT; Connected Home; Coexistence; RSDB; 

Bluetooth Mesh; Collaborative Coexistence; Managed Arbitration; 

Global Coexistence Interface; GCI; Connected Car; Connected 

Auto; 


As could be seen across nearly the entire show floor at the 

Consumer Electronics Show this year, the connected smart 

home is a reality. The proliferation of connected devices and 

the variety of Internet of Things (IoT) applications is 

staggering. According to Cisco, we can expect there to be as 

many as 50 billion “things” and more than 50 million homes 

connected to the Internet by 2020. Everything seems to be 

getting connected, from our alarm clocks to our lights to our 

kitchens. Our bodies are becoming connected in a multitude of 

manners through a wide range of sensors. Even our pets and 

livestock have been affected by the IoT. 

The IoT was initially driven by smartphones. Smartphones 

provided a simple way for users to interface with connected 

systems. In addition, the availability of standardized and 

TCP/IP networked wireless technologies has led to the 

adoption of many protocols in the 2.4 GHz band, including Wi- 

Fi, Bluetooth, Bluetooth Low Energy (BLE), and 802.15.4 

applied as ZigBee and Thread. 

Early smart home applications focused on adjusting the 

temperature, controlling lights, and streaming media. With low 

cost, low power wireless technology, it has become possible to 

add intelligence to the home in a way that goes far beyond 

simple sensors or digital entertainment. More advanced 

applications driving greater innovation today include home 

security, water quality monitoring, pollution detection, smart 

appliances, and many others. 

II. 

COMPETITION WITHIN THE CONNECTED HOME 

Increasing the number of connected devices, however, is 

making the home a densely populated network (see Figure 1). 

Further complicating the issue is the potential of other nearby 

densely populated networks. In an apartment building, for 

example, the connected home could be surrounded on all sides 

by other networks attempting to utilize the same spectrum 


Fig. 1. The increasing number of connected devices has turned the home into 

a densely populated network. 

Additionally, the proliferation of voice services like Alexa 

require greater quality of service (QoS) amidst increasing 

multimedia streaming to maintain their value. The need to 

support advanced capabilities like voice services is on the 

uptake. According to Consumer Intelligence Research Partners, 

Amazon has sold 20 million Echo units since 2015, with 15 

million of these sold in the last 12 months. Similarly, Google 

Home units have seen tremendous growth, with sales estimated 

at 7 million since their debut last year [1]. The connected 

home, one that can listen and respond to us, continues to 

evolve, as does its wireless needs. 

With so many wireless technologies sharing spectrum in the 

connected home, there is a critical need for robust wireless 


579

coexistence measures to ensure throughput performance, 

reliability, and quality across multiple radios and use cases. 

Consider the increasing importance of reliability and fidelity in 

the connected home. Audio distribution across several 

simultaneous channels is becoming the norm, and any 

significant interruption or interference will negatively impact 

the user experience. 

Without some form of coexistence measures in place, the 

connected home will not be able to provide the level of 

responsiveness, reliability, or fidelity consumers are 

demanding. There is often too much competition for 

bandwidth. There are a great many protocols in use, and as 

each uses different methods for securing bandwidth, this 

creates additional contention. Aggravating this problem is that 

many devices are designed as if they are the only connected 

device in the room. They don’t take into account how crowded 

the home is getting. 

For example, some devices “talk” too much, consuming 

more than a reasonable share of the available bandwidth. They 

might also interrupt other devices in their communications 

because their data is of a “higher” priority. Some devices 

broadcast with more power than they need, effectively shouting 

over quieter, more cooperative devices. This creates 

undesirable contention, triggering retransmissions, reducing 

effective range, and increasing the difficulty of finding a clear 

slot in which to broadcast. This in turn lowers impacts the 

quality of real-time streaming, potentially resulting in glitches a 

user can hear or see. 

The bottom line is that when devices don’t coexist with 

each other, bandwidth is wasted, reliability drops, and quality 

suffers. In truth, as the number of connected devices continues 

to rise in our homes, coexistence may well determine the 

success and rate of adoption of IoT technologies in the home. 

III. 

THE NEED FOR COEXISTENCE 

Coexistence refers to well-defined measures that manage 

medium access and connection when radios in the same 

location are operating simultaneously in adjacent or 

overlapping radio frequency spectrums using different 

protocols. 

The three most common coexistence issues are, from most 

to least severe: 

Overlapping spectrum use, such as happens in the 2.4 

GHz between Wi-Fi, Bluetooth, and 802.15.4 

Adjacent frequency spectrum 

Harmonics and intermodulation distortion 

To provide effective coexistence that addresses these 

issues, the following requirements typical of area-constrained 

environments like the connected home must be met: 

Multi-use: Users perceive that need they simultaneous 

operation across several devices, such as viewing 

photos while listening to high-fidelity audio and issuing 

voice commands to adjust the lights. 

Frequency Domain: It may be the case that the 

requirements for applications coexisting together are 

higher than is required for standalone operation. 

Devices that cannot provide enough performance to 

meet these additional requirements may end up bringing 

down the performance of every device with which 

they are sharing bandwidth. 

Time Domain Arbitration: Allocation of bandwidth 

must take into account the varying QoS requirements of 

each device, application, and potentially user. 

To be able to meet these requirements, coexistence must be 

considered at the beginning of the design process, otherwise 

devices may not operate properly when they are deployed in 

densely populated networks. In addition, developers must also 

consider usage scenarios at the onset of design so these 

requirements can be clearly outlined, understood, and 

addressed. If usage scenarios are not considered, developers 

may find that they are locked out of certain systems due to 

incompatible design choices. Finally, sufficient bandwidth 

must be allocated for each operation/application with 

appropriate quality of service (QoS) capabilities to ensure a 

quality user experience. 

A traditional approach to coexistence is to use frequency 

domain methods that prevent radios from using the same 

spectrum. Overlapping bands between protocols, however, 

prevents this approach from being the sole effective approach 

in the connected home. 

IV. 

COLLABORATIVE COEXISTENCE 

Devices targeted for the connected home will need to be 

able to operate under a wide range of challenging conditions. 

Failure to do so may prevent devices from achieving sufficient 

coexistence. Such devices may have poor market acceptance 

from consumer reviews once it is discovered that they offer 

substandard performance compared to devices that have been 

designed with coexistence as an early design priority. And, if 

the industry cannot work together as a whole to ensure better 

coexistence, this will delay overall adoption of the connected 

home. 

For devices to work together to share bandwidth, they need 

a mechanism to manage arbitration across devices, protocols, 

and applications. Collaborative coexistence provides a 

methodology by which Wi-Fi, Bluetooth, and 802.15.4 can be 

collocated. To be successful, managed arbitration must be 

applied at all levels of design: 

Hardware: Apply spatial separation and filters 

Software: Time domain multiplexing of radios with 

time synchronization and radio frames 

System: Combined intelligence between devices 

through mechanisms like Packet Traffic Arbitration and 

the Global Coexistence Interface 

Collaboration at the hardware level takes place between 

radios collocated in the same device. Leading Wi-Fi / 

Bluetooth combination chips implement highly sophisticated 

hardware mechanisms and algorithms that provide enhanced 

collaborative coexistence between subsystems. These 

580

mechanisms enable Wi-Fi and Bluetooth to operate 

simultaneously while ensuring maximum access time 

utilization and high throughput. This is essential to guarantee 

QoS for applications such as wireless audio. 

Integrating collaborative coexistence capabilities in 

connected devices can provide optimal performance and a 

superior user experience. Coexistence begins with the interface 

between the Wi-Fi / Bluetooth and 802.15.4 subsystems. 

Collaboration between Wi-Fi and Bluetooth can be 

implemented according to IEEE 802.15.2 Packet Traffic 

Arbitration and using the Global Coexistence Interface (GCI). 

This interface must be simple to facilitate effective arbitration. 

Instead of using a system bus where coexistence signals would 

have to complete with other system messages, a dedicated 3- 

wire coexistence interface enables optimal signaling between 

subsystems. 

Figure 2 shows the GCI in action. Only three signals are 

used for handshaking. First, the Bluetooth subsystem sets 

RF_ACTIVE to request to use the medium. It uses STATUS to 

indicate both the priority and Tx/Rx slots. TX_CONF allows or 

denies the request. If allowed, the Bluetooth subsystem asserts 

RF_ACTIVE to request antenna access before transmitting. If 

antenna access is granted, TX_CONF is asserted; if not 

granted, TX_CONF is not asserted. When the Bluetooth 

subsystem completes its transaction, it de-asserts RF_ACTIVE. 

Fig. 3. The Global Coexistence Interface (GCI) is a dedicated 3-wire 

interface that works with an external ZigBee radio to maximize Wi-Fi 

throughput, voice quality, and link performance. 

Managed arbitration is essential to the success of the 

connected home. A unified connectivity coexistence 

framework like collaborative coexistence is needed to enable a 

superior IoT experience across multiple radio technologies. 

This widely supported industry effort encompasses best 

practices like APIs, programmability to accommodate different 

use cases and environments, and support for multiple OSes and 

platforms. Collaborative coexistence makes optimal use of the 

spectrum to provide the best user experience. It also facilitates 

an effective software ecosystem with interoperability and 

openness to inspire innovation and new business models. 

V. REALIZING THE PROMISE OF BLE MESH 

New Bluetooth Mesh technology promises to accelerate the 

adoption of connected home technology by simplifying 

provisioning and management of nodes in applications such as 

smart lighting control and home medical applications. BLE 

meshes are being used to deploy a great variety of wireless 

sensors and controllers, both line-powered and batteryoperated. 

Users will be able to easily set up secure networks 

and directly control devices through their mobile devices (see 

Figure 4). 

Lighting 

Gateway 

Home Acess 

Gateway 

Fig. 2. The Global Coexistence Interface (GCI) uses only three signals for 

handshaking. 

The decision-making part of this exchange is done by the 

Packet Traffic Arbitrator (PTA). The PTA allows or denies the 

Bluetooth subsystem’s initial request, as well as its request for 

access to the antenna. Note that the PTA has a real-time 

information exchange with the Bluetooth subsystem to 

determine its priority. In general, high-priority requests are 

granted. 

Note that the GCI can be applied by external radios such as 

Thread and ZigBee to access the PTA. Figure 3 shows how the 

dedicated 3-wire coexistence interface works with an external 

ZigBee radio. Three GPIO signals are used for the interface: 

WLAN Request, WLAN Priority, and WLAN Grant. Using the 

prioritization approach between data types and applications, 

optimal performance can be achieved, resulting in maximum 

Wi-Fi throughput, voice quality, and link performance. 

Fig. 4. BLE Mesh technology simplifies provisioning and management of 

nodes in applications such as smart lighting control and home medical 

applications. However, meshes without coexistence measures can create 

bursts of interference that can disrupt real-time functions within the connected 

home like voice control and audio streaming. 

The increasing presence of BLE Mesh in the home, 

however, is only going to aggravate coexistence issues and 

introduces a unique set of challenges for developers. Consider 

a mesh-based home lighting system. Rather than connecting 

through the power line, each light has its own wireless radio. 

To keep cost and energy consumption down, this radio uses 

BLE and communicates to the home controller through other 

BLE-enabled lights, thus creating a mesh. To turn on a specific 

light, the controller sends out a message to the nearest 

connected light, which then relays the message to another light 

until the desired light is reached. 

Without some point of intelligent control, this mesh can 

wreck havoc on the connected home environment. A mobilebased 

application might broadcast commands continuously to 

ensure the message gets through in a timely fashion. Each node 

in the mesh will do the same. As a result, such commands may 


581

create a burst of interference that visibly disrupts real-time 

communications such as voice control or audio playback. 

Mesh coexistence is ideally implemented outside of any 

application requirements. Thus, the reliability of the home 

environment is not at risk to the failure of an app developer to 

apply Bluetooth Mesh coexistence measures. Rather, 

coexistence technology is built into a dedicated mesh 

controller. 

Figure 5 shows the command flow for an Android device 

controlling a BLE mesh-based lighting system. The Host 

Controller Interface (HCI) enables the Android device to 

communicate with the dedicated mesh controller via a UART 

connection, thus completely abstracting all wireless transmit, 

receive, and coexistence functionality. The dedicated mesh 

controller is then able to schedule mesh communications 

collaboratively with the rest of the Wi-Fi / Bluetooth connected 

home environment to implement coexistence measures 

effectively and ensure reliable communications. 

Fig. 5. Coexistence can be transparently implemented in BLE meshes with a 

dedicated mesh controller using a Wi-Fi / Bluetooth enabled platform like 

Cypress’ Wireless Internet Connectivity for Embedded Devices (WICED) as 

shown here. Rather than require every application to apply coexistence 

correctly, the dedicated mesh controller abstracts coexistence functionality 

and collaborates with the rest of the connected home network to ensure 

reliable communications. 

VI. 

REAL SIMULTANEOUSLY DUAL BAND (RSDB) 

Work continues to be done to improve the efficiency of 

wireless technology through coexistence. One of the recent 

innovations available to developers is WLAN Real 

Simultaneous Dual Band (RSDB) technology. 

RSDB brings the capabilities of a high-end router to IoT 

applications. By collocating 2.4 GHz Wi-Fi, 5 GHz Wi-Fi, and 

Bluetooth, a single WLAN RSDB controller can implement 

optimal coexistence measures to ensure optimal use of 

bandwidth, real-time fidelity, and overall network reliability. 

Collocating wireless radios in this way greatly simplifies 

many elements of the connected home network. Users have a 

more satisfying experience because they can seamlessly use 

any part of the bandwidth all of the time. Because wireless 

traffic aggregates in the RSDB controller, capacity can be 

efficiently allocated across all available spectrum to multiple 

users and for all use cases. A centralized controller also makes 

it possible to support multiple independent streaming channels 

in a manner that eliminates contention to maximize QoS and 

quality of content. 

Bandwidth utilization is optimized through fully concurrent 

operation in 2.4 GHz and 5 GHz bands between Wi-Fi and 

Bluetooth across all active applications. Because the controller 

manages coexistence for all traffic it carries, the losses due to 

interference from contention can be substantially reduced. As a 

result, RSDB is capable of providing full Bluetooth throughput 

(>2 Mbps), 802.11n throughput (> 50 Mbps), and 802.11ac 

throughput (> 300 Mbps), all at the same time without 

degradation. 

System design can also be simplified through the use of a 

dongle-based architecture. This refers to the ability of the 

controller to offload certain tasks from the host processor and 

simplify system integration. For example, 802.11 processing of 

Ethernet packets exchanged between the controller and host 

can be handled on-chip. Additional offloading includes 

Preferred Network Offload (PNO) and Address Resolution 

Protocol (ARP) processing. 

To further ease design and integration, RSDB technology 

can be implemented in a processor- and operating systemagnostic 

manner. This makes it much easier to introduce RSDB 

into environments like the connected home. 

VII. THE CONNECTED CAR 

One of the key markets for RSDB beyond the connected 

home is the connected car. Rising use of Wi-Fi in vehicles, 

added to existing Bluetooth usage, is only going to increase 

capacity demands. In addition, the densely populated nature of 

the car presents a highly challenging environment for 

coexistence, and RSDB is a key technology for supporting true 

use-case concurrency. 

For example, a family of four will typically have two to 

four cell phones in addition to a tablet or two. One cell may be 

delivering navigation information, several streaming music, 

while the tablets stream video. The car could have active voice 

controls, currently be tethered to a phone, and display sharing. 

Any of these devices could also be accessing the Internet or 

providing hotspot capabilities for another device. This is a 

582

tremendous number of radios and real-time data to have to 

accommodate simultaneously is such a confined space. 

To provide reliably connectivity that can stream quality 

video and high-fidelity audio, the connected car needs 

technology like RSDB. Its dual-band capabilities can provide 

the needed throughput and reliability. For example, the 2.4 

GHz band could be used for real-time audio streaming and data 

delivery while the 5 GHz band is used to carry streaming 

video. Aggregating data in this way makes it easier to 

interweave data without negatively impacting quality (see 

Figure 6). 

Fig. 6. The dual-band capabilities of RSDB enable the 5 GHz band to carry 

video and the 2.4 GHz band to interweave real-time audio streaming and data 

delivery without negatively impacting quality. 

The IoT is growing quickly, and our homes and cars are 

only going to get more crowded. Wireless technologies like 

BLE, 802.11ac, and RSDB are essential for the IoT to move 

forward. By implementing collaborative coexistence measures 

in hardware, software, and at the system level, developers can 

ensure the performance, reliability, and fidelity of the 

connected home. 

REFERENCES 

[1] CIRP, 2017 

[2] Cypress WICED ® IoT Developer Community: 

www.cypress.com/wicedcommunity; 2017 


583

Building IoT Solution Effectively 

Simon Chudoba 

IQRF Alliance z.s., CEO 

Jicin, Czech Republic 

simon.chudoba@iqrf.org 

Internet of Things is a young but very promising 

market segment which is catching attention of many 

companies all around the world. All technicians as well as 

businessmen want to realize a simple, fast and cost 

effective Proof of Concept project to evaluate both 

technical and business aspects of a specific use case. This is 

not a simple task since IoT is a very complex area with 

hundreds of elements that must fit one to each other. The 

goal of the members of the IQRF Alliance is to provide 

these elements from end devices, through gateway 

hardware and software up to clouds and mobile apps so 

building up an IoT project is matter of couple of days. 

How far we get, what is ready and what are the challenges 

ahead are the key questions answered in this paper. 

Internet of Things, IQRF Technology, IQRF Alliance, 

IQRF Ecosystem, Wireless Mesh Network, fog/edge 

computing 


IoT seems to be, and at the end of the day must be, very 

simple. For user it should be just a matter of using his smart 

phone or tablet to monitor, manage and control his home, 

business, city or any other “thing”. On the other hand if you 

take a closer look at the IoT ecosystem you realize it's a large 

puzzle of dozens or rather hundreds of pieces that must fit one 

to each other. 

To build up a well working solution, and do it easily, fast 

and cost effective, is a big challenge even for a very 

experienced team. There is no company world-wide which can 

realize an IoT project from A to Z: manufacturing all 

components, write all SWs, run own clouds, provide own 

mobile apps, market the solutions, deploy it, maintain it and 

support it. 

Fig. 1. Internet of Things puzzle 

Due to this you need 1) an open community providing 2) an 

ecosystem of ready elements for building an IoT solution 

quickly and effectively. 

With this challenge in mind and proven wireless mesh 

technology called IQRF [1] in hands, couple years ago we 

started to build up the IQRF Alliance [2] so you can find all 

necessary IoT elements on one place and make your IoT pilot 

project up and running within couple of days. 

II. 

ALLIANCE – BUILDING IOT COMMUNITY 

Although we are talking about the Internet of Things here, 

first of all you need to get together people who will analyze 

customer needs, develop and manufacture appropriate devices, 

put together reasonable solutions and provide valuable services 

to end customers. We believe that the best way how to do this 

is to build up a community of cooperating commercial and 

non-profit entities having the same goal and values. 

IQRF Alliance is an open international community of IoT 

professionals (developers, manufacturers, cloud providers, 

telco operators, system integrators, research and innovation 

centers, technical high schools and universities) providing 

wireless solutions for IoT and M2M communication based on 

the IQRF platform. 

The IQRF Alliance focuses on these 3 areas: community, 

interoperability and promotion. 

COMMUNITY 

In the community area we focus on the real and effective 

cooperation of the members so system integrators share their 

1 

584

needs according to their opportunities with manufacturers and 

SW and cloud providers so they develop what is really needed 

by the end customer. IQRF Alliance also supports joint pilot 

projects since we see it as the most effective way how to build 

and sell different IoT solutions with significant added value to 

the end-customer. Two examples of joint IoT projects could be 

found in the chapter 5 of this document and more at [3]. 

IQRF Alliance has currently (October 2017) around 80 

members from 17 countries [4] and the number is steadily 

growing. The portfolio of members is really wide from global 

corporations, through successful SMEs up to small start-ups. 

Fig. 2. Members of the IQRF Alliance, October 2017 

INTEROPERABILITY 

The IQRF platform, specifically the IQRF DPA framework 

[5], provides built-in wireless compatibility so devices from 

different manufacturers can communicate in one wireless mesh 

network. The trouble was that every device was usually 

controlled with different commands and provided data in a 

little bit different structure (based on the manufacturer 

preference). This made integration of devices from different 

manufacturers more complex and disabled using the key IQRF 

functions such as Fast Respond Commands [6]. 

Due to this the IQRF Alliance members agreed on 

standardization of the most used commands and sensor/meter 

quantities. In October 2017 the IQRF Alliance released the first 

version of the IQRF Interoperability Standard and published it 

on its website [7]. 

The standardization enables control of devices without 

integration special commands and reading sensor/meter data 

without special data parsing algorithms. 

Every certified device gets a unique HWPID (Hardware 

Profile ID) so gateway or cloud can recognize what type of 

device is connected. Currently (October 2017) the IQRF 

Alliance is testing the IQRF Repository which contains all 

relevant information about certified products so gateway or 

cloud can automatically download them. In the second stage 

the Repository will include drivers of IQRF-certified devices, 

so gateway can start controlling these devices automatically. 

Fig. 3. IQRF Ecosystem 

PROMOTION 

The third key area covered by the IQRF Alliance is 

promotion of the products and solutions based on the IQRF 

Technology. IQRF Alliance uses different channels to 

communicate the benefits of the IQRF Ecosystem to the IoT 

professionals such as website, social media, participation on 

conferences and exhibitions, organization of IQRF Summit and 

local meet-ups and much more. 

III. 

ECOSYTEM – BUILDING IOT PORTFOLIO 

In order to build up your IoT solution effectively you need 

ready components so you don’t have to waste your time on 

development of everything from A to Z. This would not be 

only very time and money consuming process but also skills 

and know-how requiring operation. 

With this having in mind IQRF Alliance is supporting its 

members to prepare ready devices, software, clouds, services, 

mobile apps, etc. so putting together an IoT solution is really 

a job for just couple of days. 

Fig. 4. IQRF Ecosystem [8] 

In the following text we will describe the key attributes of all 

levels of an IoT solution. Having said that, as we are the 

Alliance focused on wireless connectivity we will not go too 

much into a detail on the cloud level in this document. 

A. Wireless connectivity 

One of the first challenges of any IoT solution is the lastmile-communication. 

The well-known and massively used 

wireless technologies such as GSM/LTE, WiFi or Bluetooth 

don’t fit well most of the IoT use cases’ needs: low-power, 

high number and high density of connected devices, low data 

rate, reliability, security,... 

Thus, there is a big boom of young technologies for IoT 

especially in the area of Wireless Wide Area Networks 

(WWAN) such as LoRa or Sigfox. These technologies are 

designed mainly for collecting data from remote sensors. On 

the other side there is a high number of IoT use cases where 

features and parameters of WWAN technologies don’t fit well, 

either. Those are typically real time (local) control applications 

(lights, heating, air-condition, motors) and deep indoor 

applications (large buildings, underground, tunnels, industrial 

operations, etc.). 

For these types of applications wireless mesh networking 

technologies are much better fit. 

5852

TABLE II. 

IQRF BEST-FIT TYPE OF PROJECTS 

Fig. 5. Positioning of IQRF Technology 

IQRF 

IQRF [1] is a mature technology connecting devices to IoT 

via wireless mesh networks. IQRF provides simple integration, 

standards-based-security, interoperability of end-devices, 

robust and reliable mesh networking, low-power operation and 

full bidirectional communication. 

Project needs 

Data acquisition 

Control 

Gateway 

Number of nodes per GW 

Ready infrastructure and 

signal coverage 

Cost of wireless operation 

Density of nodes 

Environment 

Power 

OTA upgrades 

Robustness and reliability 

Cloud 

IQRF 

sensor / operation data – tens of bytes 

actuators (ON/OFF, dimming, rotation,..) 

local control and data processing 

(fog/edge computing) 

tens / hundreds 

not needed 

free of charge 

< 200m from each other to ensure robust 

(redundant) mesh networking 

outdoor / indoor / deep indoor / RF harsh 

ultra-low-power – 5+ years on battery a 

yes, all levels (OS, plug-ins, custom app.) 

very high due to mesh networking 

any cloud, standard protocols (MQTT, 

https) 

a. Depends on use case, type of battery, etc 

TABLE I. 

SW: 

Band: 

BASIC IQRF PARAMETERS 

Parameter 

Network topology: 

Range (device-to-device): 

Range (device-to- gateway): 

Native multi-hop: 

Routing algorithm: 

Security: 

Directionality: 

End devices OTA management: 

Main benefit: 

Low power: 

BEST-FIT TYPE OF PROJECTS 

IQRF 

OS + DPA + Appl. + SDK 

433 / 868 / 916 MHz 

mesh 

500+ meters 

tens of kilometers 

240 hops per packet 

oriented flooding 

multilayer, AES-128, dynamic keys 

bidirectional 

for all operations needed 

easy adoption/reliability 

several years on a battery 

There is no technology fitting every use case. Here is a list 

of typical parameters of projects where IQRF fit the best: 

As the consequence of typical project parameters here are 

the typical use cases of the IQRF technology: 

TABLE III. 

Area 

Smart City 

Smart Building 

Industry 4.0 

IQRF USE CASES 

IQRF Typical Use Cases 

street lighting, street parking, traffic monitoring 

and control, environment sensors, waste 

management,… 

indoor/ emergency / design lighting, HVAC 

control, environment monitoring, metering, 

operation monitoring, … 

machine and tool monitoring, employee and 

forklifts tracking, infrastructure monitoring,… 

B. End devices 

In order to be flexible when putting together your IoT 

project you need a wide range of interoperable sensors and 

actuators. 

Interoperable means that the devices not only communicate 

in one network but that the actuators are controlled with the 

same commands and sensors provide data in the same 

structure. Interoperability thus significantly simplifies 

integration of different devices from more manufacturers in 

one network. 

Overview of available IQRF end-devices could be found at 

[8]. 

C. Gateways 

In IQRF Ecosystem gateways are the key component of the 

whole design. Gateways don’t provide only a link from IQRF 

network to the internet but they are the control unit of the 

complete IQRF network. It means that they collect data from 

sensors, analyze them and based on the results control actuators 

in the network. Apparently they also report data up to a 

connected cloud or receive commands from the cloud or users. 

5863

This “fog/edge computing” approach enables much bigger 

flexibility and reliability than standard cloud-controlledinstallations 

and is the future of the real-time IoT. 

When talking about IQRF gateways we don’t mean only 

hardware but also included software and remote management. 

HARDWARE 

Regarding gateway hardware the goal is to be as much as 

possible independent on a specific hardware and to let the 

integrator choose the hardware according to his priorities. 

Nowadays actually any Linux computer can operate as an 

IQRF Gateway. In general two things are needed: 

IQRF Transceiver connected to the gateway through 

SPI or USB connector/protocol 

IQRF Daemon – universal software which can control 

IQRF network and communicate to a cloud or a mobile 

app 

IQRF DAEMON 

simultaneously, as well. You can check IQRF-ready remote 

gateway management systems from RehiveTech at [10]. 

D. Remote visualization and control 

Another important layer of any IoT solution is the data 

storage, analysis, visualization and user control interface. 

Currently these tasks are usually covered by cloud solutions, 

mobile applications and integration platforms. 

The IQRF Ecosystem is fully open to any cloud solution 

which communicates on standard protocols such as MQTT or 

https. Thanks to this you as an system integrator or a customer 

have a full flexibility to use any cloud or platform. 

IQRF Alliance cooperates with providers and integrators of 

the key cloud services such as Microsoft Azure or IBM 

Bluemix as well as with small cloud service providers such as 

Inteliments, CIS or CTI software. 

Part of the IQRF Ecosystem is also a universal mobile app 

by Master Internet that enables you to build and control an 

IQRF network directly from your cell phone. 

IV. 

HOW TO BUILD YOUR IOT PILOT PROJECT 

In the previous paragraphs we described what you need as a 

base line to start your IoT pilot project and realize it 

effectively. In this chapter we will focus on the step by step 

guide how you can build up your simple IoT solution and how 

you can extend it into a real IoT installation. 

A. Start with the IoT Starter Kit 

Members of the IQRF Alliance joined their efforts and put 

together a starter kit [11] where you should find all you need to 

start your IoT project. 

Fig. 7. IoT Starter Kit by IQRF Alliance members 

Fig. 6. IQRF Daemon [9] 

IQRF Daemon is the second key building block of an IQRF 

gateway. It provides all necessary services for control of an 

IQRF network, gateway configuration, remote access through 

UDP protocol, link to local user application through MQ 

messaging and communication to remote cloud or app through 

MQTT messaging. 

REMOTE GATEWAY MANAGEMENT 

Very important service, or rather must-to-have service, is a 

remote management of the gateways. If you need to do any 

upgrade or change a configuration you must be able to do it not 

only remotely but with dozens or rather hundreds of gateways 

There are two IQRF wireless kits – a sensor kit providing 

temperature, illumination and potentiometer inputs and a relay 

kit. These kits are enough for you to learn how to collect data 

from sensors and how to control actuators. You can make these 

kits up and running and connected in a wireless mesh network 

using IQRF IDE according to on-line video tutorials [12]. 

UP board is a computer for makers and professional makers 

bridging the gap between hobby and industrial computers [13]. 

It usually serves as a gateway controlling the IQRF wireless 

mesh network and connecting it to the Internet through 

Ethernet, WiFi, GSM or LTE. 

STEP-BY-STEP GUIDE 

To make the UP board working as an IQRF gateway you 

need to do the following steps: 

1. Install and configure Linux 

5874

2. Install and configure IQRF Daemon that will handle 

the control of your IQRF network 

3. Install Node-RED for basic control of your network. 

4. Install MQTT broker so you can get connected to one 

of the supported cloud services such as Microsoft 

Azure, IBM Bluemix, etc. 

Everything you need to realize these five steps is available 

on the IoT Starter Kit Github [14]. 

B. Add more end-devices 

There is a growing portfolio of IQRF interoperable devices 

– both sensors and actuators. You can see the complete 

portfolio of IQRF related products, solutions and services on 

the IQRF Marketplace [8] and purchase end-device samples at 

IQRF Alliance e-shop [15]. You can select devices that you 

need, purchase them on one single e-shop and bond them to 

your wireless network. 

C. Test different Software from Github 

As you can extend your solution with different end-devices 

you can also test different software for your gateway. Go to 

IQRF GihHub extensions [16] where you can download 

software and/or demo access to different services free of 

charge. 

D. Test different clouds and mobile apps 

There is number of cloud and mobile apps providers and 

integrators in the IQRF Alliance providing access to Microsoft 

Azure, IBM Bluemix, Inteliglue, Master App etc. Based on the 

documentation available on the IQRF Github you can test 

different products and find you the one most fitting goals of 

your IoT project [16]. 

E. Work with a system integrator 

Potential cooperation with a system integrator depends on 

the scale of your project, experience of your team and 

timeframe you have for realization of your project. You can do 

everything yourself or you can cooperate with system 

integrators or consultants who can help you to make your 

project up and running much faster and more effective. 

V. CASE STUDIES 

This paper would be just a poor theory without mentioning 

real case studies where the approach described above was 

taken. More case studies can be found at [3]. 

A. Air quality monitoring in a Prague school 

IDEA 

Because of the assumption that there is a bad air in schools, 

and therefore students have concentration problems, Protronix 

and his partners (O2 IT Services, IQRF Alliance, 

MICRORISC, Camea,...) decided to make a 4-month-long 

measurement. The CO₂, temperature and relative humidity 

values were monitored. Data were continuously analyzed 

followed by recommendations for ventilation and other 

corrective actions. 

SOLUTION 

This solution consists of 

10 combined sensors of CO 2 , temperature and relative 

humidity 

IQRF wireless mesh network for data transfer 

UP board based gateway enabling data transfer from the 

IQRF network to TCP/IP network 

O2 data storage and a web application with 

visualization of measured data. 

RESULTS 

Fig. 8. CO2 concentration graph of a monitored classroom 

As a result, it was found that minimum recommended 

values of relative air humidity had not been reached for most of 

the school time and maximum allowed CO₂ values had been 

exceeded for almost half of the time. These variables and their 

values are directly linked to the concentration and health of 

students. 

CONCLUSION 

Thanks to using ready end-devices, gateway and remote 

management system it was very easy and cost-effective to 

realize this project. 

B. Water metering 

IDEA 

CETIN as the key Czech telecommunication infrastructure 

provider wanted to evaluate IQRF technology and compare it 

to W-MBUS when reading data from water meters. The goal 

was to do it cost and time effectively. 

SOLUTION 

CETIN involved five members of the IQRF Alliance into 

this project: 

Mainstream technologies providing integration services, 

data analysis in MS Azure and visualization in 

PowerBI; 

AAEON providing UP board as a gateway; 

IQRF Tech with IQRF Daemon and customization 

services; 

RehiveTech with their remote management system of 

gateways 

Bitspecta providing W-MBUS / IQRF protocol bridge 

RESULTS 

As a result, using services and products of other IQRF 

Alliance members CETIN was able to evaluate benefits of 

mesh networking in the area of water metering in a very 

limited budget and time frame. 

There are many other running projects where cooperation 

of members and using of ready IQRF ecosystem elements is 

the key to successful and effective pilot projects. 

5885

VI. 

FURTHER DEVELOPMENT 

As expected, we definitely don’t stop where we are. There 

is plenty of work ahead to make your life, as a user of the 

IQRF Ecosystem, even easier. 

A. Community 

As mentioned, community is the base of the whole 

ecosystem. We will involve more partners with more skills in 

to the IQRF community, so the overall flexibility of the 

Alliance steadily grows. 

B. Ecosystem 

From the IQRF Ecosystem perspective we see the weakest 

point in the limited portfolio of ready end-devices and 

gateways. This is the challenge and opportunity for 

manufacturers to develop and manufacture sensors and 

actuators that are needed on the market. 

C. Standard 

The IQRF Alliance will not only extend the current 

standard but will develop an on-line repository of IQRF 

Certified devices so building up a wireless network will be 

literary in the “plug-and-play” mode. 

[16] IoT Starter Kit SW and cloud extensions 

https://github.com/iqrfsdk/iot-starter-kit/tree/master/extensions 


The Internet of Things is a really very complex ecosystem 

and enabling pilot projects to be done quickly and effectively is 

not the simple task. The IQRF Alliance did number of steps to 

prepare ready and open ecosystem including not only enddevices 

and gateways but gateways software, clouds, mobile 

apps, services and development tools as well. 

These days (October 2017) using the IoT Starter Kit you 

can build up your device-to-screen IoT solution in matter of 

hours and within couple of days extend it, using different enddevices 

and software, into pilot-project-solutions in a matter of 

days. There is a limited number of end-devices you can use for 

your project at the moment, but the portfolio is growing pretty 

fast following to the market needs. 

You can join us in our effort to make the IoT really useful 

and cost effective for the end user. You are always welcome to 

step on the board of the IQRF Alliance. 


[1] IQRF Technology, www.iqrf.org 

[2] IQRF Alliance, www.iqrfalliance.org 

[3] IQRF Alliance, Case studies http://iqrfalliance.org/case-studies/ 

[October, 2017] 

[4] IQRF Alliance members http://iqrfalliance.org/alliance 

[5] IQRF DPA Framework http://iqrf.org/technology/dpa 

[6] Fast Response Command, Youtube tutorial 

https://www.youtube.com/watch?v=kK48A9MMfQU 

[7] IQRF Interoperability Standard http://www.iqrfalliance.org/techDocs/ 

[8] IQRF Ecosystem http://iqrfalliance.org/products 

[9] IQRF Daemon https://github.com/iqrfsdk/iqrf-daemon 

[10] RehiveTech Management System 

http://iqrfalliance.org/case-studies/remote-iqrf-network-managementsystem-by-rehivetech 

[11] IoT Starter Kit, http://iqrfalliance.org/product/iot-starter-kit 

[12] IQRF Tutorials http://iqrf.org/support/video-tutorial-set 

[13] Up Board http://www.up-board.org/up 

[14] IoT Starter Kit GitHub https://github.com/iqrfsdk/iot-starter-kit 

[15] IQRF Alliance eshop https://iqrf.shop/ 

5896

Supporting multiple protocols (BLE/IEEE 802.15.4) 

concurrently in a single chip 

Steve Urbanski 

Secure Connected MCU 

NXP Semiconductors 

Chicago, IL 

Steve.Urbanski@nxp.com 

Abstract—The Internet of Things (IoT) landscape is 

expanding requiring the need for devices to support multiple 

protocols in a single chip. Bluetooth Low Energy (BLE) is readily 

available in many personal devices today which makes it a good 

technology to use for command and control of IoT networks. 

IEEE 802.15.4 is a mature technology that enables low power 

mesh networks such as Zigbee and Thread which makes it a good 

technology to use for an IoT network. 

This paper addresses ways to enable multi-protocols running 

BLE and IEEE 802.15.4 concurrently on a single chip. It reviews 

techniques that need to be considered in both hardware and 

software and the limitations encountered when supporting 

networks running in both technologies. This paper will also 

review practical uses cases that require the use of these 

techniques and the need for supporting multiple protocols 

concurrently in a single chip. 

Keywords—Multiprotocol; Bluetooth Low Energy; BLE; IEEE 

802.15.4; IoT; Thread; Zigbee 


One of the challenges of supporting multiple protocols 

concurrently in a single chip is that the radio resource needs to 

be shared. Therefore the utilization of the radio for each 

technology needs to be carefully considered, more specifically, 

how much time each technology needs the radio resource for 

normal operation. 

This paper will review the communication fundamentals for 

normal BLE and 802.15.4 operation, focusing on the amount of 

time the radio resource is needed to complete fundamental 

tasks. It will then discuss concurrent operation of BLE and 

802.15.4 and techniques to use for supporting concurrent 

operation of these technologies in a single chip. 

Lastly, this paper will analyze a practical use case using an 

NXP KW41Z device running BLE and 802.15.4 concurrently 

while utilizing some of the techniques described in this paper. 

It will review the packet error rates (PER) of different 

experiments and discuss what network parameters make a 

difference and how to configure an error free network. 

II. BLE COMMUNICATION FUNDAMENTALS [1] 

The BLE Link Layer has five defined states. Four of them 

are non-connected states – Standby, Advertising, Scanning and 

Initiating – and one is defined as the Connection State. This 

paper will focus on operations in the Connection State and the 

timing fundamentals associated with this state as it relates to 

sharing the radio resource on a single chip. 

A. Connection State 

The Connection State is entered when an Advertiser and an 

Initiator successfully exchange connection Protocol Data Units 

(PDUs). When the two devices enter the Connection State, one 

takes on the Master Role, the other takes on the Slave Role. 

The master is in control of the timing of the Connection Event. 

Each Connection Event starts with the master sending a Data 

PDU to the slave. The slave may respond depending on the 

timing. This paper will review three timing parameters of the 

connection state – the Connection Interval (CI), the Slave 

Latency (SL) and the Supervision Timeout (STO). 

The Connection Interval is the time between connection 

events. Its value is a multiple of 1.25 ms in the range of 7.5 ms 

to 4.0 s. In many systems, a value around 50 ms is commonly 

observed. 

The Slave Latency is the number of consecutive connection 

events a slave can ignore before it needs to respond to the 

master. This helps a slave device save power by allowing it to 

sleep for longer periods of time. Its value is an integer in the 

range of 0 to ((STO / (CI * 2)) – 1) but can be no larger than 

500. 

If the connection gets lost for any reason, the Supervision 

Timeout is used as a fallback mechanism to prevent a device 

from getting stuck in the connection state. If no communication 

is received within the supervision timeout period, the device 

will exit the connection state and transition to the standby state. 

Its value is a multiple of 10 ms in the range of 100ms to 32.0 s 

and needs to be larger than (1 + SL) * 2 * CI. 

Fig. 1 shows an example of the connection state with a 

slave latency of 2. 


590

Connection State 

Supervision Timeout 

M S 

Connection 

Interval 

S M 

SL =1 SL =2 SL =1 SL =2 

Slave Latency (SL) = 2 

Supervision Timeout = 6 * Connection Interval 

Connection Events 

Figure 1: 1: An An example of of the the connection state state with with a slave slave latency of of 2. 2 

III. 

IEEE 802.15.4 COMMUNICATION FUNDAMENTALS 

The IEEE 802.15.4 communication protocol [2] is used as 

the lower Medium Access Control (MAC) and Physical (PHY) 

software layers of Zigbee [3] and Thread networks [4]. They 

both use a contention based access mode to access the shared 

channel which utilizes a Carrier Sense Multiple Access with 

Collision Avoidance (CSMA-CA) backoff algorithm. 

There are four frame types defined 

Beacon: Used for synchronization and broadcast of data 

Data: Used for data transmission. Maximum payload 

size is 127 octets 

MAC command: Used to carry MAC management 

commands 

Acknowledgement: Used to acknowledge data and 

command frames. Frame size is 8 octets (256 us) 

This paper will focus on the Data Frame type since it is the 

largest frame type and predominately used in an active 

network. The maximum data frame size consists of 127 bytes 

for the MAC PDU and 6 bytes for the PHY header. This equals 

a max frame size of 133 bytes (4.256 ms). 

The transmission of Data and MAC Command frames 

utilize the unslotted CSMA-CA backoff algorithm whereas the 

Beacon and Acknowledgement frames use no checking 

mechanism for transmission. 

The unslotted CSMA-CA algorithm maintains two 

variables for each transmission attempt 

NB: The number of times the CSMA-CA algorithm 

was required to backoff while attempting the current 

transmission. This value shall be initialized to zero 

before each new transmission attempt, where. 

o NB = NB + 1 for every unsuccessful transmission 

attempt 

o If NB > macMaxCSMABackoffs, the transmission is 

considered a failure, where 

• macMaxCSMABackoffs is an integer value 

between 0 and 5 

BE: The Backoff Exponent defines how many backoff 

periods a device shall wait before attempting to assess a 

channel. BE is initialized to the value of macMinBE, 

where 

o BE = min(BE + 1, macMaxBE) for every 

unsuccessful transmission attempt 

o macMinBE is an integer value between 0 and 

macMaxBE 

o macMaxBE is an integer value between 3 and 8 

o one backoff period is equal to aUnitBackoffPeriod 

symbols which is defined to be 20. This translates to 

320 us. 

Note that if macMinBE is set to zero, collision 

avoidance will be disabled during the first iteration of 

this algorithm. 

For each transmission attempt, the transmitter waits for a 

random number of backoff units between 0 and (2 BE – 1). After 

the delay, a Clear Channel Assessment (CCA) is performed. If 

the channel is clear, the transmitter proceeds. If not, the NB 

and BE variables are updated as follows 

BE = min(BE + 1, macMaxBE) 

NB = NB + 1 

If NB > macMaxCSMABackoffs, the transmission is 

considered a failure, otherwise the transmitter waits 

again using a new backoff delay value as described 

above. 

Fig. 2 shows an example of the NXP stack transmitting a 

data frame preceded by a maximum number of backoffs 

allowed by the system. Here macMinBE = 3, macMaxBE = 5 

and macMaxCSMABackoffs = 4. 

CCA 

128 us 

FAIL 

Wait 

Wait 

0 to 2.2 ms 0 to 4.8 ms 

BE = 3 

NB = 0 

BE = 4 

NB = 1 

Data Frame TX w/ Max Backoffs 

CCA 

128 us 

FAIL 

BE = 5 

NB = 2 

Wait 

CCA 

128 us 

FAIL 

Wait 

0 to 9.92 ms 0 to 9.92 ms 

BE = 5 

NB = 3 

CCA 

128 us 

FAIL 

BE = 5 

NB = 4 

Wait 

0 to 9.92 ms 

CCA 

128 us 

PASS 

Data 

Frame 

TX 

4.256 ms 

Figure 2: NXP stack transmitting a data frame preceded by a maximum 

number of backoffs 

591

Any data or MAC command frame can be sent with an 

acknowledgement request. This requires the recipient to send 

an acknowledgement frame back to the sender when the 

message has been properly received. Without this feedback 

mechanism, the transmission of the frame is just assumed to be 

successful. 

The acknowledgement frame needs to be sent within 

aTurnaroundTime = 12 symbols (192 us) after the reception of 

the last symbol of the data or MAC command frame. However, 

the originator will wait up to macAckWaitDuration = 54 

symbols (864 us) for the acknowledge frame to be received 

before the transmission attempt is considered failed. 

If the transmission attempt is considered failed, the 

originator will repeat the process of transmitting the frame and 

waiting for the acknowledgement up to a maximum of 

macMaxFrameRetries times, where 

macMaxFrameRetries = integer value between 0 and 7 

If the maximum frame retries has been reached, a 

transmission failure notice is sent to the next higher layer of the 

software stack. 

Fig. 3 shows an example of the NXP stack transmitting a 

data frame with acknowledgement preceded by a maximum 

number of acknowledgement failures. Here 

macMaxFrameRetries = 3 and it is assumed that all CCAs pass 

with a zero delay wait time prior to transmitting. 

Data Frame TX w/ Ack and Max Ack Failures 

Data 

Frame 

4.256 ms 

ACK 

Wait 

864 us 

Data 

Frame 

4.256 ms 

ACK 

Wait 

864 us 

Data 

Frame 

4.256 ms 

ACK 

Wait 

864 us 

Data 

Frame 

4.256 ms 

Retries 0 1 2 3 

macMaxFrameRetries = 3 

ACK 

RX 

448 us 

Figure 3: NXP stack transmitting a data frame with acknowledgement 

preceded by a maximum number of acknowledgement failures 

IV. CONCURRENT OPERATION OF BLE AND 802.15.4 

When using a single chip for concurrent operation of BLE 

and 802.15.4, the best use case is when the BLE side is 

configured as a slave and the 802.15.4 side is configured as an 

end device (a device without children and it communicates 

only to a single parent). This configuration is the least timing 

restrictive for both technologies. On the BLE side, the slave 

can take advantage of the Supervision Timeout period (see Fig. 

1). On the 802.15.4 side, the end device is in control of when 

data is transferred. See Fig. 4. 

Parent 

Data 

Acknowledgement 

(if requested) 

Child 

Data Transmission from a Child 

Parent 

Data Request 


Data 


Child 

Data Transmission from a Parent 

Figure 4: How data is transmitted in a nonbeacon-enabled 802.15.4 PAN 

As Fig. 4 illustrates, every 802.15.4 data transfer is initiated 

by the child device and can be scheduled when it’s convenient. 

This means that the end device can schedule all of the 802.15.4 

communication around the BLE communication, which has a 

stricter timing regimen. Recall that in the Connection State, the 

BLE master device must transmit at every connection interval 

and the slave must respond within the supervision timeout 

period. See Fig. 1. 

Since not all situations can take advantage of this best case 

configuration, another prominent use case to consider is when 

the 802.15.4 device is configured as a coordinator. Here the 

control of when the 802.15.4 side communicates is forfeited 

and the controller must respond to the end device(s) attached to 

it, all while maintaining the BLE connection. 

For this concurrent operation, one strategy that can be used 

is to give the BLE operation higher priority over the 802.15.4 

operation – essentially letting BLE run as needed while filling 

the time gaps with 802.15.4 operations. Like in the previous 

example, this strategy is effective because the 802.15.4 

protocol is less restrictive in its timing requirements. 

This strategy has been used in the NXP KW41Z hybrid 

device. This device supports BLE and 802.15.4 protocols 

concurrently in a single chip. The software has a Mobile 

Wireless System (MWS) Coexistence block that arbitrates the 

use of the radio hardware resource. It is essentially a set of 

APIs that allow higher layers of the software to request access 

to the radio resource. The MWS natively gives priority to BLE 

allowing it to abort ongoing 802.15.4 transactions even if they 

have already been started. If this happens, the 802.15.4 

transaction will be restarted once the BLE transaction has been 

completed. 

While this strategy is bullet proof for any BLE 

communication and any 802.15.4 transmission, it is vulnerable 

to 802.15.4 receptions. As described above, in this 

configuration, the device is not in control of the 802.15.4 

transactions and must listen for end node (child) devices. If a 

child tries to communicate while the parent is in the middle of 

a BLE transaction, the 802.15.4 packet could be lost. 

To measure the effects of this, a test was conducted using 

the NXP KW41Z hybrid device [5]. In this experiment, there 

were 3 devices – the KW41Z hybrid device, a Smartphone and 

another KW41Z device configured as an 802.15.4 end device. 

See Fig. 5. 


592

The results of this experiment demonstrate that the 802.15.4 

PER falls to around 1% as long as acknowledgement is used 

and the CI is kept around 50ms or higher. The other variables 

do not play as big of a role in the outcome. 

To further reduce this 1% PER, there are a number of 

techniques that can be used to minimize any packet loss. One 

technique is to relax the 802.15.4 parameters: 

macMaxFrameRetries – This is the number of retries if 

no ACK is received. The default value is 3 but can go as 

high as 7. Having more retries improves the chance of 

getting the 802.15.4 data packet through. 

CCA Backoff time – This is the amount of time the 

transmitter waits before it performs a CCA. This 

essentially spreads out the transmission events which 

give it more time to clear a BLE event if it’s in progress. 

Other techniques can be used to eliminate or reduce the 

PER. This involves adding retry mechanisms at higher layers 

of the software, such as the network and/or application layers. 

This is a common technique used in Zigbee and Thread 

systems. 

Figure 5: Packet Error Rate (PER) Experiment Setup 

For this experiment, an 802.15.4 Packet Error Rate (PER) 

test was performed to determine the impact to this protocol 

when BLE is running in the same device. The Smartphone was 

configured as the BLE Master and established a connection 

with the hybrid device using a selected connection interval. 

Then the Hybrid device created an 802.15.4 network. The end 

device connected to the Hybrid device on the 802.15.4 

network. With both networks running, end device sent 802.15.4 

data packets to the hybrid device with a selected time interval 

between packets. The Hybrid device measured the PER. 

This experiment was run 36 different ways, varying the 

following parameters 

802.15.4 Acknowledgement Enabled (Yes, No) 

802.15.4 Payload Size (0, 100 bytes) 

BLE Connection Interval (7.5, 50, 360ms) 

802.15.4 Message Interval Rate -- MIMS (10, 50, 

100ms) 

The results are shown in Fig. 6. Observe the clear 

difference between the Message Acknowledge Enabled versus 

Disabled results. This is because in the enabled case, the 

message is retransmitted if the acknowledgement is not 

received, which significantly lowers the PER. 

The next biggest noticeable difference is in the BLE 

Connection Interval (CI). As Fig. 6 shows, when the CI is at 

the lowest allowed value of 7.5 ms, the PER is dramatically 

higher than when a more typical rate such as 50 ms is used. 

V. CONCLUSION 

This paper reviewed the communication fundamentals of 

BLE and IEEE 802.15.4, focusing on the amount of time the 

radio resource is needed to complete fundamental tasks. It then 

reviewed techniques to support multiple protocols concurrently 

in a single chip. Lastly, it showed a practical use case applying 

these strategies and the effects it had on packet error rate of the 

802.15.4 network. 

ACKNOWLEDGEMENT 

The author would like to thank the team members of the 

NXP Microcontroller Systems Engineering team for their help 

in running the experiments and providing their valuable 

feedback. 

REFERENCES 

[1] Bluetooth SIG, Inc, “Bluetooth Core Specification,” v5.0, December 

2016, 

https://www.bluetooth.com/specifications/bluetooth-core-specification 

[2] IEEE Standards Association, “Wireless Medium Access Control (MAC) 

and Physical Layer (PHY) Specifications for Low-Rate Wireless 

Personal Area Networks (WPANs),” IEEE802.15.4-2006, 

http://standards.ieee.org/findstds/standard/802.15.4-2006.html 

[3] Zigbee Alliance, zigbee Specification, Revision 22 1.0, zigbee 

Document 05-3474-22, April 19, 2017, http://www.zigbee.org/zigbeefor-developers/network-specifications/zigbeepro/ 

[4] Thread Group, Inc, “Thread 1.1.1 Specification,” 

https://www.threadgroup.org/ThreadSpec 

[5] S. Lopez and J.C. Pacheco, “Thread + Bluetooth Low Energy 

Coexistence,” unpublished 

593

Message Acknowledge Disabled 

Payload Size 0 Bytes 

Payload Size 100 Bytes 

Packet Error Rate 

Packet Error Rate 

100.00% 

90.00% 

80.00% 

70.00% 

60.00% 

50.00% 

40.00% 

30.00% 

20.00% 

10.00% 

0.00% 

MIMS 

6.00% 5.33% 5.33% 4.00% 2.33% 3.33% 

10 50 500* 10 50 500* 10 50 500* 

100.00% 

90.00% 

80.00% 

70.00% 

60.00% 

50.00% 

40.00% 

30.00% 

20.00% 

10.00% 

0.00% 

MIMS 

29.33% 28.33% 29.67% 

10.00% 8.33% 8.67% 10.00% 6.67% 3.67% 

10 50 500* 10 50 500* 10 50 500* 

Conn Int 

7.5 50 360 

Conn Int 

7.5 50 360 

Figure 6: Experiment Results 


594

Automatic Tracking of Li-Fi Links for Wireless 

Industrial Ethernet 

René Kirrbach 

Fraunhofer Institute for Photonic Microsystems IPMS 


Rene.kirrbach@ipms.fraunhofer.de 

Michael Faulwaßer, Tobias Schneider, Robert 

Ostermann, Dr. Alexander Noack 

Fraunhofer Institute for Photonic Microsystems IPMS 


{michel.faulwaßer, robert.ostermann, tobias.schneider, 

alexander.noack}@ipms.fraunhofer.de 

Abstract— The ongoing digitalization of our environment 

leads to continuously increasing data traffic. Especially in 

industrial environments, automation is an omnipresent 

trend. Autonomous systems incorporate a rising amount of 

sensors as well as continuous machine-to-machine (M2M) 


Wireless communications can simplify the data 

transmission and enable connectivity to dynamic parts like 

moving, vibrating or rotating components. Due to the open 

nature of the communication channel, engineers have to 

face a number of challenges, e.g. security issues, 

interferences and regulation of irradiated power. 

Radio frequency (RF) technologies are used in manifold 

applications, but in certain scenarios they are still 

cumbersome, because of signal interference and hard 

real-time requirements. 

The so-called Li-Fi technology is ideal for autonomous 

systems in Industry 4.0 since optical communications offer 

reliable and high data rate communication links with lowlatency 

characteristics. However, the engineer typically 

has to face a trade-off between the link’s range, coverage 

and data rate. This contradiction can be overcome by 

forming a small, steerable spot. 

In this paper we present a compact Li-Fi tracking system 

based on a steerable optical wireless link, which enables 

real-time full-duplex bi-directional data communication 

with a data rate of 1.289 Gbit/s. This approach shows the 

feasibility and handling of an energy efficient wireless link, 

thanks to its 12-bit-precise beam alignment by using micro 

mirrors. We describe the optical setup and introduce a 

tracking algorithm which enables fully autonomous link 

establishment and thus simple installation. Data rate 

measurements underline the high performance of the 

wireless link whereas the system’s mobility is 

characterized by measurements of the settle time of the 

steered beam. 

Keywords— Li-Fi; optical wireless; infrared; real-time; light 

communication; 1 Gbps; IrDA; mobile; IoT; Industry 4.0; M2M 


The ongoing digitalization of our environment leads to 

continuously increasing data traffic. Especially in industrial 

environments, automation is an omnipresent trend. 

Autonomous systems incorporate a rising amount of sensors as 

well as continuous machine-to-machine (M2M) 

communication. The resulting enormous data volumes are 

often transmitted using wired interconnections. 

Wireless communications can simplify data transmission 

and enable the connectivity to dynamic parts like moving, 

vibrating or rotating components. Due to the open nature of the 

communication channel, engineers have to face a number of 

challenges, e.g. security issues, interferences and regulation of 

irradiated power. 

Radio frequency (RF) technologies are used in manifold 

applications, but in certain scenarios they are still cumbersome. 

For instance, wireless hard real-time operation require a careful 

analysis of the channel and its environment to avoid 

interferences. Easier handling by using exactly defined 

communication spots is possible by utilizing light. 

The so-called Li-Fi technology is ideal for autonomous 

systems in Industry 4.0 since optical communications offers 

reliable and high data rate communication links with 

low-latency characteristics. However, there is a trade-off 

between communication range and coverage. The latter is 

determined by the system’s field-of-view (FOV) which 

describes the defined light spot where the link is established. 

Unfortunately, a large FOV reduces the received power and the 

related maximum communication range. This contradiction can 

be overcome by forming a small, steerable spot as shown in 

Fig. 1. 

595

magnification can be achieved with additional lenses at the 

transmitters exit. 

Our Li-Fi tracking system is equipped with an 

1 Gbit/s-Ethernet adapter, which enables easy integration. 

Fig. 1: Optical wireless tracking scenario. If the position of one or both 

transceivers change, then the beam is deflected accordingly and 

interruption-free data transmission is possible. 

Manifold principles of beam deflection and their practical 

feasibility have been shown including tiltable mirrors [1], 

Risley-Prisms [2], decentered lenses [2], Pockels- and 

Kerr-cells [3], acousto-optic gratings [4], liquid crystal based 

spatial light modulators [5] and many more. Brandl et al. [6] 

and Wang et al. [7] already demonstrated the feasibility of 

micro mirrors for beam steering of optical wireless links. 

Here we present a similar approach based on micro mirrors, 

but with different system design. Thanks to their small form 

factor and fast response, we can design compact and dynamic 

systems. In contrast to Brandl et al. [6], we introduce a tracking 

algorithm for fully autonomous link establishment without the 

need of a special photodetector. And unlike Wang et al. [7] our 

tacking approach is different and does not require spot size 

adjustment and thus enables a simpler system design. 

Our paper is structured as follows: In section II, we 

describe our Li-Fi tracking system. We give detailed 

information about the used micro mirrors and our search 

algorithm. In section III, we present an analysis of the steerable 

beam and data rate and latency measurements, that prove the 

practical feasibility of the concept. 

II. 

SYSTEM DESCRIPTION 

A. Overview 

Fig. 2 shows a 3D model of our tracking system. We 

combine a Fraunhofer IPMS standard transceiver with two 

1D micro mirrors from our institute. The term “1D” indicates, 

that the mirror is able to statically keep its tilt around an axis. 

The communication beam is reflected at the mirrors. If a mirror 

is tilted by the angle θ, the beam is deflected by an angle 2θ, 

because of the law of reflection. In order to enable 2D beam 

steering, one mirror enables deflection along the horizontal axis 

and the other one along the vertical axis. Additional lenses are 

used for beam forming. These lenses influence the actual 

steering angle of the beam. Optimally, further angle 

Fig. 2: Fraunhofer IPMS tracking system. The PCB of the transceiver is 

highlighted green. The micro mirror PCBs are mounted vertically and colored 

yellow. The steered beam is highlighted blue. The receiver optic is not shown 

here. 

B. MEMS Mirror 

Each micro mirror has its own controller called 

Fraunhofer QsDrive. The controller has a USB interface and is 

connected to a host system, which can easily control the 

mirrors by simple commands given by its API. For analysis 

and loopback tests the algorithms is controlled via 

Matlab R2015a. In a second step the tracking algorithm will be 

moved in an arbitrary microcontroller by porting it to C-code. 

The mirrors can be tilted by 5° in positive and negative 

direction. This could form a FOV with full angle of 20°. 

However, because of the influence of the additional lenses the 

actual FOV will be slightly smaller. The settle time t s of the 

micro mirrors is specified with 5 ms. The 12-bit-precise 

addressing scheme provides us 4096 steps in beam deflection 

for horizontal and vertical direction separately. A window in 

front of the mirrors surface is used for hermetic encapsulation. 

For this project we use two 1D micro mirrors because of 

their availability. For the next revision we will use one titltable 

2D micro mirror, that simplifies the optical concept and 

improves the optical performance of our tracking system. 

C. Tracking Algorithm 

A search algorithm is necessary for fully automated link 

establishment. Fig. 3 illustrates the principle of our algorithm. 

First, both transceivers scan their FOV coarsely and 

transmit their current mirror tilt angles in each point. When the 

beam reaches the receiver of the opposite transceiver by 

accident, this opposite transceiver detects the angle position of 

device one in that point. From now on transceiver two 

transmits both mirror positions, its own ones and the received 

ones from transceiver one. As soon as the second transceivers 

596

eam hits transceiver one, the first transceiver knows the 

correct mirror tilts for its mirrors. With this information, the 

mirrors are configured and the received mirror position of 

transceiver two is transmitted back. Lastly, transceiver two 

receives this mirror tilts and configure its mirrors 

correspondingly. 

Fig. 4: Realtive irradiance EE rel of the spot profile in a distance of 15 cm. 

Fig. 3: Coarsely part of the search algorithm. The sequence is analogous if the 

beam of transceivers 2 first hits transceiver 1. 

At this point a data link is established. However, the mirror 

tilts may not be ideal, i.e. only the edge of the beam hits the 

opposite transceiver. Therefore, a fine-scanning algorithm with 

smaller step size is initiated in order to find the maximum. The 

fine-scanning starts as soon as the transceiver knows its right 

mirror positions. We suggest the gradient ascent or hill-climb 

algorithm. Both methods may not find the global maximum of 

the spot. However, if the spot is homogenous enough, even a 

local maximum should be sufficient. Therefore, we investigate 

shape of the spot in section III.A. 

As soon as the fine-scanning algorithm is finished, it is 

initiated again. Thereby, we can follow a moving spot. This 

enables communication within a dynamic scenario. Both 

devices can be moved within their corresponding FOVs 

without interrupting the communication link. 

The required time for link establishment depends on the 

transceivers positions and thus on the number of scanned 

points until the coarse scanning algorithm finds the opposite 

device. The time can be approximated with NN ∙ tt ss , where N is 

the number of scanned points and tt ss the settle time of the 

mirror. The settle time is measured in section III.C. 

III. 

EXPERIMENTAL RESULTS 

A. Beam profile 

Fig. 4 shows the spot in a distance of 15 cm. As expected, a 

circular spot is formed. It exhibits a donut profile, i.e. the 

minimum in the center of the spot is surrounded by 

ring-shaped maximum of the irradiance. Moreover, a speckle 

pattern all over the spot and a weak ghosting effect at the left 

side can be observed. 

B. Field of View and Bit Error Rate 

Next, the functionality of the system is evaluated. 

Therefore, both transceivers are placed in front of each other. 

Transceiver 2 is moved within the plane perpendicular to the 

optical axis along the horizontal x-axis (red) and along the 

vertical y-axis (blue) respectively. Next, the tracking algorithm 

is initiated and the mirrors are tilted correspondingly. Fig. 5 

illustrates the measurement setup and the bit error rate (BER) 

over displacement to the optical axis. Data transmission takes 

place in bi-directional full-duplex mode with a data rate of 

1.289 Gbit/s in both directions. 

If we assume BER < 10 -8 , we can establish a robust link 

over an area with an extent of 22.5 cm in horizontal and 

23.5 cm in vertical direction. This corresponds to full angles of 

12,7° and 13,22° respectively. 

Fig. 5: Top: Measurement setup. Bottom: BER over displacement along 

horizontal x and vertical y axis seperately. The distance in this scenario was 

80 cm for practical reasons. 

597

C. Settle time 

Table I shows settle times of the mirrors for different mirror 

tilts. The settle times range from 7.1 ms to 7.5 ms. They only 

slightly increase, if the mirrors are tilted by larger angles. 

TABLE I: MEASURED SETTLE TIMES FOR DIFFERENT MIRROR TILTS. 

Delta Mirror Tilt Settle Time tt ss 

0.1 ° 7.1 ms 

0.5 ° 7.2 ms 

1.0 ° 7.3 ms 

5.0 ° 7.3 ms 

10.0 ° 7.5 ms 

As soon as a communication link is established, the system 

provides real-time communication. One communication 

channel exhibits an electrical latency of about 5 ns plus time 

for the light traveling through the optical channel. The value 

was measured from the signal input port of transceiver 1 to the 

signal output port of transceiver 2. 

IV. 

DISCUSSION 

A. Beam profile 

The shape of the beam profile is satisfying. The speckle 

pattern results from the laser diode and cannot easily be 

avoided without influencing the shape of the spot. 

However, by adjusting the collimator lenses, the donut 

profile can be avoided and a more homogenous power density 

within the spot can be achieved. This is generally useful for the 

fine-scanning algorithm. 

The ghosting effect results from reflections at the windows 

of the micro mirror packages. It could be minimized by 

applying an anti-reflection coating to the windows. However, 

since the ghosting effect is quite weak, the additional expense 

of such a coating is not justified. 

C. Settle time 

The measured settle times are slightly larger than the 

specified 5 ms from the manufacturer. We controlled the 

mirrors with Matlab R2015a in this measurement setup. As a 

result, we introduced an additional overhead and thus 

additional latency. If the logic of the tracking algorithm is 

integrated into a chip on the PCB, we should be able to reduce 

the latency to nearly 5 ms. 

Both devices can change their position within the FOV 

without losing the data link. However, the settle time of the 

mirrors limit this movement fundamentally. If device one is 

fixed and the other one is moving, the theoretically maximum 

angular speed ω of device two along one axis is given by to 

equation (1). Where ∆θ is the step size of the fine-scanning 

algorithm. 

ω = ∆θθ 

tt ss 

(1) 

A large step size ∆θ in the fine-scanning algorithm results 

in higher dynamic. However, the precision of the scanning 

algorithm decreases with larger steps. 

V. CONCLUSION 

In this paper we introduced a fully automatic Li-Fi tracking 

system, which provides large FOV and large communication 

distances at the same time. Our tracking algorithm allows fully 

automated link establishment without any additional 

information. The transceivers allow full-duplex, bi-directional 

data communication with a data rate of 1.289 Gbit/s and 

BER < 10 -8 . As soon as the link is established, real-time 

capable communication with latencies of only 5 ns is possible. 

Therefore, our Li-Fi system is ideal for wireless industrial 


For the next revision, we plan to replace the 1D micro 

mirrors by a single 2D micro mirror. This will further simplify 

the optical setup and leads to further miniaturization. 

B. Field of View and Bit Error Rate 

According to Fig. 5 our system is able to find the opposite 

transceiver within the FOV. However, the achieved FOV is 

still below 20° in full angle. This basically because of the 

following factors: 

• Additional lenses within the optic setup influence the 

actual deflection angle 

• The optical gain of the receiver optic decreases with 

higher angles 

• A larger deflection angle causes spot distortion, which 

results in larger spot size and therefore in lower 

irradiance 

VI. ACKNOWLEDGEMENTS 

The authors thank Fraunhofer for founding this system 

within the framework of the project 600601 Autotrack. 

VII. 

REFERENCES 

[1] Davis, S. R. ; Farca, G ; Rommel, S. D. ; Martin, A. W. ; Anderson, M. H.: 

Analog, Non-Mechanical Beam-Steerer with 80 Degree Field of Regard. 

In: Proceedings of SPIE 6971 (2008), 24.march. – DOI 

10.1117/12.783766 

[2] Gibson, J. ; Duncan, B. ; Bos, P. ; Sergan, V.: Wide-angle beam steering 

for infrared countermeasures applications. In: SPIE Proceedings 4723 

(2002), 3. September, S. 100–111 

[3] Nakamura, K. ; Miyazu, J. ; Sasaki, Y. ; Imai, T. ; Sasaura, M. ; 

[3]Fujiura, K.: Spacecharge-controlled electro-optic effect: optical beam 

deflection by electro-optic effect and spacecharge-controlled electrical 

conduction. In: Journal of Applied Physics 104 (2008), No. 14.. – DOI 

10.1063/1.2949394 

598

[4] Römer, G.R.B.E. ; Bechtold, P.: Electro-optic and acousto-optic laser 

beam scanners. In: Physics Procedia 56 (2014), 4. September, 23–39. – 

DOI 10.1016/j.phpro.2014.08.092 

[5] Tholl, H. D.: Novel Laser Beam Steering Techniques. In: SPIE 

Proceedings 6397 (2006), 22. September, No. 639708-14 

[6] Brandl, P. ; Zimmerman, H.: Optoelectronic Integrated Circuit for Indoor 

Optical Wireless Communication with Adjustable Beam. (2013), 3. Juli. 

ISBN 978–1–4673–5822–4 

[7] Ke Wang, Ampalavanapillai Nirmalathas, Christina Lim, and Efstratios 

Skafidas, "4 x 12.5 Gb/s WDM Optical Wireless Communication 

System for Indoor Applications," J. Lightwave Technol. 29, 1988-1996 

(2011) 

599

Design of On-Chip RFID 

Transponder Antennas 

Dr. Andreas Heinig 

Drahtlose Mikrosysteme 

Fraunhofer IPMS 


andreas.heinig@ipms.fraunhofer.de 

Abstract— Manufacturing the transponder antenna directly 

on top of the silicon substrate of the transponder integrated 

circuit has the advantage of a cheap and miniaturized 

transponder tag. No additional mounting and joining processes is 

necessary. For passive transponder tags the complete system can 

be fabricated using the standard CMOS process. 

The size of the antenna depends on the wavelength of the 

transmission frequency. In the presented project a frequency of 

61 GHz was chosen. This makes it possible to design a chip of 

about one square millimeter. In modern silicon technologies, 

there is more than enough space behind the area of the antenna 

for the necessary electronic circuit. Regarding the high frequency 

and the small antenna diameter, only a very small amount of 

energy is expected to be available for powering the electronic 

circuit. Therefore, only identification and authentication with 

embedded cryptographic functions are planned for applications 

with a reading range of about five millimeters in the first steps. 

More energy-critical functions like passive sensor measurements 

will be addressed in the future. 

In general the design process of the transponder antenna is an 

iterative process based on high frequency electromagnetic field 

simulations. The process is similar to the design of customized 

UHF antennas. The antenna type is a slot antenna, consisting of 

the antenna slot itself and a surrounding frame. The conductive 

substrate material results in unavoidable losses. However, an 

antenna gain of 1.5 dB is achieved. The antenna gain can be 

boosted up to 5.6 dB with a metallic backplane on the chip with 

the right thickness. Additional losses, originating from the 

metallic filling structures in the circuit are already included in 

these numbers. The filling structures are necessary for 

technological reasons in the manufacturing process: the metallic 

covering has to be between a minimum and a maximum value. 

The filling area behind the frame does not reduce the antenna 

quality. In this area also the electronic circuit is placed. The 

filling directly below the antenna structure itself should be at the 

minimum allowed value. A critical point is the influence of the 

filling to the antenna impedance parameters. These parameters 

have to be fitted to the input parameters of the electronic circuit 

for an optimal match. Unfortunately the filling structures are 

restricted by the technology rules and are too complex to be 

simulated in the field simulations. On the other hand, the 

antenna and the electronic parts are manufactured in the same 

steps. A post-process matching is impossible. Therefore, a first 

antenna on silicon substrate is manufactured and verified to find 

differences between simulation and real measurement. This will 

help to find a simplified model of the filling structures in the 

simulation and to allow a spot landing of the antenna impedance 

to the complex conjugated circuit impedance after 

manufacturing. 

Keywords— GHz-Transponders, RFID, Antenna-on-Silicon, 

Embedded Antenna, Antenna Design 


RFID systems are currently being actively developed 

worldwide by various research and industrial companies and 

represent a multi-billion dollar future market. The goals are 

often to speed up processes or reduce logistics costs in tracking 

and controlling the corresponding mobile assets. Another area 

of application is real-time monitoring and reliable 

administration in the pharmaceutical, medical or military 

industries. RFID offers many advantages over the widely used 

barcode or data matrix code. Reading several tags can be done 

simultaneously and fully automatically. The information can 

not only be read unidirectional, but also actively written. This 

is an advantage, especially for poorly networked processing 

chains, because the information can be passed on to subsequent 

stages, such as validation. In addition, other systems such as 

sensors or cryptographic modules can be integrated in the ID 

tags, which can enormously expand the range of functions and 

possible applications. An RFID system generally consists of a 

transponder, the ID tag that carries the information and a reader 

that is required for reading. There are roughly three different 

types, which differed by the operation of the transponders. The 

transponders working in the lower frequency ranges 125 kHz 

and 13.56 MHz are based on the physical operating principle of 

the magnetic coupling (loosely coupled transformer), while the 

systems from 868 MHz use the backscatter principle known 

from radar technology. In this case, an electromagnetic wave 

600

from the antenna of the reader spreads freely in space and is 

reflected in response to the digital information to be transmitted 

(0.1) from the antenna of the transponder back to the reader or 

not. The development of the magnetically coupled transponder 

can today technically be considered as completed. In the 

application, however, there are large fields of application such 

as travel passports, identity card, EC and credit cards but also 

medical implants). With the use of near field communication 

(NFC) in connection with smartphones, further applications 

such as electronic purse or telemedicine devices have been 

developed for the 13.56 MHz frequency range. Most of the 

new developments are currently taking place in the frequency 

band from 850 MHz to 950 MHz (internationally not uniformly 

standardized). In addition to pure identification systems, 

transponders with sensors and, in the future, with cryptography 

functions are increasingly gaining in importance. The IPMS 

plays a key role in this development. For applications to protect 

against piracy these transponders are not suitable because in 

this frequency range, the antenna is still several centimeters in 

size, the simple dipole antenna is 17 cm long. The frequency 

range of 2.45 GHz, which is also released for RFID 

applications, has no significance in practical application. Only 

in the frequency range above 60 GHz, the antenna dimensions 

of one to two millimeters in length are so small that integration 

on the silicon chip is technically and economically sensible. 

The frequency range of 61GHz allows a size of the antenna 

that makes integration on a chip in the range of 1mm²…2mm² 

areas makes sense. This frequency is released for long range 

devices. 

A particular challenge is the energy supply of such 

transponders. Since the chip is to work passively, without its 

own energy source, the necessary power must be transmitted 

via the carrier field. At 61 gigahertz, very little power will be 

available on the chip for the electrical function. Since the 

transmission power of the reader is limited by standard 

regulations to 100mW, a very efficient low-power circuit must 

be developed for the transponder tag. 

A special influence on the efficiency of the transponder has 

its antenna. Appropriate geometries must be developed for onchip 

antennas (OCA). Another challenge is the optimal 

adaptation of the antenna to the electrical circuit. Only this 

makes it possible to minimize the transmission losses and to 

optimally use the minimum power budget. The integration of 

the antennas is realized in the standard CMOS process. Only in 

this way can the chip be realized within the envisaged cost 

range. Therefore, the technological requirements in the 

backend process have to be taken into account during 

development. 

There is still no standard protocol for the communication 

between reader and transponder in this frequency range. The 

starting point for the development will be the EPC-G2V2 

standard, which was developed for the UHF frequency range. It 

also already contains the command-state definition that is 

required for the function of an authentication. When 

implementing the function in the 61GHz chip, the power 

consumption must be significantly reduced. 

II. COMMON SYSTEM DESCRIPTION 

The project will develop a CMOS chip and a reader (Figure 

1). Both parts communicate with each other and thus represent 

the transponder system. The selection of the 60 GHz band 

enables the integration of all components in one chip on the 

chip side. The antenna on top of the transponder chip is the 

topic of this presentation. 

Fig. 1. Overview of the 61GHz Transponder System. 

The target application of the system is the wireless 

identification and authentication of assets. The reader should be 

brought close to the transponder, ranges in the range of 5mm 

are being considered. 

The used FD-SOI technology is based on an ultra-thin 

silicon layer, which is separated from the silicon substrate by a 

thin buried oxide layer. The transistor channel is made in the 

ultra-thin silicon layer. The transistor channel is fully depleted 

due to the small silicon thickness in the off state ("fully 

depleted"), which significantly improves the turn-off of the 

transistor. Due to the complete depletion of the silicon layer, 

implantation steps in the channel can be almost completely 

eliminated, whereby the mobility of the charge carriers in the 

channel and thus also the inrush current are increased. The 

"buried oxide" layer reduces the parasitic capacitances in the 

transistor structure, improves the field penetration of the gate 

onto the channel, and produces a nearly ideal sub-threshold 

voltage, which is reflected in an even improved inrush current. 

III. ANTENNA DESIGN 

The aim of the antenna development is to obtain the 

smallest possible antenna geometry that can be produced 

directly on the chip with the technology also used as standard 

for the integrated circuit, which achieves an antenna gain > 0 

dB. A particular challenge is the influence of the silicon 

substrate and the metal structures contained therein on the 

antenna located in the uppermost levels. The selected working 

frequency of 61GHz determines the size of the antenna. The 

optimal antenna length is in the usual antenna types in the order 

of half the wavelength of the operating frequency, plus a 

dependent of material parameters truncation factor. The half 

wavelength is in our case at λ / 2 ~ 2.44mm. The materials 

specified by the chip substrate still lead to a significant 

shortening factor, so that λ / 2 antennas are suitable for the 

selection of an antenna in the project. Through a literature 

review and the input and simulation of simplified models of 

different antenna geometries, a double slot antenna was chosen 

as the most promising. In contrast to the example of the normal 

dipole antenna, in which a conductor itself represents the 

antenna, in a slot antenna a recess in the metallic conductor is 

the actual antenna structure. This makes the antenna 

601

characteristics relatively independent of metal around the 

antenna, such as the electronic circuit structures of the chip. 

The structure of the antenna is shown in Figure 2. 

antenna 

antenna and circuit development were provided, this is not a 

limitation, important to clarify is the question whether the 

properties of the developed antenna can also be metrological 

verified so the specific adaptation to the chip then takes place 

in further steps. The impedance characteristics of the antenna 

have been adjusted substantially along the length and width of 

the waveguide. While any resizing affects all parameters an 

iterative process to readjust all parameters is necessary. Figure 

3 shows as an example the dependence of the antenna 

impedance on the length of the waveguide. 

frame 

matching 

Fig. 2. Used antenna geometry. 

In the illustration metallic surfaces shown in blue. At the 

top, horizontally, the two slots of the antenna are visible in the 

metal. In the middle downwards, two slots connect, which 

represent a waveguide (CPW = coplanar waveguide). This 

couples the signal from the antenna and feeds it to the 

transponder circuit. This is later electrically contacted at the 

lower end of the waveguide. About the geometry of the 

waveguide can be done in addition to the adaptation of the 

antenna to the electrical circuit. In addition to the antenna 

characteristics themselves, the impedance of the antenna must 

be conjugate complex to the input impedance of the 

transponder circuit. This is necessary in order to achieve an 

optimum energy input from the electric field via the anode into 

the transponder circuit and to have a good reflection coefficient 

for the data transmission from the transponder back to the 

reading device. The geometrical distances entered in the figure 

affect the antenna properties and the impedance of the antenna. 

In order to develop the antenna for the 61GHz transponder 

chip, a model adjustable in all geo-metric dimensions has been 

created. With this the antenna can be entered in the software 

for high-frequency simulation "HFSS", a product of the 

ANSYS, Inc.. An analytical calculation of the geometries is not 

possible because of the complex relationships. Due to the large 

number of variables, a large number of time-consuming 

simulations is necessary in order to iteratively approach the 

best possible solution. The antenna structure is located in the 

highest metallic conductor level of the used technology 

"22FDX". All levels of the silicon wafer are implemented in 

the simulation model, which together comprise 32 different 

substrate, trace and via levels. This leads to a very complex 

simulation model, which places high demands on the used 

computer technology and represents a significant expenditure 

of time per simulation. First of all, in an iterative process, the 

geometries were determined by varying the antenna length, 

width of the slit, and extent of the enclosing frame to give 

optimal antenna characteristics. Thereafter, the adaptation of 

the antenna to the input impedance at the transponder circuit 

was performed. Since, by working in parallel at the time of 

antenna development, the input impedance of the transponder 

circuit was known neither by simulation nor by measurement, 

an estimated value of ZA = (5 + j10) Ω was given for the 

antenna. Since for the first trials on silicon separate chips for 

Fig. 3. Complex impedance in respect to the matching feed length. 

Combined with the other variables, this results in a multidimensional 

result field which, due to its complexity, makes it 

necessary to use computer-assisted optimization options 

offered by the software "HFSS" and to develop end-to-end 

optimization programs. As a result, it was possible to develop a 

chip antenna which, with an antenna gain of GAIN = ~ 1.55 

dB, is significantly better than the target value. Figure 4 shows 

the straightening characteristic of the designed antenna. 

Fig. 4. Antenna Gain characteristic with high impedance wafer material. 

It was evaluated to what extent a metal surface behind the 

antenna bundles the directivity and leads to a higher antenna 

gain. This has been confirmed, the antenna gain increases to 

GAIN = ~ 5.9 dB when the Reader is above the chip, which 

corresponds to the normal application. Figure 5 shows the 

corresponding diagram. It is later technologically easily 

possible to mount the transponder chip on a metallic substrate 

or to metallize the back of the wafer to take advantage of this. 

However, the GAIN value depends on the thickness of the 

material, the values shown are valid for a substrate thickness of 

700μm, which corresponds to a wafer thickness of 

602

approximately 730μm during production. If the distance 

decreases, the antenna gain decreases. The change is very small 

up to 300μm substrate, but from 300μm 

mm x 0.7 mm, which is within the desired target range. The 

areas under the antenna frame can later be used for the 

transponder circuit. For organizational reasons, a chip size of 2 

mm x 2 mm had to be occupied for the production of the test 

chip, so the remaining area was left empty in order to exclude 

an influence on the antenna properties. Very problematic for 

antenna design is the technological need for all metal levels to 

have a certain minimum and maximum metal coverage, 

otherwise the chip cannot be produced. Excluded from this rule 

is the top metal level, in which the antenna itself is located. 

However, the additional metal surfaces in the few nanometers / 

micrometer below the antenna have a large impact on the 

antenna parameters. For the antenna gain, the filling under the 

actual antenna has a negative effect, the filling under the frame 

has a positive effect. In total, this leads to a loss of antenna 

gain, which must be accepted. The above-mentioned antenna 

gain of GAIN = 5.9 dB is already the final value reached. The 

Fig. 5. Antenna Gain characteristic with reflector and high impedance wafer. 

to 100μm substrate thickness the antenna gain drops back to 

the value without backside metallization. That is, if the 

additional antenna gain is to be exploited, the wafer must not 

subsequently be thinned below 320 microns thickness. For 

initial measurements of the antenna, it was planned to mount 

the antenna chip on a PCB of 0.5 mm thickness as a carrier for 

easier handling and to use undiluted chips. The circuit board 

has a copper surface as a backside metallization. The 

simulation model has been adjusted accordingly, Figure 6 

shows this in the appropriate size ratio. 

Fig. 7. Antenna test chip layout. 

Fig. 6. Chip carrier with reflector for test. 

IV. TRANSFER TO CHIP LAYOUT 

For the design of the test chips, test pads were added in 

order to be able to measure the antenna via needle probes. At 

the later transponder chip these pads are not necessary. The test 

pads affect the impedance of the antenna. Since an adaptation 

to a specific target impedance in the probe chip is not necessary 

and the resultant impedance is in the desired window, a 

correction of the influence of the test pads was omitted for 

reasons of expense. Figure 7 shows the resulting test chip 

including the technologically required marginal and auxiliary 

structures. The predetermined by the antenna chip size is 1.3 

losses due to the filling structures of approx. 3 dB are already 

included here, without filling structures the antenna with a 

metallic background would even achieve over 8 dB antenna 

gain. The antenna impedance is also affected by the filling. 

Here, however, the change is not tolerable and it is necessary to 

counteract the influence of changes in the antenna geometry in 

order to ensure the adaptation between antenna and transponder 

circuit. Here arises the problem that the filling structure in the 

technologically predefined structures in the simulation is so 

delicate and complex that the space requirement and the 

computing time for this simulation are beyond any scope. The 

filling structure was therefore modeled in a greatly simplified 

form. The measurements of the test chip must show whether 

this simplification emulates the influence of the filling structure 

exactly enough. Despite the simplification, the simulation of 

the antenna with filling structure remains very computationally 

intensive, for a single simulation point more than 60 hours of 

simulation time are necessary. This dramatically complicates 

the parameter optimization on the antenna. For this purpose, a 

suitable procedure is still to be determined, in the test chip, the 

adjustment was not necessary and was not carried out for 

scheduling reasons. The impedance shown in Figure 8 results 

for the implemented test chip as a simulation result. 

603

Fig. 8. Simulated complex impedance values of the antenna. 

V. MEASUREMENT 

The designed test chip for the antenna was manufactured 

under the name "autag1a" in the target technology. The test 

chips were then mounted on a circuit board carrier (Figure 9) 

and the impedance measured at a needlepoint test station and a 

network analyzer. The measurement setup is first calibrated on 

a specially designed calibration substrate. For this purpose, 

structures for "short circuit" (0 Ω), "adaptation" (50 Ω) and 

"open" (∞) are located on the calibration substrate. 

The measured values obtained are very stable and 

reproducible, they are also very homogeneous over the various 

test chips on the printed circuit boards as well as the test chip 

without PCB assembly. The measured values initially deviated 

significantly from the simulated values. After a further 

refinement of the model by structures in the top metal layer, 

which were initially not realized in the simulation model (pads, 

crack stop) and better modelling of the filling structure and the 

measurement environment like needles and mountings (Figure 

11), the measurement could be exactly matched with the 

simulation (

extensive electronic circuits in the direct area of the antenna 

can be taken into account (Figure 13). 

All technologically necessary structures, such as fillings to 

achieve the required metal coverages, pads and other essential 

metal surfaces, which can potentially influence the antenna, 

were considered in advance in the design process and allow an 

optimal match between antenna and circuit. The influences on 

the antenna gain could be recognized and optimized for the 

best possible gain. 

The design methodology can also be applied to other 

transponder antennas in other frequency ranges. Figure 15 

shows a set of printed board antennas for a other IPMS 

transponder chip in the 869MHz frequency band. 

Fig. 13. UHF printed board antenna with complex electronic. 

VIII. CONCLUTION 

The article shows the development of an antenna placed on 

the chip in the frequency range of 61GHz. The selected target 

frequency enables chip sizes in the range of 1mm². The 

transponder electronics are also located on the chip, so that a 

very small transponder can be created. The parameters 

determined by simulation in the design process could be 

detected on a first sample chip. This is particularly important 

for the adaptation of antenna and transponder circuit, since 

both are inseparable from each other and a subsequent change 

is not possible. Sensible antenna parameters can also be 

achieved on standard wafers with high conductivity of the 

material. Figure 14 shows a photograph of the realized chip on 

standard wafer material. 

Fig. 15. Photo of several UHF antenna designs. 


The work on this Topic was supported by the Project 

“PROSECCO: PROduct SECurity and COmmunication” of the 

Collaborative R & D project funding of the SAB (Sächsische 

AufbauBank) and the Fraunhofer internal Project “Radar-Tag: 

System for the authentication of assets”. 

REFERENCES 

[1] Heiß M., “Antennenentwurf für radio-frequency identification (rfid)- 

sensor-transponder”, Dissertation, Technical University of Dresden, 

2014 

[2] Lischer S, “A 24 ghz rfid system-on-a-chip with on-chip antenna, 

compatible to iso 18000-6c / epc c1g2”, IEEE_COMCAS_2015 

[3] Lischer S., “Ein iso 18000-6c / epc c1g2 kompatibles 24-ghz-rfid-einchip- 

system mit integrierter antenne“, MST-Kongress 2015 

[4] Fonte A., Saponara S., Pinto G., Neri B., “Feasibility study and on-chip 

antenna for fully integrated μrfid tag at 60 ghz in 65 nm cmos soi”, IEEE 

International Conference on RFID-Technologies and Applications, 

2011, 457-462 

[5] Guo,L.H., “A small oca on a 1 x 0.5-mm2 2.45 ghz rfid tag – design and 

integration based on a cmos-compatible manufacturing technology”, 

IEEE Electron Device Letters, February, No. 2, 2006, 96-98 

[6] Dagan,H., “A low-power low-cost 24 ghz rfid tag with a c-flash based 

embedded memory”, IEEE Journal of Solid-State Circuits, VOL. 49, 

NO. 9, September 2014, 1942-1957 

Fig. 14. Chip photo of the test antenna. 

605

Demystifying Why Your ADC Does Not Perform To 

The Datasheet And What You Can Do To Improve 

Performance 

Christy She 

Connected MCU Systems 


Dallas, Texas, U.S. 

Chris Sterzik 

Connected MCU Applications 


Dallas, Texas, U.S. 

Abstract—Noise is a complex problem that challenges even 

the most experienced analog engineer working with sensor nodes 

in IOT. The complexity comes from the number and types of 

noise sources. These sources can be within the microcontroller 

(MCU), on the board, or in the environment. As MCU 

integration increases and speeds increase the internal noise has 

increased. Additionally, the rise of IOT and wireless connectivity 

has increased the noise in the environment. This paper will 

educate the reader on why the datasheet doesn’t tell the whole 

story of integrated performance and generally represents a 

subset of use cases. This subset represents characterization of 

singular peripherals and functions with the remainder of the 

SOC peripherals and functions in a sleep or idle state. This 

paper will show data from real world examples implementing 

different techniques to reduce noise. These examples will include 

noise introduced by SPI and sub-1GHz wireless connectivity 

which will be generalized to I2C and UART as well as BLE and 

WIFI. Using specific use cases and showing how to generalize to 

other wired or wireless configurations, MCU developers can 

apply the concepts discussed in this paper to successfully 

integrate precision analog measurements in their sensor node 

designs. 

Keywords— Microcontroller; MCU; ADC; analog to digital 

converter; differential; high performance ADC; coexistence; 

electromagnetic compatibility (EMC); sensor nodes; IoT; 

ratiometric measurements; ADC calibration 


Most analog to digital converters (ADCs) have 

configurability that affects its performance. Thus, a single 

datasheet value may not cover performance for all possible use 

case configurations. ADCs integrated into a microcontroller 

(MCU) often have even more configurability in order to 

optimize the ADC for power and performance across varied 

use cases. 

Integrated circuit (IC) manufacturers want to show the best 

performance possible; thus, they select the configuration that 

shows the best performance. In a few cases, manufacturers will 

split parameters to show how specific configurations affect 

performance. Therefore, you must pay careful attention to test 

conditions including the typical test conditions to know if the 

data sheet performance for a parameter of interest applies to 

your use case. 

The next section goes into the details of why the ADC’s 

datasheet performance may not represent your use case and 

gives guidance on what performance to expect. Followed by a 

section describing how to maximize ADC performance for 

your use case. 

II. 

WHY ADC DATASHEET PERFORMANCE MAY 

NOT BE APPLICABLE TO YOUR USE CASE 

This section lists the common configuration parameters that 

affect performance, with some guidelines on how to take a data 

sheet with parametric conditions that do not match your usecase 

and yet still know what performance you can expect. 

A. Reference Choice 

There are two main parts to the reference voltage which 

affect the performance of the ADC: accuracy and voltage. 

1) Accuracy: Accuracy is driven by what reference is 

used. For ADCs integrated on an MCU, the reference options 

may include (in order of increasing accuracy): the supply, 

internal reference, or an external reference (separate chip). 

The supply as the reference is the lowest current option but 

is usually noisier, as it supplies the digital circuitry (which has 

switching noise). One common technique to mitigate or protect 

the analog supply from digital switching noise is to use a filter 

between the analog and digital supplies if there are separate 

pins. Similarly, to isolate noise on the supply from the 

reference, connect the external supply to the ADC’s external 

reference pin using a ferrite bead (a passive electric 

component) and decoupling filters to reduce noise, as shown in 

Fig. 1. 

Using a ferrite bead is a common practice to isolate noise, 

especially between analog and noisy switching digital signals. 

Reference [1] provides details about the use of a ferrite bead 

and although it is written around a phase-locked-loop (PLL) it 


606

Power 

Supply 

Decoupling 

Capacitor 

Ferrite Bead 

Filter Between Analog 

Supply/Reference and 

Digital Sup ply 

Decoupling 

Capacitor 

Digital Supply 

Analo g Supply 

External 

Reference 

ADC Input 

Internal 

Referece 

Digital Domain 

Analog Domain 

Reference 

Multip lexer 

Fig. 1. Example connection of supply voltage to ADC external 

reference pin. 

is applicable to an ADC as well. Also, the supply used for the 

ADC reference generally cannot be a direct connection from a 

battery because the voltage will decay over the lifetime of the 

battery’s lifetime whereas the ADC reference voltage must be 

known to calculate the ADC converted voltage. 

The internal reference typically provides lower noise than 

the supply at the cost of increased current consumption. Even 

when filtering the supply and applying it to the external 

reference path, as described earlier, the internal reference is 

typically a lower noise option. 

Applications that need better accuracy, especially over a 

wide temperature range, may require an external reference. 

External references are available with better accuracy and a 

lower temperature coefficient/drift (generally the two dominant 

error factors). External reference voltages are available with a 

temperature coefficient in the single digit parts per million 

(ppm)/ ْC vs references integrated into a MCU range from 25 

ppm/ ْC - 50 ppm/ ْC. For more details on how to select a 

voltage reference and example calculations of total reference 

voltage error refer to [2]. 

There are two alternatives to using an external reference to 

improve DC reference accuracy across temperature: 

a. Calibration: In production create a lookup table (or a 

single point if temperature range is small) to determine the 

actual reference voltage (some device manufacturers on select 

devices actually measure this during device production and 

store it on chip) and use that in software to either correct the 

raw ADC code or adjust the ADC result for the inaccurate 

reference voltage. Equation (1) is the correction equation: 

measured_VREF 

ADC_correc ted = ADC_raw x 

(1) 

VREF 

where VREF is the ideal ADC reference voltage and 

measured_VREF is the measured ADC reference voltage. If 

you are correcting across temperature, a temperature 

measurement needs to be taken at the time of the ADC 

measurement to know which measured voltage reference value 

to use in the look up table. 

b. Ratiometric measurement: In applications where the 

voltage used to excite the senor, is the same voltage used as 

the reference for the ADC, the measurement is called a 

ratiometric. By using the same voltage to excite the sensor and 

the ADC reference voltage, any errors in that voltage will be 

canceled out. For ratiometric measurements either an external 

ADC 

reference can be used or the internal reference if it can be 

made available outside the device. You can also take 

ratiometric measurements when a current source to excite the 

sensor with a resistor placed between the positive and negative 

ADC reference pins and the excitation current through that 

resistor. For a detailed example with a resistance temperature 

detector (RTD) refer to [3]. 

2) Voltage: If the integrated ADC supports a range for the 

input reference voltages, then understanding how the voltage 

level affects performance is important. Selecting a lower 

reference voltage reduces the least significant bit (LSB) size 

so that the overall (full-scale) range is decreased in order to 

resolve smaller changes in voltage. This reduction in the 

signal via the reference voltage level affects performance 

shown in this signal to noise ratio (SNR) equation (2): 

rms_SIGNAL 

SNR(dB) 20 x log10( 

) (2) 

rms_NOISE 

where SIGNAL is the full scale ADC input less than or equal 

to the reference voltage. 

Figure 2 shows how the SNR decreases as the reference 

voltage decreases. Given the same noise, when the signal is 

smaller (in the case of a lower reference voltage), SNR is 

lower. Thus, to maximize performance keep in mind the full 

dynamic range of the ADC and, if required, to pre-condition or 

amplify the ADC input to use the full ADC dynamic range. 

When cost is more important than performance, choose the 

smallest reference voltage level that will always be larger than 

the input signal. For example, assuming an ideal reference, if 

an input signal is 1V max and voltage references 1V are 2V are 

available, then amplifying the input by a factor of 2 and using 

the 2V reference would provide better SNR than measuring the 

1V directly with a 1V reference. 

B. Supply Voltage 

MCUs have a fairly wide operating range to support many 

applications – specifically battery-powered applications. This 

wide range does not always propagate to the ADC, which may 

require a higher minimum supply voltage. If a device has this 

limitation, then you can find the minimum supply voltage for 

ADC operation in the data sheet, usually in an ADC parametric 

table row. 

Depending on the ADC’s architecture and design, there 

may be performance degradation at lower supplies so look 

carefully at the test conditions. Data sheets show test 

conditions in different ways including: footnotes, a column in 

the data sheet, or in the table title. Some datasheets supplement 

table entries with graphs that show how performance changes 

over voltage or temperature. In a battery powered application, 

understanding the performance over the range of the 

operational battery voltage is critical to a successful design. If 

your application needs a lower supply than the datasheet shows 

the ADC parametric at, you should measure the performance at 

the minimum supply of your application to know if it meets 

your performance requirements. 

Also note that when the supply varies, as is the case of a 

direct battery connection, some parametric values can change 

607

Signal-to-Noise Ratio (dB) 

72 

70 

68 

66 

64 

62 

60 

58 

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 

Fig. 2. SNR vs reference voltage. 

Reference Voltage (V) 

across the supply voltages. The power supply rejection ratio 

(PSRR) is one measure but also look for any parameter with 

units of per Vsupply. Examples of parameters which may be 

affected by the supply are gain and offset errors but keep in 

mind this is ADC architecture dependent. And some ADCs 

may be subregulated (with an internal low-dropout regulator 

(LDO) for instance) to always have the same voltage supply 

independent of the device supply. In which case, the ADC 

would only see the small ripple at the LDO output. 

C. Multiple Modes 

MCUs typically offer multiple modes to allow you to 

customize ADC tradeoffs such as speed, performance, and 

current. So unless your use case matches that of the data sheet 

test condition, your application would have different 

performance, current and sample rate limits. 

Several things affect the sample including: mode, 

conversion clock frequencies and sample time. The device data 

sheet will list the minimum sample time for a specific source 

resistance and capacitance. But if the source you are measuring 

has a larger source resistance, the ADC needs a longer sample 

time to maximize its performance. The manufacturer should 

document a minimum sample time equation for the ADC in the 

datasheet and/or reference manual. Reference [4] provides an 

example of minimum sample time equation and example 

calculation for a specific device [4]. 

Datasheet maximum current values that apply should be 

used whereas typical currents can be obtained by characterizing 

the current for your application across devices. Some 

datasheets will have typical curves showing how current varies 

with different configurations. Current is often the result of 

multiple parameters. For a more detailed list of low power 

features and configurability of one specific ADC refer to [5]. 

D. Datasheet Use Case Only Has ADC operating 

To showcase best case ADC performance, datasheet 

performance numbers often use a low power mode where the 

central processing unit (CPU) is not active to minimize on chip 

noise. And, if there is an option to choose between in internal 

low drop out (LDO) regulator versus a direct current to direct 

current (DC/DC) converter, the LDO is used to minimize on 

chip noise. Note some ADC architecture/layout may be less 

sensitive to the on chip noise. 

If you have the luxury to limit what is on during ADC 

measurement, the datasheet performance may be a good 

indicator of the level of performance you can reach but if you 

have noisy signals (i.e. high speed signals especially clocks) on 

your board or in the CPU it would be good to bench test the 

performance early on to make sure the ADC is meeting your 

needs. See section III How Differential Signaling Can Address 

Noise for more details on what you can do to help this, 

including board layout techniques and differential input. 

E. Datasheet only Considers Noise from the ADC 

The previous section discussed additional on-chip noise not 

accounted for in datasheet performance numbers. This section 

discusses additional noise off-chip (prior to the ADC input 

coming on chip). For datasheet performance measurements, 

signal generators are used and directly connected to the ADC 

so the input signal has very low noise. In a real application, the 

input signal has noise from the board/external environment in 

addition to any noise from preconditioners in the analog front 

end. In the signal chain, the noise of each component in front 

of ADC degrades the signal into the ADC; thus, you must 

consider the noise of each component to determine the signal 

chain performance, not just the ADC performance alone. If 

your input signal is only 10 bits due to noise, an 11 effective 

number of bits (ENOB) ADC will still only give you 10 bits of 

information because the rest of the bits are noise. Examples of 

the additional components in front of the ADC include 

operational amplifiers (for amplification, filters or current-tovoltage 

conversions), passives for resistor-capacitor (RC) 

filtering, and bias voltages. 

III. HOW DIFFERENTIAL SIGNALING CAN ADDRESS NOISE 

(GETTING CLOSE TO THE DATASHEET NUMBERS) 

The last two sections (D and E) of the previous section 

highlight the significance of noise in hindering the achievement 

of datasheet performance. This section focuses on differential 

signaling as a means to address both of these components. 

Differential signaling is an invaluable tool in the engineers’ 

toolbox for addressing noise during analog measurements. The 

strength of differential signaling is in the simplicity of 

removing noise as common-mode. The challenge is designing 

a circuit so that the noise is in fact common to both conductors 

of the differential pair. This challenge is extended to both the 

embedded hardware engineer and the integrated circuit (IC) 

designer. In an IC design, a great example of this challenge is 

substrate noise. The substrate acts as the bridge or ‘medium’ 

between a component or peripheral generating noise and the 

integrated ADC. Similarly, at the board level, neighboring 

digital signals can couple with the analog traces. The strength 

of that coupling is often augmented by poor ground structures, 

forcing long return paths, which increase electromagnetic field 

fringing. Finally, with radiated immunity, the differential 

spacing should be relatively small compared to the distance 

from the radio. This highlights the use of symmetry in 

differential signaling in order to cancel or reject common-mode 

signals, such as noise. 

A. Addressing ADC Noise Internal to the MCU 

At the IC level, the power management architecture can 

contribute noise to the system and should be considered when 


608

Fig. 3A) Internal LDO regulator with single-ended measurements. 3B) Internal DC/DC regulator with single-ended 

measurements. 3C) Internal DC/DC regulator with differential-ended measurements. 

comparing the benefits of one architecture over another. For 

internal voltage regulation, the IC may use an LDO or a 

DC/DC. While the DC/DC is often the more efficient of the 

two, Fig. 3.B shows that the DC/DC also contributes more 

noise relative to the LDO in Fig. 3.A. Noise is equated to an 

increase in the difference between the minimum and maximum 

voltages returned by the ADC. In both Fig. 3.A. and Fig. 3.B, 

the ADC is measuring at DC voltage at approximately 250ksps 

for 32ms. The variation in the conversion result is more than 6 

times greater in the case of the DC/DC as the LDO. 

By comparison, if you were to make the same measurement 

with the DC/DC in differential mode, (see Fig. 3C), the overall 

noise is decreased and the difference between the LDO and 

DC/DC performance is minor. Fig. 3 shows performance in 

volts instead of LSB, with the vertical axis converted to mV 

since the LSB of the differential is twice that of the singleended 

to account for the support of signed results. The 

variance in the differential measurement is less than half the 

variance in the single-ended implementation, showing that the 

majority of the noise from the DC/DC is seen as commonmode 

by the ADC. 

The DC voltage being measured is treated as a differential 

input, where Vss is the negative input to the ADC. So even 

though the signal itself is a single-ended signal, measuring in 

differential mode enabled a reduction in the noise and 

moreover reduced the noise seen with using the DC/DC 

regulator. This is very good news enabling engineers to take 

advantage of the benefits of the DC/DC while eliminating the 

associated noise cost. 

B. Addressing ADC Noise External to the MCU (Neighboring 

Signals) 

The noise from the internal regulator is only one possible 

source of noise. Other possible noise sources can be 

neighboring digital signals, such as I2C or SPI 

communications, as well as digital stimuli like a pulse-width 

modulated (PWM) waveform. As a general rule it is 

recommended to keep these signals as physically far away as 

possible from the ADC pins and if possible, inactive during the 

ADC measurements. Typically, most IC manufacturers 

intentionally keep digital signals away from the analog by 

creating dedicated analog pins. In smaller packages, however, 

some digital functions may be multiplexed with analog pins or 

the digital input/output (I/O) pins can be neighboring the 

analog pins. In Fig. 4, the analog input is located immediately 

next to a 48-MHz clock output (full rail-to-rail swing) to 

represent an SPI clock. 

As shown in Fig. 5 and Fig. 6, the increase in noise 

(variance) seen with the addition of the neighboring clock 

output is greater with the single-ended measurement as 

compared with the differential. In the single-ended case only 

the signal A+ is used and the complimentary input is left in 

general-purpose I/O (GPIO) mode and actively driven low, 

DVSS. In the differential case, the complimentary input is 

externally connected to AVSS (see Fig. 6). 

Although small when compared to the single-ended 

example, the differential result indicates that noise is still 

present. As a point of discussion, it is important to notice that 

the clock is relatively close to the positive leg of the differential 

measurement when compared to the separation between the 

positive and negative signals of the differential pair. Therefore 

the relative coupling will not be equal and the noise will not 

appear completely as common-mode. Additionally, the printed 

circuit board (PCB) layers below the top signal layer are not 

shown which would show the signal return paths. This is a 

four-layer PCB, with the 3rd layer providing an almost 

completely solid plane so that return currents may follow the 

‘path of least impedance’ [6], which for high frequency signals, 

such as the 48MHz clock will be directly below the trace. The 

second layer provides reference voltages and is split in several 

places complicating the coupling between the signal and 

ground plane return path. While a more complicated (greater 

than 4 layer) PCB can be used to help bring the ground plane 

closer to the signal, most issues can be resolved by simply 

moving the ‘aggressor’ signal (SPI clock) farther away from 

the ‘victim’ (A+/-). Another point to make is the orientation 

of the clock signal relative to the analog input(s). Keeping the 

Fig. 4. Clock adjacent to ADC input. 

609

Fig. 5. Crosstalk, ’A’ induced from adjacent clock onto Single-ended ADC input vs ‘No Noise’ 

Fig, 6. Crosstalk. ‘A’ induced from adjacent clock onto Differential ADC input vs ‘No Noise’. 

signals separated may not always be possible or worse, the 

signals may need to cross. In order to keep coupling to a 

minimum, signals should not run in parallel and when needed 

cross at a 90 degree angle [6]. 

As a final note, the idea of noise coupling into the ADC 

from neighboring signals is not limited to the ADC input pins. 

In the case of an external reference, noise could also be 

coupled into the reference before entering the IC and similar 

precautions should be taken. 

C. Addressing ADC Noise from RF Sources 

Making analog measurements coincident with wireless 

communication is typically not recommended and as a matter 

of application, typically any communication is done after a 

measurement in order to convey a subset or summary of the 

measurement(s). The radio source used in Fig. 7 was an 

evaluation module (EVM) which was transmitting 100 random 

packets at 50kB (868 MHz, 2-GFSK, 2 kHz deviation). The 

EVM was placed adjacent to the MCU test board, so that the 

MCU (and ADC) under test was approximately 6cm from the 

EVM PCB antenna. Fig. 7 shows that the differential 

configuration is superior in noise immunity over the singleended. 

Again, the key is that the energy is induced or coupled 

uniformly on both the positive and negative inputs of the 

differential ADC, so the signal is rejected as common-mode. 

And again, the differential is far from ideal and merits 

discussion on the potential sources. 

The most notable difference between the experiments with 

the clock and sub-1GHz radio is the relative coupling area, 

shown in Fig. 8. In the case of the clock the coupling area was 

most related to where the clock trace ran parallel with the ADC 

input lines. After this parallel run the signals diverged, the 

ADC signals went off-board to the voltage source to be 

measured while the SPI signal terminated at another receive 

input. 

It is the off-board connection with minimal shielding which 

provides a potential path for the radio energy to couple into the 

ADC. Moreover, any differences in electrical length between 

the positive and negative inputs to the ADC can cause the 

coupled noise to be differential rather than common-mode. One 

powerful way to minimize differences in electrical length 

between the positive and negative inputs of the ADC is by 

designing signal paths which are symmetrical. Fig. 9 is taken 

from [8] and the different axis of symmetry for the input and 

outputs are highlighted. 

The testing in this section was intended to show the breadth 

of improvement made available by differential signaling. The 

improvement was seen at an application or implementation 

level with interference from a neighboring radio which can be 

applied to Bluetooth and WiFi applications where 


610

Fig. 7. Noise induced from a nearby radio. 

Fig. 8. Off-board routing of differential inputs. 

electro-magnetic compatibility (EMC) is needed. 

Improvement was also seen at the board level with crosscoupling 

(cross-talk) from a neighboring digital signal. And 

finally, improvement was even seen at the IC level where a 

noisy regulator is chosen to achieve lower power operation and 

sacrifice in ADC performance was mitigated. 

IV. 

CONCLUSION 

While differential signaling can be a great tool to achieve 

the ADC performance found in datasheets, it cannot supersede 

the need to understand the datasheet parameters. 

Comprehending ADC datasheet performance, to see which will 

meet your performance needs can be difficult due to all of the 

nuances and dependencies of performance described in this 

paper. This paper touched on some of the main performance 

dependencies and provided trends to give tips and tricks to help 

better decide from the datasheet if the ADC may meet your 

performance needs. But in some cases, bench tests in your lab 

using your specific configuration may still be required. This 

paper covered the main points to look for but remember there 

are more. 

An application using an ADC cares about the whole analog 

front end performance. This paper spoke of the ADC and the 

voltage reference but additional analog front end pieces which 

if present must be considered are gain, filters, and bias 

voltages. The ADC is the last piece of the analog front end but 

additional post digital filtering can further improve 

performance. Also, if the ADC samples enough over the 

Nyquist rate of the input signal, over-sampling can be 

Fig. 9. An example of signal path symmetry. 

implemented at system level to improve SNR as out-of-band 

quantization and thermal noise can be filtered out [9]. 

A good starting point to learn more about ADCs is Texas 

Instrument’s precision labs online classroom with on-demand 

courses and tutorials. The introduction to Analog and Digital 

Converters section explains what many of the electrical 

parametric values you find in ADC datasheets [10]. 

REFERENCES 

[1] K. Mustafa, “Filtering Techniques: Isolating Analog and 

Digital Power Supplies in TI’s PLL-Based CDC 

Devices,” Texas Instruments, Application report 

SCAA048, October 2011. [Online]. Available: 

http://www.ti.com/lit/an/scaa048/scaa048.pdf 

[2] D. Megaw, “Voltage Reference Selection Basics,” 

SNVA602. [Online]. Available: 

http://www.ti.com/lit/an/snva602/snva602.pdf 

[3] C. Hall, “It’s in the math: how to convert and ADC code 

to a Voltage (part 2)”, Texas Instruments E2E 

Community blog post [Online]. Available: 

https://e2e.ti.com/blogs_/archives/b/precisionhub/archive/ 

2016/04/29/it-39-s-in-the-math-how-to-convert-an-adccode-to-a-voltage-part-2 

[4] MSP432P4xx Family Technical Reference Manual page 

848 22.2.6.3 Sample Timing Considerations, SLAU356 

December 2017. [Online]. Available: 

http://www.ti.com/lit/ug/slau356h/slau356h.pdf 

611

[5] C. She, “Top 12 ways to achieve low power using the 

features of an integrated ADC,” [Online]. Available: 

http://e2e.ti.com/blogs_/b/msp430blog/archive/2016/06/0 

6/top-12-ways-to-achieve-low-power-using-the-featuresof-an-integrated-adc 

[6] N. Gray, “The problem of ADC and mixed-signal 

grounding and layout for dynamic performance while 

minimizing RFI/EMI,” Texas Instruments, SNAA113. 

[Online]. Available: 

http://www.ti.com/lit/wp/snaa113/snaa113.pdf 

[7] DAC3482 Data Sheet, Section 10.1 Layout Guidelines, 

Texas Instruments, SLAS7487768. [Online]. Available: 

http://www.ti.com/product/DAC3482/datasheet/layout#S 

LAS7487768 

[8] X. Ramus, “The PCB Layout for Low Distortion High- 

Speed ADC Drivers,” Texas Instruments, application 

report SBAA113, April 2014. [Online]. Available: 

http://www.ti.com/lit/an/sbaa113/sbaa113.pdf 

[9] “Precision ADC with 16-bit Performance,” Texas 

Instruments, application report SLAA821. [Online]. 

Available: http://www.ti.com/lit/an/slaa821/slaa821.pdf 

[10] Texas Instruments Precision Labs - ADCs [Online]. 

Available: https://training.ti.com/ti-precision-labs-adcs 


612

Embedded Algorithms for Motion Detection and 

Processing 

Smart sensors with embedded configurable algorithms and machine learning 

processing software pave the way to advance innovation and reduce consumption at 

system level 

M. Castellano 1 , R. Bassoli 2 , M. Bianco 1 , A. Cagidiaco 1 , C. Crippa 2 , M. Ferraina 1 , M. Leo 1 , S.P. Rivolta 1 

1 

STMicroelectronics, Castelletto, Italy; 2 STMicroelectronics, Agrate, Italy; 

marco.castellano@st.com 

Abstract—MEMS inertial modules are powerful and versatile 

converging technologies: mechanical and electronic functions are 

merged into a single component, ready to offer physical data to 

users about the environment (through wearables or equipment on 

which a sensor is mounted). In recent years, drastic reduction in 

the power consumption of inertial sensors has opened the door to 

a new world of applications. IoT is certainly one, but not the only 

example of what can be achieved using battery-operated devices. 

This technology is ubiquitous and innovative smart sensors, able 

to further reduce energy consumption and recognize and interpret 

their environment autonomously, are on the horizon. New sensors 

are able to provide the application with the right feedback 

precisely when the application needs it. This paper introduces a 

programmable and configurable embedded digital module which 

further reduces system power consumption, moving part of the 

intelligence into the sensor, and thus keeping the main processor 

in sleep mode. The digital module is composed of two embedded 

reconfigurable blocks able to solve two main sets of application 

requirements. The first block has been developed for systematic 

motion recognition using a reconfigurable Finite State Machine; 

application examples are motion/no-motion, human gesture and 

industrial applications. The second block has been developed for 

statistical-based context awareness; using a decision-tree 

approach it is possible to perform human activity recognition 

(stillness, walking, vehicle motion, etc.), carry-position detection 

(on wrist, in pocket, on table, etc.) and machine activity and 

movement recognition. These two blocks can be programmed and 

mutually concatenated by using a simple GUI running on a 

common PC to exploit the full configurability of the digital module 

and to meet user needs easily, quickly and effectively. The 

embedded digital module allows moving all or part of the 

algorithm elaboration to a custom, low-power environment on the 

sensor side, reducing communication to the main processor, and 

thus reducing overall power consumption. 

Index Terms— MEMS, smart sensor, sensor networks, low 

power, autonomous system, embedded algorithms, gesture 

recognition, context awareness, machine learning, decision tree, 

IoT. 


During the past 10 years, the number of IoT applications has 

increased exponentially. Most of IoT applications involve 

measuring a physical quantity in a location that may not already 

have a power source available. It’s often not feasible to add 

wiring, so a battery solution is a preferred option, and wireless 

connectivity for data transmission is a must. At a minimum, the 

IoT application needs a sensor to get data, a medium over which 

to transmit and a battery to supply power to both operations. In 

a design of this type at this point a trade-off is needed: maximize 

battery life or frequency of data transmission? 

A key tool available to the application designer to manage 

this trade-off is an elaboration unit, which can perform the 

measurements and transmission effectively and efficiently. The 

computational unit is usually a general-purpose microcontroller, 

targeted for low-power consumption. Since the wireless data 

transmission is critical for low power consumption, with respect 

to the other processes involved, the strategy in the IoT 

application design is to move the elaboration to the IoT side, if 

the latter allows reducing communication. For example, let’s 

suppose that we have to design a healthcare product which 

sounds an alarm when the standard deviation of a certain 

measured parameter is above a certain threshold. A good design 

practice choice, considering battery longevity, is to write the 

algorithm on the transmitter side in order to limit the wireless 

transmission just during the alarm event. 

The objective of this paper is to introduce a new step in the 

reduction of product power consumption thanks to innovative 

sensors. The new inertial module LSM6DSOX from 

STMicroelectronics allows moving all or part of the algorithm 

elaboration to a custom low-power environment in the sensor. 

The broad configurability of this approach guarantees a wide 

spectrum of applications. This paper is organized as follows: the 

first section will show the innovative creativity beyond the 

embedded algorithms and the advantages in an applicative case 

example. Two sections are dedicated to the description of the 

embedded algorithms description. The last section is dedicated 

to custom support software which is user-friendly and can be 

rapidly adapted to the creation of new applications. 

613 


II. EMBEDDED ALGORITHMS SCENARIO 

As introduced in the previous section, a simple model of an 

IoT application is composed of a transmitter/receiver apparatus, 

an elaboration unit, actuators or sensors connected to the 

elaboration unit, and a battery. In order to show the advantages 

of embedded algorithms on the sensor side, we’ll introduce an 

application use case. Although we provide an example, the 

subject can be easily extended to many other use cases. 

The analyzed use case is a smart-bracelet able to perform useractivity 

recognition, and give a feedback on it: how long the 

user has walked, has been in a vehicle, etc. Of course the smartbracelet 

should give the date and time to the user, prompted by 

a wrist-tilt gesture. A key component to perform both 

transmission and elaboration is a Bluetooth low-energy systemon-chip[4]. 

This solution embeds a complete Bluetooth network 

processor and an elaboration unit for running application code. 

The elaboration unit is composed of a low-power 

microcontroller, NVM memory for user programs, memories 

for data and programming (mirror of NVM) and common 

interfaces (SPI, I²C, etc.). With this kind of solution, the 

application is able to read/write to the sensor and actuators from 

the interfaces, execute an algorithm computation and connect 

to an end component (computer/smartphone) by means of 

Bluetooth communication protocol. A rough power budget of 

this solution can be estimated from the following proposed 

system example. The “smart” Bluetooth module with 

embedded microcontroller in general has different power 

modes. The most common modes are listed below: 

a) Sleep mode: this mode is used to minimize power 

consumption by turning off or putting most of the internal 

blocks in low-power conditions. In sleep mode an interrupt is 

monitored for waking up, or an internal RTZ can be used. 

Plenty of options can be configured for blocks in the system in 

this mode. Exiting from this mode requires some time to regain 

full operative mode (0.5-2 ms). Current consumption of this 

mode is in the 0.5-2 µA range. 

b) Microcontroller active mode: radio transmitter/receiver 

off, microcontroller fully-active elaborating. Current 

consumption range of this mode is 1-3 mA. 

c) Radio transmitting/receiving mode: Device is 

communicating, power consumption is 3-20 mA. 

The present “smart” Bluetooth power consumption values 

are just a rough estimation based on the product datasheet, the 

purpose of this exercise is to show the advantage of embedding 

algorithms on the sensor side. Before making the full 

computation of current consumption usage in this smart-bracelet 

case other useful assumptions must be defined. In the first place 

the microcontroller in the smart-Bluetooth module is connected 

to an inertial module using an I 2 C/SPI interface, configured to 

generate sensor data at 25 Hz Output Data Rate. The 

microcontroller exits from sleep-mode, reads data from the 

sensor and executes the activity-recognition algorithm every 

time a sample is generated using an embedded 16 MHz clock 

domain. The high-quality activity-recognition algorithm case 

requires a mean elaboration time of 4 ms. The Bluetooth 

transmission is sporadic on user request (once a day). 

Figure 1: Microcontroller sleep to active timing 

Figure 1 shows the timing of the duty cycle of the 

microcontroller while running the algorithm. Tstart is the turnon 

time of the microcontroller, Talgo is the execution time of 

algorithm, Todr is the time between sensor reads. 

A basic formula of the mean of the current I TOT containing 

its main contributors is: 

I TOT = I BUS + I SLEEP + f algo * I UCORE * ( T start/2 + T algo ) 

I BUS is the current related to interface bus read; for the SPI 

bus the contribution should be < 1 µA, in case of I²C the range 

is 2-5 µA roughly. The other variables have been already 

introduced, a power budget can be estimated for sensor plus 

microcontroller system, since radio transmission has been 

considered negligible due to its sporadic nature. Considering 

that the middle of the declared range of each parameter has been 

taken, an I TOT around 230 µA is obtained. 

The LSM6DSOx embedded algorithm, reconfigured for 

implementing “activity recognition”, requires less than 8 µA. 

We would like to point out that we’re referring to the same high 

quality algorithm with exactly the same performance as running 

on a microcontroller. A significant advantage of the embedded 

solution is that data is already available inside the component, 

so the I BUS contribution is absent. Another contribution which is 

completely missing in the embedded solution is T start needed for 

the safe exiting from the sleep state of the microcontroller. I TOT 

estimation with the two terms T start and I BUS at zero leads to 200 

µA which means that, using the same formula f algo*I UCORE*T algo, 

the porting from the microcontroller to the sensor leads to a 

reduction of 25 times less power consumption. 

Where does the magic come from? The main consideration 

comes from the general-purpose microcontroller versus the 

application-specific digital logic. STMicroelectronics has been 

a pioneer and a leader in inertial MEMS modules since the 

beginning of the MEMS era. The ST software library and 

customer requests are well known and consolidated, so the 

strategy has been to collect and divide the most common 

application use cases into two sets. The first set is composed of 

algorithms well-suited to using a Finite State Machine, the 

second set is based on applications which need statistical 

analysis (based on pattern analysis) and that can be implemented 

with a decisional net (tree) in an effective way. For the two sets, 

a collection of “metacommands” has been implemented, in order 

to cover existing algorithms, and to guarantee wide reconfigurability 

for new custom requests. At the end of the 

process arithmetical analysis has been done with the aim to find 

the best low-power but effective arithmetical custom logic. 

614

Arithmetic simplification has been done to tailor to application 

needs, without impacting algorithm performance. In the two 

following sections the two blocks and the metadata are 

presented. 

III. MOTION DETECTION FINITE STATE MACHINE 

The purpose of the FSM block is to provide tools that allow 

writing compact programs able to recognize user gestures. Each 

gesture requires a specific PROGRAM, thus many programs 

can be written and concatenated, making an array of programs 

named PROGMEM, as shown in Figure 2, to be processed by 

an interpreter resident in ROM. 

Each program is made of two parts, a data section and an 

instructions section. In more detail, a data section is made of a 

fixed length part, present in all the programs, and a variable 

length part, whose size is specific for each program. Finally, the 

instructions section, i.e. the executable part, is made of 

conditions and commands, the latter sometimes requiring 

parameters to be executed. 

condition is true, the next sample is awaited and both conditions 

are evaluated again later on. 

Inside the fixed data section, two bytes store respectively the 

address of the reset instruction and the address of the current 

instruction, i.e. the program pointer is updated every time a next 

condition is true or forced to the reset address in case a reset 

condition is true. 

Since a condition is coded over four bits, a maximum of sixteen 

different conditions can be coded. There are four types of 

conditions, namely timeouts, thresholds comparisons, zerocrossing 

detection and decision-tree checks. Timeout 

conditions are true when a counter, preset with a timeout value, 

reaches zero, while threshold comparisons are true when 

enabled inputs (such as accelerometer XYZ axes or norm 

V = √ ) are higher (or lower) than a programmed 

threshold, zero-crossing detection is true when an enabled input 

crosses the zero and decision-tree check condition is true when 

the tree result matches the expected result. If the counter has not 

yet reached zero or an enabled input is not yet higher (or lower) 

than a programmed threshold, or no zero-crossing event has 

been detected, or the decision-tree result does not match, then 

the condition is false and the program pointer is not updated: 

when the next input sample arrives, the conditions are evaluated 

again until one of the two (reset or next) becomes true. 

Figure 2: Programs organization in memory 

The gesture recognition interpreter decodes each instruction of 

a program’s instructions section and executes it by operating on 

data located in the data section. Each program recognizing a 

specific gesture realizes a simple, programmable FSM and it is 

totally independent from the other programs. 

The instructions section is the operative part of the FSM. It is 

made of conditions and commands. Instructions are also called 

states. Each condition is coded in one byte; the highest nibble 

codes a reset condition, while the lowest nibble a next 

condition. 

A condition is one line of code. Any time a program is executed, 

each condition is evaluated in both its parts, i.e. reset and next, 

with this priority. If a reset condition is true, the code is 

restarted from the beginning, whereas if a next condition is true 

the execution progresses to the next line of code. If neither 

Figure 3: Example of a program instruction section 

TABLE I. shows the possible conditions, TABLE II. shows all 

the available commands. TABLE III. shows an example of the 

instruction section of a PROGRAM, mixing conditions and 

commands. 

CONDITION 

TABLE I. CONDITIONS 

DESCRIPTION 

0x0 NOP No execution on current sample 

0x1 TI1 Timeout 1 expired 




615 


0x5 GNTH1 Any triggered axis > THRS1 

0x6 GNTH2 Any triggered axis > THRS2 

0x7 LNTH1 Any triggered axis ≤ THRS1 

0x8 LNTH2 Any triggered axis ≤ THRS2 

0x9 GLTH1 All triggered axis > THRS1 

0xA LLTH1 

0xB GRTH1 

0xC LRTH1 

All triggered axis ≤ THRS1 

Any triggered axis > -THRS1 

Any triggered axis ≤ -THRS1 

0xD PZC Any triggered axis crossed zero pos. slope 

0xE NZC Any triggered axis crossed zero neg. slope 

0xF CHKDT 

COMMAND 

Check result from decision tree vs. expected 

TABLE II. COMMANDS 

DESCRIPTION 

0x00 STOP Stop execution and wait for new start 

0x11 CONT Continues execution from reset-point 

0x22 CONTREL Like CONT but reset temporary mask 

0x33 SRP Set reset-point to next address/state 

0x44 CRP Clear reset-point to first program line 

0x55 SETP 

Set parameter in the program data 

section 

0x66 SELMA 

Select MASKA and TMASKA as 

current mask 

0x77 SELMB 

Select MASKB and TMASKB as 

current mask 

0x88 SELMC 

Select MASKC and TMASKC as 

current mask 

0x99 OUTC 

Write the temporary mask in the output 

register 

0xAA STHR1 Set new value to THRESH1 

0xBB STHR2 Set new value to THRESH2 

0xCC SELTHR1 Select THRESH1 instead of THRESH3 

0xDD SELTHR3 Select THRESH3 instead of THRESH1 

0xEE SISW Swap sign to opposite in selected mask 

0xFF REL Reset temporary mask to default 

0x12 SSIGN0 Set UNSIGNED comparison mode 

0x13 SSIGN1 Set SIGNED comparison mode 

0x14 SRTAM0 

Do not reset temporary mask after a 

next condition is true 

0x21 SRTAM1 

Reset temporary mask after a next 

condition is true 

0x23 SINMUX Set input multiplexer 

0x24 STIMER3 Set new value to TIMER3 register 

0x31 STIMER4 Set new value to TIMER4 register 

0x32 SWAPMSK 

Swap mask selection MASKA 

MASKB 

0x34 INCR Increase long counter +1 

0x41 JMP Jump address for two Next conditions 

0x42 CANGLE Clear angle 

0x43 SMA Set MASKA and TMASKA 

0xDF SMB Set MASKB and TMASKB 

0xFE SMC Set MASKC and TMASKC 

0x5B SCTC0 

Clear the Time Counter TC on next 

condition true 

0x7C SCTC1 

Don’t clear the Time Counter TC on 

next condition true 

0xB5 SETR 

Set external registers at given address 

with given data 

0xC7 UMSKIT 

Unmask interrupt generation when 

setting OUTS 

0xEF MSKITEQ 

Mask interrupt if OUTS does not 

change 

0xF5 MSKIT 

Mask interrupt generation when setting 

OUTS 

TABLE III. EXAMPLE OF A PROGRAM INSTRUCTION SECTION 

STATE 0 NOP GNTH1 

STATE 1 LNTH1 TI3 

STATE 2 

STATE 3 

OUTC 

SRP 

STATE 4 NOP LNTH1 

STATE 5 GNTH1 TI3 

STATE 6 

STATE 7 

CRP 

CONTREL 

Go to next state if enabled 

input is greater than TH1; 

otherwise wait 

Stay over TH1 for TI3 

seconds; if goes down 

before then restart from 

STATE 0 

After TI3 seconds output 

temporary mask and 

interrupt 

Set reset pointer to STATE 

4 

Go to next state if triggered 

input is lower than TH1; 

otherwise wait 

Stay under TH1 for TI3 

seconds; if goes up before, 

then restart from STATE 4 

After TI3 seconds clear reset 

pointer to STATE0 

Output temporary mask and 

interrupt, reset temporary 

mask and continue from 

STATE 0 

616

The interpreter decodes instructions and executes them; actions 

are based on data stored inside the data section. 

Each program realizing an FSM able to recognize a specific 

gesture has its own data set. For example the instructions 

section in the previous example, in order to properly work, 

needs the following data: 

- Threshold 1 value 

- Timeout 3 value 

- Mask to enable the relevant axis among XYZV to trigger 

the events 

- Timer to count wait times 

These are the most commonly used data, also called 

resources; however five more data are available to be declared 

and used: 

- Hysteresis value, to be added/subtracted to/from threshold 

values when performing comparisons of the kind “greater 

than” / ”less than” 

- Decimation mechanism, in case the FSM has to be 

executed not at every input sample but at a lower 

frequency; in these cases two bytes must be reserved, one 

with the decimation value and another for the decimation 

counter 

- Previous axis sign PAS, declared and used in case a zerocrossing 

condition PZC or NZC is present in the 

instructions section, to store the previous input sample 

XYZV signs 

- Memory locations to store gyroscope integrated angles 

and ODR period duration when using FSM with such 

input data 

- Decision-tree interface handling 

Simpler programs other than the above examples could use less 

data, whereas more complex programs could use more data (i.e. 

thresholds and/or timeouts and/or masks). 

In order to implement an efficient and effective datainstructions 

structure, a fixed data section, present in all the 

programs, stores information about the amount of resources to 

be used by the program. The user must carefully fill it and 

reserve memory locations accordingly in the variable data 

section. 

Six bytes store information about the variable-data section and 

the instructions section: 

CONFIG_A: masks, thresholds, long timeouts, short timeouts 

CONFIG_B: decimation, hysteresis, gyro angles, PAS 

SIZE: length in bytes of the whole program data + instructions 

SETTINGS: flags used in the instruction section processing 

PP: program pointer 

RP: reset pointer 

The variable data section is normally different in size between 

two FSM programs. It collects all parameters needed by the 

program, such as masks (1-3), thresholds (1-3), timeouts (1-4) 

etc. The resources in the variable data part are declared in the 

CONFIG_A and CONFIG_B bytes belonging to the fixed data 

part. In this way the interpreter can easily process the variable 

data part of a given program knowing exactly what is stored 

there owing to the information obtained from the fixed data 

part. 

Returning to the previous TABLE III. example, TABLE IV. 

and TABLE V. show the data section consistent with the 

instruction section. 

TABLE IV. EXAMPLE OF A FIXED DATA PART 

CONFIG_A 0 1 0 1 0 0 0 1 

1 mask, 1 threshold, 1 

short timeout 

CONFIG_B 0 0 0 0 0 0 0 0 No other resources 

SIZE 00010100 20 bytes length 

SETTINGS 0 0 0 0 0 0 0 0 No special flags 

PP 00000000 Starting value PP 

RP 00000000 Starting value RP 

TABLE V. EXAMPLE OF A VARIABLE DATA PART 

THRESH1_ 

LSB 

THRESH1_ 

MSB 

00000000 

00111000 

MASKA 10 00 00 00 

0.5 g accelerometer X+ 

axis threshold 

X+ accelerometer axis 

mask 

TMASKA 00 00 00 00 Temporary mask A 

TC 00000000 Timer starting value 

TI3 00001111 Timeout3: 15 samples 

The whole FSM program shown as an example thus is made 

of 20 bytes, respectively 12 bytes of data section (shown in 

TABLE IV. and TABLE V. ) and 8 bytes of instruction 

section (show in TABLE III. ). 

This 20-byte long, FSM example program, starting from data 

coming from a three-axis accelerometer sensor, is able to 

detect on/off wrist tilt, useful in case of a smartwatch or 

fitness bracelet application. 

IV. MACHINE LEARNING PROCESSING (MLP) 

Finite State Machine, presented in the previous section, 

makes use of its own nature of deductive reasoning: it starts out 

with a hypothesis, and examines the possibilities to reach a 

specific logical state. For motion detection algorithms this 

implies finding “rules” to be satisfied in a sequence of events. 

This approach works with most gesture detection algorithms, but 

surely not for all. For example a phone-up to phone-down 

gesture algorithm can be solidly based on the fact that the gravity 

detected by the accelerometer in the phone is mainly on one axis 

and will be inverted on the same axis over a sequence of time. 

Gesture definition can be changed based on a few parameters: 

definition of axis, threshold and time to complete the sequence. 

A different motion algorithm like walking detection could 

hardly be defined by means of a simple state machine, since the 

number of variables would dramatically increase: sensor 

positioning, frequency, terrain and personal behavior render the 

617 


sensed signal widely variable. From the last example it is 

possible to extract a more general concept: while phone-up to 

phone-down gesture statistical variance on a population is sharp, 

and allows deductive-reasoning application design, walking 

gesture would lead to broad statistical variance and subsequently 

deductive-reasoning should be abandoned in favor of inductive 

reasoning. 

The idea behind Machine Learning Processing is to allow the 

implementation on silicon of data-driven algorithms, exploiting 

the capability of building a model from input patterns. Over the 

last decade the explosion of the Internet and IOT has made 

available an enormous quantity of information. Following the 

increase in quantity of data, tools to manage these collections of 

data have been developed, in order to make them effective for 

applications. MLP is considered a suitable solution to implement 

data-driven algorithms on inertial sensors. MLP is highly 

reconfigurable, effective in the field of inertial sensors, 

implemented in an ultra-low power domain, and suitable for 

power-hungry products, for example IOT algorithms. 

An important branch of machine learning is data mining: 

“data mining is an interdisciplinary field bringing together 

techniques from machine learning, pattern recognition, and 

statistics" [1][2] with the aim of knowledge discovery. 

The task output of a data-mining tool is a decision tree: the 

application design starts from a collection of patterns, and ends 

with the loading of the decision tree obtained on the MLP. The 

entire process of the application design is supervised by 

supporting software that is described in the next section. In the 

present section the set of basic blocks behind the MLP is 

introduced. 

The general scheme is illustrated in Figure 4. 

inputs for the algorithm, data from up to 3 sensors. The 

gyroscope and accelerometer blocks are internal to the sensor 

but data from an external sensor such as a magnetometer can be 

read over an embedded I 2 C master. Input sensor data is 

composed of the axis and the magnitude of the physical sensor 

(TABLE VI. ). 

Accelerometer 

Gyroscope 

External sensor 

TABLE VI. INPUT TYPES FOR MLP 

AVAILABLE INPUTS 

accX accY accZ axis 

accV accV 2 mag 

GyX gyY gyZ axis 

gyV gyV 2 mag 

magX magY magZ axis 

magV magV 2 mag 

A wide set of configurable filters is available to condition 

input data as illustrated in following table (TABLE VII. 

ORDER 

1 

2 

TABLE VII. FILTER TYPES IN MLP 

TYPE 

High Pass 

Generic IIR 

Band Pass 

Generic IIR 

Both raw and filtered data can be used as inputs for the 

feature block: this block performs statistical computation of 

data, and can be configured to output up to 19 different statistical 

features. The list of the available features is given in TABLE 

VIII. There are two main sets of features, triggered and 

windowed: the former are elaborated at a feature event, the latter 

at fixed window time intervals. While all features can be 

calculated as windowed or triggered depending on user 

configuration, only a subset of these features can generate a 

trigger. 

TABLE VIII. AVAILABLE STATISTICAL FEATURES IN MLP 

Figure 4: MLP General Scheme 

For the figure it possible to deduce the boundary between 

software and hardware layers. The application starts from 

patterns of sensor data which describe the knowledge that MLP 

has to understand while running. For example an activity 

recognition algorithm starts from patterns involving activities to 

be recognized (walking, running, moving vehicle, no motion, 

etc.) with the aim that MLP outputs the results of the current 

activity directly from the sensor data. The user can configure, as 

FEATURE 

Mean 

Variance 

Energy 

Peak to Peak 

Zerocross 

Zerocross_trigger_gen 

Positive Zerocross 

Positive Zerocross trigger 

gen 

Negative Zerocross 

Negative zerocross trigger 

gen 

Peak detector 

Peak detector trigger gen 

TRIGGER GENERATION 

No 

No 

No 

No 

No 

Yes 

No 

Yes 

No 

Yes 

No 

Yes 

618

Positive peak detector 

Positive peak detector trigger 

gen 

Negative peak detector 

Negative peak detector 

trigger gen 

Min 

Max 

Duration 

Clock Feature 

No 

Yes 

No 

Yes 

No 

No 

No 

Yes 

At the end of the features configuration step, the software 

tool described in the following section can output a 

configuration file to be loaded on the device for MLP 

configuration and an ARFF file for data-mining tool. The ARFF 

file obtained is matched with silicon implementation of MLP 

computation. The data-mining tool forming the ARFF file is 

able to refine (or “determine”) the best set of features to be 

chosen for a specific application case, and is able to output a 

decision tree and the relative statistical performance. 

After elaboration and feedback from the data-mining tool, it 

is possible to reprocess the data and optimize the set of features. 

When the performance matches the expectation, the decision 

tree can be loaded on the MLP by means of a configuration file 

produced by the STM software tool. 

V. SUPPORTING SOFTWARE 

Two dedicated tools have been developed to allow the 

programmability of the MEMS sensor, the first for the Finite 

State Machine configuration, the second for decision-tree 

configuration using a statistical-based / machine-learning 

approach. These tools make the device configuration process 

easy and fast. 

The tools for Finite State Machine and decision-tree 

configuration work as an extension of the Unico GUI (the 

Graphical User Interface for all the MEMS sensor 

demonstration boards available in the STMicroelectronics 

portfolio[5]). Unico interacts with a motherboard[6][5] based on 

the STM32 microcontroller, which enables the communication 

between the MEMS sensor and the PC GUI. The software 

visualizes the output of the sensors in both graphical and 

numerical format, and allows the user to save or generally 

manage data coming from the device. 

Unico allows access to the MEMS sensor registers, enabling 

fast prototype of register setup and easy testing of the 

configuration directly on the device. It is possible to save the 

current registers configuration in a text file, and load a 

configuration from an existing file. In this way, the sensor can 

be reprogrammed in few seconds. 

The Finite State Machine and Machine Learning tools 

abstract the process of register configuration by automatically 

generating configuration files for the device. The user just needs 

to set some parameters in the GUI and the configuration file is 

automatically generated. A set of configuration files is already 

available and can be distributed to the users. The user can 

modify these configurations, and also create his own library of 

configuration files by generating new configurations using the 

tools. 

A. Finite state machine tool: 

The State Machine tool extension of Unico allows the user 

to configure the state machines and test the functionalities. 

Different tabs are available in this tool: 

- A configuration tab allows setting a configuration for the 

state machines, writing the configurations to the MEMS 

sensor, and loading and saving configuration files (Figure 

5). 

- An interrupt tab showing sensor data and interrupts 

generated from the state machine execution (Figure 6). 

- A debug tab allows injecting data in the sensor and 

debugging the state machine execution sample-by-sample 

(Figure 7). 

Figure 5: Finite State Machine configuration tab 

Figure 6: Sensor data and Interrupt generation tab 

619 


Figure 7: Debug tab and step by step data insertion 

B. Machine learning tool 

The statistical-based / machine-learning algorithms require 

the collection of data logs. This is possible using Unico GUI. To 

each data log, an expected result must be associated (e.g. no 

motion, walking, running, etc.). The tool collects these data 

patterns to compute some features. 

Figure 9: Configuration tab 

The ARFF file is the starting point for the decision-tree 

generation process. The decision tree can be generated by 

different machine learning tools. Weka[7], software developed 

by the University of Waikato, is able generate a decision tree 

starting from the Attribute-Relation File. Through Weka it is 

possible to evaluate which attributes are good for the decision 

tree, and different decision-tree configurations can be 

implemented by changing all the parameters available in Weka. 

Figure 8: Data Patterns tab 

The tool allows selecting filters which can be applied to the 

raw data, and features to be computed from the filtered data. The 

features computed will be the attributes of the decision tree. 

After a few steps, an Attribute-Relation File (ARFF) is 

generated by the tool. 

Figure 10: Attributes view in Weka 

620


The world is becoming more connected: devices are linked 

together to exchange massive quantities of data. IoT applications 

rely on three key building blocks: sensing, intelligence and 

connectivity. In this paper a highly configurable digital module 

embedded in an inertial sensor has been introduced. The digital 

module adds intelligence to the sensor, thus allowing significant 

power saving at system level. In order to make the prototype of 

the application immediate, supporting configuration software 

for the digital module is furnished along with the hardware. The 

application case in the previous section clearly shows that thanks 

to the digital module, the reduction in current consumption is 

huge. Smart sensors are enablers for new applications where 

battery life is crucial. 

Figure 11: Decision-tree generation in Weka 

Once the decision tree has been generated, it can be uploaded 

to the ST tool to complete the generation of the register 

configuration for the MEMS sensor. 

The Unico GUI, by accessing the sensor registers, can read 

the status of the decision-tree outputs. 

VI. APPLICATION CASE EXAMPLE 

Starting from the example presented in the second section, 

some current consumption measurements have been taken. As 

an example, an activity recognition algorithm has been chosen 

since it shows some benefits: performance is unambiguously 

evaluated in a patterns database, and current consumption for 

this algorithm, running on common general-purpose 

microcontrollers, is in the order of hundreds of µA. MLP can be 

easily configured, by means of the supporting software 

presented in the previous section, to run the activity recognition 

algorithm. 

TABLE IX. CURRENT REQUIREMENTS 

Current 

Additional current consumption 

Current mean [µA] 

MLP on LSM6DSOx 7 

Cortex-M3 

STM32L152RE@32MHz 

TABLE IX. summarizes the current requirement of the 

activity recognition algorithm running on a Cortex- 

M3[8][9][10], and the additional current requirement for the 

same algorithm running on LSM6DSOx MLP. 

240 


The authors thank the STMicroelectronics Analog Mems 

Sensor division for discussions, encouragement and support. 

REFERENCES 

[1] S. Sumathi and S.N. Sivanandam: Introduction to Data Mining 

Principles, Studies in Computational Intelligence (SCI) 29, 1–20 

(2006). 

[2] V. Sze, Y. H. Chen, J. Einer, A. Suleiman and Z. Zhang, 

"Hardware for machine learning: Challenges and opportunities," 

2017 IEEE Custom Integrated Circuits Conference (CICC), 

Austin, TX, 2017, pp. 1-8. 

[3] V. Sze, "Designing Hardware for Machine Learning: The 

Important Role Played by Circuit Designers," in IEEE Solid-State 

Circuits Magazine, vol. 9, no. 4, pp. 46-54 , Fall 2017. 

[4] STMicroelectronics, “Bluetooth® low energy wireless systemon-chip,” 

BlueNRG-2 datasheet, November 2017, 

[DocID030675 Rev 2]. 

[5] STMicroelectronics Analog Mems Sensor Application Team , 

Unico GUI User manual, Rev. 5 October 2016. 

[6] STMicroelectronics Technical Staff, STEVAL-MKI109V3 

Professional MEMS Tool motherboard for MEMS adapter 

boards, July 2016 

[7] Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: 

Practical Machine Learning Tools and Techniques (3rd ed.). 

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. 

[8] STMicroelectronics, “Ultra-low-power 32-bit MCU ARM®based 

Cortex®-M3 with 512KB Flash, 80KB SRAM, 16KB 

EEPROM, LCD, USB, ADC, DAC,” STM32L151xE 

STM32L152xE datasheet, Rev. 9 August 2017. 

[9] STMicroelectronics Technical Staff, STM32 Nucleo-64 boards, 

NUCLEO-XXXXRX NUCLEO-XXXXRX-P data brief, Rev. 10 

December 2017. 

[10] STMicroelectronics Technical Staff, Sensor and motion 

algorithm software expansion for STM32Cube , X-CUBE- 

MEMS1data brief, Rev. 10 November 2017. 

621 


Parallel Architectures for Object-Based Sensor 

Fusion on Automotive Embedded Systems 

Florian Fembacher 



Email: florian.fembacher@infineon.com 

Abstract—Autonomous or highly automated driving is an 

emerging development in science and industry in the last decades. 

Sensor fusion, in which information from different sensors is 

combined to achieve higher measuring accuracy or to create 

new information, acts as a key for this technology. Although 

much research is done in developing and improving algorithms 

for advanced driver assistance systems (ADAS), the development 

of automotive embedded hardware that is capable of running 

those applications is still in its beginnings. Automotive embedded 

systems have to meet some special requirements compared to 

other embedded applications. Important factors are costs per 

piece, energy consumption and safety requirements (cf. ISO 

26262). Since power consumption increases proportional to CPU 

frequency it seems reasonable to use multi-core architectures, 

which are running at a lower frequency, to meet the computational 

requirements. 

This paper studies the benefits of using parallel architectures 

for object-based sensor fusion on automotive embedded systems. 

For the evaluation a simulation using Kalman filtering for the 

state estimation and an auction algorithm for the data association 

was implemented. All simulations were performed on a NVIDIA 

DRIVE PX 2 board, containing four ARM A57 cores and a Pascal 

GPU. The results show that multi-core processors can be used 

to efficiently speed up object-based sensor fusion in embedded 

systems, whereas a GPU based implementation largely suffers 

from high latency caused by memory accesses. 

Keywords—sensor fusion, embedded system 


In recent years a lot of effort has been made to develop highly 

automated vehicles [1], [2], [3]. A lot of different advanced 

driving assistant systems (ADAS) like break assist, lane keeping 

or traffic jam assistant systems have already come onto the 

market in the last couple of years. The promise of all these 

systems is to offer more safety by preventing accidents and 

to increase the driver’s comfort by carrying out simple driver 

tasks. Currently most progress in autonomous driving can be 

seen on highways since driving in urban areas is much more 

challenging due to the amount of monitoring and decision tasks, 

that have to be performed in such a scenario. 

Table I illustrates the six levels of automation defined by 

SAE J3016. In the first three levels the human driver is 

actually monitoring the driving environment, whereas in the 

last three levels the automated driving system is monitoring 

the environment. At level 0 the driver is responsible for all 

driver tasks. At level 1, simple systems are assisting the driver 

by either steering or accelerating, but still the human driver is 

needed to perform all other driving tasks. At level 2, which is 

TABLE I 

AUTOMATION LEVEL ACCORDING TO SAE J3016 

Level 5 

Level 4 

Level 3 

Level 2 

Level 1 

Level 0 

Full automation 

High automation 

Conditional Automation 

Partial automation 

Driver Assistance 

Driver only, no assistance 

already available in premium vehicles, the system is capable of 

doing both, steering and accelerating the vehicle. For level 3 the 

human driver has to intervene if necessary, but all driving tasks 

are performed by the system. For the next level the system 

is basically driving autonomously, but the driver still has to 

overtake the dynamic driving tasks if the system cannot handle 

a certain situation. Finally at level 5 the system is performing 

all driving modes and no human interaction is needed any more. 

The following work is focusing on level 2 and its requirements 

for an automotive ECU. 

For autonomous driving capabilities it is necessary to model 

the surrounding world of a vehicle in real time. At level 2, 

radar and camera sensors are used to monitor objects in the 

surrounding world. These objects are used to create a model that 

serves as an abstract representation of the real world. An object 

model usually contains geometric and dynamic information. 

To create a global state all the sensor input has to be fused 

on a central unit. To keep track of all objects over time 

an association of sensor measurements and detected objects 

has to be performed. For complex rural or urban areas this 

approach might not be sufficient since it lacks of important 

information, such as free and occupied space. For this purpose 

occupancy grids [4] that model occupancy of the environment 

by a probability grid seem more promising. In contrast to 

the object-based approach this representation lacks of dynamic 

information of objects. To compensate eache advantages and 

disadvantages a combination of both approaches is possible [5]. 

In 2016 Rakotovao et al. published two papers in which they 

describe approaches to integrate grid based sensor fusion on less 

powerful automotive embedded systems. In [6] they present an 

efficient traversal algorithm to find cells covered by a sensor 

beam and an integer based implementation for ECUs without 

floating point units in [7]. 

Object of this paper is to analyze the suitability of parallel 


622

architectures for an object-based sensor fusion for a dynamic 

changing number of objects on less powerful automotive embedded 

systems. For this purpose a general multi-taget tracking 

software framework for the evaluation of computational requirements 

of object-based sensor fusion was developed. Multitarget 

tracking is a well studied subject from the 80s and 90s 

and will not be discussed in very detail. Further information 

can be found in [8], [9] and [10]. A new fully working multitarget 

tracking system that is completely implemented using 

recurrent neural networks was published in 2016 by Milan et 

al.. Although this approach showed promising results it is not 

suitable for current automotive embedded system architectures. 

The Kalman (KF) filter and its variants are probably the 

most used approach for target tracking. For the data association 

the auction algorithm (AA) is a suitable choice for embedded 

systems because of its low computational requirements. For 

the implementation the KF for the state estimation and the 

AA for the data association were chosen and will be described 

in further detail in section II including a detailed complexity 

analysis. In section III results for different levels of parallelism 

are presented. 

In [11] a parallel architecture for Kalman tracking in FPGAs 

is presented resulting in a significant speed up. An optimization 

method for graphic precessing units can be found in [12]. 

Nevertheless the presented approaches are mainly suitable for 

state models with large dimensions and are not focusing on 

the computation of multiple hundred of KFs in parallel. In 

the given multi-target tracking typically a constant velocity or 

constant acceleration model is used with a low dimension. For 

this reason a SIMD accelerator was used to speed up the matrix 

computations while the KF computation for different targets 

was distributed over the available cores. A profound theoretical 

discussion about parallel implementations of the AA can be 

found in [13]. 

A. Sensor Fusion 

II. BACKGROUND 

1) Framework: In the following Fig. 1 a general setup of 

a multi-target tracking system is given. The process can be 

described in 4 basic steps that have to be executed repeatedly. 

In the first step the sensor input and the prediction of the 

existing tracks have to be associated. If a track can be associated 

with a measurement it will be updated in the second step. 

Otherwise a new track has to be initialized. In the next step 

track management is needed to delete invalid tracks or to fuse 

tracks that are recognized to belong to the same object. In the 

last step the existing tracks will be predicted. 

2) Time Dependencies: In this section we will elaborate on 

time requirements and dependencies that arise for a multi-target 

tracking system. First we will define the following sets: 

Definition 1: S = {s 1 ,...,s n |s i (i ∈ {1,...,n}) is the i’th 

sensor in the multi-target tracking system}. 

Definition 2: D = {d s1 ,...,d sn |d i (i ∈ {1,...,n}) is the 

dimension of a measurement vector for sensor s i }. 

Definition 3: M = {m 1 ,...,m n |m i (i ∈ {1,...,n}) is the 

maximal number of objects that are observable by sensor s i }. 

Sensor 

Data 

no association 

State 

Initialisation 

Association 

association 

State Update 

Fig. 1. Framework of a multi-target tracking system. 

State 

Predicition 

Track 

Management 

Definition 4: ∆T = {δt 1 ,...,δt n |δt i (i ∈ {1,...,n}) is the 

time span between two consecutive measurements by sensor 

s i }. 

Definition 5: P = {p m1 ,...,p mn |p mi (i ∈ {1,...,n}) is the 

maximal processing time needed for an object list measured by 

sensor s i }. 

As depicted in Fig. 2 we will make the assumption that all 

measurements send their object list asynchronously in a correct 

time order. This means if the acquisition time of a sensor is 

earlier than the acquisition of a second one, the first one will 

also be sent before the second one. Let δ tmax = max i {δt i } 

be the longest update rate that is existing in the system and 

s max = argmax i=1,...,n = {δt i } the corresponding sensor. 

To keep the time dependencies it is clear that all measurements 

generated in this time interval have to be processed befores max 

sends the next measurement list. 

T p = 

n∑ 

p si ≤ δt max (1) 

i=1 

The total processing time T p directly depends on the number 

of sensors |S|, corresponding measurement dimensions d i and 

the maximal size of the observable objects m i for each sensor. 

In general it can be expected that the number of sensors in an 

automotive multi-target tracking system as well as the number 

of objects that are observed will increase in the near future. For 

this reason it is indispensable to use algorithms that offer real 

time performance while still showing a sufficient precision for 

ADAS systems that rely on the provided object tracks. 

In the following we chose the KF for the track estimation 

and the AA for the data association as depicted in Fig. 1. 

Both algorithms with a complexity analysis will be discussed 

in the following sections. The KF and its different variations 

for non linear object models is a common used estimation 

filter in autonomous driving. The AA belongs to the algorithm 

class of nearest neighbor association algorithms. Other well 

known techniques, which are much more complex and therefore 

not targeted in this study, are the Joint Probabilistic Data 

Association Algorithm (JPDAF) [14] and Multiple Hypthesis 

Filter (MHT) [15]. 

B. Kalman Filter 

The KF is a discrete time recursive linear implementation 

of the Bayes filter that was originally published by R.E. 

Kalman in 1960 [16]. It is well studied and used in numerous 

623

∆t s1 

S 1 S 2 ... S n F 

m 1j 

m 2j 

m nj 

∆t s2 

m 1j+1 

m 2j+1 

∆t sn 

m nj+1 

Algorithm 1: Discrete Kalman filter 

1: Input: A k ,P k ,Q k ,R k ,x k ,z k 

2: Output: P k+1 ,x k+1 

3: Prediction Step: 

4: x k+1 ← A k x k 

5: P k+1 ← A k P k A T k +Q k 

6: Correction Step: 

7: K ← P k H T k (H kP k H T k +R k) −1 

8: x k+1 ← x k +K k (z k −H k x k ) 

9: P k+1 ← (I −K k H k )P k 

Fig. 2. Time dependencies between n sensors sending measurements to a 

fusion unit. 

applications. A more detailed introduction can be found in [17], 

[18] and [19]. The filter addresses the problem of predicting 

a state x k ∈ R n , given only a measurement z k ∈ R m . 

The relationship between two states is expressed by the linear 

difference equation 

x k+1 = A k x k +Bu k +w k . (2) 

The square matrix A ∈ R nxn maps the state at time step k 

to the state at time step k +1. Accordingly matrix B ∈ R nxl 

relates a control vector u k ∈ R l to the state vector at time k. In 

the following multi-target tracking application the input vector 

is assumed to be zero. Therefore equation (2) can be shortened 

to 

x k+1 = A k x k +w k , (3) 

with a Gaussian noise vector w k that models the system’s 

uncertainty. The relation between the measurement vector and 

the state vector is modeled by 

z k = H k x x +v k (4) 

with matrix H ∈ R mxn that relates the state vector x k to 

measurement z k . Again uncertainty is modeled by a gaussian 

noise vector v k . 

The complete KF is shown in Algorithm 1. It consists of 

two steps. First in a prediction step the a priori state x k+1 

is computed using the difference equation given in equation 

(2). Additionally in line 5 the a priori covariance matrix P k+1 

is computed using the transition matrix A and a Matrix Q 

modeling uncertainty. In the second correction step the updated 

state and covariance is computed. First in line 7 the gain K is 

computed minimizing the a posteriori error covariance. With 

the gain and the innovation z k −H k x k the a posteriori state is 

computed in line 8. Finally in line 9 the a posteriori covariance 

is updated. 

1) Complexity: Summarizing the equations given in Algorithm 

1 we have 8 matrix matrix multiplications, 3 matrix vector 

multiplications, 3 matrix matrix additions, 2 vector matrix 

additions and one matrix inversion. The complexity of these 

TABLE II 

MATRIX AND VECTOR DIMENSIONS. 

Matrix A B P Q R K I H 

Dimension nxn nxl nxn nxn mxn nxm nxn mxn 

Vector x z u 

Dimension n m l 

operations is listed in Table III. As it can be seen from the 

table the total runtime is bounded by O(n 3 ). The matrix and 

vector dimension are given in Table I and II respectively. Using 

single precision floating point format the memory requirement 

for all matrices is n(3n+4m)×32 bit and (n+m)×32 bit 

for the state and measurement vector. Additional memory space 

needed for temporary results is not considered. 

C. Auction Algorithm 

One important step in a target tracking application is to 

assign measurements to existing tracks and if not existing to 

create a new track. In this scenario the AA [20] is chosen, 

since it can be easily parallelized and in general delivers an 

optimal solution. The algorithm solves the problem of assigning 

n existing tracks to m measurements. The assignment between 

tracks and measurements is bijective. Algorithm II-C show 

the AA for symmetric problems, this means the number of 

tracks equals the number of measurements. A multi-target 

tracking application will be in general asymmetric, but still the 

TABLE III 

COMPLEXITY 

Operation Expression Complexity 

Multiplication A k x k O(n 2 ) 

Multiplication A k P k A T k O(n 3 ) 

Addition A k P k A T k +Q k O(n 2 ) 

Multiplication H k P k Hk T O(mn 2 ) 

Addition H k P k Hk T +R k O(m 2 ) 

Inversion (H k P k Hk T +R k) −1 O(m 3 ) 

Multiplication P k Hk T(H kP k Hk T +R k) −1 O(mn 2 ) 

Multiplication H k x k O(mn) 

Subtraction z k −H k x k O(m) 

Multiplication K(z k −H k x k ) O(mn) 

Addition x k +K(z k −H k x k ) O(n) 

Multiplication K k H k O(mn 2 ) 

Subtraction I −K k H k O(n 2 ) 

Multiplication (I −K k H k )P k O(n 3 ) 


624

symmetric approach can be used by adding dummy tracks or 

measurements. Other possible modifications for the asymmetric 

assignment problem are discussed in [20]. The AA is run 

iteratively and will stop if a feasible assignment S was found 

or some time boundary is reached. An assignment S is called 

feasible if every track is assigned exactly to one measurement. 

The AA consists of three phases. In the first phase a gating 

area for each track is computed. This approach allows us 

to reject assignments that are highly improbable. For that 

reason only measurements that lie within the gating area G 

are considered for an assignment 

d 2 ij = y ij S −1 y T ij < G (5) 

with y ij being the innovation vector given in equation (5) 

and S being the innovation covariance. 

In the bidding phase for each track i the measurement j i 

with the smallest and second smallest distance is searched and 

the corresponding values 

and 

j i = max j∈A(i) {a ij −1/d ij } (6) 

w i = max j∈A(i),j≠ji {a ij −1/d ij } (7) 

computed. If there is no second lowest value than w i will 

be set to a value much smaller than v i . The bid of track i for 

measurement j is finally computed as 

b iji = 1/d ij +v i −w i +ǫ (8) 

with ǫ being a constant that limits the error to the optimal 

solution. 

The third phase is called assignment phase in which the new 

pricep j is set to the highest bid that was given for measurement 

j. In the next step all pairs (i,j) will be removed from the 

assignment set S and the new pair (i j ,j) will be added. 

1) Complexity: The AA consists only of a small number 

of mathematical operations. As Table IV shows there are 2 

matrix vector multiplications, 1 vector vector subtraction and 

1 matrix vector multiplication for the gating phase that are 

performed for every track i. In the following bidding phase 

4 scalar additions are performed for each track until a feasible 

solution is found or some time boundary is reached. In the last 

phase no mathematical operation is performed. 

III. RESULTS 

For the implementation the aforementioned KF and AA were 

used. To ensure the functional correctness of the implementation, 

it was verified if targets were correctly assigned to 

their track. The KF used a constant velocity model, so the 

state and measurement vector both had an equal dimension 

of 4 (position in x, position in y, velocity in x and velocity 

in y). The implementation was run for a different number of 

targets on a Nvidia Drive PX 2. Its Hardware specification is 

shown in Table V. The Drive PX2 has 4 ARM-Cotex A57 

Algorithm 2: Data Association 

1: Input: Predicted States X k+1 , Measurements z k+1 , Gate 

G := (H k P k H T k +R k) −1 

2: Output: Assignment Vector S k+1 

3: Gating 

4: for all Tracks i do 

5: for all Measurements j do 

6: residual ← z j −Hx i 

7: diff = residual T ∗Gate∗residual 

8: if diff < ǫ then 

9: a ij = 1/diff 

10: else 

11: a ij = ∞ 

12: repeat 

13: for all Tracks i do 

14: Bidding Phase 

15: j i ← argmax j∈A(i) {a ij −p j }{measurement with 

max value} 

16: v i = max j∈A(i) {a ij −p j } 

17: w i = max j∈A(i),j≠ji {a ij −p j }{second best 

measurement} 

18: b iji +v i −w i +ǫ{bid for chosen measurement} 

19: Assignment Phase 

20: for all Measurements j do 

̸ 

21: p j ← max i∈P(j) b ij 

22: for all pairs (i,j) do 

23: if (i,j) ∈ S then 

24: S ← S (i,j) 

25: S ← S ∪(i j ,j) 

26: until each track is assigned to one measurement 

TABLE IV 

COMPLEXITY 

Operation Variables Complexity 

Multiplication Hx k+1 O(mn) 

Substraction z k+1 −Hx k+1 O(m) 

Multiplication residual T ∗Gate∗residual O(m 2 ) 

Substraction a ij −p j O(1) 

Addition b iji +v i −w i +ǫ O(1) 

cores, 2 Denver Cores and one Pascal GPU. The two Denver 

superscalar processors, that support the ARM v8 instruction 

set, are connected to the 4 ARM-Cotex A57 cores. For the 

evaluation on the CPU the two Denver cores were deactivated 

to achieve a consistent result on the ARM-Cotex A57 processor. 

The four Cortex-A57 were run at approximately 2 GHz. In 

theory it has a total performance of 64 GFLOPS (single 

precision). For each core 32 kB L1 cache is available and a 

shared L2 cache of size 2 MB. The GPU is based on the Pascal 

architecture with 256 Cuda cores running at 1275 MHz. The 

GPU has a theoretical maximal performance of 653 GFLOPS. 

For benchmarking four different implementations were compared. 

The reference implementation was single threaded without 

using the advanced SIMD and floating point unit of the 

625

TABLE V 

TESTING HARDWARE 

CPU GPU RAM 

Instruction set ARMv8 Architecture Pascal Protocol LPDDR4 

Cores 4 x A57 Cuda Cores 256 Memory size 8 GB 

Frequency 1996 MHz Clock Rate 1275 MHz 

Cache L1(I/D) 48 kB/32 kB Global memory 7686 MB 

Cache L2 2 MB Shared memory 48 kB 

Constant memory 64 kB 

Block registers 32768 

Cortex-A57 cores. For the second one the advanced SIMD and 

floating point unit was explicitly used for all matrix and vector 

operations. In the third implementation an additional level of 

parallelism was added by parallelizing the application objectbased 

using OpenMP. In this case object-based means, that the 

state update and prediction for different targets was distributed 

over the available cores. For the AA the for loops in algorithm 2 

were parallelized. At last the same object-based parallelization 

was implemented using the CUDA 8.0 Framework. The results 

for the KF are presented in Subfig. 3 a) and for the AA in 

Subfig. 3 b) for different numbers of targets. 

As already discussed in section II-B1 there is a cubic runtime 

for the KF. For the target tracking the KF was computed for a 

growing number of targets. As the runtime for each single KF 

stays the same, a linear increase of the runtime was expected 

as it can be seen in the runtime diagram. Using the SIMD 

accelerator a total speed up of 1.2 was achieved for 512 targets. 

Since there are not that many matrix and vector operations in 

the data association algorithm no significant difference between 

those two implementations could be observed. By distributing 

the computation over all four cores and using the SIMD device 

a speed up of approximately 4.1 was observed for the KF 

Computation and 3.7 for the data association. 

For the CUDA implementation a batched processing function 

of the NVIDIA CUBLAS lib was used. Since the overhead for 

creating and starting the streams for processing were already 

much higher than the overall processing time on the CPU, the 

runtime was not considered for the KF computation. For the 

data association a speed up of 3.4 was achieved. As a result 

it can be seen that the parallelization on the GPU cannot take 

advantage of its capability to process simultaneously multiple 

data. The reason for this behavior can be explained by the 

slow memory access that is present for the GPU. Since the 

distance matrix computed in AA is about 1 MB it does not fit 

into the fast shared memory, while it completely fits into the 

fast shared L2 cache on the A57 cores. Certainly the CUDA 

implementation could be further optimized, but at most a 

limited advantage over a CPU implementation can be expected. 

Since the computations are executed on different memory, it 

seems best to support such a multi-target embedded system 

with sufficient cache sizes and fast memory access. Further an 

object-based parallelization seems most promising as a general 

solution approach. 


Sensor fusion is a key technology for future autonomous 

driving. Automation level 2 is already available in premium 

cars and will certainly be established in the broad market in 

the near future. Today’s automotive ECUs are very limited in 

memory and computational resources and are therefore only 

used for safety critical functions while sensor fusion is usually 

performed on more powerful systems. Since the computational 

power of an embedded system is directly limited by its energy 

consumption, parallel architectures can be used to overcome 

those limits. 

In this paper the suitability of parallel embedded architectures 

were investigated by using a multi-target tracking implementation 

based on the KF and AA. The multi-target tracking 

was performed on up to 4 ARM Cortex-A57 cores and an 

embedded Pascal GPU. It was shown that parallel architectures 

offer an efficient way to speed up object-based sensor fusion 

using a microprocessor architecture that is especially optimized 

for low energy consumption. 

It was possible to speed up the computational critical data 

association 3.7 times by distributing the load on 4 cores and 

using a SIMD accelerator at the same time. By this implementation 

512 targets could be associated in less than 20 ms, 

which is way below the update rate of a camera or radar system. 

A comparable speed up was achieved using a GPU that was 

developed for autonomous driving tasks by NVIDIA. In the 

given scenario a GPU implementation is mainly hitting the so 

called ”memory wall” which prevents an efficient use of the 

parallel architecture. 

REFERENCES 

[1] M. Aeberhard, S. Rauch, M. Bahram, G. Tanzmeister, J. Thomas, Y. Pilat, 

F. Homm, W. Huber, and N. Kaempchen, “Experience, results and lessons 

learned from automated driving on germany’s highways,” IEEE Intelligent 

Transportation Systems Magazine, vol. 7, no. 1, pp. 42–57, 2015. 

[2] M. Birdsall, “Google and ite: The road ahead for self-driving cars,” 

Institute of Transportation Engineers. ITE Journal, vol. 84, no. 5, p. 36, 

2014. 

[3] A. Hohm, F. Lotz, O. Fochler, S. Lueke, and H. Winner, “Automated driving 

in real traffic: from current technical approaches towards architectural 

perspectives,” SAE Technical Paper, Tech. Rep., 2014. 

[4] H. P. Moravec, “Sensor fusion in certainty grids for mobile robots,” AI 

magazine, vol. 9, no. 2, p. 61, 1988. 

[5] M. E. Bouzouraa and U. Hofmann, “Fusion of occupancy grid mapping 

and model based object tracking for driver assistance systems using laser 

and radar sensors,” in Intelligent Vehicles Symposium (IV), 2010 IEEE. 

IEEE, 2010, pp. 294–300. 


626

Runtime in ms 

0.3 

0.2 

0.1 

Single Core 

SIMD 

OpenMP 

Runtime in ms 

60 

40 

20 

Single Core 

SIMD 

OpenMP 

CUDA 

0 

0 

0 100 200 300 400 500 

Targets 

(a) 

0 100 200 300 400 500 

Targets 

Fig. 3. Results for different levels of parallelism for the Kalman filter and auction algorithm. The runtime was measured for a growing number of targets. The 

assignment phase was constantly executed 30 times to get comparable results. Subfig. a) shows the measured runtime for the Kalman filter and Subfig. b) the 

runtime for the auction algorithm. 

(b) 

[6] T. Rakotovao, J. Mottin, D. Puschini, and C. Laugier, “Integration of 

multi-sensor occupancy grids into automotive ecus,” in Proceedings of 

the 53rd Annual Design Automation Conference. ACM, 2016, p. 27. 

[7] ——, “Multi-sensor fusion of occupancy grids based on integer arithmetic,” 

in Robotics and Automation (ICRA), 2016 IEEE International 

Conference on. IEEE, 2016, pp. 1854–1859. 

[8] Y. Bar-Shalom, T. E. Fortmann, and P. G. Cable, “Tracking and data 

association,” The Journal of the Acoustical Society of America, vol. 87, 

no. 2, pp. 918–919, 1990. 

[9] Y. Bar-Shalom, “Multitarget-multisensor tracking: advanced applications,” 

Norwood, MA, Artech House, 1990, 391 p., 1990. 

[10] S. S. Blackman, “Multiple-target tracking with radar applications,” Dedham, 

MA, Artech House, Inc., 1986, 463 p., 1986. 

[11] C. Lee and Z. Salcic, “A fully-hardware-type maximum-parallel architecture 

for kalman tracking filter in fpgas,” in Information, Communications 

and Signal Processing, 1997. ICICS., Proceedings of 1997 International 

Conference on, vol. 2. IEEE, 1997, pp. 1243–1247. 

[12] M.-Y. Huang, S.-C. Wei, B. Huang, and Y.-L. Chang, “Accelerating the 

kalman filter on a gpu,” in Parallel and Distributed Systems (ICPADS), 

2011 IEEE 17th International Conference on. IEEE, 2011, pp. 1016– 

1020. 

[13] D. P. Bertsekas and D. A. Castañon, “Parallel synchronous and asynchronous 

implementations of the auction algorithm,” Parallel Computing, 

vol. 17, no. 6-7, pp. 707–732, 1991. 

[14] Y. Bar-Shalom, F. Daum, and J. Huang, “The probabilistic data association 

filter,” IEEE Control Systems, vol. 29, no. 6, 2009. 

[15] S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,” 

IEEE Aerospace and Electronic Systems Magazine, vol. 19, no. 1, 

pp. 5–18, 2004. 

[16] R. E. Kalman et al., “A new approach to linear filtering and prediction 

problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960. 

[17] H. W. Sorenson, “Least-squares estimation: from gauss to kalman,” IEEE 

spectrum, vol. 7, no. 7, pp. 63–68, 1970. 

[18] P. S. Maybeck, Stochastic models, estimation, and control. Academic 

press, 1982, vol. 3. 

[19] G. Welch and G. Bishop, “An introduction to the kalman filter,” 1995. 

[20] D. P. Bertsekas, “Auction algorithms,” in Encyclopedia of Optimization. 

Springer, 2008, pp. 128–132. 

627

DeepAPI – bringing deep learning at the edge device 

with a use case in food recognition 

Spiros Oikonomou, Nikos Fragoulis, Vassilis Tsagaris, Christos Theoharatos 

Irida Labs S.A 

Patras, Greece 

tsagaris@iridalabs.gr 

Abstract—In this paper, we present an innovative approach for 

food product identification in real-time, based on Artificial 

Intelligent (AI) and deep learning methods that provide extreme 

accuracy without depending on cloud or high-end processing 

systems. 

Index Terms—DeepAPI, convolutional neural networks, CNN, 

SqueezeNet, food, image classification, deep learning. 


During the past years, convolutional neural networks 

(CNNs) have been established as the dominant technology for 

approaching real-world visual understanding tasks. A 

significant research effort has been put into the design of very 

deep architectures, able to construct high-order representations 

of visual information. The accuracy obtained by deep 

architectures such as GoogleNet [1] and the more 

contemporary ResNet [2] on image classification and object 

detection tasks, proved that depth of representation is indeed 

the key to a successful implementation. 

Main focus up to now was on implementations for 

mainstream PC-like computing system or cloud based systems, 

in order to deploy deep learning approaches into diverse 

technological areas like automotive, transportation, Internet of 

Things (IoT), medical etc. However, meeting particular 

performance requirements on embedded platforms is, in 

general, difficult and complex. A possible workaround to this 

problem, is the use of heterogeneous computing. This involves 

the exploitation of every computing resource present on an 

embedded system (CPU, GPU, DSP) to which a part of the 

load is off-loaded increasing this way the overall computational 

capacity and thus processing speed. There are however cases, 

where a multicore CPU is the only available resource on an 

embedded system. So, there is a reasonable question in this 

case: Is it possible to have a fast inference speed? 

We specialize on solving embedded vision problems of 

such a kind. To this end, we took a step forward and we 

evaluated the performance of a SqueezeNet CNN [3] model on 

a multi-core multi-cluster CPU. 

II. DEEPAPI AND SQUEEZENET CNN ARCHITECTURE 

In this section we will refer to DeepAPI which is a software 

library and why this is useful for the developers. Moreover, we 

will describe the architecture of SqueezeNet and we will 

analyze the reasons why this architecture is suitable for 

embedded applications. Finally, we will briefly refer to the 

Food-101 [5] database. In Fig. 1 we observe the process steps 

that are performed off-device and these are performed on 

device. 

Fig. 1. The processing steps which are performed off-device and the 

processing steps which are performed on-device. 

A. DeepAPI 

The future brings a wave of embedded devices able to 

respond to their environment through embedded intelligences. 

Deep learning is a proven technology able to achieve this 

intelligence, but requires extremely complex processing to be 

performed in high speeds and at the same time within low 

levels of energy consumption. 

To achieve those features a holistic approach should be 

followed: 

Tweak the Models: Use special models and special 

compression techniques which they are mimicking the 

human brain and which result in significantly more 

economical models. 

1 

628

Tweak the Code: Use heterogeneous programming 

technologies, featuring the synergistic use of every 

available computing unit like multi-cluster CPUs, the 

GPU and the DSP must all be exploited for carrying 

out the complex tasks deep learning inference in 

reasonable time and with limited power budget. 

To help developers embed deep learning technology into 

their own systems and applications we developed DeepAPI. 

DeepAPI is a software library consisted of high-performance 

deep learning models, highly optimized for embedded 

computing systems and implemented by using a variety of 

approaches and techniques in order to be able to suit to a wide 

range of applications. 

DeepAPI includes also the necessary software tools in order to 

allow the user to train the model of interest by itself and the use 

the training results for building the final application. DeepAPI 

supports platforms of various ARM/MALI and SnapDragon 

family processors, but the list keep increasing. 

B. Tweaking the algorithms: The SqueezeNet architecture 

A basic downside of deep learning architectures, is that 

they require hundreds of MBytes in coefficients for the 

convolutional kernels to operate. Such requirements can render 

the embedded implementation of similar networks rather 

prohibitive. Imagine a scenario where a CNN has to operate on 

a video stream, in order to produce a real-time video annotation 

captured by a smartphone. The allocation and data transfers 

needed to load e.g. 600 MB of coefficients on an embedded 

device’s memory is a rather intense workload, particularly 

when it has to be completed within a limited time, starting 

when the user opens the camera app and ending when the video 

recording starts. 

In order to address such issues, very recently a significant 

research effort has been shifted towards architectures that 

produce significantly fewer coefficients. In particular, the 

recently presented SqueezeNet [3] architecture is able to 

achieve similar levels of classification accuracy on ImageNet, 

to the baseline AlexNet [4] architecture, using 50 times fewer 

coefficients. The smart combination of small convolutional 

kernels and a complex architecture that enables information to 

flow through different paths facilitates the construction of 

sufficiently high-order image representations that are suitable 

for a large variety of applications. A coefficients’ size of 3 MB, 

easily reduced further by factor of 5 via model-compression 

techniques, makes SqueezeNet a very appealing architecture 

for embedded implementations. 

In Fig. 2 we observe two SqueezeNet CNN Architecture to 

classify the 101 food categories of the Fodd-101 database [5]. 

The original SqueezeNet architecture is shown in Fig. 2a and 

our implementation is shown in Fig. 2b. The main differences 

of architectures are the output number of the conv10 layer and 

the existence of the fc11 layer which is a fully connected layer. 

Our SqueezeNet architecture begins with standalone layer 

(conv1) followed by 8 Fire module (fire2-9) and 1 convolution 

layer (conv10), ending with a final fully connected layer 

(fc11). A Fire module is comprised of a squeeze convolution 

layer, which has only 1x1 filters, feeding into an expand layer 

that has a mix of 1x1 and 3x3 convolution filters. During 

network training, we noticed that with the changes we made in 

architecture, the network converged more easily and the 

classification accuracy was greater. At this point, we have to 

mention that the training of the network for Food-101 database 

was done with fine tune based on the pre-trained SqueezeNet 

of IMAGNET [6]. 

The number of filters per fire module are gradually 

increased from the beginning to the end of the network. 

SqueezeNet performs max-pooling with a stride of 2 after 

conv1, fire4, fire8 and conv10. 

Fig. 2. SqueezeNet Food-101 Architecture. a is the original SqueezeNet 

architecture and b is our SqueezeNet architecture. 

2 

629

C. Food-101 Database 

In the use case of Food Recognition the SqueezeNet 

architecture has been trained to perform image tagging, and is 

able to discriminate between 101 food categories, tagged in the 

Food-101 database [5] comprised by some 101000 images. The 

database is balanced because each food category consisted of 

1000 images. 

The database was augmented by cropping each image at 5 

different frames and vertical mirroring each one of them. The 

training has been made using the Caffe [7] deep learning 

framework. 

III. RESULTS 

As we referred in the previous section, the training has been 

made using the Caffe deep learning framework and the 

accuracy achieved in terms of average recognition rate is 72% 

for Rank 1, 85% for Rank 3 and >90% for Rank 5 as we can 

see in Table 1. 

TABLE I. CLASSIFICATION ACCURACY OF SQUEEZENET 

Rank Accuracy (%) 

Rank 1 72 

Rank 2 85 

Rank 3 91 

The validity of the proposed approach has been verified for 

the SqueezeNet architecture on different platforms. For each 

implementation, the inference time has been measured and the 

results are shown in Table 2. 

TABLE II. INFERENCE SPEED (IS) OF SQUEEZENET FOR CPU-ONLY 

IMPLEMENTATION (MULTI-CORE) ON DIFFERENT PLATFORMS. 

Processor 

Inference Speed 

Mean IS 

(msec) 

Min IS 

(msec) 

1. SnapDragon 820 @ MDP820 43.7 37.3 

2. SnapDragon 808 @ LG G4 99.4 78.3 

3. SnapDragon 801 @ LG G3 133.0 123.3 

4. 

5. 

Mediatek MT 6797 @ Redmi 

Note 4 

Hisillicon Kirin 935 @ HUAWEI 

CRR-L09 

38.9 33.9 

75.9 64.2 

6. Exynos 8890 Octa @ S7-Edge 33.5 28.1 

7. AllWinner A80 94.1 84.0 

As seen in the above table, inference speeds vary with the 

technology, clock speed and the number of cores. However, 

they are proven efficient in order to support a real-time 

recognition task like Food Recognition problem. 

On the platform of Exynos 8890 @ S7-Edge the DeepAPI 

achieves mean inference speed of 33.5 msec and min inference 

speed of 28.1 msec. The second faster platform is the Mediatek 

MT 6797 @ Redmi Note 4, which achieves 38.9 msec for 

mean inference speed and 33.9 msec for min inference speed. 

DeepAPI achieves 43.7 msec mean inference speed and 37.3 

msec min inference speed on the platform of SnapDragon 820 

@ MDP820 which is fast enough inference speed time. Based 

on Table 2 we observe that the DeepAPI achieves decent 

inference speed using SqueezeNet architecture for food 

recognition on older processor platform like SnapDragon 801 

@ LG G3. 


Based on the experimental results that we could observe in 

Table 2, we can answer the question that was raised earlier: Is 

it possible to have a fast inference speed if a multicore CPU is 

the only available resource on an embedded system? The 

answer is that is possible to have a fast inference speed and 

good classification results. 

Tis answer is due to the proper design, development and 

implementation of DeepAPI as well as to the architecture of the 

SqueezeNet which is a convolution neural network that fits into 

only multicore CPU embedded systems. 

The fastest inference speed we have achieved is on the 

platform of Exynos 8890 Octa @ S7-Edge and is 28.1 msec. 

We can also mention, based on the results on the Table 2, that 

the DeepAPI achieves fast enough inference speed on older 

platforms like SnapDragon 801 @ LG G3. 

As a final conclusion we can say that with DeepAPI we can 

achieve food recognition in real-time at the edge device. 

REFERENCES 

[1] Szegedy, Christian, et al. “Going deeper with convolutions.” 

Proceedings of the IEEE Conference on Computer Vision and 

Pattern Recognition. 2015. 

[2] He, Kaiming, et al. “Deep residual learning for image 

recognition.” arXiv preprint arXiv:1512.03385 (2015). 

[3] Iandola, Forrest N., et al. “SqueezeNet: AlexNet-level accuracy 

with 50x fewer parameters and

631 

4

Deep Learning Requirements for Autonomous 

Vehicles 

Gordon Cooper 

Synopsys, Solutions Group 

Mountain View, CA, USA 

gordon.cooper@synopsys.com 

Abstract—Deep-learning techniques for embedded vision are 

enabling cars to 'see' their surroundings and have become a 

critical component in the push toward fully autonomous vehicles. 

The early use of deep learning for object detection, e.g., pedestrian 

detection and collision avoidance, is evolving toward scene 

segmentation where every pixel of a high-resolution video stream 

must be identified. Embedded vision solutions will be a key enabler 

for making automobiles fully autonomous. Giving an automobile 

a set of eyes – in the form of multiple cameras and image sensors 

– is a first step, but it also will be critical for the automobile to 

interpret content from those images and react accordingly. To 

accomplish this, embedded vision processors must be hardwareoptimized 

for performance while achieving low power and small 

area, have tools to program the hardware efficiently, and have 

algorithms to run on these processors. This presentation will 

discuss the current and next-generation requirements for ADAS 

vision applications, including the need for deep-learning 

accelerators. It will discuss how coming changes in deep learning 

will improve ADAS performance, and discuss how to evaluate the 

hardware and software tools needed to quickly deploy ADAS 

applications with high-definition resolutions. 

II. DEEP LEARNING VS MACHINE LEARNING VS ARTIFICIAL 

INTELLIGENCE 

Artificial intelligence is a broad category (Fig. 1). Until very 

recently, AI has been associated more with science fiction than 

automotive reality. AI conjures up imagines of self-aware 

androids or rogue robots taking over the world. In the simplest 

definition, however, artificial intelligence is human levels of 

intelligence exhibited by machines. An automobile exhibiting 

human levels of driving would certainly be classified as an 

example of artificial intelligence. Machine learning is an 

application of artificial intelligence that uses algorithms to 

analyze large amounts of data and then infers some information 

about the real world from the data. 

Keywords—embedded vision; deep learning; IP; CNN; 

convolutional neural network; automotive; advanced driver 

assistance system; ADAS; SoC design 


There is an arms race between the major automotive 

manufacturers – and some of the biggest tech companies – to be 

the first to bring autonomous driving vehicles to the masses. 

With about 94% of all accidents attributed to human error, the 

rise of autonomous vehicles will save thousands of lives daily 

and billions in dollars lost to road crashes. To hand over control 

from a person to a machine requires a high confidence in the 

machine’s decision making process. Deep learning techniques 

provide the building blocks to reach the level of artificial 

intelligence needed for machines to make the decisions 

necessary to replace human drivers. An understanding of deep 

learning requirements is important to best implement this new 

technology. 

Fig. 1: Hierarchy showing how deep learning related to artificial 

intelligence 

Neural networks are a class of machine learning algorithms – 

modeled after the human brain – with a neuron representing the 

computational unit and the network describes how these units 

are connected to each other. Unit recently, neural networks 

were limited to only a couple layers. But with algorithmic 

advances combined with the acceleration of computing power 

brought on by GPUs, more layers have been added to neural 

networks improving their performance. Any neural network 

with more than input and output layers – with intermediate 

‘hidden’ layers – is considered a deep neural network (Fig. 2). 

A deep neural network could have one or hundreds of hidden 

layers. These deep neural networks provide the state-of-the-art 

implementations for deep learning. A practical example of 


632

deep-learning techniques is enabling cars to 'see' their 

surroundings using computer vision hardware and software. 

automatically from training examples. Although the concept of 

deep neural networks has been around for a long time, only 

recently have semiconductors achieved the processor 

performance to make them a practical reality. In 2012, a 

convolutional neural network (CNN)-based entry into the annual 

ImageNet competition showed a significant improvement in 

accuracy in the task of image classification over the traditional 

computer vision algorithms (Fig. 4). Research in neural 

networks accelerated as did the improvements in accuracy. By 

2015, for the ImageNet task of classifying a thousand objects, 

neural networks had not only far surpassed traditional computer 

vision techniques, they were beating human detection. 

Fig. 2. An example of a deep neural network. This network is ‘deep’ because 

of the hidden layers between the input and output. Each node represents a 

computational unit with weights multiplied with inputs to form the output. 

III. DEEP LEARNING TECHNIQUES 

Computer vision represents a good starting point to 

understand deep learning techniques as they apply to 

automobiles. Most pattern recognition tasks, like detecting a 

pedestrian in front of your car, are part of a broad class of “object 

detection” techniques. For each object to be detected, traditional 

computer vision algorithms were hand-crafted. Examples of 

algorithms used for detection include Viola-Jones and more 

recently Histogram of Oriented Gradients (HoG). The HOG 

algorithm looks at the edge directions within an image to tries to 

describe objects (Fig. 3). HOG was considered a state-of-the art 

for pedestrian detection as late as 2014. It had a reasonable level 

of accuracy, but a significant restriction was the amount of work 

required to convert detecting a pedestrian to detecting a dog, for 

example. 

Fig. 3. Example of Histogram of Gradients (HoG) applied to pedestrian 

detection. 

The important breakthrough of deep neural networks is that 

object detection no longer has to be a hand-crafted coding 

exercise. Deep neural networks allow features to be learned 

Fig. 4. Error rates for ImageNet Large Scale Visual Recognition Challenge 

(ILSVRC) winners have dropped dramatically since 2012 when deep learning 

was introduced. 

The neural networks used to win the ImageNet Large Scale 

Vision Recognition Challenge were Convolutional Neural 

Networks (CNNs) which are the current state-of-the art for 

efficiently implementing deep neural networks for vision. 

CNNs are more efficient because they reuse a lot of weights 

across the image. CNN-based pedestrian detection solutions 

have been shown to have better accuracy than algorithms like 

HoG and perhaps more importantly, it is easier to retrain a CNN 

to look for a bicycle than it is to write a new hand-crafted 

algorithm to detect a bicycle instead of a pedestrian. 

IV. DEEP LEARNING APPLIED TO AUTOMOTIVE OBJECT 

DETECTION 

Auto manufacturers are including more cameras in their cars, 

as shown in Fig. 5. A front facing camera can detect pedestrians 

or other obstacles and, with the right algorithms, assist the 

driver in braking. A rear-facing camera – mandatory in the 

United States for most new vehicles starting in 2018 – can save 

lives by alerting the driver to objects behind the car, out of the 

driver’s field of view. A camera in the cars cockpit facing the 

driver can identify and alert for distracted driving. And most 

recently, adding four to six additional cameras can provide a 

360-degree view around the car. Giving an automobile a set of 

eyes – in the form of multiple cameras and image sensors – is a 

first step, but it also will be critical for the automobile to 

interpret content from those images and react accordingly. 

633

Fig. 5. Cameras, enabled by high-performance vision processors, can "see" if 

objects are not in the expected place. 

To replace human decision making, a front facing camera, 

for example, has to be consistently faster than the driver in 

detecting and alerting for obstacles. While an ADAS system can 

physically react faster than a human driver, it needs embedded 

vision to provide real-time analysis of the streaming video and 

know what to react to. 

Vision processing solutions will need to scale as future 

demands call for more processing performance. A 1MP image 

is a reasonable resolution for existing cameras in automobiles. 

However, more cameras are being added to the car and the 

demand is growing from 1MP to 3MP or even 8MP cameras. 

The greater a camera’s resolution, the farther away an object can 

be detected. There are simply more bits to analyze to determine 

if an object, such as a pedestrian, is ahead. The camera framerate 

(FPS) is also important. The higher the frame rate, the lower 

the latency and the greater the stopping distance. For a 1MP 

RGB camera running at 15 FPS, that would be 1280x1024 

pixels/frame times 15 frames/second times three colors or about 

59M bytes/second to process. An 8MP image at 30fps will 

require 3264x2448 pixels/frame times 30 frames/second times 

three colors or about 720M bytes/second. 

This extra processing performance can’t come with a 

disproportionate spike in power or die area. Automobiles are 

consumer items that have constant price pressures. Low power 

is very important. Vision processor architectures have to be as 

optimized as power and yet still retrain programmability. 

V. CHIP OPTIONS FOR DEEP LEARNING IMPLEMENTATIONS 

Implementing deep learning in embedded applications 

require a lot of processing power with the lowest possible power 

consumption. Processing power is needed to execute 

convolutional neural networks – the current state-of-the-art for 

embedded vision applications – while low power consumption 

will extend battery life, improving user experience and 

competitive differentiation. To achieve the lowest power with 

the best CNN graph performance in an ASIC or SoC, designers 

are turning to dedicated CNN engines. 

GPUs helped usher in the era of deep learning computing. 

The performance improvements gained by shrinking die 

geometries combined with the computational power of GPUs 

provide the horsepower needed to execute deep learning 

algorithms. However, the larger die sizes and higher power 

consumed by GPUs, which were originally built for graphics and 

repurposed for deep learning, limit their applicability in powersensitive 

embedded applications. 

Vector DSPs–very large instruction word SIMD processors– 

were designed as general purpose engines to execute 

conventionally programmed computer vision algorithms. A 

vector DSP’s ability to perform simultaneous multiplyaccumulate 

(MAC) operations help it execute the twodimensional 

convolutions needed to execute a CNN graph more 

efficiently than a GPU. Adding more MACs to a vector DSP will 

allow it to process more CNNs per cycle, improving the frame 

rate. More power and area efficiency can be gained by adding 

dedicated CNN accelerators to a vector DSP. 

The best efficiency, however, can be achieved by pairing a 

dedicated yet flexible CNN engine with a vector DSP (Fig. 6). 

A dedicated CNN engine can support all common CNN 

operations (convolutions, pooling, elementwise) rather than just 

accelerating convolutions and will offer the smallest area and 

power consumption because it is custom designed for these 

parameters. The vector DSP is still needed for pre- and postprocessing 

of the video images. 

Fig. 6. Adding a CNN engine to an embedded vision processor enables the 

system to learn through training. 

A dedicated CNN engine is also optimized for memory and 

register reuse. This is just as important as the number of MAC 

operations that the CNN engine can perform each second, 

because if the processor doesn’t have the bandwidth and 

memory architecture to feed those MACs, the system will not 

achieve the optimal performance. A dedicated CNN engine can 

be tuned for optimal memory and register re-use in state-of-theart 

networks like ResNet, Inception, Yolo, and MobileNet. 

Even lower power can be achieved with a hardwired ASIC 

design. This can be the desired solution when the industry agrees 

on a standard. For example, video compression using H.264 was 

implemented on programmable devices before the standard was 

settled on, and implemented on ASICs afterwards. While CNN 

has emerged as the state-of-the-art standard for embedded vision 

implementation, how the CNN is implemented is evolving and 

remains a moving target, requiring designers to implement 

flexible and future-proof solutions. 

VI. TRAINING AND DEPLOYING DEEP LEARNING CNNS 

As mentioned earlier, a CNN is not programmed. It is 

trained. A deep learning framework, like Caffe or TensorFlow, 

will use large data sets of images to train the CNN graph – 

refining coefficients over multiple iterations – to detect specific 

features in the image. Fig. 7 shows the key components for CNN 

graph training, where the training phase uses banks of GPUs in 

the cloud for the significant amount of processing required. 


634

Fig. 7. Components required for graph training 

The deployment – or “inference” – phase is executed on the 

embedded system. Development tools, such as Synopsys’s 

MetaWare EV Toolkit, take the 32-bit floating point weights or 

coefficients output from the training phase and scale them to a 

fixed point format. The goal is to use the smallest bit resolution 

that still produces equivalent accuracy compared to the 32-bit 

floating point output. Fewer bits in a multiply-accumulator 

means less power required to calculate the CNN and smaller die 

area (leading to lower the cost) for the embedded solution. Most 

object detection or classification tasks need 8-bits of resolution 

to assure the same accuracy of the 32-bit Caffe output. 

Advanced tools take the weights and the graph topology (the 

structure of the convolutional, non-linearity, pooling, and fully 

connected layers that exist in a CNN graph) and map them into 

the hardware for the dedicated CNN engine. Assuming there are 

no special graph layers, the CNN is now “programmed” to detect 

the objects that it’s been trained to detect. 

Fig. 8 shows the inputs and outputs of an embedded vision 

processor. The streaming images from the car’s camera are fed 

into the CNN engine that is preconfigured with the graph and 

weights. The output of the CNN is a classification of the contents 

of the image. 

Fig. 8. Inputs and outputs of embedded vision processor 

VII. DEEP LEARNING ALGORITHMS 

Some of the earliest implementations using CNNs were 

based on the neural networks or graphs used by the ImageNet 

winners. AlexNet was popular as a benchmark initially – 

although now with inefficient and some obsolete layers it has 

fallen out of favor. VGG and versions of GoogleNet and ResNet 

are still popular as classification graphs. These graphs will take 

a two-dimensional image and return a probability that that 

images includes one of the objects that the graph was trained to 

recognized (Fig. 8). There is also an evolving class of 

localization graphs – CNNs that will not only identify what is in 

the picture, but will identify where the object is. RCNN (regional 

CNN), Faster RCNN, SSD and Yolo (Fig. 9) are examples of 

these graphs. 

Fig. 9. A TinyYolo CNN graph running on Synopsys DesignWare EV61 

processor provides an example of object detection and localization for 

automotive and surveillance applications 

635

We’ve discussed object classification of pedestrians (or 

bicycles or cars or trucks) that can be used for collision 

avoidance – an ADAS example. CNN engines with high enough 

performance can also be used for scene segmentation – 

identifying of all the pixels in an image. The goal for scene 

segmentation is less about identifying specific pixels than it is to 

identify the boundaries between types of objects in the scene. 

Knowing where the road is compared to other objects in the 

scene provides a great benefit to a car’s navigation. 

Fig. 10. Scene segmentation identifies the boundaries between types of objects 

Much of the research has been to improve the accuracy of 

object detection or recognition. As the accuracy improves, the 

focus has become to shift to getting great accuracy with fewer 

computations. Fewer computations will both lower bandwidth 

and will improve power consumption of the implementation. In 

addition to new graphs, a lot of research as been focused on 

optimizing the existing CNN graphs but pruning coefficients or 

compressing features – the intermediate outputs of each layer of 

CNN computations. An important requirement is to make sure 

the CNN hardware and software supports the latest techniques 

of compression and pruning. 

VIII. POWER CONSIDERATIONS FOR DEEP LEARNING 

IMPLEMENTATIONS 

Deep learning algorithms like CNN have to process a lot of 

pixels in a short amount of time. This requires a significant 

amount of computations and lots of data transferred across an 

internal AXI bus. There is no question that power – or energy 

consumed – is high on the concern list for SoC designers, even 

in automotive designs. 

For a given process node, the easiest way to lower power is 

to start by lowering the frequency of the design. Other low 

power techniques include near-threshold logic where the logic 

runs at a lower voltage, greatly reducing the power required to 

switch the transistor. Minimizing external bus bandwidth also 

helps cut power. The less external bus activity, the less power is 

consumed. For an embedded vision application, increasing the 

size of internal memory will decrease bandwidth and thereby 

lower power, even though it will increase the overall area of the 

design. Other ways to minimize bandwidth – and cut power – is 

to use compression techniques on CNN graphs to reduce the 

computations and memory usage. 

For the most power sensitive embedded vision applications, 

a vision processor with a dedicated CNN engine could be the 

difference between meeting the design’s power budget or 

missing it. Choosing a dedicated CNN engine seems intuitive, 

but how do you measure the power before silicon is available? 

Consider an application having to meet a performance 

threshold within a tight power and thermal budget such as a 

battery powered camera in the cockpit of the car to identify 

driver alertness. Facial recognition – depending on desired 

frame size, frame rate and other parameters – might require a 

few hundred GMAC/s of embedded vision processing power. 

An ASIC or SoC design must now find an embedded vision 

solution that can execute that network within the design’s power 

budget – let’s say several hundred mW. 

Unfortunately, comparing vision processor IP is not simple. 

Bleeding edge IP solutions often haven’t reached silicon yet, and 

every implementation is different, making it difficult to calculate 

and compare power or performance between IP options. No 

benchmark standards exist for comparing CNN solutions. An 

FPGA prototyping platform might provide accurate benchmarks 

but not accurate power estimates. 

One way to calculate power consumption is to run a RTL or 

Netlist based simulation to capture the toggling of all the logic. 

This information, using the layout of the design, can provide a 

good power estimate. For smaller designs, the simulation could 

complete in hours, e.g., running CoreMark or Dhrystone on an 

embedded RISC core. For large designs, the simulation runs 

slow. Larger graphs requiring high frame rates could take weeks 

to reach a steady state to measure power. For larger CNNs with 

high frame rate requirements, such a simulation could take days 

or even weeks. There is a real risk of IP vendors skipping such 

arduous power measurements in favor of estimating power 

through shortcuts using smaller simulation models pushing the 

problem down-stream to the SoC vendors, to sign-off on the IP 

vendor’s power analysis claim. 

Low power requirements aren’t limited to designs using 

small CNN graphs. An autonomous vehicle, for example, might 


636

equire significant embedded vision performance – one or more 

8MP cameras running at 60 fps could require 20 to 30 TMAC/s 

of computational power – all within the lowest possible power 

budget. Note that these TMAC/s requirements might also be 

listed as tera-operations per second (TOP/s). Since a MAC cycle 

includes two operations (one multiply and one accumulate), 

MAC/s are converted to Ops/s by multiplying by two. 

For this application, having a dedicated CNN for the lowest 

power is only helpful if it can scale to higher levels of 

performance needed. Embedded vision processors such as 

Synopsys’ EV6x family address this challenge in two ways – by 

scaling the number of MACs within each CNN engine, and then 

by scaling multiple instances of the CNN engine on the bus 

Fabric, e.g., tailored NoC or standard AXI. 

IX. DEEP LEARNING AND FUNCTIONAL SAFETY 

Automotive manufacturers must ensure systems function 

correctly to avoid hazardous situations. The highest 

performance, lowest power CNN engine is of little use in an 

automotive design if it cannot meet critical safety requirements 

– like the ISO 26262 standard and Automotive Safety Integrity 

Levels (ASIL) – without significant loss of functionality. Safetycertified 

products must be able to detect and manage faults. As 

deep learning moves from an ADAS system who’s only job is to 

alert the drive (i.e., lane departure warnings) to the primary 

decision maker driving the vehicle, fault detection becomes 

more critical. ASIL D (Fig. 11) is required for the most safetycritical 

components (

FOC SoC - Field Oriented Control Servo on Chip 

Dr. Lars Larsson 

Research & Development 

TRINAMIC Motion Control GmbH & Co. KG 

Hamburg, Germany 

Abstract— Field-Oriented Control (FOC), or Vector Control 

(VC), is a well-known method for the energy-efficient 

commutation of electromagnetic motors for almost half a 

century. So far, FOC was completely implemented in software. 

There are processors available with integrated hardware 

supporting base transformations (Clarke, Park, iPark, iClarke) 

that are required for FOC. In addition to these base 

transformations, the realization of FOC requires PI controllers, 

and different peripheral function blocks such as pulse width 

modulation (PWM), analog digital converter (ADC), and 

interfaces for encoder and for Hall signals (analog or digital), to 

form a complete FOC servo control system. 

This article discusses the advantages of a full implementation 

of the FOC in hardware on a single chip together with ease of use 

peripheral units for rotor position determination by an encoder 

and for current measurement with intergraded scaling for signal 

conditioning as required for FOC. The implementation enfolds 

the most real-time critical inner FOC current regulation loop, 

together with the less time critical velocity control loop, and the 

less time critical position control loop. Altogether, integrated 

analog and digital building blocks form a Field-Oriented Control 

Servo-on-Chip – a FOC SoC. 

Keywords—Field Oriented Control; Vector Control; Servo 

Control; Hardware Implementation of FOC on a Chip; SoC 


The initial setup of the FOC is usually very time 

consuming and complex, although source code is freely 

available for various processors. This is because the FOC has 

many degrees of freedom that all need to fit together in a chain 

in order to work. 

Currently available processor architectures used for FOC 

are limited to PWM bandwidths within a typical range of 20 

kHz to 30 kHz outside the audible frequency range for the 

inner current regulation loop, which is sufficient when the 

PWM frequency is limited by the switches of the power stage. 

In contrast to software solutions, a hardware solution enables 

current regulation update rates up to 100 kHz and beyond. In 

addition, a hardware solution allows permanent monitoring of 

critical limits without eating up performance, unlike software 

solutions. On the other side, software is more suitable for 

implementation of communication protocol handling and for 

flexible adaptation of application specific requirements. 

The integration of the FOC as a SoC (System-on-Chip) 

drastically reduces the number of required components and 

reduces the required printed circuit (PCB) space. The high 

integration of FOC, together with velocity controller and 

position controller as a SoC, enables the FOC as a standard 

peripheral component that transforms digital information into 

physical motion. 

Compact size together with high performance and energy 

efficiency especially for battery powered mobile systems are 

enabling factors when embedded goes autonomous. 

Fig. 1. Illustration of FOC Basic Principle by Cartoon [8]. 

II. 

FOC 

The base functions of the FOC are straightforward 

mathematics. Implementation of these base functions using 

floating-point arithmetic is possible in a simple way. A PC 

equipped with a processor with integrated hardware floatingpoint 

unit can archive a performance within the range of one 

million FOC calculations per second (Table 1). However, in 

terms of cost and energy a PC is not really an option for 

current control of a single motor with FOC. Different 

additional components are required to build a full FOC system 

for motor control. 


638

I D 

Fig. 2. FOC is an efficient method to turn a wheel [8]. 

III. 

WHY FOC? 

It is a method for turning an electric motor smoothly with 

low torque ripple in the most energy efficient way. FOC is 

suitable for both motorized and regenerative operation. The 

method is proven over many years by many applications. 

IV. 

WHAT IS FOC? 

The Field Oriented Control was independently developed 

by K. Hasse, TU Darmstadt, 1968 [1], and by Felix Blaschke, 

TU Braunschweig, 1973 [2]. Theory of motor control [3] and 

control technology in general [4, 5] are fundamental for FOC, 

while the implementation of FOC brings more technical 

constraints into it [6, 7]. 

The FOC is a current regulation scheme for electric motors 

that takes the orientation of the magnetic field of the stator of 

the motor and angle of the rotor of the motor with its magnetic 

axis into account. The FOC controls the torque in the way that 

the motor gives that amount of torque that is requested as target 

torque. The FOC maximizes active power and minimize idle 

power - that finally results in lowest power dissipation - by 

intelligent closed-loop control illustrated by the cartoon (Fig. 

1). FOC is an efficient method to turn a wheel applying 

tangential force (represented by I Q) only while zeroing 

tangential force (represented by I D) as the result of field 

oriented closed-loop control (Fig. 2). 

API 

I Q 

FOC 

Software FOC 

IO Interface 

I D = 0 

I Q 

V. WHY FOC AS A PURE HARDWARE SOLUTION? 

The basic implementation of the inner FOC loop in 

software with C using double precision arithmetic is relatively 

easy. At the first sight, one can achieve a performance that is 

more than sufficient by executing the code on a PC with a CPU 

with floating-point unit (FPU). Nevertheless, the initial setup of 

the FOC is usually very time consuming and complex, 

although source code is freely available for various processors. 

This is because the FOC has many degrees of freedom that all 

need to fit together in a chain in order to work. 

The hardware FOC as an existing standard building block 

drastically reduces the effort in system setup. With an off-theshelf 

building block, the starting point of FOC is no longer the 

setup and implementation of the FOC itself and the creation 

and programming required for the interface blocks. Instead, 

only the parameters for the FOC have to be setup. Real parallel 

processing of hardware blocks de-couples the higher lever 

application software from high-speed real-time tasks and 

simplifies the development of application software. With a 

field oriented control servo-on-chip realized as system-on-chip 

as a building block, the user is free to use its qualified CPU 

together with its qualified tool chain. A field oriented control 

servo-on-chip as a hardware building block frees the user from 

fighting with processer specific challenges concerning interrupt 

handling and direct memory access. The TMC4671 is such a 

FOC SoC [8]. There is no need for a dedicated tool chain to 

access the TMC4671 registers and to operate it. SPI (or UART) 

communication needs to be implemented for a given user CPU. 

FOC as a SoC (System-on-Chip) is in contrast to the 

classical FOC servo controller formed by a motor block and a 

separate controller box wired with motor cable and encoder 

cable. The high integration of FOC available as a standard 

peripheral component enables FOC for embedded applications 

where turning a motor is just part of an embedded application 

and not the primary application itself. A typical software FOC 

system architecture is outlined by Fig. 3 with the FOC as part 

of the application software. The challenge of this architecture is 

the software emulation of parallel processing of different tasks 

that might cause disturbance to the FOC itself. The pure 

hardware based FOC architecture is outlined by Fig. 4. In 

hardware, parallel processing of different tasks can be realized 

in a natural way. The challenge of hardware design is the 

higher effort in realizing the basic arithmetic function 

compared to software. The FOC as a standard SoC hardware 

component can fully encapsulate all real-time tasks from the 

software side. 

SPI 

UART 

OTHER Interfaces 

RAM 

ROM 

CPU 

Core 

Servo Control FOC Firmware 

CUSTOMER 

APPLICATION 

Interrupt TIMER 

PWM Interface 

ADC Interface 

[ABN Interface] 

Interrupt IO 

GATE DRIVERS 

POWER STAGE 

CURRENT sensing 

PMSM 

MOTOR 

(3 phase) 

ENCODER 

HALL 

API 

ANY type 

of CPU 

SPI 

CUSTOMER 

APPLICATION 

UART 

SPI 

Servo 

Motor 

Control 

Hardware FOC SoC 

Register Bank 

FOC 

Servo Control FOC Engine 

DBGSPI 

PWM Engine 

ADC Engine 

SENSOR Engine 

GATE DRIVERS 

POWER STAGE 

CURRENT sensing 

PMSM 

MOTOR 

(3 or 2 phases) 

ENCODER 

HALL 

Fig. 3. Typical Software FOC System Architecture. 

Fig. 4. Hardware Based FOC System Architecture (TMC4671) 

639

VI. 

HOW DOES FOC WORK? 

Two force components generated by two current 

components act on the rotor of an electric motor. One 

component is just pulling in radial direction (I D) where the 

other component pulling tangentially (I Q) is applying torque. 

The ideal FOC - apart from field weakening - performs a 

closed-loop current regulation that results in a pure torque 

generating current I Q without direct current I D. 

From a top level perspective, FOC for three-phase motors 

uses three phase currents of the stator interpreted as a current 

vector (Iu; Iv; Iw) and calculates three voltages interpreted as a 

voltage vector (Uu; Uv; Uw) taking the orientation of the rotor 

into account in a way that only the torque generating current I Q 

results. As for two-phase motors, the FOC uses two phase 

currents of the stator interpreted as a current vector (Ix; Iy) and 

calculates two voltages interpreted as a voltage vector (Ux; Uy) 

taking the orientation of the rotor into account in a way that 

only a torque generating current I Q results. To do so, the 

knowledge of some static parameters (number of pole pairs of 

the motor, number of pulses per revolution of a used encoder, 

orientation of encoder relative to magnetic axis of the rotor, 

count direction of the encoder) is required together with some 

dynamic parameters (phase currents, orientation of the rotor). 

VII. WHAT IS REQUIRED FOR FOC? 

The FOC bases on a couple of transformation that need to 

be implemented. It takes the actual current vector together with 

the actual electrical angle from the motor and calculates a 

voltage vector that is applied to the motor. 

The FOC for three-phase (FOC3) permanent magnet 

synchronous motors (PMSM [9]) maps a three-dimensional 

current vector (Iu; Iv; Iw) together with an angle to a threedimensional 

voltage vector (Uu; Uv; Uw) 

Uu 

Iu 

( Uv ) = FOC3 (( Iv ) ; ) (1) 

Uw 

Iw 

The FOC for two-phase (FOC2) permanent magnet 

synchronous motors (stepping motors [10]) maps the actual 

two-dimensional current vector (Ix; Iy) together with the with 

the actual electric angle to a two-dimensional voltage vector 

(Ux; Uy) that is applied to the motor. 

( Ux ) = FOC2 ((Ix) ; ) (2) 

Uy Iy 

AT 

FOC3 

AT 

FOC2 

PIDQ 

UQ 

PIDQ 

UQ 

iPARK 

Ua, Ub 

iCLARKE 

UU, UV, UW 

iPARK 

Ua, Ub 

0 

PIDD 

UD 

0 

PIDD 

UD 

PHI 

PHI 

ID 

ID 

PARK 

Ia, Ib 

CLARKE 

IU, IV, IW 

PARK 

Ia, Ib 

IQ 

IQ 

Fig. 5. Inner FOC Loop Architecture for Three-Phase Motors (FOC3) 

Fig. 7. Inner FOC Loop Architecture for Two-Phase Motors (FOC2) 

Rotor System w/ quasi static 

voltage vector (UD UQ) 

Stator System w/ 

rotating voltage vector (Ua Ub) 


rotating voltage triple (UU UV UW) 


voltage vector (UD UQ) 


rotating voltage vector (Ua Ub) 

UD 

Ub 

UD 

Ub 

UV 

iPARK 

iCLARKE 

iPARK 

UQ 

Ua 

UU 

UQ 

Ua 

UW 

PID controllers 

(PI w/ D=0) 

PID controllers 

(PI w/ D=0) 

ID 

Ib 

ID 

Ib 

IV 

IQ 

Ia 

IU 

IQ 

Ia 

PARK 

CLARKE 

PARK 

IW 


current vector (ID IQ) 


rotating current vector (Ia Ib) 


rotating current triple (IU IV IW) 

Fig. 6. Inner FOC Loop Data Flow for Three-Phase Motors (FOC3) 


current vector (ID IQ) 


rotating current vector (Ia Ib) 

Fig. 8. Inner FOC Loop Data Flow for Two-Phase Motors (FOC2) 


640

The voltage vectors (Uu; Uv; Uw) resp. (Ux; Uy) drive 

currents (Iu; Iv; Iw) resp. (Ix; Iy) with desired target current 

vectors satisfying the condition (I Q=I TARGET; I D=0). To do so, 

the three currents need to be known together with the electrical 

angle of the magnetic axis of the rotor of the motor. 

The angle can be measured with different kind of sensors 

(incremental encoder, analog encoder, digital Hall sensors, and 

analog Hall sensors). Alternatively, the angle can be estimated 

sensorless by measuring voltages and currents together with a 

mathematical model of the motor. 

A. Current Measurent, Offset Cleaning and Scaling for FOC 

The phase currents are essential state parameters for the 

FOC. The currents can be measured using sense resistors with 

sense amplifiers giving analog voltages that represent the 

measured currents. 

Alternatively, one can use isolated sense amplifiers with 

integrated delta-sigma modulators [11, 12, 13] that give digital 

delta-sigma signal streams representing the measured currents. 

Currents can also be measured based on Hall sensors. 

Whatever type of current measurement is selected, before the 

measured current values are available for processing within the 

FOC loop, they must be freed from offsets and scaled to the 

value range of the FOC loop. 

B. Electrical Rotor Angle, Orientation and Direction 

The electrical angle of the rotor is an essential state variable 

required for the FOC. An encoder measures the mechanical 

angle of the rotor in terms of its resolution, the number of 

positions per revolution (PPR). Some encoder vendors call this 

counts per revolution (CPR) or pulses per revolution (PPR), 

others give lines per revolution (LPR) or line counts (LC) 

where PPR might mean either CPR or a quarter of LPR. 

Nevertheless, the FOC needs to know the electrical angel 

normed to its internal numerical representation. To map a 

mechanical angle measured by an encoder, one needs to take 

the number of pole pairs (NPP) of the used motor into account. 

The direction of rotation is an additional parameter as a 

degree of freedom. A possible phase shift between measured 

encoder angle and rotor angle needs to be taken into account by 

initialization. Hall signals that give absolute positions within 

each electrical period might have another direction of 

revolution – different sign of – as the rotor of the motor has. 

( Iq 

Iu 

Id ) = PARK() ∗ CLARKE ∗ ( Iv ) (3) 

Iw 

with 

Iv = −(Iu + Iw), (4) 

( Uq ) = PID (Eq) with (Eq 

Ud Ed Ed ) = (Iq Id ) − (Iq_target ), (5) 

Id_target 

PID(E) = P ∗ E(t) + I ∗ ∫ E(t)dt + D ∗ d E(t), (6) 

dt 

Uu 

( Uv ) = iCLARKE ∗ iPARK() ∗ ( Uq 

Ud ) (7) 

Uw 

with 

1 

CLARKE = 2 3 

0 

( 

− 1 2 

√3 

2 

cos() 

PARK() = ( 

−sin() 

iCLARKE = 

( 

1 

− 1 2 

− 1 2 

cos() 

iPARK() = ( 

sin() 

− 1 2 

− √3 

, (8) 

2 ) 

sin() 

), (9) 

cos() 

0 

√3 

2 

− √3 

2 ) 

, (10) 

−sin() 

). (11) 

cos() 

C. Basic Transformations and Functions of FOC 

The basic operations of FOC are Clarke Transformation, 

Park Transformation (PARK), PI control, inverse Park 

Transformation (iPARK), and inverse Clarke Transformation 

(iCLARKE). 

Fig. 9. Pure Hardware FOC Servo Controller on Chip as Engeneering 

Sample in compact QFN Package (11.5 x 6.5 mm, 76 pins, 0.4 mm pitch). 

641

foc3(iu, iv, iw, phi, iq_tg, id_tg, &uu, &uv, &uw); 

{ 

} 

clarke(iu, iv, iw, &ia, &ib); 

park(ia, ib, phi, &id, &iq); 

pid(id, id_tg, Pd, Id, Dd, &ud, dt); 

pid(iq, iq_tg, Pq, Iq, Dq, &uq, dt); 

ipark(ud, uq, phi, &ua, &ub); 

iclarke(ua, ub, uu, uv, uw); 

Fig. 10. C Code Structure of Three-Phase FOC3. 

foc2(ix, iy, phi, iq_tg, id_tg, &ux, &uy); 

{ 

} 

park(ix, iy, phi, &id, &iq); 

pid(id, id_tg, Pd, Id, Dd, &u_d, dt); 

pid(iq, iq_tg, Pq, Iq, Dq, &u_q, dt); 

ipark(ud, uq, phi, ux, uy); 

Fig. 11. C Code Structure of Two-Phase FOC2. 

With these basic transformations of the FOC implemented 

together with PI controller as software functions, the inner 

FOC loop needs to be calculated periodically by calling either 

foc3() or foc2() once per PWM cycle. An overview of reported 

FOC loop performance archived by different processors 

compared to the FOC SoC is given in table 1. 

TABLE I. 

INNER FOC LOOP PERFORMANCE OUTLINE 

FOC Loop Performance 

Platform FOC loops / s [Citation] 

Intel i7-3777 3.4GHz, 

MinGW gcc 4.4.0, PC 

1.1 M [14] a 

AMD Ryzen Thredripper 1950X, 

3.4 GHz MinGW gcc 4.4.0, PC 

1.1 M [14] a 

Intel i7-3250M 2.9GHz, MinGW 

gcc 4.4.0, X230 Laptop 

1.1 M [14] a 

Intel Atom x7-Z8700 1.6GHz, 

MinGW gcc 4.4.0, Surface 3 

450 k [14] a 

TMC4671, 25MHz, 

pure hardware FOC SoC 

250 k [8] 

Texas Instruments, AM437x, 

ARM®-Cortex®-A9, 1 GHz 

47 k [15] 

STMicroelectronics, 

STM32F103ZE, 

2 * 20 k [16] 

72 MHz, 85% FOC 

Microchip Technology, 

dSPIC33FJ32MC204, 

20 k [17] 

21 MIPs, 66% FOC 

XC886/XC888, 

96 MHz, 58% FOC 

20 k [18] 

Atmel (Microchip) 

AT32UC3B0256, 

42 MHz, 35% FOC 

10 k [19] 

a. 

Performance Estimation of foc3() on PC 

Normally, an interrupt timer of highest priority triggers the 

periodic call of the inner FOC loop which might become an 

issue for a software solution when another interrupt needs to be 

executed due to protocol handling or monitoring tasks. 

VIII. DELTA-SIGMA CONVERTERS AND INTERFACE 

The phase currents need to be measured as input values for 

the FOC. Analog signal conditioning is done close to the power 

stage either available as analog voltages representing measured 

currents by sense amplifiers like [20] or as digital delta-sigma 

data streams. Analog Digital Converters (ADCs) based on the 

Delta-Sigma conversion principle [21] are widely used also in 

audio signal processing [22]. For control, the analog digital 

conversion with the delta-sigma principle has the advantage 

that one can digitally adjust between resolution and speed. 

Additionally, the delta-sigma sampling can measure throughout 

the entire PWM period, making it insensitive to noise and 

spikes due to switching events. Digitalization of analog 

encoder signals or analog Hall signals can also be realized by 

delta-sigma ADCs with the advantage of digital adjustability of 

the required bandwidth and resolution of the encoder. 

Delta-Sigma ADCs integrated as part of the FOC SoC can 

internally operate with 100 MHz delta-sigma sampling rate, 

giving a good performance. External delta-sigma modulators 

for current measurement typically operate within a delta-sigma 

oversampling frequency range of 10 MHz … 20 MHz that is 

sufficient for digitalization of analog current signals that are in 

a typical frequency range from 0 Hz to some kHz. The 

advantage of external isolated delta-sigma modulators for 

current measurement is the sense amplifier that is galvanic 

isolated. Additionally, digital delta-sigma data streams are 

quite insensitive to spikes and noise compared to analog sense 

amplifier voltages. 

For the FOC SoC with an integrated ADC engine that is 

designed to process either internal or external delta-sigma data 

streams, the usage of these is reduced to the selection of type of 

delta-sigma source (internal, external, delta-sigma clock input 

or delta-sigma clock output) together with the delta-sigma 

clock frequency and decimation rate that determines the 

resolution and speed of the ADC channel. 

IX. 

BENEFITS OF DELTA-SIGMA CONVERTERS FOR FOC 

Delta-Sigma ADCs integrated as part of a FOC SoC give 

different advantages from the application point of view. With a 

delta-sigma signal-processing engine, also external delta-sigma 

modulators are supported. Digital processing of high frequency 

delta-sigma data streams requires dedicated hardware. 

A. Current Sensing 

From the current regulation point of view, the continuous 

oversampling with delta-sigma ADCs give an advantage for the 

closed-loop current regulation itself because the ADC values 

represent the mean current over a sample period resp. 

decimation period. 


642

B. Minimized Phase Shift by Simultaneous Sampling 

With delta-sigma ADCs integrated as part of a FOC SoC 

sampling all channels in parallel, one gets rid of the challenge 

of phase shifts compared to phase shift between ADC channels 

with multiplexed analog inputs. The advantage of simultaneous 

sampling becomes especially relevant when using highresolution 

analog sine-cosine-encoder as position sensors. 

C. Adjustable ADC Resolution vs. ADC Bandwidth 

From a system building point of view, delta-sigma ADCs 

integrated as part of an integrated FOC SoC with digital 

processed delta-sigma data streams enable the adjustment of 

resolution versus bandwidth in a flexible way covering a wide 

range of applications with a single ADC hardware – contrary to 

the need for application-specific selection and interfacing of 

different types of external ADCs with different speeds, 

resolutions and different types of interfaces. 

D. Support of External Delta-Sigma Modulators 

From a system-building point of view, support of external 

isolated delta-sigma modulators enable applications where 

measurement of phase current is challenging due to high 

potential differences. 

X. UNIFIED POSITION SENSOR INTERFACE 

For FOC SoC as standard peripheral component, different 

types of position sensors interfaces (digital Hall sensors, digital 

incremental encoders, analog hall sensors, and analog sinecosine-encoders) 

need to be supported in a unified way taking 

motors with different number of pole pairs into account. In 

hardware, this can be realized by a set of registers that hold the 

relevant parameters of the available position sensors, mapping 

them to a normed electrical period for commutation. This 

decouples the used position sensor from the FOC itself. 

XI. 

PWM 

Pulse width modulation (PWM) is essential for energyefficient 

power conversion. Where processors provide generic 

PWM peripherals, our integrated FOC SoC solution provides 

an integrated PWM unit that is dedicated for closed-loop 

control of brushed DC motors, two-phase stepper motors with 

FOC2, and for three-phase permanent magnet synchronous 

motors (PMSM) with FOC3. The application just needs to 

select the type of motor. Optional, the application can select 

pulse width modulation (PWM) or space vector pulse width 

modulation (SVPWM) for more efficient voltage usage with 

one control bit where PWM units of processors require more or 

less complex configuration of their PWM units that - in the 

worst case - might require re-compilation of the processor 

firmware. Additionally, the PWM frequency of the FOC SoC 

can be changed at any time during motion by setting a single 

parameter in contrast to processors that require changes of 

some parameters of the PWM unit and might require 

adjustment or re-compilation of the FOC for another PWM 

frequency. Change of PWM is useful especially for low 

inductance motors like [23] to reduce the phase current ripple, 

the associated torque ripple, and the supply current ripple. High 

PWM frequencies can reduce the power dissipation within the 

motor itself [24] due to lower current ripple at higher 

frequencies. On the other side, higher PWM frequencies cause 

higher power dissipation within MOS-FET power stages or 

IGBT power stages and within their gate drivers. MOS-FETs 

like [25, 26] with low gate charge are able to switch fast in the 

100 kHz PWM frequency range with relatively low power 

dissipation with 1A gate drivers like [27]. It makes sense to 

increase the PWM frequency dynamically when a motor runs 

at high speed and to decrease the PWM frequency when the 

motor runs at low speed or when it is at rest. 

Fig 12. Raw Delta-Sigma ADC Values of Phase Currents (left) and offset freed amplitude scaled ADC Values (right) as Input for the FOC Engine. 

643

FOC Servo Controller TMC4671 

User API 

UART 

DBGSPI 

Interfacing & Register Bank 

SPI 

Embedded 

CPU 

X 

XT 

PIDX 

V 

VT 

PIDV 

AT 

PIDQ 

UQ 

FOC 

iPARK 

Ua, Ub 

iCLARKE 

UU, UV, UW 

SVPWM 

PWM 

PWM 

BBM 

Power 

Stage 

PMSM 

MOTOR 

VELOCITY 

0 

PIDD 

UD 

MOTOR CONTROL 

ID 

IQ 

PARK 

Ia, Ib 

CLARKE 

PHI 

IU, IV, IW 

POSITION 

CURRENTS 

SENSING 

ROTOR POSITION 

(sensored, sensorless) 

IU, IW 

Position 

Sensors 

Sense 

Amplifiers 

Fig. 13. FOC SoC Architecture with multi-ported Register Bank, primary Application Interface (SPI) and Real-Time Monitoring Interfaces (DBGSPI, UART) 

With the PWM frequency, one can control that the power 

dissipation takes place more within the power stage or more 

within the motor. In contrast to MOS-FET stages, GaN-FET 

power stages can operate with low power dissipation even with 

PWM frequencies up to the MHz range that could be processed 

with FOC hardware in case it is beneficial for applications. 

An additional parameter that affects the power dissipation 

is the so-called brake-before-make (BBM) time, For the FOC 

SoC this, time is programmable even during motion separate 

for high side switches and low side switches in steps of 10 ns 

for fine-tuning. For gate drivers that do their own BBM time, 

the BBM time handling can be disabled. Programmability of 

BBM times is essential for a FOC SoC solution and for PWM 

units of processors. 

With a FOC as a SoC as a standard peripheral solution, one 

can focus on setting up the FOC and parameterizing it itself 

instead of focusing on implementing the FOC 1 st , and setting 

up it up as the 2 nd step. With a FOC SoC, one can use any 

qualified processor with its qualified tool chain in contrast to 

running FOC on a processor in software. FOC as a hardware 

takes care of all real-time critical tasks and keeps the processor 

free from FOC so that it can process user application and 

protocol handling that better fits to software. A hardware FOC 

that processes different types of position sensors and current 

sensors in a uniform way decouples application software 

development from those sensors. The decoupling of real-time 

tasks by dedicated hardware simplifies the application software 

development where turning a motor with FOC is just a 

component as part of an embedded system. 

XII. MULTI-PORTED COMMUNICATION INTERFACES 

Multi-ported user interfaces enable real-time monitoring of 

internal parameters while the FOC is running (fig. 12). 

Realizing in hardware does not disturb the execution of the 

FOC operations. Multi-ported access in software might disturb 

executing of the FOC when it eats too much processing power. 

XIII. CONCLUSION 

The FOC has many degrees of freedom, due to a chain of 

parameters that all need to fit together within a single chain for 

successful setup of the FOC. With a hardware FOC available 

as a building block providing all necessary functionality, one 

can focus on parameterizing the FOC for a given application 

itself. The potential to look at internal registers in real-time 

parallel to the running application enables monitoring and 

initial setup with external tools without re-compiling software. 

[1] K. Hasse, Zur Dynamik drehzahlgeregelter Antriebe mit 

stromrichtergespeisten Asynchron- Kurzschlußläufermaschinen, 

Dissertation, TH Darmstadt, 1969. 

[2] Felix Blaschke, Das Verfahren der Feldorientierung zur Regelung der 

Drehfeldmaschine, Dissertation, TU Braunschweig. 1974. 

[3] W. Leonhard, Control of Electrical Drives, 3rd Edition, Springer, 2003 

[4] W. Leonhard, Einführung in die Regelungstechnik: Lineare und 

nichtlineare Regelvorgänge für Elektrotechniker, Viewg, 1992. 

[5] Michael A. Johnson, PID Control: New Identification and Design 

Methods, Springer, 2005. 

[6] Nguyen Phung Quang, Praxis der feldorientierten 

Drehstromantriebsregelungen, expert Verlag, 1993. 

[7] Nguyen Phung Quang, Jörg-Andreas Dittrich, Vector Control of Three 

Phase AC Machines, System Development in the Practice, Second 

Edition, Springer-Verlag, 2015. 

[8] TMC4671 Preliminary Datasheet 0v90, TRINAMIC Motion Control 

GmbH & Co. KG, September 29, 2017, www.trinamic.com 

[9] T. J. E. Miller, J. R. Hendershot, Design of Brushless Permanent 

Magnet-Magnet Machines, Motor Design Books LCC, 2010. 


644

[10] P. Acarnley, Stepping Motors: A Guide to Theory and Practice, 

Institution Engineering & Tech, 4 th edition, 2002. 

[11] AD7400 Isolated Sigma-Delta Modulator, Analog Devices, 2013. 

[12] AD7401 Isolated Sigma-Delta Modulator, Analog Devices, 2015. 

[13] AD7403 16-Bit Isolated Sigma-Delta Modulator, Analog Devices, 2015. 

[14] L. Larsson, TRINAMIC, Performance Estiamtion of FOC Loop with 

Double Precision Arithmetics in C with MinGW, unpublished, 2017. 

[15] TIDU701–December 2014 AM437x Single Chip Motor Control 

Benchmark, Texas Instruments, 2014. 

[16] AN3165 Application Note Digital PFC and dual FOC MC integration, 

STMicroelectronics, 2010, p. 16. 

[17] Jorge Zambada, Debraj Deb, Application Note AN 1078, Sensorless 

Field Oriented Control of a PMSM, Microchip Technology Inc., 2010. 

[18] Field Oriented Control Using XC886/888 MCU, Application Brief, 

Infineon Technologies, 2007XC886/888 CM/CLM 8-Bit 

FlashMicrocontroller Sensorless Field Oriented Control for PMSM 

Motors, AP08059 Application Note V1.0, Infineon Technologies, 2007. 

[19] AVR32723: Sensor Field Oriented Control for Brushless DC motors 

with AT32UC3B0256, Atmel, 2009. 

[20] AD8418 Bidirectional Current Sense Amplifier, Analog Devices, 2013. 

[21] Shanthi Pavan, Richard Schreier, Gabor C. Temes, Understanding Delta- 

Sigma Data Converters, IEEE Press Series on Microelectronic Systems, 

Wiley, Second Edition, 2017. 

[22] Udo Zoelzer, Digital Audio Signal Processing, Wiley, 2008. 

[23] PMSM 3274G024BP4 3692 Datasheet, Faulhaber 2018. 

[24] Shunsuke Amano, Kan Akatsu, Study on High Frequency Inverter with 

100kHz Current Feedback Control by Using FPGA, 2014 17th 

International Conference on Electrical Machines and Systems (ICEMS), 

Oct. 22-25, 2014, Hangzhou, China. 

[25] BSZ068N06NS OptiMOS 60V Power-Transistor Data Sheet, Rev. 2.0 

2013-10-17, Infineon Technologies, 2013. 

[26] BSC030N08NS5, OptiMOS, 80V Power-Transistor Data Sheet, 

Rev.2.2 2014-11-10, Infineon Technologies, 2014. 

[27] LM5109B High Voltage 1A Peak Half-Bridge Gate Driver, Daha Sheet, 

Texas Instruments, 2016. 

[28] L. Larsson, Hardware FOC-Servo-Regler mit integrierten Schnittstellen 

für autonomen Betrieb, Forum Elektromagnetismus 2017, Technische 

Akademie Esslingen (TAE) & Hochschule Heilbronn Campus 

Künzelsau (HHN) - Reinhold-Würth-Hochschule, February 16-17, 2017, 

Tagungshandbuch 2017, pp. 131-141 

645

High Speed Interfaces in Cost Optimized FPGAs 

Ted Marena 

Director of SoC FPGA Marketing 

Marketing Chair RISC-V Foundation 

Microsemi Corporation 

3870 N 1 st Street San Jose, CA 95134 

ted.marena@microsemi.com 

https://www.linkedin.com/in/tedmarena/ 

Abstract—This document explores high speed interfaces that 

are now available in cost optimized FPGAs. The presentation will 

explain interfaces such as 1Gb Ethernet, JESD204B, PCIe, 

HDMI and DDR4 memory interfaces and how they can be 

utilized in cost optimized, mid-range density FPGAs. In the 

presentation, design examples showing these interfaces and steps 

necessary to implement these functions will be shown. For each 

design, the typical power consumption will be provided so 

engineers can judge for themselves the benefits of the new class of 

cost optimized, mid-range FPGAs. We will go into detail on the 

FPGA densities offered as well as the package sizes that could be 

leveraged for embedded designs. 


Industrial designs are increasingly requiring higher 

performance interfaces. Protocols such as DDR4 memory, 

10Gigabit Ethernet, JESD204B, Gigabit Ethernet, PCIe and 

more are becoming commonplace. These higher speed 

interfaces are often found on high end FPGAs which are often 

overkill and cost prohibitive for most embedded designs. Now 

there exists a new class of mid-range density FPGAs which are 

cost optimized, consume lower power and offer smaller form 

factors with generous high speed interfaces. 

III. 

KEY HIGH SPEED INTERFACES 

The most common interface being leveraged in many 

industrial designs is Gigabit Ethernet. Most commonly a 

FPGA interfaces to a PHY via a serial SGMII interface. In the 

past, a SGMII interface required using high speed transceivers 

in FPGAs, but with new cost optimized mid-range FPGAs, 

SGMII interfaces are now on generic GPIO pins. 

A. SGMII on GPIO 

Power-efficient Gigabit Ethernet interfaces are often 

required in industrial system architectures. Many embedded 

product developers are using Gigabit Ethernet for an increasing 

number of connections. No longer only for data payloads, these 

links are becoming ubiquitous for control, management, status, 

and more. Traditional mid-range FPGAs can support these 1 

Gbps speeds, but they require transceivers to implement 1G 

SGMII interfaces (as well as other high speed interfaces). 

Ideally, a device would have generic I/O pins that could 

support SGMII, as the following illustration shows. 

II. 

MARKET DYNAMICS 

Although the industrial segment is unique, it has several 

characteristics that exist in other vertical markets. The 

requirements for better value & lower cost is a growing driver 

for industrial designs. In addition, faster and numerous 

networking interfaces are also more commonplace. Finally, 

faster performance processing in many embedded designs is 

now a new norm. These factors result in architectures which 

require interfaces such as Gigabit Ethernet, transceivers up to 

12.7Gbps for 10Gb Ethernet, JESD204B ADC/DAC, PCIe 

interfaces, HDMI 2.0b and lastly DDR4 memory buses. Now 

that these types of interfaces are available in cost optimized 

mid-range FPGAs, system architects can address the latest 

market dynamics for their products. 

Low end FPGAs and traditional mid-range FPGAs do not 

have this feature and so they must rely on transceivers. These 

transceiver interfaces are precious and frequently scarce, unless 

very expensive, higher density FPGA fabrics are used. The 

very large FPGA fabric is often not required in industrial 


646

designs, but designers are forced to choose these devices 

because they require additional transceivers. In addition, these 

larger devices dictate that larger package form factors are 

required. These existing solutions increase both power 

consumption and cost in opposition to lower cost demands in 

the industrial market. 

The new PolarFire FPGAs offer cost-optimized mid-range 

densities and address the requirement for numerous GigE links 

via SGMII on GPIOs. What differentiates this family is that 

they have incorporated a clock and data recovery (CDR) circuit 

into high-speed LVDS I/Os that can support 1.25 Gbps. This 

allow the device to support SGMII interfaces on several select 

GPIO pins. Using this architecture, designers can reduce the 

cost, size and power of their designs versus traditional high end 

FPGAs. 

B. Transceiver to support 10Gb Ethernet, JESD204B, PCIe, 

HDMI 2.0b and more 

Although industrial and embedded designs are not typically 

very high performance, the processing needs are increasing and 

the interfaces are also getting faster. These factors necessitate 

that FPGAs can support serial interfaces up to 12.5Gbps, so 

that these common interfaces can be supported: 

• PCIe Gen2 requires 5Gbps 

• HDMI 2.0b needs 6Gbps 

• 10Gb Ethernet requires 10Gbps 

• JESD204B can run up to 12.5Gbps 

These high speed serial interfaces require that transceivers be 

able to operate at the speeds listed above. The performance for 

these rates is trivial for high end FPGAs or mid-range FPGAs 

that are built off of high end architectures. The issue with these 

devices is they are costly and often beyond the budget for 

many embedded designs. Low density FPGAs often do not 

have transceivers and those that incorporate them do not 

support the performance rates listed. Fortunately cost 

optimized, mid-range density FPGAs with the right mix of LEs 

(logic elements) and transceivers can support the required data 

rates. Below is a table of one such FPGA family, the PolarFire 

FPGAs support a range of densities from 100k to 500k LEs and 

each offer up to 12.7Gbps transceivers. 

Package balls 

and spacing 

325 

0.5 mm 

484 

0.8 mm 

484 

1 mm 

536 

0.5 mm 

784 

1 mm 

1152 

1 mm 

Max LEs 192K 300K 300K 300K 481K 481K 

Max CDR GPIOs 8 14 13 15 20 24 

(SGMII) 

Transceivers up 4 4 8 4 16 24 

to 12.7 Gbps 

Max possible 

SGMII 

interfaces 

12 18 21 19 36 48 

These devices enable industrial architects to support not only 

the latest high speed serial interfaces but can also implement 

the necessary board functions with the adequate LEs on chip. 

In addition, because this family has both SGMII on GPIO as 

well as transceivers, designers can often select smaller package 

sizes and densities thus lowering the system cost and reducing 

the power needed for their FPGA functionality. 

C. DDR4 Interfaces 

Many embedded designs are required to interface to other 

parts of the overall system. Stand alone boards are the 

exception and not the norm by and large. Because designs 

often communicate and network to other system components 

this means data is being transmitted to and from most industrial 

designs. As the data rates increase, so does the need to store it, 

so it can be processed and acted up. Hence the need for 

memory on many industrial designs. 

The most common memory that engineers tend to connect 

to a FPGA is DDR DRAM based devices. There are several 

generations to choose from and generally speaking the best 

choice is to use memory which has been shipping for some 

time, but not the absolute newest standard. For DRAMs the 

best cost per bit and architecture that will be supported for 

numerous years is DDR4. Although DDR3 is still a viable 

choice for designs, the majority of new designs are choosing 

DDR4 because it will offer reduced pricing in the future, faster 

performance and wider single chip data buses are offered. 

Today there do not exist low density FPGAs that support 

DDR4 memory interfaces. One must go to mid-range density 

FPGAs for DDR4 interfaces. Previously mid-range FPGAs 

which were built off of high end architectures were the only 

choice. The issue with these devices is they are costly, 

consume high power and the form factors are very large. One 

should look to new cost optimized mid-range FPGAs to offer 

the required DDR4 performance and provide smaller packages 

and lower cost to meet the new demands on industrial designs. 

Below are a few cost optimized mid-range FPGA devices 

offered in smaller package sizes that support DDR4 interfaces. 

D. Conclusion 

With the growing demands of higher performance 

interfaces, more connectivity and lower costs for industrial 

designs, system architects and engineers need to look for new 

solutions. Today’s cost optimized, mid-range density FPGAs 

solve these design challenges. These devices offer great value, 

lower power consumption while still providing the capabilities 

demanded by modern industrial designs. 

647

Lucky Seven 

Taking Advantage of COM Express Type 7 

Ansgar Hein 

Marketing 

iesy GmbH & Co. KG 

Meinerzhagen, Germany 

sales@iesy.com 

Abstract—This talk delivers facts and features of the latest 

PICMG development as well as give insights on custom solutions 

based on this Server on Module standard. 

Keywords—COM Express; Type 7; PICMG; Server on Module; 

custom; solution; speed; memory; connectivity; flexibility; 

virtualization; size; power; performance; 10GbE; customization; 

Micro-Server; ATX; 19”; QuadServer 


COM Express Type 7 is a brand new standard introduced 

by PICMG. However, the Type 7 is not a replacement for the 

existing and well-established Type 6 pin-out. In place of all 

audio and video interfaces of the Type it offers four 10GbE 

ports and a total of 32 PCI express lanes in order to support 

high computing performance and high speed communication 

while reducing power consumption at the same time. 

II. DIFFERENCE BETWEEN TYPE 6 AND TYPE 7 

There are several reasons why Type 7 was introduced. One 

is the availability of server-class SOC processors. Another is 

the support of 10Gigabit Ethernet and NC-SI signals as well as 

the definition of a larger amount of PCIe lanes for high-speed 

data-transfer. The differences are as follows: 

A. Added in Type 7 

4 x 10GBaseKR Ethernet 

NC-SI 

32 x PCI express lanes 

2 x SATA 

4 x USB 3.0 / 2.0 

B. Removed in comparison to Type 6 

DDI [0:2] 

SATA [2:3] 

AC97 / HDA Audio 

VGA 

LVDS/eDP 

USB 2.0 [4:7] 

III. ADVANTAGE NO. 1: SPEED 

When it comes to speed, COM Express Type 7 is 

unparalleled in the market for embedded computing modules. 

First, because of its 32 PCI express 3.0 lanes in place of 16 PCI 

express lanes for COM Express Type 6. Compared to Type 6 

modules, Type 7 modules offer 40 times increased bandwidth 

when it comes to network connectivity. Second, because of its 

support for M.2 socket and thus a wide range of expansions, 

ranging from storage to connectivity. Third, the COM Express 

Type 7 standard comes with server-grade processors, making it 

a true server-on-module approach with Intel® Xeon® class 

processors. A further plus is its headless design which – in 

combination with a Baseboard Management Controller – 

makes COM Express Type 7 modules the perfect match for 

any server application you can think of. 

IV. ADVANTAGE NO. 2: STORAGE 

Looking at the difference between Type 6 and Type 7 one 

might wonder why and how the removal of two SATA-ports 

can lead to an advantage for storage, since server applications 

always have a high demand for storage capabilities. The use of 

fast SSDs in place of SATA disks makes the SATA interface a 

bottleneck. This is where NVMe (non-volatile memory 

express) comes in, a new specification for connecting mass 

storage to PCI express and Type 7 supports this development 

through its increased amount of PCI express lanes. 

M.2 NVMe is probably the most advanced specification for 

internal extension cards currently available for embedded 

systems. COM Express Type 7 now makes full use of this form 

factor, for example to connect SSDs at a higher speed 

compared to mSATA. Since NVMe reduces the I/O overhead 

and latencies ( see TABLE I. ), a paradigm change is about to 

take place when it comes to mass storage solutions in server 

environments, leading to an increased use of M.2 SSDs. 

Further to this the slick design of M.2 allows for smaller 

footprints when it comes to storage solutions. However, one 

downside at present is a limited capacity of max. 8 TB for M.2 

storage modules and a higher price compared to mSATA 

solutions, but both will change for the better in the near future. 

Talking about future: in 2016 Intel has announced Optane SSD 

products based on the brand new 3D Xpoint technology 

offering a thousand times more performance and durability 

than NAND flash technology. NVMe already supports Optane. 

648

Maximum 

Queue Length 

TABLE I. 

COMPARISON OF SATA AND NVME 

SATA 

2048 MSI-X 

NVMe 

65535 command chains 

65536 commands per queue 

Interrupts 1 single interrupt 2048 MSI-X 

Parallelism and 

Multi-threading 

needs synchronization 

locking for execution 

no locking 

V. ADVANTAGE NO. 3: CONNECTIVITY 

The COM Express Type 7 standard has been released 

because of the bandwidth bottleneck in connected devices that 

need to interact in (industrial) applications, where many 

devices exchange massive data streams and need to be 

synchronized in real time (i.e. IoT, telemedicine). Further to 

this the new PICMG standard provides for the addition of up to 

four 10 Gigabit Ethernet (10GbE) interfaces on the baseboard. 

An increased number of PCI Express lanes (32 instead of 16) 

provides a wealth of connectivity and interface options. 

While all 10-GbE interfaces on the module are defined as 

10GBASE-KR single backplane lanes to keep them from being 

bound to predefined physical interfaces, the PHY is not places 

on the module itself, but on the baseboard. This allows for even 

greater flexibility, since the interfaces can be implemented as 

interchangeable SFP+ modules. Further to this it is also 

possible to combine the performance of several 10-GbE signals 

into in a PHY for 40GBASE-KR4 for example. 

VI. ADVANTAGE NO. 4: FLEXIBILITY 

Talking about flexibility, there are many use-cases for 

COM Express Type 7 across various markets because of its 

versatile high-speed approach. Especially in Industry 4.0 

environments there is a huge need for server-like appliances 

with high-speed connections, fast memory and computing 

performance. Due to its compact size, COM Express Type 7 

allows for small-scale housing. At iesy we have developed four 

different types of baseboards as platforms for customization: 

A. embedded 5x5 – for micro-servers on the shopfloor 

Micro-Server based on the Mini STX form-factor (formerly 

known as Intel 5x5), making it suitable for standard housings: 

1 × COM Express Type 7 basic module 

Intel® Atom C3xxx with up to 16 cores 

up to 4 × SFP+ for 4 × 10 GBit Ethernet 

2 × 1000/100/10 MBit Ethernet 

3 × DDR4 SO-DIMM socket with up to 48 GB 

2 × USB 3.0 

2 × M.2 M-Key slot (1× PCIe x4, 1× SATA multiplex) 

1 × M.2 A-Key slot (2× PCIe x1, 1× USB 2.0, I²C) 

1 × Baseboard Management Controller (BMC) 

Dimensions: 140mm × 147mm × 55mm (incl. cooling 

solution) 

B. Flex ATX – for legacy form-factors 

Makes use of the advantages of the popular Flex ATX 

form-factor and can thus easily be implemented in existing 

infrastructures and Flex ATX housings: 

 

 

 

 


Intel® Atom & Intel® Xeon® with up to 16 cores 

up to 6 × SFP+ for 6 × 10 GBit Ethernet 


2 × USB 3.0 

 

 

 

2 × M.2 M-Key slot, i.e. for NVMe SSD 

1 × RS232 

Dimensions: 229mm × 191mm × 46mm 

C. Basic Size – for minimal footprint 

Sized exactly to meet the dimensions of a COM Express 

basic module, this minimalistic approach delivers the most 

compact COM Express Type 7 experience possible while at the 

same time providing a maximum of high-speed interfaces: 

 

 

 

 


Intel® Atom & Intel® Xeon® with up to 16 cores 

4 × SFP+ for 4 × 10 GBit Ethernet 


2 × USB 3.0 

 

 

 

1 × M.2 M-Key slot, i.e. for NVMe SSD 

1 × mini PCIe slot 

Dimensions: 125mm × 90mm × 58mm 

D. 19” QuadServer – for datacenters on the shopfloor 

Designed with industrial applications and datacenters in 

mind, but without the need for datacenter cooling solutions, 

this high-end rackmount server appliance offers extensive 

processing power as well as switching and storage capabilities: 

 

 

 

 


Intel® Xeon® with up to 64 cores per HU 

16 × 10 GBit Ethernet (thereof 4 × external) 

8 × M.2 M-Key slot, for NVMe SSD (up to 64 TB) 

8 × USB 3.0 

 

 

 

1 × integrated Switch with 48 hosts & 48 clients and 

up to 128 GBit max. transfer rate 

4 × boot devices via SSD 

Dimensions: 483mm × 44mm × 530mm (incl. fans) 

649

Fig. 1. embedded 5x5 baseboard with cooling solution / © www.iesy.com 

Fig. 2. Flex ATX baseboard with BMC and 6 × 10 GbE / © www.iesy.com 

Fig. 3. Basic Size with 95mm × 125mm footprint / © www.iesy.com 

Fig. 4. QuadServer with 4 × COM Express Type 7 / © www.iesy.com 

VII. ADVANTAGE NO. 5: VIRTUALIZATION 

A few years ago, there were only few use-cases for 

virtualization in server-environments. Today there are several 

trends, for example Software-Defined Networking and 

Networks Functions Virtualization (SDN/NFV) in carriergrade 

business or the demand to separate real-time systems in 

industrial applications from IoT connectivity. These two trends 

lead to an increasing demand in virtualized server technologies. 

Most vendors of COM Express Type 7 modules support 

virtualization technologies, like RTS hypervisor, which is wellaccepted 

in industrial and medical real-time applications. 

Especially industry 4.0 applications require redundant edge 

or fog servers right on the shop-floor, consisting of dedicated 

infrastructure components, such as firewalls, routers, loadbalancers 

and storage servers. All of these can be virtualized 

with software-based solutions while all configurations are 

interconnected through redundant 10 GbE interfaces. The 

virtualized environments are hardware independent and allow 

for the development of multi-tenant nodes for faster 

implementation of heterogeneous machines, systems or sensor 

networks. This results in more agile, flexible and scalable 

installations which are well-suited to meet the requirements of 

all kinds of Industry 4.0, M2M and IoT services. 

Extensive remote monitoring and management systems are 

required, when using virtualization at the above mentioned 

level. Using a Baseboard Management Controller (BMC) helps 

you with out-of-band management tasks, such as rebooting a 

system, mounting virtual media, accessing the console from 

remote, managing firmware or tracking physical conditions as 

well as checking event logs. COM Express Type 7 modules do 

not have a BMC onboard and not all iesy baseboards are 

equipped with one. However, COM Express Type 7 supports 

the Network Controller Sideband Interface (NC-SI) and thus 

offers the possibility to run OpenBMC which allows for a wide 

range of system administration features and helps in remotely 

managing (virtualized) servers at large scale. 

VIII. ADVANTAGE NO. 6: SIZE 

Size matters. The modular approach within the COM 

Express specification strikes a careful balance between cost 

and performance and results in a variety of COM Express form 

factors and board sizes defined in the standard. Right now there 

are seven different versions that rely on a set of commonly 

defined connectors and mounting holes as well as common 

signaling where appropriate. Though currently only available 

in basic size (95mm × 125mm), COM Express Type 7 modules 

deliver a server-on-module approach at high-density level, 

allowing for compact casing or mounting of several modules 

into one 19” 1U system. As shown in section VI iesy already 

provides several blueprints ranging from legacy to microservers. 

Heat dissipation becomes critical, if you want or need 

to harvest the full potential of existing COM Express Type 7 

modules with high-performance processors using up to 65W 

TDP. However, there is a wide range of efficient cooling 

solutions along with extended temperature range modules that 

help conquering this embedded systems challenge. 

A server-like application at the size of the COM Express 

Type 7 basic module – even with BMC – becomes possible. 

650

This makes the new PICMG standard a versatile form factor 

for leveraging Industry 4.0 applications and bringing 

datacenters to the shop floor or empowering innovative highbandwidth 

applications in other fields, such as healthcare or 

autonomous driving. 

IX. ADVANTAGE NO. 7: POWER 

Compared to off-the-shelf servers, COM Express Type 7 

powered solutions require much less power. M.2 SSD storage, 

low-power CPU and less heat dissipation are the main reasons 

for this consequent low-power approach for server-grade 

processors and a thermal profile below 65 W TDP. While 

several application scenarios require low power, the ever rising 

prices for energy and running server applications at 24/7 have 

an impact on future product pricing as well as sustainability. 

X. CONCLUSION 

COM Express Type 7 is the first true server-on-module 

approach and in no way to be compared with Type 6. Its 

headless design along with the seven advantages highlighted 

above makes it the ultimate choice for Small Form Factor 

(SFF) applications. Further to this it is cost-efficient, easy to 

upgrade as newer COM Express modules are developed and 

suitable for a wide range of applications, from commercial to 

rugged environments. 


iesy would like to thank its partners congatec and Kontron for their continued 

support in developing customized solutions based on COM Express Type 7 

solutions as well as in delivering facts and figures and valuable feedback for 

this talk. 

651

PCB Design Problems and Solutions for Embedded Supercomputing 

Dr. Andreas Döring 

IBM Research Laboratory 

Rüschlikon, Switzerland 

Rainer Asfalg 

Altium Europe GmbH 

Global Head of Technical Sales & Support 


rainer.asfalg@altium.com 

Abstract—The DOME microserver achieves a high density for 

highly-efficient computing by using commodity components. 

Modularity allows the adaption to a specific environment. Water 

cooling allows ruggedized packaging. A new IO card is presented 

that allows the direct attachment of sensors and actuators. 

Keywords—component; formatting; style; styling; insert (key 

words) 


Improvements in performance and energy efficiency in 

computing result only in a low degree from CMOS technology 

advances nowadays, if at all. Hence, architectural contributions 

such as higher degrees of parallelism and the integration of 

specialized accelerator is used to meet the growing demands in 

supercomputing and high-end embedded systems. Building a 

system from commercial components makes use of a 

development ecosystem, provides shorter economic upgrade 

cycles, and offers a wider heterogeneity. The DOME 

microserver was developed in a cooperation of IBM Research 

Zurich, IBM Netherlands, and ASTRON as technology 

preparation of the exascale computing requirements of the 

Square Kilometer Array (SKA) radiotelescope. While the 

project focus lay on a large supercomputer installation, the 

properties fit the high-end embedded space very well. The main 

features are modularity, low volume/high density, water 

cooling (allowing closed packaging), cost and energy 

efficiency, and system management including temperature and 

power supervision. 

An important enabler for these characteristics are the 

printed circuit boards (PCBs) employed for modules and the 

back plane. While using HDI and advanced materials such as 

megtron4 by Panasonic the edge of conventional technology is 

reached, the boards can still be produced by several suppliers in 

reasonable time. 

Poweredge for the power converter. While the main rail for 

power distribution at 12V is part of the cooling system, the 

backplane connects two regulated power converter slots at 7 

different rails for DRAM-supply and I/O voltages. Using shared 

power converters for the minor regulated supply voltages of 

modern SoCs is more cost-, volume, and energy efficient than 

using dedicated converters on each module. The compute 

modules are cooled by a passive copper plate with integrated 

heat pipe. The pitch of 7.6mm of one compute node to the next 

is partitioned into 0.8mm for backside components, 1.8mm 

PCB thickness, 2mm heat spreader, and 3mm for the top side 

components. For higher components, such as BGA packages or 

inductors, recesses or cut outs into the heat spreader are milled. 

For the main SoC the package cap is removed such that the 

silicon die is directly cooled from its backside. 

The connectors for the switch and the compute modules allow 

data rates beyond 10Gbps on a differential pair which allows 

operation of the switch with 64 ports and the modules with up 

to 6 ports of 10Gbit Ethernet (Base10G), in addition to PCI- 

Express 2.0 and Serial-ATA-2. 

Further modules of the system support mSATA solid state disks 

and M.2 form factor flash memory. 

A. Compute Modules 

Currently, three compute modules have been developed, based 

on the NXP QorIQ processor T4240, with the NXP QorIQ 

processor LS2088, and Xilinx Kintex Ultrascale FPGA 

respectively. These cards measure 139x62.5mm, have power 

converters for the core voltage and switches for all other supply 

rails. They are managed over USB by a Cypress PSoC 5LP 

controller. The main features are summarized in Table 1: 

II. 

SYSTEM OVERVIEW 

A passive backplane provides slots for four different module 

types: compute/storage, power conversion, network switch, and 

USB-hub. Furthermore, the backplane carries the SFP and 

SFP+ cages/connectors for external network. For each slot a 

different connector is used, 3M SPD08 for compute and hub 

slots, Molex Impact for Switch, and Molex EXTreme 


652

Node ISA/Logic Memory I/O 

NXP 

T4240 

28nm 

Bulk 

43W 

NXP 

LS2088 

28nm 

Bulk 

35W 

Xilinx 

Kintex 

Ultrascale 

20nm 

FinFET 

PPC64 

24 core 

1.8GHz 

e6500 

ARMv8 

8 core 

2GHz 

A72 

726K cells 

2760 DSP 

38Mb 

RAM 

24GB 

3 

channels 

DDR3L 

72bit 

ECC 

32GB 

2 

channels 

DDR4 

72bit 

ECC 

16GB 

2 

channels 

DDR4 

72bit 

ECC 

Table 1: Compute nodes 

4x10GbE 

PCIe x8 

2 SATA 

USB,SDHC 

6x10GbE 

PCIe x4,x2,x1,x1 

2 SATA 

32 transceivers: 

GbE/SATA/PCIe 

as needed 

T4240 compute module 

robust pumps are wide spread. Furthermore, the modular design 

allows a customized heterogeneous design tuned for a particular 

application. However, the datacenter-oriented design of the 

microserver provides as primary interface 10Gb Ethernet, 

which does not match most industry applications. Therefore, a 

modified version of the FPGA module was developed, which 

can be equipped with a daughter card. This daughter card 

supports the physical interfaces needed in industrial and 

embedded applications, including the Internet of Things 

Services and People, Table 2. 

B. Power Converter 

The Power Converter Module generates regulated supply 

voltages for DRAM and I/O. The output voltages are 

programmable such that several DRAM types and IO standards 

can be used in the compute modules. Furthermore a fixed 

voltage rail is needed for internal purposes, but extra current is 

fed to the backplane as well. The design is optimized for 

maximum power per backplane area at the given width/height. 

The main limiters are the passive components (power inductors 

and capacitors) and the backplane connector. 

C. Ethernet Switch 

In order to connect a high number of modules with each other 

and the outside a 64-port Ethernet switch supporting 10Gbps 

and 40Gbps is integrated in the same form factor (139mm x 

55mm) but with a higher depth due to the high-speed high 

pinout connector and the cooling requirements (195W TDP). 

Two stacked PCBs integrate in the Intel FM6364 ASIC, the 

connector, the core power converters, clock generators, and 

configuration memory. The combination of 128 differential 

pairs at 10Gbps and the high supply currents (120A on one of 

the rails) result in a 3.6mm thick 28-layer PCB. Since the same 

ASIC is used in conventional datacenter switches, compatibility 

with many standard network protocols can be achieved. The 

management of the switch is implemented in a T4240 compute 

module. 

III. 

INDUSTRY INPUT/OUTPUT 

The combination of features of the microserver triggered early 

interest in its application in high-end embedded systems, for 

vehicles or image processing in production, for instance. In fact, 

the liquid cooling allows the use of tight enclosures for dusty or 

otherwise dirty environments. Furthermore, the microserver has 

no moving parts – except the pump for the cooling liquid, but 

Interface 

Count 

USB 2.0 host 2 

Optocoupler in 4 

Optocoupler out 4 

LVDS 

7 pairs 

CAN 2 

Output level shift 18 

Input Level shift 12 

Isolated USB 1 

MIPI PHY 1+1 

Serial 

2-4 

(RS232, RS485, etc.) 

Table 2: Interfaces of Industry-IO card 

IV. 

SUMMARY 

The DOME microserver design targets a balance between 

commercially available components and production processes 

on one side and aggressive density and performance on the 

other. Being a research project the risk and design cost had to 

be considered as well; for a commercial product additional 

features could be added to the modules by embedding passive 

components (e.g. decouple capacitors) into the PCB or using 

chip-on-board technology for selected components. 

QorIQ and Layerscape are trademarks of NXP, PSoC is a 

trademark of Cypress Semiconductor. EXTreme Poweredge 

and Impact are Trademarks of Molex. 

REFERENCES 

[1] R. Luijten and A. Doering: “The DOME embedded 64 

bit microserver demonstrator”, Proceedings of 2013 

653

International Conference on IC Design & Technology 

(ICICDT) 

[2] R. Luijten, D. Pham, R. Clauberg, M. Cossale, H. N. 

Nguyen, and M. Pandya: “Energy-efficient 

microserver based on a 12-core 1.8GHz 188K- 

CoreMark 28nm bulk CMOS 64b SoC for big-data 

applications with 159GB/S/L memory bandwidth 

system density” ISSCC 2015 


654

Camera Standards for Embedded Vision Systems 

Dr. Fritz Dierks 

Basler AG 


friedrich.dierks@baslerweb.com 

Abstract—The market for PC-based machine vision has bread 

a vivid ecosystem of camera, frame grabber, and image 

processing library vendors whose products work quite seamlessly 

together. The industrial embedded vision market in contrast is 

still in need of such an ecosystem which would allow customers to 

pick suitable camera modules for their embedded system from 

multiple vendors especially for low and medium unit volumes 

and combine them equally seamlessly with software libraries of 

other parties. This article describes how the co-evolution of 

interface standards and component vendors’ portfolios has made 

the PC-based camera ecosystem possible and explains where the 

difficulties in the embedded vision market come from and how to 

overcome them. 

Keywords—camera; sensor module; embedded system; 

interface standard; standardization; ecosystem 


Interface standards make products from different vendors 

work seamlessly together. Thus by looking at the history of 

interface standards one can see how the ecosystem (or food 

chain) of a certain market has evolved over time and 

understand the underlying market mechanisms. The market for 

digital machine vision cameras is approximately 20 years old 

and quite mature already while the market for embedded 

camera modules for industrial purposes is still in its early 

stages and its food chain is not yet fully established. 

This article first looks back how things evolved in the 

digital machine vision camera market in order to explain the 

underlying market mechanisms. Then it shows how these 

mechanisms apply to the embedded camera module market, 

sketch how a camera ecosystem for this market could look like 

and derive what that would mean for the corresponding 

interface standards. 

II. 

THE MACHINE VISION CAMERA MARKET 

A. Structure of Camera Interface Standards 

For connecting a camera to a PC in a plug&play manner 

two layers need to be standardized: 

The transport layer deals with enumerating devices, 

accessing to the camera’s low-level registers, retrieving 

stream data from the device, and delivering events. The 

transport layer is governed by the hardware interface. 

 

Depending on the interface type, the transport layer for 

PC-based systems requires a dedicated frame grabber 

(e.g. Camera Link, Camera Link HS, CoaXPress) or a 

bus adapter (e.g. IEEE 1394, GigE Vision, USB3 

Vision). For embedded systems the typical hardware 

interface is MIPI CSI-2 which is normally already 

integrated into the embedded processor. 

The feature layer describes the functional properties 

of a camera such as Exposure Time or Gain as well as 

the format of the video, chunk, and event data. Two 

major approaches exist for standardization: one can 

describe a fixed register layout (e.g. IIDC for machine 

vision cameras, or CCS for embedded systems 

respectively) or provide a machine readable selfdescription 

of the cameras’ features (see the GenICam 

approach which works for both worlds). 

A transport layer description should be agnostic to camera 

functionality while a feature layer description should be valid 

for different transport layers in order to maximize reusability. 

The hardware interfaces industrial camera transport layers are 

based on originate in most cases from the consumer market. 

Examples are GigE Vision basing on Gigabit Ethernet, USB3 

Vision basing on USB 3.0, or MIPI CSI-2 originating from the 

mobile phone market. Every time a new suitable interface is 

introduced in the consumer market the industrial camera 

market tends to adopt it which is the reason why constantly 

new interface standards have to be adopted. The following 

Benefits of Standards 

For Customers 

Being ensured to “bet on the right horse” 

Getting cheaper and better products due to 

competition in the market 

For Vendors 

Faster market growth since standards attract 

customers 

Re-using know-how and IP 

Looking strong in the eyes of the customers when 

visibly spending effort on standardization 

(“handicap principle”) 


655

sections describe some important steps in the evolution of 

camera standards. 

B. Transport Layer without Standardized Feature Layer 

The first digital cameras required dedicated frame grabbers 

resulting in a market food chain of camera vendors and frame 

grabber vendors (Fig. 1). A typical transport layer standard 

from this time is Camera Link [1] which provides video data 

transfer via LVDS and camera configuration via a serial port. 

Camera Link has no standardized feature layer so camera and 

frame grabber vendors have to team up and manually create 

configuration files to make the connections work. 

The customer has to design his system using the frame 

grabber vendor’s proprietary SDK which makes changing 

frame grabbers hard to do. Many frame grabber vendors also 

developed image processing libraries which were closely tied 

to the vendors’ respective SDKs. 

The frame grabber vendors however still had the software 

libraries to sell so they created their own drivers to stay in 

business (Fig. 2 bottom). This was possible since the feature 

layer for IEEE 1394 cameras had been standardized by 

defining a fixed register layout called IIDC [3] which was very 

much inspired by a design from Sony who created one of the 

first industrial 1394 cameras at that time. 

GenICam Standard 

Established in 2006 

>180 member companies 

Hosted by European Machine Vision Association 

(EMVA) 

Free membership 

Foundation of all modern machine vision camera 

standards such as GigE Vision, USB3 Vision, 

CoaXPress, Camera Link HS, Camera Link 3.0 

Maintains an open source reference 


Two international meetings pre year taking place in 

turn in Asia, Europe, and America. 

Fig. 1 

Camera Link - no standardized feature layer 

green:camera vendor, blue: frame grabber vendor 

C. Fixed Register Layout (IIDC) 

When Apple started promoting the IEEE 1394 bus 

interface (aka Firewire) [2] it was soon adopted as transport 

layer by the machine vision industry because it allowed 

building systems without using expensive frame grabbers but 

rather by using cheap bus adapter cards instead (Fig. 2 top). 

Since the adapter card’s hardware interface was standardized 

and supported by Windows as well as Linux developing one 

driver for each operating system was sufficient to support a 

wide range of systems. The corresponding drivers were 

provided by the camera vendors for free which changed the 

food chain of the market considerably since the frame grabber 

vendors were no longer dominating the SDK and thus the 

programming interface to the customer. 

Fig. 2 

IEEE 1394 – fixed register layout (IIDC) 

green:camera vendor, blue: frame grabber vendor, grey: consumer 

market 

The key problem with a fixed register layout however is 

that supporting custom features is very tricky especially when 

the customer is using a 3 rd party imaging library. While for the 

camera vendor adding a custom feature to their products often 

is a good business case imaging library vendors in many cases 

don’t make enough money on their SW licenses to justify 

supporting those custom features through their SDKs which 

makes the overall business case difficult to realize. 

D. Self Describing Camera Features (GenICam) 

When Intel announced to add Gigabit Ethernet support to 

their chipsets the industry teamed up in order to standardize 

this new interface for machine vision cameras. The transport 

layer was named GigE Vision [4] and defined on top of UDP 

packets. While it was quite easy to agree on the transport layer 

standard it was next to impossible to agree on a fixed register 

layout though the group tried for over one year: too many 

companies had developed their own proprietary custom 

features they wanted to see in the standard and discussing all 

that with full implementation details simply took too long. In 

addition the companies forming the standard committee were 

of similar strength so no company was strong enough to 

impose their solution to the others. 

In order to overcome these problems the GenICam 

standard [5] was created as unified feature layer standard for 

GigE Vision and all future transport layers in the machine 

vision industry (Fig. 3). The key idea is to describe the features 

of a camera in an abstract way and provide an XML feature 

description file defining how to map the features to the 

camera’s control registers. In GenICam each feature is defined 

by a Name, a Type, and a Meaning. The feature describing the 

amplification of a camera’s video signal (the meaning) is for 

example named ‘Gain’ and is of type Float. With each type 

there is associated a software interface allowing to query the 

656

implementation details of the features such as minimum, 

maximum etc. The most important types supported are integer, 

float, enumeration, string, command, and bool. GenICam 

exposes the features of a camera through a feature tree so the 

camera is fully self-describing. 

Fig. 3 

GigE Vision – self describing camera (GenICam) 

green:camera vendor, grey: consumer market 

The GenICam standard has several modules: The GenApi 

module defines the syntax of the XML description language 

while the SFNC (= standard feature naming convention) 

defines a list of over 600 camera features each by Name, Type, 

and Meaning. This architecture overcomes the problems of the 

fixed register layout standards. 

 

 

Since the standard does not define the implementation 

details of each feature agreeing on the feature layer 

became quite easy. In addition this scheme opened up 

room for competition based on different quality of 


The XML description language is the same for 

standard and custom features; the difference is just a 

flag. This allowed the camera vendors to deliver any 

kind of custom features to their customers even if these 

were using 3 rd party image processing libraries. In 

addition it made growing the standard feature list easy: 

typically a vendor would implement something new as 

custom feature and then submit it for standardization 

having already provided a proof of concept. 

The on-the-fly interpretation of the XML file is done using 

an open source reference implementation which is maintained 

by the standard committee and available for many operating 

systems including Linux on ARM which makes it applicable 

also for embedded systems. The code has been optimized for 

performance and quality and is used as the core of most SDKs 

in the industry. As a result different cameras behave very 

consistently in different software environments since camera 

vendors can use the reference implementation for testing their 

products. 

It turned out that vendors kept their products’ proprietary 

SDKs and used GenICam under-the-hood only. This is partly 

for historical reasons and partly because the SDK is an 

important tool for differentiation in the market. A corollary 

from that observation is that having a standardized SDK is not 

crucial for the forming of an ecosystem. This is an important 

learning for the embedded systems market. 

GigE Vision and GenICam were a huge success so it was 

decided to build all new transport layers on top of GenICam in 

order to re-use as much existing IP as possible. Newer 

GenICam based standards are USB3 Vision [6], CoaXPress 

[7] and Camera Link HS [8]. Even existing standards like 

IEEE 1394 [2] defined with a fixed register layout are covered 

by GenICam through creating a corresponding XML file, and 

also Camera Link [1] is about to create a new release v3.0 

which will make GenApi support mandatory making it finally 

support plug&play. 

E. Standardized Transport Layer API (GenTL) 

It turned out that standardizing the feature layer was not 

enough in the long run. For image processing library vendors it 

became more and more unattractive to create transport layer 

drivers and frame grabbers for the different interface 

technologies. Therefore a transport layer API was standardized 

within the GenICam standard named GenTL (Fig. 4). It 

defines an abstract C interface which allows using transport 

layer drivers from different vendors in a generic way. 

CoaXPress was the first transport layer standard which made 

the support for GenTL mandatory and several large image 

processing library vendors who have no history in building 

frame grabbers rely now on GenTL to access cameras which 

made most camera vendors support it. As a result a customer 

can now freely combine cameras, frame grabbers, drivers, and 

image processing libraries of different vendors even in mixed 

interface systems. 

Fig. 4 

CoaXPress – standardized transport layer API (GenTL) 

green:camera vendor, blue: frame grabber vendor, grey: image 

processing library vendor 

The development sketched in this section – the forming of 

an ecosystem and the development of multiple GenICam based 

standards – is one of the key reasons why the machine vision 

market is highly competitive and offers so many different 

camera products, a development which is of course very much 

in the interest of customers but also of vendors since it is one of 

the key reasons for the considerable market growth. The next 

section will use these insights for analyzing the situation in the 

embedded camera module market. 

III. 

EMBEDDED CAMERA MODULE MARKET 

In recent years a variety of embedded processors became 

available whose computing power reached now a range where 

many of these devices become very attractive for vision 

applications. For projects with very large unit volumes such as 

for mobile phones or consumer devices it is possible to get 

camera modules with Android support quite easily. However 

when it comes to industrial applications which tend to have 

lower or medium unit volumes only and often require Linux 

support things are different: due to the current market structure 

and missing interface standards only few offerings exist which 

in many cases are restricted to certain sensors or processor 

types. 

A. Technical Peculiarities of Embedded Vision Systems 

Currently the physical interface of choice for embedded 

camera modules is the MIPI CSI-2 interface [9] using the D- 

PHY. It has most traits of a transport layer interface meaning 

that it can transfer video data and allows configuration of the 


657

camera module via I2C without dealing with camera features 

such as Gain etc. The MIPI CSI-2 interface was originally 

designed to connect raw sensors directly to processors inside 

mobile phones but it can also be used to connect camera 

modules to a processor provided the customer can live with the 

short cable length of around 30 cm dictated by the D-PHY. If 

not there are technologies which allow extending the cable 

length to 15 m, e.g. by bridging the CSI-2 interface over coax 

cable using technologies such as GMSL [10], FPD-Link [11], 

or V-by-One HS [12]. 

Before the raw image from a sensor is delivered to the 

customer it must be pre-processed, e.g. de-bayered, de-noised, 

and defect pixel corrected. This is done in the ISP (image 

signal processor) which in most machine vision cameras is 

implemented using an FPGA inside the camera. Since low cost 

FPGAs do not support the electrical interface required by the 

CSI-2/D-PHY interface this is however not an option for 

embedded camera modules. The two basic alternatives are 

using a separate ISP on the camera module or using the ISP 

many embedded processors have already built in. 

B. Using an External ISP (Pass-Through Mode) 

If the ISP resides in the camera module the CSI-2 video 

port of the embedded processor runs in so called pass-through 

mode meaning any internal ISP is bypassed. The external ISP 

can be implemented in different manners (Fig. 5): 

 

 

 

Some sensors have an ISP on-chip which typically has 

limited flexibility and performance only. Since the 

customer does not want to deal with the raw register 

interface of the ISP and do the necessary 

configuration/calibration himself a camera firmware is 

required which needs to reside on the processor. 

Some sensor vendors provide companion chips for 

their sensors which tend to be more powerful but do of 

course not come for free. The firmware topic is the 

same as with on-chip ISPs. 

There are dedicated ISP chips which can be 

programmed to behave like the stand alone camera 

module. In this case the camera firmware can reside on 

the camera module but a driver is required on the 

processor side nevertheless to allow customers access 

to the camera. 

Since the programming interface of the CSI-2 port is 

different for every processor family currently a special driver 

needs to be implemented for every embedded processor to be 

used with a camera module. This is one of the difficulties 

preventing the formation of an ecosystem for embedded vision 

since it adds a lot of cost to the camera maker’s side it they 

want to support the multitude of embedded processor types 

available in the market. Compare the situation to PC-based 

systems: the network interface cards / bus adapter cards have a 

standardized bus interface so implementing one driver for 

Windows and one for Linux is sufficient for covering the 

majority of the market which helped a lot creating the existing 

ecosystem. 

Fig. 5 

Pass-through mode 

blue: processor vendor, grey: other parties 

Another problem with external ISPs is cost and/or quality: 

sensors with ISP on-chip tend to have simple ISP algorithm 

implemented only and dedicated ISP chips cost money which 

is economically feasible only for high end und thus expensive 

sensors. 

C. Using the Internal ISP 

From the customer’s point of view the optimum solution 

would be to pay for the raw sensor in the camera module only 

and use the ISP inside the processor which is typically much 

more powerful than a reasonably priced external ISP (Fig. 6). 

The challenge here is to get access to the ISPs programming 

interface which tends to be very complex and is mostly 

designed with high unit volume use cases in mind where 

developers from the processor and the sensor vendor meet and 

co-design the vision system. This is an even larger obstacle for 

creating an ecosystem since the processor vendors are very 

reluctant to allow access to their ISPs since they (rightfully) 

fear that the amount of support required from their side isn’t 

justified by the revenue to be expected from industrial 

applications. 

Fig. 6 

Using the internal ISP 

blue: processor vendor, grey: other parties 

658

D. Existing SDK Standards 

In the embedded world several attempts exist to standardize 

with respect to cameras. A first group of standards concentrates 

on the SDK the customer is using for accessing the camera 

module. 

 

 

 

 

For Android the Camera HAL3 [13] provides a very 

sophisticated API which contains many features being 

useful also for industrial applications. The interface 

however is focused on Android and most industrial 

applications today require Linux support. 

The standard video interface for Linux is GStreamer 

[14] based on the low level interface Video4Linux 

[15]. GStreamer focuses on supplying video streams 

and is lacking the support and flexibility required for 

more advanced industrial imaging use cases. 

For quite a while the Khronos group attempted to 

create a camera access standard dubbed OpenKCam 

[16]. However during an Embedded Vision Alliance 

meeting in Dec 2017 it was reported that the initiative 

has not yet found enough contributors. 

Last but not least there are some proprietary SDKs 

provided by embedded processor vendors such as 

LibArgus from NVIDIA [17] which can be seen as a 

de-facto standard for that particular processor family. 

None of these standards however support a food chain 

where camera module vendors can act independently from the 

embedded processor vendors which control the proprietary 

transport layer API and the internal ISP. As the example of the 

machine vision camera market shows a standardized SDK for 

customers is not even necessary for the creation of an 

ecosystem; the important interface to be standardizes is that 

between the domain of the camera vendor and the processor 

vendor which is the transport layer. 

E. Fixed Register Layout Revisited (CCS) 

A quite recent development is the release of the MIPI CCS 

(camera command set) standard [18] which provides a fixed 

register layout for MIPI CSI-2 sensors like the IIDC standard 

provided for IEEE 1394 cameras (see section II.C). The idea is 

that one generic driver would be sufficient for a multitude of 

sensors and that this driver could be supplied by the processor 

vendors which control the interface and on-board ISP. This 

would take away much of the value currently generated by 

camera makers and reduce the ecosystem players to mostly the 

sensor and processor vendors (Fig. 7). 

Fig. 7 

CCS – fixed register layout 

purple:sensor vendor, blue: processor vendor 

The problem with this approach is its inflexibility when it 

comes to custom extensions. Industrial applications tend to 

have much more complex use cases than consumer applications 

like mobile phones so the feature set defined in CCS will need 

to have custom extensions. Any extension however would have 

to be created by the processor vendors governing the generic 

CCS driver which would result in the same situation as we 

have today in that customizations are only accepted for very 

high unit volumes. So the CCS standard will most likely not 

help in creating a vivid ecosystem for the industrial embedded 

market. Nevertheless the CCS is a good starting point for new 

sensors since it allows getting a lot of the base functionality set 

up with standard IP on the driver side and it can be supported 

and extended by GenICam (see next section). 

F. Using GenICam in Pass-Through Mode 

This section explains how GenICam can be applied for 

camera modules coming with an external ISP or a sensor with 

integrated ISP, using the processors in pass-through mode. The 

key challenge is to overcome the missing standard driver for 

the CSI-2 ports on the different systems. This can be solved by 

implementing a GenTL interface adapter for each processor 

family which then can be used as connection point for the 

camera SDK driver (Fig. 8). The GenICam reference 

implementation contains a GenTL producer framework which 

makes it quite easy to provide such an adapter which would be 

open source and could even be made part of the processor’s 

board support package. The XML file describing the camera’s 

features would be installed together with the camera firmware 

on the processor’s file system. 

Fig. 8 

Using GenICam in pass-through mode 

green:camera vendor, blue: processor vendor, grey: open source 

With this scheme camera vendors would be able to provide 

their products independently from the processor vendors which 

would most probably start the desired ecosystem. In order to 

get this going a combined effort of the camera and processor 

makers would be required to create the necessary GenTL 

adapters for the most important processor families. 

G. Using GenICam in Combination with the Internal ISP 

To start an ecosystem for this case is way more difficult 

than for the pass-through mode case since it requires an open 

and standardized API for the internal ISP which is not an easy 

task due to the complexity of the ISPs’ functionality and its 

many parameters for tuning the image quality. 

Currently a semi-open ecosystem seems to be the best thing 

to get where each processor vendor teams up with a small 

number of camera module vendors granting them access to 

their ISP. This will at least take away the burden from the 

processor vendors to deal with the industrial imaging market 

which they normally are not familiar with. 


659

In order to keep the look and feel of the camera interfaces 

similar to the case of externa ISPs GenApi can be also used 

here to expose the cameras feature layer (Fig. 9). The sensor 

driver just needs to implement a layer of pseudo registers 

which are described by an XML file so from the user’s point of 

view there is no difference to camera modules based on the 

pass-through mode. 

Fig. 9 

Using GenICam in combination with the internal ISP 

green:camera vendor, blue: processor vendor 

IV. 

CONCLUSION 

The market of industrial embedded vision systems could do 

with an ecosystem of independent camera vendors like the one 

the much more mature PC-based machine vision camera 

market has developed over the last 20 years resulting in a vivid 

market of cameras even for projects with low to medium unit 

volume. What is missing in order to get this going is an 

interface standard which separates the product arena of the 

camera vendor from that of the processor vendor. The existing 

standards either do not address this interface since they are 

focusing on the programming interface to the customer or they 

are not flexible enough since they are based on a fixed register 

layout. Using the well-established GenICam standard for the 

industrial embedded vision market would solve that problem 

and most probably trigger an ecosystem also in this emerging 

market. 

REFERENCES 

[1] Camera Link standard. 

https://www.visiononline.org/vision-standards.cfm 

[2] IEEE 1394 standard 

http://www.1394ta.org/developers/specifications/StandardsOrientation 

V5.0.pdf 

[3] IIDC 1394-based digital camera specification 

http://1394ta.org/wp-content/uploads/2015/07/2003017.pdf 

[4] GigE Vision standard. 


[5] GenICam standard. 

http://www.genicam.org 

[6] USB3 Vision standard. 


[7] CoaXPress standard. 

http://www.coaxpress.com/ 

[8] Camera Link HS standard. 


[9] MIPI CSI-2 standard. 

https://www.mipi.org/specifications/csi-2 

[10] GMSL MIPI CSI-2 bridge over coax 

https://www.maximintegrated.com/en/products/interface/high-speedsignaling/gmsl.html 

[11] FPD-Link 

http://www.ti.com/lsds/ti_de/interface/fpd-link/camera-serdesoverview.page 

[12] V-by-One HS 

http://www.thine.co.jp/en/products/pr_details/V-by-OneHS.html 

[13] Android Camera HAL3 

https://source.android.com/devices/camera/camera3 

[14] GStreamer multimedia framework 

https://gstreamer.freedesktop.org/ 

[15] Video4Linux framework 

https://www.linuxtv.org/ 

[16] OpenKCam call for participation 

https://www.khronos.org/openkcam 

[17] NVIDIA LibArgus camera API 

https://developer.nvidia.com/embedded/jetpack 

[18] MIPI CCS standard 

https://www.mipi.org/specifications/camera-command-set 

660

High-Resolution Multi-Camera Methodology for 

Autonomous Vision System Solution Development 

Michaël Uyttersprot 

Avnet Silica 

Merelbeke, Belgium 

michael.uyttersprot@avnet.eu 

Mario Bergeron, Luc Langlois 

Avnet 

Quebec, Canada 

mario.bergeron@avnet.com 

luc.langlois@avnet.com 

Abstract— Recent advances in embedded vision have evolved 

from passive video capturing devices to fully autonomous vision 

systems. Self-driving cars, drones, or autonomous guided robots, 

require real-time parallel processing, low-latency, and in some 

cases low power consumption. Multiple camera modules provide 

surround view and sensor fusion improves the overall vision 

system, while artificial intelligence and machine learning herald 

tremendous improvements for recognition and learning tasks in 

autonomous vision systems. This paper describes a multi-camera 

development platform for autonomous vision systems supporting 

six camera modules with up to 4K UHD resolution. The core of 

the solution is a Xilinx Zynq UltraScale+ MPSoC combining a 

64-bit processing system and programmable logic. By leveraging 

the processing system’s quad-core ARM Cortex-A53 to run 

traditional software tasks coupled with hardware-accelerated 

functions executing in programmable logic, system designers can 

achieve performance gains orders of magnitude higher than 

traditional software-based computer vision systems. A design 

methodology based on the Xilinx reVISION stack is presented, 

with hardware-accelerated OpenCV algorithms commonly used 

in ADAS and other autonomous vision systems. 

Keywords— High-Resolution Multi-Camera, Autonomous 

Vision System, Embedded Vision, SoC, FPGA 


Building autonomous vision systems grows increasingly 

challenging, requiring multidisciplinary expertise in optics, 

image sensors, computer vision and deep learning. Selecting 

the right development platform and design methodology is 

crucial for a successful implementation. A multi-camera 

system is part of an embedded vision design and several design 

recommendations need to be taken into account. 

The amount of data involved in a multi-camera approach 

can be enormous. This is especially the case with highresolution, 

real-time video and may require parallel processing 

or dedicated vision processing devices. Autonomous vision 

applications are real-world systems with continuously changing 

conditions for light, motion, or orientation, which creates 

uncontrolled and changeable variables in the system. Relying 

on simulations only will not work and real-world experiments 

are necessary, but can be very time consuming. Appropriate 

computer vision and machine learning algorithms are required 

to manipulate and analyze the video data and deal with realworld 

conditions and unexpected circumstances. 

It is clear that fine-tuning between hardware, software, 

computer vision and machine learning, under real-world 

conditions, is a difficult task. It is important for a developer to 

use the right tools to reduce development time and risk. In 

order to ease this task, this paper proposes a full solution, close 

to an end application, for multi-camera designs. The solution 

includes the design methodology with hardware, software 

environment, drivers, and computer vision and machine 

learning. 

In the following sections, we will discuss first the multicamera 

applications and system requirements (section II), 

following with the details of the different hardware platform 

building blocks (section III), continue with the design 

methodology and example details (section IV), and finalize 

with a conclusion (section V). 

II. 

MULTI-CAMERA APPLICATIONS AND REQUIREMENTS 

Cameras are used in a wide range of applications today. 

New applications demand higher resolution image sensors at 

faster frame rates, enabling sharper images and better object 

detection at longer distances. Faster frame rates reduce latency 

and improve reaction time of an autonomous system at the 

expense of higher-performance signal processing combined 

with more complex computer vision and machine learning 

algorithms. Sensor fusion integrates various sensor 

technologies such as image sensors paired with a thermal 

sensor, LIDAR or radar. The goal of sensor fusion is to (a) 

improve application or system performance, (b) correct 

deficiencies of the individual sensors, (c) or provide better 

accuracy such as for position and orientation. Additionally, for 

time-critical reaction such as obstacle avoidance, various 

sensor sources must be synchronized and processed with low 

latency. 

Machine learning, and in particular deep learning, has 

enabled rapid progress towards recognition and classification 

tasks in autonomous systems. Deep learning is more efficient 

for most complex recognition tasks than computer vision, and 


661

in many cases more accurate than with traditional computer 

vision, but it requires significantly more computing power. The 

implementation of deep learning inference, which is the 

deployment of a pre-trained deep learning neural network, 

requires GPUs or FPGAs. FPGAs have the advantage of high 

flexibility and low power consumption. Companies like DeePhi 

Tech recognize a very large market value for deep learning and 

provide FPGA deep learning platforms for autonomous 

systems [1]. Studies prove that even binarized neural network 

inference can run efficiently on an FPGA for very fast 

classification of objects, with high accuracy and low power 

consumption [2]. FPGAs can run small, compressed machine 

learning accelerators, and can be dynamically reconfigured, to 

adapt the accelerator on the fly for the required acceleration 

task or for any deep learning topology [3]. Typically, computer 

vision and machine learning are combined as a hybrid model; 

computer vision for initial detection, deep learning for 

verification, classification and recognition tasks. In some cases, 

autonomous systems even run multiple deep learning networks, 

each with their own specific tasks. 

The multi-camera development platform can target 

different applications including: 

Automotive and advanced driver assistance systems 

(ADAS) – cars have multiple image cameras, radar and 

LIDAR, for situational awareness with 360° field of 

view, and to support safety outside and inside the car. 

The number of high-resolution cameras will increase to 

meet the impending challenges of fully autonomous 

driving. The camera resolution will increase from 1–2 

Mp today up to 8 Mp in the longer term, and the frame 

rate will increase from 10-30 fps today to 60 fps in the 

future [4]. ADAS include functionality for lane 

departure warning, traffic sign recognition, park assist, 

pedestrian detection, adaptive cruise control, passenger 

monitoring with drowsy driver detection, and blind spot 

detection. 

Unmanned aerial vehicles (UAVs) and drones – UAVs 

and drones are equipped with multiple cameras to have 

a flight view and to perform analytics of the 

surrounding area. UAVs are used to inspect agriculture 

fields, power lines, wind turbines or buildings. 

Additional applications are security and surveillance for 

police or fire brigade, search and rescue, and 

professional photography. 

Autonomous guided robots – autonomous robots deploy 

various sensors with multiple cameras, with strong 

coordination on sensing, motion and decision-making 

for fast reaction in environmental situations. 

Virtual and augmented reality (VR/AR) – multiple 

cameras capture 360° video and the individual video 

streams are stitched together to create a virtual or 

augmented environment. 

Key requirements of the different components of the 

solution include: 

Reliability – reliability and safety are a top priority for 

autonomous systems and functional safety is critical to 

avoid hazardous situations 

Real-time execution – fast reaction time is required with 

low-latency on hardware and software 

Flexibility – hardware and software need to be highly 

flexible and reconfigurable to be future proof and to 

meet different configurations for multi-camera designs 

Power consumption – power consumption must be 

minimized, especially for battery powered applications 

like drones 

Computer vision and machine learning – extensive 

capabilities to implement computer vision and machine 

learning with deep learning inference are crucial for 

object detection, recognition, verification and 

classification 

Automotive qualified – this paper proposes a solution 

valid for autonomous systems including automotive 

applications, meaning that all components need to be 

available in automotive grade with long-life time 

availability. 

III. 

MULTI-CAMERA PLATFORM 

The high-resolution multi-camera solution consists of 4 

hardware core building blocks: (A) camera modules, (B) a 

multi-sensor FMC module, (C) a multi-processor SoC or SOM, 

and (D) a carrier board. 

Fig. 1. Overall system with camera modules, multi-sensor FMC module, 

multi-processor SoC or SOM, and a carrier board 

A. Camera modules 

An image pipeline starts at the camera module combining 

image sensor, lens, control electronics, and an interface. Poor 

lens quality, or mismatch in image sensor specifications, will 

affect the whole image pipeline, and cannot always be 

improved or recovered by the processing system. This paper 

describes modules with image camera and serializer from 

MARS, a modular automotive reference system developed by 

ON Semiconductor [5]. With the modular approach, developers 

can build different combinations of image sensors, coprocessors 

or ISPs, and communication standards. The 

modules are component boards with consistent signal/power 

interconnect definitions to enable to swap individual boards, 

creating a wide range of options for experimenting, while 

eliminating the need for constructing custom boards. The result 

662

is a highly flexible solution where the various modules with 

different image sensors and lenses, and different field of view 

(FOV), are interchangeable. The modules are miniaturized with 

a form factor of 25mm by 25mm and can be mounted on a 

vehicle for real-world testing. MARS camera modules are 

equipped with M12 lenses with different FOV options. The 

camera module supports high bandwidth and image resolution, 

and data can be transferred over several meters to the multiprocessor 

SoC with the serializer. The serializer is part of a 

serialized/deserializer implementation (SerDes) and will be 

described in detail below with the multi-sensor FMC module. 

Both the MARS camera modules and the serializers are ideal 

components for the high-resolution multi-camera development. 

Fig. 2. ON Semiconductor MARS camera module with serializer board 

B. Multi-sensor FMC module 

The multi-sensor FMC module is an interconnection board 

between the camera modules and the host carrier board. It is 

not a stand-alone module, but rather a plug-in module designed 

to interface with FMC-compatible carrier boards. The FMC 

module was developed by Avnet [6] to support multiple 

cameras for Xilinx Zynq UltraScale+ embedded vision 

applications in automotive ADAS, augmented reality, and 

UAV or drones. The board has an FMC connector and 6 

FAKRA connectors supporting up to 4 (four) 2Mpixel and 2 

(two) 8Mpixel camera modules through low cost 50Ω coax 

cables. The camera modules make use of the Maxim Integrated 

quad channel GMSL deserializer for the 2Mpixel cameras, and 

a dual GMSL2 deserializer for the 8Mpixel camera modules 

[7]. Communication with the Xilinx Zynq UltraScale+ is 

delivered by MIPI CSI-2. The FMC-LPC connector 

specification, MIPI CSI-2 interface, and SerDes are explained 

below. 

Fig. 3. Multi-sensor FMC module 

FMC-LPC. The FPGA mezzanine card (FMC) is a 

daughter card for carrier boards containing an FPGA and is an 

ANSI standard that provides a standard form factor, 

connectors, and modular interface to the FPGA located on a 

carrier board. FMC supports data throughput up to 10 Gb/s for 

individual signaling and 40 Gb/s overall bandwidth between 

mezzanine and carrier card [8]. The multi-sensor FMC module 

has a Low-Pin Count (LPC) connector with 160 pins. 

MIPI CSI-2. MIPI CSI-2 is a camera serial interface (CSI) 

and is a specification of the mobile industry processor interface 

(MIPI). It defines the interface between cameras and host 

processors. It is scalable, robust, low power, high-speed, costeffective, 

and has low electromagnetic interference (EMI). It is 

a widely used camera interface for single- or multi-camera 

implementations. It is typically used for high performance 

video and still image applications, including interconnection 

between 4K resolution cameras, head-mounted virtual reality 

devices, automotive applications, and UAV or drones. MIPI 

CSI-2 is a lane-scalable specification and the data stream is 

distributed between the lanes. Applications that require extra 

bandwidth beyond that provided by one data lane, or those 

trying to avoid high clock rates, can expand the data path to 

two, three, or four lanes, and obtain approximately linear 

increases in the peak bus bandwidth. 

Serializer and deserializer. SerDes is used for high-speed 

data transmission over extended distances. The serializer takes 

multiple data line inputs and condenses them to a fewer 

number of outputs at higher frame rate, and transmits the 

condensed data over a cable. The deserializer captures the 

serialized data and outputs the recovered original data, usually 

to a host processor. SerDes will reduce the cost of connectors 

and cables, reduce noise EMI, and deliver high-speed data 

transmission over long distances. Gigabit multimedia serial 

link SerDes (GMSL) are Maxim Integrated proprietary 

transceivers and provide a compression-free alternative to 

Ethernet with a 10x increase in data rates, 50 percent lower 

cabling costs, and better EMC compared to Ethernet. GMSL 

chipsets can drive 15 meters of coax or shielded twisted pair 

(STP) cabling, with margin required for robust and versatile 

designs. Spread-spectrum capability is built into each serializer 

and deserializer to improve the EMI performance of the link, 

without the need for an external spread-spectrum clock. 

GMSL2 is an improved technology of GMSL supporting 6- 

12Gbps serial data rates, multi-streaming functionality, and 

advanced diagnostics (like polling of remote registers to ensure 

link/system integrity). GMSL and GMSL2 are ideal for data 

transmission for megapixel multi-camera systems. 

C. Multi-processor SoC and SOM 

A multi-processing system-on-chip (MPSoC) integrates 

several processing devices, including additional hardware 

functions, into a single silicon chip. Along with a processing 

unit, a SoC can contain a GPU, FPGA, memory, peripheral 

controllers, power management circuits, and may even contain 

wireless radios, or other integrated circuits. A SoC reduces the 

overall hardware complexity of a system, the number of 

external components, and power consumption because of the 

optimized hardware implementation. 

An efficient alternative to custom chip-down designs is a 

system-on-module (SOM), a small form-factor, ready-to-use 

computing module. A SOM combines all core hardware 

components, including SoC, external memory, and power 

regulations on a pre-engineered compact board, significantly 

reducing development time and risk. Both approaches are 

described below. 


663

1) Zynq UltraScale+ MPSoC 

Zynq UltraScale+ MPSoC devices are multiprocessor 

system-on-chips (MPSoC). Zynq UltraScale+ MPSoC is a 

family with three distinct variants: (a) Zynq UltraScale+ CG 

with dual ARM Cortex-A53 and dual Cortex R5 application 

processors, (b) Zynq UltraScale+ EG with quad ARM Cortex- 

A53, dual Cortex R5 processors and a GPU, and (c) Zynq 

UltraScale+ EV, similar to the EG version, but with an 

additional video codec [9]. The Zynq UltraScale+ EV devices 

with H.264 / H.265 video codec can simultaneous encode and 

decode up to 4Kx2K (60fps). 

Zynq UltraScale+ MPSoCs have two main hardware parts: 

(1) a processing system – PS, and (2) a programmable logic – 

PL: 

PS consists of an application processor unit (APU), a 

real-time processing unit (RPU), a graphics processing 

unit in case of the EG and EV devices (GPU), a 

configuration security unit (CSU), a variety of 

peripherals, integrated memory, and high-speed 

communications interfaces. 

PL is equivalent to an FPGA and contains 

programmable logic and interconnections. The 

advantage of the FPGA part is the re-programmability. 

In this way it becomes a large logic circuit that can be 

configured according to a design, but if changes are 

required, it can be re-programmed with an update. 

Additional Xilinx or third party IP blocks can be 

implemented in the PL. 

Fig. 4. Zynq UltraScale+ EV MPSoC block diagram with the details of the 

processing system (PS) and programmable logic (PL) 

The interconnection between the PS and PL is supported by 

the advanced extensible interface (AXI), a burst-orientated, 

open standard protocol, allowing the connection and 

management of many controllers and peripherals in a multimaster 

design, and for the communication between the IP 

cores. Each AXI port contains independent read and write 

channels. 

Zynq UltraScale+ MPSoC devices can operate in different 

modes: (1) the PS can work in a standalone mode without 

attaching any additional fabric IP, or (2) IP cores can be 

instantiated in fabric and attached to the Zynq UltraScale+ PS 

as a PS+PL combination. The second option will be most 

appropriate, as it takes the benefits of a software-hardware codesign 

with high bandwidth building blocks in fabric and lower 

bandwidth parts running on the processors. 

The different core units of the MPSoC have different 

functions and specifications, which are described below: 

Application processor unit (APU). The APU includes 

several ARM microprocessor units (MPU) and are ideal for 

high-level vision processing. MPUs are easy to program, 

because the tools, libraries and programming structure are 

similar to those for PC applications and leverage existing 

standard APIs like OpenCV or V4L. This reduces the learning 

curve required to program new hardware and applications. An 

MPU can run very complex algorithms, but with high image 

resolutions and frame rates, it will quickly runs out of 

horsepower. It is not able to achieve real-time performance, 

thus additional hardware acceleration is required. 

Real-time processing unit (RPU). RTUs are ideal for high 

performance, real-time applications. RTUs are similar to 

MPUs, but allow higher-performance interrupt handling and 

quicker response to real-time events. The high performance 

and high determinism of RTUs is ideal for functional safety 

and secured applications. 

Graphics processing unit (GPU). GPUs were initially 

designed for displaying video on computer monitors, but since 

several years also implemented for general-purpose computing. 

GPUs are also used for training and inference for deep 

learning. A GPU has a massively parallel architecture 

consisting of thousands of small, efficient cores designed for 

handling multiple tasks simultaneously. Compared with a 

MPU, which has only a few cores optimized for sequential 

serial processing, a GPU can perform tasks much faster than an 

MPU, due to its parallel processing. Memory latency is very 

low by very fast context switching, but GPUs typically have a 

higher power consumption and heat generation than other 

processing devices. 

Field-programmable gate array (FPGA). The FPGA is part 

of the PL and are integrated circuits, designed to be configured 

after manufacturing. This differs from application-specific 

integrated circuits (ASIC) and application-specific standard 

products (ASSP). These devices have the advantage of high 

performance and low power consumption, but they are only 

used for high volume manufacturing and long series, due to 

higher initial engineering cost. They are not suitable for rapid 

prototyping and they are not reconfigurable. Once they are 

manufactured, they cannot be reprogrammed. This lack of 

flexibility has led to the use of FPGAs to implement all the 

desired functionalities directly in the programmable logic, 

including the required additional custom hardware accelerators, 

or even the architecture of a microprocessor (soft processor 

core). FGPAs have low power consumption and they are better 

suited for low-level processing than general-purpose hardware. 

Digital Signal Processors (DSP) can be part of an FPGA and 

offer single cycle multiply and accumulation operations, in 

addition to parallel processing capabilities and integrated 

memory blocks. DSPs deliver excellent overall performance 

across low-, mid- and high-level vision processing. 

664

Configuration security unit (CSU). The security unit with 

cryptographic capabilities can be used for hardware 

acceleration of cryptographic functions. The secure boot 

functionality in Zynq UltraScale+ MPSoC allows support for 

confidentiality, integrity, and authentication of partitions. 

Secure boot is accomplished by combining the hardware root 

of trust (HROT) capabilities of the Zynq UltraScale+ device 

with the option of encrypting all boot partitions. The HROT is 

based on the RSA-4096 asymmetric algorithm in conjunction 

with SHA-3/384, which is hardware accelerated, or SHA- 

2/256, implemented as software. Confidentiality is provided 

using 256 bit advanced encryption standard - Galois counter 

mode (AES-GCM). 

2) UltraZed-EV SOM 

An efficient alternative to custom chip-down designs is a 

system-on-module (SOM). Designers can use a SOM as a 

reference to create their own vision system, or can drop the 

SOM into their final product with their custom-designed carrier 

board. The UltraZed-EV SOM, developed by Avnet [10], 

enables designers to build multi-camera systems for 

automotive ADAS, surveillance, and other embedded vision 

applications, and because of the integrated H.264/H.265 video 

codec unit into the MPSoC EV, it is possible to simultaneously 

encode and decode up to 4Kx2K (60fps). 

The Avnet UltraZed-EV SOM is a production-ready, high 

performance, full-featured SOM with the Zynq UltraScale+ 

EV. The SOM includes all the necessary functions, such as onboard 

dual system memory, high-speed transceivers, Ethernet, 

USB, and configuration memory. The UltraZed-EV provides 

access to 152 user I/O pins, 26 PS multiplexed I/O pins (MIO) 

and gigabit transceivers GTR/GTH. GTR and GTH are gigabit 

transceivers to support the most common serial high speed 

interconnects. GTR supports a maximum data rate of 6.0 Gb/s, 

while GTH supports a maximum data rate of 16.3 Gb/s. The 

UltraZed-EV SOM has 4 high-speed PS GTR transceivers 

along with 4 GTR reference clock inputs, and 16 PL highspeed 

GTH transceivers along with 8 GTH reference clock 

inputs through three I/O connectors. Avnet also provides Linux 

board support packages (BSP), to reduce the time required to 

bring up an operating system on the SOM, allowing developers 

to immediately start developing their differentiating algorithms 

and applications. 

Fig. 5. UltraZed-EV SOM 

D. Carrier board 

A carrier board, carrier card, or evaluation board, provides 

the mounting option of the Multi-sensor FMC module and 

additional modules or cards. It includes also the Zynq 

UltraScale+ MPSoC device or connector for the Avnet 

UltraZed-EV SOM. It has an Ethernet connector, SD Card 

interface, I/O interfaces, peripherals, video output, and power 

supplies. Several configurations are available to build the 

multi-camera solution. Despite that the multi-sensor FMC 

module has only 6 connectors for the camera modules, it is 

possible to extend the number of cameras in case the carrier 

board has more than one FMC connector. The description 

below explains the type of boards with compatible MPSoC or 

SOM, the possible configuration of the FMC module, and the 

number of camera modules that can be connected: 

Xilinx ZCU102 evaluation board [11]: this board is 

equipped with the Zynq UltraScale+ EG (ZU9EG). The 

board has 2 FMC connectors and can support up to 10 

camera modules with the following configuration: 4x 

2Mp modules and 2x 8Mp modules on the first FMC 

connector, and 4x 2Mp modules on the second FMC 

connector. 

Xilinx ZCU104 evaluation board [12]: this board is 

equipped with the Zynq UltraScale+ EV (ZU7EV) and 

has an additional video codec. The board has 1 FMC 

connector and can support up to 6 camera modules with 

the following configuration: 4x 2Mp modules and 2x 

8Mp modules. 

Avnet UltraZed EV carrier card [13]: this carrier card 

has a socket for the Avnet UltraZed-EV SOM. The 

carrier card has 1 FMC connector and can support up to 

5 camera modules with the following configuration: 4x 

2Mp modules and 1x 8Mp modules. Both the carrier 

card and SOM can be part of the UltraZed EV starter 

kit, bundled to provide a complete system for 

prototyping and evaluation. 

IV. 

DESIGN METHODOLOGY 

The design methodology is based on reVISION, a stack 

including a broad range of development resources for (1) 

algorithm development, (2) application development and (3) 

platform development from Xilinx or from third parties. 

reVISION is more responsive than typical SoCs & embedded 

GPUs, and delivers up to 6x better images/sec/Watt in machine 

learning, 42x higher frames/sec/Watt for computer vision 

processing, and 1/5th the latency [14]. reVISION works in 

synergy with existing development tools for hardware and 

software application design, including Xilinx SDx 

environment, PetaLinux, and additional reVISION libraries for 

computer vision and machine learning applications. SDx 

consist of the SDAccel and SDSoC development 

environments, and the Vivado Design Suite. SDAccel is 

typically used for applications in data centers and for PCIe 

based accelerator systems, and is beyond the scope of this 

paper. SDSoC, Vivado Design Suite, PetaLinux and reVISION 

libraries will be explained in detail in this section and are the 

core components for the high-resolution multi-camera solution. 

Additionally, we will explain the vision capture pipeline and a 

workflow example. 


665

A. SDSoC 

The software-defined system on chip environment 

(SDSoC) is an Eclipse-based integrated development 

environment (IDE) for implementing embedded systems using 

Zynq devices. SDSoC includes a full-system optimizing C/C++ 

compiler, providing an intuitive programming model for 

software engineers to write applications in C/C++. The SDSoC 

system compiler creates a complete embedded system on the 

device by compiling the application into hardware and 

software, including a complete boot image with firmware, 

operating system, and application executable. SDSoC performs 

program analysis, task scheduling and binding, onto 

programmable logic and the embedded MPUs, as well as 

hardware and software code generation that automatically 

implement communication among hardware and software 

components. SDSoC is built on the general Xilinx SDK 

(XSDK), inheriting many of its tools including debuggers, 

performance analysis, command-line tools, GNU toolchain 

such as GNU C library (glibc), and standard libraries like 

OpenCV. SDSoC also supports open computing language 

(OpenCL), with OpenCL kernels that target programmable 

logic Zynq devices. OpenCL is a framework for writing 

programs that execute across heterogeneous platforms 

consisting of CPUs, GPUs, DSPs, FPGAs and other processors 

or hardware accelerators. 

B. Vivado Design Suite 

Vivado Design Suite is a software suite produced by Xilinx 

for synthesis and analysis of hardware description language 

designs (HDL), with features for SoC development and highlevel 

synthesis (Vivado HLS). Vivado performs timing 

analysis, examines register-transfer level (RTL) diagrams, 

simulates a design's reaction to different stimuli, and 

configures the target device with the programmer. 

Vivado HLS accelerates IP creation by enabling C, C++ 

and system C specifications to be directly targeted into Xilinx 

devices without the need to manually create RTL. The Vivado 

IP integrator makes it easy to add hardware IPs to existing 

design source and create connections for ports, such as clock 

and reset. Vivado HLS and IP integrator are used for hardware 

system development. Configuration of the embedded 

processors, peripherals, and the interconnection of these 

components, also takes place in Vivado. 

C. PetaLinux 

PetaLinux is a brand name used by Xilinx to provide a full 

embedded Linux system specifically targeting FPGA-based 

SoC designs. It includes the Linux OS as well as a complete 

configuration, build, and deploy environment for Xilinx 

silicon. Because it is a standard Linux OS with Linux drivers, 

and with standard application programming interfaces (APIs), a 

developer has the advantage of incorporating existing 

functional software blocks and facilitating porting of 

applications from other processors. Software ecosystems like 

OpenCV or GStreamer, can be used within PetaLinux. Realtime 

video capturing on Linux, can be achieved with 

Video4Linux (V4L), which includes a collection of device 

drivers and an API. V4L has the advantage that programmers 

can easily add video support to applications without a lot of 

development effort, because it supports USB webcams and 

image sensors from many image sensor vendors, and is 

supported by many existing libraries and applications. 

PetaLinux refers to an individual software package, but it is 

not a standalone embedded Linux development solution. The 

workflow for PetaLinux consists of multiple layers in which it 

relies on other Xilinx software like Vivado and SDSoC. It is 

based on Yocto and add the hardware/software interface from 

Vivado (HSI), and special tools for boot image creation. 

PetaLinux consists of three key elements: (a) pre-configured 

binary bootable images, (b) fully customizable Linux for 

Xilinx SoC devices, and (c) PetaLinux SDK, including tools 

and utilities to automate complex tasks across configuration, 

build, and deployment. 

PetaLinux reference board support packages (BSPs) are 

available and are reference designs to help the developer to 

start working with a fully optimized design, and can be 

customized afterwards for developing own projects. A BSP 

includes all the necessary design and configuration files, prebuilt 

and tested hardware and software images, ready to 

download on a carrier board, or for booting in the system 

emulator (Quick EMUlator). Developers can customize the 

boot loader, Linux kernel, or Linux applications. They can add 

new kernels, device drivers, applications, libraries, and boot 

and test software stacks on the QEMU or on physical hardware 

via a network connection or JTAG. 

D. reVISION libraries 

The reVISION stack includes xfOpenCV and xfDNN, 

libraries for algorithm development and execution. xfOpenCV 

has a broad set of acceleration-ready OpenCV functions for 

computer vision processing. xfDNN is a library intended for 

machine learning inference implementation. For application 

level development on top of the algorithm development, Xilinx 

supports industry-standard frameworks including OpenVX for 

computer vision and Caffe for machine learning. SDSoC is 

used to enable algorithm and/or application development in C, 

C++ and/or OpenCL, by using the reVISION resources. The 

SDSoC Environment can also be used to expand the reVISION 

resources with new acceleration-ready software libraries. 

1) xfOpenCV library 

The open source computer vision library (OpenCV) is an 

open source software library aimed for real-time computer 

vision on still images or video. The library includes a large 

number of algorithms for filtering and image optimization, 

tracking of moving objects, image stitching, recognition and 

classification. Advanced computer vision algorithms used for 

image and video processing in 2D and 3D are part of the 

library. The standard OpenCV library with several thousand 

functions, can be used within the Xilinx C/C++ embedded 

software environments on the PS, but it is even more 

interesting to use hardware-accelerated functions running in 

PL. Xilinx provides the xfOpenCV library, an FPGA device 

optimized and hardware accelerated OpenCV library, intended 

for application developers using Zynq devices. xfOpenCV 

library functions are similar in functionality to their OpenCV 

equivalent. xfOpenCV provides a software interface for 

computer vision functions accelerated on an FPGA device. The 

xfOpenCV library is designed to work in the SDSoC 

development environment. SDSoC performs two steps 

666

automatically, which drastically increases productivity when 

accelerating functions: 

Hardware accelerators are created for each of the 

xfOpenCV function instantiations in the C/C++ code. 

Data movers are instantiated when needed to access 

image data from/to external memory. 

The following figure shows the reVISION platform with an 

xfOpenCV function implemented in the hardware [15]: 

2) xfDNN library 

Xilinx deep neural network library (xfDNN) is optimized 

for machine learning and deep learning inference applications. 

The reVISION Stack with xfDNN enables deployment of 

trained networks on a Zynq UltraScale+ MPSoC for inference. 

xfDNN is designed for maximum compute efficiency at 16-bit 

and 8-bit integer data types. xfDNN includes support for the 

most popular neural networks including AlexNet, GoogLeNet, 

SqueezeNet, SSD, and FCN. Additionally, the stack provides 

library elements including pre-defined and optimized 

implementations for convolution neural network layers (CNN), 

required to build custom neural networks. Zynq UltraScale+ 

MPSoC devices are ideal for deep learning inference, 

achieving better results on images/sec/Watt comparted to 

embedded GPUs. 

E. Vision capture pipeline 

The MIPI CSI-2 capture pipeline uses the V4L Linux 

framework and is implemented in the PL. It consists of the 

“MIPI CSI-2 sub-system”, the “AXI-Stream switcher” (for the 

case of the quad GMSL deserializer), the “Image Pipeline”, 

and the “Frame Buffer Write”. The figures below illustrate the 

pipeline for the quad GMSL and dual GMSL2: 

Fig. 6. xfOpenCV kernel on the reVISION platform 

The main challenge of running OpenCV-based applications 

on embedded hardware is the fact that all functions access 

image data from external memory. As more and more OpenCV 

functions are called, more and more accesses to external 

memory is required, which increases the latency and power 

consumption of the entire system. Xilinx resolved this 

challenge by providing xfOpenCV functions which can infer a 

streaming pipeline implementation. The following image 

illustrates this using a stereo vision algorithm, calculating depth 

from two (stereo) cameras. xfOpenCV directly infers 

pipelining functions from one to the next, avoiding frame 

buffers and external memory. 

Fig. 8. MIPI capture pipelines for quad GMSL (A) and dual GMSL2 (B) 

Fig. 7. Comparison of traditional computer vision memory access (A) with 

xfOpenCV (B) 

The following steps describe the MIPI CSI-2 capture 

pipeline: 

1. The image sensors provide raw image data via 

SerDes over MIPI CSI-2 link 

2. MIPI CSI-2 sub-system receives and decodes the 

incoming data stream to AXI4-Stream 

3. Image Pipeline manipulates the data for each 

image stream, including: 

o Demosaic IP converts the raw image 

format to RGB 


667

o Gamma IP provides per-channel gamma 

correction functionality 

o Color correction 

o Color space conversion converts the 

RGB image to YUV422 

o Video scaling resizes the images 

4. Frame Buffer Write IP writes the output of the 

Image Pipeline to memory 

The Image Pipeline is flexible and is determined by the 

developer. It can includes Xilinx IP cores and/or third party IP 

cores. logicBRICKS/Xylon is one of the partners that provide 

optimized IP cores for ADAS applications [16]. 

F. Workflow example 

With the reVISION stack, developers can start with an 

existing platform with all the interfaces and a reference BSP in 

place, and can concentrate on developing their own 

differentiating algorithms and applications. The typical 

reVISION workflow involves the following steps: 

 

 

 

 

 

 

 

 

Select the hardware platform including carrier 

card, FMC module and cameras 

Build a reference BSP and run it on the hardware 

Start application development in C/C++ in 

SDSoC, with the OpenCV library 

Cross-compile OpenCV applications to the PS of 

the Zynq UltraScale+ 

Profile and identify bottleneck functions in the 

code and select the potential candidates for 

hardware acceleration 

Make minimal modifications to call the Xilinx 

optimized xfOpenCV functions, instead of the 

standard OpenCV functions 

Recompile the xfOpenCV functions for hardware 

acceleration using SDSoC. SDSoC not only builds 

hardware accelerators from the xfOpenCV 

functions, but also instantiates the data movers, 

such as DMA engines, needed to transfer the data 

to/from external DRAM memory 

Execute the final optimized application on the 

embedded hardware 

In the near future, a complete reVISION compliant 

development solution will be proposed. The solution will 

include the Avnet UltraZed EV carrier card with Avnet 

UltraZed-EV SOM and the multi-sensor FMC module, which 

leverages the ON Semiconductor MARS cameras and MAXIM 

Integrated GMSL serialized links. The solution will include 

PetaLinux board support package with (1) V4L2 compliant 

Linux drivers to access the cameras on the Multi-Camera FMC 

module, and (2) pre-compiled OpenCV library for computer 

vision functions. The solution will be fully integrated into the 

reVISION stack. 

V. CONCLUSION 

This paper describes a multi-camera methodology for 

autonomous vision systems based on the Xilinx reVISION 

stack. It includes development resources for algorithm 

development, application development and platform 

development. The stack is highly optimized for computer 

vision and machine learning tasks used in automotive ADAS, 

UAVs/ drones, autonomous guided robots and VR/AR 

applications. The core of the solution is a Xilinx Zynq 

UltraScale+ MPSoC combining multiple processing units, and 

programmable logic. The architecture of the MPSoC is ideal 

for a software-hardware co-design with high-bandwidth 

building blocks in the programmable logic, and lowerbandwidth 

parts and operating system running on the 

processors. The combination of carrier board with the multisensor 

FMC module and optimized MARS camera modules 

provide a robust solution for high-resolution multi-camera 

development. The platform with reference BSP simplifies the 

design and reduces development time and effort significantly. 

REFERENCES 

[1] DeePhi Deep Learning Platforms. 

https://www.xilinx.com/video/corporate/deephi-deep-learningplatforms.html. 

Last visited 19.1.2018. 

[2] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. 

Jahre, and K. Vissers. FINN: A Framework for Fast, Scalable Binarized 

Neural Network Inference. ArXiv e-prints, 2016. 

[3] Kortiq Small and Efficient CNN Accelerator. 

https://www.xilinx.com/video/corporate/kortiq-small-efficient-cnnaccelerator.html. 


[4] Synopsys - The impact of AI on autonomous vehicles. Technology 

webinar, 14.12.2017. 

[5] ON Semiconductor Modular Automotive Reference System product 

page. http://www.onsemi.com/PowerSolutions/content.do?id=18780. 


[6] Avnet multi-sensor FMC. Not announced, January 2018. Will be public 

on http://ultrazed.org. 

[7] Maxim Integrated gigabit multimedia serial link (GMSL) serializer and 

deserializers product page. 

https://www.maximintegrated.com/en/products/interface/high-speedsignaling/gmsl.html. 


[8] FMC card product page. https://www.xilinx.com/products/boards-andkits/fmc-cards.html. 


[9] Xilinx Zynq UltraScale+ MPSoC product overiew. 

https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascalempsoc.html. 


[10] Avnet UltraZed-EV SOM product page. 

http://ultrazed.org/product/ultrazed-ev%E2%84%A2-som. Last visited 

19.1.2018. 

[11] Xilinx ZCU102 evalution board user guide. 

https://www.xilinx.com/support/documentation/boards_and_kits/zcu102 

/ug1182-zcu102-eval-bd.pdf. Last visited 19.1.2018. 

[12] Xilinx ZCU104 evaluation board. Not announced, January 2018. Will be 

public on Xilinx website. 

[13] Avnet UltraZed EV carrier card. Not announced, January 2018. Will be 

public on http://ultrazed.org. 

[14] Xilinx reVISION developer zone. 

https://www.xilinx.com/products/design-tools/embedded-visionzone.html. 


[15] Xilinx OpenCV User Guide. 

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017 

_1/ug1233-xilinx-opencv-user-guide.pdf. Last visited 19.1.2018. 

[16] logicBRICKS / Xylon IP Cores. 

https://www.logicbricks.com/Products/logiADAK-VDF.aspx. Last 

visited 19.1.2018. 

668

Embedded Vision Systems 

used as Sensors in IoT Applications 

How vision sensors can be used in different industrial areas 

Marcus-Michael Müller 

Strategic Business Initiatives 

Basler AG 


Marcus.Mueller@baslerweb.com 

Abstract— To understand how embedded vision systems can 

be transformed into sensors and used in IoT applications you 

have to comprehend what these systems do and how they are 

integrated into the IoT environment. An embedded vision system 

consists of a camera module, a processing unit and a kind of 

operating system software. With a special software that runs on 

the embedded vision system, which is able to analyze the pictures 

or image streams that are taken by the camera, the embedded 

vision system can be turned into a vision sensor. This vision 

sensor delivers metadata derived from the image and this data 

can then be used in IoT applications for analytics, interpretations 

or predictions. 

the human vision system for decades and the result we named 

computer vision. 

In the beginning, the systems were huge and quite 

expensive and the processing power was very low. But during 

the last 10 years the processing units became smaller, more 

powerful and more energy efficient. So the system was made 

much smaller and named smart camera or IoT camera system, 

if it is connected to a cloud. If you put all this together and use 

it with software for a computer vision or machine vision 

application then you can call it an embedded vision system. 

The combination of cameras and processing boards can be 

utilized in several professional B2B applications to collect data 

that is derived out of images that were captured by the camera. 

In several areas, like retail, factory automation or smart cities 

embedded vision systems already help to deliver the required 

data for an IoT application. In smart buildings for example the 

actual count of old electric meters can be transmitted to the 

utility company. All these systems have to be trained, for example 

with neural networks to improve the results or to reduce the 

setup time of a system. 

In all cases the embedded vision system is not for capturing or 

streaming videos. In this case the system is used like an IoT- or 

IP-camera. The application software on the embedded vision 

system is the most relevant tool to derive the data out of images 

that are taken by a camera. This data can be used for analytics 

and even to start actions or stop machines. 

I. FROM HUMAN VISION 

TO AN EMBEDDED VISION SYSTEM 

What do we understand when we talk about a computer 

vision system and what is the difference between a vision 

system and an embedded vision system? 

There is a vision system all of us are familiar with: Our 

own eye with its optic nerve in connection with our brain. 

A part of this system is used for image recognition, processing 

and interpretation – 24h a day. We have tried to rebuild 

Fig. 1: From Human Vision to Embedded Vision 


669

II. 

THE DEFINITION OF INTERNET OF THINGS 

The Internet of Things (IoT) can be described as the 

intelligent connectivity of physical devices driving massive 

gains in efficiency, business growth, and quality of life [1]. IoT 

can be separated in the Industrial IoT or the Consumer IoT: 

• In the Industrial IoT, critical machines and sensors in 

high-stake industries (e.g. energy, aerospace, defense or 

healthcare) are connected and failures can result in lifethreatening 

or other emergency situations. Special 

sensors for industrial applications and data aggregation 

are used. Deep learning leads to more efficiency. 

• In the Consumer Internet of Things you will find more 

consumer-level devices, like wearable fitness tools, 

smart home automation, glasses or automatic pet 

feeders. They are convenient, and breakdowns do not 

immediately create emergency situations. Networked 

home appliances can be classified as IoT gadgets, but 

they also lead sometimes to less power consumption for 

example. 

In the last years, the Internet of Things (IoT) has become 

more of a buzzword although it just means that devices were 

connected to the internet. All smart phones, fitness trackers are 

“things” that deliver data into a cloud, regardless if the data is 

used for any analytics or not. 

At the IoT World Forum in Chicago in 2014, the 28 

members of the IoT World Forum’s Architecture, Management 

and Analytics Working Group (made up of Cisco, IBM, 

Rockwell Automation, Oracle, Intel and a variety of others) 

presented an IoT Reference Model [2]. This model was a proof 

that the major industry players were working closely together 

to move the Internet of Things from the realm of hype to 

something real with the necessity of an open, standards-based 

approach. 

the stack is storage and this is succeeded in turn by data 

abstraction, applications, and collaboration and (business) 

processes.[3] Referring to this model an embedded vision 

system that is capable of handling the first three levels, if the 

video analytics and processing is done on the edge in the 

device. 

In the IoT environment normally it is not useful to stream 

images and to analyze them in the cloud, because of the huge 

amount of data that is generated and the high bandwidth 

resources that are required. Sometimes real time interactions 

are needed and this means that the data processing has to be 

done on the edge. 

III. 

WHAT TURNS AN EMBEDDED VISION SYSTEM 

INTO A VISION SENSOR 

Actual there are many different sensors are used in IoT 

devices or machines that imitate the different human senses, 

like touching, hearing, smelling, tasting or balance. These 

sensors help to analyze the actual status of a machine or can 

measure the heart rate. But the most complex sense is the sense 

of vision. If an embedded vision system shall be used as a 

vision sensor, it has to be defined what the vision sensor should 

be able to “see”, because no existing system is capable to know 

what a human brain knows. 

A vision sensor can detect many things out of an image or 

image stream, like products, colors, empty, full, right or wrong 

assembled, defects, motion, light, dark, barcode, people, faces 

or animals. 

Fig. 3: Examples what a trained embedded vision system can see 

Fig. 2: IoT World Forum Reference Model [2] 

The reference model has seven tiers. Starting at the lowest 

tier there are physical devices and controllers (the things), then 

there is connectivity and, above that, edge computing where, 

for example, you might want to do some initial aggregation, 

de-duplication and analysis. These lower three levels can be 

considered as operational technology (OT) whereas the 

remaining four levels are IT. The lowest level in the IT part of 

It is a question of training, analytics and output. Looking at 

machine vision solutions that are on the market for years 

already, where high end computer systems with machine vision 

cameras get the information out of a picture using libraries and 

algorithms. These systems are not able to learn and have to be 

preconfigured. In factory automation the systems are very 

efficient and optimized for the things they should classify or 

detect. 

670

Actual embedded processing boards and new algorithms 

led to new possibilities and smaller and energy efficient 

systems. These are more flexible, because they can be trained 

and improved by using machine learning with neural networks. 

Training a neural network and machine learning can turn an 

embedded vision system into a special vision sensor for several 

different IoT applications. Many companies offer frameworks 

or special software in the cloud that can reduce the time to train 

a system and improve the results. 

A typical vision sensor classifies the images on the edge of 

the cloud, which means that the pictures are not transferred into 

the cloud and classified there. A typical classification software 

would be a neural network. This neural network would run on 

the embedded vision system itself transforming it to a vision 

sensor and only delivers data out of the classified images. It is 

also possible to do the analytics in the cloud but then the 

system acts like an IP-camera that streams images into the 

cloud and not like a vision sensor. 

IV. 

EXAMPLE HOW TO SETUP AND TRAIN 

AN EMBEDDED VISION SYSTEM 

For the initial setup and to train the system, one needs an 

industrial camera, a processing board and a connection to the 

web. Then you have to decide what the system should detect or 

classify. 

To build up a vision sensor for example that is able to 

classify images of a defined product a neural network and a 

deep learning framework can be used. This vision sensor can 

detect if a product was assembled in the right way or not or if it 

has other defects. 

Therefore the embedded vision system must be connected 

to the web to send the pictures into the cloud. Images of 

different “status”, like correct assembled parts as well as 

different defects of the product have to be taken and sent to the 

cloud as a foundation to train the deep learning model. 

In a controlled setting 100 to 200 images of the same 

product from different angles are sufficient to train a neural 

network. These images are uploaded and then classified by 

human inspectors. The products on the pictures classified as 

right or wrong assembled respectively as defect or correct 

products. 

With these qualified images a data scientist can train a 

neural network. A deep learning framework like mxnet, Caffé, 

TensorFlow or CNTK helps to train the models. With a deep 

learning framework a neural network can be built much faster. 

The speed of the training process can be increased by using the 

power of more processors. For this reason it is useful to do the 

training in the cloud. However the resulting trained model runs 

on a simple processing board and can be distributed to the 

embedded vision system. 

Fig. 4: How to train an embedded vision system 

If the system is trained and the neural network runs on this 

system, products can be classified. If the system is not able to 

classify a product or status, a picture of this product will be 

uploaded to a special area where a human user can label and 

classify the pictures manually. As soon as enough new material 

is generated a new model can be calculated and deployed on 

the embedded vision system. So the results can be improved 

permanently. There already are a lot of pre-trained neural 

networks available that can be adapted to specific use cases. 

V. B2B APPLICATIONS WHERE EMBEDDED VISION 

SYSTEMS ARE USED AS SENSORS ALREADY 

The following three examples will show in which B2B 

applications embedded vision systems are used or will be used 

as sensors and neural networks will help to improve the results 

of these sensors. 

In retail, for example, companies want to know who is in 

the shop or who reacts to an advertisement. They need only the 

age, gender or time of attention of the customer. Therefore 

special algorithms are used on the processing board to create 

the data. No image has to be transmitted or stored. The “IoT 

camera” works as a sensor and sends only data to the cloud. 

In factory automation, an optical inspection can be done by 

Embedded Vision Systems and these systems can be trained 

with neural networks to improve the results or to reduce the 

setup time of a system. Think of a special case in which the 

visual inspection system has to detect defect parts or inspect 

assembled components. Therefore the system has to be trained 

with pictures that are transmitted to a cloud where humans 

analyze and label them. After the system was trained, it can 

autonomously improve the results by using machine learning 

algorithms. 

Embedded vision systems in a traffic environment can 

reduce traffic jams or help drivers to find free parking space. 

The system can be trained to analyze the traffic flow by getting 

the relevant information out of a stream of images. Traffic data 

from different vision sensors is sent in real time to the cloud, is 

collected and analyzed in real time. These current traffic 

information can be sent to mobile devices and traffic lights can 


671

e controlled to reduce traffic jams. The systems can also be 

trained to detect free car parking spaces or number plates, so 

that free parking spaces are shown on mobile devices and the 

parking fee is automatically booked from the drivers 

account.[4] 

In all cases the right combination of hardware and software 

delivers the best results for a specific application. The 

hardware delivers the best image quality and the software is the 

relevant tool to transform the images into data. 

CONCLUSION 

Today embedded vision systems can already be used as 

vision sensors in several IoT applications, but they can’t be 

used as a standard product or solution out of the box. A lot of 

companies offer complete IoT platforms to train systems by 

using deep learning frameworks and artificial intelligence to 

build neural networks. A neural network on an embedded 

vision system transforms it into a special vision sensor that is 

able to classify something in an image. 

These platforms can also be used to analyze the whole data 

of different sensors to improve workflows or results of a 

production process for example. 

In 2009 the ImageNet [5] database was published. This 

database consists out of over 10 million images that a have 

been hand-annotated to indicate what objects are pictured. 

ImageNet is used in visual object recognition software research 

and for the training of neural networks. This means also that a 

lot of pre trained systems are available already. 

A lot of new and innovative solutions will appear on the 

market in the future utilizing embedded vision systems to 

imitate the human vision sense. Internet of Things applications 

will take the data of these sensors for analytics, predictions or 

interpretations to improve the results permanently. 

REFERENCES 

[1] The Internet of Things, Cisco Connect 2015, 

https://www.cisco.com/web/offer/emear/38586/images/Presentations/P1 

1.pdf 

[2] IoT Reference Model, June 2014, Author: Jim Green 

[3] https://www.bloorresearch.com/2015/01/the-internet-of-thingsreference-model/ 

, January 2015, Author: Philip Howard 

[4] https://www.baslerweb.com/en/vision-campus/markets-andapplications/iot-applications-in-the-smart-city/ 

[5] https://en.wikipedia.org/wiki/ImageNet 

672

Closing the loop in Additive Manufacturing – An 

embedded solution for real-time melt pool monitoring 

Christos Theoharatos, Vangelis Vassalos, Dimitrios Besyris and Vassilis Tsagaris 

Computer Vision Systems, IRIDA Labs S.A. 

26504 Patras, Greece 

{htheohar, vassalos, dbes, tsagaris}@iridalabs.gr 

Abstract— Direct Metal Deposition (DMD), as part of the 

highly expanding Additive Manufacturing (AM) market, is a 

subtle procedure and any variation in the machine’s condition and 

process parameters can result in the production of defected parts 

in terms of surface quality. For this reason, real-time monitoring 

of the DMD process is required for controlling the pat 

manufacturing. In this work, a novel, vision-based, solution for 

closing the loop in AM is presented. Our work is based on the realtime 

monitoring of the DMD process with a comprehensive vision 

sensing system that interacts with the machine process algorithms 

in order to detect and correct deposition errors, leading in an 

optimal shape of the manufactured part or the material properties 

and contributing towards zero-defect AM. The interoperable 

vision system is designed to monitor the size, shape and intensity 

of the melting pool. The solution performs, on-camera image 

processing directly on the hardware subsystem’s FPGA for closedloop 

AM melting process monitoring. 

Keywords—Additive manufacturing; melt pool monitoring; laser 

metal deposition; FPGA processing; 


Additive Manufacturing (AM) brought great innovation in 

the field of complex shape part manufacturing, since shape 

complexity does not imply additional costs and, moreover, the 

use of material is globally reduced compared to traditional 

technologies [1]. However, despite the evident advantages, 

current AM technologies have important drawbacks that 

severely challenge the wide adoption. For instance, AM still is 

limited to the production of small parts only, the final surface 

quality still requires machining as further processing, the scrap 

rate is high and the quality certification is an issue. In short AM 

processes are not industrially robust yet. 

As part of the AM technology, Direct Metal Deposition 

(DMD) is a powder-based process for building up 3D metallic 

parts layer-by-layer, in which a laser is used to melt metal 

powder onto a substrate [2]. Due to the increased quality 

requirement of 3D parts produced through a DMD process, 

knowledge of correlations between the main process parameters 

such as laser power, laser head velocity, feed rate and powder 

mass stream, and the melt pool behavior, is needed [3]. For this 

reason, a necessity has nowadays emerged for being able to 

monitor the part quality, detect defect formations and make 

corrections or repairs in situ, as a part is being built. Therefore, 

the real-time monitoring of the melt pool within a laser 

deposition process is an essential part of AM, to better 

understand and control the thermal behavior of the process, as 

well as to detect any unpredicted fault and continuous control 

the interior quality of the machine [4, 5]. 

In this work, a novel, vision-based, solution for closing the 

loop in the AM market is presented. Our work is conducted as 

part of the BOREALIS and SYMBIONICA H2020 projects and 

is based on the real-time monitoring of the DMD process with a 

comprehensive vision sensing system that interacts with the 

machine process algorithms in order to detect and correct 

deposition errors, leading in an optimal shape of the 

manufactured part or the material properties and leading towards 

zero-defect AM. The interoperable vision system is targeted to 

monitor the size, shape and intensity of the melting pool thanks 

to a sophisticated sensing system (camera and spectroscopy 

integrated system) and process parameters corrections 

elaborated and implemented at Numerical Controller (NC) level. 

The solution implements complex image processing algorithms, 

directly on the hardware subsystem’s FPGA for closed-loop AM 

melting process monitoring. Using the DMD monitoring system 

that will be described in the sequel, results of the melt pool 

monitoring procedure and parameter estimation will be 

presented on data coming from different DMD processes, 

followed by distinctive metrics representing the resources 

occupied within the FPGA programming unit such as Look-Up 

Tables (LUTs), Flip-Flop pairs (FFs), Block-RAMs (BRAMs) 

and embedded Arithmetic and Logic Units (e-ALUs). In 

addition, the algorithmic methodology for effectively 

segmenting the melt pool distribution, as well as the entire DMD 

distribution (that is, the melt pool plus the tail that is depicted 

due to the movement of the laser head and the associated melting 

process) will be shortly presented. 

II. 

HARDWARE METHODOLOGY AND DATA 

A. Hardware Set-up 

The associated hardware subsystem is comprised of an 

Optronis CXP6 high-speed camera of 540 fps at 1696×1708 

resolution, equipped with a specific optical system, which is 

positioned inside the machine tool head to capture images in- 


673

axis with the laser beam. The camera is connected through 

CoaXPress interface to a frame grabber which utilizes a Xilinx 

FPGA for performing melt pool monitoring pipeline in realtime. 

The frame grabber focuses on high-speed image 

acquisition with up to 4*CoaXPress cameras giving access to 

advanced Machine Vision applications. The design process 

followed, consists of three main phases. In the first phase the 

algorithm was developed and simulated using artificial data. 

This phase ended with the development of bit-accurate models 

for each hardware component in the datapath. The second phase 

includes the implementation of the hardware components and 

the simulation of the whole design. After successful completion 

of these two steps, we used the design tools provided by the 

FPGA vendor for synthesis and place-and-route procedures so 

as to produce the final bitstream file to program the FPGA, 

included in the frame grabber. In the third phase, a software 

(C++) application was developed in order to manipulate the 

results from the FPGA processing, upload them to the Host DDR 

memory, package them and send them to the Numerical 

Controller. 

B. Equipment and Data Description 

In order to carry out the experimental investigation, DMD 

process is provided by a Laserdyne small-scale machine 

equipped with a 1 kW CW fiber laser coupled with a 2 axis scan 

head. An innovative deposition process with high speed scanner 

has been developed, based on the high frequency movement of 

the laser beam for high energy efficiency controlling melt pool 

size and dimensions. The ablation process makes use of a 100W 

infrared pulsed laser source in the ns range. For the initial 

experimental analysis of the melt pool geometry and intensity, 

single tracks and circular movements were built up, using the 

nominal values of the machine’s physical parameters. 

III. 

IMAGE PROCESSING AND PARAMETER ESTIMATION 

In this section, the methodology for monitoring the DMD 

process and extracting the necessary statistical parameters is 

presented. In order to do that, the main objective is to robustly 

segment each frame in 2 overlapping regions R 1 and R 2 , where 

R 1 is the melt pool area and R 2 is the entire distribution, that is, 

the melt pool along with the tail. Following that, the extraction 

of various geometrical shape features / properties of each region 

that assist in quantifying their size and shape, as well as intensity 

features of the two regions, is carried out. In short, our algorithm 

can be summarized in the flowchart of Fig. 1 and is briefly 

explained in the following subsections. 

A. Frame Recording 

Initially, video recording is adjusted in order to provide 

focused and well-contrasted images of the melt pool. This is 

done by setting suitable values to the camera parameters such as 

exposure time and frame rate (the later value is limited by the 

maximum operational frequency achieved by the FPGA design), 

given all possible values within the operational range of energy 

intensities. Figs. 2 (a) and (b) illustrate two typical frames from 

different set-ups of the DMD process, related to a laser power of 

300W and a deposition speed of 500 mm/min. In both frames, 

the two distributions, that is the melt pool (i.e. the brightest 

region) and the melt pool plus the tail (i.e. the entire distribution 

that is separated from the background), that need to be 

segmented from each frame are clearly distinguished. 

B. Image Binarization 

Next, image binarization is performed for initially 

segmenting the images into the two overlapping regions R 1 and 

R 2 , defined previously. The segmentation is performed by 

thresholding the original image at two different levels, t 1 and t 2 . 

Therefore, thresholding is used to segment the current frame, by 

setting all pixels whose intensity values are above a threshold to 

a foreground value and all the remaining pixels to a background 

value. Formally, 

R 1 = {(x, y) ∈ ROI ∣ I(x, y) > t 1 } (1) 

R 2 = {(x, y) ∈ ROI ∣ I(x, y) > t 2 } (2) 

The simplest way to use image binarization is to choose a 

global threshold value, and classify all pixels with values above 

this threshold as white, and all other as black. This approach is 

problematic since there is no standard way of effectively 

selecting the correct thresholds, given the intensity variations of 

the melt pool images when altering the physical parameters of 

the system (i.e. energy power, deposition rate etc.). Therefore, 

an adaptive multi-thresholding technique was implemented for 

changing the threshold dynamically over the images and 

segmenting the two regions R 1 and R 2 from the background. 

Local adaptive thresholding selects an individual threshold 

for each pixel, based on the range of intensity values in its local 

neighborhood. This allows for thresholding an image whose 

global intensity histogram doesn't contain distinctive peaks. In 

our case, adaptive thresholding can provide better estimation in 

case the internal physical parameters of the laser deposition 

process change over time. The most well-known and commonly 

utilized adaptive thresholding technique is Otsu’s method [6], 

which assumes the image contains two classes of pixels - 

foreground and background, and has a bi-modal histogram. In 

our case, a two-step procedure was adopted; after separating the 

background pixels, foreground was further segmented into two 

distributions, the melt pool and the overall distribution 

containing the tail. The effectiveness of the method is due to its 

attempt to minimize their combined spread (intra-class 

variance). Figs. 2 (c) and (d) present the segmentation melt pool 

results of the images in Figs. 2 (a) and (b), while Figs. 2 (e) and 

(f) present the same results for the entire distribution. 

C. Morphological Filtering 

In order to refine the binarization results and filter out sparks 

that are apparent due to powder and particle reflections, 

morphological filtering is applied for smoothing both regions. 

This is a non-linear operation related to the shape or morphology 

of features in an image and rely only on the relative ordering of 

pixel values, not on their numerical values, and therefore is 

especially suited to processing of binary images [7]. 

Morphological filters may act as filters of shape, filtering out any 

details that are smaller in size than the structuring element. A 

combination of “opening” and “closing” operations using 

different structural elements is utilized here for refining the 

segmentation result, which is illustrated in Figs. 2 (g), (h) and 

(i), (j) for the melt pool and the entire distribution respectively. 

As it is clearly demonstrated, the sparks are effectively filtered 

out and therefore the resulted melt pool distribution is segmented 

correctly in order to extract the necessary geometry parameters. 

674

Read Frame In 

Image Binarization 

Morphological 

Filtering 

Shape Statistics 

Intensity Statistics 

Average Filtering 

Average Filtering 

Output Pn 

Fig. 1. Flowchart of the algorithmic pipeline. 

(a) (c) (e) (g) (i) 

(b) (d) (f) (h) (j) 

Fig. 2. (a), (b) Melt pool images of te DMD process; Binarized images of (c), (d) the melt pool and (e), (f) the melt pool plus the tail; Finally segmented images 

following the morphological operators of (g), (h) the melt pool and (i), (j) the melt pool plus the tail respectively. 

D. Extraction of Statistical Parameters 

In the next step, a variety of shape and intensity statistics are 

extracted at the back end of the image pipeline system from both 

regions in order to quantify their properties and be fed to a 

Numerical Controller, so as to be used and transformed to 

physical parameters of the DMD process. The set of features / 

parameters that are extracted for both regions R 1 and R 2 is the 

Major and Minor Axis Lengths, the Area, the Center of Gravity 

(or Mass) and the Average Intensity. Apart from the above 

statistical parameters, other quite interesting and important 

features can be extracted from the two distributions like the 

perimeter, which is the path that surrounds the two-dimensional 

shape, the aspect ratio, which is the ratio of its sizes in different 

dimensions and describes the proportional relationship between 

its width (major axis) and its height (minor axis), the 

eccentricity, which can be thought of as a measure of how much 

the conic section deviates from being circular, or other intensity 

features such as the maximum, minimum and median intensities. 

E. Average Filtering 

As a last step, a temporal moving average filter is applied on 

a sequence of statistical values in order to smooth out the 

temporal variation, like a rapid variation or movement of the 

melt pool image or even a large spark that is not entirely filtered 

out by the morphological operator. For example, after 

calculating the area at frame i, its value is set to be the average 

of our estimation and the k previous values. The value of k 

depends on the amount of regularization we need to perform. 

Typically a value of k=3 can smoothen the results without losing 

much temporal information. 

IV. 

RESULTS AND DISCUSSION 

Figure 3 illustrates a snapshot of the process parameters 

estimation for a single DMD frame, in which intensity and 

geometrical features are shown in graphical mode. Regarding 

the hardware design, the following operations were 

implemented on the FPGA, as described in Section III, that is: 

Frame acquisition and parallelism level which was set to 

x8 (instead of x20, which is the hardware operator 

default), to minimize HW resources utilization. It must 

be noted that a parallelism reduction may save on FPGA 

resources but also lowers the overall bandwidth (BW) of 

the link, since BW = ClockFrequency × Parallelism. 

Full multi-adaptive threshold-based segmentation to 

produce the two binarized distributions. 

Morphological filters on each distribution. 

Shape statistics’ computation for each distribution, along 

with the melt-pool centroid distances from the back-end 

and the front-end of the tail distribution. 

Average intensity computation for both distributions. 

Packing the results in a 2×8 vectors for transfer to the 

host DDR memory. 

Table 1 provides the percentage of the FPGA resources’ fill 

levels of each processing module separately, as well as the total 

amount of the complete design, after successful Synthesis and 

Place-and-Route procedures using XILINX ISE v14.7. 


675

Fig. 3. Snapshot of the process parameter estimation. 

TABLE I. 

FPGA RESOURCE UTILIZATION 

Hardware Operators 

Resources Utilization (%) 

LUTs FFs BRAMs e-ALUS 

Frame Acquisition 19 9 12 10 

Full Adaptive Thresholding 12 7 27 3 

Morphological Filters (both distr.) 10 2 5 0 

Shape Statistics Tail 5 3 0 0 

Shape Statistics Melt Pool 2 2 0 0 

Intensity Statistics (for both distr.) 6 4 3 1 

Merging Statistics & DmaToPC 6 7 0 0 

TOT|AL 60 34 47 14 


A real-time software and hardware implementation is 

presented here for monitoring the DMD process, based on a 

comprehensive vision sensing system that interacts with the 

machine process algorithms in order to detect and correct 

deposition errors. The vision system is targeted to monitor the 

size, shape and intensity characteristics of the melting pool, 

performing on-camera image processing directly on the 

hardware subsystem’s FPGA for closed-loop AM monitoring. 

Live monitoring of the melt pool geometry on the working 

surface during the deposition allows to optimize the overall 

process since the camera’s optical information can provide, after 

processing, measurements of the melt pool intensity and 

geometry and tune, in real time, the process parameters. Results 

of the melt pool monitoring procedure and parameter estimation 

are presented on data coming from a specific DMD processes, 

followed by distinctive metrics of the resources occupied within 

the FPGA device such as LUTs, FFs, BRAMs and e-ALUs. As 

a next step, the association and correlation between the melt pool 

behavior and process parameters like laser power, laser head 

velocity, feed rate and powder mass stream will be presented. 


This work has been financed from the European Union’s 

Horizon 2020 research and innovation programme under grant 

agreement No. 678144, the “SYMBIONICA” project. 

REFERENCES 

[1] H. Bikas, P. Stavropoulos, and G. Chryssolouris, “Additive manufacturing 

methods and modelling approaches: a critical review”, The International 

Journal of Advanced Manufacturing Technology, vol. 83, pp. 389-405, March 

2016. 

[2] R. Vilar, “Laser powder deposition,” vol. 10.7, Adv. Addit. Manuf. Tooling, 

in Comprehensive Materials Processing, 1 st ed., Elsevier, pp. 163–216, May 

2014. 

[3] S. Ocylok, E. Alexeev, S. Mann, A. Weisheit, K. Wissenbach, and I. 

Kelbassa, “Correlations of melt pool geometry and process parameters during 

laser metal deposition by coaxial process monitoring”, Physics Procedia, vol. 

56, pp. 228-238, 2014. 

[4] B. Qian, L. Taimisto, A. Lehti, H. Piili, O. Nyrhilä, A. Salminen, and Z. 

Shen, “Monitoring of temperature profiles and surface morphologies during 

lasersintering of alumina ceramics”, Journal of Asian Ceramic Societies, vol. 2, 

pp. 123-131, 2014. 

[5] P. Stavropoulos, D. Chantzis, C. Doukas, A. Papacharalampopoulos, and G. 

Chryssolouris, “Monitoring and control of manufacturing processes: a review,” 

in 14 th CIRP Conf. on Modelling of Machining Operations, Turin, 2014. 

[6] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,” 

IEEE Trans. on Systems, Man and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979. 

[7] R.C. Gonzalez, and R.E. Woods, Digital Image Processing, ch. 9, 4th ed., 

Prentice Hall, 2018. 

676

Addressing the Challenges of Creating Infra-Red 

Vision Systems for the IIoT and IoT 

Adam P Taylor 

Director 

Adiuvo Engineering & Training Ltd 

Harlow, United Kingdom 

Adam@adiuvoengineering.com 

Abstract— Embedded vision is ubiquitous being used across a 

range of applications for example ADAS, Vision Guided Robotics 

and of course the IOT & IIOT. Increasingly along with the 

visible spectrum, EV systems rely upon the wider 

electromagnetic spectrum such as the Infra-Red spectrum. This 

spectrum is used in IIoT applications for the monitoring of 

temperatures on key equipment providing prognostics or within 

IOT applications where it enables users to see in low light 

conditions, or provide the ability to measure the temperature of a 

sleeping baby remotely. 

Keywords—IIOT, IOT, FPGA, SOC, Infra-Red, Embedded 

Vision. 


One of the advantages of embedded vision systems is their 

ability to observe wavelengths outside those which are visible 

to humans. This enables the embedded vision system to 

provide superior performance across a range of applications 

and deployments. 

Two common deployments of embedded vision systems are 

within the Internet of Things (IoT) and its industrial 

counterpart the Industrial Internet of Things (IIoT). Indeed, 

IOT and IIOT deployments continue the trend of ubiquity of 

embedded vision. IoT and IIoT applications are diverse and 

include monitoring, security and surveillance within IOT 

deployments. While IIOT applications are dominated by 

Industry 4.0 solutions, including positioning, guidance, 

identification and inspection applications. 

Many IoT & IIoT applications benefit by imaging outside of 

the visible spectrum and utilizing the infra-red element of the 

electromagnetic spectrum. Using the infra-red element of the 

electromagnetic spectrum enables the embedded vision system 

to sense background thermal radiation. As the imager works 

with the background thermal radiation no scene illumination, 

making IR solutions ideal for imaging in total darkness or 

poor visibility conditions, making them ideal for industrial, 

automotive and security applications. The use of IR sensors 

also allows the creation of thermographic applications which 

accurately measure the temperature of the scene contents. One 

example application of this would be within renewable energy 

where IR Imaging can be combined with drones to monitor the 

performance of solar arrays and detect early failures due to the 

increasing temperature of failing elements 

Working outside the visible range requires the correct 

selection of the imaging device technology. If the system 

operates within the Near IR Spectrum or below we can use 

devices such as Charge Coupled Devices (CCDs) or CMOS 1 

(Complementary Metal Oxide Semiconductor) Image Sensors 

(CIS) however, as we move into the infrared spectrum we 

need to use specialized IR detectors. 

The need for specialized sensors in the IR domain is in onepart 

due to the excitation energy required for silicon based 

imagers such as CCD or CIS. These typically require photon 

energy of 1eV to excite an electron however at IR 

wavelengths photon energies range from 1.7eV to 1.24 meV 

as such IR imagers tend to be based upon HgCdTe or InSb. 

These have lower excitation energies and are often combined 

with a CMOS readout IC called a ROIC to control and readout 

the sensor. 

II. 

COOLED OR UNCOOLED 

IR systems fall into to two categories, cooled and uncooled. 

Cooled thermal imagers use image sensor technology based 

upon HgCdTe or InSb semi-conductors. To provide useful 

images a cooled thermal imager requires the use of a cooling 

system which cools the sensor to a temperature of 70 to 100 

Kelvin. This is required to reduce the generated thermal noise 

to below that which is generated by the scene contents. Using 

a cooled sensor therefore brings with it an increased 

complexity, cost and weight for the cooling system, the system 

also takes time (several minutes) to reach operating 

temperature and generate a useable picture. 

Uncooled IR sensors can operate at room temperatures and use 

microbolometers in place of a HgCdTe or InSb sensor. A 

microbolometer works by each pixel changing resistance when 

IR radiation strikes it. This resistance change defines the 

temperatures in the scene. Typically, microbolometer-based 

1 

We can use different coatings upon the image device to 

affect its wavelength performance. 


677

thermal imagers have much-reduced resolution when 

compared to a cooled imager. They do however make thermalimaging 

systems simpler, lighter and less costly to create. 

For this reason, many IoT and IIoT applications will use 

uncooled image sensors like the FLIR Lepton. 

Radio Module for WIFI and Bluetooth communications for 

wireless connectivity. While the programmable logic is used 

to receive VoSPI, perform direct memory access with DDR 

and output video for a local display. The high-level 

architecture of the solution is demonstrated within figure 2. 

One additional concern is export compliance, cooled thermal 

imagers offer higher performance and resolution than their 

uncooled counterparts. As such cooled thermal imaging 

solutions are often subject to the stricter export compliance 

regimes than an uncooled solution, restricting the available 

markets. 

Creating an uncooled thermal imager presents a range of 

challenges for embedded vision designer. Requiring a flexible 

interfacing capability to interface with the select device and 

display. While providing the processing capability to 

implement any additional image processing upon on the video 

stream. Of course, as many of these devices are hand held or 

power constrained, power efficiency also becomes a 

significant driver. This solution must also provide security of 

the final solution both remotely via its internet connection and 

physically. 

III. 

ARCHITECTURE 

The FLIR Lepton is a thermal imager which operates in the 

long wave IR spectrum, it is a self-contained camera module 

with a resolution of 80 by 60 pixels (Lepton 2) or 160 by 120 

pixels (Lepton 3). Configuration of the Lepton is performed 

by an I2C bus while the video is output over SPI using a 

Video over SPI (VoSPI) protocol. These interfaces make it 

ideal for use in many embedded systems, which require ability 

to image in the IR region. 

One example, combines the Lepton with a Xilinx Zynq 

Z7007S device mounted on a MiniZed development board. As 

the MiniZed board supports WIFI and Bluetooth it is possible 

to create both IIoT / IoT applications and traditional imaging 

solutions with a local display in this case a 10-inch touch 

display. 

Figure 2 High Level Architecture 

Within the image processing pipeline, we can instantiate 

custom image processing functions generated using High 

Level Synthesis or use pre-existing IP blocks such as the 

Image Enhancement core which provides noise filtering, edge 

enhancement and halo suppression. 

This high-level architecture requires translation into a detailed 

design within Vivado, as such the following IP blocks are used 

to create the hardware solution. 

• Quad SPI Core – Configured for single mode 

operation, receives the VoSPI from the Lepton 

• Video Timing Controller – generates the video timing 

signals for the output display 

• VDMA – Reads an image from the PS DDR into a 

PL AXI Stream 

• AXI Stream to Video Out – Converts the AXI 

Streamed video data to parallel video with timing 

syncs provided by the Video Timing Core. 

• Zed_ALI3_Controller – Display controller for the 7- 

inch touch screen display 

• Wireless Manager – Provides interfaces to the Radio 

Module for Bluetooth and WIFI. While not used in 

this example including this module within the HW 

design means addition of wireless communications 

requires only additional SW development. 

When these IP blocks are combined with the Zynq processing 

system and the necessary AXI interconnect IP we obtain a 

detailed hardware design as shown in figure 3. 

Figure 1 MiniZed & FLIR Lepton 

To create a tightly integrated solution we can use the 

processing system (PS) of the Zynq to configure the Lepton 

using the I2C bus. The PS also provides an interface to the 

678

This limits access to hardware functions within the Zynq 

including programmable logic (PL) peripherals. 

The XADC provided within the Zynq device provides the 

ability to monitor both device temperature and voltages. 

Raising alarms should user specified limits be breached, along 

with the ability to monitor external anti tamper features. 

These features provided by the underlaying architecture of the 

Zynq enable creation of a sound base upon which higher level 

software based security solutions can be implemented. 

Figure 3, detailed hardware design in Vivado 

IV. 

SECURITY SOLUTION 

When we are developing a IoT or IIoT solution we need to 

ensure the solution is secure from malicious hackers, 

unauthorized access and modification. A secure solution for 

the IoT or IIoT should as a minimum provide 

• Secure Boot – The ability to decrypt an encrypted 

boot image. Secure boot should also provide 

cryptographic authentication of the image. 

• Authentication – Only authorised users should be 

able to connect with the IoT /IIoT system. Strong 

passwords and authentication protocols should be 

used. 

• Secure Communication – Communication to and 

from the IoT/IIoT device should be encrypted. 

• Secure Data – Data stored within the system should 

be secure, encryption standards such as AES, Simon 

or Speck can be used to secure data. 

• Anti-Tamper – Able to determine unauthorised 

access attempts to the system. This may include 

monitoring the presence of enclosure lids, device 

voltages and temperatures. 

The Zynq device provided on the MiniZed enables the 

implementation of a secure solution. The Zynq is capable of 

secure booting both the PS and the PL with a three-stage 

process. This three-stage process comprise Hashed Message 

Authentication Code (HMAC), Advanced Encryption 

Standard (AES) Decryption and RSA Authentication. Both the 

AES and HMAC use 256-bit private keys while the RSA uses 

2048-bit keys, the security architecture of the Zynq also 

allows for JTAG access to be enabled or disabled. 

These security features are enabled when generating the boot 

file and the configuration partitions for our non-volatile boot 

media. It is also possible to define a fall back partition such 

that should the initial first stage boot loader fail to load its 

application it will fall back to another copy of the application 

stored at a different memory location. 

Once you successfully have the device up-and-running further 

security can be implemented using the ARM Trust Zone 

architecture within the PS to implement orthogonal worlds. 

V. SW DEFINITION 

Most of the IP blocks included within the Vivado design 

require configuration using application software developed 

within SDK. This provides the flexibility to change the 

operational parameters required as the product evolves for 

example accommodating a larger display or changing sensor 

from the Lepton 2 to the Lepton 3. The application software 

configures the video timing from the Video Timing controller. 

Along with configuring the Video Direct Memory Access 

Controller to read frames from the memory mapped DDR and 

convert it into a AXI Stream to be compatible with the image 

processing stream. 

Following the initialization of the IP blocks the applications 

software performs the following 

• Configure the FLIR Lepton to perform Automatic 

Gain Control 

• Synchronisation with the VoSPI data to detect the 

start of a valid frame 

• Applies a Digital Zoom to scale up the image to 

utilise efficiently the 800 pixels by 480-line display. 

This can be achieved by outputting each pixel either 

8 or 4 times depending upon the sensor selection. 

• Transfer the frame to the DDR Memory as the FLIR 

Lepton only outputs 8 bits data when ACG is enabled 

this is mapped to the green channel of the RGB 

display. 

When the completed program is executed on the MiniZed with 

the FLIR Lepton connected and outputting to a 10-inch touch 

sensitive display the output of the FLIR can be seen very 

clearly as demonstrated in figure 4. 

This application above however, only addresses the lower 

level software to control the imager. To communicate over the 

internet the MiniZed WIFI capabilities need to be used. To do 

this we need to use a operating system which provides not 

only the appropriate WIFI stack but also allows the 

implementation of the authentication, secure communication 

and overall security solution. This is provided by updating the 

PetaLinux operating system running on the MiniZed. 

Petalinux is a Linux distribution provided by Xilinx for the 

Zynq and Zynq UltraScale+ MPSoC devices. With the 


679

PetaLinux OS updated the MiniZeds WIFI capabilities can be 

used to communicate images captured from the FLIR Lepton. 

Figure 4 Final System, connected to display 

VI. 

CONCLUSION 

Imaging within the IR domain provides a very significant 

benefit in many IoT and IIoT applications. The creation of an 

imaging system based upon an uncooled thermal imager 

presents a number of challenges in interfacing, security, power 

efficiency and performance. Heterogeneous SoC allow us to 

create a solution which is flexible, secure and power efficient. 

VII. AUTHOR BIOGRAPHY 

Adam Taylor is a world recognized expert in design and 

development of embedded systems and FPGA’s for several 

end applications. Throughout his career Adam has used 

FPGA’s to implement a wide variety of solutions from 

RADAR to safety critical control systems, with interesting 

stops in image processing and cryptography along the way. He 

currently holds an executive position within a major European 

Defense company. Prior to that he was most recently the Chief 

Engineer of a Space Imaging company, being responsible for 

several game changing projects. Adam is the author of 

numerous articles on electronic design and FPGA design 

including over 230 blogs on how to use the Zynq & Zynq 

MPSoC for Xilinx. Adam is a Chartered Engineer and Fellow 

of the Institute of Engineering and Technology, he is also the 

owner of the engineering and consultancy company Adiuvo 

Engineering and Training (www.adiuvoengineering.com) 

680

Securing tomorrow’s IoT devices: the new potential 

for integrating sophisticated security functions into 

the microcontroller 

Jack Ogawa 

Senior Director of Marketing, MCU Business Unit 

Cypress Semiconductor Corp. 

San Jose, California 

Abstract—Internet of Things (IoT) devices, which transmit and 

receive data and commands over the world’s universal network, are 

exposed to a far greater variety and number of threats than earlier 

products that supported older machine-to-machine (M2M) 

communication, typically over a closed, private network. The 

security functions and resources required to protect an IoT device 

against these security threats are today available in specialized, 

discrete ICs such as: 

• a secure element – a system-on-chip combining a 

microcontroller with on-board cryptographic capabilities, secure 

memory and interfaces 

• secure non-volatile memory ICs, which typically feature a 

cryptographic engine for pairing the memory securely to authorized 

devices 

However, the use of such discrete ICs in IoT devices has the 

effect of increasing their component count, complexity and bill-ofmaterials 

cost compared to designs that use the integrated security 

capabilities of the host MCU (or in some cases an applications 

processor). The crucial question for IoT device designers, then, is 

whether the capabilities of the host MCU are sufficient to counter 

the threats of spoofing, tampering, repudiation, information 

disclosure, denial of service and elevation of privilege. 

Keywords—IoT; IoT Security; MCU Security; Microcontroller 

Security; Data Integrity; Trusted Firmware; Encryption; Firmware 

Authenticity; Malware; Inter-Processor Communication 


By all accounts, IoT (Internet of Things) devices are 

forecasted to become ubiquitous. IoT devices, powered by 

semiconductors, will make every imaginable process smart. 

From simply turning on a light to more complex processes such 

as outpatient care or factory control, IoT devices utilizing 

sensing, processing, and cloud connectivity will dramatically 

improve their effectiveness. IoT device applications are 

diverse, and their promise and impact are quite literally 

unbounded. 

The ubiquitous application of IoT devices introduces 

security challenges. For example, traditional lighting control is 

relatively primitive – it’s a power circuit with a physical 

switch. Operating the switch requires physical proximity. 

Securing this process against unauthorized use simply requires 

physical protection of the switch. Now consider lighting 

control in its smart incarnation as an IoT device. The physical 

switch is replaced by light and proximity sensors, logic 

(typically implemented in a microcontroller or MCU), and 

wireless connectivity to a Cloud-based application. In 

becoming smart (enlightened!), a light switch is transformed 

into an embedded client that works with an application server 

through a network. Securing the smart light switch has become 

much more complicated. The good news is that secure 

microcontrollers can greatly enhance the security of the IoT 

device and accelerate the design cycle. 

This paper examines a method for determining security 

requirements of an IoT device and presents Cypress’ PSoC 6 

secure MCU as a solution that meets these requirements. 

II. 

IOT DEVICE SECURITY ANALYSIS 

The idea of securing IoT devices can be daunting. Doing a 

bit of research immediately reveals large bodies of knowledge 

regarding cryptography, threats, security objectives, and 

myriad other subjects. Faced with overwhelming information, 

often the first question that IoT device designers ask is “how do 

I judge security?”, closely followed by “where do I start?”. 

As shown in figure 1, the first step in the analysis process is 

to identify data assets handled by the IoT device and their 

secure properties. The next steps are to identify threats that 

target these assets, define security objectives to resist these 

threats, and finally, the requirements to satisfy the security 

objectives. By meeting these requirements, a microcontrollerbased 

design supports the security objectives and ultimately 

preserves the secure properties of the assets. Finally, the 

design should be evaluated to determine if the final design 

achieves the objectives. Typically, this evaluation utilizes 


681

threat models applied to the design to assess the attack 

resistance of the device. 

the identity of the signing actor. The verifying actor uses the 

public key to decrypt the hash that was embedded in the data 

set and compares it to the calculated hash. If they match, then 

the verifying actor is assured that the data has not changed 

since it was signed, and it was provided by the signing actor as 

Identify Data 

Assets 

Identify Threats 

Define Security 

Objectives 

Requirements 

Fig. 1. Analysis process for designing secure IoT devices. 

III. 

DATA ASSETS 

The value of every IoT device is built upon data, and how 

that data is managed. Data assets take various forms in an 

embedded system, such as a unique ID, firmware, password, or 

encryption key. Each data asset has secure properties. Secure 

properties are inherent characteristics of the data asset that the 

system relies upon as the basis of trusting that data asset. 

There are three secure properties: confidentiality, integrity, and 

authenticity: 

Confidentiality: Encryption is the process of encoding data 

in such a way that only trusted actors can read it, thus 

maintaining confidentiality. Correspondingly, if an actor can 

read encrypted data, they are assumed to be trusted. 

Encryption algorithms utilize keys for encryption and 

decryption. Therefore, secure handling and storing of keys is a 

critical requirement for secure IoT devices. There are generally 

two types of encryption algorithms: symmetric or shared key 

and asymmetric or public key encryption. In shared key 

schemes, the encryption and decryption key are the same, and 

communicating parties must both have the same key to achieve 

secure communication. In public key schemes, the encryption 

key (public key) is published for anyone to use for encrypting 

messages. However, only the receiving actor has the 

decryption key (private key). Public key schemes are useful 

for securing many-to-one communications. 

Integrity: Data integrity assessment is required for data 

assets that are immutable. Examples are boot firmware and 

configuration data. Assessing data integrity involves applying 

a cryptographic hash function to the data asset. A hash 

function maps data of arbitrary size to a bit string of a fixed 

size called a hash. The probability of the same hash being 

generated for two data sets is made very small by the choice of 

hash bit length. Therefore, for a given application and properly 

chosen hash length, hashes can be considered unique to a data 

set. If a data set is changed, its hash will also change. Data 

integrity can therefore be determined by comparing a provided 

hash representing the original data set to a calculated hash of 

the data set received. 

Authenticity: Authenticity, when combined with integrity, 

establishes trust, and therefore it is a critical cornerstone of a 

secure IoT device. Typically, a Public Key Infrastructure 

(PKI) is used for this purpose. In a PKI scheme, a digital 

signature, which simplistically is the hash of a data set that is 

encrypted by the signing actor using a private key, is embedded 

in the data set. Separately, the verifying actor receives a 

certificate issued by a Certificate Authority (CA). The 

certificate contains the corresponding public key, along with 

attested by the CA. 

Fig. 2. Overview of Digital Signatures. Source: www.docusign.com. 

It is critical to comprehensively identify data assets in the 

IoT device, since each subsequent step relies on this step. 

Some examples of data assets are: 

Hardware ID – A unique identifier for the device 

Trusted Firmware – Implements Trusted Applications 

(TAs) that support secure objectives 

User Data – Data used by the application 

Configuration – Data used to configure the device, 

including network information 

Keys – Data used for crypto operations 

Each data asset will have secure properties. 

example data assets: 

Data Asset 

Hardware ID 

Trusted Firmware 

User Data 

Configuration 

Keys 

IV. 

Secure Properties 

Integrity 

Integrity, Authenticity 

Confidentiality, Integrity 

Confidentiality 

Integrity, Confidentiality 

THREATS 

For the 

Threats target data assets. The goal of threat identification 

is to expose vulnerabilities associated with the device’s ability 

to maintain the secure properties of its data assets when 

attacked. For design purposes, threats that do not target data 

assets, and similarly, vulnerabilities or data assets that do not 

have a particular attack method by definition cannot be 

evaluated, and therefore must be treated with extra scrutiny. 

682

The previous data asset examples may face the following 

threats: 

Threat 

Spoofing 

Man in the Middle 

Malware 

Tamper 

Targeted Data Asset 

Configuration 

User Data Keys 

Trusted Firmware 

All 

V. SECURITY OBJECTIVES 

With threats identified, security objectives can now be 

defined. Security objectives are defined at an application level, 

in essence providing implementation requirements. Some 

security objectives can be implemented as Trusted 

Applications (TAs) that execute in an isolated execution 

environment provided by the secure MCU. The isolated 

execution environment comprehensively protects the TAs and 

the data that they use/process. The IoT device application itself 

operates in an unsecure execution environment and 

communicates with TAs in the isolated execution environment 

through an API that uses an inter-processor communication 

(IPC) channel. The TAs in turn utilize the resources available 

(such as crypto accelerators, secure memory) in the hardware 

to support the objective. 

Continuing with the example, the threats identified 

previously can be countered by the following security 

objectives: 

Secure State: Ensures that the device maintains a 

secure state even in case of failure of verification of 

firmware integrity and authenticity. Counters Malware 

and Tamper threats. 

VI. 

REQUIREMENTS 

At this point, the analysis provides a logically connected 

model of data assets, threats, and security objectives. From this 

picture, a list of required capabilities or features for a secure 

MCU can be compiled. This list can of course then be used as 

solution implementation criteria for the particular IoT device 

application. 

Note that the requirements for a security objective may 

change according the life cycle stage (design, manufacturing, 

inventory, end use, termination) of the IoT device and should 

be considered as well. 

The analysis of the example now can be presented: 

Notes: 

1. Ideally implemented as a TA in an isolated 

execution environment 

2. C = Confidentiality, I = Integrity, A = Authenticity 

3. SEF = Secure Element Functionality 

4. Dead = In a non-operational state 

Fig. 3. Summary of security objectives and countered threats. 

Access Control: The IoT device authenticates all 

actors (human or machine) attempting to access data 

assets. Prevents unauthorized access to data assets. 

Counters Spoofing and Malware threats where the 

attacker modifies firmware or installs an outdated 

flawed version. 

Secure Storage: The IoT device maintains 

confidentiality (as required) and integrity of data assets. 

Counters Tamper threats. 

Firmware Authenticity: The IoT device verifies 

firmware authenticity prior to boot and prior to 

upgrade. Counters Malware threats. 

 

Communication: The IoT device authenticates remote 

servers and provides confidentiality (as required) and 

maintains integrity of exchanged data. Counters Man in 

the Middle (MitM) threats. 


This paper presents an analysis method for determining the 

requirements of a secure IoT device. By creating a logically 

connected model of data assets, threats against these assets, and 

security objectives to counter the threats, a list of requirements 

can be derived that can be used as criteria for implementation 

solutions. 

The vast majority of IoT devices will be built upon MCUbased 

embedded systems. This growth opportunity is 

attracting a new breed of MCUs that offer security features and 

capabilities to maintain the secure properties of data assets. 

Cypress’ PSoC 6 secure MCUs are one of the first of such new 

MCUs. The PSoC 6 MCU architecture was designed for IoT 

device applications, offering ultra-low power for extended 

battery life, efficient processing capacity, and hardware-based 

security features that support security objectives: 

Isolated Execution Environment: PSoC 6 secure MCUs 

isolate secure operations from unsecure operations through the 

use of hardware isolation technology: 


683

Configurable protection units are used to isolate 

memory, cryptography, and peripherals 

Inter-Processor Communication (IPC) channels 

between the Arm Cortex-M4 and Cortex-M0+ cores 

are provided to support isolated API-based interaction. 

 

Ideal for implementing Trusted Applications that 

support the security objective of an IoT device 

Integrated Secure Element functionality: The hardware 

isolation technology in PSoC 6 supports isolated key storage 

and crypto operations, delivering secure element functionality 

in addition to the isolated execution environment. 

 

 

Ideal for secure key storage 

Optional pre-installed root of trust to support secure 

boot with a chain of trust 

Isolated, hardware-accelerated cryptographic operations: 

Includes AES, 3DES, RSA, ECC, SHA-256 and SHA-512, and 

True Random Number Generator (TNRG). 

Life cycle management: eFuse-based life cycle 

management capability ensures secure behavior in the event of 

security errors such as firmware hash check failures. 

The forecasted explosion of IoT devices will be driven by 

the availability of cost effective, easy to design, and easy to use 

wireless connectivity to the Cloud. The ability for an 

embedded system to send and receive data is a fundamental 

enabler for smartness. Unfortunately, this ability is also an 

enabler for threats against the very data that makes an IoT 

device valuable. The more valuable the data, the more critical 

that IoT devices implement security capabilities that protect 

this data. Secure MCUs such as Cypress’ PSoC 6 MCUs 

address the needs of secure IoT devices. 

Fig. 4. PSoC 6 Secure MCU’s isolated execution environment enabled 

thorough hardware isolation technology. 

REFERENCES 

[1] Cypress PSoC 6 MCU Community: 

https://community.cypress.com/community/psoc-6; 2017 

684

Delivering high-mix, high-volume secure 

manufacturing in the distribution channel 

Steve Pancoast (Author) 

Secure Thingz, Inc. 

IoT and Embedded Systems Security 

San Jose, California, USA 

Rajeev Gulati (Author) 

Data I/O Corp. 

Software, Semiconductor, and Systems Technology 

Redmond, Washington, USA 

Abstract— This paper examines the cryptographic foundational 

elements required to establish roots-of-trust in silicon to design, 

manufacture and deliver secure devices. Recent advancements in 

security and programming technology, when designed in, streamline 

the manufacturing process, scale and deliver trusted devices to 

partners and OEMs cost-effectively. Other topics for discussion 

include impacts on manufacturing and downstream provisioning 

processes, as well as new technology in security provisioning and 

data programming that OEMs of any size can implement. 

Keywords— security, secure manufacturing, OEM, Internet of 

Things, IoT, root of trust, microcontroller, MCUembedded security; 

device provisioning; supply chain of trust; IP protection; IoT 

device; certificate signing; cryptographic key pairing, 

authentication, and decryption; hardware security module (HSM); 

public key infrastructure 


As billions of new IoT products come online every year, the 

opportunity to co-opt these products for nefarious purposes 

grows exponentially. In addition, the supply chain for these 

IoT products may be susceptible to threats such as cloning and 

intellectual property (IP) theft. The effects cannot be 

underestimated: Unauthorized attacks can significantly impact 

an OEM’s revenue, profits, brand and reputation. Because of 

this, pressure is building for each and every IoT device to 

include security features that prevent the device from being 

used by an unauthorized agent. Zero security is no longer an 

option – it is a must-have. 

Techniques used to combat the above threats include 

secure provisioning and programming of the products, along 

with operational security measures such as establishing trusted 

mutual authentication between the IoT device and a remote 

server, securing communication to and from the IoT device 

and securing the firmware running on the device itself. These 

capabilities can be enabled by features in secure 

semiconductor devices such as Secure Elements (SEs) and 

Secure Microcontrollers (Secure MCUs) as long as they are 

properly and securely provisioned. 

establishing a supply chain of trust, creating a root of trust and 

a solution for the secure provisioning and programming of 

Secure Elements and Secure MCUs. The paper also details 

the component architecture of the Data I/O SentriX system 

used for secure provisioning of Secure Elements and Secure 

MCUs, and it provides an example of how devices can be 

provisioned for mutual authentication during manufacturing. 

II. 

A. Supply Chain of Trust 

CHAIN OF TRUST 

In creating secure products, the product developer (the 

OEM) should adopt a “zero / low trust” approach across the 

supply chain to minimize vulnerabilities and IP loss or theft. 

The OEM should continually authenticate and individualize 

deliverables across the supply chain as far as possible, and this 

involves establishing this chain of trust across the entire 

product lifecycle, including the end customer who will need a 

way to securely apply software updates for its product. 

Silicon vendor 

(SE’s, MCUs) 

Programming 

Facility 

Figure 1 – Supply Chain of Trust 

Contract 

Manufacturing 

Customer 

(SW Updates) 

A typical supply chain of trust is shown in Figure 1. The 

chain of trust should start with silicon vendors (for the Secure 

Element or Secure MCU) and continue with programming 

solution providers and contract manufacturers all the way 

through to the OEM, who develops the end products, and even 

the end customer who needs to securely update the product in 

the field. Think of the chain of trust as a process flow - any 

step in the process builds upon the security of the previous 

step. Once the context of the overall supply chain is 

understood, focus can be placed on the Programming Center 

and the Provisioning System that resides there. 

OEM 

This paper describes common security issues facing OEMs 

when developing and manufacturing IoT devices, including 


685

B. Root of Trust 

The security of an IoT product starts by having a secure 

“root of trust” (RoT) that must be securely provisioned into 

the product which is usually contained in a SE or a Secure 

MCU itself. The root of trust typically consists of four key 

items, three of which are shown in Figure 2. 

Figure 2 – Components of a Root of Trust 

1. A unique product asymmetric key pair that is 

provisioned into the product and is secure and 

immutable. The private part of the key pair must be 

protected and provisioned/programmed into the SE or 

secure MCU so that it is never exposed, but can be 

used for authentication purposes (see #3). 

2. A unique identity that is secure and can be 

validated (typically a product certificate). Every 

connected product should have a unique identity 

certificate. The most common implementation of this 

uses signed product certificates that can be verified 

by a certificate authority (CA). Unlike web browsers 

that connect to multiple sites, most IoT devices 

connect back to just the OEM’s own site, so the CA 

can be the OEM itself (i.e. a self-signed CA) or a 

third-party CA can also be used. The key principle is 

that there needs to be a verifiable certificate chain 

from each product back to a trusted CA. 

3. A secure way to authenticate the identity of the 

product (i.e. tie the product to the certificate). The 

SE or MCU provides a cryptographic method to 

authenticate that the public key (from the product 

certificate that was previously validated) matches the 

corresponding private key in the SE or MCU. 

4. A secure and immutable boot path (for MCU 

solutions). In addition to the items above, a secure 

MCU must also provide a secure boot mechanism 

where the integrity of the initial boot software is 

cryptographically verified before executing it. This 

process continues successively where the boot 

software verifies the integrity (signature) of the 

subsequent software before is executed, etc. 

The RoT represents the base level of security information 

that must be protected in the secure device against readout and 

tampering, etc. This is usually done with a variety of 

hardware and software protection methods in the device. 

Beyond the RoT, many layers of operational security software 

are required for different classes of IoT products, but each of 

these solutions typically rely on some initial security starting 

point that is empirically trusted. So it is critical that the RoT 

be secure and properly protected at manufacturing. 

C. Secure Provisioning and Programming 

To implement the chain of trust process, the OEM should 

define roles for its own ecosystem partners and suppliers in 

the product’s lifecycle, including partners in the development 

and manufacturing of the product. The OEM should take 

ultimate ownership for the security of its own products and 

protect its own Intellectual Property. One of the key areas 

often overlooked by OEMs is the secure provisioning and 

programming of the RoT and product software either at a 

Programming Center or contract manufacturer. In addition, it 

is important for the OEM to address how software updates can 

be securely deployed for its own products. 

The challenges of a secure manufacturing solution should 

not be understated. Secure devices (SEs and MCUs) must be 

produced securely anywhere in the world with an OEM’s keys 

(RoT) and product software protected. OEMs must restrict 

access to secrets, but most OEMS need to trust third parties 

for high volume production. With an automated security 

provisioning and data programming solution, Programming 

Centers and contract manufacturers are able to handle more 

customers with less audit overhead because the provisioning 

and programming are managed cryptographically. The 

OEM’s secrets are protected inside a hardware security 

module (HSM), and over-production is eliminated. 

For Secure MCUs, the security problem is more 

complicated. Beside the need to securely provision the RoT 

into the MCU, there is a need to securely program the OEM’s 

application software / firmware into the MCU to protect 

against IP theft. The OEM should also provide a solution to 

securely update the software in its products after production. 

Secure Thingz has created a solution where the software can 

be securely programmed / updated as it is “mastered” with a 

secure system at the OEM and sent to the programming 

facility. Mastered software images are encrypted and 

protected against any modifications and can only be decrypted 

and installed by the targeted device or family of devices. 

III. 

PROVISIONING SYSTEM ARCHITECTURE 

A provisioning system component architecture is a turnkey 

solution that enables an OEM to securely provision 

component devices like Secure Elements and Secure MCUs. 

A common usage model for the provisioning system, shown in 

Figure 3, involves its setup at a Secure Programming Center. 

Secure Programming Centers provide important provisioning 

services to OEMs at various volume levels ranging from first 

article to hundreds of thousands of units. Automated security 

686

provisioning and data programming systems may also be 

similarly setup at OEM-controlled factories or factories owned 

by contract manufacturers. 

IV. 

MUTUAL AUTHENTICATION USE CASE 

Now that the component architecture of the system has 

been reviewed, consider a real world use case of mutual 

authentication to see how the provisioning system is used. 

Mutual authentication requires both parties to have a 

provisioned root of trust as outlined in Section II. Two devices 

attempting to communicate with each other will use this root of 

trust to cryptographically authenticate before exchanging data. 

A brief study of how mutual authentication works will 

establish the device provisioning requirements. One of the 

devices could be an update server and the other device an IoT 

product, but many variants are common. 

Figure 3 – Secure Provisioning System Architecture 

While the secure provisioning of devices may be 

outsourced by an OEM to a Secure Programming Center, 

OEMs need to provide both public and secret information that 

is necessary to provision their devices. As an example, an 

OEM may need to provide a Secure Programming Center with 

private signing keys, certificates, certificate templates, 

production counts and other important information. For the 

purpose of this paper, and as shown in Figure 3, such material 

will be referred to as “OEM Public and Private Information”. 

OEMs create and manage the OEM Public and 

Private information on their premise in a highly secure 

environment. Such information is company secret and of very 

high value to an OEM, yet this information needs to be 

transmitted to a Secure Programming Center so that devices 

can be provisioned. The OEM Secret Wrapping Tool is a 

subsystem that enables OEM Public and Private Information 

to be cryptographically signed and encrypted, or “wrapped”. 

This allows the information to be protected and be securely 

transmitted to a specific provisioning system at a 

Programming Center over an unsecure Internet connection. 

The unwrapping of the OEM’s secret information only occurs 

inside a secure, tamper-resistant HSM where it is stored and 

protected. This process of wrapping takes place at an OEM 

Premise by the OEM and the wrapped file is then sent to a 

Secure Programming Center. 

The Secure Programming Center is responsible for 

provisioning devices for the OEM customer at its premise. 

This process starts with creation of a Security Product using 

the specific system programming software. During product 

creation, the wrapped OEM Public and Private Information is 

imported into the programming system and cryptographically 

bound to a Unique Product ID and a Unique OEM ID within 

the HSM. Thus a unique product representation (OEM ID and 

Product ID) is created. 

Figure 4 – Mutual Authentication Between Two Devices 

The application and devices are developed by OEM A, who 

is thus the owner of Device 1, Device 2 and the application. 

Thus OEM A is also responsible for provisioning both devices 

for mutual authentication. 

This process generally requires creating a Public Key 

Infrastructure (PKI) system among the various devices in the 

application. Identities for Device 1 and Device 2 are created 

by associating an Identity Key Pair with each device. As 

shown in Figure 4, the private part of the Identity Key Pair for 

each device is stored in read and write protected storage on the 

device (and thus never exposed to the outside world). The 

public part of the key pair is stored in a Device Certificate in 

write-protected storage and represents the Public Identity of the 

device (and is available to share with the outside world). 

Assigning ownership of Device 1 and Device 2 to OEM A 

requires that an Identity Key Pair for OEM A be created as 

shown in Figure 4. To assign ownership of Device 1 and 

Device 2 to OEM A, the Device Certificate for each device is 

signed with the Private Key from the OEM Identity Key Pair. 

The Public Identity of the OEM is represented by the OEM 

Root CA certificate, which contains the Public Key of the 

OEM. In this example, we are assuming that the OEM is also 

the Root Certificate Authority (where the certificate signature 

chain of trust needs to terminate), thus the OEM Root CA 

certificate is signed by the Private Key of the OEM. Note that 

in most cases, the Root CA is used to sign intermediate CA 

certificates and these are, in turn, used to sign device 

certificates. However, for this example, the Root CA will be 

used directly for simplicity. 


687

The OEM Root CA certificate is also associated with 

Device 1 and Device 2 so that during the device authentication 

process, the chain of trust starting with the Device Certificate 

for Device 1 (or Device 2) can be dynamically verified to 

terminate at same Root CA certificate as the OEM Root CA 

certificate stored on the device. 

Once the above PKI system is set up, the mutual 

authentication algorithm works as follows: 

1. Device 1 requests Device 2 to send its Certificate. 

2. Device 2 sends its Device Certificate to Device 1. 

3. Device 1 authenticates Device 2 by validating Device 

2’s Device Certificate. This is done in two steps. 

a. Cryptographic validation that Device 2 certificate 

is from OEM A – this involves ensuring that the 

Device Certificate is signed by Private Key of the 

OEM Identity Key Pair and by verifying the 

signature chain of trust starting with the Device 

Certificate of Device 2 and terminating at the 

OEM Root CA Certificate. Each device has a 

local copy of the OEM certificate that it trusts. 

b. Authentication of the identity of Device 2 using 

the Device 2 Public Key from the Device 2 

certificate and performing a challenge response 

algorithm with Device 2. 

Since Device 2 is the only entity that knows the 

Device 2 private key, it is the only device that can 

successfully respond to the challenge, thus 

proving that the Public Key belongs to Device 2. 

The details of the challenge response mechanism 

are not covered in this paper. 

4. If validation of the Device Certificate by Device 1 is 

successful, Device 2 sends affirmative authentication 

response to Device 2. 

5. Device 2 executes a complementary sequence to 

authenticate Device 1. 

V. PROVISIONING IMPLEMENTATION 

Now that mutual authentication requirements have been 

discussed, provisioning Secure Element devices using the 

specified provisioning system is outlined. From the discussion 

related to the provisioning architecture and understanding of 

the mutual authentication use case, the following requirements 

will need to be supported for provisioning devices: 

1. The OEM creates the following at the OEM Premise: 

a. An OEM Identity Key Pair 

b. An OEM Root CA Certificate 

c. A Device Certificate Template 

d. Production Count for number of devices to be 

produced at the Secure Programming Center 

2. The OEM securely transmits the following OEM 

information to the Secure Programming Center: 

a. OEM Device Certificate Signature Key (which is 

the Private Key from the OEM Identity Key pair) 

b. OEM Root CA Certificate 

c. A Device Certificate Template 

d. A Production Count 

e. Unique serial numbers to be programmed into 

devices (optional and not shown) 

The above information flow is shown in Figure 5. 

Figure 5 – Secure Transfer of OEM Information to Provisioning System 

The OEM Identity Key is secret, must be protected and is 

wrapped before transfer to the specific provisioning system. 

The OEM Secret Wrapping Tool targets a specific HSM 

system at the Secure Programming Center, thus a Specific 

Guardian HSM identity certificate from the target 

programming is required. OEM Public Information is not 

secret and technically does not need to be wrapped; however, 

this is often a convenient method of transfer. 

Once securely stored inside the provisioning system, the 

OEM Information is used to create a Job Package, which 

initiates the provisioning cycle for a batch of Devices. The 

simplified provisioning flow for a single device is as follows: 

1. Generate a Device Identity Key Pair for each device. 

2. Create a Device certificate using the Public Key of the 

Identity Key Pair. 

3. Sign the Device certificate with OEM Device 

Certificate Signature Key. 

4. Program the Device Certificate into the Device Write 

Protected storage. 

5. Program the Root CA Cert into the Device Write 

Protected storage. 

6. Lock the Device. 

Once this provisioning flow has been executed, devices are 

protected from modification and prepared for mutual 

authentication in the field. 

688

VI. 

SUMMARY 

This paper has described common security issues facing 

IoT OEMs. In order to secure IoT devices, OEMs must 

establish a supply chain of trust, create a root of trust in each 

device and ensure the secure provisioning of secure elements 

and secure MCUs for those devices. A component architecture 

of a secure provisioning system was discussed and an example 

of mutual authentication was used to demonstrate the necessary 

device provisioning steps. This fundamental provisioning 

architecture can also be used for more advanced usage models, 

including secure boot of an MCU and the authentication and 

encryption of firmware. These advanced usage models can be 

implemented with a Secure Element or a Secure MCU. 

VII. TRADEMARKS 

Secure Deploy is a registered trademark of Secure Thingz, Inc. 

SentriX is a registered trademark of Data I/O Corporation. 


[1] Microsoft, “Securing Public Key Infrastructure (PKI),” May 2014. 

https://docs.microsoft.com/en-us/previous-versions/windows/itpro/windows-server-2012-R2-and-2012/dn786443(v=ws.11) 

[2] Jones, Scott, “Secure Authenticators answer the call to solve IoT device 

embedded security needs”, Embedded World 2017, unpublished. 


689

Providing Cryptography for Your System 

How to Port Transport Layer Security (TLS) 

Ron Eldor 

IoT Services Group 

Arm 

Netanya, Israel 

Janos Follath 

IoT Services Group 

Arm 

Cambridge, United Kingdom 

Abstract— It’s unimaginable for a device designed to be 

secure in a modern connected environment to not use 

cryptography. Authentication, secure communication, secure 

boot and firmware update services, all heavily rely on 

cryptographic protocols and primitives. However, such features 

come at a cost, including enlarged code size and decreased 

performance, which can be challenging on constrained devices. 

This can be made easier with hardware-accelerated 

cryptographic functions. Performing cryptography in hardware 

removes the workload from the microcontroller, decreasing 

power consumption and reducing overall code size. In this paper, 

we will review the most common cryptographic primitives, 

provide an overview of the porting process and demonstrate its 

implementation through a case study of integrating Arm Mbed 

TLS into an Mbed OS target platform. There are several 

challenges that may arise when integrating TLS technology into 

an operating system instead of directly into the application. For 

example, it may be non-trivial to expose configuration options to 

the application developer and/or to the silicon manufacturer. 

This presentation will outline Arm’s approach to implementing 

TLS within an OS. 

Keywords—cryptography; TLS; Mbed TLS; Mbed OS; 

CryptoCell; TrustZone; hardware accelerator 


With the Internet of Things growing by leaps and bounds, 

security related issues are also growing, and need to be 

effectively addressed now. As the ecosystem increases in size, 

securing the embedded platforms is more important than ever. 

A basic component of securing platforms is adding 

cryptography with some security protocol, such as Transport 

Layer Security (TLS). The downside of adding cryptography is 

the decrease in performance on a Microcontroller Unit (MCU) 

with already low performance, and enlarging code size on a 

memory limited platform at the same time. These problems can 

be mitigated by using a hardware-accelerated cryptography 

engine which offloads operations from the MCU and reduces 

code size. 

Integrating a hardware accelerator into a product takes time 

and effort during both development and maintenance, even 

more so because this integration is a security critical task in 

itself. This can be mitigated by using an embedded operating 

system that integrates an accelerator, where the code is 

maintained by the developers of the driver and/or the operating 

system. 

In this paper we discuss how porting Mbed TLS and Mbed 

OS to a platform with Arm TrustZone Cryptocell-310 resulted 

in a 15-53x performance increase and what were the main 

integration tasks performed associated with this. Section II 

shares some of the experimental data from this task. Section III 

gives a short background in cryptography, discussing the terms 

used throughout the paper. Section IV describes how Mbed 

TLS and Mbed OS support the process of porting and in 

Section V some representative tasks are discussed in detail. 

Section VI closes the paper with a summary. 

II. RESULTS 

Cryptography is a necessary but expensive operation, both 

in performance and in code size. In a constrained environment, 

where both these setbacks can be a major limitation, hardware 

accelerated cryptography comes into use. There are several 

potential benefits of using hardware acceleration for 

cryptography: 

1) Throughput increase 

2) Code size (therefore area) reduction (as most of the 

cryptography is done in hardware) 

3) Power reduction 

4) Isolation of cryptographic keys from potentially flawed 

or even malicious software 

We will study the potential increase in performance and 

throughputs in this paper. 

A. Performance 

As mentioned, offloading the cryptography to a hardware 

acceleration core improves the performance. It allows the TLS 

handshake to finish faster and application data to be encrypted 

more quickly - providing an overall smoother user experience. 

Cryptographic operations are performed using dedicated 

hardware highly optimized for the target algorithms, which 

leads to significant improvement in the computation time. 

Table I presents the comparison of the Mbed OS 

Benchmark [9] application output, when running Mbed TLS 


690

with pure software implementation, which has been ported to a 

platform with Arm TrustZone CryptoCell-310. The 

measurements were done on a platform embedding Arm 

Cortex-M4F and TrustZone CryptoCell-310, both supplied 

with a 64MHz clock. The benchmark application was compiled 

with the ARM5 toolchain. 

TABLE I. 

Algorithm 

COMPARISON OF MBED TLS BENCHMARK WITH ARM 

TRUSTZONE CRYPTOCELL-310 AND WITHOUT 

III. 

Improvement Ratio 

SHA-256 1:15.81 

AES-CBC-128 1:24.06 

AES-CCM-128 1:25.49 

ECDSA-secp384r1 (sign) 1:19.32 

ECDSA-secp256r1 (sign) 1:35.76 

ECDSA-secp384r1 (verify) 1:28.21 

ECDSA-secp256r1 (verify) 1:53.59 

ECDHE-secp384r1 (handshake) 1:20.89 

ECDHE-secp256r1 (handshake) 1:38.79 

ECDHE-Curve25519 

1:43.65 

(handshake) 

ECDH-secp384r1 (handshake) 1:20.50 

ECDH-secp256r1 (handshake) 1:40.29 

ECDH-Curve25519 (handshake) 1:46.73 

CRYPTOGRAPHIC BACKGROUND 

Authentication, secure communication, secure boot and 

firmware update services all heavily rely on cryptographic 

protocols and primitives. There is a wide range of crypto 

primitives and protocols in use for various purposes around the 

industry. The TLS protocol is one of the most widespread 

conventional protocols – it provides confidentiality, integrity 

and authenticity over an untrusted channel. It is a very versatile 

protocol, supporting numerous algorithms for authentication, 

key exchange and encryption. When starting a TLS connection, 

a handshake takes place. The peers negotiate algorithms to use, 

authenticate each other and share temporary key material to use 

during the session. Technically, TLS optionally supports 

mutual authentication. In the HTTPS case the server usually 

authenticates, and the client does not, but there are many other 

use cases other than HTTPS, where both sides do want to 

authenticate. 

One way of authenticating the peer is with the help of 

digital signatures. Only someone in the possession of the 

private key can produce valid signatures, which anybody can 

verify with the help of the public key. The Elliptic Curves 

Digital Signature (ECDSA) algorithm is well suited for 

embedded applications, as it has good performance and small 

key size, while still providing strong cryptographic security. 

The security of these cryptographic primitives is always 

conditional on the computational power the adversary has 

available. If there is an 80-bit long key and the adversary can 

try all the 2 80 keys fast enough, then he can break the scheme. 

Although it’s worth noting, that on average he only needs to try 

2 79 . A primitive has x bit security if the best attack to break it 

requires computational power equivalent to trying 2 x 

symmetric keys. 

Keys, secrets and random numbers are generated with the 

help of random number generators (RNG). The unpredictability 

of their output is crucial to the security of the system. For 

example, if a 256-bit long output of the RNG is used as a key 

in a 256-bit strong scheme and the RNG output can be 

predicted with a chance of 0.00001%, then the security is 

reduced to 23 bits. Secure systems must have some way of 

generating random numbers in a secure way. Ideally, the 

generation of randomness is based on a physical phenomenon, 

providing a high level of entropy (also known as “True 

Random Number Generator”). 

IV. MBED TLS AND MBED OS 

In this case study we ported Mbed OS and Mbed TLS to a 

platform equipped with Arm TrustZone CryptoCell-310. The 

Mbed TLS library is not just a free, open-source, highly 

configurable TLS stack designed with use in embedded 

systems in mind – it is also a cryptographic library. It provides 

compile time configuration options enabling the use of 

cryptographic hardware accelerators [1]. The portable use of C, 

minimal dependencies, modular structure and its own built-in 

thin abstraction layer make it easy to port to new platforms [2]. 

Because of this, porting Mbed TLS for a single, specific 

application can be simple and easily achieved, but if integration 

into a multipurpose, generic platform is required, then there are 

several actors and viewpoints that have to be taken into account 

to preserve Mbed TLS’s high configurability. 

When integrating Mbed TLS into operating systems, 

instead of directly into the application, care should be taken for 

the following reasons: 

• It is non-trivial to expose configuration options to the 

application developer and/or to the driver developer, 

due to the fact that Mbed TLS has a lot of compile-time 

options that enable the user to fine-tune the memory 

footprint, performance and functionality. 

• The crypto engines have to be integrated and available 

as both an operating system service and as part of Mbed 

TLS. 

• The above two considerations have to be addressed in a 

way that enables a straightforward and usable 

integration process for the driver developer. 

Mbed OS is a free, open-source embedded operating 

system, which is pre-integrated with Mbed TLS. The 

integration with Mbed OS addressed the above considerations, 

by dividing the options into three groups, and then providing 

three different mechanisms to access them. The first most 

integrated set of options is part of the Mbed Hardware 

Abstraction Layer (HAL) and can be activated through Mbed 

OS target configuration [3]. The second set provides a more 

flexible way for the driver developer to provide the driver code 

[4]. The third set of options has a similarly flexible but 

different configuration method [5]. The integration process is 

691

still ongoing and options can move from the second group into 

the first, when they have stood the test of time. 

Since the nature of this case study is integration and not 

application development, the first two mechanisms are used. 

The TRNG is being integrated into the Mbed OS HAL and the 

other Arm TrustZone CryptoCell crypto acceleration services 

are being made available by the second mechanism. 

V. INTEGRATION 

When porting a hardware cryptography engine, the 

signature of the public API functions and the types of the 

context structures in the driver are unlikely to coincide with the 

ones in Mbed TLS. Furthermore, in some cases the byte order 

used by Mbed TLS might differ from the one preferred by the 

engine. Both differences arise in the case of Arm TrustZone 

CryptoCell-310 and need adjustment to achieve full 

integration. In the first case this adjustment means converting 

the input from the Mbed TLS type to the types prescribed by 

the hardware, passing it to the driver API function and 

converting back the output from the hardware driver type to the 

Mbed TLS type. In the second case a simple change in byte 

order is necessary. These translations have a downside of 

adding code and decreasing performance. However both of 

these effects are negligible and well justified by the gain 

provided by the cryptography engine. In the rest of this chapter, 

we present examples of the integration tasks performed when 

porting Arm TrustZone CryptoCell-310 to Mbed TLS on Mbed 

OS and the way to overcome them. 

A. Type mismatch in output 

The Mbed TLS function for ECDSA signature, 

mbedtls_ecdsa_sign(), receives two output parameters of type 

mbedtls_mpi and follows the standard SEC1 [6]. However, the 

hardware driver's signature function outputs a single byte 

buffer representing the signature. Fortunately, translating the 

output byte buffer to mbedtls_mpi can be done with the 

mbedtls_mpi_read_binary() function (Fig. 1). 

B. Type mismatch in input 

The Mbed TLS random function signature is int 

(*f_rng)(void * context, unsigned char * output, size_t length) 

and the Arm TrustZone CryptoCell-310 random function 

signature is int (*f_rng)(void * context, uint16_t length, uint8_t 

* output). This means that the random function callback given 

as a parameter to a function, such as mbedtls_ecdsa_sign(), 

cannot simply send the function pointer to the hardware 

accelerator driver. To overcome this, a wrapper function and 

context have been created. 

First, a structure called mbedtls_rand_func_container is 

defined, which will contain the context and the Mbed TLS 

random function pointer (Fig. 2). 

After that, the function mbedtls_to_cc_rand_func() is 

created with the hardware driver's signature, which calls the 

Mbed TLS callback function (Fig. 3). 

Last, an mbedtls_rand_func_container is initialized with 

the Mbed TLS RNG parameters and passed to the hardware 

Fig. 2. Definition of mbedtls_rand_func_container 

accelerator driver, along with mbedtls_to_cc_rand_func() (Fig. 

4). 

C. Difference in standards 

Mbed TLS and Arm TrustZone CryptoCell-310 comply 

with different standards. In the majority of the cases, the 

standards are either the same or very similar, however 

sometimes there are minor differences, which eventually led to 

some small complications during the porting process. In our 

case Mbed TLS follows SEC1 [6] and this standard leaves the 

details of transforming random bits to key candidates to the 

implementation. Mbed TLS implements a key generation 

method to be suitable for generating ephemeral keys for 

deterministic signing too. Namely it complies with RFC 6979 

[7] and implements key generation as described in section 3.3. 

Arm TrustZone CryptoCell-310 on the other hand follows the 

FIPS 186-4 [8] standard and implements key generation as 

described in section B.5.2. To be precise both of these describe 

the ephemeral key generation as SEC1 calls it, or the permessage 

secret number generation by the terms of FIPS 186-4. 

This is intentional and appropriate here, because this use of key 

generation is relevant in this case. 

The major difference between the two standards is that 

FIPS 186-4 ensures that the key k is in the interval [1,n-1] by 

checking if k>n-2, generating a new k if it is, adding one and 

accepting it otherwise. The RFC 6979 on the other hand checks 

if k>n-1 and if k=0, generating a new k if it is and accepting it 

otherwise. Although in both cases the algorithms are correct, 

the output will be different (k+1 and k respectively) when using 

the same random input. The only case when this poses a 

problem is testing: this difference makes predefined test 

vectors fail. 

To overcome this difference, the 

mbedtls_to_cc_rand_func() function defined previously, can be 

modified to decrease the output of f_rng by one, before this 

callback is returned to the hardware driver (Fig. 5). 

D. Difference in byte order 

Arm TrustZone CryptoCell-310 and Mbed TLS use a 

different byte ordering, which needs to be adjusted when 

passing raw data between components. For example, the 

hardware handles keys in byte buffers in little endian byte 

order. However, converting to the mbedtls_mpi structure when 

using mbedtls_mpi_read_binary() function, the buffer has to be 

in big endian byte order. In order for the values of the 

Fig. 1. Conversion of byte array to mbedtls_mpi 

Fig. 3. Definition of mbedtls_to_cc_rand_func 


692

Fig. 4. Combination of the components 

generated keys to be the same, the output of the random bit 

generator needs to be translated to match the byte order of 

Mbed TLS. A straightforward way to do this is, extending 

mbedtls_to_cc_rand_func() with this translation functionality 

(Fig. 6). 

Like in the previous subsection this is only an issue when 

using predefined test vectors and it will not affect the 

correctness of operation in production. These modifications 

come with a slight increase in code size and penalty in 

performance, but these are negligible in most use cases and can 

be mitigated by turning them off in production if absolutely 

necessary. 

Fig 6. Change of the byte order 

TrustZone CryptoCell-310 can be done with minimal 

integration engineering and can result in significant 

performance improvement. We have measured a 15-53x 

performance improvement when porting to our target platform. 

Integrating the accelerator in an operating system instead of 

directly into the application reduces overall development and 

maintenance cost. The typical integration tasks performed 

during the porting were addressing mismatch in function 

signatures, byte order and followed standards between Mbed 

TLS and Arm TrustZone CryptoCell-310, all of which can be 

solved with negligible overhead. 

VI. SUMMARY 

Adding cryptography to IoT products is a requirement for 

security and therefore for their overall success. Unfortunately, 

software implementation of cryptography comes with 

performance and memory costs on constrained platforms. 

Adding cryptographic hardware accelerators to the system can 

potentially solve both problems. 

Porting Mbed TLS and Mbed OS to a platform with Arm 

REFERENCES 

[1] S. Butcher, “Alternative cryptography engines implementation,” 

https://tls.mbed.org/kb/development/hw_acc_guidelines 

[2] M. Pégourié-Gonnard, “Porting Mbed TLS to a new environment or 

OS,” https://tls.mbed.org/kb/how-to/how-do-i-port-mbed-tls-to-a-newenvironment-OS 

[3] “Arm Mbed Reference,” https://os.mbed.com/docs/v5.7/mbed-os-apidoxy/group__hal__trng.html 

[4] “Mbed Handbook – Mbed TLS Hardware Acceleration,” 

https://docs.mbed.com/docs/mbed-oshandbook/en/latest/advanced/tls_hardware_acceleration/ 

[5] “Mbed OS Reference – Security/TLS,” 

https://os.mbed.com/docs/v5.7/reference/tls.html 

[6] D. R. L. Brown, “SEC 1: Elliptic Curve Cryptography,” 

http://www.secg.org/sec1-v2.pdf 

[7] T. Pornin, “Deterministic Usage of the Digital Signature Algorithm 

(DSA) and Elliptic Curve Digital Signature Algorithm (ECDSA),” 

https://tools.ietf.org/html/rfc6979 

[8] “Digital Signature Standard (DSS) - FIPS PUB 186-4,” 

http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-4.pdf 

[9] “Mbed TLS Benchmark Application on Mbed OS Platform,” 

https://github.com/ARMmbed/mbed-os-exampletls/tree/master/benchmar 

Fig 5. Adapting mbedtls_to_cc_rand_func 

693

How Next-Generation Security ICs Deliver a 

Stronger Level of Protection 

Scott Jones 

Maxim Integrated 

Micros, Security & Software Business Unit 

Dallas, TX, USA 

Abstract—An IC-based physically unclonable function (PUF) 

has desirable properties that can be utilized by chips that 

implement cryptographic functionality. A new PUF 

semiconductor solution takes advantage of the random analog 

characteristics of MOSFET transistors, the fundamental building 

block of CMOS ICs. The PUF is constructed from an analog 

circuit element with inherent randomness in I-V characteristics. 

At the chip level, the PUF solution is constructed from an array of 

these elements sized according to the number of bits required to 

achieve the cryptographic requirements of the chip. When needed, 

the PUF is operated to derive a per-chip random, unique, and 

repeatable binary value that is only accessible by chip crypto 

blocks. Thereafter the PUF-derived key value is instantaneously 

erased and does not exist in digital form. For a PUF output to be 

used as a cryptographic key value, it must be highly reliable and 

have appropriate crypto quality. This new PUF solution has 

demonstrated the ability to satisfy both requirements. 

Keywords—PUF, physically unclonable function, security IC, 

security chip, secure authenticator, crypto, cryptography, ChipDNA 


In a world where embedded electronic systems continue to 

come under attack, cryptography provides flexible and effective 

tools to address a myriad of potential security threats. 

Accordingly, a variety of options exist to implement crypto 

solutions with both hardware and software approaches. Given 

the dedicated and optimized implementations, it is understood 

that a hardware-based solution, i.e. a dedicated security IC, is the 

most effective formulation for the root of trust and the way to 

provide the countermeasures and protection that prevent 

numerous types of common attacks. 

The reality is that there are valuable assets associated with 

embedded systems that face relentless threats. For example, 

such systems encounter intrusions such as theft of intellectual 

property, introduction of malware to disrupt or destroy 

equipment, unauthorized access to sensitive communication, 

tampering with data produced from IoT endpoints, etc. Security 

ICs and the cryptographic solutions available currently exist to 

address these threats. However, the security ICs themselves can 

become the target of attack by an adversary attempting to 

circumvent or break the security. 

II. 

ATTACKS ON SECURITY ICS 

With an assumption of a security IC-based protection 

solution, there are two general categories of attack scenarios: 

non-invasive[1] and invasive. Non-invasive attacks consist of 

operational measurements, sometimes combined with other 

externally applied stimuli, in an effort to obtain cryptographic 

keys or other sensitive data. Examples of such efforts include 

differential or simple power/electromagnetic analysis 

(DPA/SPA/DEMA/SEMA) or the inducing of fault states 

through voltage glitching, extreme thermal conditions, or laser 

and timing attacks. While the non-invasive attack vectors are 

technically complex to address, there are established circuits and 

algorithmic countermeasures that are proven effective in 

protecting the security IC and sensitive stored data from being 

compromised. 

Invasive attacks on a security IC consist of direct die-level 

circuit probing, modification, deprocessing and reverse 

engineering, again with the objective of compromising the 

solution by obtaining keys, disabling functionality, or 

completely reverse engineering the design to a netlist for 

reproduction. The skill set and required tools are more complex 

than in the non-invasive scenarios, but they do exist and are 

commonly used to attack the security ICs that protect high-value 

assets. For example, Fig. 1 and Fig. 2 are examples of the output 

from tools that may be used with an invasive attack to first image 

a portion of an IC and then extract the netlist and schematics 

from the imaging. An attacker would repeat this process for the 

entire IC with the ultimate goal of gaining some insight to launch 

a sub-circuit attack, or producing a data base to replicate the IC. 


694

Fig. 1. Imaged security IC area for schematic/netlist extraction 

III. 

PUF – DECISIVE INVASIVE ATTACK COUNTERMEASURE 

A decisive technology that has emerged to provide strong 

protection against the invasive threat is the physically 

unclonable function (PUF)[4]. PUF is a function that is derived 

from the complex and variable physical/electrical properties of 

ICs. Because PUF is dependent on random physical factors 

(unpredictable and uncontrollable) that exist natively and/or are 

incidentally introduced during a manufacturing process, it is 

virtually impossible to duplicate or clone. PUF technology 

natively generates a digital fingerprint for its associated security 

IC, which can be utilized as a unique key/secret to support 

cryptographic algorithms and services including 

encryption/decryption, authentication, and digital signature. 

A PUF implementation from Maxim Integrated operates on 

the naturally occurring random variation and mismatch of the 

analog characteristics of fundamental semiconductor MOSFET 

devices. This randomness originates from factors such as oxide 

variation, device-to-device mismatch in threshold voltage, and 

interconnect impedances. Similarly, the wafer manufacturing 

process introduces randomness through imperfect or nonuniform 

deposition and etching steps. Paradoxically, 

semiconductor device parameter variation is normally a 

challenge that IC designers face during development. In 

contrast, it is the fundamental basis and exploited for Maxim’s 

PUF design. 

Fig. 3 provides a simplified block diagram of the Maxim 

PUF architecture showing an example key size of 128 bits. 

Shown within the PUF core block is a 16x16 array of 256 PUF 

elements each of which is an analog structure. Through factory 

conditioning these 256 elements are combined into 128 pairs. 

Comparing structure to structure, random I/V characteristics due 

to the previously described parameters, exist and are utilized to 

generate binary 1/0 values through precision circuit-level 

comparison of each element within a pair. For example, 

elements {2,1} and {14,16} could constitute a pair, and I/V 

characteristics of each would be compared to derive a bit value. 

This is repeated with each of the 128 pairs to produce a 128-bit 

PUF key output (for this key size example). 

Fig. 2. Schemtic output from a tool that imaged the area 

Like in the non-invasive situation, there are circuit solutions 

available to combat invasive attacks. One example consists of 

top-level die shields that are actively monitored for a tamper 

event and combined with detection circuitry that takes defensive 

counteraction. However, the skills and equipment of attackers 

employing invasive techniques quickly evolve and have 

historically been a challenge to decisively defeat. 

Fig. 3. Block diagram of Maxim Integrated’s PUF architecture 

695

From an invasive attack perspective, any probing or 

attempted analog measurement of a PUF element causes the 

analog electrical characteristic to change due to factors including 

capacitive/inductive/resistive loading. As a result, it is not 

possible to extract any key data through invasive measurements. 

Also, due to the statistical nature of imperfect manufacturing 

techniques, there is no known method to discern any key 

information from inspection methods. Similarly, even 

knowledge of PUF element paring does not reveal any 

information about the key value that would ultimately be derived 

from the analog characteristics of the PUF element structures. 

Finally, the PUF key value only exists digitally when a 

cryptographic operation is performed; thereafter, it is 

instantaneously erased. Combined, these attributes of this PUF 

design result in a solution that is highly immune to invasive 

attacks. 

IV. 

PUF RELIABILITY AND CRYPTO QUALITY 

From a cryptographic perspective, reliability and 

randomness are critical characteristics that a PUF solution must 

exhibit. For use as a cryptographic key, or root thereof, the PUF 

output must have 100% reliability, meaning PUF-derived key bit 

values must be repeatable over time and all operating conditions. 

For semiconductor devices, this evaluation is performed using 

JEDC[5] defined industry-proven methods of reliability study. 

This includes selecting and subjecting a statistically significant 

sample set of devices to environmental and operational stress 

conditions that enable evaluation of lifetime reliability 

performance. These stresses include high-temperature operating 

life (HTOL), temperature cycling, packaging and solder reflow 

influences, voltage and temperature drift, and highly accelerated 

temperature/humidity stress testing (HAST). Performing a 

reliability qualification study using these proven methods results 

in a statistical assessment of how a design will perform over the 

life of its use in a system. For example, consider that a system 

end product could have a design life of 10 years and operates 

within -40C to +85C environments with power sources that can 

fluctuate by ±10%. 

Equally critical with a PUF solution is the requirement for 

high-performance cryptographic quality, with a key property 

being randomness. Low-quality randomness can create a 

cryptographic attack vulnerability through predictability 

weakness. Statistical test suites, including NIST[2] SP 800- 

22[3], provide an industry-proven means to measure 

randomness of PUF output. Evaluation against the test suite 

provides several metrics which determine whether the PUF 

output is consistent with a random sequence. To be statistically 

significant, these tools require large data sets for the analysis, 

e.g. 20kbit sequences. Therefore, the output from a large set of 

PUF instances is required and used for the assessment. 

V. RELIABILITY STUDIES ON PUF 

The reliability of Maxim’s PUF was proven from results 

obtained via a lifetime reliability analysis as described 

previously. Fundamentally, the reliability study produced data 

to understand the shift from aging, temperature/voltage drift, IC 

packaging, PCB assembly, etc., of the PUF elements. Relative 

to the time-zero characteristics of two PUF paired elements, the 

post-reliability study paired elements have been shown to 

consume ~7% of the total margin available to maintain the 

stability of the output binary value. The final output from the 

analysis is a PUF key error rate (KER) of ≤5ppb, where KER is 

defined as the probability that 1 bit within the total key sized 

produced by the PUF, e.g. 256 bit, would flip over the life of the 

product. 

A randomness assessment of the PUF relied on performance 

to NIST standard SP 800-22 monobit, poker, runs test, and long 

run test. These are test suites that evaluate whether output data 

is consistent with a random sequence. Assessment results for 

each of the four tests validate excellent performance with respect 

to randomness. 

To evaluate immunity to invasive attack and reverse 

engineering, the Maxim PUF solution was evaluated by a 

leading US-based company[6] that specializes in die level 

security assessments and IC reverse engineering expertise. With 

the given assessment time frame, there was no compromise of 

PUF operation, along with a qualitative conclusion that the 

solution is “highly effective and resistant against physical 

Reverse Engineering attacks”. 

VI. 

PUF USE CASES 

Numerous use cases exist for a PUF solution. Three are 

shown in Fig. 4, Fig. 5, and Fig. 6. In Fig. 4, to secure all stored 

data on a security IC, the PUF derived key is used to 

encrypt/decrypt data as needed using an algorithm such as AES. 

Any NVM data extracted from an invasive attack is useless 

given its encrypted state and inability to obtain the PUF-based 

decryption key. Fig. 5 shows the use of PUF as the unique 

private key for ECDSA signing operations. For this case the 

device would compute its own public key from the PUF private 

key and a certificate would be installed in NVM by a certificate 

authority prior to end-use deployment. In Fig. 6, the PUF private 

key is the root private key for the security IC and is used in 

conjunction with the end system to establish a “root of trust” 

with the security IC for subsequent services. 

Fig. 4. Encrypting IC NVM with the PUF secret key 

Fig. 5. ECDSA signing with PUF as the private key 


696

Fig. 6. PUF as trust anchor private key 

VII. MAXIM’S COMMERCIAL PUF-BASED SECURITY IC 

Maxim introduced its first PUF-based security IC, the 

DS28E38[7], in November 2017. The DS28E38 is an ECDSA 

authenticator that utilizes the company’s ChipDNA PUF 

output as key content to cryptographically secure all devicestored 

data. Optionally, under user control, ChipDNA is used as 

the private key for ECDSA signing operations. The device 

provides a core set of cryptographic tools derived from 

integrated blocks including asymmetric (ECC-P256) and 

symmetric (SHA-256) hardware engines, a FIPS/NISTcompliant 

true random number generator (TRNG), 2Kb of 

secured EEPROM, a decrement-only counter, and a unique 64- 

bit ROM identification number (ROM ID). The ECC 

public/private key capabilities operate from the NIST-defined P- 

256 curve to provide a FIPS 186-compliant ECDSA signaturegeneration 

function. A block diagram of the DS28E38 is shown 

in Fig. 7. 

VIII. SUMMARY 

Embedded systems have electronic assets that can be 

protected by cryptography. Security ICs with cryptographic 

functions provide optimal protection, but, ultimately, become 

the attack point by those attempting to compromise the assets. 

Furthermore, attackers are becoming increasingly sophisticated 

in their techniques. A decisive countermeasure to the invasive 

attack is the PUF, which, due to its inherit qualities, can be 

highly immune to reverse-engineering methods. 

IX. 

TRADEMARKS 

ChipDNA is a trademark of Maxim Integrated Products, Inc. 

X. REFERENCES 

[1] https://en.wikipedia.org/wiki/Side-channel_attack 

[2] National Institute of Standards and Technology, NIST, Current Federal 

Information Processing Standards (FIPS) 

https://www.nist.gov/itl/current-fips. 

[3] https://csrc.nist.gov/Projects/Random-Bit-Generation/Documentationand-Software 

[4] https://en.wikipedia.org/wiki/Physical_unclonable_function 

[5] JEDEC standards for microelectronics https://www.jedec.org/ 

[6] MicroNet Solutions, Inc.http://micronetsol.net/ 

[7] https://www.maximintegrated.com/en/products/digital/memoryproducts/DS28E38.html 

Fig. 7. Block diagram of Maxim’s PUF-based secure authenticator 

697

Timon, Rex and Tux 

How TPMs and On-Chip Security Modules improve Trust and Security in GNU/Linux 

Dipl.-Ing. Michael Roeder 

Technology Engineering and Services CE 

Avnet Silica 

Poing, Germany 

Michael.Roeder@avnet.eu 

Dipl.-Inf. Martin Hecht 


Avnet Silica 


Martin.Hecht@avnet.eu 

Abstract—Although Hardware Security Modules (HSM) to 

accelerate cryptographic operations and to perform authenticated 

or encrypted boot have been integrated into numerous SoC for 

years, they are rarely used in today’s applications. Implications of 

using them (both positive and negative) are mostly unknown to the 

majority of designers. 

At the same time, Trusted Platform Modules (TPM) are 

established more and more in embedded and industrial 

applications and support for TPM 2.0 in the Linux kernel has 

arrived. This prompts the question, to what extend TPMs can take 

over some of these functionalities. 

This paper gives an introduction into both technologies and 

their advantages and disadvantages for certain use-cases. 

We look into scenarios like encrypted, authenticated and 

measured boot over the various boot stages and the use of 

hardware security in the Linux Kernel and in applications such as 

OpenSSL, StrongSwan, along with the respective stacks involved. 

We show ways to combine hardware security technologies and 

software algorithms to create best-in-class solutions but also 

explore which hardware functionalities are currently supported in 

software and what is missing to create a complete, trusted solution. 

Keywords— Security, Trust, trusted boot, authentication, TPM, 

TPM2.0, HSM, trust architecture, measurement, attestation 


Over the past weeks, while the authors were finishing this 

paper, the Meltdown and Spectre attacks against modern 

processors and SoCs were intensively discussed in professional 

and popular press. One of the positive outcomes of such 

immense public interest for security concerns is that lots of 

people start re-thinking their concepts (and side-effects) of 

security and data protection. This is especially important in the 

consumer/IOT space, where for devices such as IP cameras or 

garage door openers, security used to be an afterthought. Now, 

that these products enter the market in big quantities and are sold 

even in discounter supermarkets, both public and government 

are alerted about potential misuse and the dangers imposed by 

cracking attempts to these devices. Federal agencies have started 

to look into criteria to be met for devices transmitting personal 

data over open communication channels and how to ensure the 

integrity of such devices. 

SoCs have been equipped with cryptographic accelerators 

for years and some like NXP’s Layerscape families, the 

i.MX6UL3 or i.MX8 offer amazing hardware security features 

by combining crypto accelerators (hardware security engines, 

HSM) with tamper detection features. However, these features 

are vendor- and part-specific, and it is hard to base a common 

security strategy on proprietary features. 

At the same time, Trusted Platform Modules (TPM) are 

established more and more in embedded and industrial 

applications and support for TPM 2.0 in the Linux kernel has 

arrived for some devices. This prompts the question, to what 

extend TPMs can take over some of these functionalities. 

This paper gives an introduction into both technologies and 

their advantages and disadvantages for certain use-cases. 

However,, this paper can only give an introduction and 

completely leaves out implementation specifics and some 

details. Feel free to contact the authors for more details on the 

topics. 

1 HARDWARE-ACCELERATED SECURITY 

In this chapter we take a closer look at HSMs and TPMs as 

hardware implementations to provide security in embedded 

systems. We also discuss some generic advantages hardware 

security implementations provide over software. 

1.1 Motivation 

In 1995, former NSA Chief Scientist Robert Morris said: 

“Systems built without requirements cannot fail; 

they merely offer surprises. Usually unpleasant!” 

Some of the basic security requirements, when talking about 

SoC-based systems (as usually mentioned in threat analysis 

documents and security requirement sheets) are: 

 

 

Access Control: access and remote access to device 

has to be denied to unauthorized users 

Anti-Cloning: measures against overbuilding and 

counterfeiting of devices 


698

IP Protection: the manufacturer’s intellectual property 

(e.g. software, FPGA netlists) is protected against theft 

Confidentiality: data is encrypted, especially in 

communication to outside world or when written to 

memories 

Resilience: device can detect attacks and initiate 

measures to protect data 

Data Integrity: the data generated or exchanged with 

the system is protected against modification 

Non-Repudiation: device can prove that data was 

generated by it and check that data arriving in it has the 

correct origin. 

In the following chapters, we will give a short overview 

about how and why these requirements are addressed in recent 

SoC hardware security modules (HSM) and the limitations 

imposed by them. We will also show which functionality TPMs 

(TPMs) provide that can be added as a peripheral to existing 

systems to enhance security. We will discuss how TPMs can be 

used to complement or replace the integrated SoC functionality, 

along with the advantages and disadvantages of doing this. 

1.2 HSMs (Hardware Security Modules) 

At first, we take a look at the basic functionalities provided 

by HSM in SoC. These are as follows: 

 

 

 

Trust: measures taken and functionality provided to 

ensure that the system can be trusted and is untampered 

after boot and during operation. This includes secure and 

encrypted boot functionality, certified true random number 

generators, secure storage and use of individual keys and 

protection against access from the non-secure world from 

either software (malware) or hardware (JTAG, debugging 

pins). ARM Trustzone offers some basic support to isolate 

trusted from untrusted software parts and to restrict system 

access of untrusted software. Some SoCs offer far more 

advanced hardware features to assists software (e.g. 

hypervisors or secure operating systems) to separate 

software and restrict access to specific hardware resources 

using a rule based system. Trust mechanisms are crucial to 

provide Access Control, Anti-Cloning, IP Protection 

and Non-Repudiation capabilities to the product. 

Hardware Crypto Engines: acceleration of crypto 

algorithms to offload the CPU and add additional security 

and key protection to crypto processes. This unit is usually 

closely linked to the units providing Trust and Tamper 

Resistance, but can also be leveraged from user 

applications to provide Confidentiality, Data Integrity 

and Non-Repudiation at user level. 

Tamper Resistance: provides protection against attacks 

by moving the SoC out of its regular specifications. This 

includes units providing a secure RTC and active tamper 

pin monitoring along with temperature, clock and voltage 

monitors. The tamper unit is closely connected to all key 

storages in the crypto accelerators to ensure deletion of 

critical memories upon a tamper attempt. These units are 

required to provide 

protection. 

Resilience and Anti-Cloning 

HSMs provide the best performance, power effectiveness 

and hardware costs and are ideally integrated into the SoC 

functionality. Therefore they can provide a comprehensive 

“security package” to the SoC user to leverage and are most 

easily used to achieve common security targets. For example, 

the integrated secure key memory may have a dedicated 

connection to the crypto unit to provide the encryption key to it 

without being snooped over system busses, or the tamper 

detection unit automatically erases the secure key storage and 

other critical memory areas, if an attack is detected. 

However, HSMs also have some disadvantages in the 

following areas. 

Standardization and Reusability: Most SoC vendors 

either develop their security modules as internal IP by 

themselves or buy third party IP which is then integrated 

into the SoC. There are no standardized ways how these 

modules are designed, integrated and used in the system 

scope. Therefore, a security concept and software written 

for one system can’t be easily migrated to a different one. 

HSMs, drivers and usage concepts will most likely differ 

even among members of the same family of one vendor. 

This gets worse, if a common security concept has to be 

developed and maintained among several different 

platforms in a company. 

Certification and Trust in Implementation 

Correctness: this problem arises if security certifications 

such as Common Critera (EALn) or NIST are desired on 

the end product. Achieving such a certification usually 

requires providing a lot of material, sometimes including 

(semi-)formal verification reports or enabling source 

code/HDL review to the auditor. The SoC end user is total 

dependent on the SoC manufacturer to assist in providing 

(and disclosing) this data, which most times will not 

happen. Even access to functional documentation is 

sometimes restricted to users or only available under NDA, 

which further decreases the trust level into these solutions. 

“Security through Obscurity” comes to mind. 

Unfortunately, this is not only a theoretical point, security 

problems in SoC hardware implementations that with an 

open and public implementation would probably have been 

detected within months, are exposed on a regular bases in 

such obscure implementations (last one known to the 

authors: [1]). 

Ecosystem: the complete ecosystem (software stacks, 

drivers, manufacturing utilities) is provided by the SoC 

vendor and therefore single-source and supported only by 

one company. Sometimes SoC vendors are hesitant to 

integrate complete support for their solutions into u-boot or 

Linux mainline to avoid exposing too much knowledge 

about the actual implementation so that users are still 

required to stick with proprietary versions. 

699

1.3 TPMs (Trusted Platform Modules) 

Contrary to HSMs, which are internal to the SoC, Trusted 

Platform Modules (TPM) are external low-cost cryptographic 

modules. Trusted computing platforms may use a TPM to 

enhance privacy and security scenarios that software alone 

cannot achieve. A TPM offers the four primary capabilities 

Authentication, Platform Integrity, Secure Communication, 

and IP Protection. Depending on the version of the TPM, 

different cryptographic algorithms are implemented in the 

module. Additionally, TPMs include a small, secured nonvolatile 

memory that can be used by user space applications to 

store confidential information. Other units of the TPM can be 

used to implement policies to manage access to this memory. 

The hardware specification of TPMs is maintained by the 

Trusted Computing Group (TCG) [2] as a non-profit 

organization. The TCG also drove the specification to be 

accepted as international standard ISO/IEC 11889/15 which 

corresponds to TPM 2.0. All specifications as well as the 

according API are open to allow wide adoption and integration 

into any operating system and application software. 

Asymmetric 

Engines 

Symmetric 

Engines 

Hash Engines 

Management 

Authorisation 

Nonvolatile Memory 

Plattform Seed 

Endorsement Seed 

Storage Seed 

Monotonic Counters 

... 

I/O 

I2C, LPC, SPI 

Key Generation 

RNG 

Execution Engine 

Power Detection 

Volatile Memory 

PCR Banks 

Keys in use 

Sessions 

... 

1.3.1 TPM Hardware and internal Firmware 

As of today TPMs are mostly separated small hardware 

modules that are hardened against several forms of electrical, 

environmental and physical attacks to prevent that neither keys 

can be stolen nor the implemented cryptographic algorithms 

may be influenced to break the system security. Other 

implementations as Firmware TPM are possible. 

In general, TPMs are passive components that neither 

measure nor monitor nor control anything directly on the system. 

They are physically connected by using standardized bus 

systems such as I2C, LPC or SPI to receive commands and 

responds. So they cannot influence the host system actively e.g. 

by stopping the execution of some kind of code on the host CPU 

or generating a system reset. The owner of the system even has 

the responsibility to manage the TPM by turning it on or off or 

to reset and initialize it. Unlike HSMs, TPMs are relatively slow 

and cannot be used as a cryptographic accelerators for 

encryption. The bus system is usually speed limited and 

depending on the particular TPM implementation some 

commands have execution times of several seconds. The most 

relevant use cases are key generation, key encryption to store 

keys externally, key signature and certification, and last but not 

least improved random number generation. 

In this paper we will focus on TPM version 2.0 and skip 

version 1.2 for the simple reason that TPM 1.2 only implements 

SHA-1 as cryptographic hash algorithm. As of today there exist 

several serious attacks against SHA-1 [3] which leads to the 

conclusion that TPM 1.2 cannot be used to enhance the security 

of a system. Instead, in TPM 1.2 based security concepts, the 

TPM itself is now usually considered to be the weak point of that 

system. TPM 2.0 comprises all features of TPM 1.2 but with 

significant enhancements like an offering of several mandatory 

and optional algorithms instead of just few isolated ones. 

As shown in the picture to the right, a TPM comprises 

several hardware blocks. An important block is the non-volatile 

memory. In the production process the TPM vendor programs 

four individual and unique primary seeds. Three of these are 

permanent ones which only change when the TPM2 is cleared: 

endorsement (EPS), Platform (PPS) and Storage (SPS). 

Additionally, the TPM firmware implements a Key Derivation 

Function (KDF). A seed (which is simply a long random 

number) is hereby used as input to the KDF along with the key 

parameters and the algorithm to produce a key based on this 

seed. The KDF is deterministic, so if you input the same 

algorithm and the same parameters you will get the same key 

again. There’s also a Null seed, which is used for ephemeral keys 

and changes every with every reboot, reset or power on. Seeds 

are never exposed by the TPM. The simple, unprotected physical 

connection between the TPM and CPU invites the idea to snoop 

on that connection to explore exported and imported keys. 

However, key import or export is always handled as encrypted 

key blobs (e.g. using AES) to ensure that keys generated in the 

TPM are protected. Using the clear command destroys all keys 

generated based on EPS, PPS and SPS along with these keys 

themselves. 

TPMs also contain several monotonic, un-resettable 

counters which can be used for instance to count the number of 

firmware updates or other events. 

There also exists a special volatile memory block on each 

TPM. This block comprises memory locations to store keys and 

session information. For TPM 2.0 this block also contains two 

banks of 24 Platform Configuration Registers (PCR) that can 

be used for measurement, which works as follows: 

 

 

 

TPM Family 2.0 Block Diagram 

The PCRs cannot be written directly. 

PCR 0 to 15 can only be extended by a value after an initial 

reset directly after power on. 

An individual reset of PCR 16 to 23 can be triggered by 

user. 


700

The extend formula as calculated internally in the TPM is the 

following: 

PCR[i] n+1 := hash ( PCR[i] n || [extend value] ) 

The index i specifies which PCR register will be extended and n 

is the current state of the PCR register. PCR registers are usually 

extended multiple times with data like sets of code, 

configuration data or policies to calculate a measure of this data. 

In other words, the measurement value of a PCR after some 

extensions is a measure of the code which was used as extend 

values. Due to the nature of a strong hash function, the PCR 

values change significantly even for minor changes in the extend 

values. At the same time, it is close to impossible to calculate 

PCR extend values to achieve a desired PCR value. Comparing 

PCR values in certain system states (e.g. after boot up) versus 

saved reference values is therefore a possibility to assess the 

trust state of a system. 

Another important block of a TPM is the Execution Engine, 

a small internal MCU which executes the protocol stack for host 

communication and controls the asymmetric and symmetric 

engines, key generation, key entropy checking, module self-test 

and other operations. An additional power detection block 

monitors external events like power on and is an essential part 

of the tamper detection. 

Depending on the particular purpose of the system, the TCG 

publishes so called Platform Profile specifications to define a 

mandatory set of capabilities of the TPM as well as optional 

extensions for certain use cases. One example is the PC Client 

Platform TPM Profile (PTP) Specification for TPM family 2.0 

(which can be used for Embedded Systems). 

The picture below shows mandatory and optional algorithms 

and curves for elliptic cryptography algorithms as defined in the 

PTP. Other profiles, e.g. for Automotive and Automotive Thin 

Clients applications exist as well. 

1.4 Why not simply use a software implementation? 

The following chapters will show, that for many use cases, 

speed and security can be highly improved when using hardware 

assisted cryptographic modules. However, in recent ARMv8, 

Intel and AMD CPUs specific instructions have been 

implemented that help to speed up algorithms such as AES 

significantly [4]. So for mere speed reasons, or if the key is 

exposed to parts of the software or operating system anyway, 

using such optimized implementations is a real option. In this 

paper’s chapter about HSMs and TPMs on application level 

(chapter 4) we will also have a look at these software 

implementations and compare them with the ones in HSMs. 

However, the topic of the next chapter, establishing trust on an 

embedded system is a good example of a use case that highly 

profits from hardware implementations. 

2 TRUSTED SYSTEMS 

In this chapter, we take a closer look at possibilities of adding 

trust to embedded systems. This point is especially important for 

dynamically adaptable and network-connected systems (that 

might be attacked through this network), but also a sensible 

measure to ensure that system crashes or malfunctioning 

software is prevented from harming the system. Examples of 

dynamically adaptable systems are systems that support remote 

updates, user software installation or user customization. 

When looking at trusted systems, the terms root of trust and 

chain of trust are important to understand so we look at these 

first. 

2.1 Root of Trust, Chain of Trust 

Similar to real live, trust in the Embedded Systems world has 

to be “earned”. However, unlike humans who have the luxury of 

taking time to develop their own assessment whom to trust, in 

the embedded world trust has to be immediate in most cases. 

Therefore, a concept, called “chain of trust” is used, which is 

based on inheritance. It starts with an implicitly trusted 

component, the so called root of trust. This component then 

evaluates other components and decides if they can be trusted as 

well. If so, these components are added to the trust base, can be 

executed and can act themselves as new assessors of trust for 

other components. This way, a chain of trust is constructed 

which (ideally) results in a completely assessed and trusted 

system. The picture below illustrates a chain of trust based on 

the boot process of an embedded system. 

701

In this case, the ROM boot loader acts as the root of trust. 

Authentication of components (e.g. u-boot) is usually done 

based on hash values of the binary code of these components 

using either the complete component or selected (security 

relevant) parts of it. A root of trust is established by an 

unchangeable, implicitly trusted peace of code. Naturally, this 

code needs to be reviewed extensively for potential 

vulnerabilities and functional correctness and should therefore 

be kept as small as possible. In most SoC implementations 

supporting hardware authentication, the root of trust is generated 

in the ROM bootloader (or BIOS code, if applicable). Since the 

complete trust of a system is inherited from the root of trust, it 

stands and falls with its correctness. Therefore special diligence 

needs to be exercised when selecting and evaluating the root of 

trust. In the next chapter, we will look at specific hardware 

implementation concepts to authenticate components and their 

advantages and disadvantages. 

2.2 Authenticated and Encrypted Boot using HSMs 

The basic task of authenticating software is very simple: 

generate a hash value and compare it to a reference value of this 

hash. If they match, the software is authenticated. However, this 

prompts some questions with conflicting answers: 

 

 

 

Where should the reference hash be stored? 

Since the reference hash is the foundation of the 

decision if an image is valid, it should be stored in a 

secure, unmodifiable location, such as in fuse arrays. 

What can I do if my software changes (updates) and the 

hash needs to be updated? 

To update the hash, it needs to be stored in a 

modifiable memory, such as flash. 

How can I dynamically store multiple hashes to validate 

multiple images and how can I assign a hash to an image? 

Fuse memory is usually limited and the assignment 

should be fixed for security reasons. 

The solution to these conflicting answers is using 

private/public key cryptography, which works as follows. To 

sign an image for a target system, a combination of private and 

public key is generated during production on a development PC. 

This private key is then used to encrypt the hash value and 

directly attached to the image along with the matching public 

key to verify this encrypted hash. Due to the nature of 

private/public key cryptography, it is very simple to verify the 

encrypted hash using the public key, but practically impossible 

to guess the private key to generate such a signature (e.g. with 

the updated hash value after modifying an image). To ensure that 

attackers can’t just switch to a completely new pair of 

private/public key and authenticate their own images, a hash 

value of the public key is saved to an immutable internal 

memory location and checked against before using the public 

key. This methodology ensures that multiple images can be 

signed using the same private/public key with a fixed 

consumption of one-time programmable memory, as long as the 

signature of the hashes are attached to their respective images. 

The scheme to the right shows the process of signing an image 

on the host PC in a secure environment. The blue and green 

blocks identify the cryptographic functions used, while yellow 

blocks represent actual keys that are generated and used in the 

process. After generating a pair of (secret) private and (publicly 

available) public key, the private key is used to sign the hash 

value of the software image to be authenticated. Usually, 

SHA256 is used as the hash function and RSA for private/public 

key cryptography. The generated signature is then attached to 

the image along with the public key and the complete binary is 

written to flash memory. Using the attached public key, the 

image signature can then be authenticated by the SoC. To avoid 

that attackers simply replace the pair of private and public key 

with their own ones and use these to sign a new (modified) 

image, a hash of the public key to be used is generated and 

burned to the SoC fuses (one-time programmable memory). 

On the SoC side, this process is simply inverted as shown in 

the image on the next page. After extracting signature and public 

key from flash, the signature is checked against the calculated 

hash of the image. If this succeeds and the hash of the public key 

matches the one saved in the fuses, the authentication succeeds 

and the image is booted. 

To make this process of authenticating an image as attacksafe 

as possible, it is usually implemented in hardware, using 

Functional State Machines (FSMs) or small, completely isolated 

controllers. NXP’s CAAM engine is an example of such 

hardware, which can perform authentication in a completely 

automated way. Highly simplified, after being told where the 

respective image to authenticate is located, it performs the 

authentications and then updates the chip “secure” state. If the 

images has been authenticated successfully, it keeps the state in 

“secure”, otherwise it changes it to “unsecure”. Once the state 

has been changed to “unsecure” it can’t be changed back to 

“secure” without a hard reset. This way, a chain of trust is 

established. The authentication can be invoked from software 

(e.g. ROM bootloader or u-boot) by jumping into functions 


702

located in the SoC’s ROM with some registers informing about 

the current trust state. 

If encryption for the images is added to the boot process, an 

additional step is required which involves encrypting the image 

key with an integrated device key, therefore generating a 

cryptographic blob which can also be attached to this image. 

This process allows users to select their individual key, but still 

keeps security at a high level by encrypting this key with unique, 

random keys not knows to anybody and specific to the device. 

In other words, if two devices boot images which are encrypted 

with the same key, the cryptographic blobs of these keys are 

different, while the encrypted image itself is the same. 

This process highly simplifies remote updates of images, 

because it allows OEMs to use the same private key and 

encryption key on multiple images, but still minimize the attach 

surface by parallelized brute force attacks. 

This high level of security is achieved by using hardware 

security units. In pure software, it would be impossible to 

realize, starting with the inability of processor cores to boot 

encrypted code, which would imply to keep some code 

unencrypted and unsigned. As a consequence, this code and the 

keys used for accessing and verifying flash program code would 

have to be stored on protected, one-time programmable 

memories, making it expensive and hard to update and adapt to 

exposed security leaks. Another advantage of using hardware 

security for authentication is the possibility to integrate these 

units with other security units on the SoC. For example, the 

symmetric units can also be used to export/import user specific 

keys or user data into cryptographic blobs. Upon detected 

attacks or images not being authenticated, internal key memories 

can be cleared and access to critical IP modules can be 

prevented. 

If this method is implemented in a functional correct way, it 

is the most practical and secure way of establishing a root of 

trust. However, it also has some disadvantages: 

 

License Restrictions: Some open source licenses (e.g. 

GPLv3) impose the requirement on a system that users 

should be allowed to completely exchange the operating 

system on these devices. [5] for example states: 

“When it comes to security measures governing a 

computer’s boot process, GPLv3’s terms lead to one 

simple requirement: Provide clear instructions and 

functionality for users to disable or fully modify any boot 

restrictions, so that they will be able to install and run a 

modified version of any GPLv3-covered software on the 

system.” 

Because HSM assisted trusted boot cannot be disabled in 

hardware once enabled, measures (e.g. using an 

intermediate boot loader) need to be implemented to ensure 

that this is possible. Contact the authors for more 

information on this topic. 

Trust / Certification issues: as mentioned in chapter 1.2 

Little flexibility for bug fixing: Since the implementation 

is completely in hardware with no software interaction, it 

offers little flexibility of later improvement or bug fixing if 

errors are detected or changes have to be made due to 

exposed problems or security holes in later system updates. 

2.3 Measured Boot with TPM 

As mentioned in the last chapter, depending on the profile 

specification the PCR extend functionality can be used to 

measure dedicated parts of code, configuration data and policies. 

This functionality can for example be used to measure the 

components involved in the boot of system. The table below 

shows how PCRs are defined to be used on UEFI enabled X86 

systems and a recommendation how to adapt this usage on ARM 

systems with GNU/Linux. 

PCR PCR usage (UEFI + x86) PCR usage 

(ex. for ARM) 

0 SRTM, BIOS, Host Platform 

Extensions, Embedded Option 

ROMs and PI Drivers 

1 Host Platform Configuration 

2 UEFI driver and application Code 

3 UEFI driver and application 

Configuration and Data 

4 UEFI Boot Manager Code 

(usually the MBR) and Boot 

Attempts 

5 Boot Manager Code 

Configuration and Data (for use 

by the Boot Manager Code) and 

Boot ROM, if 

accessible 

First Stage Boot 

Loader 

u-boot 

(including SPL) 

u-boot 

Environment 

GPT/Partition Table 

6 Host Platform Manufacturer 

Specific 

7 Secure Boot Policy Linux IMA 

8-15 Defined for use by the Static OS Linux IMA 

16 Debug 

It is important to note, that all software code performing 

measuring operations (with or without using TPMs) has to be 

authenticated to be part in the chain of trust, otherwise the 

measurements performed by it can’t be trusted. TPMs by 

themselves are not able to establish a root of trust, since they 

703

(unlike HSMs) need to be controlled by software code running 

on the SoC (chicken/egg problem). So replacing HSM 

authentication with TPM measurement functionality can only 

start with the first code that allows the TPM to be accessed and 

can be authenticated by outside means. We will look into ways 

how to mitigate this fact for more complex systems in chapter 

2.4. Another important thing to know is that PCR extends of big 

chunks of code usually implies a big time penalty due to the 

limited operating speed of TPMs. Therefore, most of the times, 

the preferred approach would be to authenticate a software 

algorithm (utilizing trusted HSMs) to calculate a hash of the 

chunk in question and then extend the TPM PCR with the result 

of this calculation. 

No complete, adaptable and re-usable implementation to 

support measured boot across all boot stages exists to the 

knowledge of the authors. So in the remainder of this subchapter, 

we will outline the current state of TPM support in the most 

relevant components of the boot chain. 

Naturally, the ideal place to start TPM measurement would 

be the ROM bootloader implemented by SoC vendors which 

would come close to having a (immutable) root of trust, as long 

as the vendor in question is trusted or the boot loader code is 

publicly available to check and certify. However, at the time of 

writing this, no SoC known to the author offers TPM support in 

the ROM bootloader. However, some SoC vendors offer 

support to modify and extend their first stage bootloader (FSBL) 

implementation, so some basic routines for TPM initialization 

and PCR-based measurement can be added there. If this is not 

possible, u-boot (or u-boot’s SPL) would be the first stage in 

which measurement can be started for components following in 

the chain of trust. This implies, that HSM-based methods have 

to be used before to authenticate u-boot itself and previous 

components in the chain of trust. 

Mainline U-Boot currently only offers support for self-test, 

provisioning, un-provisioning and PCR extension for TPM 1.2. 

Advanced features to support integrated measurement of 

payload data in u-boot and support for TPM 2.0 are missing. The 

authors have implemented this support and are evaluating the 

solution in some internal and customer projects. The solution 

currently supports automated measurement of configuration 

data and boot payloads, such as the Linux Kernel, configuration 

data such as Device Tree Binaries, Initial RAM disks or 

complete FIT images. This allows to measure and boot in initial 

GNU/Linux system. Please get in contact if you are interested in 

using or contributing. 

To support selective measurement and remote attestation in 

a running GNU/Linux system, starting with GNU/Linux Kernel 

2.6.30 the Integrity Subsystem of the kernel has been introduced 

(with extensions in 3.3 and 3.7) that can be used to implement 

measured boot of the Kernel. Basically, it allows detection if 

files have been accidentally or maliciously modified, both 

locally or remotely. Files measurement can be appraised against 

a known “good” value stored as an extended attribute in the 

filesystem or via a server through a secure IPSEC connection 

(“remote attestation”). IMA uses the Extended Verification 

Module (EVM) to guarantee the integrity between the file and 

its extended attributes. IMA currently offers the following 

integrity functions: 

Collect 

Store 

Attest 

Appraise 

Protect 

measure a file before it is accessed. 

add the measurement to a kernel resident 

list and, if a hardware Trusted Platform 

Module (TPM) is present, extend the IMA 

PCR 

if present, use the TPM to sign the IMA 

PCR value, to allow a remote validation 

of the measurement list. 

enforce local validation of a measurement 

against a “good” value, stored as hash or 

signature in 'security.ima' extended 

attribute of the file, protected by EVM 

protect a file's security extended attributes 

(including appraisal hash) against off-line 

attack. 

Hash values are extended into PCR10. So the final aggregate 

hash in PCR10 is the record of the state of the measured files 

and directories, for example after booting. IMA also offers some 

built-in policies that can be enabled on the boot command line. 

The kernel log also contains all information about what files 

have been appraised and which executables have been started. It 

provides the tool evmctl (contained in the ima-evm-utils 

package) which can be used for producing and verifying digital 

signatures and to store them into the xattr to be used by IMA. 

For further information on the configuration of IMA see the 

official documentation [6] or contact the authors. 

2.4 Local and Remote Attestation 

Between the two extrema of pure local and remote 

attestation, other ways of verifying the system correctness are 

possible that combine some of the advantages of both worlds. 

Remote attestation provides the best security and is highly 

immune to local attacks, because the actual attestation is done 

on a remote system. However, it relies on a working 

communication channel to the attestation server and – even more 

basic – on the attestation server itself. Not all application use 

cases allow remote connections and highly available servers are 

costly and require lots of maintenance while still posing a high 

security risk themselves. 

On the other hand, local attestation, like i.e. performed by 

IMA’s Appraise function, implies trusting in the local selfprotection 

capabilities of the attestation software itself and its 

functional correctness. This means, that it needs to be 

authenticated by a component in the chain of trust (e.g. a 

minimum system contained in an initial ramdisk). This approach 

is inflexible, time consuming and complicated for components 

as dynamic as an Embedded Linux System and also provides no 

run-time protection of the authenticating system after boot. 

An intermediate path between these approaches is to use a 

secure, trusted and isolated component on the local system to 

authenticate the remainder of it. Two ways of implementing this 

come to mind, using ARM TrustZone or using an Embedded 

Hypervisor. Both technologies rely on hardware features in 

modern processors to isolate trusted from untrusted software and 

to restrict access of untrusted components to the trusted ones. 

Embedded Hypervisors have been described in detail in [7], 

more information about TrustZone can be found in [8], a trusted 

OS implementation is described here [9]. Hypervisors offer far 


704

advanced methods to implement attestation, but the basic 

concepts are the same and rely on such a trusted component with 

minimal Trusted Computing Base (TCB) [7], which is security 

certified and provides attestation services to the rest of the 

system. This component is installed on the system and – after 

authentication by the root of trust – acts as new, flexibly 

expandable measurement and authentication agent which can 

remain unchanged even if functionality or updates are added to 

the system, therefore also allowing less flexible roots of trust to 

be used for initial authentication. 

Different concepts to do the measurement are imaginable, 

with increasing security but also complexity. 

a) Triggered 

In this concept, the trusted component provides crypto 

measurement and attestation services to the untrusted system. So 

instead of performing the measurement itself, this is requested 

by the untrusted system as a service from the trusted component. 

The advantage is, that a common, simple API can be used which 

requires no knowledge about underlying hardware (e.g. TPM) 

by the untrusted system and is highly portable and selfcontained. 

All information about expected measurement values 

or other secrets such as keys are kept securely within the trusted 

component. For example, after loading a payload (such as the 

Linux Kernel) into memory, the bootloader could request an 

attestation of this payload from the trusted component 

(providing memory location, size and potentially also the start 

address). The trusted component then measures the payload and 

starts it, if the measurement matches an expected value. 

To increase security, the payload can be encrypted by a key 

which is sealed to this PCR measurement value in the TPM. So 

the trusted component would measure the payload and 

afterwards try to retrieve the key. If the measurement was 

correct, they key is available to decrypt the payload in memory 

and initiate booting it on the untrusted side. If the measurement 

was not correct, the payload can’t be decrypted and therefore 

booting it is impossible. 

The biggest advantage of this concept is its flexibility when 

system updates are performed. In this case, the trusted 

component retrieves the new images via a secure connection, 

places them into (publicly accessible) memory and performs a 

reference measurement to get the new PCR values to sign the 

keys to. 

Alternatively, if the channel through which the update image 

is retrieved is unsecure, it is sufficient to receive the new 

reference measurement value through a secure (e.g. a 2 way 

authenticated) connection. If the image is compromised while 

being fetched from the public update source, it will not match 

the measurement, the TPM will not unseal the key and therefore 

decryption is impossible. In this case, a new update will be 

triggered by the secure component. 

The secure component uses TPMs to securely store keys and 

measurement reference values to avoid that secrets are extracted 

from mass storage. Only after the secure component itself has 

been authenticated and measured, the TPM will reveal secrets 

and only if the correct measurements and PCR extends are done, 

more secrets will be made available by the TPM to continue the 

boot process into subsequent instances. 

b) Active 

While in the Triggered Measurement scenario, the 

measurement is triggered from the unsecure side, Active 

Measurement allows the secure component to initiate 

measurements on the unsecure side. This is done similarly to 

remote attestation, using Quote requests and expecting quotes 

from the attested system. The only difference is that the secure 

component acts as the (implicitly trusted) attestation server here, 

but all concepts (such as including nonces in the quote request) 

known from remote attestation can be used here. 

c) Observing 

Observing Attestation drives the Active Attestation concept 

one step further and is an interesting concept for embedded 

hypervisors. Instead of communicating with the payload virtual 

machine, it performs a measurement of VM contents from the 

outside. This happens completely without payload VM 

interaction and leverages the access rights of the VM|s virtual 

machine monitor (VMM) [7]. The main advantage is that no 

modifications (==hooks, functions, channels) have to be made 

to the payload virtual machine itself and that the software 

running in the virtual machine can’t even know that it is being 

measured. The main drawback is the implementation 

complexity, and a potential impact on the performance of the 

virtual machine during measurement. So the measurement times 

need to be carefully selected. Alternatively, time critical tasks 

the virtual machine is performing need to be shifted to another 

one while performing the measurement. The major advantage is 

that the actually executed memory content is measured instead 

of an image before execution. Some SoCs offer hardware 

support (programmable hash engines with DMA support) for 

performing the actual measurement, which - if these engines are 

trusted – can be leveraged to accelerate the measurement. 

Furthermore TPMs can be leveraged by the hypervisor to save 

measurements, reference values and encryption keys which 

might be necessary to access the VM. 

2.5 Software Implementations for File System Security 

Next to the very hardware (HSM and TPM) centric 

techniques of authentication and measurement mentioned above 

there are various methods in the GNU/Linux system to enhance 

security by use of software authentication and encryption. Some 

of these include support for hardware crypto acceleration 

(explicitly or implicitly through the Kernel Crypto API) and are 

therefore quickly mentioned in this chapter. 

The Linux Kernel supports various methods to implement 

block level integrity protection such as DM-Verity [10] and 

DM-Integrity [11]. 

DM-Verity uses a cryptographic hash tree to authenticate 

block devices. The hashes are computed using kernel crypto 

services and hereby leverages HSM through the kernel’s crypto 

API. DM-Verity can only be used to verify read-only partitions, 

updating or changes to these partitions therefore require a new 

integrity setup of the partition. It can therefore be used for 

system partitions which are supposed to remain unchanged. 

DM-Integrity which was recently merged into Linux 

Mainline with Kernel 4.12, supports the authentication of R/W 

partitions and supports journals. It also leverages the Kernel 

Crypto API and therefore profits from HSM acceleration. 

705

DM-Crypt [12] provides support for encrypting block 

devices for GNU/Linux and works with both DM-Verity and 

DM-Integrity. LUKS (Linux Unified Key Setup) [13] provides 

some extensions to it, mostly to simplify key handling. DM- 

Crypt uses the Linux Kernel Cryptographic API and therefore is 

able to leverage HSM. The extension tpm2_luks [14] can be 

used to store keys in TPM 2.0 modules and to seal access to the 

key to specific PCR values to ensure, so that keys can only be 

accessed in safe measured states. 

2.6 Conclusion: secure / authenticated / measured boot 

In the past subchapters, we have looked at both authenticated 

boot using on-chip HSMs and at measured boot using TPMs to 

complement pure software based approaches. But what is the 

“ideal” method to implement trusted systems? The following 

table summarizes the advantages and disadvantages of both 

methods: 

Aspect 

On-Chip 

HSM 

TPM pure 

Software 

boot speed (also see 

next chapter) 

++ -- + 

(ARMv7) 

to 

++ 

(ARMv8) 

can establish root of + - - 

trust 

ease of use + + ++ 

open source support - + ++ 

standardized / vendor -- + - 

independence 

pre-certification -- ++ - 

financial costs + -- ++ 

integration into SoC ++ - -- 

security system 

algorithm flexibility - + ++ 

(TPM 

2.0) 

updates / security 

fixes possible 

-- + ++ 

For most of today’s applications, the preference of system 

architects is to leverage TPMs as far as possible. The major 

problem with using TPMs for a complete authentication flow is 

the limited speed and the inability to establish a root of trust 

without a remote connection for attestation. This, however 

means that HSM-based methods have to be used anyway to 

complement TPMs. So a common approach is to use HSMs 

where they are absolutely required (root of trust) or show 

significant advantages (integration into SoC security system) 

and otherwise complement TPM’s disadvantages using (TPM 

authenticated) software routines, which allows for higher 

flexibility and easier certification of the algorithm. Some recent 

security exposures show that this approach is the right one. 

NXP’s high assurance boot had some security leaks, which are 

hard to mitigate in deployed systems. About at the same time, a 

problem in Infineon’s SLB9670 TPM was exposed for RSA key 

generation. This problem could be fixed with a simple firmware 

update (because the TPM is running just software in a specially 

protected microprocessor) and took much less time to be 

detected due to the wide use of this TPM. 

The table below shows an example flow, how an 

authenticated boot flow could be implemented, leveraging both 

HSMs and TPMs. 

Step Actions Payload 

authenticate u-boot u-boot (incl. 

(including hash algorithms TPM 

for TPM) using integrated support) 

HSM hardware support 

ROM 

Bootloader 

(no changes 

possible) 

u-boot 

Linux Kernel 

Initial 

RAMdisk 

Application 

Level 

* initialize TPM 

* measure u-boot binary in 

memory and extend PCR 

(alternatively: 

remote attestation) 

* load FIT Image to RAM 

* authenticate FIT Image 

using certified hash algorithm 

and extend PCR 

* Initialize HSMs + TPM 

support 

* load initial RAMdisk 

* enable IMA kernel 

measurement 

* setup dm-integrity and 

dm-crypt with keys from 

TPM (protected by PCR 

state) 

* mount encrypted, 

authenticated Root 

Filesystem Partition 

* setup HSMs/SW engines 

using CryptoDev 

* setup access to key 

storages and crypto 

functions in TPM 

(protected by IMA / PCR 

sealing) 

FIT Image 

(Kernel, 

Devicetree, 

initial RAM 

disk) 

Initial 

RAMdisk 

Root 

Filesystem 

Partition 

Block 

Device 

see next 

chapter 

Using hypervisors to provide authentication mechanisms to 

virtual machines offers advanced possibilities at the cost of 

increasing complexity. However, for dynamic systems that can 

also leverage other advantages of hypervisors, this concept 

should be taken into consideration. 

3 HSMS AND TPMS ON APPLICATION LEVEL 

In the last chapter we evaluated how hardware security 

modules are used to authenticate a system during boot and assure 

that it can be trusted. However, cryptography is also a common 

requirement on application level. Some common tasks include: 

 

 

 

Encrypting/Decrypting sensitive data 

Key Storage 

IPSeC tunnelling 


706

We start with an overview how HSMs and TPMs are used at 

a user space level in GNU/Linux. Then we will look into the 

specific support for some of these scenarios. 

3.1 HSM on Application Level 

In this chapter, we describe HSM support and benchmarking 

in GNU/Linux for kernel and userspace. 

3.1.1 Hardware 

The HSMs of recent SoCs can be leveraged as cryptographic 

accelerators by user applications. We have analysed two specific 

implementations, NXP’s CAAM and the HSM used in recent 

Marvell SoCs, SafeXCell IP197. These units support 

acceleration of all major cryptographic algorithms, including 

AES, DES/3DES, RC4, MD5, SHA-256 and some advanced 

message authentication and authenticated encryption 

algorithms. They include a secure, NIST certified random 

number generator and support the import and export of 

cryptographic blobs into DDR or flash memory and provide 

both, DMA support and memory mapped slave interfaces. 

Internally, the units are accessed through memory mapped 

portals (Job Rings), which are basically FIFOs that can be loaded 

with cryptographic job descriptors. Highly simplified, a job 

descriptor is a structure describing the cryptographic task to be 

performed (“encrypt using AES256”), the key to used (“key #2 

from internal key storage”) and the payload to perform the task 

on. 

3.1.2 GNU/Linux support 

For both units, Linux mainline driver support is available. 

Enabling them on Kernel level is a matter of adding the unit to 

the device tree and enabling and loading the drivers. In case of 

SafeXCell IP197, a binary firmware has also to be provided to 

the driver to be loaded into the unit. Successful enablement of 

the HSM adds it to the kernel CryptoAPI so that the 

cryptographic services can be used from kernel space, e.g. to 

provide IPSEC support. This can be checked by analysing the 

output of /proc/crypto which lists all the algorithms and 

services available, along with their priority of use. An example 

is shown in the table below, showing examples of different 

SHA-256 implementations. On the left side, the implementation 

provided by the SafeXCell IP197 is shown, on the right hand one 

provided by the kernel itself, realized in software using the 

ARM64-bit NEON crypto extensions. Note the different 

priorities. 

HSM Implementation 

name 

: sha256 

driver : safexcelsha256 

module : crypto_ 

safexcel 

priority : 300 

refcnt : 1 

selftest : passed 

internal : no 

type 

: ahash 

async : yes 

blocksize : 64 

digestsize : 32 

SW Implementation 

name 

driver 

module 

: sha256 

: sha256- 

arm64-neon 

: kernel 


refcnt : 1 


internal : no 

type : shash 


digestsize : 32 

So in this example, if “sha256” as an algorithm is requested, 

the one provided by the SafeXCel HSM would be used because 

it has the higher priority. 

In the second example (below) for AES-CBC, two 

implementations have the same priority. The easiest way to 

select which one is used would be to unload the module 

providing the undesired one or to unselect it from the kernel 

build configuration. However, there are more sophisticated 

ways to granularly select specific implementations of a specific 

algorithm (contact the authors for details). 

HSM Implementation 

name 

: cbc(aes) 

driver : safexcelcbc-aes 

module : crypto_ 

safexcel 


refcnt : 1 


internal : no 

type 

: skcipher 

async : yes 


min keysize : 16 

max keysize : 32 

ivsize : 16 

chunksize : 16 

walksize : 16 

SW Implementation 

name 

driver 

module 

: cbc(aes) 

: cbc-aesce 

: kernel 


refcnt : 1 


internal : no 

type 

: skcipher 

async : yes 


min keysize : 16 

max keysize : 32 

ivsize : 16 

chunksize : 16 

walksize : 16 

But how can kernel cryptographic services be accessed from 

userspace? Several ways exist, e.g. AF_ALG [15] which 

provides services through sockets or CryptoDev [16] which 

implements an OpenBSD compatible /dev/crypto device. 

Both methods have their advantages for specific applications, 

we will take a closer look at CryptoDev in this paper. 

Cryptodev is provided as an out-of-tree kernel module, 

support to build and include it is, for example provided through 

the Yocto Project, along with patches to work with recent Linux 

kernels. When loaded, it provides a new character device, called 

/dev/crypto which can be used to access the cryptographic 

services from userspace. Plenty of examples are provided that 

illustrate how to use CryptoDev in applications [17] and is also 

supported by well-known crypto libraries like GnuTLS [18] or 

OpenSSL [19]. 

3.1.3 Performance / Benchmarking 

To get a first impression about the speed an HSM can 

provide, it should be benchmarked from userspace. OpenSSL is 

a great way to perform speed tests and comparisons between 

various engines and implementations using HSM, CPU Crypto- 

Acceleration functions or plain software algorithms. To do this, 

it should be compiled with HAVE_CRYPTODEV and 

USE_CRYPTODEV_DIGESTS defined, which is done 

automatically by Yocto if the CryptoDev recipe is included in 

IMAGE_INSTALL. Using the command 

openssl speed –engine cryptodev –elapsed –evp [cipher] 

starts the benchmark of a specific cipher algorithm 

implementation accelerated by a HSM supported by CryptoDev. 

The parameter –elapsed is important to get realistic results. It 

tells OpenSSL to measure the time it actually required to get the 

707

answer back from the engine (contrary to the internal times just 

to load the engine and process the results). 

Diagram 1 in the appendix shows the performance of NXP’s 

CAAM engine on an ARMv7 Layerscape LS1021 device for 

AES128 and AES256, compared to running in software on 

ARM Cortex-A7 at 1.2 GHz. 

Several observations are interesting to notice and generally 

true on most ARMv7 SoCs: 

HSM shows a significant speed advantage vs. 

implementation in software for bigger block size (up to 10 

times) 

for smaller block sizes (below 512 bytes), the overhead of 

loading the engine with the job descriptors and the memory 

transfer latencies involved is so high, that the achieved 

throughput is less than in software 

Even for small block sizes, the use of a HSM might be 

preferred to save CPU cycles and because of its better 

energy efficiency 

The HSM only achieves optimal results if being used by 

multiple threads (OpenSSL option: -multi n). In other 

words, one thread alone is not able to exploit the engine to 

its maximum performance. 

CPU load (for providing data to the HSM, interface 

handling and housekeeping) is about 40% to 50% of the 

load compared to when actually executing the algorithm in 

Software (setup: cryptodev, openssl). So by using the 

HSMs CPUs can be relieved from crypto operation load 

and a better total system energy efficiency can be achieved. 

Diagram 2 in the appendix shows the same benchmark 

performed on two 64 bit SoCs (Marvell Armada 8040 with 

SafeXCell 197 HSM and NXP LS1046 SoC with CAAM HSM). 

The core frequencies have been limited to 1 GHz to achieve 

comparable results. 

The following observations are interesting to notice and 

generally true on most ARMv8 SoCs: 

 

 

 

 

ARMv8 crypto extensions show a significant speed 

advantage over the HSM implementation for all block sizes 

For larger block sizes, the HSM implementation is able to 

reach similar speeds to the single-core software 

implementation when multiple threads are being used to 

push crypto pushing data into the HSM do not increase the 

crypto speed 

HSM implementation shows little speed degradation 

compared to software implementations for increasing key 

lengths on AES and other algorithms, profiting from 

parallel hardware implementations 

CPU load (for providing data to the HSM, interface 

handling and housekeeping) is about 40..50% of the load 

compared to when actually executing the algorithm (setup: 

cryptodev, openssl). Therefore from a performance point 

of view, using HSM from userspace is not recommended. 

Example: on NXP LS1046, for 8k blocksizes, 2 CPUs have 

to be used to 70% each to achieve an AES throughput 

through the CAAM engine which is equivalent to running 

 

 

the algorithm on one single core (90% load). The 

throughput achieved in software can’t even be reached on 

CAAM for smaller block sizes. 

The use of HSMs might still be desired because of their 

better energy efficiency or if performance is not relevant. 

The main advantage of HSMs is when being used by other 

hardware engine, such as the Ethernet offloading system 

(DPAA/DPAA2 for NXP SoCs). 

As the diagrams and further experimental results (available 

through the authors) show, for most algorithms there is a break 

even point of block size, at which the HSM implementation 

starts to become more efficient (in terms of energy, speed, or 

total system load) than a software implementation. This break 

even point can to be determined from experimental results. From 

an optimization point of view, it is then possible (e.g. in 

cryptodev) to distribute tasks to either the software or hardware 

engine, depending on these even points. In a more advanced 

design (either implemented in cryptodev or directly in the kernel 

crypto framework) a concept can implemented to distribute 

crypto operations to both HSMs and software implementations 

(e.g. locked to crypto-designated cores) to implement a speed 

and energy optimized high-end crypto system. Please contact the 

authors for more information on this. 

3.2 TPMs on Application Level 

In this chapter we will describe the support for TPM 2.0 in 

GNU/Linux for Kernel, userspace and some application 

examples. 

3.2.1 GNU/Linux Support 

Driver support implementing TIS and TCTI for TPM 2.0 

has been mainlined with Kernel 4.8 for several vendors. To 

avoid the limitation of one userspace application blocking the 

TPM for exclusive use, an access broker and resource 

management daemon was added with Kernel 4.12. 

But on top of the driver, there is some more infrastructure 

required to actually use a TPM. 

In addition to the module specification itself the TCG also 

defines the TPM Software Stack (TSS) that defines the 

Software API (SAPI) and the TPM Command Transmission 

Interface (TCTI). There exists also an additional Enhanced SAPI 

to simplify the userspace access on TPM functions out of several 

programming languages. 

At the moment several TSS implementations for 

GNU/Linux exist, e.g. from Intel [20] or IBM [21]. The picture 

on the next page shows the basic concept on a GNU/Linux 

system. A kernel driver is used to communicate with the TPM 

e.g. by using the TPM Interface Specification (TIS). This driver 

creates the /dev/tpm0 device to tunnel TCTI communication to 

the TPM. Optionally (depending on 

CONFIG_HW_RANDOM_TPM in the kernel configuration) 

another device for reading random numbers can be generated. 

One layer above resides the TCTI TPM Device Provider 

service as part of the TSS what will be used mostly by the SAPI 

layer. The SAPI is implemented as a userspace library to be 

linked dynamically or statically. Applications may also connect 

to the TPM device provider directly. There exist possible 


708

modifications and extensions of that layered stack to support 

more complex systems (virtualization, remote TPM access) 

which are explained in more detail in the TSS System Level API 

and TCTI Specification. 

Before a TPM can be 

used on a specific system it 

needs to be provisioned. 

During provisioning, a TPM 

needs to be enabled and 

activated. In a next step the 

Endorsement primary key 

pair will be created using 

TPM2_CreatePrimary. 

Because the fact that the 

calculation of the key pair is 

based on the Endorsement 

Seed in conjunction with the 

KDF it can be recreated each 

time again as long as the 

TPM hasn’t be cleared and 

doesn’t need to be stored in 

the NVM. Based on the 

several asymmetric 

algorithms and added 

template entropy of TPM 2.0, 

there can be more than one 

primary endorsement key 

pair or storage key pair as 

well as other keys. On 

GNU/Linux this is a 

manually process what can 

be automated by a script of 

course. A useful feature is the possibility to setup policies on 

the TPM to control access to keys or NVRAM data depending 

on e.g. specific PCR values (“sealing”). 

3.2.2 TPM application usecases in GNU/Linux 

The following mentions some interesting tools 

complementing the TSS on userspace. 

 

 

 

TCTIdev-API 

GNU/Linux Application 

TCTI TPM 

Device Provider 

/dev/tpm0 

System API 

System API 


Instance 

Local TPM 

TCTI 

Kernel TPM Driver 

tpm2-tools: This project implements most of the TSS 

SAPI functions as command line tools to access TPM 2.0 


Available from https://github.com/01org/tpm2-tools 

tpm2-abrmd: an implementation of a TPM2 access 

broker and resource management daemon. 

Available from: https://github.com/01org/tpm2-abrmd 

eltt2: The Embedded Linux TPM toolbox, the swiss army 

knife to access basic TPM 2.0 functionality from the Linux 

command line. It communicates directly over the TCTI. 

Available from https://github.com/Infineon/eltt2 

The SAPI as the standardized interface to that functions or 

alternatively the TPM2-tools are available from the command 

line. For example, the tpm2_getrandom function reads a random 

number from the TPM’s Hardware Random Number Generator. 

Other functions may be used to create and manage keys for 

using them with the various symmetric and asymmetric 

encryption algorithms of the TPM for small portions of data or 

key management and distribution of encrypted keys for 

symmetric cryptography. Users can export and import their keys 

while the key itself will be encrypted before leaving the TPM to 

keep the keys secret at all. 

As mentioned before, PCR 17 to 24 are not used for 

measurement and can therefore be used by user space 

applications. This includes a reset of that registers. For example, 

these PCRs could be used for application specific measurements 

to monitor the healthiness of specific software products and 

configurations or for software license management. In the same 

way, the TPM’s NVRAM van be used by user applications. 

However, users need to consider the specified maximum number 

of write cycles and data retention limitations of the specific TPM 

used. 

3.3 Advanced Usecases 

In this chapter we briefly look into some advanced use cases 

on application level, combining HSMs and TPMs. 

3.3.1 Encrypting / decrypting sensitive data 

Encrypting and decrypting sensitive data is a very common 

requirement on application level. For small payloads, this task 

can be easily done using a TPM, which offers the following 

advantages: 

 

 

 

Key generation and protection without exposing them to 

the SoC world 

Key access rights can be restricted or granted based on the 

trust state of the system 

Encryption is performed with maximum security against 

snooping 

However, in the following situations, combining the TPM with 

other methods might be preferred. 

 

 

huge payloads that would take too long on the TPM to 

encrypt / decrypt or high throughput demands 

the SoC’s tamper detection system should be leveraged to 

protect and destroy keys and sensitive data upon attack 

In these cases, there are still two possibilities: 

exclusive use of HSM 

This is the preferred way, if the SoCs tamper detection 

system shall be leveraged. Most HSMs support advanced 

methods to import and export cryptographic blobs into 

memory and advanced key generation methods. However, 

these methods are highly proprietary and require access to 

the detailed documentation of the HSM (e.g. the “Security 

Reference Manual”). 

combining HSM and TPM 

This methods offers the best of both worlds by using the 

TPM as highly secure and certified key generator and 

storage and the HSM to actually encrypt the data using this 

key. This task can be either performed using proprietary 

software that directly accesses the HSM or using the 

methods described in chapter 4.1.2 e.g. through OpenSSL. 

There are ongoing actions to integrate a “tpm2” engine into 

709

OpenSSL [22] and some manual approaches [23] to 

achieve this goal. 

3.3.2 IPSEC 

If a kernel driver for an HSM is available and the required 

algorithms are supported, the HSM can be used to accelerate 

IPSEC traffic. TPMs can be integrated into this solution as key 

generators, storages and to perform low-performance 

cryptographic operations such as verifying signatures. Solutions 

such as StrongSwan support the use of TPM 2.0 though a plugin. 

[24] and [25] describes the possibilities which are offered by 

this, including remote attestation of IMA results. 

3.3.3 Network Acceleration 

Typically the most powerful HSMs can be found on 

networking processors, such as NXP’s Layerscape or Marvell’s 

Armada 8040 family to accelerate the encryption of network 

traffic. To be able to do this, these modules need to be highly 

embedded into the hardware network acceleration modules of 

these SoCs, such as queue, buffer and frame managers in NXP’s 

DPAA or DPAA2. If such an acceleration system is used in 

conjunction with optimized network stacks to leverage all 

components, Ethernet traffic can be encrypted in line-speed on 

such platforms. Please contact the authors for more information 

on this subject. 

4 WHAT IS IT ABOUT THE TITLE? 

In a way, TPMs and HSMs as the two companions to Tux 

(the Linux Kernel mascot) have similarities to the “the Lion 

King” character Timon [26] and the canine character Rex from 

an Austrian/Italian TV show [27]. Timon being small in size, 

smart and very good at self-marketing, similar to how TPMs are 

being positioned as the solution to every existing security 

concern by some parties, while Rex is a well-minded, persistent 

and usually underestimated K9 ready to jump in to help and save 

lives whenever needed, quite similar to what HSMs are doing in 

an SoC, especially when no one else is there to help. 

5 CONCLUSION 

Hardware accelerated cryptography support is an extremely 

valuable addition to enhance GNU/Linux security. However, 

while some applications are well enabled (user space 

cryptography, IMA, image authentication), support for others is 

missing completely (TPM 2.0 and measurement support in 

mainline u-boot, TPM integration as a key storage into dmcrypt). 

Due to the rising security demands, this will most likely 

improve in the future and help to replace proprietary HSM 

enabled solutions with TPMs. ARMv8 crypto accelerations 

show great speeds for most algorithms, so that the predominant 

use of HSMs (crypto speed) will most likely also be 

complemented and replaced by pure software implementations 

in the future. The maturity and adoption for productive use of 

such solutions also highly depends on community evaluation 

and implementation. So feel free to contact the authors to discuss 

collaboration on implementation or just some feedback about 

your use cases and experiences. 

REFERENCES 

[1] https://community.nxp.com/docs/DOC-334996 

[2] https://trustedcomputinggroup.org/ 

[3] https://shattered.io/ 

[4] https://github.com/torvalds/linux/blob/master/arch/arm64/crypto/aes-cecipher.c; 

https://www.linaro.org/blog/core-dump/accelerated-aes-for-the-arm64- 

linux-kernel/ 

[5] https://www.fsf.org/campaigns/secure-boot-vs-restrictedboot/whitepaper.pdf 

[6] https://sourceforge.net/p/linux-ima/wiki/Home 

[7] Roeder et al, „Tux Airborne - Encapsulating Linux —real-time, safety 

and security with a trusted microhypervisor”, Embedded World 

Conference 2016 

[8] https://www.arm.com/products/security-on-arm/trustzone 

[9] https://github.com/OP-TEE/optee_os 

[10] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt 

[11] https://github.com/torvalds/linux/blob/master/Documentation/devicemapper/dm-integrity.txt 

[12] https://github.com/torvalds/linux/blob/master/Documentation/devicemapper/dm-crypt.txt 

[13] https://gitlab.com/cryptsetup/cryptsetup 

[14] https://github.com/rqou/tpm2-luks 

[15] http://www.chronox.de/libkcapi/html/ch01s02.html 

[16] http://cryptodev-linux.org/ 

[17] https://github.com/cryptodev-linux/cryptodevlinux/blob/master/examples/ 

[18] http://www.gnutls.org/ 

[19] http://www.openssl.org/ 

[20] https://github.com/tpm2-software/tpm2-tss 

[21] https://sourceforge.net/projects/ibmtpm20tss/ 

[22] https://dguerriblog.wordpress.com/2016/03/03/tpm2-0-and-openssl-onlinux-2/ 

[23] https://mta.openssl.org/pipermail/openssl-dev/2016- 

December/008924.html 

[24] https://wiki.strongswan.org/projects/strongswan/wiki/TPMPlugin 

[25] https://www.strongswan.org/docs/ConnectSecurityWorld_2016.pdf 

[26] https://en.wikipedia.org/wiki/The_Lion_King 

[27] https://en.wikipedia.org/wiki/Inspector_Rex 


710

APPENDIX 

LS1021 CAAM vs. Cortex-A7@1.2GHz 

400000 

350000 

300000 

Speed, Kbytes/s 

250000 

200000 

150000 

100000 

50000 

0 

16 64 256 1k 8k 

Block Size 

aes-128-cbc (1 x A7, CAAM, 1 thread) aes-128-cbc (1 x A7, CAAM, 2 threads) aes-128-cbc (1 x A7, CAAM, 3 threads) 

aes-128-cbc (1 x A7, SW, 1 thread) aes-256-cbc (1 x A7, CAAM, 1 thread) aes-256-cbc (1 x A7, CAAM, 2 threads) 

aes-256-cbc (1 x A7, CAAM, 3 threads) 

aes-256-cbc (1 x A7, SW, 1 thread) 

Marvell A8040 SafeXCell vs. NXP LS1046 CAAM vs. ARMv8 Cortex-A72 Crypto @ 1GHz 

Speed, Kbytes/s 

900000 

800000 

700000 

600000 

500000 

400000 

300000 

200000 

100000 

0 

16 64 256 1k 8k 

Block Size 

aes-128-cbc (A72, CAAM, 1 thread) 

aes-128-cbc (A72, CAAM, 3 threads) 

aes-128-cbc (A72, SAFEXCEL, 1 thread) 

aes-256-cbc (A72, CAAM, 1 thread) 

aes-256-cbc (A72@1GHz, SW, 1 thread) 



aes-128-cbc (A72@1GHz, SW, 1 thread) 

aes-256-cbc (A72, SAFEXCEL, 1 thread) 

711

TPM 2.0 for Enhanced Security in 

Software Updates of Industrial Systems 

Dr.-Ing. Florian Schreiner 

Embedded Security Solutions 



Florian.Schreiner@infineon.com 

Abstract— Industry 4.0 enhances the communication and 

data exchange between devices in a smart factory. This requires 

enhancing the functionalities of the devices and also increasing 

the complexity of the software. More software complexity also 

implies more possibilities of security issues and bugs. This can be 

improved with frequent remote software updates, which address 

bugs and consider the latest known threats. These updates also 

need a high level of protection in order to prevent misuse and 

threats on the deployment of the updates. The Trusted Platform 

Module (TPM) is a standardized technology to increase the 

security in the deployment and application of software 

communication as a trust anchor, because it protects keys and 

data with a high security level. 

Keywords— Software Update; Security; TPM; Trusted 

Computing; Standardization; 


Industrial Automation and the Industry 4.0 movement are 

showing that there are substantially more and more connected 

devices in factories and production lines. As a result, the 

amount of connected devices increases - offering opportunities 

for attacks on such devices, communication channels and 

stored data. 

The challenges are the enhanced functionalities and the 

complexity of the software in the devices, which also extends 

the possibilities of security issues and bugs. This can be 

improved with frequent remote software updates, which 

address bugs and consider latest known threats. These updates 

also need a high level of protection in order to prevent misuse 

and threats on the deployment of the updates. 

The problem for a system with a security bug is the 

protection of the cryptographic keys, which are required for a 

deployment of an update. These keys need to be stored and 

managed in a secured environment, which is separated of the 

main software of the devices. 

Such a secured environment is the Trusted Platform 

Module (TPM), which is a standardized technology to increase 

the security in devices and to protect cryptographic keys and 

data with a high security level. The TPM 2.0 is the latest 

Trusted Computing technology, which provides modern 

algorithms, easier integration of cryptographic functions and 

the crypto-agility concept. Crypto-agility is important for 

industrial devices, as they have a long lifetime and therefore 

require a smooth transition to new upcoming algorithms in the 

future. 

This paper provides a short introduction in the new 

functionalities of the TPM 2.0 standards and their application 

in industrial devices. The focus is on the protection of a remote 

software update process, which uses the TPM as key storage 

and the policies for the protection of the key usage. 

II. 

APPLICATION SCENARIO 

A. Secured Software Update 

The software update of an industrial device is an 

innovative approach to enhance the software of industrial 

devices. It enables a faster adaptation in smart factories by 

optimizing industrial devices in order to reduce potential risks 

or to enhance the performance. 

The performance can be increased with optimized 

parameters or the execution of new functionalities can be 

enabled. There are also negative aspects, because software 

updates can also introduce risks like failures, bugs or error in 

the software. Such failures can also cause significant financial 

damage, because of the high costs for an interruption of the 

production line. If such a software error is detected, a software 

update can be developed and released in order to remove the 

potential risk. 

A software update can be executed locally at the device or 

as a remote software update via a network connection. The 

local update of the device has advantages as the environment of 

the update is more secured, because the physical presence at 

the device is required to execute the update. However the local 

update also creates higher costs and resources, because the 

operator needs to enable a direct connection to the device and 

the update needs to be planned in the production process. 


712

A remote software update reduces these costs, because the 

update is deployed over a network connection. However this 

increases also the attack potential, as the device is reachable by 

a higher amount of other devices or intruders, which could 

potentially misuse or intercept the normal operation of the 

update process. Such a threat could be also more easily 

exploited if there are bugs and error in the software of 

industrial devices. Therefore a secured software update process 

is required in order to validate the transmitted update package 

and to verify that the update was installed correctly. 

B. Threats for secured software updates 

The challenge for a secured software update mechanisms is 

the architecture and cryptographic process in order to achieve 

an adequate quality to protect against the high variety of known 

threats. The architecture of the software update is of special 

criticality, because an optimal security concept is required to 

achieve a high security level. There are several aspects in a 

software update sequence: 

 

Authorization of the update 

Verification of the authenticity, integrity and 

confidentiality of the update package 

 

Verified installation of the update 

The authorization of the update is required to protect 

against a misuse of the update process. Only authorized parties 

are allowed to start a firmware update process, when the device 

is in the right state. The authorization is typically done by the 

operator or the owner in order to decide, when the update can 

be started. The owner or operator is generally not known 

during the manufacturing of the industrial device. Therefore a 

flexible mechanism for the authorization is required so that the 

right party can be identified during the lifetime of the device. 

Another threat is the manipulation of the update package. 

Cryptographic mechanisms need to be integrated in the secured 

update architecture so that the authenticity, integrity and 

confidentiality of the update are protected. This can be 

addressed with an encryption in combination with a signature 

of the update package. 

The last step in an update process is the installation of the 

new software in the device. This installation can be verified so 

that also manipulations during the execution of the update can 

be detected. Mechanisms like secured boot also verify the 

integrity of the installed software after a reboot of the device. 

A general problem for these cryptographic mechanisms is 

that the essential keys and authorization credentials need to be 

stored securely in the device. The storage of this secret data in 

plain on the memory of the host process would not be optimal, 

because an attacker can get access to these keys by e.g. reading 

the flash memory or exploiting a bug remotely in the software. 

In such a case he can misuse update mechanism or to read, 

change or clone software of the device. 

C. The Trusted Platform Module (TPM) 

The Trusted Platform Module (TPM) is a standardized 

technology to increase the security in software update as a trust 

anchor, because it protects keys and data with a high security 

level. The TPM 2.0 as in [1] is the latest Trusted Computing 

technology, which provides modern algorithms, easier 

integration of cryptographic functions and the crypto-agility 

concept. Crypto-agility is important for industrial devices, as 

they have a long lifetime and therefore require a smooth 

transition to new upcoming algorithms in the future. 

The TPM provides standard cryptographic functionalities 

and interfaces to protect the data in industrial systems and 

enhance communication security. It supports a wide variety of 

functionalities including basic device authentication, embedded 

system life cycle protection and system integrity in 

combination with secured boot. These functionalities offer 

flexibility in the integration of the security and enable dynamic 

security enhancements over the lifetime of the device. The 

generic TPM functionalities are shown in Fig. 1. 

Fig. 1. Generic TPM functionalities for industrial systems 

The TPM is a hardware security device in which all 

functions and the cryptographic operations are executed in a 

protected environment. Internally the TPM consists of several 

different blocks, which can be accessed via an external 

interface, e.g. I2C or SPI. This external interface provides a 

mechanism for the authorization of users and the execution of 

cryptographic operations with secret keys. Fig. 2 shows the 

components of the TPM 2.0 with the supported algorithms, 

protocols and functionalities. 

Fig. 2. Functional components of a TPM 2.0 

The TPM 2.0 has a variety of key management 

functionalities, which allows to create keys internally in the 

TPM and to store the keys protected by the TPM. Furthermore 

the usage of the keys can be limited to authorized parties in 

713

order to verify if the requesting party is allowed to use the 

corresponding key. 

Furthermore the TPM has a set of Platform Configuration 

Registers (PCRs), which are registers to store the measurement 

data collected during a boot process. These registers are used to 

verify the system integrity and they are cryptographically 

protected with a hash algorithm. 

The high level of resistance to attacks of the TPM chip is 

achieved with several countermeasures in hardware. Examples 

of these countermeasures are analog sensors that supervise the 

input and output pins for the chip in order to detect if the pins 

are used for manipulation of the chip operations. Examples for 

threats are spikes or glitches that are applied to the pins. 

Furthermore the TPM contains a sophisticated internal memory 

encryption, so that the data is stored securely even internally 

inside the chip. The TPM has additionally more than 50 other 

security features, so that it achieves a high level of security 

which is also approved using the Common Criteria process. 

Furthermore approved TPM products are listed in [2], which 

fulfill the requirements of the standardization organization 

Trusted Computing Group (TCG). 

D. Advantages of the TPM in Software Updates 

The enhanced security and trustworthiness offered by a 

TPM provides several benefits in the usage for Software 

Updates. 

Attacks on industrial devices can become widely known 

with the publication on conferences, social media, news and 

press. This can lead to damages on the brand of products and 

even affect the reputation of the whole company regarding 

trustworthiness and reliability. Threats like reading flash 

memory or the exploitation of software bugs (e.g. heartbleed) 

can be developed today even with the knowledge of students. If 

a secret key was extracted unauthorized, there are several 

methods to use the key in order to obtain confidential data. 

The TPM protects these keys as they are only stored and 

used inside the chip. Therefore also software bugs in 

cryptographic libraries in the host processor (e.g. heartbleed) 

aren’t security problems for the keys in the TPM. Therefore the 

TPM keys can be considered trustworthy even after the host 

software has been attacked or compromised. This enhances the 

capabilities of the software update mechanism, because it 

allows recovering a system after an attack has occurred. 

The high amount of functionalities and the included 

security evaluation of the TPM is a significant cost reduction 

compared to other security implementations. Such other 

implementations can lead to high implementation efforts and 

costs. Some technologies like virtualization can require the 

usage of proprietary extensions, which lead to a high effort in 

the implementation and reduce the interoperability. 

Furthermore the threat resistance of these integrated 

technologies is often limited. The TPM is internationally 

standardized in the Trusted Computing Group (TCG) offering 

extensive interoperability with current IT-systems, operating 

systems and network protocols like SSL/TLS. Additionally it 

provides a high level of security based on smartcard 

technology, which provides a strong protection to current 

known threats. 

III. 

SYSTEM ARCHITECTURE WITH A TPM 

This section will explain the process on how the TPM is 

used to setup a secure software update of a device. The update 

is distributed by a cloud server. The following figure 3 shows 

an overview of the system and the components. The update 

package is signed and encrypted in the cloud server. The 

encrypted package will be sent to the device, which uses the 

TPM to decrypt the package. This can only be done, if the 

operator or owner was authorized beforehand. After that the 

signature of the package is verified. If all operations were 

successful, the update is installed in the device. 

The public/private key pair in the TPM can be for example 

generated in the manufacturing of the device. In this operation 

also the authorization process for the key is defined. Some 

examples are a password or a signature validation, which can 

be used to verify the operator later on to authorize the software 

update. After the key creation, the public key can be requested 

from the TPM and the key can be stored in the Cloud Server. 

Host CPU 

Storage/ 

Flash 

Client Device 

Fig. 3. Overview for secured update with TPM 

The TPM also support further enhanced protection 

mechanism. The TPM provides version management, which is 

controlled by the cloud server and can be also included in the 

authorization mechanism. The version management would 

check the current firmware update version. Only if the version 

matches to the allowed versions from the backend, the TPM 

would authorize the key to decrypt the firmware update 

package. This enables a protection against unallowed firmware 

versions or also rollback attacks to old firmware versions. 

IV. 

TPM 

Software Update 

Cloud-Server 

SUMMARY AND OUTLOOK 

New and increasingly sophisticated security threats are 

constantly developing as a result of the widespread adoption of 

the Industry 4.0 as well as new application areas and types of 

device being connected - all of which are attractive to potential 

attackers. With the TPM standard, engineers and developers of 

industrial devices have a versatile and highly efficient solution 

available that enhances the security of a broad variety of use 

cases and industrial systems. 


714

REFERENCES 

[1] Trusted Platform Module Library, Family 2.0, Revision 01.38, 

September 2016, Trusted Computing Group, 

https://trustedcomputinggroup.org/tpm-library-specification/ 

[2] TPM Certified Products, January 2018, Trusted Computing Group, 

https://trustedcomputinggroup.org/membership/certification/tpmcertified-products/ 

[3] TSS Feature API Specification, Family 2.0, Revision 00.12, November 

2014, Trusted Computing Group, https://trustedcomputinggroup.org/tssfeature-api-specification/ 

715

Secure Updates of Artificial Intelligence Applications 

Used in Autonomous Driving 

Antonino Mondello 

Principal Design Engineer 

Micron Technology Inc. 

amondell@micron.com 

Alberto Troia 

Memory System Architect 

Micron Technology Inc. 

atroia@micron.com 

Abstract— Many applications are developed adopting 

intelligent hardware and algorithms using the same methodology 

that biological structures use to solve real life problems; this 

approach involves strategies and implementations that are based 

upon neural networks, genetic algorithms, deep learning, and 

other forms of artificial intelligence. One of the main benefits of 

artificial intelligence is the inherent ability to arrive at a solution 

to a problem in a very short time compared to alternative 

implementations, while guaranteeing the robustness and the 

integrity of the solution — including management against 

unauthorized changes. Ensuring protection against unauthorized 

changes requires that the updates of contents (i.e. datasets) must 

be accepted only if the trust between the sender and the receiver is 

verified. The application of artificial intelligence in conjunction 

with system-level security are the cornerstones of enabling 

elements to realize autonomous driving. 

neurons (see Fig. 1). The output of the internal ANN layer of 

neurons, after receiving the signals from the previous layer of 

neurons, is provided to the next layer of neurons until it reaches 

the output of the ANN; this kind of ANN network is called a 

feedforward ANN. The first layer is the input layer, and it 

interfaces the network with the external world using R inputs. 

Generally, the primary purpose of this layer is to precondition 

the magnitude of inputs. 

Keywords—Artificial intelligence; Machine Learning; Deep 

Learning; Neural Network; Genetic Algorithm; Neuron; Gene; 

HASH; HMAC; SHA2; Digest; Weight Matrix; Secure Storage; 

Memory, Automotive, autonomous driving 

I. ARTIFICIAL INTELLIGENCE OVERVIEW 

One of the definitions of artificial intelligence (AI) is the 

capability of a machine to imitate intelligent human behavior. 

The main purpose of the introduction of AI is to be flexible in 

adapting new circumstances that were not forecasted in the 

planning and designing of the hardware. 

A. Neural Network overview 

In recent years, most AI systems were based on neural 

networks that emulated the functionality of animal brains. An 

Artificial Neural Network (ANN) is a set of cells called neurons 

which can be both interconnected or independent using a set of 

connections, as depicted in Fig. 1. 

The neural signals processed by the generic neuron i is sent 

to the neuron j after a multiplication by certain numerical 

constants called synaptic weights (w mn). For conceptual 

simplicity, the ANN is organized in consecutive layers of 

Fig. 1. Generic artificial neural network structure 

Some ANNs might have neurons that directly influence the 

input by the synaptic weight w mn (see Fig. 1), or, in general, the 

output of any specific neuron might be the input of a neuron 

present in one of the previous layers. These types of networks 

are called Recurrent Artificial Neural Networks (RANNs); the 

RANN can affect the stability of the entire structure. It is 

customary to call the feedforward ANN a non-recurrent ANN. 

To briefly introduce the ANN functionality, we may start 

from the description of a single neuron. Consider a generic m- 

neuron of the network represented below in Fig. 2; the 

relationship between its inputs and the output is given by the 

formula: 

RR 

aa mm = ff mm [bb mm + ∑ kk=1 (ww mmmm ∙ pp kk )] (1) 

716

Where: ff mm (nn) is a real function, called activation 

function. 

Finally, an implementation of a feedforward ANN needs the 

management of two different sets of N matrix WW mm , BB mm and N 

functions ff mm (nn). Such matrices must be stored in nonvolatile 

memory and updated when needed according to the update 

policy implemented. 

Fig. 2. Neuron I/O relationship 

In an ANN, the activation functions can be, in principle, 

different for each neuron; in practice, they use only a few 

different kinds of functions. The most common activation 

functions are described in TABLE I. 

TABLE I. 

DIFFERENT KINDS OF ACTIVACTION FUNCTIONS 

Hard 

limiter 

Linear 

Logsigmoid 

bb iiii tt ≥ 0 

ff(tt) = 

aa iiii tt < 0 

ff(tt) = aa ∙ tt 

1 

ff(tt) = 

1 + ee −tt 

The constant bb mm , not always present (bb mm = 0), is called the 

bias of the neuron. The relationship between inputs and outputs 

of the entire network depends on the choice of the activation 

function, bias and the synaptic weight ww mmmm . The process that 

permits us to define the bias and synaptic weight is called the 

learning process. We will describe briefly the ANN learning 

strategies in the following paragraphs. 

B. Synaptic weight matrix of a neural network 

Regardless of the algorithm used to define the numbers ww mmmm , 

bb mm or the choice of the activation functions ff mm (nn), the 

mathematical description of an ANN can be represented using 

standard matrix formats. This permits us to have a very compact 

notation in theory formulation and to have equations organized 

in a form that is suitable for the software implementation of the 

network. 

In a feedforward ANN, the synaptic weights of a generic m- 

th layer can be organized in a rectangular matrix WW mm, where the 

element ww mmmm ∈ WW indicates the weight ww mmmm and is related to the 

connection (synapses) from the neuron n to the neuron m of the 

artificial network. The output of each layer can be written as: 

AA 

 

0 = PP 

AA mm = FF mm (BB mm + WW mm ⋅ AA mm−1 ) ∀mm ∈ [1, ⋯ , NN] 

Fig. 3. Feedforward neural network 

To implement a vehicle with autonomous driving 

capabilities, the different functions can be implemented using 

the feedforward ANN; sometimes it is necessary to use more 

powerful ANN architectures, like the Recurrent Artificial Neural 

Network (RANN). 

The implementations of RANNs pose several additional 

problems; in fact, they behave like a dynamic system, which 

means that the output depends not only on the current input of 

the network, but also on the previous values of the inputs, the 

outputs and the internal states. Because of the dynamic nature of 

the network, RANNs possess a more complicated mathematical 

structure than the feedforward networks and require very large 

matrices. 

A general model of a RANN is depicted in Fig. 4. To 

guarantee the correct functionality at each layer, a Taped Delay 

Line (TDL) is used to avoid critical race conditions at the output 

of the neurons. 

Fig. 4. Recurrent neural network model 

For a complete modeling of the network layers, it is 

necessary to introduce some matrices as listed in TABLE II. In 

fact, it is possible to prove that by using the matrices it is possible 

to describe the behavior of the entire RANN network. 

717

TABLE II. 

PP 

MATRICES NEEDED TO DESCRIBE RANNS 

Input vector of network; in this network 

model, the inputs can be connected to all 

layers. 

AA mmmm Vector is the output vector of layer m. 

BB mm Vector of neurons bias of layer m. 

LLLL mm,ll 

IIII mm,ll 

Matrix with the synaptic weight from the 

layer l to the layer m. 

Matrix of synaptic weight associated to the 

network input l related to the layer m. 

Fig. 5 represents a generic RANN layer. TABLE II. 

describes the contribution of the neuron inputs and the outputs 

which are calculated by using the formula (1) (see [1][2][5][8] 

for more details). 

Fig. 5. Recurrent neural network m-th layer 

From this overview of ANNs, we can draw the conclusion 

that the implementation of a neural network requires the 

manipulation of large amounts of data stored in matrix form. 

Such data must be stored in a nonvolatile memory device and 

must be written, read, and updated in a secure manner. This 

challenge will be addressed in the next paragraphs where we 

describe a possible hardware implementation. 

C. Learning of a neural network overview 

The learning or training of the ANN [1][2][5] consists of 

setting the values of the weight matrices LLLL mm,ll , IIII mm,ll , BB mm . The 

learning strategies can be divided into two distinct categories, 

which define two different paradigms of learning: 

• Supervised learning: Some input vectors previously 

collected are presented to the ANN. The output produced 

by the network is observed and the deviation from the 

expected answer is measured. The weights are corrected 

according to the magnitude of the error in the method 

defined by the specific error function used. This learning 

strategy is also known as learning with teacher. 

A special class of supervised algorithm is given by the 

Reinforcement learning strategy [5]; reinforcement 

algorithms don’t compare the response—at each input of 

the network—with the theoretical one; instead, they use, 

as measure of the error, a score function that provides a 

measure of the global performance of the network. The 

synaptic weights are changed according to the score 

value. Another supervised class of algorithms is 

represented by the learning with error correction. Here, 

the magnitude of the error, together with the input vector, 

determines the magnitude of the corrections of the 

weights. 

• Unsupervised learning [5] is used in all situations where, 

for a given input, the exact numerical output value is not 

known; in other words, the teacher is not available. A 

typical example of a problem without a teacher is the 

problem of classification. Suppose, for example, there 

are some points in a two-dimensional space to be 

classified into three clusters. For this task, we can use a 

classifier network with three output lines, one for each 

class. Each of the three computing units at the output 

must specialize by providing a non-zero value in 

correspondence of the input elements of each cluster. If 

one unit is not zero, the others must be silent. In this case, 

we do not know a priori which unit is going to specialize 

on which cluster. Generally, we do not even know how 

many well-defined clusters are present; the network must 

organize itself to be able to associate clusters with units. 

Tens of training algorithms are typically employed for each 

learning paradigm. A detailed description of these can be found 

in [1][2][5][7]. 

D. Reasons to update the ANN 

Once the ANN is designed, trained, implemented and 

installed within an autonomous vehicle, there are many reasons 

why it will need updating: 

• The system can have self-learning capabilities, and after 

an extended period of time on the road, there could be a 

new set of synaptic weights to be applied to itself to 

improve the autonomous driving capabilities. 

• The producer of the vehicle (i.e. car maker and/or its 

hardware provider) propagates an update to the system 

over-the-air because a new version of the AI set of 

matrixes is available. 

• Malfunctioning equipment requires a local update by the 

auto dealership using the OTA/SOTA port. 

• The vehicle owner purchases new services. 

In next paragraph, we will describe the problem and the 

methodology used for an ANN update. 

II. WEIGHT MATRIX UPDATE 

Fig. 6 shows data gathering until deployment of the ANN. 

The first step to train an ANN is to collect data from the field. 

This can be accomplished using a data center built in the vehicle 

that gathers information from the various sensors in the vehicle, 

and which uses that information to identify the detected object, 

using for example a supervised learning methodology. The 

developer of the ANN usually performs a pre-processing of the 

dataset to optimize the functionality of the ANN. The collected 

data to generate an ANN is divided into three categories: 

718

• Training datasets 

• Validation datasets 

• Testing datasets 

The reason why the data is broken into three datasets falls 

outside of the scope of this paper; however, in general, this is to 

evaluate if the ANN demonstrates appropriate levels of accuracy 

and to ensure it does not overfit the training datasets. 

A factory server is typically used by developers to train, 

validate and test the ANN functionalities. The objective of the 

phase is to generate an ANN that is ready to be deployed in the 

field. 

Fig. 6. 

Factory vehicle updates flow 

The ANN model, which is deployed in the field, requires 

constant updates. These updates must originate from a certified 

authority (i.e. the original developer server). The update of the 

ANN model is needed mainly because the autonomous vehicle 

can detect events where the local ANN does not have a unique 

output. While the local ANN has the capability to self-learn, the 

new combination of inputs and outputs must be sent to the 

factory server for validation. The factory server will store this 

new combination of inputs and outputs as an additional dataset 

to run a new learning process that updates all autonomous 

vehicles deployed in the field. 

The update process is the main reason why the storage device 

used to store the ANN cannot be a read only memory device; it 

instead requires Flash media. The process of updating the system 

must be implemented in a secure environment, regardless of the 

communication channel (i.e. on-board diagnostic port, over-theair 

update, etc). 

If the media used to store the artificial intelligence data is 

accessible in an easy manner, it could be changed in a nonauthorized 

manner. The manipulation of this setting can change 

the behavior of the system, or worse, threaten the safety of the 

driver. 

As we will describe in the next paragraph, any attempt to 

protect data with legacy protection features are a risk, as these 

security systems can be easily hacked. Therefore, there is a need 

to implement a robust protection scheme for the system and the 

media where the data are stored to ensure a high level of 

protection against malicious attack. 

III. CRYPTOGRAPHY METHODES FOR VEHICLE SAFETY 

We are observing a real transformation of the world, where 

most electronic devices will be interconnected and capable of 

exchanging messages and communicating with each other. This 

can be considered a new evolutionary era, where machines are 

self-learning and act independently. In this new era, data storage 

devices can’t be considered simply containers of data, but rather 

integral elements of electronic devices that contain very 

sensitive data—data which is mandatory for the correct behavior 

of the system. As an example, the large amount of data that an 

ADAS system must process must have as few errors as possible, 

and at a minimum, the application controller should be able to 

recognize the presence of them. This is especially true for data 

such as the weight matrix of an artificial intelligence algorithm. 

Updating the weight matrix is a very critical operation and it 

must be done in a very secure fashion. The weight matrix is 

usually stored in a NAND, managed NAND, or NOR storage 

device. 

For the past decades, memory architects and designers have 

proposed various protection schemes against accidental or 

unintended modification of data. Some protection schemes were 

based on a simple command, protected by a password, which 

permitted the protection of selected portions of data in the 

memory array. This level of protection ultimately proved to be 

ineffective because hackers could readily detect the password 

simply by sniffing commands from board buses and reusing 

them later. Capturing and reusing transition on the bus is known 

as ‘replay attack.’ 

Another practice used to break a ‘weak protected’ storage 

system is accomplished through the use of a non-original 

memory component, where the content is exactly the same as the 

original component and it is able to emulate the original 

component; this kind of technique is called component 

replacement attack. 

Modern protection schemes must provide protection against 

these types of attacks and others, while at the same time enabling 

over-the-air firmware (and data) updates. 

International organizations like Trusted Computing Group 

(TCG) have proposed new paradigms of security [9], based on 

consolidated cryptographic concepts [6], [7] which became 

standard in the recent years, refer to [10], [11], [12], [13] by the 

National Institute of Standard, U.S. Department of Commerce 

(NIST). 

The new security paradigms and implementations also 

involve the end-point components; these paradigms suggest that 

each edge component contributes to the security of the entire 

electronic system. The strength of this approach is based on the 

fact that each device has embedded the capability to prove their 

identity (to ensure that a component replacement attack fails) 

and to confirm, to the provider of the update, that the data stored 

is not changed by a malicious entity. Guaranteeing authenticity 

of the data is accomplished via cryptographic measurements. 

A. System Secure Zone definition 

Most electronic systems are implementing mechanisms to 

make sure that data is checked and validated using cryptographic 

measurements. Some common definitions include: 

719

• Trusted Platform Module (TPM): A specialized chip, or 

part of a system, on an endpoint device that stores 

cryptographic keys specific to the host system for 

hardware authentication. 

• Trusted Execution Environment (TEE): A silicon area 

inside a controller, isolated from the other circuits and 

able to implement cryptographic calculation. 

• Secure Element (SE): A secure element (i.e. tamperresistant) 

able to store secrets and, eventually, make 

cryptographic calculations. 

For instance, in a TPM implementation, there is the need to 

use components able to support assigned rules and perform the 

operations required to ensure that the data used and updated in a 

board are secure. 

Initially, such concepts were introduced with the intent to 

protect PC BIOS integrity and to guarantee secure remote 

updates, but such techniques were broadly useful to be applied 

to different applications other than PC BIOS [16]. Due to their 

inherent strength, those techniques have also been employed in 

the automotive field to address safety concerns. 

B. Secure communications problem 

A TPM inside an electronic system is implemented using 

secure components. The components communicate with each 

other via a standard communication protocol. 

Due to the presence of secure components in the system, the 

first feature to be implemented is trusted communication 

between the components, refer to Fig. 7. 

Fig. 7. TPM implementation 

The definition of trusted communication applies when there 

is a methodology (i.e. cryptographic signature) to ensure the 

integrity of the message and the non-repudiation of the message 

(i.e. the clear identification of the sender). This does not imply 

that the content of the message must be encrypted (i.e. 

maintained secret and/or not visible), but only requires 

maintaining a certain knowledge about the origination of the 

message. In fact, the main properties of a message sent to the 

communication bus must comply with the following: 

• Authenticity: Assure that the sender is who we think it is. 

• Integrity: The message is not altered intentionally by a 

hacker or randomly by noise. 

• Non-repudiation: Sender cannot deny that the message 

was sent. 

Those main properties of networked TPMs can operate 

inside the same system, in parallel, exchanging messages. 

Fig. 8. TPMs communication over an unsecure channel 

One possible implementation of a secure TPM inside a 

system can be the use of some cryptographic features and 

properties to create the message signature. 

Payload 

Signature 

Fig. 9. Simple message structure 

A message is defined as being signed when it contains a 

signature. The message content is usually referred to as the 

payload, while the piece of information appended to the message 

is called a signature. The signature of a message is calculated by 

using a one-way function called Message Authentication Code 

(MAC) function as in 2. 

SSiiiinnaattiiiiii = MMMMMM(kkkkkk, pppppppppppppp) (2) 

The key contains secret information, known only to the 

entities connected to the TPM, to ensure the security of the 

system. The key is never shared in a clear way across the 

communication bus. The key may be written into each device in 

factory, and/or can be shared by using some protocols based on, 

for example, a public key cryptographic scheme [9][10]. To 

further improve upon the security of the entire system, there can 

be more than one key for a given pair of TPMs connected in the 

communication bus. 

There are several functions that can be used to generate the 

signature of a message; however, these functions must satisfy 

some key requirements: 

• “Easy” to be calculated: Given a message X the calculus 

of MAC (key, X) does not require sophisticated hardware 

or software resources 

• “Hard” to be inverted: Given a MAC (key, X) it must be 

“impossible” to determine the message X or the key; in 

other words, the calculation [MAC (key, X)] −1 must be 

impossible 

• “Negligible” collision probability: Given two messages 

X ≠ Y MAC (key, X) ≠ MAC (key, Y) 

The cryptographic strength of the MAC together with the 

secrecy of the key guarantees the authenticity, the integrity and 

the non- repudiation of the messages exchanged inside the TPM 

system and between TPM regions. In fact, only the sender that 

possesses the (secret) key can generate a signature that the 

receiver can understand together with the Message field; 

otherwise, the message itself can be discarded by the receiver. 

In summary, the receiver refuses the message any time its local 

calculus does not match with the signature contained in the 

received one. 

A powerful MAC function used in many cryptographic 

systems is the HMAC-SHA256. Such a function is described in 

720

[14][13] and is very robust in the fact that it satisfies all the 

requirements defined for MAC function. 

HMAC-SHA256 can process a message up to 2 64 bits in 

length and produce a 256-bit signature. As of today, there is no 

known possibility to invert this MAC and no collision conditions 

have been identified. 

C. Message replay problem 

One methodology to hack the data communication in a bus 

is based on recording the messages transmitted through the 

channel. The recorded message can be used later either to take 

control of the whole system and/or to generate a certain 

malfunction in the system under certain circumstances. This 

kind of attack, also known as man-in-the-middle attack, can be 

executed successfully without the knowledge of the key (known 

to the senders and the receivers). A system that can avoid this 

type of attack is called a robust against replay-attack. 

One of the methodologies to avoid this kind of attack is to 

insert a source of randomness into the message; this source of 

randomness is also called freshness. The purpose of the 

freshness field is to ensure the same signature—including the 

message—is not used twice in different times or applications. 

The content of the message supporting the anti-replay attack 

methodology can look like Fig. 10, where the freshness is an 

additional field after the payload. 

Payload Freshness Signature 

Fig. 10. Message structure with anti replay field 

Let’s proceed with the understanding of what might be 

considered for the freshness field. The freshness is a special field 

with the property that changes at every cycle, i.e. a random 

number, the value of a counter incrementing using a clock, etc. 

The strategy is to change/generate the freshness content every 

time a message is sent. This message must be well known to the 

sender and the receiver of a TPM system. It is not required that 

any of the elements communicating on the TPM bus need to 

encrypt or hide the content; therefore, in a communication 

protocol, the sender prepares the signed message using the 

freshness to send to the receiver. 

Fig. 11. Communication between two components 

When the message arrives at the receiver, the first operation 

is to check the security of the message: 

• The value of the freshness field must be the expected 

one. 

SSiiiinnaattiiiiii = MMMMMM(kkkkkk, pppppppppppppp | ffffffffhnneeffff) 

• The value of the signature calculated, as per the 

equation, in the receiver side must match the value 

appended to the message itself. 

Let’s look at some of the methodologies used to generate a 

string that matches the rules to be called a freshness field. 

The simplest way to create a freshness field is to use a 

timestamp as message; this method needs synchronization 

between the clock domains and between different TPMs and the 

components inside the TPMs. Another way can be using a 

number generated only once at a time, called NONCE (Number 

ONCE); this method is not easily implemented because it 

requires the existence and the maintenance of a database with all 

the generated NONCE to avoid the replication. One of the best 

methodologies, also suitable for most applications and, above 

all, in the automotive market, is to use a Monotonic Counter 

(MC). The Monotonic Counter is a generic digital circuit that 

increments its value depending on a certain event; that event can 

be each message sent, a clock source, a power event, etc. The 

main property of this MC is that its value can’t be decremented 

in any way and its operation must not be compromised during 

power loss. 

Assuming now that the event triggering the increment of the 

MC is the message sent, this guarantees that two messages 

containing the same payload field will have a different signature, 

thanks to the freshness value. So, introducing the MC value, a 

message can look like Fig. 12. 

Payload MC Signature 

Fig. 12. Generic message structure 

One possible implementation could be to use a unique 

freshness value (i.e. value of the monotonic counter) across the 

whole system. 

D. Secure communication protocol 

Consider the subsystem shown in Fig. 11. To be sure that 

only the authorized ECM communicates each other, the follow 

protocol can be used: 

The sender: 

• prepares the message including payload, destination of 

the message (optional field), message correction 

information (optional) and other relevant strings that 

will specify the content of the entire message; 

• generates the freshness, i.e. incrementing the MC value; 

• calculates the signature by using the formula (2); 

• packs the message as per Fig. 10; 

• ends the message through the unsecure channel. 

The receiver: 

• recognizes that the messages are addressed to him; 

• reads the content of the message; 

• unpacks the message to retrieve the MC value; 

• checks the value of the freshness, i.e. monotonic counter 

value: 

• if the freshness does not correspond to the expected one 

the message is discarded; 

721

• if the freshness corresponds to the expected one, the 

message is accepted and the security check, as per next 

step, is executed; 

• calculates the signature using the formula (2), the 

content of the message as content, the known secret key; 

• compares the signature of the message with the 

signature coming out from the calculation; 

• if the signatures are the same, the message is accepted 

and eventually executed; otherwise, the message is 

discarded because the content is considered 

corrupted/hacked. 

For the remainder of this paper, we will refer to this protocol 

as secure communication protocol or secure communications 

criteria. 

The secure communication criteria can be used to build a 

secure command set usable over an unsecure channel. If the 

devices’ command set contains freshness and signature fields as 

shown in Fig. 13, then they are accepted only if the message 

matches the security communication criteria. 

Command 

op-code 

Command 

parameters 

MC 

Fig. 13. Secure command structure 

Signature 

The component that executes the command provides a 

command response. Additionally, it must be organized (see Fig. 

14) and accepted by the receiver if, and only if, it matches the 

security criteria. 

Command 

response 

MC 

Fig. 14. Secure command response 

Signature 

An example of a secure command is one that provides the 

unique identification code (UID) of a device (this is normally 

written into each device in the factory.) In this case, the 

command op-code is the one related to the request UID 

command, and the command response contains the requested 

UID into the command response field. 

E. Device identification memory content measurement 

The storage device such as a NOR/NAND memory must 

guarantee the genuineness of data stored, and, if necessary, 

recover the data with a genuine copy to guarantee system 

functionality in case of a hacker attack or system malfunction. 

The changing of stored data can be detected through the use of 

some cryptography tools including hash functions. 

A hash function is a function that is able to map data of 

arbitrary size to a fixed size. A function is eligible as a hash 

function if it is: 

• “Easy” to be calculated: Given a message X the calculus 

of HASH (X) has low complexity 

• “Hard” to be inverted: [HASH (X)] -1 must be impossible 

to be calculated 

• “Negligible” collision probability: Given two messages 

X≠Y HASH (X) ≠ HASH (Y) 

Cryptographic literature proposes many hash functions 

[9][10]. A common one is the SHA256 described in [13] and 

already described in this paper in conjunction with the HMAC. 

Once a HASH is calculated on a genuine data pattern, the 

result is called a golden hash. This data is stored in an area that 

is not user accessible and is compared on demand with the 

current HASH calculated at the moment of the request. The 

result of this comparison enables the system to understand if the 

array content is genuine or was accidentally or intentionally 

modified. 

If requested, the memory can provide the hash result as a 

command result. 

The memory can be configured to automatically hash the 

array or part of the array at each power cycle, and in case of data 

corruption, to restore a hidden version of the information. 

In this case, the restored information cannot be the latest 

data, but it might be just a version that permits safety to be 

maintained within the whole system. This is not a rule; it is up 

to the platform implementer to establish the frequency and the 

policies of the hidden information updates; the secure command 

set allows the designer to manage them accordingly. 

IV. SECURE UPDATE OF ARTIFICIAL INTELLIGENCE 

The methodology introduced in the above paragraphs can be 

used to update the artificial intelligence algorithms within the 

vehicle, and in general, for updating all the features needed to 

maintain a high level of security. 


The authors wish to thank Robert Bielby for the interesting 

suggestions during the paper preparation; Barbara Kolbl for the 

great availability in coordinating the review sessions; 

Shelley Frost for help with editing the text; Lance Dover, 

Francesco Tomaiuolo and Tommaso Zerilli for the exchange of 

ideas on the implementation of the security features in silicon. 

REFERENCES 

[1] C. Bishop, Pattern Recognition And Machine Learning, Springer, 2006, 

ISBN 9780387310732. 

[2] Goodfellow at al., Deep Learning, MIT, 2016 IDBN 9780262035613. 

[3] J. J. Hopfield, Neural Networks and Physical Systems with Emergent 

Collective Computational Abilities, 1982 PNAS 1982;79;2554-2558 

[4] LeCun Y. et al, Deep learning, Nature, 2015 doi:10.1038/nature14539 

[5] Marsland S., Machine Learning An Algorithmic Perspective, 2nd ed, 

CRC, 2014, ISBN 9781466583283. 

[6] A. Mondello, Pianificazione frequenziale automatica nei sistemi 

radiomobili cellulari mediante reti neurali. Tesi di laurea, Politecnico di 

Torino, 1988 

[7] R. Rojas, Neural Networks a systematic introduction, Springer, 1996, 

ISBN 9783540605058. 

[8] M.T. Hagan at al., Neural Network Design, 2 nd ed, ISBN 978-0971732117 

[9] V. V. Yaschenko, Cryptography: An Introduction. AMS, 2002. 

[10] N. Ferguson et al. Cryptography Engineering. Wiley, 2010. 

722

[11] D. Challener et al. A Practical Guide to Trusted Computing. IBM Press, 

2007. 

[12] TGC. (2014 March 13). Trusted Platform Module Library Part 1: 

Architecture [PDF]. Available at: 

http://www.trustedcomputinggroup.org/files/static_pagefiles/C2122862- 

1A4B- 

B2940289FD15408693D/TPM%20Rev%202.0%20Part%201%20- 

%20Architecture%2001.07-2014-03-13.pdf 

[13] Fips-180-2. SHA-256: Secure Hash Algorithm [PDF]. Available: 

http://csrc.nist.gov/publications/fips/fips180-2/fips180- 

2withchangenotice.pdf 

[14] Fips-198-1. HMAC-SHA-256: Hash Based Message Authentication 

Code [PDF]. Available: http://csrc.nist.gov/publications/fips/fips198- 

1/FIPS-198-1_final.pdf 

[15] NIST 800-147. BIOS Protection Guidelines [PDF]. Available: 

http://csrc.nist.gov/publications/nistpubs/800-147/NIST-SP800-147- 

April2011.pdf 

[16] NIST 800-155. BIOS Integrity Measurement Guidelines [PDF]. 

Available: http://csrc.nist.gov/publications/drafts/800-155/draft-SP800- 

155_Dec2011.pdf 

723

Physical Unclonable Functions to the Rescue 

A New Way to Establish Trust in Silicon 

Geert-Jan Schrijen 

Intrinsic ID 


geert.jan.schrijen@intrinsic-id.com 



USA 


Abstract — As billions of devices connect to the Internet, 

security and trust become crucial. This paper proposes a new 

approach to provisioning a root of trust for every device, based 

on Physical Unclonable Functions (PUFs). PUFs rely on the 

unique differences of each silicon component introduced by 

minute and uncontrollable variations in the manufacturing 

process. These variations are virtually impossible to replicate. As 

such they provide an effective way to uniquely identify each 

device and to extract cryptographic keys used for strong device 

authentication. This paper describes cutting-edge real-world 

applications of SRAM PUF technology applied to a hardware 

security subsystem, as a mechanism to secure software on a 

microcontroller and as a basis for authenticating IoT devices to 

the cloud. 

Keywords — Security; Internet of Things; Physical Unclonable 

Function; Authentication 


The Internet of Things already connects billions of devices 

and this number is expected to grow into the tens of millions in 

the coming years [5]. To build a trustworthy Internet of Things, 

it is essential for these devices to have a secure and reliable 

method to connect to services in the cloud and to each other. A 

trustworthy authentication mechanism based on device-unique 

secret keys is needed such that devices can be uniquely 

identified and such that the source and authenticity of 

exchanged data can be verified. 

In a world of billions of interconnected devices, trust 

implies more than sound cryptography and resilient 

transmission protocols: it extends to the device itself, including 

its hardware and software. The main electronic components 

within a device must have a well-protected security boundary 

where cryptographic algorithms can be executed in a secure 

manner, protected from physical tampering, network attacks or 

malicious application code [18]. In addition, the cryptographic 

keys at the basis of the security subsystem must be securely 

stored and accessible only by the security subsystem itself. The 

actual hardware and software of the security subsystem must 

be trusted and free of known vulnerabilities. This can be 

achieved by reducing the size of the code to minimize the 

statistical probability of errors, by properly testing and 

verifying its functionality, by making it unmodifiable for 

regular users and applications (e.g. part of secure boot or in 

ROM) but updateable upon proper authentication (to mitigate 

eventual vulnerabilities before they are exploited on a large 

scale). Ideally, an attestation mechanism is integrated with the 

authentication mechanism to assure code integrity at the 

moment of connecting to a cloud service [3]. 

However, we are not there yet. We also need to be able to 

trust the actual generation and provisioning of the 

cryptographic keys into the security subsystem. Without trust 

in the key generation and injection process we cannot assure 

that keys are sufficiently random and that every device in fact 

obtains a unique key, which is the basic assumption for secure 

device identification. In addition, the provisioning must 

guarantee that private keys are not known outside the device, 

cannot be extracted or cloned, and that public keys are 

unmodifiable without proper authentication. 

A trustworthy Internet of Things requires a trust continuum 

from chip manufacturing through code development, device 

manufacturing, software and key provisioning, all the way to 

connecting to the actual cloud service. Central to the capability 

of a device to authenticate to the cloud is its digital identity, 

which is protected by the security subsystem. Devices that 

make up the Internet of Things use a broad variety of silicon 

components. It will therefore be a daunting challenge to roll out 

a universal security solution that works seamlessly for all 

possible microchip technologies in a consistent cost-effective 

way. 

The further outline of this paper is as follows. Section II 

articulates the importance of device root keys as a basis for a 

digital device identity and authentication. Section III introduces 

SRAM-based PUF as an innovative, flexible and cost-effective 

way to bootstrap and secure such root keys in a universal way 

on the widest possible variety of microchip technologies. 

Finally, section IV highlights some relevant real-world 

applications. 

II. 

DEVICE IDENTITY AND AUTHENTICATION 

To securely authenticate a device that is connecting to a 

cloud service or for unmanned machine-to-machine 

connectivity, every single device must provide a strong 

cryptographic identity. Such identity typically consists of an 

asymmetric key pair, composed of a public key and a private 

key. The private key must be kept secret in the device and 

ideally should never leave the device security boundary. The 

public key on the other hand can be output and communicated 


724

to external entities. According to the current PKI model, before 

the key pair can be used for device authentication a trusted 

entity needs to assert that the public key in fact belongs to a 

specific device (e.g. specific brand, model, serial number). This 

assertion is created in the form of a digital certificate. The 

trusted entity is typically the OEM that manufactures the 

device, although many variations in the supply chain setup are 

possible. 

Devices are authenticated by sending their digital 

certificate, which includes the public key, to the verifying 

entity, e.g. the cloud service or another device. The verifying 

party checks the contents of the certificate and verifies by the 

known public key that it is correctly signed by a party it trusts. 

The device’s public key that is in the certificate can then be 

used to verify the authenticity of the device by means of 

established authentication protocols. For example, a challengeresponse 

protocol can be used in which the verifying party 

generates a random number and sends it to the device. The 

device generates a response value using its private key to 

compute a digital signature on the received challenge. The 

verifying party receives the response and verifies that the 

signature is correct using the device’s public key. Alternative 

authentication schemes based on asymmetric keys are possible. 

For example, when the device sets up a secure HTTP 

connection to the cloud service using the TLS protocol, the 

client authentication check is done as part of the TLS 

handshake. This use case is described in section IV.C. 

The asymmetric key pair that forms the device identity 

needs to be securely stored inside the security subsystem. This 

can be achieved via key wrapping, a process that involves 

encrypting the private key within the security boundary before 

storing it in non-volatile memory (NVM). The root key, used to 

encrypt the other secrets, must be device-unique and securely 

stored inside the security boundary: see the use case in section 

IV.A. Besides encrypting additional secrets for permanent 

storage, the root key can also be used to derive additional 

private/public key pairs directly via a cryptographic key 

derivation mechanism. Such keys can be used to authenticate 

and establish secure channels with multiple devices. 

Provisioning root keys into a chip is an essential step in 

establishing a root of trust anchored in hardware. Traditional 

key storage methods require the root keys to be injected at an 

early stage in the production chain. This process implies that 

secret keys are handed over from device manufacturer to 

silicon manufacturer, and hence are revealed to different parties 

in the production chain. This creates undesired liabilities for 

both parties as the root keys are known outside the device’s 

security boundary. In the IoT this problem is enormously 

amplified by the sheer number of devices. In this emerging 

scenario, distribution and potential leakage of root keys 

becomes the single most important problem [9]. 

To overcome these limitations, a flexible new key 

provisioning method is needed that enables secure 

programming of device root keys at any stage in the production 

process, allowing a device maker to reduce its dependency on 

other trusted parties. Physical Unclonable Functions (PUFs) 

based on SRAM memory are an ideal candidate for providing a 

universal cost-effective solution to this root key programming 

and storage problem. 

III. 

PHYSICAL UNCLONABLE FUNCTIONS 

Physical Unclonable Functions (PUFs) are known as 

electronic design components that derive device-unique silicon 

properties, or silicon fingerprints, from integrated circuits 

(ICs). The tiny and uncontrollable variations in feature 

dimensions and doping concentrations lead to a unique 

threshold voltage for each transistor on a chip. Since even the 

manufacturer cannot control these exact variations for a 

specific device, the physical properties are de facto unclonable. 

These minute variations do not influence the intended 

operation of the integrated circuit. However, they can be 

detected with specific on-chip circuitry to form a device-unique 

silicon fingerprint. The implementation of such measurement 

circuit is what is called a PUF circuit. There are several 

alternatives to implementing PUF circuits into an IC. They 

vary from comparing path delays and frequencies of free 

running oscillators to measuring startup data from memory 

components [10]. A particularly promising PUF technology is 

based on SRAM memory. The SRAM PUF has excellent 

stability over time, temperature and supply voltage variations 

and it provides the highest amounts of entropy. Furthermore, it 

is available as a standard component in almost every IC. The 

latter aspect has important advantages in terms of deployment, 

testability and time to market. SRAM PUFs can be used in 

standard chips by software access to uninitialized SRAM 

memory at an early stage of the boot process. Hence, it is not 

required to integrate special PUF circuitry into the hardware of 

the chip when using SRAM PUF technology. 

A. SRAM PUF 

SRAM PUFs are based on the power-up values of SRAM 

cells. Every SRAM cell consists of two cross-coupled 

inverters. In a typical SRAM cell design, the inverters are 

designed to be nominally identical. However, due to the minute 

process variations during manufacturing, the electrical 

properties of the cross-coupled inverters will be slightly out of 

balance. In particular, the threshold voltages of the transistors 

in the inverters will show some random variation. This minor 

mismatch gives each SRAM cell an inclination to power-up 

with either a logical 0 or a logical 1 on its output, which is 

determined by the stronger of the two inverters. Since this 

variation is random, on average 50% of the SRAM cells have 0 

as their preferred startup state and 50% have 1. Note that 

SRAM memory is normally used by writing data values into 

the memory and reading back the written values at a later point 

in time. To use SRAM as a PUF, one simply reads out the 

memory contents of the SRAM before any data has been 

written into it. 

One can evaluate the behavior of this SRAM PUF based on 

two main properties for PUFs: reliability and uniqueness. Over 

the past years thorough analysis of SRAM PUF data has been 

performed. Startup patterns have been measured under various 

conditions, from SRAM implemented in several technology 

nodes (180nm down to 14nm) by several foundries with 

different processes. 

725

Fig. 1: A 6-transistor SRAM cell; two cross-coupled inverters are formed by 

left inverter consisting of PMOS transistor PL and NMOS transistor NL and 

right inverter consisting of PMOS transistor PR and NMOS transistor NR. 

Left and right access transistors are indicated as AXL and AXR respectively. 

Extensive tests performed by leading PUF vendors and 

universities (e.g. in [10],[17]) have yielded the following 

results: 

• Reliability: Most of the bit cells in an SRAM array have a 

strongly preferred startup value which remains static over 

time and under varying operational conditions. A minority 

of cells consist of inverters that are coincidentally well 

balanced and result in bit cells that will sometimes start up 

as a 0 and sometimes as a 1. This causes limited “noise” 

(or, deviation from the initial reference measurement) in 

consecutive SRAM startup measurements. Tests 

demonstrate that the noise level of the SRAM PUF under 

extensive environmental conditions (e.g. temperatures 

ranging from -55˚C to 125˚C) and over years of lifetime 

(see also [12]) is sufficiently low to extract cryptographic 

keys with overwhelming reliability when using appropriate 

post-processing techniques. 

• Uniqueness: Extensive testing demonstrates that the startup 

pattern of an SRAM array is unique for every IC and even 

for a specific memory (region) within every IC. It is highly 

unpredictable from chip to chip and hence provides a large 

amount of entropy. The amount of entropy is sufficiently 

high to efficiently generate secure and unique cryptographic 

keys suitable to a broad range of applications. 

B. Root Key Storage with PUFs 

PUFs can be used to reconstruct a device-unique 

cryptographic root key on the fly, without storing secret data 

in non-volatile memory. Since PUF responses are noisy, they 

cannot be used directly as a cryptographic key. To remove the 

noise and to extract sufficient entropy, a so-called Fuzzy 

Extractor is needed. A Fuzzy Extractor or Helperdata 

Algorithm is a cryptographic primitive that turns PUF 

response data into a reliable cryptographic root key. 

The Fuzzy Extractor (see Fig. 2) has two modes of 

operation: Enrollment and Key Reconstruction. 

In Enrollment mode, which is typically executed once over 

the lifetime of the chip, the Fuzzy Extractor reads out an 

SRAM PUF response and computes that so-called Helperdata 

that is then stored in (non-volatile) memory accessible to the 

chip [11]. 

Whenever the cryptographic root key is needed by the chip, 

the Fuzzy Extractor is used in the Key Reconstruction mode. In 

this mode a new SRAM PUF response is read out and 

Helperdata is applied to correct the noise. A hash function is 

subsequently applied to reconstruct the cryptographic root key. 

In this way the same key can be reconstructed under varying 

external conditions such as temperature and supply voltage. 

Important: by design the Helperdata does not contain any 

information on the cryptographic key itself and it can therefore 

be safely stored in any kind of unprotected Non-Volatile 

Memory (NVM) on- or off-chip. At rest, when the device is 

powered down, no secret is ever present in memory making 

traditional expensive anti-tamper requirements obsolete. 

Fig. 2: A Fuzzy Extractor operates in two basic modes: i) In Enrollment mode 

(steps 1-2) Helperdata is generated based on a measured SRAM PUF 

response, ii) In the Key Reconstruction mode (steps 3-5) the Helperdata is 

combined with a fresh SRAM PUF response for reconstructing the deviceunique 

cryptographic root key 

C. Fuzzy Extractor implementations 

A Fuzzy Extractor is typically implemented inside a chip in 

one of the following basic forms: 

• Hardware IP: A hardware IP module that is connected to a 

dedicated SRAM memory. The Fuzzy Extractor hardware 

IP block directly controls the SRAM memory interface to 

read out the PUF values. The cryptographic key can be 

output via a dedicated interface to a cryptographic 

accelerator. The security advantages of such an 

implementation are discussed in the next subsection. 

Besides security advantages, a Fuzzy Extractor 

implemented in hardware is typically faster and more 

power efficient than the equivalent software 


• Software IP: A software library that can access a dedicated 

portion of the overall SRAM memory. It is preferable that 

the SRAM portion used by the PUF algorithm is not shared 

with other software. Memory management units, silicon 

firewalls and trusted execution environments (TEEs) are 


726

likely used if available. The Fuzzy Extractor does not 

contain any secrets, so it does not need to be encrypted. 

However, it is important to guarantee the integrity of the 

software itself. This can be achieved with a secure boot 

setup or by locking down the software on the chip with 

alternative mechanisms provided by the chip itself. 

Advantages of the software variant include flexible 

deployment options, i.e. retrofitting existing devices in the 

field and integration with other security components, with 

minimal or no hardware changes. 

D. Security level provided by PUFs 

Using the PUF to reconstruct a cryptographic root key has the 

following security advantages: 

• Keys are reconstructed on the fly when needed and are 

present only temporarily within the security boundary of 

the chip. This greatly reduces the attack surface and time 

window for exploiting eventual vulnerabilities. 

• When the chip is powered down, no physical traces of the 

key are present in the chip. 

• Guaranteed randomness from the physics of the silicon 

results in full entropy keys. 

• Root keys are generated within the security boundary of 

the chip rather than being injected from the outside, 

resulting in a safer and more flexible provisioning process 

throughout the supply chain. 

It is important to observe that the Fuzzy Extractor must be 

implemented and integrated in a secure manner to minimize 

the exposure to various attack vectors including software 

vulnerabilities, side-channel and invasive attacks. Various 

countermeasures are possible and this is an area where 

established PUF vendors have developed considerable 

proprietary IP. 

The actual security level achieved depends largely on the 

integration of the Fuzzy Extractor with the security subsystem. 

One of the design goals is to make sure that only the Fuzzy 

Extractor can access the SRAM PUF. In case of a hardware 

integration this is assured by connecting a dedicated SRAM 

memory directly to the Fuzzy Extractor and making sure that 

there are no software interfaces to it. To this end, it is for 

example preferable to use a Built-In Self Test instead of a scan 

chain [2]. In case of a software implementation one needs to 

make sure that access control settings of the chip are set up 

correctly. For example, this is done by using a memory 

management unit to reserve access to the SRAM PUF region 

of the memory dedicated to the Fuzzy Extractor software, by 

locking down the software image using firmware lock bits, by 

applying secure boot or by integrating into a TEE. 

Additionally, the in-circuit debug facilities need to be 

disabled. 

Another design goal is to make sure that the cryptographic 

key that is output by the Fuzzy Extractor is transported 

securely to the cryptographic software that requires it. In case 

of a hardware implementation, this can be arranged by 

connecting the output bus of the Fuzzy Extractor hardware via 

a direct internal connection to a cryptographic coprocessor. In 

case of a software implementation, one needs to make sure that 

any registers used to store the key are cleared as soon as 

possible and cannot be accessed by untrusted processes. 

Similar measures as described in the previous paragraph can be 

taken to lock down the security boundary of the chip. 

E. Known attacks to PUFs 

Delay-based PUFs such as Arbiter PUFs and Ring-Oscillator 

PUFs promise a large space of independent challengeresponse 

pairs that can be used for special authentication 

schemes [6][7]. In practice, however, it turns out that 

implementations of such PUFs are broken by modelling 

attacks, showing that responses are predictable given a limited 

subset of challenge-response pairs [15][16]. 

Memory-based PUFs such as the SRAM PUF are not 

susceptible to such attacks. The attacks that have been 

demonstrated on SRAM PUFs have been conducted only in 

non-realistic laboratory setups and do not form a threat to 

practical implementations. For example, with highly 

specialized equipment such as laser scanners it seems possible 

to read out SRAM memory contents by observing photo 

emissions during repeated read cycles [14]. This method is, 

however, feasible only in antiquated large technology nodes 

(e.g. 300 nm) and does not scale down to smaller modern 

technology nodes. In addition, the documented attacks require 

a situation where many consecutive SRAM read operations are 

executed sequentially on the same SRAM address range; a 

situation that does not occur in a good Fuzzy Extractor 

implementation. The work presented in [8] uses such a readout 

method in combination with a Focused Ion Beam to “clone” a 

PUF response from a first to a second SRAM memory. It 

should be noted that this is feasible only in obsolete large 

technology nodes (demonstrated on 600nm technology) and 

that it is only practical to clone a very limited number of bits 

with significant effort. In addition, commercial 

implementations include various proprietary countermeasures 

that make these kinds of attack simply infeasible. As of today 

there are no documented successful attacks of commercialgrade 

SRAM PUF implementations. 

IV. 

USE CASES 

This section offers some real-world examples of successful 

SRAM PUF applications. 

A. Secure key vault 

The SRAM PUF can be used to provide a cryptographic 

root key for a hardware security subsystem. The Fuzzy 

Extractor IP block is integrated with the security system IP. 

The chip-unique cryptographic root key that is reconstructed 

from the SRAM PUF feeds directly into the cryptographic 

module, for example an AES core. Fig. 3 shows a typical 

security subsystem architecture. 

To initialize the system, the PUF must be enrolled: a first 

readout of the SRAM startup values is used by the Fuzzy 

Extractor to compute the Helperdata (steps 0 and 1 in Fig. 3). 

Once the Helperdata is stored in the chip’s non-volatile 

memory (NVM), the enrollment step is completed. 

727

Fig. 3: Secure key vault based on SRAM PUF depicting Enrollment steps 0 

and 1 (dotted lines); Key reconstruction steps 2,3,4, and Encryption of data 

generated on processor in steps 5 and 6. 

The enrollment step establishes the device-unique 

cryptographic root key in the security subsystem. To 

reconstruct this key for use, the Helperdata is read from NVM 

and combined with a readout of the SRAM startup values in 

the Fuzzy Extractor (steps 2 and 3). The reconstructed key is 

fed into the AES core (step 4). Data that is being processed by 

the CPU can be securely stored by feeding it to the security 

subsystem, where it is encrypted using the AES module and 

stored in NVM (steps 5 and 6). Note that besides just 

encrypting the data, the AES core can also be used to protect 

the integrity of the data by computing additional 

authentication tags or by using an authenticated encryption 

mode such as AES-GCM. 

When the processor requires the secure data, steps 2, 3 and 

4 are repeated to reconstruct the cryptographic root key and 

load it into the AES core. Steps 6 and 5 are then reversed in 

direction to feed the encrypted data to the AES block, have the 

AES block decrypt it and feed it back to the CPU. 

This mechanism makes it possible to keep secrets in 

otherwise unprotected non-volatile memory. Note that only 

encrypted data and non-sensitive Helperdata is ever stored in 

NVM. No secret is ever stored in permanent memory. The 

cryptographic root key that is reconstructed from the SRAM 

PUF is not known anywhere outside the security boundary. 

Therefore, the data that is securely stored in the chip’s NVM 

can be decrypted only on the same chip on which it has been 

generated. Transferring them to any other target device is not a 

concern, even if the Helperdata is copied along with them. The 

Helperdata can be used only with the specific SRAM 

fingerprint of the chip that generated it in the first place. 

B. Software protection in microcontroller 

This section describes a use case where the SRAM PUF is 

used to protect software IP on a microcontroller. We assume 

the microcontroller has an internal flash memory where its 

program code can be stored. Before code is executed it is 

loaded into an internal SRAM memory. A small part of the 

SRAM memory is reserved to be used as PUF. This can be 

achieved by instructing the compiler to exclude a certain part 

of the SRAM from the memory map, assuring that it will not 

be “visible” by other software. 

We furthermore assume that the microcontroller has some 

access control mechanisms to: 

1. Lock down the software in the flash memory to prevent any 

modification 

2. Disable in-circuit debug facility 

Except for a few low-end microcontrollers, these access 

control mechanisms are quite common. 

1) Setup phase 

To securely set up the system, we use a provisioning PC in 

a trusted environment to load the code in the flash memory of 

the microcontroller. This is depicted in step 1 of Fig. 4. This is 

software that will be executed at runtime (see next section). 

The software consists of: 

• A boot image containing the Fuzzy Extractor algorithm and 

the cryptographic cipher algorithms used to decrypt the 

software image 

• A software image encrypted with key S. Initially the 

software has an empty header. At the end of the setup phase 

the header will be overwritten with a uniquely encrypted 

header per device. 

After storing the software code in flash memory, the 

provisioning PC loads a temporary enrollment image in the 

executable SRAM of the device. This is depicted in step 2. 

The enrollment image contains the Fuzzy Extractor algorithm, 

as well as a cryptographic cipher that can be used to encrypt a 

header for the software image in flash. Furthermore, it 

contains the software image encryption key S. 

When execution of the enrollment image is triggered (step 

3), the SRAM PUF is read out (step 4) and Helperdata is 

created by the Fuzzy Extractor algorithm. The Helperdata is 

stored in the flash memory (step 5). Based on the Helperdata 

and the SRAM PUF readout, the cryptographic root key of the 

device K is reconstructed by the Fuzzy Extractor. Using the 

cryptographic cipher in the enrollment image, the software 

image encryption key S is encrypted with the device-unique 

key K. The resulting value, denoted as E[K](S), is written in 

the header of the encrypted software image (step 6). The flash 

memory now contains an encrypted software image, with a 

header that is specifically encrypted for the device it is stored 

on. 


728

At the end of the setup phase the enrollment image is removed 

from SRAM. The provisioning PC triggers the necessary 

mechanisms in the microcontroller to lock the software images 

in flash and to disable the debug port. 

Fig. 5: SRAM PUF-based software protection mechanism, runtime operation. 

The software protection method described in this section 

can be retrofitted to existing devices as it is completely 

software based. Still, the root of trust originates from the 

SRAM PUF in hardware. The core component that enables this 

mechanism is the Fuzzy Extractor that enables key 

reconstruction from a standard SRAM memory available in the 

microcontroller. 

An open source reference implementation of such a Fuzzy 

Extractor is available as part of the prpl Security Framework, 

see [19]. 

Fig. 4: SRAM PUF-based software protection mechanism, setup phase. 

2) Runtime operation 

The runtime flow is depicted in Fig. 5. First the 

microcontroller boot loader copies the first boot image into the 

SRAM of the microcontroller (step 1) and triggers execution 

(step 2). The boot stage code reads the SRAM PUF values 

(step 3) as well as the Helperdata (step 4). The Fuzzy Extractor 

algorithm in the boot image uses these values to reconstruct the 

device-unique root key K. The key K is used to decrypt the 

header of the software image (step 5). Decrypting the software 

image header results in the software image key S, which is then 

used to decrypt the software image in flash (step 6) as it is 

being copied to execution SRAM (step 7). When the full 

software image is decrypted and available in the SRAM, 

execution of the image is triggered (step 8). 

The PUF plays an essential role in providing the 

microcontroller with a device-unique cryptographic root key 

that is used to bind the software image to the specific device. 

The root key is only temporarily reconstructed in working 

memory to decrypt the header of the software image. 

Likewise, the decrypted software image key is only 

temporarily present in working memory to decrypt the 

software image. When the device is powered off, the plain 

software disappears from the execution SRAM memory. Only 

encrypted values are left in the flash memory. 

C. Device authentication to the cloud 

In this use case scenario, we describe how the SRAM PUF 

is used as a basis to connect IoT end nodes securely to a cloud 

service such as Amazon Web Services or Microsoft Azure 

cloud. We assume that the IoT device employs an off-the-shelf 

microcontroller as its main processing unit. An OEM (Original 

Equipment Manufacturer) owns both the devices and the 

service that is running in the cloud. The situation is depicted in 

Fig. 6. 

1) Installation phase 

In the installation phase (step 1) the OEM installs its IoT 

Service on the cloud platform of choice. The cloud service has 

its own private/public key pair denoted d S/Q S. This key pair is 

used to authenticate the service toward its clients. Furthermore, 

the cloud service knows the public key Q CA of a trusted 

Certificate Authority. This public key is used to verify device 

identity certificates of the end nodes that connect to the cloud 

service. 

The OEM also provides a software image to the Contract 

Manufacturer for installation on the IoT devices (step 2). 

Embedded in this software image is the URL of the cloud 

service, as well as the public key Q S of the cloud service. This 

key is used to authenticate the OEM IoT service toward the 

device. The software image contains the following 

submodules: 

• Fuzzy Extractor: The software library that reads out the 

uninitialized SRAM values from a reserved part of the 

729

SRAM of the IoT device in order to reconstruct a deviceunique 

cryptographic key K. 

• TLS & crypto library: A software library that contains 

cryptographic functionality for securing a network 

connection using the Transport Layer Security protocol 

[20]. 

• Connectivity library: A network stack running on the IoT 

device, which enables the device to connect to Internet 

services. It will typically set up a TCP/IP stack over a 

physical network connection such as ethernet or Wi-Fi. 

Furthermore, it will support a connectivity protocol such as 

MQTT (Message Queuing Telemetry Transport) to run on 

top of the TCP/IP stack [21]. 

• OEM Application: The actual application software that 

provides the device with the intended functionality. 

2) Setup phase 

Every device will go through a setup phase in the 

production environment of the Contract Manufacturer, which 

operates on behalf of the OEM. As part of this enrollment 

and reconstructed on the fly only when needed. The public 

device key Q D is sent via the contract manufacturer PC or 

Automated Test Equipment to the Certificate Authority 

service (step 6). The CA generates a device certificate, which 

includes the device public key Q D as well as a signature 

created with the CA private key d CA. Optionally the certificate 

may include other chip or device IDs. The device certificate, 

denoted as $[d CA](Q D), is stored in non-volatile memory on the 

device (step 7). After this step the device has an “identity” in 

the form of a public-key certificate 

Note that this phase implements a one-time-trust event 

where the contract manufacturer assures that the device public 

key Q D is valid for the specific device and triggers the 

generation of a certificate at the CA. The contract manufacturer 

is trusted for correctly requesting certificates for public keys of 

the devices. It does not have to be trusted to handle any 

sensitive private keys. 

3) Runtime operation 

Once the IoT device is in the field, it can now 

Fig. 6: Cloud authentication mechanism based on SRAM PUF. 

step, the Fuzzy Extractor reads out the SRAM PUF values 

(step 3) and generates Helperdata (step 4), which is stored in 

non-volatile memory. The device-unique cryptographic key K 

is output by the Fuzzy Extractor and used with a Key 

Derivation Function in the TLS crypto library to derive an 

asymmetric elliptic curve device key pair d D/Q D. The private 

key of this key pair is never stored in any non-volatile memory 

autonomously set up secure connections to the OEM IoT 

Service. First, the Fuzzy Extractor is used to reconstruct the 

device-unique cryptographic key K from a readout of the 

SRAM PUF (step 8) and the Helperdata (step 9). The 

cryptographic key K is then used by the crypto library to derive 

the asymmetric key pair d D/Q D (step 10) and prepare for 

cryptographic support of the secure network connection. 


730

The connectivity library contacts the Internet service via the 

URL that is fixed in the OEM software image (step 11). A TLS 

connection is then set up where the server is authenticated 

toward the device based on the public key Q S that is stored in 

the OEM SW image (fetched via step 11). The Device 

Certificate (obtained via step 12) is used to authenticate the 

client IoT device toward the OEM IoT cloud service. Setting 

up the TLS connection (step 13) uses support from the crypto 

algorithms in the TLS layer (step 14) and on a high level 

proceeds as follows [20], see also Fig. 7: 

from the PUF key K. When the IoT device is powered off, no 

private keys are present. No sensitive data is ever stored in any 

NVM memory. 

a. Client and Server exchange initial messages where the 

client sends to the server a list of ciphers that it supports. 

The server compares this list with the ciphers that it 

supports and selects its preferred cipher that both sides 

support. In this case we assume that TLS_ECDHE_ECDSA 

is supported by the client and selected for setting up the 

secure connection. This cipher combination uses elliptic 

curve Diffie-Hellman key exchange to set up a shared 

session key, and the elliptic curve digital signature 

algorithm for authentication (i.e. message signing). 

b. The server determines the elliptic curve parameters, 

including the elliptic curve base point P. The server 

randomly generates an ephemeral elliptic curve key pair 

d SR/Q SR, where Q SR = d SR∙P and signs the ephemeral public 

key Q SR with its private key d S using the ECDSA signature 

algorithm. Note that the operator “∙” denotes point 

multiplication over the elliptic curve. The signature value is 

denoted as $[d S](Q SR). 

c. Then the server sends the signed ephemeral public key 

$[d S](Q SR) to the client, together with the elliptic curve 

parameters. 

d. The client uses the server’s public key Q S to verify that Q SR 

was signed correctly. 

e. The client sends its public key certificate to the server. The 

server uses the CA public key Q CA to verify the certificate 

and to be assured of the correct device’s public key Q D. 

f. The client also randomly generates an ephemeral elliptic 

curve key pair d DR/Q DR, where Q DR=d DR∙P. The public 

ephemeral key Q DR is sent back to the server. 

g. The client uses its private key d D to sign the TLS transcript 

(messages exchanged in steps a-f) and sends the signature 

to the server. 

h. The server verifies the signature using the previously 

verified device public key Q D. 

i. The client computes a shared secret as S = d DR∙Q SR = 

d DR∙d SR∙P over the elliptic curve group. 

j. The server computes the same shared secret as S = d SR∙Q DR 

= d SR∙d DR∙P 

Now that both client and server side have the same shared 

key S, symmetric session keys are derived from it to encrypt 

and authenticate further messages that are exchanged between 

both sides. Note that authentication of the client IoT device 

toward the server is done through steps e, g and h. The private 

device key d D that is used for this authentication step is derived 

Fig. 7: Simplified overview of TLS key agreement steps based on ECDH 

protocol. 

The SRAM PUF provides the flexibility to instantiate a 

device-unique key in the device and form the basis of a device 

identity (through the device certificate). No IDs or keys have to 

be injected by the silicon manufacturer. The OEM can decide 

to run the enrollment step at any semi-trusted time and place in 

the production chain. This has the advantage that the OEM can 

take device security in its own hands, without having to rely on 

key injection by the silicon manufacturer and secure handover 

of installed keys. This reduces key provisioning costs in the 

production chain considerably. 


SRAM-based Physical Unclonable Functions form a 

universal method to securely store cryptographic keys in the 

chips of IoT devices. SRAM PUF provides hardware-rooted 

security that is enabled via software. When the device is 

powered down, no secrets are stored in memory, making 

cryptographic keys impossible to extract. In addition, SRAM 

PUF provides a high grade of flexibility all through the device 

supply chain. Every device can generate its own keys at any 

wanted point in the production chain. The entropy of these 

keys is determined by randomness in the physics originating 

from minute and uncontrollable process variations in the 

silicon production process. This makes PUF-based 

implementations much more resilient than traditional key 

injection options. The flexibility of the SRAM PUF process 

results in cost reductions as external key management 

infrastructure is kept to a minimum. SRAM PUF technology 

works reliably on any device that has silicon SRAM onboard: it 

will become the option of choice to establish trust in silicon for 

billions of devices that make the future Internet of Things. 

731

REFERENCES 

[1] M. Bhargava, C. Cakir, and K. Mai, “Comparison of bi-stable and delaybased 

Physical Unclonable Functions from measurements in 65nm bulk 

CMOS,” in Custom Integrated Circuits Conference (CICC), 2012 IEEE, 

2012, pp. 1–4. 

[2] M. Cortez, G. Roelofs, S. Hamdioui, G. Di Natale, “Testing PUF-Based 

Secure Key Storage Circuits”, DATE conference 2014, 

https://www.dateconference.com/files/proceedings/2014/pdffiles/07.7_2.pdf 

. 

[3] Trusted Computing Group, Device Identity Composition Engine 

workgroup, https://trustedcomputinggroup.org/work-groups/dicearchitectures/ 

. 

[4] Y. Dodis, L. Reyzin, and A. Smith, “Fuzzy extractors: How to generate 

strong keys from biometrics and other noisy data,” in Advances in 

Cryptology - EUROCRYPT 2004, ser. Lecture Notes in Computer 

Science, Springer Berlin Heidelberg, 2004, vol. 3027, pp. 523–540. 

[5] Gartner newsroom, “Gartner Says 6.4 Billion Connected Things Will Be 

in Use in 2016, Up 30 Percent From 2015”, 

https://www.gartner.com/newsroom/id/3165317 . 

[6] B. Gassend, D. Clarke, M. van Dijk, S. Devadas, “Silicon physical 

random functions” In: ACM Conference on Computer and 

Communications Security (ACM CCS). pp. 148–160. ACM, New York, 

NY, USA (2002). 

[7] B. Gassend, D. Clarke, M. van Dijk, S. Devadas, “Silicon physical 

random functions” In: ACM Conference on Computer and 

Communications Security (ACM CCS). pp. 148–160. ACM, New York, 

NY, USA (2002). 

[8] C. Helfmeier, C. Boit, D. Nedospasov, and J.-P. Seifert, “Cloning 

physically unclonable functions,” in Hardware-Oriented Security and 

Trust (HOST), 2013 IEEE International Symposium on, 2013, pp. 1–6. 

[9] Intrinsic ID whitepaper, “Flexible Key Provisioning with SRAM PUF”, 

https://www.intrinsic-id.com/resources/white-papers/white-paperflexible-key-provisioning-sram-puf/ 

. 

[10] S. Katzenbeisser, U. Kocabas¸, V. Rozic, A.-R. Sadeghi, I. 

Verbauwhede, and C. Wachsmann, “PUFs: Myth, Fact or Busted? A 

Security Evaluation of Physically Unclonable Functions (PUFs) Cast in 

Silicon,” in Cryptographic Hardware and Embedded Systems (CHES) 

2012, ser. Lecture Notes in Computer Science, Springer Berlin 

Heidelberg, 2012, vol. 7428, pp. 283–301. 

[11] J.-P. Linnartz and P. Tuyls, “New shielding functions to enhance privacy 

and prevent misuse of biometric templates,” in Audio- and Video- Based 

Biometric Person Authentication, ser. Lecture Notes in Computer 

Science, Springer Berlin Heidelberg, 2003, vol. 2688, pp. 393–402. 

[12] R. Maes, V. van der Leest, “Countering the effects of silicon ageing on 

SRAM PUFs”, HOST 2014. 

[13] D. Merli, F. Stumpf, G. Sigl, “Protecting PUF Error Correction by 

Codeword Masking”, Cryptology ePrint Archive, 

https://eprint.iacr.org/2013/334.pdf . 

[14] D. Nedospasov, J.-P. Seifert, C. Helfmeier, and C. Boit, “Invasive PUF 

analysis,” in Fault Diagnosis and Tolerance in Cryptography (FDTC), 

2013 Workshop on, 2013, pp. 30–38. 

[15] U. Rührmaier, J. Sölter, F. Sehnke, X. Xu, A. Mahmoud, V. Stoyanova, 

G. Dror, J. Schmidhuber, Wayne Burleson, S. Devadas “PUF Modeling 

Attacks on Simulated and Silicon Data”, IACR Eprint archive 2013, 

http://sharps.org/wp-content/uploads/RUHRMAIR-IACR.pdf . 

[16] U. Rührmaier, J. Sölter, “PUF Modeling Attacks: An Introduction and 

Overview”, DATE 2014, 

https://pdfs.semanticscholar.org/a023/dd6069b664b0e53dfa5366d3c881 

a6876583.pdf . 

[17] G.-J. Schrijen and V. van der Leest, “Comparative analysis of SRAM 

memories used as PUF primitives,” in Design, Automation Test in 

Europe Conference Exhibition (DATE) 2012, March 2012, pp. 1319 – 

1324. 

[18] Synopsys whitepaper, “Securing the Internet of Things – An Architect’s 

Guide to Securing IoT Devices Using Hardware Rooted Processor 

Security”, https://hosteddocs.emediausa.com/arc_security_iot_wp.pdf . 

[19] PRPL Foundation, security working group: 

https://prpl.works/category/prpl-security/ , PRPL PUF-API: 

https://github.com/prplfoundation/prpl-puf-api/tree/December-2017 , 

Security Framework application note: https://prpl.works/applicationnote-july-2016/ 

. 

[20] Wikipedia, “Transport Layer Security”, 

https://en.wikipedia.org/wiki/Transport_Layer_Security#Clientauthenticated_TLS_handshake 

[21] Wikipedia, “MQTT”, https://en.wikipedia.org/wiki/MQTT . 


732

How to Incorporate Low-Resource Cryptography 

Into a Highly Constrained Real-World Product 

Derek Atkins 

SecureRF Corporation 

Shelton, CT, USA 

datkins@SecureRF.com 

Drake Smith 

SecureRF Corporation 

Shelton, CT, USA 

dsmith@SecureRF.com 

Abstract—The Internet of Things (IoT) has a problem: the 

small devices that power the IoT are insecure because these 

devices have few, if any, options for providing authentication and 

data integrity. These embedded devices lack the computing, 

memory, and/or energy resources needed to implement today’s 

standard security methods. This leaves most IoT systems 

vulnerable to attack. 

Before revealing an alternative that enables security on 

devices as small as the ubiquitous 8051 8-bit microcontroller, we 

will first show you how to identify security threats and how to 

determine security requirements. We will provide some 

techniques for evaluating your products and deployment 

scenarios for susceptibility to spoofing and impersonation, 

message tampering, and eavesdropping. We will introduce some 

effective countermeasures to protect against these threats 

together with their suggested security strengths. As an example 

of good security protocol design, we will consider a typical IoT 

use case where a base station must communicate with a remote 

sensor in a secure manner. We will discuss some potential 

exploits and attacks and then outline a security protocol that 

mitigates those threats. 

Next, we will introduce Group Theoretic Cryptography 

(GTC) as an alternative to resource-intensive RSA and ECC. We 

will explain why RSA, ECC, and Diffie-Hellman are a poor fit for 

highly-constrained devices such as battery-less sensors with 

microcontrollers (MCUs) having low clock rates and low bitwidth 

architectures. We will present a GTC-based suite of 

quantum-resistant cryptographic methods that have been 

designed specifically for constrained environments. 

We will conclude with a discussion on how to incorporate 

GTC-based security into real-world products. You will learn 

about the availability of cryptographic libraries that you can 

incorporate into your own code that implement the low-resource 

methods discussed in this presentation. Using these libraries, you 

will see typical run-times plus ROM and RAM utilization for a 

range of microcontrollers and processor cores. 

Keywords—Internet of Things; IoT; Public Key Cryptography; 

Group Theoretic Cryptography; Ironwood Key Agreement 

Protocol; Walnut Digital Signature Algorithm; WalnutDSA 


As the Internet of Things (IoT) grows larger, the devices 

attaching to networks continually get smaller. Finding devices 

with low clock speeds, limited RAM and ROM, and 

microcontrollers with 16 or even 8 bits is not uncommon. 

While this does not reduce or eliminate the requirement for 

cryptographic authentication, it does reduce the usability or 

practicality of currently established methods. On some of the 

smaller devices where you can make it fit, an Elliptic Curve 

Cryptography (ECC) authentication still takes 10-60 seconds to 

complete. 

Unfortunately, security solutions are still required for 

authentication and data protection in networked devices. 

Without security, communications can be compromised, 

risking data or, worse, safety. Networked vehicles have already 

been hacked, enabling the attacker to control a vehicle, shut 

down the engine, enable the brakes, or even drive it remotely. 

All this is possible because there is no security on these 

devices. 

This paper will describe good security design and practice 

by leveraging the Intel DE10-Nano development kit, which 

utilizes a next-generation Group Theoretic cryptosystem (GTC) 

for quantum-resistant key agreement and digital signature 

evaluation. The board is co-developed using Intel FPGA 

technology and is delivered with GTC technology 

demonstrations for everyday use. Specifically, code and 

documentation is provided to enable the DE10-Nano to 

authenticate to small sensor nodes, and to run a speed test 

showing the performance of the technology. 

II. THE DE10-NANO 

The DE10-Nano is a development kit built around an Intel 

Cyclone V System-on-Chip FPGA, which combines a dualcore 

ARM Cortex-A9 with space for programmable logic, 

which enables design flexibility between hardware/software 

interfaces. Users can reconfigure the hardware and link with 

software to create high-performance, low-power systems. 

Leveraging the Cyclone V, the DE10-Nano enables developers 

to rapidly develop embedded applications and test different 

733

configurations of hardware and software in an easy-to-access 

platform. 

The DE10-Nano is meant to be the medium-area device, 

meaning it talks to larger devices but also talks to smaller 

devices, like an Arduino or even smaller sensors or nodes. So, 

while the DE10-Nano does contain dual Cortex-A9 

processors—virtual super-computers in the realm of IoT—it is 

expected to communicate with devices with much lowercaliber 

capabilities, perhaps even as low as an 8-bit 8051 

microcontroller. This implies that any security solution must 

be capable of running on those tiny devices. 

The DE10-Nano comes with one security solution [1] that 

leverages SecureRF’s Ironwood Key Agreement Protocol TM 

(Ironwood KAP TM ) and Walnut Digital Signature Algorithm TM 

(WalnutDSA TM ). These methods not only work effectively on 

the DE10-Nano, but can be used on those tiny devices as well. 

III. SECURITY RISK AND THREAT ANALYSIS 

Before making any security choices it is best to understand 

the risks, threats, and possible mitigation techniques available. 

Specifically, it is important to look at the potential 

vulnerabilities, the attack surfaces, the likelihood of attack, the 

cost of an attack, and the cost of protecting against the attack. 

A threat analysis is invaluable to determine what needs to 

be protected and how. The analysis provides a list of concerns, 

ranks them, and considers different ways a system can be 

attacked. Next it evaluates the risk of those attacks, how likely 

they are, and how much damage it would cause, and the costs 

to correct it. 

When analyzing risk, one approach is to look at the asset 

value versus the attack cost. For example, a bank protecting 

millions of dollars in assets is not going to protect it with a 

system that can be broken with only a few dollars of effort 

(under-protected risk). On the other side, protecting a $1 asset 

using a system that would cost a million dollars to break is, 

most likely, completely overkill. 

Threats can stem from direct physical access, where an 

attacker can touch, push, probe, twist, or otherwise manipulate 

the target. In addition to direct attacks, this can include side 

channels like differential power analysis (DPA), glitching 

attacks, timing attacks, or even listening to sounds emanating 

from the device. 

Another source of threats is the network. Open services, 

built-in accounts with hard-coded passwords, non-patched 

systems containing software bugs; the sources of networkbased 

attacks are endless. 

Threat analysis is best done by a professional (or at least 

someone who considers themselves extremely paranoid). 

However, the thought process of analyzing, enumerating, and 

ordering the threats is an important step before proceeding to 

countermeasures. 

IV. EARLY COUNTERMEASURES 

Once the threats are enumerated and the risks are 

understood, the next step to protect a system is applying 

countermeasures. The goal is to mitigate the risks and defend 

against the threats. These countermeasures can take many 

forms. 

The first form of countermeasure is physical protection. For 

example, encasing a circuit board in a block of epoxy will 

prevent any access to individual items on the board. Only 

wires/cables that explicitly protrude from the epoxy are 

accessible, limiting what an attacker can do. This would 

prevent targeted attacks between chips on the board, 

monitoring the transmission bus, watching memory, changing 

out the CPU, etc. 

However, even encapsulation cannot protect against certain 

types of side channel attacks. Most likely, a box still has a 

power adapter (unless there’s a battery inside the epoxy, which 

would limit the lifetime of the device). An attacker could 

attempt DPA using that power input. DPA-specific protections 

require specific hardware and software mitigations, but those 

are specific to DPA. 

Other countermeasures include processes in place to control 

people’s actions and behaviors, software mitigations such as 

better security on the system, keeping systems patched with 

fixes, and cryptographic protections. 

Another countermeasure is a self-contained secure boot 

solution, with an integrated secure update infrastructure. A 

secure boot solution enables a device to cryptographically 

verify the authenticity and integrity of firmware before it is 

loaded and run on the system. This prevents an attacker from 

making changes to the underlying code. At startup the system 

will verify the firmware, usually by checking a digital 

signature, and only if the signature is valid will it continue. All 

that is required is that the public key of the signer be available 

and non-writable (meaning an attacker cannot replace it). 

A secure update solution enables cryptographic validation 

of firmware updates before they are stored in place. It protects 

the device from unwanted or invalid updates. Together with the 

secure boot solution the device is assured of correct code. 

V. WHY EXISTING SOLUTIONS FAIL 

Security is often considered only as an afterthought. It is 

not usually a feature. It is rarely customer-visible (except when 

it is not working), it adds cost, and it reduces performance 

(compared to a completely insecure system). And until there is 

a major break, customers rarely ask for it. What this means is 

that in the rapid pace of development, an actual customerfacing 

feature is more likely to get implemented than a 

strategic security solution. 

Moreover, because security is only added later, the 

manufacturer attempts to bolt it onto the side of the working 

product. Of course, this trick never works. 

Security requires a holistic approach. A good security 

architecture is necessarily going to touch every part of the 

system, from the hardware up to the user interface, and 

everywhere in between. If security is not considered at the 

onset, then adding security can become a daunting task. 

Adding it piecemeal often does not work, or if it can work, it 

works insufficiently. 

734

Next, cryptography is a requirement for good security. A 

security solution without cryptography is just an attack waiting 

to happen. Yet not just any cryptography can work, one must 

apply the correct methods. 

Symmetric methods like AES are perfect for data 

encryption. AES is efficient and generally available on most 

systems. However, keys must be managed for AES to work 

properly. To manage those keys properly in a large scale, 

distributed system, the best approach is to use an Asymmetric 

system. 

However, on small, IoT devices the constraints of the 

system might restrict the ability to add legacy asymmetric 

cryptographic systems. Specifically, many of the systems in 

use today either will not fit or, if they can be made to fit, will 

not perform adequately in the low-resource environments. For 

example, fitting ECC in an 8-bit processor like an 8051 is 

nearly impossible, or if it can be fit (possibly with extremely 

low security), it will still take minutes to calculate its answer. 

Using RSA could take even longer. 

Imaging having to wait several minutes for a device to 

power up, because its secure boot solution takes that long to 

validate the firmware? 

VI. GROUP THEORETIC CRYPTOGRAPHY 

In 2005, [2] introduced the world to E-Multiplication, a 

lightweight, quantum-resistant, one-way function rooted in 

infinite group theory, matrices, permutations, and arithmetic 

over small finite fields. Implementations of E-Multiplication 

are small and extremely efficient, even on 16- or 8-bit 


E-Multiplication is the basis for several cryptographic 

methods, Group Theoretic methods, but the most interesting 

are Ironwood KAP [3] and WalnutDSA [4]. 

Ironwood KAP is a Diffie-Hellman-like key agreement 

scheme that enables two devices, who may never have met 

before, to exchange public keys and, using those and their own 

private keys, generate a shared secret. That shared secret can 

then be used to authenticate the devices or by a method like 

AES to encrypt data between the devices. 

Due to E-Multiplication being so efficient, Ironwood can 

compute a shared secret on even an 8-bit 8051 in about 200 ms. 

This would enable even the smallest of devices to compute a 

shared secret and authenticate itself. 

Ironwood is also interesting because the two sides need to 

perform different amounts of work. In other words, the method 

itself has asymmetric implementation requirements. The lighter 

side of the method is often 50 times faster, meaning an 8051 

could execute the other side of the method in about 4 ms. 

WalnutDSA, on the other hand, is a quantum-resistant 

digital signature scheme that enables one party to create a 

signature on a message that can be verified by a second party, 

that ensures that the message has not been modified, and 

proves the message came from the first party. Digital 

signatures are used for certificates, to prove identity, and in 

some challenge-response authentication systems. 

WalnutDSA signature verification is extremely fast, even 

on IoT edge devices. For example, on an 8051 a WalnutDSA 

signature can be verified in 35ms, and on an ARM Cortex M3, 

WalnutDSA verifies the signature in 5.7 ms in software! This 

works out to about 40 times faster than an ECDSA signature 

validation, using half the code size, half the RAM, and also 

providing the future-proof characteristics of quantum 

resistance. 

Because WalnutDSA is so fast and lightweight it means 

that even the lowest-end IoT device can benefit from its use. 

For example, there is no way that you could get a near-realtime 

PKI working on an 8-bit 8051 using legacy methods, but 

leveraging WalnutDSA enables that. An 8051 could validate a 

certificate quickly, in tens of milliseconds, enabling a whole 

class of new applications. Combined with Ironwood, these lowend 

devices can perform full end-to-end authentication, 

validation, and connection security. 

WalnutDSA is currently under review as part of The 

National Institute of Standards and Technology (NIST) Post- 

Quantum Standardization Process, and has the fastest 

verification times of all accepted methods as reported by NIST. 

VII. INCORPORATING GTC IN YOUR PRODUCTS 

Ironwood KAP and WalnutDSA are available in SDKs and 

IP Cores for integration into various levels of devices, from 

Linux and Windows systems, down to ARM Cortex M0, Texas 

Instruments MSP430, and even down to the 8-bit 8051 or 

Atmel AVR platforms or even into FPGAs and custom ASIC 

designs. 

Integrating Ironwood KAP is as simple as making one 

function call to generate the shared secret from the private data 

that gets provisioned onto the device and the public key sent 

from the other side. All the hard work is abstracted away, 

making it simple to use. Additional APIs are available to 

leverage that shared secret into an authentication protocol or 

data encryption module. 

WalnutDSA is just as simple to integrate. A single API 

takes a hashed message, signature, and public key, and returns 

a response of valid or not-valid. API calls for hashing are 

available for convenience, or the developer could use their 

own. Some hardware has embedded hash functions that can be 

leveraged for improved performance. 

Together, WalnutDSA and Ironwood KAP can provide a 

full suite of authentication, integrity, and secure 

communication technologies, which can easily integrate into 

existing protocols or, better yet, become the basis for a secure 

platform, including a secure boot and secure update solution. 

Moreover, adding these features requires as little as 3000-7000 

bytes of code. 

VIII. CONCLUSIONS 

Group Theoretic Cryptography has provided quantumresistant 

public-key methods that are small, efficient, and 

practical even on the smallest of today’s IoT devices. 

Leveraging the Ironwood KAP and WalnutDSA, developers 

can integrate modern PKI concepts on tiny devices while 

735

adding very little code and reducing the performance impact 

compared to legacy cryptographic methods. 

After creating a threat and risk analysis to discover the 

biggest threats and best places to add protections, adding GTC 

technologies can easily mitigate many problems with better 

efficiency and performance than legacy cryptographic 

methods. 

With WalnutDSA under consideration by NIST, the future 

is quantum-resistant. 

REFERENCES 

[1] Intel Corporation and SecureRF Corporation, “How to Authenticate 

Remote Devices with the DE10-Nano Kit,” August 2017, 

https://software.intel.com/en-us/articles/how-to-authenticate-remotedevices-with-the-de10-nano-kit. 

[2] I. Anshel, M. Anshel, D. Goldfeld, S. Lemieux, Key Agreement, the 

Algebraic Eraser TM , and Lightweight Cryptography, Algebraic methods 

in cryptography, Contemp. Math., vol. 418, Amer. Math. Soc., 

Providence, RI, 2006, pp. 1–34. 

[3] I. Anshel; D. Atkins; D. Goldfeld; P. E. Gunnells, Ironwood Meta Key 

Agreement and Authentication Protocol, to appear. 

[4] I. Anshel; D. Atkins; D. Goldfeld; P. E. Gunnells, WalnutDSA TM : A 

Quantum-Resistant Digital Signature Algorithm, to appear. 

736

Practical Use of MISRA C and C++ 

By Greg Davis 

Director of Engineering, Compiler Development 

Session 27: 1 Mar 2018 09:30-1100 

Copyright 2013-2018 by Greg Davis 

Introduction 

No software engineering process can guarantee secure code, but following the right 

coding guidelines can dramatically increase the security and reliability of your code. 

Many embedded systems live in a world where a security breach can be catastrophic. 

Embedded systems control much of the world’s critical infrastructure, such as dams, 

traffic signals, and air traffic control. These systems are increasingly communicating 

together using COTS networking and in many cases using the internet itself. Keeping 

yourself out of the courtroom, if not common decency, demands that all such systems 

should be developed to be secure. 

There are many factors that determine the security of an embedded system. A wellconceived 

design is crucial to the success of a project. Also, a team needs to pay 

attention to its development process. There are many different models of how software 

development ought to be done, and it is prudent to choose one that makes sense. Finally, 

the choice of operating system can mean the difference between a project that works well 

in the lab and one that works reliably for years in the real world. 

Even the most well thought-out design is vulnerable to flaws when the implementation 

falls short of the design. This paper focuses on how one can use a set of coding 

guidelines, called MISRA C and MISRA C++, to help root out bugs introduced during 

the coding stage. 

MISRA C and C++ 

MISRA stands for Motor Industry Software Reliability Association. It originally 

published Guidelines For the Use of the C Language In Critical Systems, known 

informally as MISRA C, in 1998. A second edition of MISRA C was introduced in 2004, 

and next MISRA C++ was released in 2008. The most recent edition of MISRA, a third 

edition of MISRA C also known as MISRA C3 was released in 2012/2013. More 

information on MISRA and the standards themselves can be obtained from the MISRA 

web site at http://www.misra.org.uk. 

The purpose of MISRA C and MISRA C++ guidelines are not to promote the use of C or 

C++ in critical systems. Rather, the guidelines accept that these languages are being used 

for an increasing number of projects. The guidelines discuss general problems in 

737

software engineering and note that C and C++ do not have as much error checking as 

other languages do. Thus the guidelines hope to make C and C++ safer to use, although 

they do not endorse MISRA C or MISRA C++ over other languages. 

MISRA C is a subset of the C language. In particular, it is based on the ISO/IEC 

9899:1990 C standard, which is identical to the ANSI X3.159-1989 standard, often called 

C ’89. Thus every MISRA C program is a valid C program. The MISRA C subset is 

defined by 143 rules and 16 directives that constrain the C language and the software 

development process. Correspondingly, MISRA C++ is a subset of the ISO/IEC 

14882:2003 C++ standard. MISRA C++ is based on 228 rules, many of which are 

refinements of the MISRA C rules to deal with the additional realities of C++. 

For notational convenience, we will use the terms “MISRA”, “MISRA C” or “MISRA 

C++” loosely in the remainder of the document to refer to either the defining documents 

or the language subsets. 

What is MISRA? 

MISRA is written for safety critical systems, and it is intended to be used within a 

rigorous software development process. The standard briefly discusses issues of software 

engineering, such as proper training, coding styles, tool selection, testing methodology, 

and verification procedures. 

MISRA also talks about the ways to ensure compliance with all of the rules. Some of the 

rules can be verified by a static checking tool or a compiler. Many of the rules are 

straightforward, but others may not be or may require whole-program analysis to verify. 

Management needs to determine whether any of the available tools can automatically 

verify that a given rule is being followed. If not, this rule must be checked my some kind 

of manual code review process. Where it is necessary to deviate from the rules, project 

management must give some form of consent by following a documented deviation 

procedure. Other non-mandatory “advisory” rules do not need to be followed so strictly, 

but cannot just be ignored altogether. 

The MISRA rules are not meant to define a precise language. In fact, most of the rules 

are stated informally. Furthermore, it is not always clear if a static checking tool should 

warn too much or too little when enforcing some of the rules. The project management 

must decide how far to go in cases like this. Perhaps a less strict form of checking that 

warns too little will be used throughout most of the development, until later when a 

stricter checking tool will be applied. At that point, somebody could manually determine 

which instances of the diagnostic are potential problems. 

Most of the rules have some amount of supporting text that justifies the rules or perhaps 

gives an example of how the rule could be violated. Many of the rules reference a 

source, such as parts of the C or C++ standards that state that such behavior is undefined 

or unspecified. 

738

Before exploring how one could use MISRA, let’s familiarize ourselves with the 

concepts and some examples of the rules of MISRA. 

Taxonomy of the Rules 

The MISRA rules are classified according to the C or C++ constructs that they restrict. 

For example, some of the categories are Environment, Control Flow, Expressions, 

Declarations, etc. However, I find that most of the rules also fall into a couple of groups 

according to the errors that they prevent. 

The first group of rules consists of those that intend to make the language more portable. 

For example, the language does not specify the exact size of the built in data types or how 

conversions between pointer and integer are handled. So, an example of a rule is one that 

says: 

C Directive 4.6/C++ Rule 3-9-2 (advisory): 

Typedefs that indicate size and signedness should be used in place of the 

basic numerical types. 

This rule effectively tries to avoid portability problems caused by the implementationdefined 

sizes of the basic types. We will return to this rule in the next section. 

Another source of portability problems are undefined behaviors. A program with an 

undefined behavior might behave logically, or it could abort unexpectedly. For example, 

using one compiler, a divide by 0 might always return 0. However, another compiler 

may generate code that will cause hardware to throw an exception in this case. Many of 

the MISRA C rules are there to forbid behaviors that produce undefined results because a 

program that depends on undefined behaviors behaving predictably may not run at all if 

recompiled with another compiler. 

Unlike this first group of rules that guard against portability problems, the second group 

of rules intends to avoid errors due to programmer confusion. While such rules don’t 

make the code any more portable, they can make the code a lot easier to understand and 

much less error prone. Here’s an example: 

C Rule 7.1/C++ Rule 2-13-2 (required): 

Octal constants (other than zero) and octal escape sequences (other than 

“\0”) shall not be used. 

By definition, every compiler should do octal constants the same way, but as I will 

explain later, octal constants almost always cause confusion and are rarely useful. 

739

A few other rules are geared toward making code safe for the embedded world. These 

rules are more controversial, but adherence to them can avoid problems that many 

programmers would rather sweep under the carpet. 

Examples of the Rules 

We will start by reviewing the rules mentioned above. 

Octal constants (other than zero) and octal escape sequences (other than “\0”) 

shall not be used. (C Rule 7.1/C++ Rule 2-13-2/Required) 

To see why this rule is helpful, consider: 

line_a |= 256; 

line_b |= 128; 

line_c |= 064; 

The first statement sets bit 8 of the variable line_a. The second statement sets bit 7 

of line_b. You might think that the third statement sets bit 6 of line_c. It 

doesn’t. It sets bits 2, 4, and 5. The reason is that in C any numeric constant that 

begins with 0 is interpreted as an octal constant. Octal 64 is the same as decimal 

52, or 0x34. 

Unlike hexadecimal constants that begin with 0x, octal constants look like 

decimal numbers. Also, since octal only has 8 digits, it never has extra digits that 

would give it away as non-decimal, the way that hexadecimal has a, b, c, d, e, and 

f. 

Once upon a time, octal constants were useful for machines with odd-word sizes. 

These days, they create more problems than they’re worth. MISRA C prevents 

programmer error by forcing people to write constants in either decimal or 

hexadecimal. 

 

Typedefs that indicate size and signedness should be used in place of the basic 

types. (C Directive 4.6/C++ Rule 3-9-2/Advisory) 

This is a portability requirement. Code that works correctly with one compiler or 

target might do something completely different on another. For example: 

int j; 

for (j = 0; j < 64; j++) { 

if (arr[j] > j*1024) { 

arr[j] = 0; 

} 

} 

740

On a target where an int is a 16-bit quantity, j*1024 will overflow and become a 

negative number when j >= 32. MISRA C suggests defining a type in a header 

file that is always 32-bits. For example one could define a header file called 

misra.h that does this. It could define an 32 bit type as follows: 

#include 

#if (INT_MAX == 0x7fffffff) 

typedef int SI_32; 

typedef unsigned int UI_32; 

#elif (LONG_MAX == 0x7fffffff) 

typedef long SI_32; 

typedef unsigned long UI_32; 

#else 

#error No 32-bit type 

#endif 

Then the original code could be written as: 

SI_32 j; 

for (j = 0; j < 64; j++) { 

if (arr[j] > j*1024) { 

arr[j] = 0; 

} 

} 

Strict adherence to this rule will not eliminate all portability problems based on 

the sizes of various types 1 , but it will eliminate most of them. Other MISRA rules 

(notably 10.1 and 10.3) are meant to fill in these gaps. 

The potential drawback to such a rule is that programmers understand the concept 

of an “int”, but badly-named types may disguise what the type represents. 

Consider a “generic_pointer” type. Is this a void * or some integral type that is 

large enough to hold the value of a pointer without losing data? Problems like 

this can be avoided by sticking to a common naming convention. Although there 

1 

The “integral promotion” rule states that before chars and shorts are operated on, they are cast up to an 

integer if an integer can represent all the values of the original type. Otherwise, they are cast up to an 

unsigned integer. The following code will behave differently on a target with a 16-bit integer (where it will 

return 0) than it will on a target with a 32-bit integer (where it will return 65536). 

UI_32 a() 

{ 

UI_16 x = 65535; 

UI_16 y = 1; 

return x+y; 

} 

741

will be a slight learning curve for these names, it will pay off over time. 

Another problem is that using a type like UI_16 may be less efficient than using 

an “int” on a 32-bit machine. While it would be unsafe to use an int in place of a 

UI_16 if the code depends on the value of the variable being truncated after each 

assignment, in many cases the code does not depend on this. In some cases, an 

optimizing compiler can remove the extra truncations; in the rest, the extra cycles 

can be considered the price of safety. 

This next rule is specific to MISRA C. 

 

Function types shall be in prototype form with named parameters.shall have 

prototype declarations and the prototype shall be visible at both the function 

definition and call. (C Rule 8.2/required) 

Consider the following code: 

File1.c 

File2.c 

static F_64 maxtemp; 

void IncrementMaxTemp(void) 

F_64 GetMaxTemp(void) 

{ 

{ 

SetMaxTemp(GetMaxTemp() + 1); 

return maxtemp; 

} 

} 

void SetMaxTemp(F_64 x) 

{ 

maxtemp = x; 

} 

This code may look OK, but it will not work as expected with most compilers. C 

has some rather dangerous rules that assume that type of a function when the 

function has not been declared. In File2.c, GetMaxTemp is called, but never 

declared A conforming ANSI/ISO C compiler will assume that GetMaxTemp() 

returns an int. In reality, GetMaxTemp will return a double. Depending on the 

architecture and compiler different things will happen, but this code will rarely 

work the right way. 

MISRA C avoids this problem by forcing the user to declare functions before they 

are used. This rule is absent from MISRA C++ since the C++ language has long 

required this. 

At the top of a file before it is ever used. Of course, the requirement that a global 

function be declared before it is used helps ensure that the declaration of a 

function matches the definition. 

In fact, another rule states: 

742

An external object or function shall be declared once in one and only one file. (C 

Rule 8.5/C++ Rule 3-2-2/Required) 

This rule works along with the previous rule to ensure that object and functions 

are will be compiled consistently. 

 

The value of an object with automatic storage duration shall not be read before it 

has been set. (C Rule 9.1/Required) 

In C and C++, automatic variables have an undefined value before they are 

written to. Unlike in Java, they are not implicitly given a value like 0. This 

sounds like good programming practice, so few people would disagree with this 

rule in most cases. But, how about the following case: 

extern void error(void); 

UI_32 foo(UI_8 arr[4]) 

{ 

UI_32 acc, j; 

UI_32 err = 0; 

for (j = 0; j < 4; j++) 

acc = (acc

they are used.” The description of the rule goes on to discuss dubious embedded 

environments that do not initialize static variables to zero before further requiring: 

“Each class constructor shall initialize all non-static members of its class.” 

 

The right hand operand of a logical && or || operator shall not contain side 

effects. (C Rule 13.5/C++ Rule 5-14-1/Required) 

A side-effect is defined as an expression that accesses a volatile object, modifies 

any object, writes to a file, or calls off to a function that does any of these things, 

possibly through its own function calls. 

The nomenclature “side-effect” may sound ominous and undesirable, but after 

some reflection, it becomes clear that a program cannot do much of anything 

useful without side-effects. 

As an example of where this rule is helpful is as follows: 

file_handle *ptr; 

success = packet_waiting(ptr) && 

process_packet(ptr); 

This may work fine in a lot of cases. But, even if it is safe, it can easily become a 

hazard later. For example, a programmer might think that process_packet() is 

always called. Therefore, he reasons, it should be safe to close a file or free some 

memory in process_packet(). 

A safer way to write this would be: 

or: 


success_1 = packet_waiting(ptr); 

success_2 = process_packet(ptr); 

success = success_1 && success+2; 


success = 0; 

if (packet_waiting(ptr)) { 

if (process_packet(ptr)) { 

success = 1; 

} 

} 

depending on the true intent of the code. 

This rule is not a portability or safety issue, per se, because the behavior of the || 

744

and && operators are well defined. But, the rule is intended to eliminate a 

common source of programming errors. 

The final two rules that I will survey are perhaps the most controversial. 

 

 

The memory allocation and deallocation functions of stdlib.h shall not be used. 

(C Rule 21.3/C++ Rule 18-4-1/Required) 

Functions shall not call themselves, either directly or indirectly. (C Rule 

17.2/C++ Rule 7-5-4/Required under MISRA C, Advisory under MISRA C++) 

One problem with dynamic memory is that it needs to be used carefully in order 

to avoid memory leaks that could cause a system to run out of memory. Also, 

since implementations of malloc() may vary, heap fragmentation may not be the 

same between different toolchains. 

Likewise, recursion needs to be used carefully or otherwise a system could easily 

exceed the amount of available stack space. 

Applying MISRA 

MISRA C and MISRA C++, in their entirety, are obviously not for everyone. MISRA 

was designed for the automotive market where reliability is of the utmost importance, but 

manufacturers in other markets, such as game machines, may be able to tolerate less 

reliability in order to cram more features into the product. But, in terms of security, a 

simple and well-conceived design usually wins. It’s hard to imagine an extremely secure 

design that doesn’t lend itself to quite a number of the MISRA rules. 

As discussed earlier, some of the rules in the standard are advisory. One need not always 

follow them, although they are not supposed to just be ignored. Even the mandatory 

rules do not need to be observed everywhere. But, a manufacturer wishing to claim that 

his product is MISRA compliant must have a list of where it was necessary to deviate 

from the rules, along with other documentation mentioned in the standard. 

A looser approach might suffice in many cases where total compliance is not necessary. 

For example, let’s consider dynamic memory allocation. Some projects might only use 

dynamic memory in rare circumstances. It might be wise for an embedded development 

team to look through their uses of dynamic memory to verify that their use of dynamic 

memory is truly safe. 

Consider the following example: 

#include 

typedef unsigned int UI_32; 

/* Allocate memory for the next 

* packet */ 

#ifdef __cplusplus 

745

extern UI_32 receive_sample(void); 

extern UI_32 checksum_data(UI_32 

length, UI_32 *data); 

void send_reply(UI_32 reply); 

extern void panic(void); 

/* This thread loops endlessly, receiving a 

* packet of variable length and replying 

* with the checksum of the packet. 

*/ 

void checksum_thread(void) 

{ 

while (1) { 

#else 

#endif 

UI_32 *data = new int[length]; 

UI_32 *data = (UI_32 *) 

malloc(sizeof(UI_32) * length); 

for (count = 0; count < length; 

count++) { 

data[count] = 

receive_sample(); 

} 

reply = checksum_data(length, 

data); 

/* Get the length of the next packet 

*/ 

UI_32 length = receive_sample(); 

if (length != 0) { 

UI_32 count, reply; 

} 

} 

} 

send_reply(reply); 

746

There are a couple of programming errors in the example: 

1. The C code does not check that malloc returns a non-NULL pointer. In C++, a 

call to new that cannot be fulfilled will result in a throw, but the surrounding code 

would need to be analyzed to see whether it could correctly handle the exception. 

A secure embedded system will probably need to restart the thread in a way that is 

consistent with its design. 

2. The memory allocated is never freed. 

This kind of analysis might lead to other insights. For example, there is often an upper 

bound on the size of most inputs. If this is true, in this case, then the programmer could 

have just as well used malloc in a case where a static or automatic array of fixed size 

would have been better. Even if these sorts of transformations are not possible, it can still 

be instructive to look at the places where memory allocation is used. This requirement 

will tend to discourage unneessary uses of malloc. 

Of course, a development team could use most of MISRA, but totally disregard rules that 

do not seem practical for their application given the amount of development time that 

they have. For example, a team could follow all of the required MISRA rules, except for 

the rule that prohibits dynamic memory allocation. They could also decide to follow 

many of the useful advisory rules, such as rule 6.3 (which says to use length-specific 

types instead of the built-in types). Later on, perhaps after completing the next 

milestone, the team could reconsider any rules that they chose to disregard in the last 

pass. 

It might also be necessary to add additional rules beyond what MISRA calls for. For 

example, MISRA C++ allows exception handling, but any given system may not be able 

to accommodate the ROM size of the tables if they are not often used. If the compiler 

offers an option that excludes exception handling in order to generate better code, this 

might be the right thing to do. Others claim that exception handling makes a program 

difficult to analyze. 

One thing that makes MISRA particularly attractive is that a number of embedded tools 

vendors are already checking the rules this in their compilers and code checkers. This off 

the shelf support makes MISRA easier than other alternatives that specify rules but have 

little infrastructure to back them up. 

Conclusion 

MISRA is a valuable tool for programming teams trying to write highly secure and 

reliable code. The rules are well thought out and provide many insights into likely errors 

and constructs that may cause security problems. Almost anyone who writes C or C++ 

code will find MISRA’s coding guidelines useful. Consistent use of MISRA will increase 

the security of your software. 

747

Write Safe AND Secure Application Code with 

MISRA C:2012 



LDRA 

Wirral, UK 



When examined with a critical eye, the commonly held belief 

that security and safety critical code are hugely different to 

each other is a conundrum. Why would that be? 

Within the safety domain, the aim for software developers is 

to produce code that performs as required, whilst ensuring that 

erroneous behaviour does not result in an accident. 

Within the security domain, their aim is to produce software 

that performs as required whilst ensuring that manipulation of 

input data does not result in denial of service or the leaking of 

sensitive data. 

Best practise for the development of either safety or security 

critical code is to apply a formalised software development 

process, starting with a set of requirements and tracing those 

requirements through to executable code. Undefined, 

unspecified and implementation-defined behaviours within the 

C language can lead to safety or security failures. And data 

handling errors such as invalid values, domain violations, 

tainted data, and leaking of confidential information can 

prevent both safety and security objectives from being 

realised. 

With so much commonality between perceived optimal 

practices for safety and security critical code, it is a puzzle as 

to why there is a common misconception that MISRA i is just 

for safety-related not for security-related projects. In response 

to that misconception, in April 2016, MISRA released 

“MISRA C:2012 – Addendum 2 ii ” which highlights which of 

the 46 C Secure iii rules are covered by the MISRA C:2012 iv 

guidelines. 

Even though MISRA C:2012 Amendment 1 v was written to 

further ensure complete coverage of the C Secure rules in the 

MISRA C:2012 standard, to a large extent it does so by 

enhancing the language of existing checks. For the most part, 

these enhancements explain why those checks are important 

from a security perspective with reference to the ISO C Secure 

Guidelines, particularly with regards to the use of 

"untrustworthy data.“ 

In other words, the original MISRA C:2012 document has 

always targeted concerns such as buffer overruns and memory 

errors, and they have always been important for both safety 

and security. It has always promoted the detection of 

inconsistent data use, pertinent for all critical code. More 

generally, it has always aimed to ensure that defects are not 

introduced into the code, rather than adopting a set of checks 

to try and identify them after the fact. 

II. THE IMPORTANCE OF PROCESS STANDARDS AND 

GUIDELINES 

Safety-critical industries such as aerospace, automotive, rail, 

and medical, use process standards that address the rigour in 

which activities need to be performed during the development 

life cycle stages with respect to the functional safety of the 

system being developed. Coding standards and guidelines, 

such as MISRA C, are a critical part of this process. MISRA C 

defines a subset of the C language suitable for developing any 

application with high-integrity or high-reliability 

requirements. Although MISRA guidelines were originally 

designed to promote the use of the C language in safetycritical 

embedded applications within the motor industry, they 

have gained widespread acceptance in many other industries 

as well. 

The illustration in Figure 1 is a typical example of a table 

from ISO 26262-6:2011 vi , which mirrors similar tables both 

in IEC 61508 vii , and in other derivatives such as IEC 62304 viii 

(used in the development of medical devices). It shows the 

coding and modelling guidelines to be enforced during 

implementation, superimposed with an indication of where 

compliance can be confirmed using automated tools. 



maintain. 


748

IV. 

MISRA C SECURITY 

AMENDMENTS 

After the publication of MISRA C:2012, the WG14 ix 

committee responsible for maintaining the C standard 

published the ISO/IEC 17961:2013 C Language Security 

Guidelines x , designed to limit the use of the C language to a 

subset excluding the more vulnerable features of the language. 

The intention was for all rules to be enforceable using static 

analysis such that their detection could be automated without 

generating excessive false positives. 

Figure 1 - Mapping the capabilities of the LDRA tool suite to 

“Table 6: Methods for the verification of the software 

architectural design” specified by ISO 26262-6:2011 

III. 

THE SAFE AND SECURE SYSTEM 

The enterprise computing community has traditionally taken a 

“fail-first and patch-later” approach to secure system 

development. This development life-cycle consists of a largely 

laissez-faire attitude to development, and the subsequent 

application of penetration tests, fuzz tests and fault injection to 

expose and correct any unwanted behaviour. Such a reactive 

approach is not adequate when safety critical applications are 

involved, where functional safety standards already demand a 

much more proactive development approach (Figure 2) – and 

that proactive attitude is equally essential where a connected 

system must be dependable, trustworthy and resilient in order 

to protect critical data. 

Developers of functionally safe systems in accordance with 

such as DO-178, ISO 26262 and IEC 61508 are required to 

perform a functional safety risk assessment as part of the 

development lifecycle. Not only does it make sense to mirror 

that approach to perform a functional security risk assessment, 

but it is obligatory if those security risks represent a potential 

safety risk too. The identification of security risks involved in 

developing and deploying the product should be assessed and 

mitigation activities reflected in the security requirements. The 

design and coding stages can then also reflect the aspects of 

security requirements along with functional and non-functional 

requirements. 

It was in response to ISO/IEC 17961 that the MISRA 

committee developed “MISRA C:2012 – Addendum 2”, 

highlighting which of the 46 C Secure rules are covered by the 

original MISRA C: 2012 guidelines. MISRA C:2012 

Amendment 1 was written to further ensure complete 

coverage of the C Secure rules. The amendment is an 

extension MISRA C:2012. 

It establishes 14 new guidelines for secure C coding to 

improve the coverage of the concerns highlighted by the ISO 

C Secure Guidelines including, for example, issues pertaining 

to the use of “untrustworthy” data—a well-known security 

vulnerability. By following the additional guidelines, 

developers can more thoroughly analyse their code and can 

assure regulatory authorities that they have adopted best 

practice. This is becoming critical in many fields of endeavour 

including the automotive industry, the Industrial Internet of 

Things (IIoT), and the medical device sector – in short, 

wherever security threats have led to OEM demands for 

developers to prove that their software meets the highest 

standards for security as well as safety. 

V. INSECURE CODING EXAMPLES AND RELATED 

RULES 

To put the amendment into context, it is useful to review 

examples of where the additional rules apply. 

Example 1: Rule 12.5 

Rule 12.5 states, “The sizeof operator shall not have an 

operand which is a function parameter declared as “array of 

type” 

Many developers use the “sizeof” operator to calculate the 

size of an array. In a normal scenario that works fine. But 

when that approach is used on an array passed as a function 

parameter, that parameter is passed as a “pointer to type.” 

Consequently the attempt to calculate the number of elements 

usually returns an incorrect value, as illustrated in Figure 2 – 

and in this case, results in an array bound being exceeded. 

Figure 2 - The traditional V software development life cycle 

model incorporates security activities from the early stages 

749

void f1 (void) 

{ 

char ch; 

ch = (char)getchar(); 

if (EOF != (int) ch) 

/* Non-compliant - getchar returns an int which is cast to a 

narrower type */ 

{ 

} 

} 

Automatic Detection of Rule Violation at an Early Stage 

Figure 2: Source code example 

A static analysis tool can be used to check for the use of such 

syntax (Figure 3). 

Peer reviews represent a traditional approach to enforcing 

adherence to such guidelines, and whilst they still have an 

important part to play, automating the more tedious checks 

using tools is far more efficient, less prone to error, 

repeatable, and demonstrable (Figure 4). 

Figure 3 - The LDRA TBvision tool detects the MISRA C:2012 

rule violation for the “sizeof” operator example 

Example 2: Rule 22.7 

Rule 22.7 states “The macro EOF shall only be compared 

with the unmodified return value from any Standard Library 

function capable of returning EOF.” 

An “EOF” (End Of File) return value from standard library 

functions is used to indicate that the relevant stream has either 

reached the end of the file, or an error has occurred in reading 

from or writing to that file. The macro “EOF” is defined as an 

“int” with a negative value. 

If the “EOF” value is captured in a variable of incorrect type, 

then it may become indistinguishable from a valid character 

code. It is therefore important to use an “int” to store the 

return code from such functions such as “getchar()” or 

“fgetc()”, and to avoid because the common practice of 

storing the result in a char. 

Figure 4 - LDRA TBvision reports the Rule 22.7 (EOF 

comparison with char) violation in the source code. 

VI. 

CHOOSING A LANGUAGE SUBSET 

Although there are several language subsets (or less formally, 

“coding standards”) to choose from, these have traditionally 

been focused primarily on safety, rather than security. More 

lately with the advent of such as the Industrial Internet of 

Things, connected cars, and connected heart pacemakers, that 

focus has shifted towards security to reflect the fact that 

systems such as these, once naturally secure through isolation, 

are now increasingly accessible to aggressors. 

There are, however, subtle differences between the differing 

subsets which is perhaps a reflection of the development 

dichotomy between designing for security, and appending 

some measure of security to a developed system. To illustrate 

this, it is useful to compare and contrast the approaches taken 

by the authors of MISRA C and CERT C with respect to 

security. 


750

A. Retrospective adoption 

MISRA C:2012 categorically states that “MISRA C should be 

adopted from the outset of a project. If a project is building on 

existing code that has a proven track record then the benefits 

of compliance with MISRA C may be outweighed by the risks 

of introducing a defect when making the code compliant.” 

This contrasts in emphasis with the assertion of the CERT C xi 

authors that although “the priority of this standard is to 

support new code development…. A close-second priority is 

supporting remediation of old code” 

Of course, as with the system as a whole, the level of risk 

involved with the compromise of the system will reflect on the 

approaches to be adopted. Certainly, the retrospective 

application of any language subset is better than nothing, but 

late adoption does not represent best practice. 

more draconian, and yet by avoiding the side effects altogether 

the resulting code is certain to be more portable, and it can be 

automatically checked by a static analysis tool. It is simply not 

possible for a tool to check whether a developer is “aware” of 

side effects – and less possible still to ascertain whether 

“awareness” equates to “understanding”. 

The net effect is that a static analysis tool can make the same 

checks but the detection of an issue has different implications. 

For MISRA – “You have a violation that either needs to be 

removed or a deviation introduced”. For CERT – “Did you 

mean to do this this”? The former is clearly easier to police. 

B. Relevance to safety, high integrity and high reliability 

systems 

MISRA C:2012 “define[s] a subset of the C language in which 

the opportunity to make mistakes is either removed or 

reduced. Many standards for the development of safety-related 

software require, or recommend, the use of a language subset, 

and this can also be used to develop any application with high 

integrity or high reliability requirements”. The accurate 

implication of that statement is that MISRA C was always 

appropriate for security critical applications even before the 

security enhancements introduced by MISRA C:2012 

Amendment 1. 

CERT C attempts to be more all-encompassing, as reflected in 

its introductory suggestion that “safety-critical systems 

typically have stricter requirements than are imposed by this 

standard … However, the application of this coding standard 

will result in high-quality systems that are reliable, robust, and 

resistant to attack”. 

C. Decidability 

The primary purpose of a requirements-driven software 

development process as exemplified by ISO 26262 is to 

control the development process as tightly as possible to 

minimize the possibility of error or inconsistency of any kind. 

Although that is theoretically possible by manual means, it 

will generally be far more effective if software tools are used 

to automate the process as appropriate. 

In the case of static analysis tools, that requires that the rules 

can be checked algorithmically. Compare, for example, the 

excerpts shown in Figure 5, both of which address the same 

issue. The approach taken by MISRA is to prevent the issue 

by disallowing the inclusion of the pertinent construct. CERT 

C instead asserts that the developer should “be aware” of it. 

Of course, there are advantages in each case. The CERT C 

approach is clearly more flexible; something of particular 

value if rules are applied retrospectively. MISRA C:2012 is 

Figure 5: Contrasting approaches to the definition of coding 

rules 

D. Precision of rule definitions 

The stricter, more precisely defined approach of MISRA does 

not only lend itself to a standard more suitable for automated 

checking. It also addresses the issue of language 

misunderstanding more convincingly than CERT C. 

Evidence suggests that there are particular characteristics of 

the C language which are responsible for most of the defects 

found in C source code xii , such that around 80% of software 

defects are caused by the incorrect usage of about 20% of the 

available C or C++ language constructs. By restricting the use 

of the language to avoid the parts that are known to be 

problematic, it becomes possible to avoid writing associated 

defects into the code and as a result, the software quality 

greatly increases. 

This approach also addresses a more subtle issue surrounding 

the personalities and capabilities of individual developers. 

Simple statistics tell us that of all the C developers in the 

world, 50% of them have below average capabilities – and yet 

it is very rare indeed to find a development team manager who 

would acknowledge that they recruit any such individuals. 

More than that, in any software development team, there will 

some who are more able than others and it is human nature for 

people not to highlight the fact if there are things they don’t 

understand. More than that, it is common for less experienced 

programmers to be writing code, especially in large teams; 

751

typically the most experienced members will be involved in 

management and requirements definition with the new intake 

being used to code from the decomposed requirements. 

Figure 6 uses the handling of variadic functions to illustrate 

how this approach differs from that of CERT C. CERT C calls 

for developers to “understand” the associated type issues, but 

doesn’t suggest how a situation might be handled where a 

developer is, despite the best of intentions, harbouring a 

misunderstanding. 

A counter argument might be that there will be developers 

who are very aware of the type issues associated with variadic 

functions, who make very good use of them, and who may feel 

affronted by the tighter restrictions on their use. However, for 

highly safety or security critical systems, MISRA would assert 

that because the “opportunity to make mistakes is either 

removed or reduced”, that is a price well worth paying. 

applied. However, for safety or security critical applications, 

MISRA C is considerably less error prone both because it is 

specifically designed for such systems and as a result of its 

stricter, more decidable rules. Conversely, there is an 

argument for using the CERT C standard if the application is 

not critical but is to be connected to the internet for the first 

time. The retrospective application of CERT C might then be a 

pragmatic choice to make, though it would likely be 

accompanied by a list of issues where confirmation of intent is 

required . 


LDRA 

Portside 

Monks Ferry 

Wirral 

CH41 5LH 


Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 




Mark James 



Presenter 




Figure 6: Comparing differing precision of rule definition 

VII. 

CONCLUSIONS 

Best practise for the development of either safety or security 

critical code is to apply a formalised software development 

process, starting with a set of requirements and tracing those 

requirements through to executable code. Even so, undefined, 

unspecified and implementation-defined behaviours within the 

C language can lead to safety or security failures in the 

resulting code base. And data handling errors such as invalid 

values, domain violations, tainted data, and leaking of 

confidential information can prevent both safety and security 

objectives from being realised. 

MISRA C:2012 is not the only coding standard option for 

those with a need to develop secure code. For example, The 

correct application of either CERT C or MISRA C:2012 will 

certainly result in more secure code than if neither were to be 


752

i 

MISRA – The Motor Industry Software Reliability 

Association 

https://www.misra.org.uk/Publications/tabid/57/Default.aspx 

ii 

MISRA C:2012 - Addendum 2: Coverage of 

MISRA C:2012 against ISO/IEC TS 17961:2013 "C Secure", 

ISBN 978-906400-15-6 (PDF), April 2016. 

iii 

ISO/IEC TS 17961:2013 Information technology -- Programming 

languages, their environments and system software interfaces - C secure 

coding rules 

iv 

MISRA C:2012 - Guidelines for the Use of the C Language 

in Critical Systems, ISBN 978-1-906400-10-1 (paperback), 

ISBN 978-1-906400-11-8 (PDF), March 2013 

v 

MISRA C:2012 - Amendment 1: Additional security 

guidelines for MISRA C:2012, ISBN 978-906400-16-3 (PDF), 

April 2016. 

vi 

ISO 26262-6:2011 Road vehicles -- Functional safety -- Part 6: Product 

development at the software level 

vii 

IEC 61508-1:2010 Functional safety of 

electrical/electronic/programmable electronic safety-related 

systems - Part 1: General requirements 

viii 

IEC 62304 International Standard Medical device software 

– Software life cycle processes Consolidated Version Edition 

1.1 2015-06 

ix 

International standardization working group for the 

programming language C JTC1/SC22/WG14 

http://www.open-std.org/jtc1/sc22/wg14/ 

x 

ISO/IEC TS 17961:2013 Information technology -- Programming 

languages, their environments and system software interfaces -- C secure 

coding rules 

xi 

SEI CERT C Coding Standard 

https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+ 

Coding+Standard 

xii 

Applying the 80:20 Rule in Software Development. Jim 

Bird. Nov 15,2013. https://dzone.com/articles/applying-8020- 

rule-software 

753

Hypervisors in Embedded Systems 

Applications and Architectures 

Jack Greenbaum 

Green Hills Software, Inc 

Santa Barbara, California, USA 

jackg@ghs.com 



Santa Clara, California, USA 


Abstract — As microprocessor architectures have evolved 

with direct hardware support for virtualization, hypervisor 

software has become not just practical in embedded systems, but 

present in many commercials applications. This paper discusses 

embedded systems use cases for hypervisors, including their use 

in workload consolidation and security applications. 

Keywords — hypervisor; virtualization; virtual machine; guest 

OS; embedded systems; security; IoT; Internet of Things. 


Hypervisors are a type of operating system software that 

allows multiple traditional operating systems to run on the 

same microprocessor [1]. They were originally introduced in 

traditional IT data centers to solve workload balancing and 

system utilization challenges. Initial hypervisors required 

changes to the guest OS to compensate for a lack of hardware 

support for the isolation required between guest operating 

systems. As microprocessor architectures have evolved with 

direct hardware support for virtualization, hypervisors have 

become not just practical in embedded systems, but are present 

in deployed applications [2]. Hypervisors are here to stay in 

embedded systems. This paper discusses embedded systems 

use cases for hypervisors, including their use in workload 

consolidation and security applications. 

Hardware support for virtualization in modern 

microprocessors has been the necessary enabler for 

virtualization to move from the data center to embedded 

systems. All of the major processor architectures have evolved 

with virtualization extensions. Notable examples include Intel 

VT-x, ARM Virtualization Extensions, and MIPS VZ 

extensions. This support includes a distinct hypervisor 

execution mode at a higher privilege level than the traditional 

supervisor mode, and IO MMUs to isolate peripheral devices 

used by different guest operating systems from each other. 

Without an IOMMU the unique IO requirements of embedded 

systems cannot be properly separated. The Intel version is 

called VT-d, and most ARM processors have a “System 

MMU”. In the data center the IOMMU is sometimes called 

Single Root Virtualization, or SRV. 

The rest of this paper focuses on the use cases for 

hypervisors in embedded systems, and introduces the 

capabilities that hypervisors provide to implement these use 

cases. 

II. 

USE CASES 

A. Consolidation 

The most common use of hypervisors is to consolidate 

multiple workloads onto a single platform in order to reduce 

size, weight, power, or cost. This is the same use case that has 

driven broad adoption of hypervisors in the IT server space. As 

servers have grown to have more capacity than any single 

application, virtualization lets one server combine multiple 

applications. But integrating multiple applications from 

different customers onto one operating system puts too many 

constraints on what functions can be combined. Virtualization 

instead runs multiple operating systems on the same hardware, 

allowing complete applications to run on the same hardware 

with very little interaction. 

Consolidation use cases are becoming common in 

automotive systems. One example is combining the instrument 

cluster and in-vehicle infotainment (IVI) systems into a single 

electronic control unit (ECU). The instrument cluster is 

typically built on a real-time operating system (RTOS), while 

IVI is often built on Linux or another general purpose OS 

(GPOS). The real-time and safety requirements of the 

instrument cluster cannot be met by GPOS, and the media 

libraries required for IVI are expensive to port to an RTOS. 

Therefore, integrating these two functions into one OS is not 

feasible. But a hypervisor with real-time and safety guarantees 

can run both the RTOS and GPOS on the same processor 

within a single ECU. This saves not only cost (by having only 

one processor and circuit board), but also space in the vehicle 

that is increasingly full with more and more ECUs for modern 

safety features. 

B. Legacy Operating Systems 

As systems evolve over time, it often becomes necessary to 

make a shift to a new operating environment to enable new 

features. Preserving the existing features of the system would 

then require porting already tested and field proven software to 

the new platform. Virtualization, on the other hand, allows 

running the existing operating environment alongside of the 

new software on the same processor. One example is a 


754

software defined radio. Over time the product requirements 

may evolve to require a transition from a simple LCD user 

interface to a graphical user interface (GUI). The radio 

software may have a high cost to recertify. By using 

virtualization, the radio protocol software can be maintained 

while a second OS with a modern GUI library can be run 

alongside. By running the radio protocol software unchanged 

(or with minimal change), a GUI can be added while 

minimizing or eliminating recertification costs for the radio 

protocol software. This use case is most common in deeply 

embedded and very cost sensitive applications. 

C. Multiple Security Levels (MILS) 

A third example is a combination of the consolidation and 

legacy cases. The application in this case is to provide security 

isolation between two different workloads that have different 

security postures. Two examples include running Trusted 

Execution Environments and dual-persona smart phones. 

Trusted Execution Environments (TEEs) provide security 

critical processing in an environment isolated from the rest of 

the system. Use cases include secure boot, cryptographic 

services, and security critical device feature management 

including the IOMMU. This is similar to consolidation in that 

cryptographic services have traditionally been offloaded to a 

separate, smaller, core. The advantage to running cryptographic 

services in a TEE on the main processor cores is typically 

higher performance than traditional approach. 

A dual persona phone acts like two separate smart phones. 

The common application is one partition is an operationally 

secure partition, while the other partition can be updated or 

reconfigured by the user. Typically, the secure partition is 

controlled by a business IT department or a government entity 

that manages compartmentalized information. Such a partition 

may have access to restricted networks and therefore contains 

high value encryption keys and information. The software load 

on the secure partition is often locked down and verified at 

boot time. The second partition is often called a user partition, 

and has access to the public internet and can install apps from 

an app store and access other unsecured content. The 

underlying assumption is that the hypervisor provides a higher 

level of isolation than the individual operating systems being 

virtualized 

A. Memory Sharing 

The most basic hypervisor runs on a multi-core SoC and 

provides only for sharing of memory. Each CPU core runs a 

separate hardware load. The hypervisor configures the SoC’s 

virtualized memory management - see section below - to 

restrict each CPU to a portion of the memory address space of 

the SoC, including both RAM and peripheral registers. This 

allows multiple Operating systems to run on a single SoC with 

disjoint peripherals and secure shared access to RAM. 

B. CPU Sharing 

A more capable hypervisor also allows sharing of 

individual CPU cores via time slicing. This allows different 

workloads to have access to all CPU resources during times of 

heavy demand, and to partition that access based on priority. 

For example, when consolidating RTOS and GPOS workloads, 

the RTOS is typically given priority on the CPUs, while the 

GPOS gets a guaranteed minimum amount of execution time. 

The GPOS has full access to the CPUs when the RTOS is idle. 

Note that a hypervisor that supports CPU sharing in this way 

typically must be written with real-time behavior in mind, and 

is often based on an RTOS. 

C. Peripherals Sharing 

Another set of hypervisor features revolves around sharing 

of peripherals, such as mass storage, communication links, and 

GPUs. Embedded systems are often cost sensitive, so the 

ability to share devices such as eMMC mass storage is 

required. There are several different techniques for 

implementing peripheral sharing, including mediated pass 

through, device emulation, and paravirtualization. Each 

approach has its strengths and weaknesses. A full discussion of 

these concepts is beyond the scope of this paper, but when 

considering the use of a hypervisor the sharing of devices is as 

important to consider as is the sharing of the CPU. 


Hypervisors have moved from the data center to embedded 

systems, enabled by hardware support in modern 

microprocessors. We have outlined the common use cases for 

virtualization, and considerations for device sharing. 

REFERENCES 

III. HYPERVISOR CAPABILITIES 

All hypervisors provide isolated sharing of System On a 

Chip (SoC) resources, but differ in the scope and depth of 

support for sharing the different hardware elements. 

[1] Security Guidance for Critical Areas of Embedded Computing, prpl 

Foundation, January 2016 - https://prpl.works/security-guidance/ 

[2] prplSecurity Framework Application Note, prpl Foundation, July 2017 - 

https://prpl.works/application-note-july-2016/ 

755

Digging Into Embedded Virtualized Systems 

Overcoming the Barriers to Debugging Hardware and Software 

Khaled Jmal 

Lauterbach GmbH 

Höhenkirchen-Siegertsbrunn, Germany 

khaled.jmal@lauterbach.com 

Rudolf Dienstbeck 

Lauterbach GmbH 

Höhenkirchen-Siegertsbrunn, Germany 

rudolf.dienstbeck@lauterbach.com 

Abstract—In order to save money, the functions from several 

electronic devices are consolidated on a common hardware unit. 

A hypervisor separates the functions on the software side. This 

results in debugging becoming more challenging but by no means 

impossible. 

Keywords—hypervisor; debugging; awareness; 


Hypervisor – embedded software developers are currently 

faced with this term all the time. There is almost a hype around 

this technology (pun intended). For instance, it seems to be a 

focal point of discussion at the moment in the automotive, 

aviation and aerospace segments as well as in the field of 

medical technology. However, what impact does this have on 

the development cycle and, in particular, in terms of 

debugging? Debugging tools, particularly those that access the 

hardware (e.g. JTAG debuggers) need to take so much into 

consideration when a hypervisor is utilized on the target 

system. Naturally, the developer wants to have a tool at their 

disposal that shows them the complete status of the embedded 

system including all components such as the hypervisor, guest 

operating systems and guest processes. 

II. 

SEVERAL MACHINES ON A SINGLE PIECE OF 

HARDWARE 

A hypervisor allows to run different virtual machines (VM), 

also called guest machines, on a single piece of hardware. This 

permits e.g. to run several operating systems on the same host. 

The hypervisor is responsible for allowing these operating 

systems to run on a single computer, either by dividing the 

CPU across the operating systems in a time slice technique, or 

by dynamically assigning the individual cores to different 

guests in a multi-core environment. Everybody is aware of 

hypervisors on desktop computers such as VMWare or 

VirtualBox which can be used e.g. to run one (or several) 

complete Linux distribution(s) on Windows. Other examples 

that are also utilized in embedded systems include Xen, KVM, 

Jailhouse and QEMU. 

A concrete application from the embedded systems segment 

may be structured as follows: The objective is for a car 

dashboard to work with an industrial Linux distribution, for the 

infotainment system to operate using Android, for the air 

conditioning to utilize FreeRTOS and the engine control to 

work with an AUTOSAR Stack. In the past, four (and even 

more) different hardware platforms were actually required for 

this purpose. However, all of these functions are now 

integrated into a single system and, where possible, even on a 

single CPU. 

Why? The first reason can be attributed to costs. 

Nowadays, embedded systems are so powerful that a single 

system is able to complete all of these tasks. Furthermore, it is 

also cheaper to produce and install an integrated hardware 

module rather than four different systems. This is the primary 

motivation as every penny counts, especially in the automotive 

industry. As an "add-on", a hypervisor provides an extra layer 

of security and protection. The hypervisor is able to monitor all 

guests and act accordingly in the event of issues, e.g. by 

restarting a guest. It is also essential to protect the guests from 

unwanted interaction. A technical prerequisite for this is to 

ensure that all guests are kept separate from each other in terms 

Fig. 1. A hypervisor coordinates the operation of several virtual 

machines on a real machine. 

756

of hardware via an independent Memory Management Unit 

(MMU) (Figure 1). 

In terms of hardware, the individual guests can be separated 

from each other if the CPU provides a complete hardware 

abstraction. In order to do so, three things must be virtualized 

in principle: The memory, the peripheral equipment and the 

CPU itself. The guest's operating system should not even know 

that it is running in a virtualized machine. This requires that the 

MMU supports two stages of address translation. The first 

stage translates the guest virtual address to a guest physical 

address also called intermediate address. The intermediate 

address is then translated in a second MMU stage of the 

hypervisor to the real physical address. The peripheral 

equipment is also virtualized ("virtual I/O") in order to ensure 

that each guest is able to interact with the environment. In 

doing so, the hypervisor decides which guest may access which 

piece of peripheral equipment and which responds to guest 

interruptions. Finally, each guest receives one or several virtual 

CPUs that are mapped on the actual cores via a scheduler. In 

doing so, the number of virtual CPUs of a specific guest can be 

lower or greater than the number of real cores. 

III. 

HYPERVISOR IMPACT ON DEBUGGERS 

There are in principle two debugging methods: The 

software-controlled run mode debugging and the hardwarecontrolled 

stop mode debugging. 

A. Run Mode Debugging 

The run mode debugging method involves the loading in 

the target platform of an additional debug software (also called 

“debug agent”) that accomplishes the actual debugging. Singlestep 

mode, breakpoints, etc. are all managed by this piece of 

software. A typical example is the use of a gdbserver to 

remotely debug a Linux process. The debugger user interface 

on the development computer communicates then with the 

debug agent e.g. via a serial interface or Ethernet. On a 

breakpoint hit, only the component to be debugged, e.g. the 

Linux process, is stopped. The rest of the system will continue 

to run. This is the reason why this method is called “run 

mode”. Such a debug session only requires an appropriate 

communication channel. If an underlying hypervisor is present, 

the channel is simply routed through it (Figure 2). Once this 

route has been established, neither the debugger nor the agent 

is aware that a hypervisor is present in-between them, i.e. the 

debugging is "hypervisor agnostic". This method is perfect if 

the system needs to continue during the debugging, e.g. 

because protocols need to be served. Run mode debugging is 

completely sufficient to debug a single component, for instance 

a process within a single machine. However, this method 

reaches its limits as soon as the guest operating system or the 

Fig. 2. Run mode debugging with a gdbserver 

hypervisor is involved. In this case a different debugging 

approach that allows a system wide view is required. 

B. Stop Mode Debugging 

When debugging, developers generally want to see 

everything: The hypervisor, all guests as well as all guest 

processes all at once and all at the same time! This is, in 

principle, not possible in run mode for the aforementioned 

reasons. But it is possible in stop mode, which is the main 

strength of this option. In hardware-controlled stop mode 

debugging, the debugger is connected directly to the processor 

via a dedicated interface which is typically JTAG. The 

debugger uses this interface to control the CPU itself, e.g. stop 

it, trigger individual program steps, read the registry or 

memory. This also means that the entire system, including all 

processes, guests and – of course – the hypervisor, is stopped in 

the event of a breakpoint. In such a case, no more interrupts are 

operated, no communication protocols run and no VM, process 

or task changes take place. The CPU is effectively "frozen", 

which is why it is called "stop mode". Since a hardware 

debugger accesses the system via the CPU, it can initially only 

“see” the component that are released by the MMU in this 

state, i.e. only the guest currently running on the CPU and only 

the currently active process. The debugger is however able to 

do slightly more than that: Thanks to a temporary minimal 

manipulation to the MMU registry, it can also directly read the 

physical address space and the current "intermediate" (= 

"physical guest") address space. However, all debug symbols 

belonging to the processes and guests are stored on virtual 

addresses, meaning that this additional view is not particularly 

useful to begin with. Therefore, the debugger needs to translate 

the virtual address to the corresponding physical address, i.e. 

perform the MMU table walk, before accessing the physical 

address space. This can be done for the current context by 

reading the page table pointers from the MMU registers. 

However, for the debugger to be able to see everything beyond 

the current status, the information about the MMU tables of the 

single tasks, virtual machines and the hypervisor needs to be 

extracted from the guest operating systems and from the 

hypervisor. The debugger needs also to be “aware” of the 

hypervisor as well as of the single guest operating systems. 

This requires a "hypervisor awareness", an "OS awareness" for 

each guest and an "MMU awareness" for both the hypervisor 

as well as for each guest. 

IV. 

DEBUGGER NEEDS TO HAVE "AWARENESS" 

A hypervisor awareness is used to determine the list of the 

virtual machines, their IDs, virtual CPUs and the MMU 

settings. The awareness uses the hypervisor debug symbol 

information (ELF/DWARF) in order to read the necessary 

information from the system. The hypervisor awareness is also 

responsible for managing the layout of the stage 2 MMU 

translation so that the debugger has access to all VMs. An "OS 

awareness" is additionally required for each guest in order to 

analyze the content of a guest operating system. The awareness 

is also developed specifically for each OS in use. This 

awareness then determines the processes of the operating 

system and the MMU settings within the VM as well as the 

MMU table layout (stage 1 MMU translation). For this 

purpose, the awareness then uses the debug symbol 

757

Fig. 3. A tree structure illustrates the target system layout 

information belonging to the respective operating system. As a 

result, the debugger is able to illustrate a hierarchical tree of the 

entire system. Processes, threads and other resources can be 

illustrated (Figure 3). 

With this awareness of the system layout, the debugger can 

read the list of guests and processes as well as their MMU 

tables from the system. Equipped with this knowledge, the 

debugger can now perform the MMU table walk for each 

virtual address of a guest or process itself, i.e. past the 

hardware MMU, and reads the respective data from the 

physical memory directly. By implementing this method, the 

debugger accesses all addresses belonging to all guests and all 

processes, irrespective of whether they are virtual, intermediate 

or physical. And all of this is done at the same time! 

Various commands and windows can be specially used on a 

certain machine or certain process. For instance, the process 

taking place on a Linux machine and the task being performed 

by a FreeRTOS device can be shown at the same time. The 

loaded debug symbols can be assigned to a certain machine or 

certain process. Using a machine ID and a process ID, each 

virtual address is unambiguous. 

If the software runs on a breakpoint, the entire system will 

be stopped as described above. The debugger then 

automatically switches to the (real) core that stopped at the 

breakpoint and displays the current machine and process on 

this core. This allows the user to immediately see the 

conditions that led to this break. Naturally, it is possible to 

manually switch to other cores and their "current machines". 

Moreover, the user is not only able to switch the view to other 

hardware cores; he can also switch to other, currently inactive 

guest systems. As a result, a symbolic access to all of the 

functions and variable of other machines is possible at all 

times. If the registries are not loaded in a real core at this 

moment in time, the debugger reads the values for this from the 

hypervisor or guest system memory. Using these values, the 

debugger determines the current stack frame in order to; for 

instance, display the current call hierarchy of a task's functions. 

Straightaway, the developer sees the current progress of the 

task and why it may be potentially waiting. 

Lauterbach has created a reference implementation with the 

Xen hypervisor and the Linux and FreeRTOS guests on a 

Hikey board that demonstrates the functionality. The MMU 

support implemented in the TRACE32 debugger and an 

expansion of the address management to virtualized systems 

permit access to all components at all times. This enables a 

debugging of the hypervisor, all guest operating systems and 

all guest processes. Consequently, it is even possible for a 

retrospective analysis of a memory map without any problems 

whatsoever. 

758

Autonomous Driving needs Safety and Security 

Dr. Ciwan Gouma 

SYSGO AG 

Manager Business Development Automotive 

Klein- Winternheim, Germany 

ciwan.gouma@sysgo.com 

Abstract— Internet in cars, vehicles communicating with 

each other and with the infrastructure - many new and very 

important functions for driver assistance and autonomous 

driving will be reality in the future. 

What ideas and concepts from the IT security and the 

avionics industry can we use? How can synergies be derived 

from the joint implementation of safety and security 

requirements, which also increase efficiency for developers, 

SW architects and testers? What requirements should a 

MILS Operating System (Multiple Independent Levels of 

Safety/Security) meet to minimize risks, reduce 

development times and reduce development costs? 

Keywords — Automotive CyberSecurity Overview & recommendations, 

from Safety to Security, mixed Criticalities, 

Adaptive AUTOSAR & Security 

I. GAMECHANGER – CONNECTED CAR 

Self-driving vehicles will soon hit the road – the automotive 

industry is facing rapid changes with countless challenges. 

Handling vast data collections at the same time managing 

uncompromised security, real-time decision making combined 

with new mobility services. OEM´s and Tier 1 are facing shorter 

design cycles and have to handle requests for more personalised 

experience. 

In 2021 we expect about 200 million connected cars, about 

90% of all cars will have an internet access [1]. Already in 2025 

research institutes expect about 470 million connected and about 

7 million autonomous driving cars (autonomous driving level 

4/5, see Figure 1) [2], [3]. 

Automated driving systems will monitor the driving 

environment. These autonomous cars with more than 

autonomous driving level 3 (see Figure 1) requires high safety 

certifications. 

The current situation with more than 100 Electronic Control 

Units (ECU´s) in a car increased the cybersecure attack surface 

tremendously because of their connectivity. 

It is obvious that we have to rethink cybersecurity and 

vehicle safety. By 2020, almost every new car will be connected, 

putting OEM’s current structures at risk because the current 

communication and energy on-board network topology as well 

Figure 1:Level of Autonomous Driving [13] [14] 

Software architecture is not able to handle future requirements – 

complexity, ensure security, costs [4]. 

Last but not least, the incredible fast-evolving Artificial 

Intelligence (AI) is also a compelling reason to think about new 

approaches, for which some ideas will be presented below. 

II. OTHER PERSPECTIVES 

- LEARNING FROM AVIONIC AND IT SECURITY – 

The many years of experience in IT security should also be 

taken into account in the automotive industry. Thus, the 

established and proven technology of firewalls and 

cryptography can be used. 

But “Crypto won’t save you either” by Peter Gutmann [5]: 

• List a lot of prominent hacks, for all of them cracking 

crypto was not necessary 

• All of them targeted integration 

Thus, what may we learn from IT Security: 

• Security is the integral system property 

Establish End2End security 


759

• Security is a process 

Establish easy to use and verifiable and secure update 

procedures 

We may observe a lot of similarities between Avionic 

Industry challenges within the recent years and the current trends 

in the automotive industry like: 

• Tremendous changes for the network based infrastructure 

Aircraft today is network based (AFDX & IP) 

• Increasing usage of common computing resources 

o 

Integrated Modular Architecture (IMA), Open 

World 

• Open World domain with COTS software 

o 

• New IT services 

o 

Wi-fi products, Linux 

Pilots (tablets), passengers, crew, maintenance 

• Increasing integration and information flow between 

systems 

• Aircraft is heavily connected to other IT services, Integration 

of several domains 

o 

Airlines, ATC 

• Aircraft is connected to INTERNET 

Already in use concepts and accepted solutions for aircrafts are: 

1. Security by design 

a. Proper separation and control of 

functionalities (freedom of interference, no 

error propagation, minimizing the attack 

surface) 

b. Proper separation and control of information 

flows. 

c. Proper compositional certification approach. 

2. Introducing of “Multiple Independent Levels of 

Safety/Security” (MILS) systems [6]. 

Figure 2: MILS Architectural Approach 

MILS is a high-assurance security architecture that supports 

the coexistence of untrusted and trusted components, based on 

verifiable separation mechanisms and controlled information 

flow [6]. 

More findings, learnings from avionic industry regarding 

safety and security certification are discussed in [14]. 

III. OTHER BENEFITS 

- SAFETY & SECURITY STANDARDS - 

ISO 26262 

2nd Edition 

a) Potential interaction 

between 

safety and security 

b) Cybersecurity 

threats to be 

analyzed as hazards 

c) Monitoring 

activities for cybersecurity, 

including 

incident response 

tracking 

d) Refer also to SAE 

J3061, ISO/IEC 

27001 and 

ISO/IEC 15480 

Common Safety and Security Base 

SAE J3101 – 

SAE J3061 – 

Hardware-Protected Cybersecurity 

Security for Ground Guidebook for Cyber- 

Vehicles Applications Physical Vehicle 

a) Secure boot 

b) Secure storage 

c) Secure execution 

environment 

d) Other hardware 

capabilities ... 

e) OTA, 

authentication, 

detection, recovery 

mechanisms ... 

Systems 

a) Enumerate all attack 

surfaces, conduct 

threat analysis 

b) Reduce attack 

surface 

c) Harden hardware 

and software 

d) Perform security 

testing (penetration, 

fuzzing, etc.) 

The ISO 26262 is already well established as the safety 

standard for certification and confirmation purposes within the 

automotive industry. This safety standard already refers to 

several security standards as shown in the table above. 

The SAE J3016 as standardization document provides 

recommendations for security relevant functions and 

procedures. The table listed the most important. 

The SAE J30161 document [7] is a guidebook with guidance 

on how to secure the software part of an automotive system. 

There are organizational and procedural similarities between 

Safety – Software Life-Cycle and Security - Software Life- 

Cycle. Taking these common efforts into account and taking 

Automotive Cybersecurity as part of the vehicle development 

life cycle from the very outset, may reduce the effort 

enormously. 

MILS Operating System (MILS OS) or with other words 

MILS approach is the architectural principle addressing 

requirements from MIL standards like development processes, 

risk modelling, verification & validation, and automotive 

domain specifics. 

For the usage as MILS OS it’s recommended to use a multi-core 

hypervisor to materialize benefits from modern multi-core 

hardware systems. A further example of successful multi-core 

hypervisor implementation can be found in [8]. 

760

IV. SUMMARY - BENEFITS FROM MILS APPROACH 

AUTOMOTIVE EXAMPLES 

Driver assistance systems and autonomous driving are 

currently in addition to electromobility the most important topics 

in automotive development. Many "autonomous" or "partially 

autonomous" systems are already on the road or in final test 

phases. For complete autonomy, as described in Level 5, it is 

still a big step [9]. Besides the vehicle-to-vehicle infrastructure, 

the one hand there are the technical systems in the car, the focus 

of this article, on the other legal challenges of responsibility in 

accidents, as well as privacy issues have to be clarified for the 

series introduction of autonomous systems. 

This article presents technical approaches that can 

successfully address both safety and automotive cybersecurity 

requirements. For questions on data protection as well as legal 

and other important ethical questions please refer to [10], [11]. 

MILS OS are a cost-efficient and practical base for the 

OEM´s and/or Tier1 supplier to provide powerful and modern 

Multi-Domain Safe&Secure Automotive Platform, which may 

integrate big data handling, sensor fusion, artificial intelligence 

algorithm and at the same time minimize security risks and 

reduce development efforts. 

Figure 3: Control the network traffic by using a security monitor 

application and firewall for communication between 3 VMs 

Figure 3 presents a real example with separated domains 

(VM´s) as a secure by design system. This example shows how 

the complexity will be managed as well the safe & secure 

integration of other operating systems or 3 rd party components. 

Beside of that, important security functions like secure boot, - 

update and Over The Air update of features and firmware may 

easily integrate. 

Furthermore, we may see AUTOSAR Adaptive as next 

automotive platform evolution as an ideal candidate for a MILS 

operating system. Figure 4 shows an example architecture, 

which combines AUTOSAR Adaptive and other operating 

systems. A MILS OS may provide the base for an ASIL D 

AUTOSAR Adaptive system, simply providing a ‘SafePOSIX’ 

API. 

Figure 4: Hypervisor combines Safety AUTOSAR ADAPTIVE and 

Linux (Source: Vector) 

Take away: 

• One multi-domain platform, integrated AI, sensor fusion 

and big data handling to create symbiosis between 

humans, cars and surroundings. 

• Enabling new mobility services 

• Secure the car with strict separated and secure domains, 

providing safe & secure inter-domain communication 

• Maximize data privacy and effective usage and minimize 

cyber risks 

• Reduce development costs and time to market with 

configurable platforms and easy and safe integration of 3 rd 

party components. 

How to create a MILS platform 

• Understand and follow the standards and 

recommendations 

• First, secure the Hardware 

Securing of the HW – not part of this paper. For more 

information how to provide higher safety-level for a nonsafe 

HW please refer to [12] 

• Then secure the Software 

o System integration concept, 

i.e. Architecture is the most important 

Security MEASUREMENT 

o Check the following feature of your platform: 

Secure Boot, Secure Update - Over The Air, 

Monitoring, Assessment, Notifications, 

Remediation, Safe & Secure SW Life-Cycle, 

establish End to End Security 

V. REFERENCES 

[1] VDC Research Group, Inc., „Hypervisor & Secure Operating Systemes: 

Safety, Security, and Virtualization Converge in the IoT,“ 2015. 

[2] PwC, „The 2017 PwC Strategy & Digital Auto Report,“ 

https://www.strategyand.pwc.com/media/file/2017-Strategyand-Digital- 

Auto-Report.pdf, 2017, September. 

[3] VDC Research Group, Inc., „The Gloabal Market for IoT & Embedded 

Operating Systems; Automotive Drives Revenue, ECU´s Drive 

Developer Mindshare,“ 2017. 

[4] Ernst & Young, „Automotive Cybersecurity,“ 

http://www.ey.com/gl/en/industries/automotive, 2016. 


761

[5] P. Gutmann, „Youtube: Linux.conf.au 2015 -- Auckland, New 

Zealand,“ Linux.conf.au 2015 -- Auckland, New Zealand, 2015. 

[Online]. Available: 

https://www.youtube.com/watch?v=_ahcUuNO4so. 

[6] H. Blasum, S. Tverdyshev, B. Langenstein, J. Maebe, B. De Sutter, B. 

Leconte, B. Triquet, K. Müller, A. Söding, A. Söding - Freiherr von 

Blomberg und A. Tillequin, „MILS Architektur, Whitepaper,“ in 

EURO-MILS: Secure European Virtualisation for Trustworthy 

Applications in Critical Domains, www.euromils.eu, 2014. 

[7] SAE International, SAE J3061 - Cybersecurity Guidebook for Cyber- 

Physical Vehicle Systems, 2016 January. 

[8] S. Nordhoff, „How hypervisor operating Systems can cope with multicore 

certification challenges,“ in Aviation Electronics Europe, Munich, 

2016. 

[9] F. Walkembach und C. Berg, „Eine für alle; Einheitliche Plattform für 

alle Autofunktionen,“ Automobile Elektronik, pp. 22-24, 11-12 2016. 

[10] Bundesregierung, „Strassenverkehrsgesetz, Automatisiertes Fahren auf 

dem Weg,“ 2017. [Online]. Available: 

https://www.bundesregierung.de/Content/DE/Artikel/2017/01/2017-01- 

25-automatisiertes-fahren.html. 

[11] THE NATIONAL ACADEMIES PRESS, „A Look at the Legal 

Environment for Driverless Vehicles,“ 

https://www.nap.edu/download/23453, 2016. 

[12] M. Özer, „Safety-Architektur für Plattformen mit komplexer Hardware; 

SIL-4 trotz unsichere Hardware,“ in Tagungsband Embedded Software 

Engineering Kongress 2017, Sindelfingen, www.ese-kongress.de, 2017. 

[13] SAE International, „AUTOMATED DRIVING, LEVELS OF 

DRIVING AUTOMATION ARE DEFINED IN NEW SAE 

INTERNATIONAL STANDARD J3016,“ 

www.sae.org/misc/pdfs/automated_driving.pdf. 

[14] Le Merdy, Stéphane; SYSGO AG, „Avionics Application: Security for 

Safety in PikeOS,“ https://www.sysgo.com/services/knowledgecenter/whitepapers, 

2017. 

762

Building Modern Industrial Applications with Open 

Standards and Open-source Software 

Frank Meerkötter (Author) 

Development Lead 

basysKom GmbH 


This paper offers arguments for building industrial HMIs with 

open standards and open-source software by showcasing a 

solution built on Qt, Linux and OPC-UA. 

OPC UA, Qt OpcUa, HMI, Qt, Qt Quick, Embedded Linux, 

Yocto, Open Source, FOSS 


Traditionally, HMIs for industrial automation are built 

using proprietary tools, components and interfaces. In the worst 

case, a solution of this kind is created with a proprietary tool, 

requiring a proprietary runtime and a proprietary 

communication interface, both often only available on 

Windows. 

This paper offers arguments for building industrial HMIs 

with open standards and open-source software by showcasing a 

solution built on Qt, Linux and OPC-UA. It will compare such 

a solution with a traditional approach. It will also discuss the 

advantages and disadvantages of both, taking into account 

different kinds of scenarios and applications, as well as our 

experience in the field. The showcase reflects what we found in 

our customer projects. 

A. Target Scope 

There are two kinds of cases that one needs to differentiate 

when talking about industrial applications or HMIs. 

Case one are plant manufacturers or industrial 

integrators that need to provide an HMI for the 

machinery inside a specific production line or even a 

complete plant. The given combination of machines and 

their setup is individual for most installations. The use 

cases such an HMI needs to fulfill are typically well 

defined and properly addressed by traditional industrial 

HMI software. The amount of budget that can be spent 

on HMI customization or application development is 

typically limited, as the resulting software is a one-off 

solution. This type of HMIs/industrial applications is 

well served by the "configuration, not programming" 

approach of traditional HMI software. 

Case two are machine manufactures with machines 

produced in medium to large series. HMIs for these 

kinds of machines are also often done with traditional 

HMI software (at least as long as the application falls 

into the "comfort zone" of such tools). HMI software of 

this kind is not a one-off development and also an 

important point of differentiation for the manufacturer. 

This means more effort can be spent and it can make 

sense to look outside the world of traditional HMI 

software. 

This paper will focus on the second case. 

II. 

TRADITIONAL INDUSTRIAL HMI SOFTWARE 

What is "traditional HMI Software"? There is a large 

number of products, so we can answer this question only for 

the typical case which looks like this: 

Industrial HMI software consists of a graphical editor and a 

run time. The editor is used on a development machine to 

create the screens of the HMI inside a graphical composer and 

to implement the UI logic. It provides a library equipped with 

often needed graphical widgets. In addition, it often provides 

wizards that guide the creation of frequently needed 

components. It also contains pre-build blocks of typical 

functionality such as alarm management, recipe management, 

access to historical data and reporting. Most of the times it has 

a way to discover and import machine interfaces (symbols, 

variables, addresses). While it is possible to customize the UIlogic 

with simple scripts, the focus is on configuration, not on 

software development. The runtime is used to execute the HMI 

that has been created with the editor. A product might be able 

to produce output for several different runtimes. 

A. Advantages of Industrial HMI Software 

 

No deep software development skills are needed. 

Many prepackaged components and existing 

application specific functionality. 


763

Ability to import machine interfaces to work with, 

either from a live machine or through several file based 

exchange mechanisms. 

Support through the tool vendor. 

Ability to get results quickly. 

B. Disadvantages of Industrial HMI Software 

 

 

 

 

 

 

It can be hard to create high quality HMIs. 

It can be hard to extend an HMI as soon as one leaves 

the "comfort zone" of a given tool or the application 

surpasses a certain size. 

Availability of runtimes. Older solutions often only 

provide a Windows runtime, while more modern 

solutions have become more flexible, also providing 

runtimes for Android/iOS or the web browser. Still, the 

cooperation of the given vendor is needed to get a 

runtime for a specific Hardware/OS combination. 

Vendor lock-in. The given HMI is developed with a 

specific product of a given vendor. The resulting 

implementation cannot be ported easily to another 

vendor. 

License fees (Windows, communication driver, HMI- 

Software and runtimes). 

Version control is often lacking. Examples include 

binary project files or XML based formats which are 

often also hard to handle reasonably in version control. 

III. 

MODERN HMI SOFTWARE DEVELOPMENT 

The following section describes an approach based on open 

standards and open source software to build a machine HMI. It 

is most suitable for scenarios where the HMI is not a one-off 

solution, there are high demands for HMI quality and/or the 

application will become complex/large. One of the strengths of 

this approach is its flexibility and openness - it becomes 

possible to switch out the hardware, the OS and other 

components. 

The HMI this essay refers to is built with Qt & Qt Quick, 

running on an ARM-SBC with an OpenEmbedded/Yoctobased 

Linux as operating system. It is using OPC-UA via Qt 

OpcUa and open62541 to communicate with its PLC. A 

slightly modified version of this stack could also be used with 

an X86 industrial PC running Windows or an Android tablet. 

A. Qt and Qt Quick 

 

Qt is an open source C++ framework delivering the 

building blocks for cross-platform HMI and 

application development. Within Qt there is Qt Quick, 

a technology geared towards rapidly building modern, 

animated, smartphone-like HMIs. A Qt Quick 

application is typically structured in two parts: an 

application backend written in C++, which contains the 

business logic and a frontend which is a pure UI 

written in QML. QML, is a JSON-like language, used 

to declaratively describe the HMI (as opposed to 

programming it imperatively). Qt ships numerous cross 

platform modules for tasks such as network 

communication, database access, printing or XML and 

JSON processing. Specifically, for industrial 

application it has support for e.g. CAN adapters, 

ModBus, serial ports and OPC-UA. Qt is available 

under a dual licensing scheme, either as a commercial 

product from TheQtCompany or as (L)GPL. Qt takes 

API and ABI stability very seriously. Strict 

compatibility is kept within a major release series. 

Historically this means ~7-8 years. 

Qt is accompanied by its own integrated development 

environment, the QtCreator. HMIs based on QML are either 

programmed or created via the Qt Quick designer which 

provides a graphical editor. 

B. OPC-UA 

OPC-UA is a communication standard for industrial 

applications. It is standardized by the opcFoundation and also 

published as an international standard by the IEC as IEC62541. 

OPC-UA is the successor to the old OPC standard (now 

dubbed OPC-classic). OPC-UA is, unlike OPC-classic, 

platform independent. 

Qt OpcUa is developed by basysKom. It will be a standard 

module of Qt, starting with Qt 5.11 which will be available 

mid-2018. It provides an easy to use, Qt-ish API for OPC-UA 

clients. It does not implement its own stack, but rather wraps 

existing stacks - one of these is open62541. 

open62541 is an open source project which implements a 

portable OPC-UA stack in C. The source is licensed under 

MPL2. It provides functionality for server and client-side 

development. 

C. Embedded-Linux and Yocto 

OpenEmbedded and Yocto have emerged as the standard 

tooling to create custom Linux firmwares. They allow the 

creation of a range of systems, from Desktop like, to singlepurpose 

systems. Its modular approach separates BSP specific 

parts from application specific parts, making it easy to switch 

out the underlying hardware. 

D. Advantages of this approach 

 

 

 

 

 

 

Allows to build high quality HMIs (animated, fluid, 

smartphone like). 

Scalable across machine variants as well as application 

size/complexity. 

Flexibility and freedom to implement individual 

requirements. 

Cross platform. 

No vendor lock-in as it allows to replace components 

throughout the stack. Examples include choosing a 

different PLC vendor, replacing the open62541-stack 

with a commercial offering or replacing the QML-UI 

with a web-based solution by placing a 

REST/WebSocket server on top of the existing 

application backend. 

Opportunity to significantly reduce license fees. 

764

Enables the usage of cheap ARM-SBC (as opposed to 

full industrial PCs). 

E. Disadvantages of this approach 

 

 

 

 

Requires the skill (and will) to perform actual software 

development. 

Does not scale for one-off scenarios. 

Less guidance by a tool 

Less pre-packaged and pre-arranged industry specific 

functionality. 

applications for machines. This approach really shines when 

the application is individual or complex and has high 

requirements on HMI quality. It becomes possible to add new 

features and functionality without being restricted by a given 

HMI tool. It also becomes easier to scale an application either 

across machine variants or in delivered features. 

The open nature of this approach allows an evolution of an 

application stack or the integration of new machine interfaces 

without being strictly tied to the product lifecycle of a specific 

tool vendor. The cross-platform nature of the presented 

application stack gives an example how to future proof an 

investment against changes in hardware or OS availability and 

opens opportunities to reduce software license costs. 

IV. 

CONCLUSION 

Our experience shows that it is beneficial to work with 

open standards and open source software to build HMIs and 


765

User Experience as an 

Industry 4.0 Innovation Driver 

David C. Thömmes, B.Sc, CEO Shapefield 

Senior Software & UX Engineer 

Microsoft MVP „Windows Development“ 

Shapefield UG 

D66115 Saarbrücken, Germany 

www.shapefield.de 

Abstract—Apparently everyone has heard of user experience 

design and usability but only the fewest manufacturers seem to 

develop software that is focused on the user. If you look at the 

user interfaces of current products, its obvious that theres a 

massive catching up to do. Today, a positive user experience (UX) 

is becoming more and more a success factor and a serious buying 

criterion for many companies. In the smartphone era, users are 

accustomed to user-friendly and well-usable interfaces. New 

technical achievements, such as HoloLens, are driving 

expectation. In his presentation, David C. Thömmes gives you an 

exciting insight into a user-centered design process, the current 

state of the market and current technology trends. The most 

important phases, terms and UX methods are presented with 

illustrative examples. Be inspired and get new impulses! 

Keywords— UX design, UI design, GUI design, user experience 

design, user interface design, graphical user interface, interaction 

design, UI development, UI engineering 


Fig. 1 shows a typical 2D user interface, perhaps reminding 

one or the other manufacturer in an industrial context of his 

own creations. The shown user interface undoubtedly offers a 

button for every function and convinces with pleasant 

aesthetics. The user is immediately aware of the different 

functions and the learnability of the interface can only be 

referred to as good. Obviously, the previous statements were 

sarcasm. In this situation, brave manufacturers eventually tend 

to reprogram the existing user interface. Modern UI 

frameworks such as WPF with XAML, Qt with QML or 

HTML5 with AngularJS are often used for this. Taking a closer 

look at the reprogrammed interface, often no improvement can 

be recognized. Poor operating concepts are adopted without 

reflections and so the opportunity for a proper redesign of the 

user interface is lost. New UI frameworks do not automatically 

lead to an attractive and well-usable interface and a positive 

user experience. A good user interface is the result of an 

interdisciplinary design process that consciously puts the user 

at the center of the design. But what exactly does user 

experience mean? 

II. 

USER EXPERIENCE 

User experience describes the sum of all the experiences a 

user collects with a digital product [1]. This includes the 

entirety of all possible points of contact such as advertising, 

websites, ordering processes, product design or installation. 

The user experience is not limited to the actual period of time 

needed to use of the product, but also the time before and after 

the usage gains in importance. Since user experience should be 

understood as a holistic approach, the term UX design reflects 

an interdisciplinary conglomeration of different disciplines. 

The pure UI design is only a partial discipline in addition to 

important core disciplines such as interaction design, product 

design and usability engineering. Optimally, the user 

experience should be stimulated on all different levels. Every 

point of contact with the product should be designed with the 

same quality and dedication for a positive user experience. 

Reduced to the aspects that are relevant during the usage of a 

product, the term user experience emerges new facets, such as 

usability. 

Fig. 1: Typical user interface from industrial sector 


766

III. 

USABILITY 

First of all, usability is a part of the standard EN ISO 9241, 

which describes guidelines for human-computer interaction. In 

the section 9241-11, usability is defined as: "the extent to 

which a product can be used by certain users in a particular 

context of use to achieve specific goals effectively, efficiently 

and satisfactorily“[2][3]. This means, usability is depending on 

which users use the product in which work environment and 

which tasks are solved. In this, the factors effectiveness, 

efficiency and satisfaction can be considered. 

Effectiveness 

Effectiveness describes how effective a task can be 

handled. For example, is the user able to configure the machine 

to his needs? 

Efficiency 

The factor of efficiency expresses the temporal, economic 

and cognitive costs involved in achieving the goal. How long 

does it take the user to find an alarm in the system? How many 

clicks are needed for this process? How exhausting was the 

search for the user? 

Satisfaction 

Satisfaction is subjective and arises when the expectations 

of a product or system are exceeded. Positive emotions, 

feelings of happiness and aesthetics play a decisive role here. 

IV. 

DESIGN PROCESS 

One possible process for ensuring good usability is known 

as user-centered design (UCD). It describes a highly iterative 

design process that focuses on the user's needs as the 

foundation for design. The fundamental idea is that, at the 

beginning, as much information as possible is collected about 

the different user groups. Based on the information gathered, a 

design phase follows in which hypotheses are prepared as 

drafts, concepts, screens, etc. Subsequently, the products of the 

design phase are evaluated by various empirical or analytical 

methods. It is reviewed whether the designed hypotheses 

actually work for the user and what degree of usability has 

been achieved. Potential problems are detected by this 

procedure and, if necessary, corrected by returning to a 

previous phase. Through the iterative alternation of the 

different phases, the development of the products is an integral 

part of the process. Step by step, an approximation to the 

optimal result is achieved. Depending on the company and 

project different interpretations, phases and methods are 

applied. Fig.2 shows a possible UCD variant. 

V. USER ANALYSIS 

The ultimate goal of user analysis is to get to know the user 

and his needs and to prepare the results with corresponding 

documentation methods. The main focus is on work processes, 

working environment and contextual general conditions. Figure 

3 shows an engineer working with a complex CAD program. 

By carrying out a context analysis, for example, the working 

environment of the user becomes comprehensible almost 

unadulterated. A context analysis is a combination of 

observation and subsequent questioning. For one day, the UX 

designer becomes a shadow for the user and accompanies him 

at his everyday work. 

VI. 

DESIGN 

During the design phase, the information and results from 

the analysis phase are transformed into creative solutions. This 

phase is subdivided into the development of conceptual and 

visual design. The conceptual design of a user interface 

documents corresponding design decisions regarding the 

navigation structure, information architecture, interaction 

paradigms, controls and layouts. For this purpose, individual 

screens are often visualized as wireframes. Important areas of 

the user interface such as the alarm system, help or the 

displayed status are designed and arranged. In this step, the 

concrete visual design is less relevant, since with wireframes it 

is possible to collect reliable user feedback in an early state of 

the project. Completely formed screens with colors or effects 

could distract from the actual concept and distort the 

impression. Fig.4 shows a conceptual design of an engine 

control. The conceptual design is followed by the visual 

design. Shapes, colors, fonts, icons, effects, proportions and 

arrangements can have a significant influence on the perception 

and value of a user interface. As part of the visual design, these 

attributes are arranged in a well-defined composition and by 

this, the user interface gets it´s appearance. This is where the 

important first impression comes from, long before usability or 

functionality play a role. 

Fig. 2: User centered design 

Fig. 3: Engineer working with a CAD software 

767

Fig. 4: Conceptual design for motor control 

VII. 

EVALUATION 

Without appropriate evaluation, the results of the design 

phase are always just hypothesis. Interactive prototypes allow 

an evaluation of these results, for example as a part of a 

usability test. For this purpose, the existing static screens are 

implemented as interactive software fragments and real users 

are confronted with the prototype. Recruited users receive 

concrete tasks and are observed during the use of the product. 

Classically, a usability test is performed in a usability lab. 

While a user is in the so-called user´s room, the UX designer 

watches the events from a second room. For support and 

documentation, video, screen and audio signals is transmitted 

from the user´s room. As a cost-effective alternative to the 

classic usability test, the method of a focus group becomes 

more and more popular these days. A focus group is a 

moderated group discussion with relevant users. Usually in the 

course of one day several design hypotheses are discussed 

openly in the group and presented with the help of wireframes, 

screens and interactive prototypes. By this, users get the 

opportunity to try out new operating concepts live and to share 

their experiences directly with other users. The momentum of 

the group quickly creates user feedback and corresponding 

problems, concerns and comments can be discussed 

transparently. 

VIII. SPECIFICATION AND IMPLEMENTATION 

After the design has been evaluated by appropriate 

methods, the project has to be processed and documented for 

the development. Typically, a style guide is written for this 

purpose. It contains basic design resources such as colors, 

fonts, and control specifications as well as guidelines for using 

these controls and general usability information. Style guide 

documents easily became very extensive. Correspondingly, the 

production costs are high and, at the same time, the document 

is less consumable. In order to make a faster leap into 

development, lightweight specifications are becoming more 

and more prevalent. They are called design manuals. Usually, 

they only contain the essential interfaces between design and 

development and are deliberately reduced to the essentials. Fig. 

5 shows a button with dimensions. 

The actual design process ends with the completion of the 

specification. But this only covers half of the project. After 

this, the technical implementation of the user interface is 

usually carried out simultaneous to the overarching 

development project. Every pixel and every distance is 

essential. Depending on the scope and complexity of the 

design, there are interesting challenges for the role of the UI 

Engineer. Especially with modern UI frameworks attractive 

and rich user interfaces can be realized with a reasonable effort 

these days. For example, WPF offers incredible possibilities 


768

Fig. 5: Button dimensions 

with styles, data templates and control templates [4]! Qt has 

also improved through QtQuick and QML [5], which opens up 

new perspectives for the technical realization. 

IX. 

CONCLUSION 

From user´s point of view, the user interface is the face of 

the application. It does not matter if the application is a 

machine control or a complex ERP system. The user likes it if 

it´s easy to use and nice to look at. But a positive user 

experience is no coincidence, but the result of a solid design 

process and a skillful technical implementation. In addition, the 

world of users is now undergoing massive change. 

Digitalization and Industry 4.0 are the new riggers. Almost 

monthly, new devices are released with innovative interaction 

paradigms, such as the Leap Motion or the Apple Watch. 

Additionally, there is the trend of artificial intelligence paired 

with language interfaces such as Google Home or Amazon 

Alexa. Innovations are already existing, but for many 

manufacturers, the development of a contemporary 2D user 

interface would be an advance. It's time for a change. 

AUTHOR 

David C. Thömmes (B.Sc.) studied media informatics at the 

University of Applied Sciences Kaiserslautern and discovered 

his passion for human-computer interaction and software 

engineering. David has developed his first user interfaces with 

VBA and Delphi 2004. As a Senior Software & UX Engineer 

as well as Managing Director of Shapefield, his passion is 

today the user-centered design and the technical development 

of impressive user interfaces. Prior to that, he was responsible 

for the development department of the renowned UX service 

provider Ergosign for almost 5 years in the role of Senior 

Software Engineer & Field Lead "Software Engineering 

Standards". At the beginning of 2015, David left Ergosign at 

his own request and laid the foundation for Shapefield a few 

months later. By working on various projects with different 

technologies, he has a profound knowledge in the development 

of desktop, web, embedded and mobile applications. 

Technically his heart beats for XAML, QML, C#, C ++ and 

PHP. For his achievements, David was honored with the 

Microsoft MVP Award 2016 and 2017. 

REFERENCES 

1. https://www.nngroup.com/articles/definition-user-experience 

2. https://de.wikipedia.org/wiki/Gebrauchstauglichkeit_(Produkt) 

3. https://de.wikipedia.org/wiki/EN_ISO_9241 

4. https://docs.microsoft.com/en-us/dotnet/framework/wpf/controls/stylingand-templating 

5. https://www.qt.io 

769

Real-Time Holographic Solution for True 3D-Display 

A. Kaczorowski, S.J. Senanayake, R. Pechhacker, T. Durrant, M. Kaminski and D. F. Milne 

VividQ Research and Development Division 

Cambridge, UK, CB3 0AX 

darran.milne@vivid-q.com 

Abstract— Holographic display technology has a been a topic 

of intense academic research for some time but has only 

recently seen significant commercial interest. The uptake has 

been hindered by the complexity of computation and sheer 

volume of the resulting holographic data meaning it takes up 

to several seconds minutes to compute even a single frame of 

holographic video, rendering it largely useless for anything but 

static displays. These issues have slowed the development of 

true holographic displays. In response, several easier to 

achieve, yet incomplete 3D-like technologies have arisen to 

fill the gap in the market. These alternatives, such as 3D 

glasses, head-mounted-displays or concert projections are 

partial solutions to the 3D problem, but are intrinsically 

limited in the content they can display and the level of realism 

they can achieve. 

Here we present VividQ's Holographic Solutions, a 

software package containing a set of proprietary state-of-theart 

algorithms that compute holograms in milliseconds on 

standard computing hardware. This allows three-dimensional 

holographic images to be generated in real-time. Now users 

can view and interact with moving holograms, have 

holographic video calls and play fully immersive holographicmixed 

reality games. VividQ's Solutions are a vital component 

for Industry 4.0, enabling IOT with 3D holographic imaging. 

The software architecture is built around interoperability with 

leading head-mounted-display and head-up-display 

manufacturers as well as universal APIs for CAD, 3D gaming 

engines and Windows based engineering software. In this 

way, VividQ software will become the new benchmark in 3D 

enhanced worker/system interaction with unrivalled 3D 

imaging, representation and interactivity. 

Keywords— Digital Holography, GPU, Augmented Reality, 

Mixed Reality, Optical Systems, Display Technology 


Owing to the recent increased interest in 3D display, 

multiple technologies have emerged to deliver a convincing 

3D experience [1-4]. These largely rely on multi-view or 

stereoscopic representations designed to "trick" the eye into 

providing correct depth cues to make the projections appear 

three-dimensional. However, these depth cues are often 

limited and in some cases can cause accommodation-vergence 

conflicts leading to nausea and headaches for the user. 

Holographic display, on the other hand, aims to precisely 

recreate the wave-front created from a 3D object or scene, 

creating a true 3D image of the input scene with all the correct 

depth cues intact. This makes holographic display an ideal 

candidate for augmented/mixed reality applications, as it 

provides 3D virtual objects that appear in focus with their 

surroundings. 

With advances in 3D sensors together with dramatic 

increases in computational power, Digital Holography (DH) 

has become a topic of particular interest. In DH, holograms 

are calculated from point cloud objects, extracted from 3D 

data sources such as 3D cameras, game engines or 3D design 

tools, by simulating optical wave propagation [5][6][7]. This 

simulation can take multiple forms depending on the desired 

quality of the recovered image and whether the holograms are 

chosen to be in the far or near field. The resulting hologram 

may then be loaded onto a suitable digital display device with 

associated optical set-up for viewing. 

A conceptually simple approach for hologram generation is 

the ray-tracing method [8][9][10], in which the paths from 

each point on the object to each hologram pixel are computed 

and aggregated to produce the hologram representation. While 

the ray-tracing method is physically intuitive, it is highly 

computationally expensive. To address this issue, many 

modifications [11-14], and alternatives solutions such as the 

polygon [14-18] and image-based methods [19][20] have been 

proposed. In the paper, we describe a real-time holographic 

display system, that uses a different algorithmic approach 

based on a Layered Fourier Transform (LFT) scheme [21][22]. 

We demonstrate how data may be extracted from a 3D data 

source i.e. the Unity engine, streamed to a holographic 

generation engine, containing the LFT algorithms. The LFT 

algorithms are highly parallelized and optimized to run via 

CUDA kernels on NVidia Graphics Processing Units (GPUs). 

The resulting holograms are then output to a suitable display 

device, in this case a Ferroelectic LCoS Spatial Light 

Modulator (FLCOS SLM). 


770

In the following we describe the various components of the 

real-time holographic display architecture. In section II, we 

discuss the streaming of game data from the Unity engine and 

the standardization of Unity data to a generic 3D format. In 

Section III, we present the real-time Hologram Generation 

Engine (HGE) before going into the display setup and driver 

in Section IV. We discuss results and future work in Section 

V. 

II. 

DATA CAPURE AND STANARDIZATION 

To stream date to the Hologram Generation Engine, we 

must first extract 3D data from a suitable source. In this case, 

we choose the Unity engine. Unity is a well-known gaming 

platform that may be used to create entire 3D scenes using 

pre-rendered assets. 

A. 3D Data Streaming from Unity 

Key to the performance of the real-time process is that 3D 

content rendered within Unity is passed to the holographic 

generation engine without having to copy memory from the 

CPU to GPU and back again. The process is summarized in 

Fig.1. Unity uses a concept of Shaders to create objects 

known as Textures that describe virtual scenes. The standard 

Unity Shaders create Textures as colour maps that specify 

colours RGB in a 2D grid. While this is suitable for rendering 

to a standard 2D display, such as a monitor or stereoscopic 

device, this is insufficient to capture depth information about a 

scene as required for 3D Holography. Instead, a custom 

Shader was implemented that renders a colour map with depth 

(Z) to create a four-channel Colour-Depth-Map (CDM) with 

channels RGBZ . Each CDM is rendered at the resolution of 

the Spatial Light Modulator (in this case the rendering is 

actually performed at half the SLM resolution due the binary 

nature of the device giving rise to twin images). Unity renders 

into the CDM texture within an OpenGL context. This allows 

it to pass the CDM texture object direct to the HGE. Within 

the HGE, the CUDA-OpenGL-Interop library is utilized to 

make the data available to the HGE’s custom kernel functions, 

contained in a set of C++ DLLs. This way, Unity is able to 

render the 3D scene and the information is passed straight to 

the hologram algorithms without multiple memory copies 

between the CPU and GPU. In this sense, the OpenGL context 

acts as a translator between the two steps, allowing us to pass 

a pointer to the texture directly to the DLLs holding the HGE 

algorithms. While this implementation is based on OpenGL, 

one could consider alternative approaches using Direct3D or 

Vulkan. Direct3D is widely use in the game industry and 

represents a natural next step in the evolution of the streaming 

solution as it contains libraries similar to the CUDA-OpenGL- 

Interop. For Vulkan there is currently no such support, but it is 

likely that there will be in the near future. 

Fig. 1. Unity Streaming Process: Unity renders a 3D 

scene, the shader creates a custom texture which is passed to 

an array on the GPU (cudaArray) where the hologram will be 

calculated. 

B. Data Standardization 

While Unity is a useful tool for gaming applications and 

technology demonstrations, for Holography to be available for 

more general applications, one should define a process to 

stream and work on data from arbitrary 3D data sources. 

Three-dimensional data is present in many forms across 

multiple software and hardware platforms. To compute 

holograms, fundamentally we require data in a point-cloud 

format. A point cloud can be thought of simply as a list of 3D 

coordinates, specifying the geometry of the object, along with 

a set of attributes of the cloud e.g. colours (the CDM texture 

from Unity can be thought of a flattened point cloud with each 

point of the grid specifying (x,y)-coordinates and the depth 

channel providing the z). While point clouds are a common 

and intuitive data type, so far no standard point cloud format 

has emerged that is compatible with the majority of 3D source 

data systems. To overcome this issue in holographic 

applications, and allow arbitrary sources to be streamed to the 

HGE, we present a new Point Cloud class structure that 

incorporates the essential features of 3D data required for 

holographic computation. 

C. Point Cloud Class 

Unity 3D rendering 

Render custom texture 

OpenGL context 

CUDA_OpenGL_Interop 

library 

cudaArray 

Hologram Generation 

Engine 

The point cloud class, PointCloud, provides a common 

framework for data passing through the real-time holographic 

display system. This allows 3D data to be passed around in 

memory rather than in file format for fast processing. 

PointCloud is an abstract base class that allows derivative 

classes to specify specific point cloud representations. In the 

holographic generation case, we are interested in two 

771

particular types of point cloud representation: 3D and 2.5D 

point clouds. The 3D case refers to a PC that contains 

complete geometric information of a given object while the 

2.5D case occurs when using a PC viewed from a particular 

perspective. In this case (assuming the object is not 

transparent), one may neglect points that are occluded. 

The base class and inheritance structure of PointCloud is 

designed to be generic and easily extensible so one may define 

further derivative classes for higher dimensional PCs or PCs 

with attributes specific to the chosen application or data 

source. The base class contains generic file reading and 

writing methods but there is no embedded algorithmic 

functionality. Instead, all parts of the holographic system 

architecture may accept an instance of these types and run 

algorithms using the data contained in them. 

With data streamed via the Unity process or through the 

generic point cloud class we may now compute the 

holographic representation of the data to be displayed on the 

SLM for viewing. In the next section we discuss the theory 

behind the hologram generation process, outline the algorithm 

in the HGE and describe the expected outputs. 

III. 

REAL-TIME HOLOGRAM GENERATION 

A physically intuitive generation method for the calculation 

of digital holograms is a direct simulation of the physical 

holographic recording process. In this model, objects are 

represented by a point cloud where points in the cloud are 

assumed to emit identical spherical light waves that propagate 

towards a fixed 2D "holo-plane" offset from the cloud. The 

resulting interference pattern is calculated on the surface of the 

holo-plane to yield the digital hologram. While this method is 

conceptually simple and can produce high quality holograms, 

it is computationally intensive and time consuming to 

implement. To reduce the computational load of hologram 

generation we make use of a layer-based Fourier algorithm. 

This method partitions the point cloud into parallel, twodimensional 

layers by choosing a discretization along one axis 

of the object. Points that do not intersect one of the discrete 

layers are simply shifted along the axis of discretization to the 

closest layer. To construct the hologram a Discrete Fourier 

Transform (DFT) is applied to each of the layers. The DFT is 

implemented by the Fast Fourier Transform (FFT) algorithm. 

To account for the varying depths, a simulated effective lens 

correction is calculated and applied to each layer. The 

transformed and depth corrected layers are summed to yield 

the final hologram. So for a hologram, H, with holo-plane 

coordinates (α, β), the construction is described by: 

H(α, β) = ∑ e iz i(α 2 +β 2) FT[A i (x, y)]. 

i 

Where A i (x, y) is the i th object layer, z i is the depth 

parameter for the i th layer. The sum is defined over all the 

layers in the discretization. 

The implementation of the LFT method in the HGE is 

complicated by two issues. First, three coloured holograms 

(RGB) must be created to achieve full colour holographic 

images. This is achieved in this case by including a loop over 

the colours and essentially running the algorithm three times. 

The resulting holographic images can then be overlaid in the 

hologram replay field to give the final full colour holographic 

image. Note that the three coloured holograms will not yield 

the same size of image in the replay field due to the different 

wavelengths, diffracting at different rates on the display. To 

account for this the input point cloud or CDM for each colour 

channel must be scaled to ensure the images overlap exactly. 

The second issue is that the output display element – in 

this case a FLCoS SLM – is a binary phase device. Hence, the 

hologram H(α, β), with which in general takes complex 

values, representing both amplitude and phase, must be 

quantized to just two phase values i.e. 1-bit per pixel. This 

causes a severe reduction in quality in the resulting hologram 

and noise reduction methods must be applied to discern a 

viewable image. In general, even non-FLCoS devices, such as 

Nematic SLMs, cannot represent full phase and must quantize 

to some finite number of phase levels (usually 256-levels i.e. 

8-bits per pixel). While an individual binary hologram may 

give a very poor reconstruction quality, it is possible to use 

time-averaging to produce high quality images with low noise 

variance. Such a scheme is implemented in the HGE to 

account for the limitations of the output display device. 

A. Algorithm Structure 

Given the that we require three-colour holograms 

composed of multiple object layers and noise reduction must 

be applied as part of the process, the HGE algorithm proceeds 

as follows: 

1. The CDM Texture is passed from Unity to the C++ 

DLL that wraps the underlying CUDA kernels. 

2. For each colour the CDM is scaled in size to account 

for the variable rates of diffraction of the three laser 

fields. 

3. FFTs are applied to the CDM data using the cuFFT 

CUDA Library to give full-phase holograms. 

4. The full-phase holograms are quantized to give 

binary phase holograms. 

5. The time-averaging algorithm is applied to eliminate 

noise in the replay field image. 

6. The holograms are stored in memory as a 24-bit 

Bitmap. 

The output of this procedure is a 24-bit Bitmap that can be 

streamed directly to the FLCoS SLM. 

The majority of the algorithmic work is handled by several 

custom CUDA kernels, which are responsible for handling the 

CDM object, creating layers, preparing them for FFT and 


772

(b) 

S 

(c) 

Image 

(e) 

I 

(f) 

(a) 

Mi 

(d) 

Fig. 2. The holographic display apparatus: The FLCoS 

SLM is synchronized to three laser diodes (RGB) via a custom 

micro-controller. The holographic image created by the 

reflected light, is enlarged by an optical set up and viewed via 

a beam-splitter eye-piece. (a) Micro-controller, (b) SLM 

Driver, (c) SLM, (d) Laser-diode array (RGB). (e) Image 

enlarging optics, (f) Eye-piece 

merging to create the holograms. These kernels are called via 

C++ wrapper functions that expose the functionality without 

the need to interact directly with the low level CUDA C code. 

IV. 

DEVICE DRIVERS AND HOLOGRAM DISPLAY 

With the holograms computed, all that remains is to display 

on a suitable output device. Holographic images of this type, 

cannot be viewed on a typical display such as LED or OLED 

as these do not allow for phase modulation of light. Instead we 

use a reflective phase-only Spatial Light Modulator. These 

devices allow us to modulate the phase of incoming coherent 

light to generate desired interference patterns in the reflected 

wave-front to create the holographic image. 

The device used here is a ForthDD Ferroelectric Liquid 

Crystal on Silicon (FLCoS) SLM with 2048 x 1536 pixels of 

pitch 8.2μm. The device comes equipped with a control unit 

and drivers to allow developers to alter setting such as refresh 

rate and colour sequencing. To create the holographic images, 

RBG lasers are collimated and directed to the SLM surface 

that is displaying the holograms. The prototype display uses 

off-the-shelf optical components from ThorLabs and is 

designed to replicate an augmented reality style display. In 

this scheme, the holograms are reflected back to a beamsplitter 

which acts as an eye-piece to achieve the augmented 

reality effect (Fig. 2). 

For this implementation, we create three colour holograms 

(RG and B) which must be displayed sequentially in a timemultiplexed 

fashion. To achieve this, a custom Arduino 

microcontroller was developed that synchronizes the RGB 

frames with three laser diodes. These frames are shown at high 

frequency to ensure that the images are time-averaged with 

respect to the viewer’s eye to give a single full-colour image 

(Fig. 3). 

Fig. 3. Augmented Reality Holographic Elephant Image. 

Photographed directly through the eye-piece with DSLR 

camera. Image is constructed by overlaying RBG holographic 

elephants in the replay field to create single full-colour 

elephant. 

V. RESULTS AND DISCUSSION 

The real-time holographic generation and display process has 

been tested on a NVidia GTX 1070 – a mid-range gaming 

GPU. Running a 3D Unity game, with a depth resolution of 

between 32 and 64 depth layers (number of layers computed 

depends on the content in the scene and is determined at runtime) 

the GTX 1070 yields a framerate of 52-55 Hz. This 

creates a smooth gaming experience assuming a static display 

as tested here, but a framerate of 90-120 Hz would be required 

to achieve a seamless mixed reality display system. Increasing 

the memory available to the GPU would be a first step to 

allow more holograms to be computed in parallel. Indeed the 

new generation Volta architecture from NVidia make use of 

half-float calculations would give a significant speed up to 

HGE and is projected to allow for >90Hz in the system. 

Moving to a dedicated ASIC would improve this further by 

running at lower power in a more compact package, suitable 

for portable, untethered devices. As 80% of the compute time 

in the HGE is spent performing FFT, a dedicated embedded 

solution would provide significant speed up over the current 

generic approach. 

In this optical setup, the image size and quality is 

constrained by several physical factors. The eye-box and field 

of view are very small due to the dimensions of the SLM and 

the optics used to expand the image size. The holographic 

images also pick up noise due to speckle effects from the laser 

diodes and there is also some residual noise in the replay field 

due to the quantization errors created by the binary SLM. 

These issues can be addressed primarily through higher 

quality SLMs with smaller pixel pitch and higher resolution. 

Nematic type, 8-bit SLMs with 4k x2K resolutions and pixel 

pitch of 3.74 μm are currently available with 8K devices likely 

to emerge within the near future. The higher resolution and 

smaller pitch of these devices allow for wider fields of view 

and finer detail in the holographic images. Additionally, one 

can consider waveguide solutions combined with eye-tracking 

for accurate eye-box mapping to ensure the viewer never loses 

773

sight of the holographic image. Such schemes are the subject 

of current research and development. 

VI. 

CONCLUSION 

Here we have presented an end-to-end holographic generation 

and display system that allows 3D data to be extracted directly 

from a Unity game, full 3D holograms to be computed and 

then streamed to an augmented reality holographic display. 

The hologram generation algorithms achieve a depth 

resolution between 32 and 64 while maintaining a framerate 

>50 Hz on a 2k x 1.5k SLM. While the hardware required to 

view the holographic images is in a nascent state, such an 

advance in the algorithmic side will enable the development of 

high-quality, fully interactive holographic display systems that 

are suitable for mass adoption. 


The authors would like to thank E. Bundyra, M. Robinson 

and M. Ippolito for providing funding and business development 

support in this project. We would also like to thank Prof T. 

Wilkinson and CAPE at the University of Cambridge for 

supporting the project in its early stages. 

REFERENCES 

[1] J. Park, D. Nam, S. Y. Choi, J. H. Lee, D. S. Park, and C. Y. Kim, "Light 

field rendering of multi-view contents for high density light field 3D 

display," in SID International Symposium, pp. 667–670 (2013). 

[2] G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar, "Tensor displays: 

compressive light field synthesis using multilayer displays with 

directional backlighting," ACM Trans. Graph. 31, 1–11 (2012). 

[3] T. Balogh, P. T. Kovács, and Z. Megyesi, "Holovizio 3D display system," 

in Proceedings of the First International Conference on Immersive 

Telecommunications (ICST) (2007). 

[4] X. Xia, X. Liu, H. Li, Z. Zheng, H. Wang, Y. Peng, and W. Shen, "A 360- 

degree floating 3D display based on light field regeneration," Opt. 

Express 21, 11237–11247 (2013). 

[5] R. H.-Y. Chen and T. D. Wilkinson, "Computer generated hologram from 

point cloud using graphics processor," Appl. Opt. 48(36), 6841 (2009). 

[6] S.-C. Kim and E.-S. Kim, "Effective generation of digital holograms of 

three-dimensional objects using a novel look-up table method," Appl. 

Opt. 47(19), D55–D62 (2008). 

[7] T. Shimobaba, N. Masuda, and T. Ito, "Simple and fast calculation 

algorithm for computer-generated hologram with wavefront recording 

plane," Opt. Lett. 34(20), 3133–3135 (2009). 

[8] D. Leseberg, "Computer-generated three-dimensional image holograms," 

Appl. Opt. 31, 223–229 (1992). 

[9] K. Matsushima, "Computer-generated holograms for three dimensional 

surface objects with shade and texture," Appl. Opt. 44, 4607–4614 (2005). 

[10] R. Ziegler, S. Croci, and M. Gross, "Lighting and occlusion in a wavebased 

framework," Comput. Graph. Forum 27, 211–220 (2008). 

[11] M. E. Lucente, "Interactive computation of holograms using a look-up 

table," J. Electron. Imaging 2, 28–34 (1993). 

[12] Y. Pan, X. Xu, S. Solanki, X. Liang, R. B. Tanjung, C. Tan, and T. C. 

Chong, "Fast CGH computation using S-LUT on GPU," Opt. Express 17, 

18543–18555 (2009). 

[13] P. Tsang, W.-K. Cheung, T.-C. Poon, and C. Zhou, "Holographic video 

at 40 frames per second for 4-million object points," Opt. Express 19, 

15205–15211 (2011). 

[14] J. Weng, T. Shimobaba, N. Okada, H. Nakayama, M. Oikawa, N. Masuda, 

and T. Ito, "Generation of real-time large computer generated hologram 

using wavefront recording method," Opt. Express 20, 4018–4023 (2012). 

[15] K. Matsushima, "Wave-field rendering in computational holography: the 

polygon-based method for full-parallax high-definition CGHs," in 

IEEE/ACIS 9th International Conference on Computer and Information 

Science (ICIS) (2010). 

[16] D. Im, E. Moon, Y. Park, D. Lee, J. Hahn, and H. Kim, "Phase regularized 

polygon computer-generated holograms," Opt. Lett. 39, 3642–3645 

(2014). 

[17] K. Matsushima and A. Kondoh, "A wave-optical algorithm for hidden 

surface removal in digitally synthetic full parallax holograms for 

threedimensional objects," Proc. SPIE 5290, 90–97 (2004). 

[18] H. Nishi, K. Matsushima, and S. Nakahara, "Advanced rendering 

techniques for producing specular smooth surfaces in polygon-based 

high-definition computer holography," Proc. SPIE 8281, 828110 (2012). 

[19] H. Shum and S. B. Kang, "Review of image-based rendering techniques," 

Proc. SPIE 4067, 2–13 (2000). 

[20] F. Remondino and S. El-Hakim, "Image-based 3D modelling: a review," 

Photog. Rec. 21, 269–291 (2006). 

[21] J.-S. Chen, D. Chu, and Q. Smithwick, "Rapid hologram generation 

utilizing layer-based approach and graphic rendering for realistic threedimensional 

image reconstruction by angular tiling," J. Electron. Imaging 

23, 023016 (2014). 

[22] J. S. Chen and D. P. Chu, "Improved layer-based method for rapid 

hologram generation and real-time interactive holographic display 

applications," Opt. Express 23, 18143–18155 (2015). 


774

Integrating Capacitive Touch Technology into 

Electronic Access Control Products 

Walter Schnoor 

System Applications, MSP Microcontrollers 

Texas Instruments, Inc. 

Dallas, TX U.S.A. 

Abstract—Capacitive touch brings appealing aesthetics, 

enhanced security possibilities, and improved reliability to 

electronic access control systems. These key benefits are often 

counteracted by higher average power consumption, touch 

detection reliability issues in exterior installations when the touch 

panel is exposed to moisture, and additional system cost and 

integration complexity. However, microcontrollers equipped 

with capacitive touch sensing technology can be designed with 

key features to address these system challenges. This paper 

discusses system design techniques for addressing the common 

challenges with capacitive touch in electronic access control 

products, including how to reduce the average current draw of a 

12-button capacitive touch keypad into the single digit 

microamperes and system integration techniques to reduce 

system cost and improve moisture tolerance. 

Keywords—capacitive touch; capacitive sensing; 

microcontroller; electronic lock; electronic acess control; human 

machine interface 


Smart home products have seen their popularity amongst 

consumers rise significantly in recent years. Take 2016 for 

example - it’s estimated that 80 million smart home devices 

were delivered to customers, marking a 64% increase from 

2015 [1]. Smart home product manufacturers are now even 

more optimistic about the future of the industry. The smart 

home industry is expected to reach $120 billion US dollars 

globally by 2022, “but not without consumer acceptance first” 

[2]. One of the most fascinating things about the onset of smart 

connected home products is the impact that they have had on 

mature industrial segments such as doorbells, thermostats, and 

access control products. In order to accelerate consumer 

acceptance of new smart products in these existing market 

segments, manufacturers have leaned heavily on not only the 

connectivity and functionality of their products, but also on the 

product’s aesthetics, its security features, and its long-term 

reliability. Aesthetics, security, and reliability have become 

key differentiators in the aforementioned market segments, and 

are looked at critically by residential and commercial 

consumers. It is this market need that has driven adoption of 

capacitive touch sensing technology in smart home products, 

specifically in electronic access control products. 

Electronic access control products, such as the electronic 

door lock, often use a short-range wireless connection such as 

Bluetooth Low Energy (BLE) or near field communication 

(NFC) to validate a user and unlock the controlled function. 

However, in the event that a user doesn’t have their mobile 

device or radio frequency identification (RFID) tag, a keypad 

may be used as a backup mechanism to allow access. While 

the keypad may not be the primary means of authentication, its 

inclusion in the electronic lock is often preferred by consumers 

because of the flexibility that it offers. For example, the owner 

of a home could enable temporary key codes for visiting guests 

so that they may come and go as they please for a period of 

time. Likewise, a business could issue a temporary security 

code to a contractor rather than issuing them a tag. The 

downside of including a mechanical keypad in an electronic 

lock is that the keypad itself takes up considerable space in the 

product and is often visually unappealing. Mechanical keypads 

also have the potential to become a security weakness, as 

fingerprints and dirt or grease smudges can leave a history of 

which keys were often pressed by users- allowing someone to 

extrapolate possible codes by observing the keys. Finally, 

mechanical keypads come with reliability concerns, as moving 

parts and electrical contacts experience fatigue over time. 

Capacitive touch sensing technology enables designers of 

electronic access control products to improve market 

acceptance of their products by offering premium aesthetics, 

enhanced security possibilities, and improved robustness with 

respect to a traditional mechanical keypad. However, 

capacitive touch is not without its own unique challenges. For 

example, capacitive touch is an “always-on” activity- meaning 

that the touch sensors must be actively scanned at a periodic 

interval to determine if a user has touched a key. When 

contrasted with a mechanical button, such as a membrane 

switch, capacitive touch carries a power consumption penaltywhich 

is significant in an electronic access control application 

that may be required to run off of a set of “AA” batteries for 12 

months or more. In addition to the power consumption 

challenge, capacitive touch sensors are often susceptible to 

false touch detections when subjected to exterior environments 

where rain and snow are present. 


775

In this paper, the benefits of capacitive touch sensing for 

electronic access control products are presented. In addition, 

the unique challenges of capacitive touch in this application are 

analyzed from a technical standpoint and solutions to those 

challenges are proposed. 

II. 

BENEFITS OF CAPACITIVE TOUCH SENSING 

Capacitive touch technology arms product designers with 

the ability to create abstract human-machine interfaces (HMIs) 

using the same fundamental technology found in touchscreens. 

A capacitive touch sensor consists of a conductive structure, or 

set of structures, from which an electric field is projected out 

through an insulating dielectric overlay material to the free 

space just above the overlay. When a user comes into close 

proximity of the overlay, or touches the overlay, the electric 

field is changed due to the presence of the user. It is this 

change in electric field that is measured via some type of 

acquisition method, referred to as a capacitance-to-digital 

conversion. Capacitance to digital conversion usually involves 

translating the sensing electrode’s capacitance into some 

measurable quantity (usually a time, current, or voltage that 

varies proportionally with the capacitance of the electrode). 

Microcontrollers with capacitive touch sensing technology are 

commonly available from several integrated circuit 

manufacturers. These devices allow for the creation of 

anything from a single capacitive touch button that replaces a 

mechanical button, to complex designs with many buttons, 

positional sensors, and short-range proximity sensors. In recent 

years capacitive touch capable microcontrollers have become 

quite cost optimized, with broad portfolios available from 

several manufacturers to allow product designers to select the 

best device that meets their needs based on the complexity and 

budget of a given design. In this section, the key benefits of 

capacitive touch in electronic access control products are 

presented, including aesthetics, enhanced security possibilities, 

and improved reliability. 

A. Appealing Aesthetic Design 

When compared with mechanical pushbuttons and 

membrane switches, capacitive touch sensors give mechanical 

designers considerable freedom to improve the aesthetics of a 

product. Because capacitive touch sensors have no moving 

parts, the keypad can be designed to fit into the mechanical 

enclosure by taking on a variety of shapes and configurations, 

rather than the mechanical keypad dictating the size and shape 

of the enclosure. A typical capacitive touch button mechanical 

stack-up is illustrated in Fig. 1. The stack-up consists of a label 

(decal, silkscreen, or other) on top of an overlay material 

(typically the product enclosure), bonded to the sensor 

implementation, which is generally a printed circuit board 

assembly [3]. 

Fig. 1 shows a typical thickness FR-4 rigid core printed 

circuit board, but this is not a requirement. Flexible circuits 

may also be utilized to create sensors that curve to match the 

contour of a product enclosure. If transparent sensors are 

required, indium tin oxide (ITO) sputtered onto polyethylene 

terephthalate (PET) or glass may be used. This optically clear 

implementation is commonly used in capacitive touch screens 

[3]. 

Fig. 1. Typical Capacitive Touch Button Stackup 

A defining feature of capacitive touch technology is the 

ability to create abstract sensors beyond just touch buttons that 

fit the form-factor of the product in which they reside. 

Common examples of this include touch slider sensors and 

scroll wheel sensors. Fig. 2 shows how a capacitive touch 

button or slider sensor may be constructed to wrap around a 

cylindrical product enclosure. Slider sensors are capable of 

reporting the position of a user touching the sensor with high 

accuracy and resolution; positional accuracy greater than 8 bits 

(256 points) is not unheard of with higher performance 


Fig. 2. Abstract Sensor Shape Examples 

Since the capacitive touch sensor itself is mounted inside 

the product, the exterior of the product can maintain a flush, 

seamless appearance. This allows smooth uninterrupted 

artwork to be used on the overlay material to identify key 

locations. Capacitive touch can also be used to “hide” the 

presence of keys when they are not in use. For example, a 

capacitive touch keypad could be implemented on an FR-4 

PCB and bonded to a plastic overlay material. The overlay 

stack may be designed with LED backlighting provisions such 

that the button locations and button identifying marks are not 

visible until the LED backlighting is illuminated. A short- 

776

ange capacitive proximity sensor could be used to control 

whether the keypad is illuminated or not. In this way, only 

when a user approaches the keypad do the keys activate and 

become visible. 

As a single overlay material and PCB assembly is all that is 

needed mechanically to implement a complex capacitive touch 

interface, it is quite easy to create product variants with 

different colors and textures. There is no need to match the 

color of mechanical keys to the color of the product enclosure, 

for example. Likewise, the product will color-fade evenly with 

UV exposure because it is constructed of a single material, 

rather than a composite of multiple materials as would be the 

case with mechanical switches or a membrane switch overlay. 

B. Enhanced Security Possibilities 

Capacitive touch sensors offer the possibility of improving 

the security of electronic access control products by providing 

the ability to scramble or hide the history of previous 

keystrokes. A drawback to including a keypad in an access 

control product is that previous keystrokes can be visible if an 

authorized user leaves fingerprints, dirt, or grease on the keys. 

If a trace of the keys which are commonly pressed is visible to 

an intruder, it limits the possible passcode combinations that 

the intruder must try, making a brute force attack more 

feasible. There are several different ways in which capacitive 

touch sensing may be used to counteract this scenario. 

1) Use of a slider or wheel for number selection: A slider 

or wheel sensor may be implemented as the method for 

selecting a passcode character. This method is effective at 

hiding previous passcode character entries because the starting 

number or character in a list of valid characters may be 

randomized whenever the capacitive touch system is awoken. 

For example, a short range proximity sensor could be used to 

wake up and activate a capacitive scroll wheel sensor. Upon 

wake-up, the product would randomly select a starting digit. 

From there, the user would use the capacitive scroll wheel to 

select the next digit in their passcode. This method requires 

the use of at least a single-digit display to give feedback to the 

user regarding which character they have currently selected. 

2) Use of buttons with scrambled values: A numeric 

keypad may be implemented using capacitive touch sensors 

with a single-digit display element present at each button, such 

that the numeric value corresponding with a given button is 

randomized for each code entry. As in #1 above, a short range 

proximity sensor may be used to wake up a hidden keypad 

when a user is close to the keypad. At that time, the numberto-button 

assignment is randomly selected and displayed for 

the user at that instance in time. Segmented LED displays 

may be used for indicating the current key mapping. 

Alternatively, a monochrome liquid crystal display could be 

utilized, with touch sensors installed over the display. In this 

configuration, the touch sensors would need to be optically 

clear, nescessitating the use of ITO or other optically clear 

conductor for the sensors. 

C. Improved Reliability 

As there are no moving parts in a capacitive touch user 

interface, capacitive touch offers a nearly infinite lifetime in 

terms of number of presses. Mechanical switchgear is often 

rated to a certain lifetime of presses, after which performance is 

not guaranteed. With a capacitive solution there is no material 

fatigue and no electrical contacts that can corrode over time. 

Capacitive touch solutions can offer improved ESD 

immunity, as the mechanical enclosure can be sealed tightly 

with the keypad sensors fully contained inside of the enclosure. 

Typical polycarbonate (PC) and acrylonitrile butadiene styrene 

(ABS) plastics offer dielectric breakdown voltages of 15kV per 

mm of thickness or higher [4]. 

III. 

CHALLENGES OF CAPACITIVE TOUCH SENSING 

While there are clear benefits to using capacitive touch 

technology to implement a keypad in an electronic access 

control product, there are still challenges that need to be 

addressed. In this section, the challenges of power 

consumption, environmental influence, and system integration 

will be addressed. 

A. Power Consumption 

Unlike mechanical switches, capacitive touch sensors 

require active scanning at a periodic rate by the controlling 

processor. The scan rate is a configurable parameter, with the 

tradeoff being between response time and power consumption. 

Scan rates for a typical capacitive touch keypad are generally 

in the range of 8 Hz to 100 Hz. The operational flow of a 

capacitive touch controller is a loop, in which the following 

tasks must be performed: 

1. All sensors in the system must be measured (an 

equivalent digital value for the external capacitance 

being measured is obtained) 

2. The digital values representing the current state of each 

sensor are post-processed. This involves the 

following, at a minimum: 

a. Application of noise filtering to the new raw 

samples 

b. A threshold comparison between the updated 

filtered samples and their historical, long term 

references is performed on an electrode by 

electrode basis to determine if there was 

enough deviation in any of the measurements 

with respect to their idle state to signify that a 

touch or proximity event has taken place 

c. If a touch was detected, it is de-bounced, 

validated, and reported 

d. If no touch was detected, the historical longterm 

reference value for each sensor is 

updated to reflect any temperature or 

environmental drift that may be occurring 

Fig. 3 shows this process visually. 


777

capacitance measurement method, in which the change in 

capacitance between two sensing elements (two conductors) is 

measured to detect touch and proximity [3]. 

Fig. 3. Capacitive Touch Application Loop 

In many solutions, the measurement is controlled by a 

processor running acquisition software, and the measurement 

results are also interpreted by the processor. This means that 

the power consumption of the capacitive touch solution can be 

quite high, because the CPU is needed to perform the 

measurements and interpret the results on a sample-by-sample 

basis. What is interesting about this problem is that the 

keypads in smart building applications are usually used less 

than 1% of total runtime. Think about it- you may access your 

door keypad once or twice daily, for 10 seconds at a time. The 

rest of the time, the keypad is not being actively used- but it is 

still necessary to actively scan it and post-process the 

measurement results. 

To address this issue, integrated circuit manufacturers are 

now beginning to automate their scanning and post-processing, 

so that a processor does not need to wake up and execute 

software at all! A digital state machine can be constructed to 

periodically scan a set of sensors, apply an IIR filter for AC 

noise rejection, perform proximity and/or touch threshold 

detection, and apply a second IIR filter to track for changing 

environmental conditions- all without any software execution 

needed. Touch sensing microcontrollers that implement this 

technique have been shown to reduce average current 

consumption by >30% for a basic proximity sensor, and >50% 

for an application with 4 capacitive touch buttons [3]. 

In addition to IC techniques for reducing power 

consumption, system design techniques may be used to 

optimize a capacitive touch system for low power. Adding a 

short range capacitive proximity sensor around the buttons of 

an electronic access control keypad can reduce average current 

by using the proximity sensor to wake up the keypad. In this 

way, it is only necessary to scan the proximity sensor regularly, 

until proximity is detected. When proximity is detected, all 

sensors in the keypad are then activated and their status is made 

available to the system. Fig. 4 below illustrates how this 

concept may be implemented in a sensor design. This specific 

sensor design is used in the BOOSTXL-CAPKEYPAD 

evaluation module [4]. This sensor design uses the mutual 

Fig. 4. BOOSTXL-CAPKEYPAD Sensing Electrode Pattern 

In this approach, it is still necessary to infrequently measure 

the entire keypad (for example, every 5 minutes) to refresh the 

long term, un-touched reference values for all of the keys in the 

keypad. This ensures that valid references are present when a 

proximity event is detected. This is needed because the 

reference, un-touched digital capacitance values will drift as a 

function of IC temperature. For this power-saving technique to 

be effective, the proximity sensing distance must be kept small 

(for example, 8 centimeters or less). Achieving long range 

proximity sensing (for example, >10cm) requires considerable 

measurement resolution, scan time, and sensor area. The 

average power required to detect proximity at >10cm will often 

be larger than the average power required to measure 12 

capacitive touch buttons. 

Fig. 5 shows the power profile of the BOOSTXL- 

CAPKEYPAD EVM, which uses the MSP430FR2522 

microcontroller to implement a 12-key numeric keypad with a 

proximity sensor. In the wake-on-proximity operating mode, 

the BOOSTXL-CAPKEYPAD reaches approximately 8µA of 

average current at 3.3V. Notice how the instantaneous current 

may be as high as 2.3mA; this is the current during the 

measurement of the proximity sensor. If all 12 electrodes in 

the keypad were measured continuously, rather than just the 

proximity sensor, the time that the capacitive touch controller 

would spend in that high current state would be larger, leading 

to higher average current. 

778

Fig. 5. BOOSTXL-CAPKEYPAD Proximity Sensor Power Duty Cycle 

B. Environmental Influences 

Due to the fact that capacitive touch sensing is based on the 

underlying foundation of analyzing changes in an electric field 

over time, it should not come as a surprise that capacitive touch 

sensors can be adversely affected by environmental influences 

such as moisture build-up on the keypad overlay. Moisture 

tolerance is important because electronic access control 

products are often designed to be installed outdoors where the 

keypad may be exposed to rain water. Even in the case of an 

indoor-only product, the keypad will need to tolerate being 

cleaned with water and cleaning solution. Despite these 

challenges, with proper system design it is possible to develop 

a capacitive touch keypad that is robust in the presence of 

moisture due to rainfall or cleaning of the touch surface. 

Moisture tolerance may be improved by the addition of a 

guard sensing channel near the affected touch sensors. In the 

case of the keypad design shown in Fig. 4 with the additional 

proximity sensor, the proximity sensor may also be repurposed 

as a guard sensing channel. A guard sensing channel 

simply acts as a mask. If moisture builds up on the sensing 

overlay, it will often be present across the guard channel as 

well as the capacitive touch buttons themselves. If the guard 

channel goes into detection, that detection may be used as a 

mask against the buttons. When the guard channel is in detect, 

the keypad becomes locked and touches are not allowed. This 

method also works well for the cleaning scenario. If cleaning 

solution is applied to the touch panel, the guard channel will 

detect this and that information can be used to lock out the 

keypad. 

The mutual capacitance measurement topology can also be 

used to improve moisture tolerance, and even enable touch 

detection when capacitive touch sensors are covered in running 

water. In a mutual capacitance measurement, the capacitance 

being measured is the capacitance between two sensing 

electrodes. This means that the designer of the electrodes gains 

considerable control over the electric field when compared 

with the self-capacitance measurement topology, in which the 

electric field comes out from the sensor in all directions. It is 

possible to design a mutual capacitance electrode geometry that 

limits nearby ground and contains the electric field between the 

two electrodes being measured in the mutual capacitance 

mode. 

Another key benefit of the mutual capacitance 

measurement topology is that when water is present on an 

overlay panel, it has the effect of increasing the mutual 

capacitance- which is the opposite effect of a touch or 

proximity, which has the effect of decreasing the mutual 

capacitance. This behavior enables the processor interpreting 

the measurement results to differentiate between water and a 

valid touch, because they create different changes in the 

system. By actively monitoring for changes due to moisture 

versus changes due to a touch, it is possible to actively control 

the touch thresholds in a system and enable accurate touch 

detection even with water flowing over the capacitive touch 

keypad area. 

Texas Instruments (TI) has subjected a 12 button numeric 

keypad capacitive touch design to IPX5 moisture tolerance 

testing, and full touch detection was possible under all IPX5 

test conditions. These results were achievable due to the 

following key system parameters: 

- Use of the mutual capacitance measurement topology 

- Use of an electrode geometry that limited ground and 

other conductors on the sensor layer of the PCB, 

leaving just TX and RX patterns 

- Use of moisture-specific firmware that monitors for the 

presence of water by looking for a reverse-touch 

scenario, at which point sensors are re-calibrated to 

operate in the presence of moisture 

These techniques, when combined together, significantly 

increase the reliability of capacitive touch solutions in 

electronic access control products that are installed outdoors. 

C. System Integration and Development 

At first glance, integrating capacitive touch sensors into a 

product that previously used mechanical buttons can seem like 

a challenging task. Admittedly, it is hard to beat the simplicity 

of a mechanical push button when it comes to hardware and 

firmware development. On the hardware development side, 

capacitive touch mandates that careful attention be paid to 

mechanical stack-up consistency, electrode geometry, and trace 

routing. On the firmware side, capacitive touch often involves 

adding a new microcontroller to an application, which means 

adding a new firmware development flow. 

Fortunately for product designers looking to integrate 

capacitive touch, there has never been a better time to get 

started than right now. Competitive pressure on IC 

manufacturers has led to the creation of a significant amount of 


779

high quality literature, tools, and devices to address common 

system integration challenges. 

1) Literature: The majority of the major IC manufacturers 

that offer capacitive touch technology now also offer high 

quality literature to educate the product designer that is new to 

capacitive touch on best practices that address not only 

firmware development, but also schematic capture, PCB 

layout, and mechanical design. System level challenges 

including noise immunity, moisture tolerance, and low power 

design are addressed. 

2) Tools: Just like literature, IC manufacturers also offer 

platform tools that enable designers to quickly start their 

designs without having to be an expert on a perticular 

technology or device. Some of these tools will even generate 

code for you to run on the capacitive touch controller. 

3) Devices: Microcontrollers with integrated capacitive 

touch are now available in a variety of memory densities, 

package sizes, and peripheral configurations. In many cases, 

it’s possible to find a microcontroller for an electronic access 

control product that can integrate the capacitive touch control 

with some or all of the other application functions. When it is 

possible to use a single microcontroller for the application 

functions and the capacitive touch interface, the bill of 

materials (BOM) cost of adding capacitive touch to a product 

can become quite small. IC-based capacitive sensing 

measurement technology has also improved considerably in 

recent years. Features such as parasitic capacitance offset 

have been implemented, enabling longer shielded capacitive 

sensing trace runs on PCBs without significantly increasing 

power consumption and measurement time. 

The available literature, tools and devices on the market 

today make integrating capacitive touch sensors into electronic 

access control systems easier and faster than ever. The 

combination of benefits provided by capacitive touch sensing 

now outweighs the system integration challenges of the past. 

IV. 

CONCLUSIONS 

Capacitive touch sensing technology is continually being 

adopted into smart home and electronic access control products 

due to the clear advantages it provides product designers. The 

technology has opened up new ways to mechanically design 

enclosures that must contain keypads and the potential security 

and robustness benefits are desired by end users. As the 

integrated circuit industry continues to remove challenges to 

adoption by reducing average power consumption, reducing the 

impacts of external environmental influence, and lowering the 

cost of system integration, the total cost of integrating 

capacitive touch from a bill of materials (BOM) standpoint as 

well as a time-to-market standpoint will continue to decrease 

and adoption of the technology will continue to increase. 


W.S. thanks Yiding Luo of Texas Instruments for his 

significant contributions to capacitive touch sensing moisture 

tolerance research and development. 

REFERENCES 

[1] D. Olick, “Why 2017 will finally be the year of the smart home: 

consumers figure it out” CNBC, Jan 2017. 

[2] I. Berger, “Is it smart to have a smart home?”, The Institute, IEEE, May 

2017 

[3] CapTIvate Technology Guide, Design Guide Chapter, Texas 

Instruments, Inc., Revision 1.60.00.00, Dec 2017 

[4] Electrical properties of plastic materials, Professional Plastics 

780

Accelarating 3D Graphics Performance With EGL 

Image on Zynq UltraScale+ MPSoC 

Author: Alok Gupta 

Platforms Processing Group 

Xilinx, SJ (CA) 

alok.gupta@xilinx.com 

Abstract— when texture content is going to be updated very 

often – more or less every frame classic functions like glTexImage 

and glSubTexImage are very inefficient. These functions are not 

suitable because data is being copied/converted in the drivers 

from CPU to GPU memory in order to be complaint against 

Khronos standard which results in lower than expected graphics 

rendering frame rates for Video textures. Fortunately, a lesser 

known solution that is more efficient exists. As with some design 

choices, this increase in efficiency comes with some increase in 

effort. This paper aims to describe different texturing techniques 

with EGLImage extension where CPU & GPU shares the same 

physical memory and copying data is not required and help users 

in choosing the proper method to avoid performance issues in 

certain situations. 

Keywords—GPU,Graphics, OpenGLES, EGL Image, Zero-copy 


Each new generation of devices comes with an expectation 

of better performance and user experience. An indicator of 

performance, that is perhaps the most easily observed by the 

average consumer, is the performance of 2-D and 3-D graphics, 

and thus, graphics performance capabilities have become 

paramount to the success of a new device. Most modern mobile 

devices, such as smartphones and tablets, are powered by SoC 

application processors that contain dedicated graphics 

processing units (GPUs). These processors, in their entirety, are 

designed to efficiently accelerate graphics operations while 

maintaining a balance between power and performance. 

OpenGL is an open API, standardized by a not- for-profit 

technology consortium called The Khronos Group, which has 

been in use for some time to enable developers to draw 3-D 

graphics on a variety of devices, and OpenGL ES is a subset of 

OpenGL that is designed to accommodate the unique demands 

of mobile devices. In OpenGL ES, every object on the screen is 

represented as a series of triangles each of which is defined by a 

set of three vertices. Images, which are referred to as textures, 

are transposed over the surfaces of these triangles, as determined 

by the application. Hundreds or thousands of these textured 

triangles sum to form a scene that represents anything a 

developer or artist could imagine. 

II. 

PROBLEM STATEMENT 

A. RAPID TEXTURE UPDATES 

While textures can originally exist in a variety of different 

formats, they ultimately exist as raw, uncompressed, color data 

in memory before they are applied to an object. With high 

quality display resolution, the need for higher resolution textures 

also increases. For example, equation below shows that a 32-bit 

texture the size of a 1080p display would require almost eight 

megabytes of memory. 1920 xres * 1080 yres * 32bpp = ~7.9 

MB On several occasions, customers have tried to use sequences 

of rapidly updating textures to create animation. For example, 

every frame of a YouTube video is actually just a series of new 

textures being rapidly displayed. In order for the GPU to draw a 

texture on an object, it must exist in a special area of system 

memory called video RAM (VRAM). VRAM is technically just 

regular system memory, but it exists within a predetermined 

address range. While there are several ways to upload textures 

into VRAM, customers have been observed using a method that 

involves uploading a fresh texture for every frame. This 

consumes significant memory bandwidth and various other 

system resources. Furthermore, this method was never intended 

to support animation. 

B. Pros and cons of glTexImage2D API 

A common way to get texture data into VRAM is through 

the use of the glTexImage2D function. This function is designed 

upload a texture to a memory region in VRAM where it can be 

reused throughout the program. The benefit of this is in scenarios 

like scrolling, where only the offset of where the texture is 

displayed on the display is altered. However it is not suitable for 

situations such as animation where it would require frequent 

modification of the content in the texture. It was not designed to 

be called on every frame with updated texture content, but 

unfortunately this has been observed in practice. When 

glTexImage2D is called, the GPU driver copies the texture data 

into a temporary buffer, and queues it to be uploaded to VRAM. 

If the handle to the new texture is currently in use, as would most 

likely be the case with animation, the existing texture must be 

ghosted. This means that the current on-screen texture and the 

newly uploaded one must exist in VRAM at the same time. This 

involves allocating another buffer in VRAM. Customers 


781

generally want their applications to achieve a frame rate of 60 

frames per second (FPS), meaning all this needs to be done in 

about 16 milliseconds. The memory flow, under these 

conditions, can be seen in Figure 1. 

for only one frame. EGL Images are not twiddled by the driver 

upon upload. All of this results in substantial memory bandwidth 

and CPU usage savings. 

Performing two extra memory transfers and allocating two 

additional buffers are time consuming operations. It is 

important to understand the implication of the glTexImage2D 

call and use it only in cases where the content is not frequently 

updated. 

III. 

SOLUTION 

An alternative to calling glTexImage2D to place a texture 

into VRAM is to make use of EGL Images. EGL Images are an 

OpenGL ES extension, to allow for the sharing of image data 

between two processes. OpenGL ES is very lenient with what it 

allows developers to do with EGL Images, and thus, developers 

are burdened with slightly more responsibility in exchange for 

increased flexibility. EGLImages are important building block 

when displaying Video content as OpenGL ES textures. The 

reason the Khronos group came up with the idea of EGLImage 

was to be able to share buffers across rendering APIs (OpenVG, 

OpenGL ES, and OpenMAX) without the need of extra copies. 

IV. 

ADVANTAGES OF EGL IMAGES 

EGL Images are designed to be shared between processes. 

One thread can produce content into an EGL Image, while 

another consumes the content. For example: Thread A decodes 

an H.264 video stream and places the next frame’s data in an 

EGL Image. Thread B then displays this EGL Image, as a 

texture, in the YouTube application. An additional texture 

upload is unnecessary as the client application is directly editing 

memory that already exists in VRAM address space. In addition, 

when a texture is uploaded using glTexImage2D, the bits are 

automatically rearranged by the driver in a process called 

twiddling. This can increase performance when the texture is 

read multiple times, but actual twiddling process is time 

consuming and of little benefit if the texture is going to be used 

V. COMPLEXITIES INTRODUCED BY EGL IMAGES 

When all rendering operations for all running applications 

are complete for a given frame, the resulting image is stored in 

a buffer called the framebuffer. The framebuffer, like other 

graphics buffers, is stored in VRAM. Most OpenGL programs 

use a technique called double buffering; a technique that makes 

use of two separate framebuffers. The first framebuffer, referred 

to as the front buffer, is the buffer that is currently being 

displayed on the screen of the device. The second buffer, 

referred to as the back buffer, is the buffer that the GPU is 

asynchronously rendering new content into off of the screen. 

Each time the display updates, the front buffer and back buffer 

switch places. This means that the GPU never renders content 

directly to the screen and as a result, the user is only presented 

with frames that have been completely rendered. Were the GPU 

to render directly to the screen, data from two different frames, 

at some point in time, would be present on screen 

simultaneously. This effect, known as tearing, would be very 

noticeable to the user. OpenGL usually handles double buffering 

from behind the scenes as is the case with glTexImage2D. 

However, a drawback of EGL Images is that the developer must 

independently implement this technique if they wish to avoid 

tearing. This is generally accomplished by allocating two EGL 

Images and alternating between the two as new content is 

produced. In most use cases, producers of content write into 

these two buffers asynchronously of the consumer application 

that is reading them. This means that content can be produced 

faster that it is consumed and the consumer discards the extra 

information. Clearly, the production and consumption of this 

EGL Image pair needs to be thread-safe. Any synchronization 

method can be used to accomplish this, but it is the responsibility 

of the developer to implement. 

VI. 

RESULTS 

While EGL Images have been used successfully in many 

consumer devices, data sets comparing the use cases improved 

by EGL Images were not readily available. A sample application 

was written to compare EGL Image performance versus 

glTexImage2D under ideal conditions. Reference application is 

using default OpenGL texture upload API (glTexImage2D) and 

used as base-line for benchmarking here on the other hand the 

other application is using zero copy based EGL Image texture 

and is implemented with DRM/DMA_BUF_EXT. The results 

are 5X faster than classic copy implementation as shown in 

Table 1. 

782

second thread, which consists of the render loop, consumes the 

content. The program flow for the EGL Image case can be seen 

in Figure 3. Note that the buffers referred to in the figure are in 

reference to the developer-created swap chain. An EGL Image 

is simply a texture whose content can be updated without 

having to re-upload to VRAM (meaning no call to 

glTexImage2D). One of the only drawbacks, besides increased 

code complexity, is that the application developer has to handle 

synchronization themselves. 

VII. PROGRAM DESCRIPTION 

A sample program was created to measure the performance 

gains realized from using eglImages versus glTexImage2D. The 

application was architected so that as many components as 

possible could be shared regardless of the texturing method. 

While the data was gathered on a linux powered device, it was 

written entirely in native code. These efforts were taken to 

ensure minimal overhead, fair data points, and consistently 

reproducible results. 

When the application is executed using glTexImage2D, a 

single thread, known as the render loop, is spawned. The render 

loop function is executed once per frame. As seen in Figure 2, 

the content is generated, uploaded, and rendered serially. 

VIII. FOR DEVELOPERS 

If you’re looking for updating an texture in real time, well 

there are two types of textures one which are natively supported 

in OpenGLES & on the other hand there are Image formats not 

supported in OpenGL ES natively, which can be supported via 

the additional extensions: eg: GL_OES_EGL_image_external 

, Textures specified in this way can be sampled as textures or 

used as framebuffer attachments as if they were native objects. 

You can't use normal texture to render camera or video preview, 

you have to use GL_TEXTURE_EXTERNAL_OES extension. 

This extension provides a mechanism for creating EGLImage 

texture targets from EGLImages. This extension defines a new 

texture target, TEXTURE_EXTERNAL_OES. 

A. Example code snippets EGL/OpenGL 

When the application is executed using EGL Images, it 

spawns two threads. The first thread generates content, and the 


783

Also you need to change your fragment shader like this, 

adding the #extension declaration and declaring your 

texture uniform as samplerExternalOES: 

B. Example code snippets GLSL 

CONCLUSION 

While EGLImages are known and understood by select 

graphics experts, there exists a lack of documentation that 

prevents many from comprehending their implementation 

implications and performance impact. As illustrated by the 

aforementioned findings, the use of appropriate texture upload 

methods results in a significant performance improvement. Now 

that the memory flow, implementation details, and performance 

gains have been explained in paper, hopefully more developers 

will start using EGLImages its well supported on ARM Mali 400 

GPU on Xilinx Zynq Ultra scale + MpSoC, Visit Our Web here. 


The author of this paper would like to thank Yashu Gosain, 

Glenn Steiner and Louie Valena for providing technical counsel 

& feedback for completing the paper. 

REFERENCES 

[1] J. Leech, " EGL_KHR_image_base.txt," Khronos API Registry , 

December, 1, 2010. [Accessed September 24, 2012] 

[2] J. Neider, T. Davis, M. Woo, OpenGL Programming Guide, Addison 

Wesley, 1993 

[3] The Android Open Source Project, (2010) Android (Version 4.0.4) 

[Source Code]. Available at http://sourceandroid.frandroid.com/frameworks/base/opengl/tests/gl2_yuvtex/ 

[Accessed September, 24, 2012] 

[4] EGL_KHR_image_base 

https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_im 

age_base.txt 

784

Multicore Approach on AUTOSAR Systems, 

Performance Impact Analysis 

Eng. Roberto Agnelli 

Teoresi Group S.p.A. 

Torino, Italy 

roberto.agnelli@teoresigroup.com 

PhD. Niki Regina 

Teoresi Group S.p.A. 

Turin, Italy 

niki.regina@teoresigroup.com 

Abstract—In multicore systems applications with a high 

degree of complexity, the large amount of data communication 

and the resulting management affect the performances. The 

complex device drives, which do not share the core with the basic 

software, have to use the runtime environment while it is not 

strictly necessary in a single core approach. Moreover, in a single 

core application the interaction between different software 

components can be realized by specific interface. The present 

article wants to show that the usage of a multicore approach 

needs to take into account of these different problems in 

particular by considering the software architecture. Moreover, 

all these perspectives will be enhanced by real examples that will 

highlight the loss of effectiveness and performances of the 

multicore compared to the single core approach. 

Keywords—multi-core; single-core; software architecture; 

software components. 


The usage of multicore architecture in the automotive 

sector, and in particular in the safety-oriented systems, is 

becoming widespread due to the increase of power computing. 

By facing to high application processing and hardware 

management, the solution of a parallel computing is considered 

the most efficient. Moreover, by considering new automotive 

standard regulations such as ISO 26262 for the safety critical 

systems, the computational redundancy requested forces to use 

the multicore approach more often than in the past. 

However, the result of this application leads to an 

increment of the computational power but it is necessary also 

to consider the effort for the effective co-operation of the 

systems used for a specific functionality. For example, the 

tasks and the routines of an application layer communicate and 

trigger several events and consequently it is necessary also to 

manage the low hardware level and the operational system. 

Hence, there is a big effort to consider both for the 

synchronization of the events placed in the various cores and 

for the coordination and management of the low level access, 

such as CAN bus communication and peripheral devices. 

Nowadays, it is also necessary to consider that the 

automotive AUTOSAR standard defines the fundamental 

software architecture and it organizes and simplifies the 

approach to the application development. However, if the 

AUTOSAR standard is used in a multicore approach, it must 

be considered that the basic software will manage all the 

requests coming from several software components and the 

complex device drivers placed in the other cores. 

For example, in multicore system applications each basic 

software access coming from a different core engages a 

synchronization with the operative system. This operation is 

definitely heavier than the one completed with a simple task 

switch in a schedule table of a single core. Moreover, if the 

modules of the complex device driver allocated to the 

management of the functionalities are in different cores, they 

need to use the runtime environment for the co-operation. This 

is not strictly necessary in a single core approach. Likewise, the 

complex device drives have to use the runtime environment in 

a multicore approach context if they do not share the core with 

the basic software. 

The present article wants to highlight that the application of 

a multicore approach needs to take into account of these 

problems, with a particular attention to the software 

architecture. All the loss of performances will be enhanced by 

real examples. In particular, the paper will focus on the 

degeneration of effectiveness and performances by comparing 

the multicore software architecture approach to the more 

classical and known single core. 

The present article is divided into VI sections. In section II 

the general problems of a multicore architecture are analyzed. 

Sections III, IV, V, VI are dedicated to specific problems. 

II. 

MULTICORE ARCHITECTURE GENERAL PROBLEMS 

A. The Event-Triggered Approach 

In an AUTOSAR multicore software architecture, the 

operative system runs on each core with independent 

applications and it is synchronized by hardware and software 

timing procedures. Moreover, hardware counter defines the 

“timing slice”. 


785

The scheduling of the tasks could follow the subsequent 

different approaches: 

The Schedule Table: this approach is easier to 

implement and it saves a lot of effort to the operative 

system. 

The event-triggered: it enables to trigger on request the 

execution of specific functionalities. It uses specific 

alarms. 

In the second approach, the operative system can manage 

also the 2 nd interrupt categories and it can start different tasks. 

In this way, it is possible to interrupt a task execution in a 

specific core by several bus messages or requests coming from 

other cores and with a high priority. 

Moreover, the event-triggered approach can manage 

different cores scheduling through specific events. It is also 

possible to divide a real time complex process in several 

different sections and to split them into the different cores. 

These features explain the widespread usage of this approach 

compared to the Schedule Table. 

However all these just cited advantages creates an overhead 

around the 20% of the operative system compared to the 

first approach. In fact, it must take into account of: 

The trigger and manage of interrupts 

The context switch 

The determination of the request origin 

The destination process 

Etc..etc. 

B. OS Application & Memory Constraint 

Another important feature of the multicore software 

architecture that can limit its performances is the memory 

manage and the implementation of the software structures for 

its access. 

A multicore system uses the OS Application concept. It 

identifies a functional unit for the software and this is then 

assigned to a specific core. The OS Application concept 

defines the regions where the memory operates and it allocates 

the tasks, the alarms and the interrupts. 

However, in this way there are some constraints to the 

software execution: for example, all the software components 

runnable have to belong to the same OS Application. This is a 

remarkable difference between the single core and the 

multicore application. In the former approach, all the processes 

can belong to the same OS Application and it is possible to 

share all the resources with an increase of the performances. In 

the latter, all the processes that belong to different cores are in 

a specific OS Application: the need of communication between 

these different cores leads to significate decrease of the 

performances. 

C. Scheduling 

In a single core application it is possible to design the software 

architecture in order that the complex procedures are in the 

same task of the OS application: for example the ones which 

requires a frequently each other interaction. This design 

method allows the share of the RAM memory with a high 

performance level. Several application uses this approachwith 

proved results. 

On the contrary, in a multicore software architecture the 

complex procedures that need an interaction between different 

cores are necessary in different operative system applications 

and consequently in different software stacks. 

For example, it must consider that if there is a call through 

an alarm of a runnable in a different OS application, there is 

the start of a context switch with a time process and memory 

costs not negligible. Moreover, the standard AUTOSAR 

foresees that specific consistency tools ensures the 

communication between different OS applications without 

data corruption. Hence, in this last context there is an 

additional increase of computational execution. 

By comparing, the single core and the multicore software 

architecture there are the subsequent important differences: 

 

 

Single core 

Capability to preview both the correct task sequences 

by the schedule table and to optimize the 

performance system by minimizing the context 

switch. 

Multicore 

If there is the need of a high interaction between 

different cores, it is not easy to preview the 

scheduling of each core. Moreover, it is necessary to 

take into account of possible occurrences and worst 

cases in order to set up correctly the dead lines for all 

the tasks. 

D. Memory Boundary 

The sharing of the data between different cores creates 

problems of authorization and competition to the memory 

accesses. The AUTOSAR safety architecture implements a 

memory protection mechanism that act on a specific 

peripheral device of the microprocessor the Memory 

Protection Unit. The MPU is used when the AUTOSAR 

requires a value of the operative systems scalability class 

equal or higher than three. 

The MPU is responsible to assign exclusive permission 

access to a specific memory area and to protect it from 

different possible access belonging to other operative system 

applications. The correct configuration of it can take a long 

time. 

As for OS Application & Memory Constraint, the tasks 

scheduling is different between single and multicore software 

architecture. In the former it is possible to work on a single OS 

application and consequently on a single memory region. In 

the latter, it is much more complicated: there is the need to 

786

define exclusive region where the different process can share 

their data. In this case, there are specific buffer memory 

regions with shared permission of access (see Fig. 1). 

Here, there is the widespread utilization of support variable. 

Semaphores protect them when there is a writing or reading 

operation. 

As highlighted in the previous section, the multicore 

software architecture has this problem. The MPU and its tool 

bring an overhead into the system due to the copies, the 

different procedures and other operations necessary to 

preserve the integrity of the data. After several tests it is clear 

that the advantage of the possible parallelism introduced by 

the multicore approach is definitely reduced by all the 

structure and operations necessary to the interaction of the 

different processes 

Fig. 1 shows the concept of the shared memory. If there is a 

data in the application core 1 that it must be utilized by the 

application core 2 the subsequent operations must be carried 

out: 

1. Copy of the data 

2. Sharing of the memory region 

 

 

In this phase the source process calls the spinlock in 

order to avoid possible competitor access 

Copy of the data in the shared buffer 

Once the first phase concludes, the data goes to the 

destination process after a copy. At the end of this 

phase, the spinlock releases itself. 

Production of the trigger event for the process 

destination 

At the end of the procedure, the source process sets a 

trigger that allow the destination process to schedule 

the data reading. 

It is important to underline that a RTE procedure does not 

require an IOC if the both the processes are in the same 

operative system application or better in the same core. In this 

case the RTE results definitely with a high performance. 

Hence, if the number of interactions between different 

elements or processes are in different cores, the communication 

increases by creating an overhead and reducing the advantage 

of the multicore parallelism. 

In Fig. 2 the IOC concept is depicted. The processes 

inserted in the same core can operate in the same operative 

system application with specific tools. On the contrary, the use 

of IOC is mandatory if the processes operate between different 

cores. 

Fig. 1 Shared memory 

3. Access to the data from the other core 

III. 

IOC 

In the previous chapter, some of the more principal 

problems have been highlighted. However, the standard 

Autosar gives some specific tools in order to manage all the 

requests between different cores. The most important is the 

Inter Operative System Application Communicator (IOC). 

This tool is given as a part of the multicore operative 

system. It is possible to use it through two different elements: 

the RunTime Enviroment software component and specified 

low-level sections of the software. It must consider that these 

elements have to be part of the AUTOSAR or of the Complex 

Device Driver. 

The IOC produces three different phases: 

 

Protection with spinlock for the integrity of the data 

Fig. 2 IOC Architecture 

IV. 

INTER PROCESS COMMUNICATION 

In the standard AUTOSAR the RTE is responsible of the 

functionalities software management. Hence, all the software 

and hardware components below this “mask” can be 

considered at the same level from the system architecture point 

of view. In particular, the RTE has some interfaces in order to 

standardize the communication method. The interactions 

between the different software components are similar in a 

single or multicore software approach. 

However, it is important to highlight the main difference 

between these two methods. In the single core, the resources 

management are usually focused on the time execution. On the 

contrary, in the multicore approach gives the main importance 

to the parallelism of the processes. 


787

Moreover, in a multicore software architecture there is the 

need of the IOC, as described in the previous chapter. In a 

single core the software components use direct client/server or 

sender/receiver gates if they need to share information or 

functionalities. This characteristic saves time execution 

compared to the multicore approach. Hence, even if in the 

multicore software architecture the interfaces of the RTE are 

the equal, there still are all the core-to-core communication 

problems already cited. 

Fig. 3 Application-Application Communication 

Fig. 3 shows the multicore software architecture. There is a 

direct communication if the software components are in the 

same core. This communication is always managed by the 

RTE. If there is the need of a cross-core communication 

between two software components in different cores, there 

must be the utilization of the IOC with all the problems cited 

in section III. 

V. CDD-CDD COMMUNICATION 

The Complex Device Driver (CDD) is responsible for all the 

elements that are not possible to manage through the basic 

software modules. Each feature, which is not present in the 

AUTOSAR stack, is implemented in the CDD such as the 

access to sensors, actuators or peripheral devices. 

Moreover, in the CDD contains several codes sections that are 

not present in a specific software component. All the codes and 

elements in the CDD represent a significant section of the ECU 

software. In a multicore approach, they are in the different 

cores according to the peripheral devices availability or load 

distribution. In this way, the functionalities that require a high 

load are in secondary cores while the safety ones are in 

lockstep cores. 

For example, in a multicore approach the sensor functionalities 

are in different cores: one core can be responsible for the 

access or manage of the peripheral device, the parsing and data 

reading; the other lockstep core can be used for safety 


Also in this scenario, the CDD-CDD communication has the 

same problems already cited. In a single-core, it is possible to 

use global variables or specific call functions for the data 

communication. On the contrary, in a multicore software 

architecture the resources are independent (the process time, 

memory variables allocation). As for the inter process 

communication there is the need to use the IOC services of the 

operative system. Hence, the CDD sections pass necessary to 

through the RTE and then they use the core-to-core 


Fig. 4 represents the CDD communication process. If the 

two different CDD are in the same core, the information 

exchanges is faster and easier. On the contrary if there is the 

need to communicate between two different CDDs in different 

cores, the utilization of the IOC is essential. This is an 

additional problem in terms of performance. 

VI. 

BASIC SOFTWARE COMMUNICATION 

The functionalities of the AUTOSAR stack consider all the 

possible perspectives such as communication, diagnosis, 

management of ECU, memory accesses etc…etc. All these 

aspects are available by using a vendor tool and they are 

integrated with the RTE and the operative system. 

The basic software is in a single specific core and any 

accesses to the stack functionalities is accomplished inside it. 

Any requests of a secondary core to a basic software 

functionality must pass through all the cross-core 

communication tools and the previous cited problems arises 

(memory accesses, delay of task activation, change of 

scheduling, etc…etc.). On the contrary, in a single core 

software architecture an API can activate some of the modules, 

at least for the CDD, and then there is the call to the RTE for 

the software. 

As an example, it is possible to consider the Diagnostic 

Event Manager of AUTOSAR that is responsible for the 

diagnosis. This special module collects all the problems of a 

running execution. In a multicore software architecture, if a 

software component of a secondary core needs to act a 

diagnostic trigger, the access must follow always the same 

procedure: a cross-core call through the RTE with a possible 

overhead already discussed. 

The accesses from the basic software or the CDD to the 

AUTOSAR core can use a specific diagnostic API module 

available for the low-level software. This option is definitely 

useful for the management of the various events and it 

improves the performances by saving time process. On the 

contrary, the accesses from the AUTOSAR core to the other 

secondary cores use the Diagnositic Communicator Manager 

module which is responsible of the services activated by the 

UDS protocol. The DCM service gives also to the user some 

utilities for diagnosis testing and debugging. 

In a multicore software architecture can be necessary to 

activate some functionalities managed by the UDS protocol in 

the secondary cores. Two different cases can be considered: 

1. If the software routine is running on a secondary core, 

the activation follows the cross-core communication 

procedure already described. 

2. If the software routine is in the same core of the basic 

software, the messages are received and processed by 

788 

Fig. 4 CDD-CDD Communication

triggering a RTE call to the specific function 

immediately. 

Fig. 5 Basic Software Communication 

Fig. 5 describes the CDD communication in a multicore 

software architecture. In this picture, the Core 1 is the 

AUTOSAR core. If there is a call from the OS Application 2 to 

a functionality of the CDD 2, it is possible to by-pass the IOC 

and to use a direct API call. On the contrary, if there is a call 

from a software component in the other OS Application 1 it is 

necessary to pass through the IOC and all the problems already 

described. 


The choice between a single-core and a multi-core software 

architecture needs a particular attention. Even if the multicore 

approach is becoming more popular due to the possibility of 

processes parallelism and a higher computing power, it is not 

always true that the performances increase compared to a more 

classical single core approach. 

All the problems highlighted in this article want to 

underline this last statement. The overhead created by the 

communication, synchronization and memory management 

increase the load of the operative system. Before choosing 

between a single or a multicore architecture it is important to 

consider the cores interactions and how the processes are 

inserted inside the cores. 

For all the problems cited in this article, it is fundamental to 

have a clear microprocessor and software architecture in order 

to estimate the load of the operative system in the case of a 

multicore: if the processes are independent this solution is 

preferred. On the contrary, the advantages of the multicore are 

less and the classical single core architecture still represents an 

optimal choice. 

[1] Matthias Becker, Dakshina Dasari, Vincent Nèlis, Moris Behnam, Luìs 

Miguel Pinho, Thomas Nolte, “Investigation on AUTOSAR-Compliant 

Solutions for Many-Core Architectures”, 2015 Euromicro Conference 

on Digital System Design, pp.95-103. 

[2] AUTOSAR, standard 4.2, www.autosar.org. 

[3] ISO/DIS 26262-1 - Road Vehicles – Functional safety, International 

Organization for Standardization / Technical Committee Std., 2009. 

[4] Björn B. Brandenburg, James H. Anderson, “On the Implementation of 

Global Real-Time Schedulers,” 2009 30th IEEE Real-Time Systems 

Symposium. 

[5] R. Nicole, “Comparison of service call implementations in an 

AUTOSAR multi-core os,” in 9th IEEE International Symposium on 

Industrial Embedded Systems (SIES), June 2014, pp. 199–205. 

[6] AUTOSAR - Guide to Multi-Core Systems, AUTOSAR Std. V1.1.0, 

Rev. R4.1 Rev3, 2014. 


789

Applied Machine Learning on Low-energy 

Platforms 

Running Machine Learning Optimally on Heterogeneous, Low-energy Arm Platforms 

Robert Elliott 

Technical Director, Machine Learning 

Arm Ltd. 

Cambridge, England 

Mark O’Connor 

Director, Deep Learning 

Arm Ltd. 

Grasbrunn, Germany 

Neural network frameworks such as TensorFlow, PyTorch and 

Caffe have revolutionized machine learning and computer vision on 

desktop PCs and on servers in the cloud, and are poised to do the 

same at the edge. But running these frameworks optimally and 

within a low-power budget provides one of the biggest challenges yet 

for developers. 

To help, modern systems-on-chips (SoCs) offer a variety of 

processor core types – CPUs, GPUs, DSPs and other accelerators – 

each suited to different parts of typical machine learning pipelines. 

But mapping these frameworks to run seamlessly across these cores, 

whilst minimizing power-sapping operations such as memory copies, 

can be complex and time-consuming to implement. Optimizing for 

one platform is often challenge enough, but with such a huge variety 

of potential target platforms, the prospect of optimizing for each one 

limits the feasibility of write-once applications that run optimally 

across multiple devices. 

This paper looks at the development environments available on 

low-energy platforms and how middleware libraries can simplify the 

process of reaching high efficiency. It also explores the tools and 

techniques available for deploying a neural network on these 

platforms. This is illustrated with examples, highlighting some of the 

work Arm is doing to enable machine learning wherever compute 

happens. 

Finally, we look ahead to approaches being proposed for future 

heterogeneous systems and future network optimization techniques 

which could provide significant performance and efficiency 

improvements in future systems. What will the implication be for the 

world of machine learning once these tools and frameworks have 

evolved, enabling this exciting set of new use cases across embedded 

devices of all shapes and sizes? 

Machine Learning; TensorFlow; Caffe; Mobile; Android; Mali 

ARMv8.2-A; Arm Cortex-A; Arm Cortex-M; Neural Network; 

Artificial Intelligence; Arm Accelerator; Arm Compute Library; 

Arm inference engine; ACL 

I. MACHINE LEARNING’S ADOPTION AND IMPORTANCE 

Machine learning is the term of the moment; no matter which 

part of technology you work in it’s likely that you’re hearing a 

lot about it. You’d be forgiven for thinking it’s a fad, but there 

has been such a rapid adoption of machine learning algorithms 

for solving key problems that it’s proving to be much more than 

that. At Arm, we’re in the fortunate position of being able to 

observe this adoption over a wide array of markets, from mobile 

phones and smart homes to agriculture and servers. One thing is 

clear, machine learning is solving real problems – such as face 

recognition, object detection and scene segmentation – with 

amazing accuracy[1][2][3][4][5]. It’s also become clear that the 

availability of large data sets, along with improving techniques 

and network complexity, is making it possible to deploy 

machine learning on embedded SoCs. 

Over the past few years, in a number of applications, it’s 

become possible to beat human accuracy. For some key 

problems, such as identifying objects and understanding spoken 

words, the problem seems solved at the algorithmic level and a 

fantastic summary of the problems being solved and why you 

should care is maintained by the EFF[1], providing insight into 

why everyone should be paying more attention to the advances 

in this field. More recently there’s also been significant work on 

reducing computational requirements[7][6] to make these 

solutions work within very limited processing budgets[8][10]. In 

practice, this means we’ve entered the age of machine learning 

on practically any SoC, and many devices within that system. 

II. 

MODERN SYSTEM-ON-CHIP DESIGNS 

As most will know, modern SoCs comprise a number of key 

parts which are common to most designs, along with a number 

of other elements chosen to solve the problems specific to the 

target market. The key components of a design are the memory 

subsystem and CPU, and often a GPU and display chip. There 

are then more market-specific functions, such as highresolution 

video decode and image processing for camera 

sensor input. 

For machine learning this has meant two things – firstly due 

to its pervasiveness, all devices in the embedded platform need 


790

good machine learning performance. Secondly, the stretch for 

performance density in the most demanding cases is, once 

again, pushing into the world of dedicated accelerators that can 

make the hard domain-specific decisions needed. 

Figure 1: A modern embedded or mobile System on Chip 

The recent trend, which we argue is not going to be shortlived, 

is the introduction of optimizations and dedicated 

accelerators for machine learning. The expectation is, that since 

the applications for machine learning are so wide-ranging, 

many of these future systems – whether they contain 

accelerators or not – will have a high requirement for machine 

learning performance. This leads us to the need to achieve high 

performance across all accelerators present in the system, be 

they CPU and dedicated instruction sets, GPU and extended 

fixed function units for ML algorithms, or bespoke hardware 

accelerating one or more classes of neural network. 

A. Fitting machine learning use cases to platforms 

You may be wondering what kind of use cases can map to 

the wide array of devices in an SoC. In practice, there are quite 

a few promising areas where traditional algorithms can be 

improved, or new algorithms can be deployed, that exploit 

machine learning, from the very small to the very big. The key 

benefit is replacing those complex, fragile, laborious and 

impenetrable sequences of conditional logic built up to manage 

the complex behaviors arising from seemingly simple inputs. 

1) Microcontrollers 

• Power management and scheduling – when deciding 

how and when to adjust operating points, reassign tasks to 

different cores and balance throughput against efficiency, 

schedulers can benefit from training based on known good 

behaviors and followed by reinforcement learning to 

become more suited to the specific device they are running 

on. This approach can also use data for which it would be 

impossible to code rules, such as detailed cache state and 

access patterns of programs. 

• Security auditing – where patterns of behavior of a 

system in normal use can be observed and abnormal 

behaviors can be caught quickly[16]. 

• Object detection – to enable low power modes as part of 

complex actions, such as unlocking a mobile phone or 

taking a camera image, without using up a battery too 

quickly, or basic object detection to reduce false-positives 

when waking a camera (over and above a simple PIR or 

change detection). 

• Key-word or Key-noise spotting[20] – as part of a 

connected world, having simple detection devices spread 

out as part of an interactive environment or as part of a 

security system detecting atypical sounds like breaking 

glass. 

2) CPUs 

• Implementation of SLAM techniques for autonomous 

vehicles – navigating the world is possible on CPUs – not 

always for fast moving cars or drones, but for those with 

constrained movement, it’s often sufficient to work on 

CPUs. 

• Simple natural language processing for robotics and 

home devices – recognizing requests, particularly 

complex or compound statements, needs to work even 

when network connectivity is spotty or unavailable. This 

is possible on CPU or GPU today. 

• Face recognition/identification – allowing entry to 

shared areas or recording participants of a meeting while 

maintaining privacy by keeping data local. 

3) GPUs 

• Complex NLP – answering questions rather than acting 

on commands. 

• Secure face identification – the additional steps taken for 

anti-spoofing and low delay for the user when being used 

for activities like unlocking a mobile phone. 

• Image processing (such as style transfer networks[15]) 

and face point registration – used extensively in the 

social media world to provide entertaining content for 

users 

4) Dedicated Accelerators 

• High resolution scene extraction – for safety during fast 

movement, automobiles require low latency to respond 

quickly. Moving this processing to dedicated accelerators 

can notably reduce cost. 

• Complex NLP for non-connected responses – the cost 

of processing large amounts of audio can be a notable 

TCO expense for busy services. Moving this to more 

efficient accelerators can be driven by cost. 

• Faster face identification (mobile phone face unlock) 

– premium experiences where fast response to the user is 

key. 

III. 

INTRODUCTION TO SOFTWARE APPROACHES 

To manage all of the aforementioned devices, and provide a 

manageable experience, we also need a stable and performanceoptimized 

software stack that does some of the heavy lifting of 

device selection and routine tuning. 

There are many ways to deploy machine learning on 

embedded platforms. Today, the most common way tends to be 

via bespoke frameworks, which allow performance on specific 

platforms or processors, or by running the full framework on 

the CPUs available in the platform. The first of these choices 

has issues of portability; the second, issues of level of 

optimization of the software. To help with this problem, we are 

791

curating a list of frameworks and libraries with support for Arm 

hardware at: 

https://developer.arm.com/technologies/machine-learningon-arm/frameworks 

For the widest deployment of a network architecture where 

the goal is functionality and reasonable performance, using a 

full machine learning framework deployed with a general 

backend is the best choice today. This provides the widest 

functional support, and the ability to modify and experiment 

with networks if needed. 

However, this route does not provide the maximum 

performance possible. For that, it’s necessary to focus on 

optimized libraries with a narrower applicability – something 

that requires a specifically optimized inference engine. This is 

the approach being taken[21] for Android deployment, and 

something which Arm is making available for other embedded 

platforms via the Arm software stack, which we will explore 

further here. 

For ongoing optimizations of machine learning primitives 

and a stable inference engine, the approach we are supporting 

is: 

Arm’s inference engine – Optimized Inference Engine for 32- 

bit float and 8-bit integer 

https://developer.arm.com/technologies/arm-nn-sdk 

Arm’s Compute Library – Optimized low-level routines for 

computer vision and machine learning, focusing on CNNs for 

32-bit float and 8-bit integer today, across a wide array of Arm 

CPUs and GPUs 

https://developer.arm.com/technologies/compute-library 

A. The different approaches available for machine learning 

deployment in an embedded system 

1) Direct integration 

The direct integration approach looks to 

embed libraries and routines directly into 

the codebase of the machine learning 

framework. This means either a call by call 

for operating on the layers of a neural 

network, or a runtime handover of a graph 

representing the network to be operated 

upon. 

This might, at first, seem like a 

promising approach to keeping the 

development environment the same, but in practice we often 

find that overheads introduced by running the full framework 

are detrimental to overall performance or impractical for 

shipping in production. This approach also limits the time 

available for optimization, as there is no offline step where the 

full neural network graph can be observed and modified for 

better performance. 

One area where this approach is particularly useful, 

however, is for training where full flexibility and access to the 

vast set of already implemented operators is key. 

2) Importing from a graph description file 

The file import approach takes the 

output graph and trained weights from a 

machine learning training framework and 

converts this to target a specifically 

designed inference engine. This 

conversion can happen as a compilation 

step or as a runtime step. The inference 

engine is used to run one or more graphs by 

implementing a subset of functionality 

found in a full training and inference 

machine learning framework. 

This allows for a much smaller runtime, where the 

capabilities are constrained to meet memory limitation targets, 

and also provides an opportunity to perform an offline 

optimization stage and further improve performance. Critically, 

this also allows a decoupling of the machine learning training 

framework from the production deployment, which tends to 

improve the design and efficiency of inference engines 

targeting different platforms. 

Quite often this approach is the right practical choice for 

deploying machine learning solutions today. 

3) Compilation flows 

Compilation flows, though long 

imagined, are just beginning to appear and 

represent a promising approach to higher 

performance when running neural networks. 

The two key advantages are exploitation of 

the existing corpus of compilation expertise 

in compiler frameworks and developers (the 

graph optimization problem is well known 

to these circles), and the ability to properly 

represent and optimize tensor processing 

and its representation on modern CPU, GPU 

and hardware accelerators. 

The compilation of graphs can result in the fusion of 

operations that reduce memory footprint (also exploited in other 

approaches, but potentially more thoroughly with a compiler 

flow). It can also reduce active memory footprint and 

bandwidth, by working on a smaller data set which fits in cache 

and makes better re-use of data loaded from memory. 

Today, however, there are still hurdles to overcome in the 

optimization process and it’s often more practical to work with 

direct integration, as this allows full flexibility for training? 

B. System software design 

Practically speaking, any of the aforementioned approaches 

is a reasonable way to get machine learning networks running 


792

on an embedded platform, and Arm’s approach has been to 

develop a low-overhead inference engine with the ability to 

import from file. This allows the same framework to target both 

Cortex-A class cores, found in high end mobile, and Cortex-M 

class cores, found in processing environments with just 

kilobytes of memory to play with. 

Arm expects machine learning to become a natural part of 

programming environments, requiring support not just for 

large networks executing on accelerator hardware, but also for 

tiny embedded networks, situated as a natural part of the 

program execution [13]. Being in a position to specify the 

system design and software allows us to design and balance 

every element to ensure the most efficient and cost-effective 

designs that meet the rapidly evolving needs of a machine 

learning-based world. 

Figure 2: Arm’s machine learning software and platform stack 

As mentioned in the introduction, machine learning is seen to 

have relevance to all classes of device, and so naturally the 

software seeks to enable and exploit this. To translate this view 

of the world into a useable software stack, we have developed 

the Arm inference engine to allow work to be distributed to 

devices and take advantage of the key optimizations of each. 

this model and weight set and using a converter to prepare an 

optimal representation for the inference engine. 

Figure 4: The Arm usage flow 

The major stages used in this process are: 

• Import or build the graph 

o Take the graph as input from a TensorFlow pb 

file or Caffe caffemodel/prototxt 

o Build the graph ‘manually’ from within your 

application using the runtime graph building API 

o This graph represents the network architecture 

and the weights to be used 

• Run the optimization process to allow the engine to 

optimize the graph and operators 

o Optimize the graph, replacing suitable sequences 

with single operations, fusing stages 

o This emits the tuned graph object which can be 

passed input and output tensors for an instance of 

processing 

• Run the inference process on the optimized graph 

o This can be repeated as needed with additional 

input data 

A framework that targets all devices in the system makes it 

easy to select and schedule between them. We’re working on 

optimizing each of these paths to deliver maximum 

performance on every device. We have also open-sourced our 

compute library with optimized primitives under a permissive 

license, so you can take a look, compile it for other platforms 

and provide feedback. 

A. Heterogeneous performance 

Figure 3: Arm’s inference engine software 

IV. 

ACCELERATING NEURAL NETWORKS 

To get the highest performance, it’s often preferable to run 

a smaller codebase that is dedicated as an inference engine, 

rather than a full machine learning framework. This approach 

requires preparation of a model and weights in a training 

environment, using a machine learning framework, then taking 

Our experience with high-performance math libraries shows 

that – despite the promise of advanced compilation flows for 

multi-device targeting – well-tuned computational primitives 

and operators are required to unlock peak performance from 

hardware. Over time, these common libraries also result in a 

benchmark that allows new designs to be focused and 

optimized. In effect, this ensures that, over time, these libraries 

will provide continued good performance over multiple 

releases, new hardware designs and new versions of the 

software stack. 

Arm’s inference engine library, working on full network 

definitions, also allows us to target workloads to the right 

device, or parallelize across devices, where it makes sense. This 

allows for both a more efficient execution when targeting the 

optimal device for a workload and higher throughput if speed 

of work completion is the key. In addition, the ability to easily 

793

select between optimized devices within a framework means 

that portability is notably easier. 

Today, choosing the right device for offloading the network 

node is a manual operation. This allows a developer to profile 

the platform, choose the right device for processing that stage 

of the network and bake that choice into the graph description. 

A future step will be to allow networks to be automatically 

tuned for the platform, to make this level of manual control 

optional rather than mandatory. 

B. Standards 

The benefit of reducing the set of core operators is being 

seen in ongoing standardization efforts such as ONNX[11] and 

NNEF[12]. While it may be some time before these take off, 

they promise to open up an ecosystem of interoperating tools, 

frameworks and inference engines, making the development of 

neural networks easier and faster. 

V. A PRACTICAL EXAMPLE OF DEPLOYMENT 

1) Preparing a model in TensorFlow for deployment on an 

embedded platform 

Preparing a model for TensorFlow deployment today 

involves removing unnecessary nodes and ensuring the 

operations used are available in the TensorFlow distributions 

on mobile devices, e.g. by removing training-specific 

operations in the model's computational graph. Optionally, it 

can also involve modifying the weights and operations to 

reduce file size and improve speed, at the expense of accuracy. 

This is accomplished through TensorFlow's graph_transforms 

tool, build from the TensorFlow source [8] with: 

bazel build 

tensorflow/tools/graph_transforms:transform_graph 

2) 32-bit floating-point model 

To build a 32-bit floating-point version of the graph ready 

for mobile TensorFlow deployment: 

bazelbin/tensorflow/tools/graph_transforms/transform_graph 

\ 

--in_graph=resnetv1_50_fp32.pb \ -- 

out_graph=optimized_resnetv1_50_fp32.pb \ 

--inputs='Placeholder' \ -- 

outputs='resnet_v1_50/predictions/Reshape_1' \ 

--transforms='strip_unused_nodes(type=float, 

shape="1,224,224,3") 

fold_constants(ignore_errors=true) 

fold_batch_norms 

fold_old_batch_norms' 

This has the largest file size and highest accuracy, but also 

has the highest computational requirements. For deployment in 

a mobile or embedded device, we can perform more preparation 

steps which make the model run more quickly and with similar 

accuracy. 

3) 8-bit weights and operations 

There are many techniques for retaining as much accuracy 

as possible, such as gradient thresholding and retraining, but 

these are beyond the scope of this paper. Applying naive 

quantization is straightforward and does not require additional 

passes through the training data: 

bazelbin/tensorflow/tools/graph_transforms/transform_graph 

\ 

--in_graph=resnetv1_50_fp32.pb \ 

--out_graph=optimized_resnetv1_50_int8.pb \ 

--inputs='Placeholder' \ 

--outputs='resnet_v1_50/predictions/Reshape_1' \ 

--transforms=' 

add_default_attributes 

strip_unused_nodes(type=float, shape="1,224,224,3") 

remove_nodes(op=Identity, op=CheckNumerics) 

fold_constants(ignore_errors=true) 

fold_batch_norms 

fold_old_batch_norms 

quantize_weights 

quantize_nodes 

strip_unused_nodes 

sort_by_execution_order' 

This produces a file 25% of the size that uses 8-bit integer 

operations for faster inference, at the expense of accuracy. 

4) Benchmarking optimized models 

It's important to benchmark optimized models on real 

hardware. TensorFlow contains optimized 8-bit routines for 

Arm CPUs but not for x86, so 8-bit models will perform much 

slower on an x86-based laptop than a mobile Arm device. You 

can build the TensorFlow Android benchmark application with: 

bazel build -c opt --cxxopt=--std=c++11 \ 

--crosstool_top=//external:android/crosstool \ 

--cpu=armeabi-v7a -- 

host_crosstool_top=@bazel_tools//tools/cpp:toolchain \ 

tensorflow/tools/benchmark:benchmark_model 

With the Android deployment device (in this case a HiKey 

960) connected, run: 

adb shell "mkdir -p /data/local/tmp" 

adb push bazelbin/tensorflow/tools/benchmark/benchmark_model 

/data/local/tmp 

adb push optimized_resnetv1_50_fp32.pb /data/local/tmp 

adb push optimized_resnetv1_50_int8.pb /data/local/tmp 

The benchmarks are run with: 


794

adb shell '/data/local/tmp/benchmark_model \ 

--num_threads=1 \ 

--graph=/data/local/tmp/optimized_resnetv1_50_fp32.pb \ 

--input_layer="Placeholder" \ 

--input_layer_shape="1,224,224,3" \ 

--input_layer_type="float" \ 

--output_layer="resnet_v1_50/predictions/Reshape_1"' 

adb shell '/data/local/tmp/benchmark_model \ 

--num_threads=1 \ 

--graph=/data/local/tmp/optimized_resnetv1_50_int8.pb \ 

--input_layer="Placeholder" \ 

--input_layer_shape="1,224,224,3" \ 

--input_layer_type="float" \ 

--output_layer="resnet_v1_50/predictions/Reshape_1"' 

5) Performance comparison 

Accuracy should be evaluated using application-specific 

data, as the impact of quantization on accuracy can vary. In 

terms of compute performance, the above networks show the 

following performance on the HiKey 960 development 

platform with stock firmware, Android and CPU frequency 

settings: 

Figure 5: Performance of Resnet50 (standard configuration) running 

on different inference implementations 

As described previously, different approaches to the 

deployment of machine learning software can have a material 

impact in the performance. In this example, we can see that a 

deployment in Arm’s inference engine, where the whole graph 

can be accelerated on device (even before fusion is possible), 

can produce much higher performance. Reduction of round 

trips to user-space software, removal of data transfers between 

layers, and specifically optimized inference routines all 

contribute to this performance difference. 

mostly attributed to the above overheads. For accelerated CPU 

routines, direct integration into TensorFlow is more straight 

forward, and so the performance of those devices is more easily 

achieved. 

It should be noted however, that continued efforts are being 

made in a number of these codebases, which will materially 

change performance over time. 

B. Using Arm’s inference engine 

The general flow of using the inference engine follows the 

graph import software model, and this is further broken down 

into the ‘Import, Optimize, Run’ pattern, expecting that the 

input network weights come from an independent training 

process. 

Figure 7: General Arm inference flow 

The initial process is to create a network. In this example, 

we use the TensorFlow parser to take an input graph and 

convert it into our runtime graph representation 

armnn::INetwork which can then be used in the normal Arm 

inference flow. For this graph, we also need to connect the input 

and output tensors that are used when running inference to 

capture data. These are named based on choices of the model 

and so depend on the model you pass. 

First, we create the parser: 

// Create a network from a file on disk, using (in 

// this case) the tensor flow parser 

std::unique_ptr 

parser(ITfParser::Create()); 

Then we parse the network, in this case coming from a text 

input representing the mnist network, using inputTensorInfo to 

specify the inputs to the graph: 

// Call the parser function with the input network, 

// this can be binary or text 

armnn::TensorInfo inputTensorInfo({ 1, 784, 1, 1 }, 

armnn::DataType::Float32); 

Figure 6: Performance of Mobilenet v1 1.0_224 running on different 

inference implementations 

The Arm inference engine and SYCL implementation are 

both running on Mali in this example, and differences are 

std::unique_ptr network = 

parser->CreateNetworkFromTextFile( 

armnn::DataType::Float32, 

"simple_mnist_tf.prototxt", 

inputTensorInfo)); 

We also get input and output bindings based on the textual 

name of the node in the graph: 

795

Get the input and output bindings based on node 

// name in the graph 

m_InputBindingInfo = parser-> 

GetNetworkInputBindingInfo("input"); 

m_OutputBindingInfo = parser-> 

GetNetworkOutputBindingInfo("output"); 

Once these steps have been completed, we can now 

continue using the Arm inference stack, as we would for this or 

any other input path. From this point on, the code we use is 

common, regardless of the framework we started with. Very 

simply, our next steps are to take the graph and optimize it to 

make an immutable graph ready for running on the device we 

choose, then running inference by enqueueing the workload and 

reading the result. 

First, we run the optimization flow, to produce our graph, 

optimized for the devices it will run on, with all nodes’ 

functions created and internal memory objects for processing 

and data transfer between devices: 

// The optimize step, which finalizes the graph 

// ready for running inference 

std::unique_ptr optNet = 

armnn::Optimize( *network, 

m_GraphContext->GetDeviceSpec()); 

as a way to run general or experimental TensorFlow graphs that 

are not yet heavily optimized by hand. A detailed walkthrough 

for installing TensorFlow SYCL with ComputeCpp for 

deployment on Arm Mali G71 devices can be found here: 

https://developer.codeplay.com/computecppce/latest/tensorflo 

w-arm-setup-guide 

Once this is installed, 32-bit floating point models will be 

deployed onto the GPU, allowing for a wide array of preexisting 

models to be used. This is useful for evaluating 

multiple networks and experimenting on the target platform 

before deployment. 

VI. 

PERFORMANCE 

As previously described, measuring performance on the 

target platform is key for getting accurate performance, 

particularly as there are a number of factors that can have an 

impact, and different network configurations running the same 

routines can produce notably different performance across 

implementations and devices; one device might be faster for 

some networks and another device faster for others. 

Even if you are able to use a previously optimized network 

provided on our developer portal, there are still tradeoffs 

between power, performance and convenience as can be seen 

below. 

Next, load the graph into the execution context: 

// Load the network into the context. 

armnn::Status ret = m_GraphContext-> 

LoadNetwork(m_NetworkIdentifier, 

std::move(optNet)); 

The context is what records the devices we will execute on, 

typically one of the following: 

armnn::Compute::CpuAcc; 

armnn::Compute::GpuAcc; 

The final step is to run the inference for the network on a 

given input, and capture the output: 

armnn::Status ret = m_GraphContext-> 

EnqueueWorkload(m_NetworkIdentifier, 

MakeInputTensors(&input.image[0]), 

MakeOutputTensors(&output[0])); 

It’s possible for this performance balance to change when 

looking at different models: 

Mobilenet v1_1.0_224 - 1-batch inference time (ms) 

8x Mali G71 Arm Compute Library - 32-bit floating point 

4x Arm Cortex A73 TensorFlow Mobile - 8-bit integer 

Arm Cortex A73 TensorFlow Mobile - 8-bit integer 

4x Arm Cortex A73 TensorFlow Mobile - 32-bit floating … 

Arm Cortex A73 TensorFlow Mobile - 32-bit floating point 

38 

158 

155 

304 

433 

0 50 100 150 200 250 300 350 400 450 500 

C. Deployment to GPU with TensorFlow + SYCL 

TensorFlow models can also be executed on the Arm Mali 

GPU via the OpenCL, targeting SYCL compiler. This 

workflow is currently experimental and under active 

development, but initial results are encouraging – particularly 

And of course, as software is further optimized, big leaps 

can be seen. 


796

Figure 8: Alexnet speedup of new versions of Compute Library 

• Seamless deployment from cloud to edge – making the 

training experience easier and providing better tooling for 

performance, adjusting model complexity, and accuracy 

tuning 

• More advanced heterogeneous scheduling – better tools 

for static scheduling of workloads across devices in the 

SoC in the first instance, and hopefully improving 

dynamic scheduling in future 

• Network Compilers – taking advantage of the full network 

and operator code to produce interleaved scheduling to 

make maximum re-use of caches, to simplify arithmetic 

sequences, and reduce memory accesses and bandwidth. 

This rapidly evolving field is continuing to solve more 

complex problems and improve performance at an amazing 

rate. Why not take a look at our developer community[18] and 

try[19] some of these techniques out for yourself? 

Figure 9: Matrix multiply speedup in SYCL and Compute Library 

It’s also worth being 

mindful of the 

performance benefits, 

bandwidth savings, and 

energy savings from 

using smaller datatypes, 

for example, switching 

from 32-bit float to 8-bit 

integer. It’s been shown 

that accuracy loss for a network is negligible when doing so, 

provided that retraining is also performed. In the case illustrated 

in this throughput graph, using 8-bit for matrix multiply results 

in a speedup of around 1.62x and, notably, a 2.46x reduction in 

bandwidth. 

VII. THE FUTURE 

There’s so much more that can be done in this space. Adding 

further optimizations such as fusion being introduced into 

inference engines, improving machine learning frameworks to 

enable accelerators, providing better development 

environments and tools for cloud to edge deployment, and 

exploiting compiler technology to further improve 

optimization. Of all the things that can be done, there are a few 

really interesting areas to look at: 

REFERENCES 

[1] https://www.eff.org/ai/metrics 

[2] https://arxiv.org/pdf/1705.02498.pdf 


[4] https://www.forbes.com/sites/michaelthomsen/2015/02/19/microsoftsdeep-learning-project-outperforms-humans-in-imagerecognition/#4381b2f9740b 

[5] https://www.youtube.com/watch?v=k4ovpelG9vs 




[9] https://community.arm.com/processors/b/blog/posts/high-accuracykeyword-spotting-on-cortex-m-processors 

[10] https://github.com/ARM-software/ML-KWS-for-MCU 

[11] https://onnx.ai/ 

[12] https://www.khronos.org/nnef/ 

[13] https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/gr 

aph_transforms 



[16] https://pages.arm.com/iot-security- 

manifesto.html?utm_medium=Website&utm_source=Arm- 

HomepageHero&campaign=SecurityManifesto 

[17] http://zhiyisun.github.io/2017/02/15/Running-Google-Machine- 

Learning-Library-Tensorflow-On-ARM-64-bit-Platform.html 

[18] https://developer.arm.com/technologies/machine-learning-on-arm 

[19] https://developer.arm.com/technologies/machine-learning-onarm/developer-material/how-to-guides/teach-your-raspberry-pi-yeahworld 

[20] http://developer.arm.com/- 

/media/Files/pdf/The%20Power%20of%20Speech%20Supporting%20V 

oice-Driven%20Commands%20in%20Small%20Low- 

Power%20Microcontrollers.pdf 

[21] https://developer.android.com/ndk/guides/neuralnetworks/index.html 

797

Triple Core ARM® Based MCU Architecture for 

Radiation Environments 

Balaji, V. (ARM Holdings Ltd.), Bannatyne, R. (VORAGO Technologies), Iturbe, X. (ARM Holdings Ltd.) 

Abstract— This technical paper proposes an ARM-based 

microcontroller that has been optimized for use in conditions of 

extreme radiation. Namely, several radiation mitigating 

techniques are combined to address different types of failures that 

occur in CMOS devices when exposed to radiation. The proposed 

microcontroller integrates three Cortex-R5 CPUs in lock-step 

mode and implements a quick error recovery mechanism to cope 

with radiation-provoked soft errors. The microcontroller is 

proposed to be manufactured using VORAGO Technologies 

HARDSIL® technology that immunizes the device against 

radiation induced latch-up. HARDSIL also allows operation 

during exposure to a significant level of Total Ionizing Dose (TID), 

typically up to 300 krad(Si). Single Event Upsets (SEU) due to 

radiation particle strikes on memory are mitigated by Error 

Detection and Correction (EDAC) and Scrub Engine subsystems 

that operate on the program and data memories of the device. 

Keywords—ARM; MCU; microcontroller; radiation; SEU; 

latchup; HARDSIL 

I. ADDRESSING LATCH-UP IN RADIATION ENVIRONMENTS 

A major problem that CMOS semiconductors face in 

extreme environments is ‘latch-up’. Under conditions of high 

temperature and radiation, the CMOS device is exposed to 

conditions where parasitic transistors can be switched on by high 

temperature silicon effects or by an ionizing radiation strike. All 

bulk CMOS wafers contain millions of parasitic structures (that 

resemble and behave like a thyristor) that are spread across the 

wafer. This is a byproduct of the CMOS wafer architecture and 

is usually not a problem if the device is operated within a limited 

specification, but at high temperature or when radiation is 

present, latch-up occurs when the parasitic structure is triggered. 

Fig. 1 illustrates a cross-section of a CMOS device structure with 

the bi-polar parasitic transistor structure shown on the diagram 

in the well and substrate area of the wafer. 

Latch-up occurs if the parasitic bipolar transistors become 

forward biased and switch on. The transistors will drive each 

other into saturation and create a short circuit from Vdd to Vss. 

When latch-up occurs, a high current will flow through the short 

circuit. Latch-up will be sustained if the combined gain of the 

NPN and PNP parasitic structure is greater than unity. This can 

result in permanent damage. To get out of a latch-up condition, 

the device must be reset. 

At high temperature, junction leakage current increases as 

electron-holes pairs are generated in the silicon lattice. The 

forward bias voltage of parasitic transistors that reside on CMOS 

silicon structures is also reduced, leading to a reduced trigger 

current that is the onset of the latch-up condition. Similarly, a 

particle strike on the die can create charge that switches on the 

parasitic structure. 

Immunizing against latch-up has not easy. Hardening 

electronic components for extreme environments has been 

achieved by using specialized semiconductor manufacturing 

V ss 

Gate 

p + n+ n+ p p 

oxide 

+ 

+ n+ 

p-well 

n-well 

PNP 

NPN 

R nwell 

R pwell 

Substrate (p-well) 

Gate 

COMMERCIAL TWIN WELL CMOS 

V dd 

oxide 

Fig. 1. Cross Section of CMOS Device Showing Parasitic Bipolar Transistor 

Structure 

processes such as Silicon-on-Insulator (SOI). This approach is 

effective but expensive as it is a ‘boutique’ process that is not 

compatible with the sizeable CMOS infrastructure that is the 

standard in the industry. 

Another approach that has been developed to address latchup 

in extreme environments is to modify standard CMOS by 

adding a ‘Buried Guard Ring’ (BGR) to the existing CMOS 

substrate. 

The HARDSIL ® process [1] is a modification to standard 

CMOS designs that includes a vertical and horizontal implant in 

the die. This approach immunizes against latch-up by creating a 

highly conductive layer underneath the CMOS devices and 

wells, combined with a high conductivity connection to well 

contacts. The HARDSIL ® approach enables high temperature 


798

and radiation tolerant operation by reducing the parasitic 

resistance so that the parasitic NPN cannot turn on and reducing 

the gain of the parasitic transistors so that the bi-polars cannot 

sustain latch-up. HARDSIL ® has been implemented on space 

grade semiconductors and has proved to be effective for latchup 

immunization. 

The HARDSIL ® BGR is implemented by adding 1-2 mask 

steps and 3 implants during the wafer manufacturing stage. No 

special equipment is required to implement HARDSIL ® and 

standard CMOS design tools and manufacturing equipment are 

used. There are no negative effects in terms of transistor 

performance or power consumption. HARDSIL ® can be 

implemented on any CMOS integrated circuit at any processing 

geometry node. This is very significant for designers of extreme 

environment electronics systems, as it unlocks the door to using 

the latest state-of-the-art semiconductor products, rather than 

being limited to the small pool of tried-and-tested components 

that are never the best-fit products for a state-of-the-art design. 

The BGR is shown under the transistor well areas in Fig. 2. 

V ss 

n+ 

R pwell 

Gate 

p-well 

n+ n+ 

oxide 

NPN 

Substrate (p-well) 

HARDSIL ® TWIN WELL CMOS 

V dd 

BGR 

Fig. 2. Buried Guard Ring Structure in CMOS Device Using HARDSIL ® 

oxide 

II. ARM-BASED MICROCONTROLLER ARCHITECTURE 

The proposed microcontroller is based on the recently 

announced ARM Triple Core Lock-Step (TCLS) architecture [2] . 

This architecture is shown in Fig. 3 and includes three lockstepped 

Cortex-R5 CPUS coordinated by a TCLS Assist Unit. 

At every clock cycle, the instructions to execute by the 

microcontroller are read from a shared memory and distributed 

to the triplicated CPUs. The CPU outputs are majority-voted and 

forwarded to memories and I/O ports, preventing CPU errors 

from propagating to other parts of the system. Simultaneously, 

an error detection logic in the TCLS Assist Unit checks if there 

is any mismatch in the outputs delivered by the three CPUs. If 

there is a mismatch, this logic identifies whether it is a 

correctable error (i.e., only one of the CPUs delivers a different 

set of outputs) or an un-correctable one (i.e., all CPUs deliver 

different outputs). If the error is correctable, a resynchronization 

logic takes over the control to correct the architectural state of 

the erroneous CPU by resynchronizing all the CPUs. If the error 

is un-correctable, the entire system where the TCLS processor is 

integrated must be reset. 

p + 

n-well 

Gate 

PNP 

p + 

R nwell 

Fig. 3. Proposed ARM-based microcontroller architecture 

In TCLS, the CPU resynchronization process is automatic 

and transparent to the software. It consists in pushing out the 

architecture state of the three CPUs and then restoring the 

majority voted values back. Two remarks are important here. 

First, the error recovery process can be completed in less than 

2,500 clock cycles (less than 2.5 us @ 450 MHz) as there is no 

need to correct the memory state, whose integrity is protected 

using ECC. Secondly, the TCLS architecture is fail functional as 

it can continue working correctly in the event of a single CPU 

error using the two remaining functionally correct CPUs until all 

critical computations are completed and there is enough time to 

resynchronize the three CPUs. 

Unlike related space-qualified processors, the TCLS can 

deliver comparable performance (i.e., CPU clock frequency) to 

the COTS Cortex-R5 processor widely used in terrestrial 

automotive applications. Finally, note that that the ARM TCLS 

architecture can be potentially used with any ARM CPUs, 

including performance-oriented A-class CPUs. 

III. MITIGATING AGAINST RADIATION EFFECTS THAT UPSET 

MEMORIES 

To mitigate against an SEU that could flip a memory bit, an 

Error Detection and Correction (EDAC) subsystem is proposed. 

Error Correcting Code (ECC) memories have the ability to 

detect a flipped memory bit and correct it. The VA10820 

ARM® Cortex®-M0 microcontroller Error Detection & 

Correction sub-system implements a Hamming Code based 

solution that detects two errors and corrects one PER BYTE. 

This means that there can be four flipped bits per 32-bit word 

and the microcontroller will still operate normally. As words are 

fetched by the CPU, the EDAC automatically performs 

detection and correction on these words in the course of normal 

CPU operation. There is still a risk however that particle strikes 

can flip bits on areas of the memory array that are not regularly 

being fetched by the CPU. This increases the likelihood that 

there will be more than a single bit error, creating an 

uncorrectable error. For this reason, a ‘Scrub Engine’ has also 

been integrated into the VA10820. 

The purpose of the Scrub Engine is to prevent accumulated 

errors and is an important part of the overall strategy to prevent 

uncorrectable bit flips due to radiation strikes. The Scrub Engine 

operates independently of the ECC system and will operate in 

the background of regular CPU activity to periodically examine 

the contents of each memory location and correct any bit-flip 

errors. This prevents the build-up of accumulated errors to 

reduce the possibility of a double-bit error that is uncorrectable. 

p + 799

The Scrub Engine frequency can be adjusted so that a full 

memory scrub can be implemented regularly enough to be 

effective based on the radiation conditions of the environment at 

any time. A recommended approach is to measure the number 

of errors that the EDAC system encounters and use that 

information to adjust the scrub rate to a reasonable level. 

IV. ARM DEVELOPMENT ECOSYSTEM 

A development ecosystem for a microcontroller is the broad 

range of tools and support that is required to get the device upand-running 

in the embedded system. The ecosystem includes 

hardware development tools that can be used to prototype 

systems, software packages that allow a designer to create high 

level language code, programming and debugging tools. An 

effective development ecosystem also includes code that can be 

used in the embedded system such as a Real-Time Operating 

System (RTOS) and communications stacks. Application notes 

and online support communities are also an important part of an 

effective development ecosystem. 

Embedded designers usually prefer to use devices based on 

the ARM Cortex- architecture because the ARM ecosystem is 

large, mature and continues to evolve with the latest state-of-theart 

tools. 

As explained in section II, the ARM TCLS architecture is 

transparent to the software programmer and hence, a user of the 

TCLS architecture is automatically granted with access to the 

entire ARM ecosystem. Furthermore, the general error recovery 

process in TCLS does not require any user intervention. If 

required, in hard real-time applications, the user can keep track 

of the occurrence of errors in the TCLS architecture and start the 

CPU resynchronization process on demand. This is controlled 

by means of some flags in internal TCLS registers that can be 

read and written from the user application. 

V. CONCLUSION 

As more money is invested in commercial space, there is a 

demand for state-of-the-art products that can operate in 

conditions of extreme radiation, but are leading edge 

performance and affordable. The space industry has typically 

used legacy products that have been processed using specialized 

hardening techniques that are very expensive. The combination 

of leading edge ARM-based technology, the huge ecosystem of 

development tools around it and low-cost hardening technology 

is very attractive. This approach will simplify the job for 

designers, reduce costs and ultimately help enable low-cost 

reliable commercial space systems. 

REFERENCES 

[1] VORAGO Technologies, “Technology: An Overview of VORAGO’s 

HARDSIL ® Technology”, VORAGO Technologies, 

www.voragotech.com 

[2] X. Iturbe, B. Venu, E. Ozer and S. Das, “A Triple Core Lock-Step (TCLS) 

ARM Cortex-R5 Processor for Safety-Critical and Ultra-Reliable 

Applications”, Proc. of the IEEE/IFIP Intl. Conf. on Dependable Systems 

and Networks, 2016. 


800

Dynamic Memory Allocation & Fragmentation in C 

& C++ 

6326 

Colin Walls 

Mentor, a Siemens business 

Newbury, UK 

colin_walls@mentor.com 

Abstract—In C and C++, it can be very convenient to allocate 

and de-allocate blocks of memory as and when needed. This is 

certainly standard practice in both languages and almost 

unavoidable in C++. However, the handling of such dynamic 

memory can be problematic and inefficient. For desktop 

applications, where memory is freely available, these difficulties 

can be ignored. For embedded – generally real time – 

applications, ignoring the issues is not an option. Dynamic 

memory allocation tends to be non-deterministic; the time taken 

to allocate memory may not be predictable and the memory pool 

may become fragmented, resulting in unexpected allocation 

failures. In this paper the problems are outlined in detail and an 

approach to deterministic dynamic memory allocation detailed 

Keywords—component; formatting; style; styling; insert (key 

words) 

I. C/C++ MEMORY SPACES 

It may be useful to think in terms of data memory in C and 

C++ as being divided into three separate spaces: 

Static memory. This is where variables, which are defined 

outside of functions, are located. The keyword static does not 

generally affect where such variables are located; it specifies 

their scope to be local to the current module. Variables that are 

defined inside of a function, which are explicitly declared 

static, are also stored in static memory. Commonly, static 

memory is located at the beginning of the RAM area. The 

actual allocation of addresses to variables is performed by the 

embedded software development toolkit: a collaboration 

between the compiler and the linker. Normally, program 

sections are used to control placement, but more advanced 

techniques, like Fine Grain Allocation, give more control. 

Commonly, all the remaining memory, which is not used for 

static storage, is used to constitute the dynamic storage area, 

which accommodates the other two memory spaces. 

Automatic variables. Variables defined inside a function, 

which are not declared static, are automatic. There is a 

keyword to explicitly declare such a variable – auto – but it is 

almost never used. Automatic variables (and function 

parameters) are usually stored on the stack. The stack is 

normally located using the linker. The end of the dynamic 

storage area is typically used for the stack. Compiler 

optimizations may result in variables being stored in registers 

for part or all of their lifetimes; this may also be suggested by 

using the keyword register. 

The heap. The remainder of the dynamic storage area is 

commonly allocated to the heap, from which application 

programs may dynamically allocate memory, as required. 

II. DYNAMIC MEMORY IN C 

In C, dynamic memory is allocated from the heap using 

some standard library functions. The two key dynamic memory 

functions are malloc() and free(). 

The malloc() function takes a single parameter, which is 

the size of the requested memory area in bytes. It returns a 

pointer to the allocated memory. If the allocation fails, it 

returns NULL. The prototype for the standard library function is 

like this: 

void *malloc(size_t size); 

The free() function takes the pointer returned by 

malloc() and de-allocates the memory. No indication of 

success or failure is returned. The function prototype is like 

this: 

void free(void *pointer); 

To illustrate the use of these functions, here is some code to 

statically define an array and set the fourth element’s value: 

int my_array[10]; 

my_array[3] = 99; 

The following code does the same job using dynamic 

memory allocation: 

int *pointer; 


801

pointer = malloc(10 * sizeof(int)); 

*(pointer+3) = 99; 

The pointer de-referencing syntax is hard to read, so normal 

array referencing syntax may be used, as [ and ] are just 

operators: 

pointer[3] = 99; 

When the array is no longer needed, the memory may be 

de-allocated thus: 

free(pointer); 

pointer = NULL; 

Assigning NULL to the pointer is not compulsory, but is 

good practice, as it will cause an error to be generated if the 

pointer is erroneous utilized after the memory has been deallocated. 

The amount of heap space actually allocated by 

malloc() is normally one word larger than that requested. 

The additional word is used to hold the size of the allocation 

and is for later use by free(). This “size word” precedes the 

data area to which malloc() returns a pointer. 

There are two other variants of the malloc() function: 

calloc() and realloc(). 

The calloc() function does basically the same job as 

malloc(), except that it takes two parameters – the number 

of array elements and the size of each element – instead of a 

single parameter (which is the product of these two values). 

The allocated memory is also initialized to zeros. Here is the 

prototype: 

void *calloc(size_t nelements, size_t 

elementSize); 

The realloc() function resizes a memory allocation 

previously made by malloc(). It takes as parameters a 

pointer to the memory area and the new size that is required. If 

the size is reduced, data may be lost. If the size is increased and 

the function is unable to extend the existing allocation, it will 

automatically allocate a new memory area and copy data 

across. In any case, it returns a pointer to the allocated 

memory. Here is the prototype: 

void *realloc(void *pointer, size_t 

size); 

III. DYNAMIC MEMORY IN C++ 

Management of dynamic memory in C++ is quite similar to 

C in most respects. Although the library functions are likely to 

be available, C++ has two additional operators – new and 

delete – which enable code to be written more clearly, 

succinctly and flexibly, with less likelihood of errors. The new 

operator can be used in three ways: 

p_var = new typename; 

p_var = new type(initializer); 

p_array = new type [size]; 

In the first two cases, space for a single object is allocated; 

the second one includes initialization. The third case is the 

mechanism for allocating space for an array of objects. 

The delete operator can be invoked in two ways: 

delete p_var; 

delete[] p_array; 

The first is for a single object; the second de-allocates the 

space used by an array. It is very important to use the correct 

de-allocator in each case. 

There is no operator that provides the functionality of the C 

realloc() function. 

Here is the code to dynamically allocate an array and 

initialize the fourth element: 

int* pointer; 

pointer = new int[10]; 

pointer[3] = 99; 

Using the array access notation is natural. 

De-allocation is performed thus: 

delete[] pointer; 

pointer = NULL; 

Again, assigning NULL to the pointer after de-allocation is 

just good programming practice. 

Another option for managing dynamic memory in C++ is 

the use the Standard Template Library. This may be 

inadvisable for real time embedded systems. 

IV. ISSUES AND PROBLEMS 

As a general rule, dynamic behavior is troublesome in real 

time embedded systems. The two key areas of concern are 

determination of the action to be taken on resource exhaustion 

and non-deterministic execution performance. 

There are a number of problems with dynamic memory 

allocation in a real time system. 

The standard library functions (malloc() and free()) 

are not normally reentrant, which would be problematic in a 

multithreaded application. If the source code is available, this 

should be straightforward to rectify by locking resources using 

RTOS facilities (like a semaphore). 

A more intractable problem is associated with the 

performance of malloc(). Its behavior is unpredictable, as 

802

the time it takes to allocate memory is extremely variable. Such 

non-deterministic behavior is intolerable in real time systems. 

Without great care, it is easy to introduce memory leaks 

into application code implemented using malloc() and 

free(). This is caused by memory being allocated and never 

being de-allocated. Such errors tend to cause a gradual 

performance degradation and eventual failure. This type of bug 

can be very hard to locate. 

Memory allocation failure is a concern. Unlike a desktop 

application, most embedded systems do not have the 

opportunity to pop up a dialog and discuss options with the 

user. Often, resetting is the only option, which is unattractive. 

If allocation failures are encountered during testing, care must 

be taken with diagnosing their cause. It may be that there is 

simply insufficient memory available – this suggests various 

courses of action. However, it may be that there is sufficient 

memory, but not available in one contiguous chunk that can 

satisfy the allocation request. This situation is called memory 

fragmentation. 

V. MEMORY FRAGMENTATION 

The best way to understand memory fragmentation is to 

look at an example. For this example, it is assumed that there is 

a 10K heap. First, an area of 3K is requested, thus: 

#define K (1024) 

char *p1; 

p1 = malloc(3*K); 

Then, a further 4K is requested: 


3K of memory is now free. 

Some time later, the first memory allocation, pointed to by 

p1, is de-allocated: 

free(p1); 

This leaves 6K of memory free in two 3K chunks. 

A further request for a 4K allocation is issued: 


This results in a failure – NULL is returned into p1 – 

because, even though 6K of memory is available, there is not a 

4K contiguous block available. This is memory fragmentation. 

It would seem that an obvious solution would be to defragment 

the memory, merging the two 3K blocks to make a 

single one of 6K. However, this is not possible because it 

would entail moving the 4K block to which p2 points. Moving 

it would change its address, so any code that has taken a copy 

of the pointer would then be broken. In other languages (such 

as Visual Basic, Java and C#), there are de-fragmentation (or 

“garbage collection”) facilities. This is only possible because 

these languages do not support direct pointers, so moving the 

data has no adverse effect upon application code. This defragmentation 

may occur when a memory allocation fails or 

there may be a periodic garbage collection process that is run. 

In either case, this would severely compromise real time 

performance and determinism. 

VI. MEMORY WITH AN RTOS 

A real time operating system may provide a service which 

is effectively a reentrant form of malloc(). However, it is 

unlikely that this facility would be deterministic. 

Memory management facilities that are compatible with 

real time requirements – i.e. they are deterministic – are usually 

provided. This is most commonly a scheme which allocates 

blocks – or “partitions” – of memory under the control of the 

OS. 

A. Block/partition Memory Allocation 

Typically, block memory allocation is performed using a 

“partition pool”, which is defined statically or dynamically and 

configured to contain a specified number of blocks of a 

specified fixed size. For Nucleus OS, the API call to define a 

partition pool has the following prototype: 

STATUS 

NU_Create_Partition_Pool(NU_PARTITION_ 

POOL *pool, 

CHAR *name, VOID *start_address, 

UNSIGNED pool_size, 

UNSIGNED partition_size, OPTION 

suspend_type); 

This is most clearly understood by means of an example: 

status = 

NU_Create_Partition_Pool(&MyPool, "any 

name", 

(VOID *) 0xB000, 2000, 40, NU_FIFO); 

This creates a partition pool with the descriptor MyPool, 

containing 2000 bytes of memory, filled with partitions of size 

40 bytes (i.e. there are 50 partitions). The pool is located at 

address 0xB000. The pool is configured such that, if a task 

attempts to allocate a block, when there are none available, and 

it requests to be suspended on the allocation API call, 

suspended tasks will be woken up in a first-in, first-out order. 

The other option would have been task priority order. 

Another API call is available to request allocation of a 

partition. Here is an example using Nucleus OS: 

status = 

NU_Allocate_Partition(&MyPool, &ptr, 

NU_SUSPEND); 

This requests the allocation of a partition from MyPool. 

When successful, a pointer to the allocated block is returned in 


803

ptr. If no memory is available, the task is suspended, because 

NU_SUSPEND was specified; other options, which may have 

been selected, would have been to suspend with a timeout or to 

simply return with an error. 

When the partition is no longer required, it may be deallocated 

thus: 

which is deterministic, immune from fragmentation and with 

good error handling. 

status = 

NU_Deallocate_Partition(ptr); 

If a task of higher priority was suspended pending 

availability of a partition, it would now be run. 

There is no possibility for fragmentation, as only fixed size 

blocks are available. The only failure mode is true resource 

exhaustion, which may be controlled and contained using task 

suspend, as shown. 

Additional API calls are available which can provide the 

application code with information about the status of the 

partition pool – for example, how many free partitions are 

currently available. 

Care is required in allocating and de-allocating partitions, 

as the possibility for the introduction of memory leaks remains. 

B. Memory Leak Detection 

The potential for programmer error resulting in a memory 

leak when using partition pools is recognized by vendors of 

real time operating systems. Typically, a profiler tool is 

available which assists with the location and rectification of 

such bugs. 

VII. REAL TIME MEMORY SOLUTIONS 

Having identified a number of problems with dynamic 

memory behavior in real time systems, a better approach can 

be proposed. 

A. Dynamic Memory 

It is possible to use partition memory allocation to 

implement malloc() in a robust and deterministic fashion. 

The idea is to define a series of partition pools with block sizes 

in a geometric progression; e.g. 32, 64, 128, 256 bytes. A 

malloc() function may be written to deterministically select 

the correct pool to provide enough space for a given allocation 

request. This approach takes advantage of the deterministic 

behavior of the partition allocation API call, the robust error 

handling (e.g. task suspend) and the immunity from 

fragmentation offered by block memory. 

VIII. CONCLUSIONS 

C and C++ use memory in various ways, both static and 

dynamic. Dynamic memory includes stack and heap. 

Dynamic behavior in embedded real time systems is 

generally a source of concern, as it tends to be nondeterministic 

and failure is hard to contain. 

Using the facilities provided by most real time operating 

systems, a dynamic memory facility may be implemented 

804

Optimized – Cost Effective Implementation of 

Widely-Used Safety Mechanisms in Heterogeneous 

Software Architectures 

Esam Mamdouh 


eJad L.L.C 

Cairo, Egypt 

Esam.Mamdouh@ejad.com.eg 

Abstract— Functional safety is a key player in the 

development of Advanced Driver Assistance Systems (ADAS). 

Most of the ADAS software architecture is mainly developed 

based on either multi-core targets or multi-chip processors, 

where both of them can be considered as a heterogeneous 

software architecture. Heterogeneous Software Architectures 

require a special attention in order to utilize the available 

software capabilities to implement the safety recommendations 

defined by ISO 26262. Following these recommendations in such 

complex software architectures has become a major challenge 

facing the developers of safety critical applications. Current 

methodologies for deploying the safety critical features mainly 

rely on component redundancy with extra development time and 

effort. This paper will introduce an optimized – cost effective 

implementation of safety critical features. The main idea of the 

presented approaches is to simplify the implementation of the 

safety critical features by utilizing the available capabilities of the 

applicable system. These approaches are applied on a case study 

in the automotive industry for a Medium Range Radar 

application where it is classified as a safety critical application. 

The results of this approach show a significant performance 

improvement of multi-core target/processor, and emphasize the 

cost saving of possible duplicated component development when 

it has been adopted with safety critical features. 

Keywords— ISO 26262; Functional Safety; Multi-core; ADAS; 

MPU; IPC; Flow Control Monitoring; Watchdog; 


Heterogeneous Software Architectures acquire special 

attention in order to utilize the available software capabilities to 

implement the additional safety requirements as requested by 

the standard ISO 26262 [1]. Some of these additional safety 

requirements are used to tolerate some failures in the system. In 

this case, they are called safety mechanisms and normally 

defined in the Technical Safety Concept (TSC) during the 

safety analysis phase according to part 4 of [1]. This paper 

explains how the commonly used safety mechanisms such as 

Flow Control Monitoring, Memory Protection and Stack 

Protection are implemented in a multi-core platform whose 

Hossam H. Abolfotuh 


eJad L.L.C 

Cairo, Egypt 

Hossam.Abolfotuh@ejad.com.eg 

system’s functions originally do not require multi-tasking on 

all cores (e.g. a simple schedule is maybe enough) and hence a 

multi-core OS is not required. In the proposed solution, only an 

ASIL single-core OS is used on one core, while the other two 

cores do not need an OS, which saves the high cost of an ASIL 

multi-core OS. 

Normally the implementation of the mentioned safety 

mechanisms is performed through multiple instants of the 

safety critical components such as the watchdog for each core, 

and on other cases achieved by applying a complex technique 

using OS communication overhead in which adding more CPU 

load and degrade the system performance. 

This research is started by demonstrating the most popular 

safety mechanism, Flow Control Monitoring, where it is 

applied in most of ADAS systems discussing how to avoid the 

duplication effort of using multiple watchdog instances; then 

talk about an effective implementation of Memory Protection 

for proper software partitioning between ASIL-x and QM 

components; passing through the smart implementation of 

Stack Protection in the light of MPU functionality, and finally 

applies these safety mechanisms in a case study for a Medium 

Range Radar application in automotive industry illustrating the 

enhanced results of the proposed solutions. 

II. 

FLOW CONTROL MONITORING 

A. Current implementation and challenges 

The first widely used safety mechanism is the Flow Control 

Monitoring. Its main purpose is to ensure the correct execution 

of the program sequence. As shown in Fig. 1, the current 

implementation for performing a flow control monitoring on a 

multi-core platform is typically achieved using multiple 

instances of ASIL watchdog stack for each core in order to 

implement aliveness supervision and logical supervision as 

described by the AUTOSAR standard; this is actually an 

expensive solution as it requires a perfect synchronization 

between the multiple watchdog instances to report the final 

status of the system accurately and on time. 


805

Fig. 1. Multiple instances of Watchdog stack on a Tri-Core platform 

B. Proposed Solution 

In the suggested proposal, the watchdog stack is only 

deployed on the first core (the one having an OS with the 

required ASIL) and handle the flow control monitoring on the 

other two cores by utilizing the existing watchdog module of 

the first core. This is achieved by implementing a simplified 

flow control monitoring with the basic required functions on 

the other two cores. The implementation is including the 

definitions of the necessary check points on the program 

sequence running on these two cores; then it reports the status 

of these check points to the main watchdog stack on the first 

core over the Inter-Processor Communication (IPC) as 

illustrated in Fig. 2. 

The first core then calculates the status of the supervised 

entities of other cores received over the IPC and reporting the 

overall status to the system. In case of any detected violation, 

the system will enter the relevant safe state or perform reset 

according to what is described in the TSC. 

This solution can be generalized to cover the flow control 

monitoring in a multi-chips system (e.g., microcontroller and 

DSP) relying on inter-chip communication (e.g., SPI 

communication) instead of IPC. 

III. MEMORY PROTECTION 


Another commonly used safety mechanism is the Memory 

Protection which is used to protect the critical memory 

partitions that contain the critical data identified by the safety 

analysis. In the mixed ASIL software architecture, there is a 

possible risk may be caused by an unauthorized accesses from 

the QM partition on the ASIL partition. This an unauthorized 

access may corrupt the ASIL data and hence leads to a safety 

goal violation. Therefore the MPU is typically handled by an 

OS with at least Scalability Class 3 (SC3) to support the 

software partitioning for mixed ASIL software architecture. 

This solution requires an OS on all cores which in turn 

acquires an expensive multi-core OS license. On the other hand 

it will degrade the performance due to the overhead of Inter-OS 

Communication (IOC) used in context switching between QM 

partition and ASIL partition. 


To avoid such a complex implementation, it is proposed to 

develop a Safety Element out of Context (SEooC) MPU driver 

to be used on all cores taking into consideration the different 

compiler options of each core. This MPU driver provides a 

simple interface that allow the application through a simple 

wrapper called “Memory Protection Wrapper” to switch 

ON/OFF the MPU device according to the safety level context 

change. The software architecture proposed for the memory 

protection is illustrated in Fig. 3. 

This is valid mainly when having only two safety levels 

(e.g., QM and ASIL-x) which is a common case in mixed 

ASIL software architectures. In other words it restricts the 

access of the lower ASIL software to the memory partitions 

belonging to the higher ASIL software. 

Fig. 3. Proposed software architecture of the memory protection 

Fig. 2. Single instance of Watchdog stack with simplified flow control 

monitoring 

806

IV. 

STACK PROTECTION 


Regarding the Stack Protection, it is typically realized using 

an OS that providing different stacks for each task or interrupt. 

Thus, the OS of the first core is responsible to protect its 

different stacks as configured. For the other cores, due to nonpreemptive 

nature of their tasks, a single stack can be 

considered safe. Hence, no multi-core OS is needed to protect 

the stacks on the other cores. This single stack is usually 

protected against stack overflow by placing magic number 

patterns at its border to be checked periodically by software. 

This continuous check is adding an overhead on software 

processing and consumes a considerable amount of CPU load 

and affects the overall system performance. 


A smart solution was proposed to take the advantage of the 

MPU feature that limits the access across different cores. Using 

this feature, the stack of each core is located at the top border 

of the memory space of each core so that the stack overflow 

will be a considered as an unauthorized writing attempt in the 

memory space of the other core. Thus a memory access 

violation is detected and a safety reaction is causing a reset. 

V. CASE STUDY 

In the case study, the implementation of the mentioned 

safety mechanisms were adopted on a tri-core target 

“MPC5774-RaceRunner” [2] with ASIL-D Micro Controller 

Abstraction Layer (MCAL) provided by NXP. The application 

deployed in the target is a Medium Range Radar application 

where the front radars have an ASIL-B on software level. 

The software architecture is based on AUTOSAR version 

3.2.1, supplied by different ASIL-B stacks from Vector such as 

WdgM, E2E, and SafeOS. 

In the following 3 sub-sections, the implementation of the 

mentioned safety mechanisms are illustrated in the light of the 

Medium Range Radar application. 

A. Flow Control Monitoring 

The first core is supplied by an ASIL-B single core OS and 

an AUTOSAR package including ASIL-B watchdog stack used 

for flow control monitoring on the first core while the check 

points statuses of the other two cores are communicated 

through the IPC as defined in section II. 

The configuration of the watchdog stack on the main core is 

including two additional check points for special IPC messages 

containing the information of the check point statuses of the 

other two cores. 

B. Memory Protection 

Whereas the other two cores do not need OS because they 

do not have any preemptive tasks. The discussed optimized 

approach can be applied without impacting the safety aspects. 

According to the ASIL decomposition on the system level, 

there are only two safety levels (QM, ASIL-B) defined in 

software architecture, therefore the simplified memory 

protection solution explained in section III can be implemented 

properly as shown in Fig. 4. 

Fig. 4. Context ASIL change using the simplified MPU ON/OFF switch 

Fig. 5. Stack location for Core1 and Core2 of MCU 'MPC5774-RaceRunner’ 

C. Stack Protection 

The single stack of the other two cores (Core1 and Core2) 

are located at the top boundaries of their memory map so that 

the MPU can detect any stack overflow and perform MCU 

reset by hardware. An additional memory hole is inserted at the 

bottom of the stack and configured in the MPU as restricted 

memory area to detect the stack underflow. 

The memory layout example of the MCU MPC5774- 

RaceRunner is illustrated in Fig. 5. 

VI. SUMMARY/CONCLUSIONS 

The major advantage of the proposed solutions is that they 

are cost-effective alternatives by using only one ASIL OS 

compared to having a multi-core ASIL OS. 

The advantage of the proposed flow control monitoring 

solution is saving the effort of developing and configuring a 

multiple instants of ASIL watchdog stack for the other cores. 

For the memory protection, the proposed solution has a 

better performance as it saves the overhead of IOC caused by 

using OS configured with SC3. 

Finally in the proposed stack protection, the solution has no 

development effort and no processing load on the CPU. 


[1] International standard, “Road Vehicles – Functional Safety”, ISO 

Standard 26262, first edition, Nov. 2011. 

[2] NXP Datasheet, “MPC5775K Reference Manual”, Document Number: 

MPC5775KRM, Rev. 2, 2/2014. 


807

Design security into your code. Don’t just hope to 

remove insecurity 



LDRA 

Wirral, UK 



If someone constructed a suspension bridge by guessing at 

steel cabling sizes and then loading the deck to see whether it 

collapsed, you would be unlikely to suggest that he was a 

great civil engineer. And if a lift manufacturer sized their 

motors by trying them to see whether they caught fire, you 

wouldn’t expect their electrical engineers to win many 

awards. 

And yet these approaches are exactly analogous to how 

security critical software developers often approach their 

work. 

The development cycle for traditional security markets is a 

largely reactive one, where coding is developed mostly on an 

informal agile basis, with no risk mitigation and no coding 

guidelines. The resulting executables are then subjected to 

performance, penetration, load and functional tests to attempt 

to find the vulnerabilities that almost certainly result. The 

hope, rather than the expectation, is that all issues will be 

found and the holes adequately plugged. 

In short, it challenges secure software developers to embrace 

the concept that it is far better to design in security rather 

than hope to remove insecurity. 

II. THE TRADITIONAL APPROACH TO ENTERPRISE 

SOFTWARE SECURITY 

Figure 1 shows an extract from a slide show based on a 

popular text book. The book itself is focused on the 

development of software for enterprise systems iii , and was 

published as recently as 2011. It typifies an approach to 

enterprise software development that focuses only on “end 

user business requirements”, with no clear regard for system 

security or safety. It goes on to place focus on testing only 

after the development phase, such that the application is 

developed in accordance with a specification and, when it is 

completed, it is tested to see whether requirements are met, 

and to “eliminate errors or bugs”. 

Safety critical software development belongs to a different 

world, with a process that would be far more familiar to 

exponents of the more traditional engineering disciplines. A 

process that consists of defining requirements, creating a 

design to fulfil those requirements, developing a product that 

is true to the design, and then testing it to show that it is. 

This paper argues that whether their product is safety critical 

or not, it is time for security critical software developers to 

embrace that same, sound engineering lifecycle. In doing so, 

it will compare and contrast the difference in focus between 

CERT C i ’s application centric approach to the detection of 

issues, versus MISRA ii ’s ethos of using design patterns to 

prevent their introduction. It will advocate the use of 

reactive penetration and load tests to prove that the product is 

sound, rather than to find out where it isn’t. 

Figure 1: The traditional enterprise development lifecycle, 

with a test phase only after development 

It is possible, of course, that security could indeed be one of 

the “business requirements” even it is not explicitly 

highlighted as such. Even assuming that to be the case, it 

remains no surprise that many established security test 


808

techniques focus on the “develop first, test later” model 

reflected in Figure 1. Penetration testing iv , for example, is an 

authorized simulated attack on a computer system, performed 

to evaluate the security of that system. The test is performed 

to identify both strengths and vulnerabilities – that is, the 

potential for unauthorized parties to gain access to the 

system's features and data - enabling a full risk assessment to 

be completed. 

Fuzz testing v is a related technique where large amounts of 

data in varying formats are sent to the inputs of an 

application. For example, “File Fuzzing” involves taking a 

well-formed file, modifying it to introduce fuzz data, and 

then driving the program to open the modified file. The 

application will then process the fuzz data and its response 

can be monitored. 

Such techniques, then, fit with the development lifecycle 

model advocated in Figure 1. The idea is that armed with 

such information, developers and IT engineers can hope to 

“plug the gaps” with the aim of ensuring that the system is 

adequately secure. 

III. SAFE & SECURE APPLICATION CODE DEVELOPMENT 

This traditional approach to secure software development is 

mostly a reactive one – develop the software, and then use 

penetration, fuzz and functional test to expose any 

weaknesses. Useful though that is, in isolation it is not good 

enough to comply with a functional safety standard such as 

DO-178C vi (in the aerospace sector), IEC 62304 vii (medical 

devices) or ISO 26262 viii (automotive) which implicitly 

demands that security factors with a safety implication are 

considered from the outset, because a safety-critical system 

cannot be safe if is not secure. 

Using ISO 26262 as an example, Figure 2 illustrates a V- 

model with cross-references to both the ISO 26262 standard 

and to tools likely to be deployed at each phase in the 

development of today’s highly sophisticated and complex 

automotive software. This serves as a reference to illustrate 

how the introduction of a security perspective impacts each 

phase. (Note that other process models such as agile and 

waterfall can be equally well-supported.) 

Figure 2: Software-development V-model with crossreferences 


The outputs from the system design phase (top left) includes 


hardware and software. In a connected system, these will 

include many security requirements because the action to be 

taken to deal with each safety-threatening security issue needs 

to be proportionate to the risk. Hazard analyses are performed 

to assess risks associated with safety, whereas threat analyses 

identify risks associated with security. Detailed hazard 

analysis may involve Fault Tree Analysis (FTA) whereas 

threat analysis may consist of Attack Tree Analysis (ATA), 

but each contributes key information to the safety case. 

Maintaining traceability between these requirements and the 

products of subsequent phases can cause a major project 

management headache. 

The specification of software requirements involves their 

derivation from the system design, isolating the softwarespecific 

elements and detailing the process of evolution of 

lower-level, software-related requirements, including those 

with a security-related element. 

Note that the application of such a process does not negate 

the value of penetration and fuzz testing. However, it makes 

it much more likely that such techniques will provide 

evidence of the robustness of systems, rather than being used 

to expose their vulnerabilities. 

Figure 3: Graphical representation of Control and Data 


809

Next comes the software architectural design phase, perhaps 

using a UML graphical representation. Static analysis tools 

help here by providing graphical representations of the 

relationship between code components for comparison with 

the intended design (Figure 3). 

Figure 4 illustrates a typical example of a table from ISO 

26262-6:2011, relating to software design and 

implementation. It shows the coding and modelling 

guidelines to be enforced during implementation, 

superimposed with an indication of where compliance can be 

confirmed with the aid of automated tools. 

Figure 4: ISO 26262 coding and modelling guidelines 

The “use of language subset” (topic 1b in the table) 

exemplifies the impact of security considerations on the 

process. Language subsets have traditionally been viewed as 

an aid to safety, but security enhancements to the MISRA 

C:2012 standard and security-specific standards such as 

CWE ix and CERT C reflect an increasing interest in the role 

they have to play in combating security issues. These too can 

be checked by means of static analysis (Figure 5). Despite 

being nominally similar, the underlying ethos can differ 

considerably between these language subsets, as discussed 

later. 

Figure 6: Unit testing with the LDRA tool suite 

Figure 6 shows how the software interface is exposed at the 

function scope, allowing the user to enter inputs and expected 

outputs to form the basis of a test harness. That harness is then 

compiled and executed on the target hardware, and actual and 

expected outputs compared. Such a technique is useful not 

only to show functional correctness in accordance with 

requirements, but also to show resilience to issues such as 

border conditions, null pointers and default switch cases – all 

important security considerations. 

In addition to showing that software functions correctly, 

dynamic analysis is used to generate structural coverage 

metrics. Both MISRA C:2012 (Dir 3.1) and the security 

standard CWE (Figure 7) require that code coverage analysis 

is used to ensure that there is no hidden functionality 

designed to potentially increase an application’s attack 

surface and expose weaknesses. 

Figure 7: CWE requirement for code coverage analysis 

Figure 5: Coding standards violations as represented by the 


Dynamic analysis techniques (involving the execution of 

some or all of the code) are applicable to unit, integration 

and system testing. Unit testing is designed to focus on 

particular software procedures or functions in isolation, 

whereas integration testing ensures that safety, security and 

functional requirements are met when units are working 

together in accordance with the software architectural design. 

IV. CHOOSING A LANGUAGE SUBSET 

Although there are several language subsets (or less formally, 

“coding standards”) to choose from, these have traditionally 

been focused primarily on safety, rather than security. More 

lately with the advent of such as the Industrial Internet of 

Things, connected cars, and connected heart pacemakers, that 

focus has shifted towards security to reflect the fact that 

systems such as these, once naturally secure through isolation, 

are now increasingly accessible to aggressors. 


810

There are, however, subtle differences between the differing 

subsets which is perhaps a reflection of the development 

dichotomy between designing for security, and appending 

some measure of security to a developed system. To illustrate 

this, it is useful to compare and contrast the approaches taken 

by the authors of MISRA C and CERT C with respect to 

security. 

A. Retrospective adoption 

MISRA C:2012 x categorically states that “MISRA C should 

be adopted from the outset of a project. If a project is building 

on existing code that has a proven track record then the 

benefits of compliance with MISRA C may be outweighed by 

the risks of introducing a defect when making the code 

compliant.” 

This contrasts in emphasis with the assertion of the CERT C 

authors that although “the priority of this standard is to 

support new code development…. A close-second priority is 

supporting remediation of old code” 

In the case of static analysis tools, that requires that the rules 

can be checked algorithmically. Compare, for example, the 

excerpts shown in Figure 8, both of which address the same 

issue. The approach taken by MISRA is to prevent the issue 

by disallowing the inclusion of the pertinent construct. CERT 

C instead asserts that the developer should “be aware” of it. 

Of course, there are advantages in each case. The CERT C 

approach is clearly more flexible; something of particular 

value if rules are applied retrospectively. MISRA C:2012 is 

more draconian, yet by avoiding the side effects altogether the 

resulting code is certain to be more portable, and perhaps more 

importantly, it can be automatically checked by a static 

analysis tool. It is simply not possible for a tool to check 

whether a developer is “aware” of side effects – and less 

possible still to ascertain whether “awareness” equates to 

“understanding”. 

Of course, as with the system as a whole, the level of risk 

involved with the compromise of the system will reflect on the 

approaches to be adopted. Certainly, the retrospective 

application of any subset is better than nothing, but it does not 

represent best practice. 

B. Relevance to safety, high integrity and high reliability 

systems 

MISRA C:2012 “define[s] a subset of the C language in which 

the opportunity to make mistakes is either removed or 

reduced. Many standards for the development of safety-related 

software require, or recommend, the use of a language subset, 

and this can also be used to develop any application with high 

integrity or high reliability requirements”. The accurate 

implication of that statement is that MISRA C was always 

appropriate for security critical applications even before the 

security enhancements introduced by MISRA C:2012 

Amendment 1 xi . 

CERT C attempts to be more all-encompassing, covering 

application programming (e.g. POSIX) as well as the C 

language That is reflected in its introductory suggestion that 

“safety-critical systems typically have stricter requirements 

than are imposed by this standard … However, the application 

of this coding standard will result in high-quality systems that 

are reliable, robust, and resistant to attack”. 

V. DECIDABILITY 

The primary purpose of a requirements-driven software 

development process as exemplified by ISO 26262 is to 

control the development process as tightly as possible to 

minimize the possibility of error or inconsistency of any kind. 

Although that is theoretically possible by manual means, it 

will generally be far more effective if software tools are used 

to automate the process as appropriate. 

Figure 8: Contrasting approaches concerning the decidability 

of coding rules 

VI. 

PRECISION OF RULE DEFINITIONS 

The stricter, more precisely defined approach of MISRA does 

not only lend itself to a standard more suitable for automated 

checking. It also addresses the issue of language 

misunderstanding more convincingly than CERT C. 

Evidence suggests that there are particular characteristics of 

the C language which are responsible for most of the defects 

found in C source code xii , such that around 80% of software 

defects are caused by the incorrect usage of about 20% of the 

available C or C++ language constructs. By restricting the use 

of the language to avoid the parts that are known to be 

problematic, it becomes possible to avoid writing associated 

defects into the code and as a result, the software quality 

greatly increases. 

This approach also addresses a more subtle issue surrounding 

the personalities and capabilities of individual developers. 

Simple statistics tell us that of all the C developers in the 

world, 50% of them have below average capabilities – and yet 

it is very rare indeed to find a development team manager who 

811

would acknowledge that they recruit any such individuals. 

More than that, in any software development team, there will 

some who are more able than others and it is human nature for 

people not to highlight the fact if there are things they don’t 

understand. 

Figure 9 uses the handling of variadic functions to illustrate 

how this approach differs from that of CERT C. CERT C calls 

for developers to “understand” the associated type issues, but 

doesn’t suggest how a situation might be handled where a 

developer is, despite the best of intentions, harbouring a 

misunderstanding. 

A counter argument might be that there will be developers 

who are very aware of the type issues associated with variadic 

functions, who make very good use of them, and who may feel 

restricted in the prohibition of their use. However, for highly 

safety or security critical systems, MISRA would assert that 

because the “opportunity to make mistakes is either removed 

or reduced”, that is a price well worth paying. 

VII. 

BI-DIRECTIONAL TRACEABILITY 

The principle of bi-directional traceability runs throughout 

the V-models referenced in such as DO-178C, IEC 62304, 

and ISO 26262, with each development phase required to 


sequence of the standard is adhered to, then the 

requirements will never change and tests will never throw 

up a problem. But life’s not like that. 

For example, it is easy to imagine these processes as they 

relate to a “green field” project. But what if there is a need 

to integrate many different subsystems? What if some of 

those are pre-existing, with requirements defined in 

widely different formats? What if some of those systems 

were written with no security in mind, assuming an 

isolated system? And what if different subsystems are in 

different development phases? 

Then there is the issue of requirements changes. What if 

the client has a change of heart? A bright idea? Advice 

from a lawyer that existing approaches could be 

problematic? 

Should changes become necessary, revised code would 

need to be reanalysed statically, and all impacted unit and 

integration tests would need to be re-run (regression 

tested). Although that can result in a project management 

nightmare at the time, in an isolated application it lasts 

little longer than the time the product is under 

development. 

Figure 9: Comparing differing precision of rule definition 

A. A question of priorities 

The correct application of either CERT C or MISRA C:2012 

will certainly result in more secure code than if neither were to 

be applied. However, for safety or security critical 

applications, MISRA C is considerably less error prone both 

because it is specifically designed for such systems and as a 

result of its stricter, more decidable rules. Conversely, there is 

an argument for using the CERT C standard because it is more 

tolerant, perhaps if an application is not critical but is to be 

connected to the internet for the first time. The retrospective 

application of CERT C would then be a pragmatic choice to 

make. 

Connectivity, with its inherent need for security, changes 

all that. Whenever a new vulnerability is discovered, there 

is the potential for a resulting change of requirement to 

cater for it, coupled with the additional pressure of 

knowing that a speedy response could be critically 

important if products are not to be compromised in the 

field. Indeed, many IoT systems are very difficult to patch 

once in service. 

Automated bi-directional traceability links requirements 

from a host of different sources through to design, code 

and test. The impact of any requirements changes – or, 

indeed, of failed test cases - can be assessed by means of 

impact analysis, and addressed accordingly. Artefacts can 

be automatically re-generated to present evidence of 

continued compliance to the appropriate standard. 

During the development of a traditional, isolated system, 

that is clearly useful enough. But connectivity demands 

the ability to respond to vulnerabilities, because each 

newly discovered vulnerability implies a changed or new 




such circumstances, being able to isolate what is needed 


812

and automatically test only the functions implemented 

becomes something much more significant. 

VI. CONCLUSIONS 

The “develop first, test later” development lifecycle so often 

applied to enterprise software security is too prone to error 

where the application under development is critical in nature. 

Sometimes the requirement to do more is implicit in the safety 

implications if security is breached, but the same principle 

applies even when the concern centres on only the sensitivity 

of data. Happily, there are numerous examples of functionally 

safe process standards such as ISO 26262 in the automotive 

industry, DO-178C in aerospace, and IEC 62304 in medical 

devices, and these provide a more stringent model for 

developers of security critical applications to adopt. 

These functional safety standards require the use of coding 

rules, and those specified by such as CERT C and MISRA 

C:2012+AMD 1 are designed for use in secure software 

development. MISRA’s mission statement to “…provide 

world-leading, best practice guidelines for the safety 

application of both embedded control systems and standalone 

software” contrasts with CERT C’s wider remit, and so 

MISRA C:2012 perhaps lends itself better to highly critical 

applications especially in view of the fact that more of its rules 

are designed to be automatically decidable by static analysis 

tools. 


LDRA 

Portside 

Monks Ferry 

Wirral CH41 5LH 


Tel: +44 (0)151 649 9300 

Fax: +44 (0)151 649 9666 



Mark James 



Presenter 




The nature of the connected system means that the software 

development lifecycle effectively continues after product 

release. Tools designed to support bi-directional traceability 

during development provide the ideal platform to ensure that 

responses to security breaches are as rapid as possible, and 

that the resulting modified codebase is as compliant to 

standards as the version initially released. 

i 

SEI CERT C Coding Standard 

https://www.securecoding.cert.org/confluence/display/c/ 

SEI+CERT+C+Coding+Standard 

ii 

MISRA – The Motor Industry Software Reliability 

Association https://www.misra.org.uk/ 

iii 

Business Driven Information Systems, Paige Baltzan, 

McGraw-Hill Education, 2011 

iv 

TechTarget Definition: pen test (penetration testing) 

http://searchsoftwarequality.techtarget.com/definition/pe 

netration-testing 

v 

TechTarget Definition: fuzz testing (fuzzing) 

http://searchsecurity.techtarget.com/definition/fuzztesting 

vi 

RTCA DO-178C Software Considerations in Airborne 

Systems and Equipment Certification”, Prepared by SC- 

205, December 13, 2011 

vii 

IEC 62304 International Standard Medical device 

software – Software life cycle processes Consolidated 

Version Edition 1.1 2015-06 

viii 



software level 

ix 

CWE Common Weakness Enumeration 

https://cwe.mitre.org/ 

x 

MISRA C:2012 Guidelines for the use of the C 

language in critical systems. March 2013. 

xi 

MISRA C:2012 - Amendment 1: Additional security 

guidelines for MISRA C:2012, ISBN 978-906400-16-3 

(PDF), April 2016. 

xii 

Applying the 80:20 Rule in Software Development, Jim 

Bird, Nov 15 2013 https://dzone.com/articles/applying- 

8020-rule-software 

813

Partitioning of Algorithms for Distributed 

Computation 

Andreas Rechberger 

Institute of Technical Informatics 

Graz University of Technology 

Graz, Austria 

Eugen Brenner 

Institute of Technical Informatics 

Graz University of Technology 

Graz, Austria 

Abstract— Early evaluation of the computing needs is a 

crucial process during developing embedded systems. Providing 

measurable metrics on the performance demands to implement a 

specified algorithm includes a large amount of target dependency. 

This paper aims on providing a generalized method, which can 

be utilized prior to mapping the algorithm onto a dedicated 

hardware. The aim is to provide a quantification on the runtime 

with reasonable accuracy when applied to a dedicated hardware 

architecture. We focus on the analysis and extraction steps of 

this process discussing its challenges in order to transform the 

algorithm implementation into a form suited for distribution 

analysis. Finally some basic methods for hardware mapping of 

the generalized algorithm are presented. 

Keywords—algorithm, partitioning, compiler, LLVM, data flow 

graph 


When designing embedded systems it is vital to determine 

the required computing power in advance. With nowadays 

micro-controllers, signal processors and programmable logic 

usually the main problem is not to get sufficient processing 

capability, but to choose the appropriate computing platform. 

The majority of embedded systems do not operate in an 

isolated stand-alone environment, but rather communicate and 

interact with others. This holds true for a system scale, where 

for example multiple Internet of Things (IoT) sensors unite 

themselves, as well as on intra device scale, where a general 

purpose controller teams up with digital signal processors 

(DSP) for wireless communication and data acquisition. 

Within section II a method to analyse the data flow and 

processing needs of arbitrary algorithms is described. Such 

an analysis is one of the initial steps required in order to 

distribute the various processing tasks to the components 

within a system. 

By combining the results of the dynamic and static analysis 

the data flow as well as the control flow graph can be extracted. 

Section III describes the challenges that occur when extracting 

the computational effort and data amount out of the 

analysis data. 

Examples for such challenges are: 

• Dealing with aspects of the single static assignment (SSA) 

form of the intermediate language in presence of loop 

constructs (Φ-nodes). 

• Handling of code sequences that do not contribute to 

the algorithm itself, such as calling subroutines with the 

required parameter passing. 

• Decomposing of data access to aggregate structures (arrays). 

With these techniques applied the impact of the compilers 

optimization level can be minimized. The goal is to achieve 

similar results whether or not the compiler performs aggressive 

function inlining or not. 

Finally a proof of concept tool has been implemented, which 

is able to automatically process an application, identifies the 

function to be analysed, runs the static and dynamic analysis 

and generates a combined data- and control flow graph. 

With this graph a reasonable metric of the computational 

effort, as well as the required processing data for each part 

of the algorithm can be given. This provides the required 

quantitative data material for formulating a suitable system 

partitioning. Such a partition might be on macroscopic scale, 

like partitioning computation between a web server and an 

(resource limited) embedded client with constrained data flow 

in between, or on microscopic scale, like a general purpose 

CPU paired with a DSP. 

II. ANALYSIS METHOD 

Prior to analysing the data flow dependencies of an algorithm, 

it has to be formulated in a machine readable manner. 

This usually means to express the sequence of commutations 

in a suitable programming or scripting language. While data 

flow analysis could operate directly on the algorithm’s engineer 

input (hence for example by analysing the C++, C, Matlab 

code or the mathematical formulas) this approach does not 

allow dynamic analysis to be performed. 

Especially for descriptions in imperative languages even 

very simply constructs - like iterating over all elements of an 

data array - cannot always be satisfactorily statically analysed. 

For the previous example (iterating over a data array) this 

would require the data size to be a compile time constant. 

814 


While almost all languages support idioms for compile 

computations, for reasons of simplicity it is not always desired 

to formulate the code in such a way. In the case of the 

C++ language completely determining all template parameters 

and propagated constants basically requires a fully functional 

compiler front-end. 

Modern compilers for high level languages aim to separate 

the language front end, the optimization stages and the code 

generation. This simplifies the handling of multiple language 

front ends for similar languages like various dialects of C or 

C++ as well as handling significantly different languages like 

Ada, Java and C++ or Go in a single compiler project. 

The GNU compiler collection (GCC front ends) supports 

a wide collection of languages (common ones like C/C++ 

and Java, as well as less prominent ones like Ada, Pascal, 

Mercury, Cobol, Go or Modula-2). The GCC suite uses several 

internal code representation formats called GIMPLE, RTL and 

GENERIC [1]. Other high level compilers, like the LLVM 

project compiler do use a single intermediate language [2]. The 

intermediate languages used within the optimization phase, 

and as such those which are handed over to the code generators 

usually have single static assignment (SSA) form. This has 

shown to beneficial for various optimization techniques, like 

dead code elimination, constant propagation and variable range 

analysis. 

In order to maintain the benefits of using a well known 

language, and being able to utilize the front end processing 

and tooling already present, the analysis suitably is based 

on a different representation of than the language used to 

describe the algorithm. As basically all high level languages 

(such as C++, Java, C#) are transformed in a generalized 

intermediate representation by use of a standard compilation 

flow [3] a method to operate on this level of abstraction would 

be beneficial. 

Not only does this exempt the algorithmic analysis from the 

details of the front end language, it does also allow combining 

all front end languages the compiler is able to handle [4], [5]. 

Based on the intermediate representation the algorithm can be 

analysed dynamically by means of executing an instrumented 

binary in addition to a purely static analysis. 

Instrumentation 

As code base for the analysis tool the compiler framework 

of the LLVM project has been used, specifically the C++ front 

end. While via static analysis the control flow graph can be 

trivially extracted, for a dynamic analysis the algorithm has 

to be executed. Within the control flow graph the algorithm 

is decomposed into a set of elementary blocks, which are 

connected via conditional or unconditional branches. Within 

the LLVM assembly language such an elementary block is 

called BasicBlock. Simplified this is a sequence of arbitrary 

instructions terminated by a branch instruction which denotes 

the next block to be executed. 

Exemplary a very simple function (Listing 1) which computes 

the sum of all elements in an array is used. This 

operation is called foldl (left fold)[6]. In contrast to the 

for.cond.cleanup: 

ret i32 %sum.0 

entry: 

br label %for.cond 

for.cond: 

%sum.0 = phi i32 [ 0, %entry ], [ %add, %for.body ] 

%i.0 = phi i64 [ 0, %entry ], [ %inc, %for.body ] 

%exitcond = icmp eq i64 %i.0, 3 

br i1 %exitcond, label %for.cond.cleanup, label %for.body 

T 

CFG for ’TestFunction’ function 

F 

for.body: 

%arrayidx = getelementptr inbounds i32, i32* %array, i64 %i.0 

%0 = load i32, i32* %arrayidx, align 4, !tbaa !3 

%add = add nsw i32 %0, %sum.0 

%inc = add nuw nsw i64 %i.0, 1 

br label %for.cond 

Fig. 1. Control Flow Graph for Array Sum 

given C code other programming languages might have builtin 

support for this operation (like Haskell), or implement 

them via library functions (std::accumulate in C++). 

The control flow of the intermediate language representation 

(Listing 2) consists of four blocks (Fig. 1). 

1 # d e f i n e N 3 

2 i n t T e s t F u n c t i o n ( c o n s t i n t a r r a y [N] ) 

3 { 

4 i n t sum = 0 ; 

5 f o r ( s i z e t i =0; i

1) Load the source module (LLVM assembly language 

code) 

2) Instrument the modules functions 

3) Execute the modules entry function 

4) Post process the generated tracking data 

As the execution of the module to be analysed will take 

place within the thread context of the analysis tool their 

functions and global variables are shared. For the tracking a 

single function call is inserted at the very beginning of each 

BasicBlock. 

For reasons of simplicity within the instrumentation code 

generator this is done in a two step approach. The function 

call inserted into the module operates with untyped memory 

addresses. It conveys the address of a second level function 

as well as a reference to the tracking instance and the first 

instruction of the block to be executed. 

While using raw memory addresses (or void pointers) is 

generally considered as bad practice it allows to fully decouple 

the type information of the algorithm which is to be analysed 

and the tracking module. The first level function (called 

SpringBoard) simply re-generates the type information and 

invokes the corresponding trace function within the tracker, 

while the second level function (called Trampoline) calls 

the actual tracking functions of the instrumentation handlers 

object instance (CoderTracker_t). 

The first level function is depicted in Listing 3. 

TestFunction 

array 

array_Slice_0 

array_Slice_0 array_Slice_1 array_Slice_2 

add (0 | 5) 

array_Slice_1 

add 

array_Slice_2 

add (1 | 12) 

add 

add (2 | 19) 

add 

ret (3 | 24) 

Fig. 2. Data Flow Graph for Array Sum 

TestFunction 

 

add (0 | 6) 

icmp (0 | 1) 

array 

inc 

inc 

exitcond 

add (1 | 13) 

icmp (1 | 8) br (1 | 2) array_Slice_0 

inc 

exitcond 

add array_Slice_0 array_Slice_1 array_Slice_2 

icmp (2 | 15) br (2 | 9) add (0 | 5) 

array_Slice_1 

Listing 3 

FIRST LEVEL TRACKING FUNCTION (SPRING BOARD) 

1 # i n c l u d e 

2 namespace CodeTracker 

3 { 

4 c l a s s C o d e T r a c k e r t ; 

5 } 

6 namespace llvm 

7 { 

8 c l a s s I n s t r u c t i o n ; 

9 } 

10 

11 t y p e d e f i n t (∗ Trampoline ) ( CodeTracker : : C o d e T r a c k e r t ∗, llvm : : I n s t r u c t i o n ∗) ; 

12 

13 e xtern ”C” 

14 { 

15 void T r a c k B a s i c B l o c k S p r i n g B o a r d ( u i n t 6 4 t F c t P t r , u i n t 6 4 t me , u i n t 6 4 t 

↪→ I n s t r c P t r ) 

16 { 

17 Trampoline t r a m p o l i n e = r e i n t e r p r e t c a s t(F c t P t r ) ; 

18 t r a m p o l i n e ( r e i n t e r p r e t c a s t(me ) , 

19 r e i n t e r p r e t c a s t(I n s t r c P t r ) ) ; 

20 } 

21 } 

Regular instructions (for example binary operations (add, 

sub, mul, ... [7])) can be directly added into the data flow 

graph. For data flow and dependency analysis each node is 

accompanied by some meta data. Within the meta data set 

the cycle count (incremented upon each instruction traced) is 

the most prominent one. It provides a reliable mechanism to 

identify the most recent graph node in case the instruction (or 

basic block) is executed multiple times. With this approach 

the data flow and control flow elements can be composed into 

a single directed acyclic graph (DAG). The data flow graph 

for the Listing 1 is shown in Fig. 2. 

Besides the expected adder tree the graph depicts the 

dissolved elements of the input array (the slice nodes), as well 

exitcond 

br (3 | 16) 

add 

add 

add 

add (1 | 12) 

add 

add (2 | 19) 

add 

ret (3 | 24) 

array_Slice_2 

Fig. 3. Control/Data Flow Graph for Array Sum 

as a single instruction (add, Listing 1, Line 19) of which the 

graph nodes are build. The numbers within the parenthesis 

denotes the operations instruction cycle (second number) and 

its rank (first number). The instruction rank denotes a virtual 

instruction cycle within a fully parallelized execution. An node 

with a rank of 0 only requires static inputs in order to be 

computed, while a node of rank N has at least on input which 

is of rank N − 1. 

By inspection the instruction cycles of the add nodes 

it is obvious that there are additional instructions executed. 

The instruction cycle is therefore guaranteed to be strictly 

monotonic (hence unique), but not necessarily continuous. 

Extending value nodes with operations required to perform 

the loop house-keeping yields Fig. 3. 

This reveals a second adder tree to be required (upper left 

corner). This adder tree reflects the pointer/index arithmetic 

for the loop implementation. Each of the increment operations 

is followed by an icmp, br (integer compare and branch) 

pair used to implement iterating over the input array. This 

816 


aspect depend on whether or not the optimizations steps of 

the compiler have unrolled the loop or not. Within this paper 

the optimization has been configured not to unroll loops in 

order to demonstrate the generic case. 

III. DATA FLOW EXTRACTION 

For extracting the data flow from the instrumented code 

execution some extension to the previously described method 

are required. Some of which are caused by the fact that the 

LLVM assembly language is a SSA based language, while 

others are a property of the LLVM assembly language itself. 

The most prominent member of the first group (SSA introduced) 

is the handling of Φ nodes (Listing 2, Line 8-9). 

These are required to deal with values produced by one out of 

multiple possible predecessor blocks. Commonly this pattern 

is used to implement loop counters, which are loaded with the 

value 0 in case the predecessor has been the entry block, while 

being denoted the value n + 1 during all other iterations. 

Resolving the Φ nodes to their corresponding value can 

easily be achieved, provided that the tracking engine keeps a 

record of the previously executed basic blocks. This queue 

basically is similar to a function call stack (a last in first 

out (LIFO) queue) but operates on BasicBlock level rather 

than on function scope. Dealing with function calls as such 

is a necessity of the LLVM assembly language, which shares 

this property with almost all programming languages. While, 

depending on the optimization settings of the language frontend 

certain functions will be inlined, but this process is not 

reliable enough to relieve the tracking module from handling 

function calls. 

For properly embedding the data and control flow of a 

subroutine into the global DAG following two tasks need to 

be performed. First the input arguments and return values of 

the called function need to be mapped to the corresponding 

nodes within the parent blocks scope, and second the functions 

instructions as such are to be processed with the proper instruction 

cycle offset, corresponding to the current instruction cycle 

count of the parent function when handling the call instruction. 

Resolving the value nodes out of the subroutine is a recursive 

issue as soon as the depth of the call tree exceeds two. With 

this approach the functions calls transparently vanish within 

the analysis graph, as depicted in (Listing 4 and Fig. 4) which 

demonstrates that the arguments a1 and b1 of the inner 

most function f1 are resolved to the top level inputs a and 

b. This behaviour is independent of the in-lining behaviour 

(optimization level) of the compiler. 

A. Aggregate Data and Memory Access 

When dealing with data arrays, or aggregate data structures 

in general, another aspect of the intermediate language has to 

be taken into account. Whenever a computational operation is 

applied to data stored in an array it is required to explicitly 

reference a single entry within the aggregate set. In this context 

the generation of the control and data flow graph competes 

with the vectorisation optimization passes of the compiler. 

As the vectorization capabilities of the front end are usually 

Listing 4 

RECURSIVE ARGUMENT LOOKUP 

1 i n t T e s t F u n c t i o n ( i n t a , i n t b ) a t t r i b u t e ( ( n o i n l i n e ) ) ; 

2 e x t e r n ”C” 

3 { 

4 i n t f1 ( i n t a1 , i n t b1 ) { r e t u r n a1 + b1 ; } 

5 i n t f2 ( i n t a2 , i n t b2 ) { r e t u r n f1 ( a2 , b2 ) ; } 



8 } 

9 

10 i n t T e s t F u n c t i o n ( i n t a , i n t b ) 

11 { 

12 r e t u r n f4 ( a , b ) ; 

13 } 

14 

15 i n t main ( i n t argc , c h a r∗ a rgv [ ] ) 

16 { 

17 r e t u r n T e s t F u n c t i o n ( 1 , 2 ) ; 

18 } 

TestFunction 

b 

b 

add (0 | 0) 

[RecCallTree.cpp:4:42] 

add.i.i.i.i 

ret (1 | 1) 

[RecCallTree.cpp:12:5] 

Fig. 4. Recursive Argument Lookup 

limited to a certain width (usually less than 4) there is not 

much lost by inhibiting the vector optimization in the front 

end. Throughout this paper the front end and optimization 

configuration has been chosen as such that the vectorization 

passes have been disabled. Doing so causes memory access 

to happen via the pointer arithmetic like scheme of the 

getelemptr instruction. This closely resembles the index 

operator of the C language. 

Digging into the details of the C index operator (and as such 

the getelemptr) reveals that for those not only an arbitrary 

number of arguments is required, but also that resolving the 

corresponding element of the aggregate structure requires the 

run-time values of the arguments. The analysis concept so far 

does perform a dynamic analysis of which blocks or to be 

traced (and their order of execution), backed up by a purely 

static analysis of the block content itself. Hence the operand 

data for the array indexing is only available as reference to the 

variable or instruction computing it but not the actual value it 

has been assigned when executing the array indexing. 

In order to make this values available for the control 

and data flow extraction engine the dynamic part of the 

instrumentation requires some extension. First of all the static 

processing of the elementary block has to be interrupted upon 

reaching an instruction which requires the actual values of its 

operands. Doing so requires the instrumentation mechanism 

to be changed. Rather than simply inserting an informative 

a 

a 

817

callback (“this block is now executed”) at the beginning of 

each elementary block and recursively doing so for all subroutines 

the memory access instructions need to be handled. 

Differently than inserting a simple informative callback the 

extended tracing function is required to additionally convey 

the actual values of the arguments. 

The fully instrumented code is shown in Listing 5. The 

numeric arguments of the Track*() functions denote the 

memory addresses of the referenced instructions and trampoline 

functions as described in section II. Their values have 

been simplified from the 64-bit address space value to low digit 

decimals for reasons of better demonstration. Each elementary 

block is started with a regular tracking spring board function as 

demanded by the previous analysis method, while computing 

the memory address (Listing 5, Line 23) requires three steps. 

Pausing the static sweep over the elementary block, fetching 

the parameter values and finally assigning them to the proper 

node in the graph. The fact that there are two invocations of 

the springboard function within the loop body (line 20, 22) is 

contributed to the implementation details of the analysis tool. 

The first one prepares the value tracking, and is interrupted 

before evaluation the getelementptr instruction. The second 

invocation continues the static analysis to the end of the 

elementary block (line 27). 

The nodes for aggregate memory access are labelled as 

slices in the previous graphs, with each slice representing a 

single entry within the aggregate or array. This decomposition 

allows better separation of the control and data flow graph 

in future processing steps. For proper handling of the subsequent 

access into the aggregate the index values for multiple 

iterations need to be recursively processed. 

Listing 5 

FOLDL LLVM CODE 

1 ; F u n c t i o n A t t r s : m i n s i z e n o i n l i n e n o r e c u r s e nounwind o p t s i z e r e a d o n l y u w t a b l e 

2 d e f i n e i 3 2 @TestFunction ( i 3 2∗ nocapture readonly %a r r a y ) l o c a l u n n a m e d a d d r #0 

3 { 

4 e n t r y : 

5 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 5167) 

6 br l a b e l %f o r . c o n d 

7 

8 f o r . c o n d : ; p r e d s = %f o r . b o d y , %e n t r y 

9 %sum.0 = phi i 3 2 [ 0 , %e n t r y ] , [ %add , %f o r . b o d y ] 

10 %i . 0 = phi i 6 4 [ 0 , %e n t r y ] , [ %inc , %f o r . b o d y ] 


12 %e x i t c o n d = icmp eq i 6 4 %i . 0 , 3 

13 br i 1 %e x i t c o n d , l a b e l %f o r . c o n d . c l e a n u p , l a b e l %f o r . b o d y 

14 

15 f o r . c o n d . c l e a n u p : ; p r e d s = %f o r . c o n d 


17 r e t i 3 2 %sum.0 

18 

19 f o r . b o d y : ; p r e d s = %f o r . c o n d 


21 c a l l void @Track GEP i64 ( i 6 4 1406 , i 6 4 516 , i 6 4 9328 , i 6 4 %i . 0 ) 


23 %a r r a y i d x = g e t e l e m e n t p t r inbounds i32 , i 3 2∗ %a r r a y , i 6 4 %i . 0 

24 %0 = load i32 , i 3 2∗ %a r r a y i d x , a l i g n 4 , ! t b a a !3 

25 %add = add nsw i 3 2 %0, %sum.0 

26 %i n c = add nuw nsw i 6 4 %i . 0 , 1 


28 } 

For example differing to the example shown in Listing 5, 

the loop could have run the pointer argument via a Φ node 

and keep the index operand constant at the value 1 (Listing 6, 

Lines 3 and 10). This demands the tracking engine to perform 

the arithmetic required to resolve the final index position on 

its own. 

Listing 6 

LOOP ALTERNATIVE (INSTRUMENTED) 

1 f o r . c o n d : ; p r e d s = %f o r . b o d y , %e n t r y 

2 % V a l . a d d r = phi i 3 2 [ 0 , %e n t r y ] , [ %add , %f o r . b o d y ] 

3 % F i r s t . a d d r = phi i 3 2∗ [ %a r r a y d e c a y . i , %e n t r y ] , [ %i n c . p t r , %f o r . b o d y ] 

4 %cmp = icmp eq i 3 2∗ % F i r s t . a d d r , %a d d . p t r . i 

5 br i 1 %cmp , l a b e l %” . e x i t ” , l a b e l %f o r . b o d y 

6 

7 f o r . b o d y : ; p r e d s = %f o r . c o n d 

8 %0 = load i32 , i 3 2∗ % F i r s t . a d d r , a l i g n 4 , ! t b a a !9 

9 %add = add nsw i 3 2 %0, % V a l . a d d r 

10 %i n c . p t r = g e t e l e m e n t p t r inbounds i32 , i 3 2∗ % F i r s t . a d d r , i 6 4 1 


B. Detecting Input and Output Arguments 

Another challenge is the automated detection of input and 

output parameters for the function to be analysed. Automating 

this process is required when retaining the language front 

end independence of the analysis tool is a design goal. As 

the LLVM assembly language does support constant data 

types it can safely be concluded that any data which is 

marked as constant within the intermediate language is to 

be treated as input. However the inverse is not necessarily 

the case. This is particularly true as the analysis tool does 

distinguish the algorithms entry function (usually the main 

function in a C or C++ program) and the function implementing 

the algorithm (denoted TestFunction within this paper). 

Differentiation between those functions allows the algorithm to 

conveniently be developed as regular application but prevents 

the analysis to track functionality which shall not contribute 

to the performance results - like reading input data from a file. 

Therefore to achieve reliable as well as correct results the 

analysis tool does determine the input (and output parameters) 

dynamically via tracing the data flow. Nodes which only 

have output edges are considered as inputs, nodes with input 

edges only as outputs and nodes with both edge types as 

combined input/output arguments. It is to be mentioned that 

the latter ones will break the acyclic property of the graph. 

This classification is done for the individual elements of 

aggregates/arrays, such that the latter processing is capable 

of decomposing an array on multiple hardware entities. 

C. Raising the Abstraction Level 

One of the greatest drawback of evaluating the algorithm 

on LLVM assembly language scope its very low level view of 

things. For data flow and sequencing this has shown to be very 

effective abstraction. Quite common although it is beneficial 

to raise the abstraction level for certain computations. This can 

be the case for well known functions, like computing the Fast 

Fourier transform (FFT), or if the targeted hardware natively 

supports them. On some CPU architectures for example this 

is the case for the square root function (sqrt). 

Rather than decomposing each function into its assembly 

instructions the instrumentation engine handles them similar 

to high level instructions. These functions are treated as simple 

compute nodes, with their corresponding input and outputs. 

Listing 7 and Fig. 5 demonstrate the method √ for computing the 

L2 norm of a real valued vector (x = ∑ N i=0 a2 i 

)) implemented 

as variant to the foldl function of Listing 1. While the 

818 


fmul (0 | 6) 

[cmath:239:15] 

TestFunction 

a_Slice_0_0 a_Slice_0_1 a_Slice_0_2 

a_Slice_0_0 a_Slice_0_0 

 

mul.i.i.i.i 

fadd (1 | 7) 

[Vector_Norm.cpp:13:42] 

a 


add.i.i.i 

fmul (0 | 13) 

[cmath:239:15] 

mul.i.i.i.i 

fadd (2 | 14) 


Fig. 5. VectorNorm Dataflow 

add.i.i.i 


fmul (0 | 20) 

[cmath:239:15] 

mul.i.i.i.i 

fadd (3 | 21) 


add.i.i.i 

sqrtf (4 | 26) 

[cmath:287:10] 

call.i 

ret (5 | 27) 


std::pow function has been decomposed, and by inlining 

and special coding within the runtime library optimized (x 2 = 

x·x), the sqrt function has been kept as high level primitive. 

This can also be done on a much different scope as it is the 

case for a sparsely distributed system with a web server and 

a resource limited interface node [8]. For this case running 

a geo-location algorithm (GPS data to address lookup) is 

performed, with only the server part being capable of running 

the lookup in full detail (solving the point in polygon problem 

for a large set of complex polygons). Such polygons occur 

when mapping a GPS location to a district with a certain area, 

as district boundaries are usually of highly irregular shape. 

Listing 7 

VECTOR NORM 

1 # i n c l u d e 

2 # i n c l u d e 

3 # i n c l u d e 

4 

5 template 

6 T T e s t F u n c t i o n ( c o n s t T(&) [N] ) a t t r i b u t e ( ( n o i n l i n e ) ) ; 

7 

8 template 

9 T T e s t F u n c t i o n ( c o n s t T(& a ) [N] ) 

10 { 

11 using namespace s t d ; 

12 return s q r t ( a c c u m u l a t e ( b e g i n ( a ) , end ( a ) , T ( 0 ) , 

13 [ ] ( T sum , T n e x t ){ return sum + pow ( next , 2 ) ; }) ) ; 

14 } 

15 

16 f l o a t a [ ] = { 1 . , 2 . , 5 . } ; 

17 

18 i n t main ( ) 

19 { 

20 return T e s t F u n c t i o n ( a ) ; 

21 } 

D. Parallelization and Distribution 

Decomposing the computation on several hardware modules 

requires the control data flow graph to be divided. For an initial 

approach the separation can be based on the graph model 

itself. Considering the goal to find the optimal partitioning 

with respect to parallelisation is an intricate set of problems, 

some of which NP-complete [9], aiming for an generalized 

exact and optimal solution is not economic. Also doing the 

partitioning with a dedicated set of computation nodes in 

mind (an embedded CPU like the ARM Cortex-M4, or a 

specific digital signal processor (DSP)) might require further 

processing on the computation graph, prior to partitioning it. 

The parallelisation width (number of operations which can be 

performed simultaneously) is highly dependant on the architecture 

of the hardware. The ARM Cortex-M4 for example 

is only capable of simultaneous processing a very restricted 

set of data types (small integer types of 8 or 16 bits), while 

a decent DSP usually can handle at least two floating point 

instructions in parallel. As such vectorization can significantly 

improve the performance of the computation, but does not 

alter the input and output dependencies. Under the assumption 

that a good partitioning is dominated by minimizing the data 

flow in between distinct computation engines its impact to the 

partitioning results is limited. 

A higher impact however can occur if instruction reordering 

is utilized. Assuming N to be even (1) and (2) are mathematically 

identical, while their data flow graphs are very 

different. One of which consisting of a single large adder tree 

(as depicted in Fig. 5) while the latter one will result in two 

virtually independent adder trees. Generally a composition of 

nodes representing a fold like instruction (like this adder 

tree) can be executed in log 2 (N) cycles, provided a sufficient 

amount of computing nodes (N/2) to be available. 

√ N∑ 

n = xi 2 (1) 

i=0 

 

 

n = 

√ N/2 

∑ 

i=0 

x 2 i + 

N 

∑ 

i= N 2 +1 x 2 i 

(2) 

This is partially covered by the instruction rank (refer 

section II). Its purpose is to provide a simple metric for 

parallelisation and distribution, without performing instruction 

reordering. Such a reorder procedure would increase the 

utilization of a distributed computing structure, but requires a 

certain (mathematical) understanding of the instructions within 

the graph, namely the commutative and associative properties. 

Another challenge in partitioning the graph is dealing with 

the control flow. Considering an obviously separable and 

artificial function like (3) from Fig. 6 is trivially to conclude 

that isolation of the left and right part of the computation is 

optimal. 

{ 

2a(⌊i/2⌋), i ∈ 2N 

c(i) = 

(3) 

2b(⌊i/2⌋), i ∈ 2N + 1 

Not only do the computations allow an easy separation also the 

input data (vectors a and b) can be independently feed into two 

distinct computation engines. This is only true in cases where 

the loop is fully unrolled. As soon as additionally considering 

819

a_Slice_0_0 

a_Slice_0_0 

fmul (0 | 5) 

[Vector_Split.cpp:16:21] 

br (1 | 2) 


a 

mul 

c_Slice_0_0 

icmp (0 | 1) 


exitcond 

mul 

a_Slice_0_1 

a_Slice_0_1 

fmul (0 | 21) 


TestFunction 

mul 

c_Slice_0_2 

 

c 

fmul (0 | 27) 


mul6 

c_Slice_0_3 

b 

b_Slice_0_1 

b_Slice_0_1 

Fig. 6. vector split example (data flow) 

 

 

a_Slice_0_0 

fmul (0 | 5) 


a 

a_Slice_0_0 

add (0 | 15) 


icmp (1 | 17) 


a_Slice_0_1 

mul 

 

c_Slice_0_0 

indvars.iv.next 

TestFunction 

exitcond 

a_Slice_0_1 

br (2 | 18) 


mul 

fmul (0 | 21) 


mul 

c_Slice_0_2 

c 

 

fmul (0 | 27) 


c_Slice_0_3 

c_Slice_0_1 

b_Slice_0_1 

b_Slice_0_1 

c_Slice_0_1 

Fig. 7. vector split example (data and control flow) 

mul6 

b_Slice_0_0 

b_Slice_0_0 

fmul (0 | 11) 


mul6 

fmul (0 | 11) 


mul6 

b 

b_Slice_0_0 

the control flow nodes (Fig. 7) a significant asymmetric portion 

of the graph appears, caused by the sequential (imperative) 

nature of the LLVM assembly language. Care has to taken into 

account when partitioning such graphs, by either duplication 

the control flow nodes (Fig. 7 the left part) of the detached 

components, or by applying a cross-the-board performance 

decrease on the simplified half. 


With this paper an approach to analyse algorithms with respect 

to their distribution capability has been presented. Using 

a rather low level representation (LLVM assembly language) 

as base for the analysis has shown to be a suitable approach. 

Especially with respect to reuse of existing tools, optimization 

techniques, and the ability to raise the abstraction level upon 

need. As proof of concept a tool has been implemented, 

which performs a combined static and dynamic analysis of 

the algorithm by means of an executable specification. 

For common low level routines (fold instructions and vector 

norm) the generated data and control flow graphs match 

the expected well-known results. Additionally decomposing 

aggregate data structures/arrays has been demonstrated. By 

addressing the needs of vectorization and parallelisation the 

initial steps for a semi-automated partitioning have been made. 

b_Slice_0_0 

V. FUTURE WORK 

With a DAG based representation of the data flow as 

well as control flow needs of an algorithm, future work can 

focus on evaluating the performance metrics when running 

it on a dedicated set of computation entities. Considering 

a computation entity as composition of nodes offering the 

capability to execute a set of instructions the control data 

flow graph can be mapped to those of the hardware. The 

instructions can either the low level idioms of the LLVM 

assembly language or higher level abstractions. With the goal 

of finding a generic graph based representation of such a 

hardware component, the mapping process heads to a graph 

on graph mapping procedure. With the ability to replay the 

algorithm on various hardware models, and varying boundary 

conditions a reasonable estimate on the performance metric 

for a specific hardware/algorithm pair can be generated. This 

would allow an easy exploration of different hardware models 

running the same algorithm. 

REFERENCES 

[1] J. Merrill, “Generic and gimple: A new tree representation for entire 

functions,” in In Proceedings of the 2003 GCC Summi, 2003. 

[2] C. Lattner and V. Adve, “The llvm compiler framework and infrastructure 

tutorial,” in LCPC’04 Mini Workshop on Compiler Research Infrastructures, 

West Lafayette, Indiana, Sep 2004. 

[3] A. Dijkstra, J. Fokker, and S. D. Swierstra, “Implementation and 

application of functional languages,” in Implementation and Application 

of Functional Languages, O. Chitil, Z. Horváth, and V. Zsók, Eds. Berlin, 

Heidelberg: Springer-Verlag, 2008, ch. The Structure of the Essential 

Haskell Compiler, or Coping with Compiler Complexity, pp. 57–74. 

[Online]. Available: http://dx.doi.org/10.1007/978-3-540-85373-2 4 

[4] D. A. Terei and M. M. T. Chakravarty, “An llvm backend for ghc,” in 

ACM SIGPLAN Haskell Symposium, Baltimore MD, United States, 2010. 

[5] C. Lattner, “Llvm and clang: Next generation compiler technology,” in 

The BSD Conference, 2008. 

[6] G. Hutton, “A tutorial on the universality and expressiveness of fold,” 

J. Funct. Program., vol. 9, no. 4, pp. 355–372, Jul. 1999. [Online]. 

Available: http://dx.doi.org/10.1017/S0956796899003500 

[7] LLVM Project. (2018) LLVM language reference. [Online]. Available: 

https://llvm.org/docs/LangRef.html 

[8] S. Tani, A. Rechberger, B. Süsser-Rechberger, R. Teschl, and 

H. Paulitsch, “Application of crowdsourced hail data and damage 

information for hail risk assessment in the province of styria, 

austria,” in Application of crowdsourced hail data and damage 

information for hail risk assessment in the province of Styria, 

Austria. EGU 2017, IE2.1/NH9.19, April 2017. [Online]. Available: 

http://meetingorganizer.copernicus.org/EGU2017/EGU2017-6822.pdf 

[9] M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to 

the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman 

& Co., 1990. 

820 


How to efficiently combine test methods for an 

automated ISO 26262 compliant software 

unit/integration test 

Markus Gros 

Vice President Marketing & Sales 

BTC Embedded Systems AG 


markus.gros@btc-es.de 

Abstract— The verification of embedded software in today’s 

development projects is becoming more and more a challenge. 

This is in particular true for the automotive industry, where we 

can observe a rapidly growing software complexity combined 

with shortened development cycles and an increasing number of 

safety critical applications. New methodologies like Model-based 

design or agile processes on one hand clearly help to make the 

development more efficient, on the other hand they even bring 

additional challenges related to the test process. One effect is for 

example, that tests need to be executed earlier and more often 

and due to the Model-based development approach on more 

execution levels like MIL/SIL/PIL. One more dimension of 

complexity comes from the fact, that one test method is not 

enough to get the necessary confidence regarding the correctness 

and robustness of the system-under-test. This conclusion is also 

part of several standards like ISO 26262, which recommend a 

combination of different test activities on model and code level. 

This paper presents a concept for an integrated verification 

platform for models and production code, which addresses the 

challenges explained above by focusing on three main aspects: 

integration, separation and automation. The integration aspect 

can be divided in two different approaches. First of all, the 

platform should be integrated with other development tools like 

modelling tool, requirements management tool or code generator. 

All information needed for the verification of a component 

should be extracted as automatically as possible, including 

information about interfaces, data types, data ranges, 

requirements or code files. As this kind of information is needed 

in a similar way for different verification methods, the second 

integration approach consists of integrating different test 

methodologies on top of a shared database within one 

environment. The first obvious benefit is, that the information 

described above needs to be extracted only once for all 

verification activities which can include guideline checking, static 

analysis, dynamic analysis and formal methods. We will also 

describe a second benefit coming from the fact, that these 

different methods deeply leverage from each other’s results. 

Separation means that software units shall be thoroughly verified 

before they are integrated into software components. Integrated 

components are then being verified according to the software 

architecture definition. The verification platform should support 

this divide and conquer approach as recommended and 

described in ISO 26262 or Automotive SPICE. One final topic to 

be discussed is automation, which should be made possible by a 

complete API as well as integration with technologies like 

Jenkins. The discussed verification platform approach automates 

many testing activities, from the more mundane activities to 

develop MBD and code centric test harnesses to the more 

sophisticating activities of automatic test generation. 

Keywords—Model-based Development, ISO 26262, Software 

Unit Test, Software Integration Test 


In today’s development projects for embedded software, the 

complexity is growing in many dimensions, which brings in 

particular many challenges for the test and verification process. 

The size of software in terms of lines of code and number of 

software components is constantly growing, which obviously 

also increases the number of test projects and test cases. On top 

of this, the Model-based development approach is becoming 

more and more popular and, despite all advantages, it brings 

some additional challenges to the testing workflow because test 

activities need to be done on model level as well as on code 

level. While these observations seem to lead to an increasing 

test effort, it is also obvious that the competitive pressure in the 

industry leads to a need to control or even reduce development 

cost and time. The amount of test activities is even further 

increased by the adoption of agile development methods, 

which require a frequent repletion of test tasks on slightly 

modified components. As a consequence, software tools are 

introduced in the process in order to automate tasks like test 

execution, test evaluation or report generation. 

One more challenge we can see in particular in the 

automotive industry is, that software is more and more taking 

over safety critical features related to steering or braking, 

slowly leading the way to fully autonomous vehicles. The level 

of confidence which is needed for these kinds of features can 

only be achieved by combining multiple test methods. This is 

also reflected in standards like ISO 26262 and leads to a 

growing number of software tools which contribute to the 


821

overall quality metrics. While on one hand specialized software 

tools for individual verification tasks are available, the growing 

number of tools inside development projects becomes more 

and more difficult to manage. Reasons are: 

• Every software tool comes with specific 

limitations regarding the supported environment 

(e.g. versions of Microsoft Windows, Matlab etc.) 

and the supported language subset (e.g. supported 

Simulink blocks). Cross-checking all limitations 

before selecting the tools and tool versions for a 

specific project is always a time consuming and 

error prone tasks. 

• While different software tools in a project address 

different use cases, they often also have features 

and needs in common. One example in the 

verification context is the fact, that every tool 

needs information about the system under test (or 

SUT), which typically includes details about 

interface, data ranges or the list of files needed to 

compile or simulate. Importing the SUT into 

different tools is not only a redundant task, it is 

also error prone as the user needs to learn and 

apply different workflows for a similar task. 

• As software tools often use different file formats 

for storing data or reports, users need to learn 

different tool specific aspects and need to store 

and analyze reports in different environments and 

formats. For automation, APIs, if available at all, 

might be based on different concepts or are only 

available in different programming languages. 

• When different test methods in a model-based 

process are applied independently, they typically 

do not benefit from each other’s results. 

This paper presents the concept of a test platform for 

software unit test and software integration test within a modelbased 

development process including automatic code 

generation. While Section II presents the core features of the 

platform, sections III to VI focus on the main benefits that we 

call integration, separation and automation. Several aspects of 

the described approach have already been integrated in the 

commercial tool BTC EmbeddedPlatform. 

II. 

CORE FEATURES 

This chapter describes some common needs and features 

that we find in a redundant way in different tools being 

designed for different test methods. The benefits of providing 

these features once and making them available to different test 

methods will be described in section III. 

A. Import of the system under test 

The starting point of any test activity is, to provide 

information about the SUT to the test tool. As we assume a 

model-based development process, we will consider at least 

two levels for test activities: Simulink/Stateflow models as well 

as production C code. Relevant information includes: 

• List of needed files and libraries for model 

(models, libraries, data dictionaries, .m/.mat files) 

and code level (.c/.h files, include paths) 

• Structure of subsystems in the model and structure 

of functions in the production code 

• List of interface objects on both levels. The main 

interface types are inputs, outputs as well as 

calibration parameters and observable internal 

signals. Interface objects can be scalar variables, 

vectors, arrays or they can be structured in form of 

bus signals or C code structures. Additional 

important information for each interface object 

includes data types, scalings and data ranges. 

• For test execution, a test frame needs to be 

available on both levels. In particular on unit test 

level, this might include the need to generate stub 

implementations for external functions and 

variables. 

B. Requirements 

The traceability to requirements is an important aspect of 

test methods like requirements-based testing or formal 

verification. The platform should be able to link test artifacts to 

requirements in a bi-directional way. 

C. Debugging 

If tests are failed, the platform should support debugging 

activities on model and code level 

D. Reporting 

It should be possible to generate report documents for all 

test activities in an open format like html. Creating the different 

types of reports with a common look and feel can support 

clarity and make them easier to read. 

III. 

TOOL INTEGRATION 

A tight integration between the test platform and other tools 

used inside the development project is a key prerequisite for an 

efficient and automated workflow. In the context, we can 

identify three main types of tools to connect to. 

In context of a model based development approach with 

automatic code generation, the most important tools to 

integrate with are the modelling environment (e.g. 

Simulink/Stateflow) and the code generator (e.g. dSPACE 

TargetLink or EmbeddedCoder). This integration should 

enable a highly automated import and analysis of the SUT as 

described in II.A. A manual setup of the test project or a semiautomated 

approach with third-party formats like Excel should 

be avoided for efficiency reasons and to avoid errors. 

As requirements play an important role, the platform should 

provide a direct connection to requirements management tools 

like IBM DOORS or PTC Integrity. It should be possible to 

automatically import the desired subset of requirements and to 

write information about test results back to the requirements 

management tool as additional attributes. 

Especially in larger projects where a lot of developers and 

test engineers are involved, a global data management platform 

822

might be available providing features like centralized access to 

all development and test artifacts, version and variant 

management or the control of access rights. This kind of tool 

also has the potential to collect quality metrics for different 

components and make them accessible on a project wide level. 

Therefore, the test platform should be able to integrate with 

such a data management platform in a bi-directional way in 

order to obtain information about the SUT and in order to 

provide test metrics back to it. 

IV. 

INTEGRATION OF TEST METHODS 

As already mentioned above, the needed confidence for the 

development of embedded systems can only be achieved by a 

combination of different test methods. Combining different test 

methods inside on platform will bring two main benefits. The 

first obvious benefit is, that the features described in II can be 

accessed and shared by the test methods, increasing efficiency 

and avoiding the need for redundant tasks. Being located in the 

same environment, some of the relevant test methods also have 

the potential to benefit from having information about each 

other’s results. Relevant tasks in this context are: 

a. Requirements-based Testing: Functional test cases 

should be derived from requirements and applied on 

model and code level. The creation of these test cases 

clearly benefits from the detailed information the 

platform has about the SUT including available 

interface variables and data ranges. This way, the test 

editor can already protect the user against invalid data 

entry. Other platform features which are needed for 

this task contain the capability to run simulations, the 

availability of requirements as well as debugging and 

reporting features. 

b. Analysis of equivalence classes and boundary values: 

Both methods are recommended by ISO 26262 and 

target an analysis of different values and value ranges 

for interface variables. These tasks will benefit from 

the fact that the platform already contains information 

about all available functions, their interface signals and 

the data ranges. The outcome of this activity should be 

a set of test cases which cover the defined variable 

ranges and values, therefore it makes sense to combine 

this analysis with the Requirements-based Testing 

activity. 

c. Analysis of model and code coverage: In order to 

assess the completeness of the test activities, structural 

coverage metrics should be measured on model and 

code level. Due to an integration with the 

Matlab/Simulink environment, model coverage can 

easily be measured via standard mechanisms. For code 

coverage, the code needs to be instrumented and all 

available tests need to be executed on the instrumented 

code. As the platform should have access to a 

compileable set of code and header files, this analysis 

can be handled fully automatically. 

d. Check for Modelling and coding guidelines: These 

kind of static analysis methods can be fully automated 

in case the list of model and code artifacts is available. 

Modelling guidelines for example can check for 

prohibited block types, wrong configuration settings or 

violations of naming rules. An example for coding 

guidelines are the widely used MISRA C rules. 

e. Analysis of runtime errors: This static analysis is 

typically done on code level by applying the abstract 

interpretation method. This methodology requires 

access to the list of code and header files and it also 

benefits from getting information about data ranges of 

variables. If some analysis goals are already covered 

by existing tests, it might be able to exclude them from 

the analysis to increase efficiency. 

f. Resource consumption: This means analyzing the 

resource consumption on the target processor regarding 

RAM, ROM, stack size and execution time. One 

option is to measure these metrics during the test 

execution on a real or virtual processor, which the 

platform should be able to call. This measurement is of 

course only possible, if a sufficient set of test cases is 

available, which covers different paths in the software. 

g. Structural Test Generation: In order to maximize 

structural coverage metrics on model and code level, 

test cases can be generated automatically either by 

random methods or using model checking. This task 

can benefit dramatically from the availability of 

requirements-based test cases, as only uncovered parts 

need to be analyzed. Structural tests can be used e.g. 

for showing robustness of the SUT and for Back-to- 

Back as well as regression testing. 

h. Back-to-Back Testing: Back-to-Back Testing between 

models and code is (highly) recommended by ISO 

26262 and it obviously requires test cases (functional 

and/or structural), the ability to run them on the 

different execution levels and the generation of 

corresponding reports. 

i. Formal Specification: Textual (or informal) 

requirements often leave some room for ambiguities or 

misunderstandings. Expressing requirements in semiformal 

or formal notation (as recommended by ISO 

26262) does not only improve their quality, it also 

allows to use them as a starting point for some highly 

automated and efficient verification methods (see 

below). The formalization process requires information 

about the architecture of the SUT and it should also 

provide traceability to existing informal requirements 

from which the formal notation is derived. Both are 

already provided by the platform concept. 

j. Requirements-based Test Generation: As the 

previously described formalized requirements are 

machine-readable, they can be used as a starting point 

for an automatic generation of test cases which will test 

and cover the requirements. If these requirements don’t 

describe the full behavior of the system, the SUT itself 

(available in the platform) can contribute to the 

process. If manual test cases already exist, they can be 

analyzed regarding their requirements coverage, so that 

only missing tests need to be generated. 


823

k. Formal Test: In a Requirements-based Testing process, 

every test case is usually only evaluated with respect to 

the requirement from which it has been derived. A 

situation where a particular test case violates a 

different requirement typically goes undetected. By 

performing a Formal Test, all test cases are evaluated 

against all requirements, which dramatically increases 

the testing depth without the need to create additional 

test data. Obviously, this method benefits from a 

platform in which formalized requirements and 

functional/structural test cases are managed together 

for a particular SUT. 

l. Formal Verification: The number of possible value 

combinations for input signals and calibration values is 

almost infinite for a typical software component. It is 

therefore obvious that even a large number of test 

cases can never cover all possible paths through the 

component. Formal Verification with Model-Checking 

technology can automatically provide a complete 

mathematical proof that shows a requirement cannot be 

violated by the analyzed SUT. This guarantees that 

there is no combination of input signals and calibration 

values that would drive the system to a state in which 

the requirement is violated. The analysis takes the SUT 

as well as the formalized requirement(s) as an input. If 

a requirement can be violated, a counter example is 

provided in form of a test case, which can then be 

debugged to find the root cause for the possible 

requirement violation. 

V. SEPARATION 

The growing complexity in todays embedded software 

development projects can only be managed by a divide and 

conquer approach. This concerns different disciplines including 

requirement authoring, software architecture design, software 

development and also testing. System requirements need to be 

broken down to smaller units as part of a bigger architecture. 

Afterwards, these units should be developed and tested 

independently before being integrated. This process is also 

reflected in the so-called V-Cycle as well as in ISO 26262 

which on software level contains a clear separation between 

software unit test and software integration test. 

The test platform should support this approach mainly in 

two ways. First of all, the tool should be flexible enough, to 

separate the SUT structure from the model structure. This 

means, it should be possible to individually test 

subsystems/subcomponents which are managed inside one 

single model or code file. Therefore, it is necessary to separate 

individual subsystems from their original model and embed 

them in a newly created test frame. A similar approach is also 

needed on code level. When it comes to the integration testing 

phase, the tool should be able focus on the new tasks which are 

related to potential integration issues. It should not be 

necessary to repeat activities (like importing unit test cases) on 

the integration level again. This also means for example, that 

metrics like MC/DC Coverage on individual units should be 

excluded from the test process, as this has already been shown 

in the unit test. This can be achieved by avoiding the code 

annotation for the units during the integration testing. 

VI. 

AUTOMATION 

As mentioned before, the number of test executions needed 

within a project is growing constantly. One obvious reason is 

the growing number of functions and features that needs to be 

tested. Also, the introduction of model based development with 

its different simulation levels MIL, SIL and PIL contributes to 

this effect. However, probably the biggest contribution comes 

from the fact that agile development methods become more 

popular, which leads to tests being created early and more 

frequently within a project, up to a situation where tests (at 

least for the modified modules) run automatically as part of 

nightly builds within a continuous integration approach. 

For maximum flexibility in this context, the platform 

should provide a complete API, allowing to automate all tool 

features including test execution and reporting. An integration 

with established continuous integration environments like 

Jenkins is also helpful and can reduce the need to manually 

script standard workflows. 


This paper presented a concept for a verification platform 

focusing on the software unit test and software integration test 

of embedded software as part of an ISO 26262 compliant 

model based development process. While software becomes 

more and more safety critical in automotive applications, more 

test methods need to be combined to achieved a sufficient 

confidence, leading to more tools being introduced in the 

process. This number of independent tools leads to several 

challenges and problems, which were described in section I. As 

a solution, we propose a platform concept which provides some 

common core features (described in section II) on top of which 

the different test methods can be realized. This way they can 

benefit from a shared database which provides general and 

reusable information about the system under test, avoiding 

redundant tasks that would need to be repeated for every test 

method in different tool environments. We also described three 

key features of this platform: Integration, separation and 

automation. Several aspects of this concept are already 

implemented in the commercially available product BTC 

EmbeddedPlatform, which is also certified for ISO 26262 by 

German TÜV Süd. Thanks to an open Eclipse-based 

architecture, additional test methods described in this paper 

could be added in the future either by BTC Embedded Systems 

or by 3 rd parties. 

824

Continuous Integration and Test 

from Module Level to Virtual System Level 

Johannes Foufas, Martin Andreasson 

Volvo Car Corporation 

Gothenburg, Sweden 

Michael Hartmann, Andreas Junghanns 

QTronic GmbH 


Abstract— Software-in-the-Loop (SiL) is a test strategic sweet 

spot between Model-in-the-Loop (MiL) and Hardware-in-the- 

Loop (HiL) tests. We show in this paper how to use automatic C- 

code instrumentation to harness the superior properties of SiL 

technology for Module Tests even when the C-code is generated 

in a few, large controller functions combining the modules to be 

tested. 

Furthermore we show how to re-use module test 

specifications in integration and system tests by separating the 

test criteria from the test stimulus. We call these test criteria 

requirements watchers and define them as system invariants. 

This powerful technique, combined with efficiently handling 

large numbers of controller variants by annotating watchers and 

scripts, allows the automatic validation of hundreds of 

requirements in module, integration and system tests improving 

the software quality dramatically very early in the software 


Last but not least, we extend the idea of continuous 

integration to continuous validation to leverage all of the above to 

reach high levels of software maturity very early in the software 

development process. That will also benefit later test phases – like 

HiL system and system integration tests – by dramatically 

reducing commissioning efforts. 

Keywords— Software-in-the-Loop; continuous integration 

I. MOTIVATION AND CHALLENGES 

Engineers are under pressure to deliver improvements at a 

growing pace while satisfying an increasing amount of 

regulatory pressures concerning performance, safety, reliability 

and ecology. The combination of more functionality and 

smaller turnaround times between new versions requires new 

methods of test and validation to keep software quality up to 

par. While traditional testing on the target hardware maintains 

a role in integration testing and satisfying strict safety norms, it 

is too slow, resource intensive and late with feedback for 

earlier phases of the control-software development cycle to 

increase robustness in a meaningful way. 

Common Unit/Module test approaches rely on MiL which 

is prone to failure when looking for certain classes of bugs. SiL 

simulation can alleviate these concerns by providing a testable 

system that is much closer to the C-code reality: using the 

generated C-code, the target integer variable scaling and the 

(variant-coded) parameter values for the target system, often 

even including parts of the basic-software and communication 

stacks [1,2]. And despite being so close to reality, SiL is still 

offering all the strong points of MiL: cheap and early available 

execution platform (PC), determinism, flexibility when 

integrating into different simulations tools for example as 

FMUs, fully accessible and debuggable internals, easy 

automation for all system variants and many more benefits. 

But moving to SiL is not without challenges. First and most 

obviously, hardware-dependent parts of the control software 

cannot be included and suitable SiL-abstractions have to 

replace the missing code. Recent, standardized software 

architectures, like AUTOSAR or ASAM MDX, ease such 

replacement and IO connectivity considerably as standard APIs 

can be provided by the SiL platform or standard description 

formats can be used to generate the connection layers like SiL- 

AUTOSAR-RTE generation from .arxml files. Even for pre- 

AUTOSAR ECUs this task can be handled quite efficiently 

these days: A limited number of tier-1-suppliers produced a 

limited number of vendor-dependent RTOS (inspired) 

architectures that allow for high-levels of reuse[3]. 

Another challenge is dealing with generated C-code for 

module test. The generated code is optimized for target use and 

may fuse many software modules into one large C-function 

(task). Stimulating individual software modules from the 

outside is not possible. Regenerating individual modules is out 

of the question, because changing the code generation process 

would lead to different C-code and therefore defeating the 

purpose of SiL: to test exactly the code that will be compiled 

for target without changes. The solution: we will instrument the 

generated C-code to gain control over all input variables to the 

module(s) under test. 

Ideally one would like to reuse tests from MiL to SiL to 

HiL. However, the different levels of simulation detail, 

restrictions on measurement bandwidth, availability of the 

execution platforms, setup cost for different variants,… 

requires a more sophisticated test strategy than “simple reuse”. 

Focusing on the strength of each platform and running each 

test as early as possible will frontload, as one example, 

application layer function and integration tests to SiL, while 

leaving hardware related diagnostic tests on the HiL platform. 

And optimizing control strategies as early as possible will 

move these tasks to MiL simulations. Re-using test definitions 

is therefore limited by the different test goals and platformrelated 

restrictions. 

However, module and system-level tests can still share the 

same requirements, if not the same test focus. The solution to 

high levels of reuse for test specifications is separating the 


825

implementation of the requirements tests from the stimulus. 

While classic test automation combines test stimulus and 

requirement tests into the same script, we define requirement 

watchers as formal, stimulus and system-state independent 

invariants: conditions that must always hold. Engineers need to 

spend more time and care in writing such requirement 

watchers, but the payoff justifies this extra effort: Requirement 

watchers can be tested with any kind of stimulus be it scripted, 

field measurements, short test vectors, hour-long loadcollective 

simulations or auto-generated test stimulus (e.g. by 

TestWeaver)[4]. Here we will show how to reuse module 

requirements defined for module testing in system-level testing 

when written as requirement watchers. 

Increasing number of variants of control systems requires 

special measures during test and validation to reduce manual 

matching of test cases to variants of the control software. We 

show how annotating requirement watchers and stimuli with 

filter properties enables automatic selection of relevant test 

Continuous Integration is a state-of-the-art method to detect 

integration problems. Combining CI with more than 

rudimentary tests is difficult if the target binary is the test 

object. Using SiL as execution platform allows high levels of 

automation for large numbers of tests because they can run on 

the same platform as the build process: the PC. Extending the 

idea of Continuous Integration (nightly builds) to Continuous 

Validation (nightly test) improves early detection of large 

classes of software problems considerably. 

II. 

TESTING AT VOLVO CARS CORPORATION 

At Volvo Cars Corporation (VCC), SiL testing is at the 

core of a new Continuous Integration strategy. Through 

increasing the frequency of integration points and 

corresponding tests, control software reaches a higher level of 

maturity when final acceptance tests are carried out close to 

production. In order to achieve this, a large number of tests 

need to be defined and used throughout the development 

process. 

One concern so far has been the incompatibility of test 

cases and stimuli between MiL and SiL setups. The structure 

that is designed by a developer in modeling tools is often 

disregarded during code generation. This means testing is 

limited to module level, with modules growing in scope over 

time. Developers on the other hand design around smaller units 

represented through subsystems. 

Fig. 1: Basic module with subfunctions 

The difficulty in test design for large models can be 

illustrated by the simple example in Figure 1. Subfunction A is 

defined by a set of requirements that define the behavior of the 

outputs (y) as a function of the intermediate signals (m). 

Historically, testing these requirements in anything other than 

MiL simulation would require the engineer to invert 

Subfunction B in order to design the correct set of inputs (u) 

for the test. 

Fig. 2: Function requiring transient stimuli 

For more complex modules, this approach is very costly 

and error-prone. As loops inside functions and state diagrams 

are introduced, tests for simple functionality require 

increasingly complicated transient stimuli. 

We aim to present an instrumentation approach that offers 

the opportunity to bypass parts of a function and allows 

developers to define stimuli and test criteria around arbitrarily 

small subfunctions of a module. 

Fig. 3: System under test in with bypassing: Test stimuli can be definded as 

m(t) 

The requirements that are defined using this process shall 

remain independent of the stimulus and usable throughout all 

levels of testing, up to integration and robustness tests. 

III. 

INSTRUMENTATION APPROACH 

Modelling tools like Simulink allow developers to 

structure their models into subsystems which can be used like 

atomic blocks. The subsystem can be copied or moved freely 

across models and can be tested independently in MiL. When 

generating code from a model using TargetLink, the complete 

model is represented by a single C method. Statements that are 

generated from blocks within a subsystem are spread across the 

entire compilation unit. This means a subsystem cannot be 

executed on its own, preventing any kind of meaningful unit 

testing. To remove this limitation from SiL tests, we analyze 

the resulting C-code and inject bypass opportunities wherever a 

measurable signal is written. 

The injected code remains inactive unless the source is 

compiled for a SiL target and the user enables bypassing for 


826

the respective variable. This way, MISRA compliance of 

production software is ensured even if the instrumented code 

makes it into release builds on accident. 

As code generators tend to use temporary, local variables 

where signals are not specifically made measurable, further 

analysis of the generated code is necessary. In cases where 

such a temporary variable is always equal to a measurable 

signal, it has to be set to the correct value as well. This 

specifically applies to signals transcending subsystem borders, 

which can be represented by two different variables in code. 

Fig. 4: Instrumentation of temporary variables 

State Machines can by bypassed entirely so no transitions 

are necessary to provide the system under test with the correct 

state and/or corresponding flags. 

After the code is instrumented, the virtual basic software is 

automatically set up with regards to task scheduling and 

supplier-dependent modifications. Compilation results in a 

virtual ECU containing the entire OEM-part of the control 

software which can be coupled with a plant model and/or other 

ECUs for system-level simulation. 

Without recompilation, engineers can trim the V-ECU to fit 

their use-case. Depending on a specification file provided by 

the user, the Virtual ECU will reconfigure its scheduler to only 

execute a subset of the included functions. The same 

specification can be extended by a detailed interface 

specification listing the ports of a subsystem. If this 

specification is present, all bypasses on the input side are 

activated and the variables are overwritten by stimuli during 

simulation. 

IV. 

DESIGN OF STIMULUS-INDEPENDENT TESTS 

The instrumentation method described reduces the effort in 

test design significantly. Unit-Tests of small subfunctions can 

be created through traditional scripting and deployed as part of 

an automated test framework. While this method can produce 

comprehensive results in regards to verification and coverage, 

it relies heavily on developers being able to foresee all possible 

problems. 

During the specification phase, requirements are written in 

a broad scope. Often a requirement will define a certain 

behavior that shall be true under certain conditions. In essence: 

Condition A => Behavior B 

Defining test cases around such requirements would be 

difficult, especially if the condition contains several continuous 

signals. The widespread approach of testing by creating a 

stimulus and checking for a specific reaction fails to capture a 

large number of possible scenarios as engineering hours and 

therefore the number of defined test cases are limited. 

In addition, a stimulus-reaction based test becomes obsolete 

once the object under test is integrated into a system, as the 

previously defined stimulus often cannot be reproduced due to 

its artificial nature. 

Side effects that appear based on the interaction of several 

components cannot be tested. A developer might cover all the 

expected combinations of outputs from another module or subfunction, 

but faulty signals as result of a bug in this module 

might not be considered. 

TestWeaver by QTronic provides the means to define 

requirements in a way that closely resembles the original 

specification. The test for a requirement is defined by 

precondition and expected behavior instead of stimulus and 

reaction. 

The definition of a requirement watcher entails conditions 

to activate the instrument and the criteria to be tested. A 

watcher intended to test the simple example above would 

remain inactive until Condition A is met and once becoming 

active check for the Behavior B. 

For more complex cases additional options such as 

tolerance times can be specified. Inverse usage, i.e. the 

specification of unwanted behavior is also supported. 

Each requirement can be tested at every point in time 

during a simulation. Requirements defined at subsystem level 

remain valid in system context and vice versa and can be tested 

in regardless of scope. As requirement watchers do not require 

write-access to any signals, the definitions implemented for 

unit tests are still applicable in larger contexts where the code 

instrumentation might be omitted. Module and integration tests 

can thus be executed on final production code. 

Fig 5: Requirement watchers can be reused throughout and refined based upon 

different scopes. All requirements are tested against at every point. Stimuli are 

selected from a pool where applicable. 

Code coverage is measured with Testwell’s CTC++. The 

decoupled requirements described above provide the option to 

use any input vector to increase coverage. Any scripts or 

measurements that are available can be added to the stimulus 

pool and simulated. This way, high code coverage can be 


827

achieved without specifically designing additional tests. The 

requirement definitions can also be reused with TestWeaver’s 

scenario generation for focused explorative tests, further 

increasing coverage and robustness. 

V. CONTINUOUS INTEGRATION AND VERIFICATION 

At VCC Powertrain, code is deployed to a Jenkins-based 

continuous integration system. Pipelines are defined to 

automatically build virtual ECUs and run applicable tests. 

Commits by function developers into the common model base 

trigger the execution of interface verification and module tests 

as well as integration tests relevant to the Module in SiL and 

HiL. 

Fig 6: A typical Jenkins Pipeline. 

As a result, function developers get quick and reliable 

feedback about the behavior of their models in the context of a 

wider system. To verify and keep track of code quality and 

open issues, the full test suite is executed nightly. 

VI. 

CONCLUSION 

In this paper we present a number of critical building 

blocks necessary to improve software maturity early in the 

software development process. Software-in-the-Loop (SiL) 

allows test execution in a Continuous Validation process of the 

target C-code. Instrumentation of the target C-code allows 

manipulation of any input of the software module enabling 

module tests even if target code generation merges many 

modules into larger C-functions (tasks). 

When expressing module and system requirements as 

requirement watchers we can reuse these more easily in most 

of the test stages more than compensating for the extra effort 

defining requirements as invariants. 

Annotating requirement watchers and stimulation scripts 

with variant information allows automatic filtering to matching 

ECU configurations. This way, a single test database can be 

used to handle a multitude of variants and at the same time 

ensuring all relevant requirements will be tested on all variants 

during all test stimuli reaching code-coverage and requirementcoverage 

goals more quickly and more easily than with 

traditional test methods. 

As the virtual ECU can be reconfigured within Silver to 

include or exclude any function in the entire application 

software, build times are kept to a minimum. 

In order to reduce the amount of work needed to design 

tests even further, closed loop simulations including detailed 

plant models will be integrated into the VCC CI and CT 

toolchain. Reusing the existing requirement watchers, 

TestWeaver’s scenario generation will be employed in order to 

increase robustness and test coverage even further. 

[1] Brückmann, Strenkert, Keller, Wiesner, Junghanns: Model-based 

Development of a Dual-Clutch Transmission using Rapid Prototyping 

and SiL. International VDI Congress Transmissions in Vehicles 2009, 

30.06.-01-07.2009, Friedrichshafen, Germany 

[2] Rui Gaspar, Benno Wiesner, Gunther Bauer: Virtualizing the TCU of 

BMW's 8 speed transmission, 10th Symposium on Automotive 

Powertrain Control Systems, 11. - 12.09.2014, Berlin, Germany 

[3] René Linssen, Frank Uphaus, Jakob Maus: Software-in-the-Loop at the 

junction of software development and drivability calibration, 16 th 

Stuttgart International Symposium (FKFS), 15. - 16.03.2016, Stuttgart, 

Germany 

[4] Mugur Tatar: Enhancing the test and validation of complex systems with 

automated search for critical situations, VDA Automotive SYS 

Conference, 06. - 08.07.2016, Berlin, Germany 


828

Self-testing in Embedded Systems 

6329 

Colin Walls 

Mentor, a Siemens business 

Newbury, UK 

colin_walls@mentor.com 

Abstract—All electronic systems carry the possibility of 

failure. An embedded system has intrinsic intelligence that 

facilitates the possibility of predicting failure and mitigating its 

effects. This paper reviews the options for self-testing that are 

open to the embedded software developer. Testing algorithms for 

memory are outlined and some ideas for self-monitoring software 

in multi-tasking and multi-CPU systems are discussed. 

Keywords—embedded software, self-testing 


Things go wrong. Electronic components die. Systems fail. 

This is almost inevitable and, the more complex that systems 

become, the more likely it is that failure will occur. In complex 

systems, however, that failure might be subtle; simple systems 

tend to just work or not work. 

As an embedded system is "smart", it seems only 

reasonable that this intelligence can be directed at identifying 

and mitigating the effects of failure... 

Self-testing is the broad term for what embedded systems 

do to look for failure situations. This paper primarily identifies 

some of the key issues. 

Broadly, an embedded system can be broken down into 4 

components, each of which can fail: 

• CPU 

• Peripherals 

• Memory 

• Software 

II. CPU FAILURE 

CPU failure is not too common, but is far from unknown. 

Unfortunately, there is very little that a CPU can do to predict 

its own demise. Of course, in a multicore system, there is the 

possibility of the CPUs monitoring one another. 

III. PERIPHERAL FAILURE 

Peripherals can fail in many and varied ways. Each device 

has its own possible failure modes. To start with, the self-test 

software can check that each peripheral is responding to its 

assigned address and has not failed totally. Thereafter, any 

further self-test is very device dependent. For example, a 

communications port may have a "loop back" mode, which 

enables the self test to verify transmission and reception of 

data. 

IV. MEMORY FAILURE 

Memory is a critical component of an embedded system of 

course and is certainly subject to failure from time to time. 

Considering how much memory is installed in modern systems, 

it is surprising that catastrophic failure is not more common. 

Like all electronic components, the most likely time for 

memory chips to fail is on power up, so it is wise to perform a 

comprehensive test then, before vital data is committed to 

faulty memory. 

If a memory chip is responding to being addressed, there 

are broadly two possible failure modes: stuck bits [i.e. bits that 

are set to 0 or 1 and will not change]; cross-talk [i.e. 

setting/clearing one bit has an effect on one or more other bits]. 

If either of these failures occurs while software is running, it is 

very hard to trace. The simplest test to look for these failures 

on start-up is a "moving ones" [and "moving zeros"] test. The 

logic for moving ones is simple: 

set every bit of memory to 0 

for each bit of memory 

{ 

} 

verify that all bits are 0 

set the bit under test to 1 

verify that it is 1 and that all other bits are 0 

set the bit under test to 0 

A moving zeros test is the same, except that 0 and 1 are 

swapped in this code. 

Coding this test such that it does not use any RAM to 

execute [assuming start up code is running out of flash] is an 

interesting challenge, but most CPUs have enough registers to 

do the job. 


829

Of course, such comprehensive testing cannot be performed 

on a running system. A background task of some type can carry 

out more rudimentary testing using this kind of logic: 

for each byte of memory 

{ 

} 

turn off interrupts 

save memory byte contents 

for values 0x00, 0xff, 0xaa, 0x55 

{ 

} 

write value to byte under test 

verify value of byte 

restore byte data 

turn on interrupts 

These testing algorithms, as described, assume that all you 

know about the memory architecture is that it spans a series of 

[normally contiguous] addresses. However, if you have more 

detailed knowledge - which memory areas share chips or how 

rows and columns are organized - more optimized tests may be 

devised. This is desirable, as a slow start up will impact user 

satisfaction with a device. 

These testing algorithms, as described, assume that all you 

know about the memory architecture is that it spans a series of 

[normally contiguous] addresses. However, if you have more 

detailed knowledge - which memory areas share chips or how 

rows and columns are organized - more optimized tests may be 

devised. This is desirable, as a slow start up will impact user 

satisfaction with a device. 

V. SOFTWARE ERROR CONDITIONS 

Software failure is obviously a possibility and defensive 

code may be written to avoid some possible failure modes. Of 

course, a bug in the software might lead to a totally 

unpredictable failure. 

All non-trivial software has bugs. Obviously, well designed 

software is likely to have less and the application of modern 

embedded software development tools can keep them to a 

minimum. Of course, specific bugs cannot be predicted 

[otherwise they could be eradicated], but certain types of 

software problem can be identified and it may be possible to 

spot a problem before it becomes a disaster. 

I would divide such software problems into two broad 

categories: 

• data corruption 

• code looping 

As a significant amount of embedded code is written in C, 

that means that developers are likely to be making use of 

pointers. Used carefully, pointers are a powerful feature of the 

language, but they are also one of the most common sources of 

programmer error. Problems with pointer usage are hard to 

identify statically and the bugs introduced might manifest 

themselves in subtle ways when the code is executed. Some 

things, like dereferencing a null pointer are easily detected, as 

they normally cause a trap. Others are harder, as a pointer 

could end up pointing just about anywhere - more often than 

not it will be to a valid address, but, unfortunately, it may not 

be the correct one. There is little that self-testing code can do 

about this. There are, however, two special cases of pointer 

usage where there is a chance: stack overflow and array bound 

violations. 

Stack overflow should not occur, as the stack allocation 

should be carefully determined and its usage verified during the 

debug phase. However, it is quite possible to overlook a special 

situation or make use of a less testable construct [like a 

recursive function]. A simple solution is to include an extra 

word at either end of the stack space - "guard words". These 

are pre-loaded with a specific value, which is monitored by a 

self-test task [which may run in the background]. If the value 

changes, the stack limits have been violated. The value should 

be chosen carefully. An odd number is best, as that would not 

represent a valid address for most processors. Perhaps 

0x55555555. So long as the value is "unlikely" - so not 

0x00000001 or 0xffffffff for example - there is a 4 billion to 1 

chance of a false alarm. 

In some languages, there is built-in detection for addressing 

outside the bounds of an array, but this introduces a runtime 

overhead, which may be unwelcome. So this is not 

implemented in C. Also, it is possible to access array elements 

using pointers, instead of the [ ] operator, so any checking 

might be circumvented. The best approach is to just check for 

buffer overrun type of errors by locating a guard word at the 

end of an array and monitoring in the same way as the stack 

overflow check. 

Code should never get stuck in an infinite loop, but a logic 

error or the non-occurrence of an expected external event might 

result in code hanging. In any kind of multi-threaded 

environment - either an RTOS or mainline code with ISRs - it 

is possible to implement a "watchdog" mechanism. Each task 

that runs continuously [which might be just the mainline code] 

needs to "check in" with the watchdog task [which may be a 

timer ISR] every so often. If a timeout occurs, action needs to 

be taken. I discussed this matter, from a different perspective, 

in a blog about user displays a little while ago. 

So, what is to be done when a stack overflow, array bound 

violation or hanging task is detected? This depends on the 

application. It may be necessary to stop the system, sound an 

alarm of some kind, or simply reset the system. The choice 

depends on many factors, but broadly the goal is for something 

better than a crashed system. 

VI. FAILURE RECOVERY AND REPORTING 

A final question is what to do if a failure is detected. Of 

course, this will be different for every system. Broadly, the 

system should be put in a safe state [shut down?] and the user 

advised. 


It is important to accept that failure of parts of a system is a 

possibility. Consideration must be given to all possible failure 

830

modes. Code may be added to a system to monitor its “health” 

and take action if a failure is detected. That action may be a 

warning to the user or perhaps a rectification of the problem. 


831

Efficient Software Variants Testing 

Michael Wittner 

Razorcat Development GmbH 


www.razorcat.com 

Abstract—The challenge in testing software variants is that 

every variant needs to be tested completely. In the following a 

method to reuse and inherit variant tests is introduced. By 

defining base tests that can be inherited to variant tests working 

on a redundant level can be avoided. For every application 

change tests need to be maintained in one place only. 

Keywords—unit integration testing, requirement, certification, 

variant management 


Safety-critical norms of various industries, like e.g. ISO 

26262 in automotive engineering or IEC 62304 in the medical 

industry, demand complete code coverage. This requires that 

every single software variant needs to be tested completely. In 

praxis this is often realized by copying the tests of one variant 

and adapting the copied test to the respective other variant. 

New software requirements and software changes increase the 

price of such variant testing because of the need to implement 

those changes in all variants redundantly. Besides the high 

effort in maintenance and expansion of such tests there is also a 

high risk of e.g. mistakes due to copy & paste operations, 

which might eventually lead to undiscovered safety-critical 

mistakes in the application. 

A. What is a variant? 

There are various possibilities to create software variants 

(e.g. C/C++ source code): 

Enabling/disabling of code parts by defines 

Generating code variants with tools (e.g. out of 

MATLAB) 

Copying, renaming, and changing the source file 

Executing identical sources on different hardware 

platforms (applicable for high safety requirements) 

A software variant is defined by a particular software 

module configuration (e.g. a C source file). Such a variant does 

not necessarily need to be functional; it might as well be an 

abstract variant. Only by specific settings (mostly by defines) 

an abstract basis variant turns into actually applicable software 

variants. 

B. Test goal: Code coverage 

To obtain complete code coverage of each variant the 

measured value of the code coverage of the variant specific 

code could be added up to the measured value of the 

commonly used code. Fig. 1 shows a simple example of a code 

variant where another value should be added to the variable 

“level”. The yellow marked programming error (the missing of 

an addition operator in line 16) could remain undiscovered if 

the commonly used code in line 19-23 remains untested in the 

variant. 

Fig. 1. Code variant with mistake 

Therefore it is not enough to test only single parts of the 

variant code put together by defines and add up the code 

coverage measurements: Every code variant needs to be looked 

at as an independent program because hidden or added parts 

can influence the shared parts. 

II. 

SOLUTION APPROACH 

In the following a method to establish and maintain variant 

testing is presented by means of an example. This example 

contains a function for a status indication of the tank content of 

various vehicles (passenger cars and trucks). The fact that the 

vehicle variant “truck” can be equipped with a supplementary 

tank of which the fuel level should also be considered is an 

additional difficulty. 

A. Example function 

The example is about a function that supplies a status 

related to the filling level of a vehicle tank. The specification is 

graphically displayed in Fig. 2: The function is expected to 


832

give a warning or an alarm when the fuel level falls below 

defined marks. If none of this the case, the function is supposed 

to deliver the value “normal”. 

On the other hand the variant hierarchy definition most of 

all serves a greater clarity in case of highly nested software 

variants. Within the presented method the variants are arranged 

in a variant tree that can be quite multileveled and displays the 

software’s variant structure. This tree serves as an orientation 

and shows which test should be created on which variant level. 

Fig. 4. Variants hierarchy 

Fig. 2. Fuel level value definition with calculation thresholds 

A simple implementation of this function could look like 

shown in Fig. 3. Through a variant configuration (#define 

TRUCK) the supplementary tanks fuel level could possibly be 

included into the calculation. 

The example above shows a variant hierarchy of various 

vehicles and includes variants with fixed or optional 

supplementary tanks for trucks. The variant “truck” in this case 

could be both, an abstract configuration or a concrete vehicle 

shape. 

C. Variant tests definition 

Tests can be divided into two types: Basic tests and variant 

tests. Basic tests refer to abstract functionalities, e.g. the basic 

configuration of a software module. On this level all potential 

test cases of variants derived from this basic model can be 

defined. Also first test data for the basic test can be specified, it 

does not have to be complete though. (Anyway, one does not 

want to execute them.) Possible test cases can be defined on the 

top level of a variant tree e.g. by means of a classification tree. 

The test specification below takes all possible variant 

configuration of our example into account. 

Fig. 3. Implementation of the filling level function 

B. Software hierarchy analysis 

The amount of the software variants to test automatically 

results from all possible software configurations. For the test it 

has to be taken into account whether a variant (e.g. the basic 

configuration) actually can be tested, depending on the fact 

whether it is executable software or just a basic configuration 

not being a functional unit on its own. Sometimes abstract tests 

can be defined for those abstract variants. However, actually 

executable tests only occur by further implementation of such 

tests for a certain variant. 

Fig. 5. Test specification for all variants 

The test specification describes the necessary tests that are 

now propagated to the children in the variant tree: The 

inherited tests can be changed, hidden or completed with 

specific tests in every variant. In Fig. 6 e.g. all tests are hidden 

that do not refer to the variant “passenger car”. 

833

“40” within the base test case 2.1 is normal tank level for an 80 

liters tank. This value needs to be increased significantly within 

the “Truck” variant assuming a 1000 liters truck tank. 

Therefore the value has been overwritten with “500” for the 

variant test. 

Fig. 6. Test cases for the variant “passenger car“ 

As a consequence every superordinate variant test only 

needs to be created once and then maintained in only one place 

for every new requirement and application change. 

D. Unique identification of test cases 

A "Universal Unique Identifier" (UUID) is assigned to 

every basic test case to uniquely identify test cases. These 

UUIDs are generated world-wide and timely unique. If a test 

case is now passed on to a variant, the inherited and eventually 

modified test case is still the same (basic) test case. In case of a 

review of the tests the inherited test cases can be uniquely 

compared with the basic tests or with other variants tests. 

The UUID assignment also enables geographically 

distributed working on variant tests because all test cases are 

uniquely defined and tests can therefore be put back together 

and updated again without any problem. 

E. Rules for test case inheritance 

In our example all possible test cases were defined on the 

top level. For some variants (e.g. “passenger car”) only some of 

the test cases make sense. Therefore the following inheritance 

operations are necessary: 

Change of inherited test data 

Deleting/hiding of inherited test cases 

Adding additional test cases 

It also needs to be distinguished on the test data level if a 

value was inherited or only defined locally. Synchronizing the 

variants updates all inherited values. Therefore the following 

variable values statuses result out of that: 

Value was inherited 

Value was inherited and overwritten 

Value was defined locally for this variant test 

These values can easily be distinguished through a color 

coding like shown in Fig. 7: The light blue colored values were 

inherited and the purple colored values were overwritten in the 

variant. 

The example in Fig. 7 shows the inheritance of values from 

the base tests to the tests of variant “Truck”. Most of the values 

were assigned within the test specification in the classification 

tree so that they are display with grey background. The value 

Fig. 7. Color coding of inherited and overwritten values 

III. 

RESULTS 

The following strategy was used for the presented variant 

management solution: 

Deduction of the variant definition from the application 

design 

Definition of all possible test cases on the highest level 

for all variants 

Hiding of the not needed test cases in every variant 

Completing/implementing of tests separately in every 

variant 

The advantage of this approach lies within the centralized 

test case specification in one single classification tree. This 

increases transparency and offers a complete overview over all 

tests in a review. Hiding not necessary test cases in all variants 

is relatively easy while at the same time it opens the test 

engineers mind to essential questions such as: Which test case 

is in fact relevant for the actual variant? 

A. “Natural” way of creating test cases 

For a test engineer it is normally easier to develop tests for 

a concrete software variant than thinking about abstract tests 

for all potential software variants. Depending on the kind of 

software to be tested, it could make sense to firstly develop a 

complete test suite for a specific variant and then transfer these 

tests to the upmost base variant. This way the tests can be 

inherited down the variant hierarchy and the test engineer can 

use all the features of the variant management. 

B. Variants within the classification tree 

One could also think of introducing variants already into 

the test specification (i.e. the classification tree). Filtering of 

sub trees according to the selected variant would result in 

specific test specifications for each software variant. The whole 

classification tree itself would still be available and being 

maintained on topmost level of the variant hierarchy. 

Reviewers could either look at the overall tree or at each 

specific filtered variant test specification. 


834

The Impact of Test Case Quality 

Frank Büchner 

Principal Engineer Software Quality 

Hitex GmbH 

Karlsruhe, Germany 

frank.buechner@hitex.de 

Abstract—Even “good looking” sets of test cases can fail to 

detect defects in the source code, e.g. during unit testing, even if 

the tests achieve 100% code coverage. However, how do we 

develop good tests? This paper tries to give some insights. 

Keywords— Test case specification, equivalence partitioning, 

boundary values, code coverage, mutation testing, error seeding, 

Classification Tree Method (CTM). 


When it comes to testing, a lot of effort is spent selecting 

the “right” testing tool. However, often this effort is expended 

for a second-class goal. Certainly, you need a tool that works 

for you, your development environment, your project, and your 

process. However, what is paramount for good testing is not 

the testing tool, but the quality of the test cases. Only “good” 

test cases will find defects in the software. 

II. 

SIMPLE EXAMPLE 

The specification for a simple test object could be as 

follows: 

A start value and a length define a range of values. 

Determine if a given value is within the defined range or not. 

The end of the range shall not be inside the range. All data 

types are integer. 

The following three test cases pass and reach 100% 

(MC/DC) overage. 

Fig. 1. This three test cases pass and reach 100% coverage 

So, what is wrong with our test cases? All three test cases 

pass and we have reached 100% code coverage. The answer is, 

that we do not have tested all requirements (i.e. we have not 

tested that the end of the range is not in the range) and in 

consequence, we have not tested using boundary values. A test 

using the values 5, 2, 7 for start, length and value respectively 

fails, because of a software defect in the test object. This 

software defect probably is a wrong check, e.g. ‘

The second column tries to give for the defective 

implementations a combination of the likelihood that (a) the 

error is made by the programmer and (b) the defect is not 

detected by a review. The likelihood for (a) is considered to be 

related to the number of wrong characters in the relation (e.g. 

from the correct (i

Fig. 5. The minimum() function with a programming defect 

For the example of a sorting function, the extreme input 

could be if the input is already in sorted order, or sorted in 

reverse order, or if all elements to sort have the same value. 

D. Illegal Values 

Let us go back to the simple example of section II. The 

specification says, “All data types are integer”. This holds for 

the length of the range, and, as consequence, the length could 

be negative. Is a negative length a valid input? Probably not. At 

least, it is an interesting test to try the inputs 5 for the start, -2 

for the length and see, if 4 is considered to be inside the range 

or not by the implementation. 

As a rule of thumb: Always look for (maybe) invalid input 

and construct test cases out of it. 

IV. 

EQUIVALENCE PARTITIONING 

A ubiquitous problem related to test case specification is 

that an input variable can take on too many values, and it is not 

possible / not efficient to use all values in test cases, especially 

in combination with many values of other input variables 

(“combinatorial explosion”). Generation of equivalence classes 

(also called “equivalent partitioning”) solves this problem. 

Equivalence partitioning divides all input values into classes. 

Values are assigned to the same class, if the values are 

considered equivalent for the test. Equivalent for the test means 

that if one value out of a certain class causes a test to fail and 

hence reveals an error, every other value out of this class will 

also cause the same test to fail and will reveal the same error. 

In other words: It is not relevant for testing which value out 

of a class is used for testing, because they all are considered to 

be equivalent. Therefore, you may take an arbitrary value out 

of a class for testing, even the same value for all tests, without 

decreasing the relevance of the tests. However, the prerequisite 

for this is that the equivalence partitioning was done correctly. 

This is in the responsibility of the person applying equivalent 

partitioning. 

Fig. 6. Example for equivalence partitioning (according to shape) 

V. THE CLASSIFICATION TREE METHOD (CTM) 

The Classification Tree Method (CTM) is a method for test 

case specification supporting the methods for test case 

derivation discussed so far. 

The CTM starts by analyzing the requirements. The 

requirements determine which inputs are relevant (i.e. which 

inputs should be variated, i.e. which inputs should have 

different values during the test). In the next step the possible 

input values are divided into classes according to the 

equivalence partition method. The third step is to consider 

boundary / extreme / invalid input values. These three steps 

result in the classification tree. The classification tree forms the 

upper part of the (graphical) representation of the test case 

specification according to the CTM. The root of the tree is at 

the top; the tree grows from top to bottom; classifications have 

frames; classes are without frames; the leaf classes form the 

head of the combination table. The combination table consists 

of lines, each line specifying a test case. Markers on the 

respective line select equivalence classes, from which the 

values for the test are taken. A human draws the tree, giving 

names to classifications and classes, and sets the markers on 

the test case line, i.e. test case specification is a human activity 

(subject to human error, unfortunately). 

Fig. 7. Test case specification using the Classification Tree Method (CTM) 

In the figure above an example for the test case 

specification according to the CTM is given. The root of the 

tree is labelled “Suspension”, i.e. the test object obviously is a 

suspension. Also quite obviously, two inputs are relevant for 

the test: “Speed” and “Steering Angle”. “Speed” and “Steering 

Angle” are classifications (in frames), at the topmost level also 

called “test relevant aspects”. Both classifications are divided 

into equivalence classes (which do not have a frame). For 

“Steering Angle” there are three equivalence classes: “left”, 

“central”, and “right”. From the classification tree, we cannot 

conclude which values are inside a certain class, e.g. “left”, and 

how the values are represented. This is implementationdependent, 

and this is not relevant for the CTM, being a blackbox 

test specification method. (The test case specification is 

abstract.) If one does not take “central” as an extreme steering 

angle position, no boundary / invalid / extreme values are 

forced for “Steering angle”. This is different for “Speed”. The 

classification “Speed” is divided into the two equivalence 

classes “valid” and “invalid”. The latter class guarantees that 


837

invalid values for speed will be used during testing, because in 

a valid specification according to the CTM, all classes that 

were present in the tree need to be used in one test case 

specification at least. The class “invalid” is divided again using 

the classification “Too low or too high?”. This results in 

additional classes “negative” and “> v_max”. Test cases using 

values from these two classes will find out what happens if the 

unexpected hits the (software) test object. The valid speeds are 

divided into “normal” speeds and “extreme” speeds. We can 

assume that the class “zero” for a valid speed contains only one 

value (probably the value 0), as the class “v_max” which 

probably contains the maximum speed as specified in the 

requirements. 

The combination table (the lower part of the figure above) 

consists of seven lines and, hence, specifies seven test cases. 

The test case specifications can have names. The markers set 

on each line indicate, which classes provide a value for the test. 

The values are combined to form a test case specification 

eventually. This depicts the purpose of a test case. In our case, 

this is also indicated by the name of the test case specification, 

but this does not have to be the case always. 

From the test case specification, it is clearly visible that 

there are only three “normal” test cases (the first three test 

cases). For instance, a test case specification that requires 

testing e.g. low speed with the steering angle right does not 

exists. This is obvious. If you feel that three normal test case 

specifications are not enough, you might opt to add an 

additional one. However, the question is not if three is enough; 

the point is that it is obvious for everyone that there are only 

three. This is an important advantage of the CTM. 

The unit testing tool TESSY includes an editor for 

classification trees. I.e. unit test for TESSY can be specified 

using the CTM. 

VI. RECOMMENDATIONS FROM ISO 26262 

ISO 26262:2011 lists in part 6, section 9, table 11 the 

methods for deriving test cases for software unit testing [7]. 

Fig. 8. Methods for deriving test cases from ISO 26262 

Hint for the interpretation of the table in the figure above: 

The recommendation depends on the Automotive Safety 

Integrity Level (ASIL). ASIL ranges from A to D, where D is 

the highest level (i.e. the level requiring the most effort to 

reduce risk). Methods that are “highly recommended” are 

marked by a double plus sign (“++”); methods that are 

“recommended” are marked by a single plus sign (“+”). 

Methods numbered 1a, 1b, 1c, … are alternative entries; 

methods numbered 1, 2, 3, … are consecutive entries. For 

alternative entries, an appropriate combination of methods shall 

be applied in accordance with the ASIL; for consecutive 

entries, all methods shall be applied in accordance with the 

ASIL. 

Method 1a of table 11 requires that the test cases for 

software unit testing are derived from the requirements. This is 

highly recommended for all ASILs. Starting from the 

requirements is the naive approach. 

Method 1b of table 11 requires that generation and analysis 

of equivalence classes is used to derive test cases for software 

unit testing. This is recommended for ASIL A and highly 

recommended for ASIL B to D. 

Method 1c of table 11 requires analysis of boundary values 

to derive test cases for software unit testing. This is 

recommended for ASIL A and highly recommended for ASIL 

B to D. 

Method 1a, 1b, and 1c were already discussed in the 

preceding sections of this paper. 

Method 1d of table 11 requires error guessing to derive test 

cases for software unit testing. This is recommended for all 

ASILs. Error guessing is discussed in the following section. 

A. Error Guessing 

Error guessing usually requires an experienced tester who is 

able to find error-sensitive test cases from experience. Hence, it 

is usually an unsystematic method (opposed to the first three 

methods). [I admit, you could use checklists or failure reports 

of previous systems or something similar as basis for 

guessing.] Error guessing relates to thinking about possible 

invalid / unexpected / extreme test cases, because this is 

actually error guessing. If a system under test has two buttons, 

and it is supposed that only one of these buttons is pressed at a 

time: What happens if the two buttons are pushed 

simultaneously? Can a button be pushed too fast / too often / 

too long? These are examples for error guessing. 

VII. ALTERNATIVES 

This section discusses alternative methods for test case 

derivation, which were not discussed in the previous sections. 

A. Test Cases from the Source Code 

It is tempting to use a tool to generate automatically test 

cases from the source code, e.g. with the objective that the test 

cases reach 100% code coverage. Different technical 

approaches exist, e.g. genetic algorithms or backtracking. Both 

open source and commercial tools implement these approaches. 

So why not leverage these tools? Generating test cases from the 

source code has some aspects that you should be aware of: 

1. Omissions: You will not detect omissions in the code. I.e. if 

a requirement is “if the first parameter is equal to the 

second parameter, then an error shall be returned” and the 

implementation of this check is missing: This problem will 

not be detected by test cases derived from the source code. 

This is evident. You need test cases that check if each 

requirement is implemented correctly. Such a test case will 

detect the missing implementation. 

2. Correctness: You cannot decide from the code if it is 

correct or not. E.g. you cannot decide if the decision (i

You need to check the behavior of the code against the 

requirements. 

Because of these two aspects, it is not sufficient to use only 

test cases generated automatically from the source code; you 

need test cases that test the requirements (at least you need a 

test oracle). But isn’t it not still a good idea to let a tool do a lot 

of the work and check afterwards, if the generated test cases 

test also the requirements, and, if not, change/extend the tests 

accordingly? 

Recently I came across a study [5] that tries to answer 

exactly that question. The main statements from this study are: 

1. Automatically generated test suites achieve higher code 

coverage than manually created test suites. 

2. Using automatically generated test suites does not lead to 

detection of more defects. 

3. Automatically generated test cases have a negative effect 

on the ability to capture intended class behavior. 

4. Automated test generation does not guarantee higher 

mutation scores. 

The study used the tool EvoSuite that automatically 

generates JUnit tests for Java classes. It was an empirical study, 

where students tried to detect defects in Java code, some of 

them starting from test cases generated by EvoSuite, some of 

them creating the test cases by themselves. 

The conclusion I draw from this study is that automated test 

case generation does not bring an advantage to testing (more 

defects found, less effort spend, etc.). On the other hand, it is 

also no disadvantage. 

Obviously, conditions of the study can be discussed 

(programming language used, programming skills of the 

students, etc.) but the tenor is surprising in my opinion. 

B. Random Test Data / Fuzzing 

Like the generation of test cases from the source code, it is 

also tempting to use automatically generated test input data. 

Many test cases can be generated and run quite effortless, 

having automated test execution in place. However, a 

(functional) test case needs an expected result. And it can be 

quite an effort to verify that all of that many test cases deliver 

the expected result, unless you have some kind of test oracle at 

hand. Running randomly generated test cases without checking 

the expected result is robustness testing. Only obvious 

misbehavior (e.g. denial of service, crash, etc.) will be detected. 

On the other hand, this can lead to the surprising detection of 

safety and security vulnerabilities. The process of stressing a 

test object with syntactically correct, but more or less randomly 

generated test input is called “fuzzing”. 

C. Artificial Intelligence 

Nowadays, artificial intelligence (AI) can accomplish many 

astonishing things, but I am currently not aware that AI is used 

successfully in the automated generation of test cases for 


VIII. MUTATION TESTING 

As we have seen in the previous sections, 100% code 

coverage does not guarantee the quality of the tests cases. But 

how can we rate the quality of our test cases? One possibility is 

mutation testing (called “error seeding” in IEC 61508 [8]). 

Having a set of passing test cases, you can mutate your code. 

Mutation means changing the code semantically, but keeping it 

syntactically correct. E.g. you can change a decision from (i

RISC-V; the Software and Hardware Aspects of an 

Open Source ISA 

Rob Oshana 

Vice President, Software Engineering 

Microcontrollers, NXP Semiconductor 

Austin, TX, USA 

robert.oshana@nxp.com 

Abstract—In this paper we discuss innovation and 

education using RISC-V. RISC-V is an open, free ISA 

enabling a new era of processor innovation through open 

standard collaboration. 

Keywords—RISC-V, ISA, Chisel 


RISC-V is a high-quality, license-free, royalty-free RISC 

ISA specification originally from UC Berkeley. The Standard 

is maintained by a non-profit RISC-V Foundation. This 

technology is suitable for all types of computing systems, from 

microcontrollers to supercomputers. There are numerous 

proprietary and open-source cores in the industry today. The 

technology is experiencing rapid uptake in industry and 

academia, and is supported by a growing shared software 

ecosystem. RISC-V technology can also be used for 

experiments in innovation and education and these two areas 

will be explored in this paper. 

RISC-V has these key characteristics; 

• Simple; Far smaller than other commercial ISAs 

• Clean-slate design; Clear separation between user and 

privileged ISA which avoids micro architecture or 

technology-dependent features 

• A modular ISA designed for 

extensibility/specialization which implies a small 

standard base ISA, with multiple standard extensions 

and sparse and variable-length instruction encoding 

for vast opcode space 

• Stable; where the base and standard extensions are 

frozen and additions are via optional extensions, not 

new versions 

• Community designed; Developed with leading 

industry/academic experts and software developers 

II. SOFTWARE DRIVES HARDWARE 

Our interest in RISC-V is driven primarily from a model of 

software driving hardware architecture. One of the tools we 

experimented with is Chisel. Chisel is an open-source 

hardware construction language from UC Berkeley that 

supports hardware design using parameterized generators and 

layered domain-specific hardware languages. 

We chose to use Chisel for architectural investigation 

primarily due to; 

• Chisel can be used by software teams for ISA 

architectural explorations 

• Use case investigation - expanding RISC-V with 

custom instructions 

As an example we chose an IP checksum for a IPv4 packet 

– ipcsum rd, off(rs) 

The Chisel workflow we used went from design to 

simulation and testing. Chisel allows for relatively easy 

integration with existing Verilog RTL designs. Chisel is easy 

to extend and benchmark. Its possible to benchmark 

specialized RISC-V instructions vs C programs. 

The advantages of Chisel architecture evaluation include; 

• High level programming language features 

• Opens the hardware to software engineers & 

architects 

• Lower SLOC compared with equivalent Verilog 

(~1/3) 

• Free and fast tools for simulation like Verilator 

• Programs execution can be analyzed below the means 

of usual debuggers (gdb) 

• Easy integration into existing Verilog RTL projects 

(minimal Verilog glue logic may be required) 

• From GPP to ASIP through custom instructions 

840

We chose to evaluate an IP checksum for a IPv4 packet – 

ipcsum rd, off(rs). This benchmark has a 20 byte IPv4 

header located at offset off from the address in register rs. 

The checksum is stored in register rd 

We will develop a Chisel class implementing the 

instruction (~60 lines) as shown in Figure 1. 

• Hardware cost: functional unit, decoding logic, unit 

control logic 

III. EDUCATION 

We are also interested in broader educational endeavors 

with RISC-V based architectures. We are building a package 

of hardware and software to allow for educational 

experimentation. Our system includes multiple RISC-V cores 

and associated peripherals and interconnect, plus software 

enablement (Figure 3). 

Figure 1 Chisel Benchmark 

The Chisel design flow we used is shown in Figure 2. The 

key steps used in this flow are; 

• Step 1 - Chisel compiler 

• Step 2 - Verilator translator 

• Step 3 - Integration into Verilog code base (manual or 

by extending Chisel BlackBox class functionality) 

• Step 4 – Implement test scenarios and build the 

emulator 

Figure 3 Microcontroller device based on RISC-V 

Conclusion 

The “openness” of RISC-V allows different ways of working 

in a hardware/software environment. We are moving towards a 

model of “software drives hardware” in the creation of IoT 

systems. RISC-V can enable innovation and education which 

is our primary interest in this technology. 

Figure 2. Chisel design flow 

We chose to benchmark a set of RISC-V specialized 

instruction vs the sample C program. The target was a RISC- 

V processor – rv32_3stage (z-scale) from Sodor designs 

(https://github.com/ucb-bar/riscv-sodor). 

The C function IP checksum was compiled using – GCC 

riscv32, -O2 which produced a 35 cycles execution time. 

The ipcsum instruction is 7 cycles (1F/5EX/1WB) which is a 

5X speedup from the C implementation. 


I would like to thank Alex Badicioio for his contributions to 

this paper. 

REFERENCES 

[1] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, 

Andrew Waterman, Rimas Avižienis, John Wawrzynek, 

Krste Asanovic´ EECS Department, UC Berkeley 

Chisel: Constructing Hardware in a Scala Embedded Language 

Design Automation Conference, 2012 

[2] Chisel Tutorial, Jonathan Bachrach, 

Krste Asanovi´c, John Wawrzynek EECS 

Department, UC Berkeley, October 26, 2012, Design Automation 

conference 

This hardware/software tradeoff can be summarized as; 

• Software gain: execution speed, code size reduction 

841

When is a Custom SoC 

the Right Choice for IoT Products? 

Economic benefits and challenges when building Custom SoCs 

Michele Riga 

Embedded and Automotive 

Arm 

110 Fulbourn Road, Cambridge, UK 

michele.riga@arm.com 

The Internet of Things (IoT) will result in an explosion in the 

number of connected devices. Many of these devices will be built 

using standard off-the-shelf silicon, others will significantly benefit 

from using a System-on-Chip (SoC), customized for the specific 

application. Benefits of a custom SoC/ASIC include reduced cost, 

improved performance and added functionality, all within smaller 

form factors. Designing a custom SoC no longer requires large 

investments, thanks to the availability of proven IP and many 

competitive design services, as well as access to mature process 

nodes - enabling SoC design at reduced cost. This paper explores 

the economic benefits and challenges of building custom SoCs, 

supported by a real case study of an industrial control application, 

with real world data on cost, size and power consumption, and the 

impact of feature enhancements. 

Keywords - ASIC, custom SoC, custom ASIC, IoT, PCB, FPGA, 

MCU, microntroller, Cortex-M, sensor IC, mixed-signal, 

microchip. 


The past decades have been marked by an exponential 

growth in the number of digital devices shipped every year and 

present in everyday life - just a look at the exponential growth 

of the number of microchips shipped by Arm’s partners in the 

last 20 years (Fig. 1) helps to understand the size of this 

impressive growth. 

Billions of chips shipped 

18 

15 

12 

9 

6 

3 

0 

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 

Fig. 1. Exponential growth over time of the number of chips shipped by 

Arm’s partners. 

Many new devices have reached consumers, such as first 

mobile phones and then smartphones, wearables, etc. while 

various existing products have become more and more digital, 

like automobiles, white goods, and many others. 

This is just the beginning of the new IoT revolution. In the 

next few years, all areas of life and work will be shaped by the 

explosion of new technologies: a “typical family home could 

contain more than 500 smart devices by 2022” [1], cars will 

become autonomous and connected, powered by hundreds of 

digital devices, and also many industrial and agricultural 

applications will become smart and part of the IoT. It is 

predicted that this explosion of new technologies and new 

applications will lead to a trillion connected devices in the next 

decades. 

Many of these devices will share similar functionalities and 

requirements, and will be built using off-the-shelf parts chosen 

from the large selection provided by large manufacturers. A 

part, however, will be characterized by very specific needs, 

while also having very strict requirements in performance, 

power efficiency and cost – these may be difficult to address 

with off-the-shelf parts. 

For this second group of applications, the realization of 

custom SoCs will enable the creation of compelling and 

differentiated solutions, providing large improvements in all 

performance, power, area, and cost, as further detailed in the 

following sections. 

II. 

SILICON TECHNOLOGY 

Everyone that has been active in the silicon industry in the 

last few decades has heard at least once about Moore’s Law, 

stating that transistor count doubles approximately every two 

years. Similarly, the semiconductor manufacturing process 

used to build chips has evolved, enabling fabrication of designs 

with transistor of smaller dimensions. Traditionally, a process 

node takes the name from the size of the transistors, with 

process nodes that evolved from various µm to a few nm. 


842

The exponential growth in the number of transistors, the 

basic building block of silicon hardware, has enabled the 

digital revolution, with devices capable of greater and greater 

performance over time. It is astonishing to think how the 

processing power of a modern smartphone has similar 

processing power compared to supercomputers of a few 

decades ago – for instance, the Cray-2 supercomputer from 

1985, which was the fastest machine in the world for its time, 

has a processing power equal to that of an iPhone 4 [2]. 

The continuous technology push for more advanced process 

nodes and for an increase in the number of transistors per chip 

has followed Moore’s Law quite consistently, sustained by ever 

increasing investments. At the same time, the astonishing 

progress in chip technology opens complete new technological 

paradigms. Previous generations of process nodes can be used 

for all these applications where added value is provided by 

incorporating functionalities that do not necessarily scale 

according to Moore’s Law, such as sensors and actuators. 

These latter application areas, characterized by functional 

diversifications, are termed More-than-Moore [3]. 

The availability of mature process nodes provides an 

opportunity for the realization of custom SoCs, mixed-signal 

devices that combine both functional diversification and digital 

components (Fig. 2). This is an opportunity that is becoming 

more appealing than ever today, as access to previous 

technology becomes cheaper (Fig. 3), while still being 

technologically valid, with the possibility to build full masksets 

on older process nodes with just a few thousands of dollars of 

investment. 

III. 

CUSTOM SOC BENEFITS 

There are many potential reasons and benefits to build a 

custom SoC: 

• Make a product more compelling and differentiated, as 

new features can be added (i.e. connectivity) 

• Lower the overall cost of end products, as many discrete 

components can be replaced with only one microchip 

• Reduce component count, overall complexity, and Printed 

Circuit Board (PCB) size – potentially providing higher 

efficiency and reliability 

• Protect the technology solution for a product, making it 

harder, or even impossible, to reverse engineer and copy 

• Reduce the complexity of the supply chain, assuming 

complete ownership of the microchip design, and thus 

ensuring long life supply of the component – through the 

use of established foundries and process nodes 

• Meet performance and/or cost requirements for a specific 

application or product that are impossible to reach with a 

PCB or FPGA solution. 

IV. 

CASE STUDY 

When deciding to build a custom SoC, the first step is 

usually to determine what are the key targets that the project is 

aiming for, and then make the various design choices based on 

these targets. The following section analyzes a case study to 

provide details on why and how a company decided to build a 

custom SoC. 

A company active in the oil and gas industry used to build 

valve controllers to sense pressure and temperature, while also 

performing controlling functions. The solution was based on a 

PCB containing a large variety of off-the-shelf parts, digital 

and analog. 

Moving to the next generation product, it decided to replace 

the many off-the-shelf parts with one integrated solution. Key 

drivers were to reduce costs, improve reliability, and simplify 

the inventory and supply management - since some of the 

vendors where planning to discontinue the production of 

components used in the current solution. In addition to that, 

they planned to add connectivity capability, so as to be able to 

remotely manage valves deployed in the field and hence reduce 

the management costs. 

This company had no in-house expertise for the 

development of silicon hardware. Therefore, they decided to 

entirely outsource the project to S3 Group, one of the many 

design houses that provides full design services for the creation 

of custom SoCs. 

Fig. 2. “The combined need for digital and non-digital functionalities in 

an integrated system is translated as a dual trend […]: 

miniaturization of the digital functions […] and functional 

diversification (More-than-Moore)” [3]. 

Fig. 3. Process node cost for 80nm, 130nm, and 65nm over time, showing 

a steady decline in price over time (courtesy of IMEC). 

843

The S3 Group built for them a low-power chip based on a 

cost-effective process node, 180nm, integrating a Digital-to- 

Analog Converter (DAC), an Analog-to-Digital Converter 

(ADC), and many interfaces to enable easy connectivity, such 

as I 2 C, UART, SPI – all packaged in a low-power design 

consuming 160uW/MHz. Overall results of this project have 

been impressive, with large improvements in cost, power, and 

area: 

• 80% cost reduction 

• 70% power consumption reduction 

• 75% smaller PCB size 

In addition, the new solution simplified considerably the 

inventory and supply management. The company did not have 

to deal anymore with many different vendors, each with a 

different product roadmap, or stock many different 

components. With the new custom SoC approach, owned 

completely by the company itself, they only needed to deal 

with one single point of contact provided by the S3 Group. 

V. ECONOMICS OF CUSTOM SOC 

As seen in the previous sections, there are many potential 

benefits in building custom SoCs. At the same time, silicon 

design is not an easy job, and a custom SoC project should not 

be started without full awareness of all the project phases and 

costs involved in such a project. 

From a high-level perspective, there are five different 

stages in custom SoC creation, each with its own required 

know-how and related costs: SoC definition, IP selection, 

Design & integration, Verification, Implementation (Fig. 4). 

Fig. 4. High-level project phase when creating a custom Soc. 

A. SoC definition 

As with all projects, the first step is defining the SoC and 

its key requirements. Typically, there are some feature 

requirements, such as security, wireless connectivity, etc. In 

addition, there are some minimum performance, power and 

area targets to be met. 

The performance target directly affects the minimal 

frequency of the device. While generally power and efficiency 

targets can be very important, since custom SoCs usually 

replace PCB solutions, they typically provide power savings so 

large that this is not necessarily a focus area. Area is instead 

another important factor, since it directly translates into silicon 

cost. 

B. IP selection 

When building a microchip, many required components can 

be easily purchased by third party vendors. On the open market 

it is possible to find processor IP, peripherals, radio, and IP to 

perform almost any function at a variety of price points. 

At the heart of a microchip there is usually a 

microprocessor, that can be programmed to perform all the 

required tasks. Typically, processors are one of the most 

complex IPs in the design and often there is a license fee and 

royalties to be paid on the units shipped. Access to IP usually 

involves a negotiation to discuss and agree all the details of the 

licensing contract, which can take various months to complete. 

Arm is the major IP provider, and many off-the-shelf 

devices are currently based on Arm Cortex-M processors. To 

address the need of companies that want to build custom SoCs 

with reduced initial investment and rapid access to the IP, Arm 

has recently enhanced the Arm DesignStart program. One of 

the key changes is the DesignStart Pro offering, that enables 

companies to quickly access the Cortex-M0 and Cortex-M3 

processors through a fixed-term contract, with $0 upfront fee 

and a success-based royalty model. Together with the 

processor of choice, both DesignStart Pro packages also 

provide a wide range of building blocks and peripherals to 

build or customize the memory system. In addition to that, the 

Cortex-M0 DesignStart Pro provides a simple example system, 

while Cortex-M3 DesignStart Pro contains a fully validated 

subsystem, named CoreLink SSE-050. 

C. Design & integration 

As for the PCB design, where each of the off-the-shelf parts 

needs to be connected, similarly with custom SoCs all the 

separate components need to be connected in the integrated 

design. This task is usually performed in the world of a 

Hardware Description Language (HDL), such as Verilog. 

These are specialized computer languages that enable the 

designer to provide a description of the electronic circuit in a 

text format; the code is then used as input to Electronic Design 

Automation (EDA) tools that convert it into actual transistors, 

ready to be built into a chip. 

One of the ways to reduce the effort required to perform 

this task is to use a good starting point, such as the preassembled 

CoreLink SSE-050 IoT subsystem (Fig. 5), included 

in the Cortex-M3 DesignStart Pro package. It contains all the 

common elements of such a system, and can be used as a 

starting point or simply as a reference design: 

• Cortex-M3 processor 

• Configurable memory system 

• Ready-made connectivity to Flash memory with 

integrated Flash cache to improve performance and 

power efficiency 

• Connectivity to the peripherals (not included, 

available from many third-party providers). 

• Real-time clock 

• True Random Number Generator (TRNG) to provide 

the foundation for security, that can be upgraded (for a 

fee) to one of the Arm CryptoCell IP for enhanced 

functionalities 

• Dedicated port for radio integration, with pre-verified 

integration of the Arm Cordio Radio (available for a 

fee) 


844

D. Verification 

Deciding on the right set of IP and assembling the system 

together in an appropriate way is very important. Similarly, 

ensuring that the assembled system meets the functional 

requirements and functions properly is essential. Quite 

counterintuitively, the amount of effort necessary for 

verification usually exceeds that for Design & Integration. 

Verification is time consuming. Generally, it is necessary to 

run thorough verification tests to ensure that all main use cases 

are covered. The reason is that, the earlier in the cycle a bug or 

issue is found, the easier and less expensive it is to solve. When 

an issue is found during the development stage, it is usually 

quite easy to fix it, and it is generally just a matter of normal 

engineering effort. However, if a bug is found after tape-out, it 

might become very difficult to solve it. At that point, all that 

can be done is in software, either by changing the sequence of 

operations performed so to ensure that the bug is not exposed 

or, potentially, by completely disabling a feature. As a result, a 

bug in silicon can result in decreasing the value of the custom 

solution or potentially can make the whole development 

worthless, if a required feature is completely compromised and 

cannot be used. 

There are various ways to limit the effort required on 

verification when developing hardware: 

1. Divide and conquer: verification is more effective 

when done first in small blocks. Focus on a small block 

of the design, or potentially an IP. For instance, when a 

Digital Signal Processing (DSP) is implemented in 

house, it is highly advisable to perform all the 

verification on that block separately, before integrating 

it with the rest of the system, so as to ensure that it 

functions properly in all the possible corner cases that 

might become particularly difficult to stimulate when 

running the whole system. 

2. Reuse: there is no need to reinvent the wheel. Literally 

hundreds of IPs can be found already verified, 

removing any need to perform block-level verification. 

The more complex the IP, the more time-consuming 

the verification. Processors are usually very complex, 

with the possibility to handle a large variety of 

different operations. Rather than building the processor 

solution in house, or using unproven solutions, Arm 

Cortex-M0 and Cortex-M3 processors can provide the 

computing solutions for many different applications – 

these processors are fully verified and proven in over 

20 billion devices to date. 

Cutting corners in verification means taking large risks. 

Various examples from the past can help to understand the 

huge potential cost of finding a bug in the field, after the silicon 

hardware has been manufactured. For instance, in the 90s a 

large microchip provider had to recall a whole series of chips 

from the market and provide upon-request replacements [4] 

because one of the key features did not work in an extremely 

unlikely, but possible, scenario [5]. 

E. Implementation 

The last step in the journey to building a custom SoC is to 

realize the physical implementation and start manufacturing. If 

the implementation is done in-house, this section involves 

carrying out the physical design using an EDA tool and a 

physical library developed for a specific foundry and process 

node. The majority of custom SoCs are built with both digital 

and analog parts, and this requires a slightly more elaborate 

process that is nowadays supported by the major EDA vendors. 

Through Arm DesignStart, companies can access, with $0 

upfront fee, hundreds of physical libraries from 18 partner 

foundries, with availability of Logic IP, Standard Cell, 

Embedded Memory Compilers, and Interface IP. Arm is the 

industry’s leading supplier of foundation physical IP and 

processor implementation solutions providing access from the 

most mature and less expensive process nodes to the most 

Fig. 5. CoreLink SSE-050 sysystem block diagram, illustrating included IP in the DesignStart, other IPs available from Arm, and any additional third-party IP 

that can be easily integrated into the system. 

845

leading-edge ones (Fig. 6). This enables the implementation of 

SoC solutions which address performance, power and cost 

requirements for literally all application markets. 

Many leading applications require the use of the latest and 

most advanced process nodes to be competitive in the market, 

and to offer tangible benefits to the users. However, for 

applications like custom SoCs, the frequency and power targets 

required can usually be met with more mature process nodes, 

that also enable mixed-signal designs. Section II Silicon 

Technology already addressed how pricing for mature process 

nodes is reducing considerably over time, providing an offering 

that is technologically compelling and cost effective. 

When the implementation is ready, the last step is to 

connect with the selected design house or foundry for 

production. Tape-out costs can vary a lot, mostly depending on 

the process node and the type of chosen maskset, which is a 

series of physical masks used during the photolithography steps 

of semiconductor fabrication. There are 3 possible masksets 

that can be created: 

• Multi-project wafer (MPW): this maskset consists of 

many projects, potentially from different customers, 

distributing the cost of the maskset set among all the 

projects involved. It can be used for early prototype, or 

also for full production when a very low volume is 

required. 

• Multi-layer mask (MLM): this maskset contains various 

masks that are combined into one, reducing the overall 

number of masks required and thus the cost. This solution 

reduces the non-recurrent costs, but it results in higher 

variable production cost (wafer cost), since more foundry 

production time is required for production. 

• Full maskset: most optimized masks for the at-scale 

production in the chosen process nodes. Ideal for large 

volume projects. 

Which maskset is right to choose mostly depends on the 

production volume and the die size. For larger volume 

implementations, after the initial test tape-out with MPWs or 

MLMs, it is cost effective to move to a full production 

maskset. For low-volume implementation, or mid-volume 

ones with small die size, MPWs and MLMs are usually very 

effective solutions. Many foundries now provide accessible 

MPWs or MLMs services, for process nodes such as 90nm, 

80nm, 65nm and 40nm. 

F. Outsourcing 

The implementation phase can be easily outsourced to the 

many design houses, like in the mentioned case study. Many of 

these design houses have services to help in any phase of the 

custom SoC realization, providing the possibility of 

outsourcing part of the design project, from definition, to 

design, integration, and verification. Some of them can provide 

a complete service, delivering an already manufactured and 

packaged microchip starting only from a requirement 

specification document that they can also help define. 

To help OEMs and companies that have never built silicon 

before, Arm has established the Approved Design Partner 

program. This program connects companies to various selected 

design houses, audited and chosen for the quality of the 

services that they can provide, and with a proven track record 

of success using Arm IP. 

G. Overall cost considerations 

As seen in the journey to build a custom SoC, there are 

various key cost areas: 

• Engineering costs for design, integration, and verification 

– depending on the internal expertise, the engineering 

tasks can be easily outsourced 

• Cost of IP – there are many IP providers to fit any 

requirement. Some, like Arm DesignStart, offer a successbased 

royalty model to access proven IP, removing 

completely any upfront license fee and thus transforming 

this into a pure variable cost 

• EDA tools access – there are many EDA tools providers 

that nowadays offer competitive pricing for small or 

Fig. 6. Physical IP available through DesignStart with no upfront fee: addressing all markets with IP from the most mature process nodes to the leading edge. 

846 


medium sized projects 

• Manufacturing and packaging – many foundries now 

provide MPW and MLM services to enable projects with 

small volumes. Many design houses and silicon aggregator 

can help to connect with the foundries, providing full 

support. 

All this has made the realization of custom SoC costeffective 

and affordable for small companies or projects with 

low production volumes (Fig. 7). An analysis performed by 

IMEC shows how on 180nm only production of few thousand 

units is required to make a custom SoC project cost-effective, 

providing a positive Return-On-Investment (ROI), together 

with all the other benefits of custom SoCs seen in III Custom 

SoC Benefits. 

Fig. 7. Minimum number of units required for an investment in Custom 

Soc. On 180nm, on current technology cost, few thousand units per 

year are a minimum requirement (courtesy of IMEC). 

VI. 

BEYOND SILICON 

Assembling the microchip and manufacturing it is not the 

end of the project. Hardware has little value when not paired 

with software to run and use all the functions implemented. 

Software is usually built using development tools and a 

compiler, that translates high-level code into machine code, 

ready to be executed by the microprocessor embedded in the 

SoC. Proven IP from Arm, such the Cortex-M0 or Cortex-M3 

processors, have full support from all the major compilers and 

development tools. In addition, when applying for DesignStart 

Pro, companies receive a free 90-day time-limited license to 

the Keil MDK Professional tool and/or IAR Embedded 

Workbench, both of which provide a compiler and debugger in 

a GUI. 

Depending on the selected processor, there might be some 

restrictions on the programming language that can be used. For 

instance, many 8-bit and 16-bit microcontrollers need to be 

programmed using low-level assembly code. Other processors 

are built to enable ease of programming and use. For all the 

Cortex-M processor family, a lot of effort has been put into 

removing the need for the user to write any assembly code. In 

fact, all code, including exception handlers and fault handlers, 

can be written in high level languages such as C or C++. In 

addition, for any specific situation where an assembly code 

instruction would be useful and cannot be inferred from high 

level language code, there is the free-to-use Cortex-M Software 

Interface Standard (CMSIS), which provides ways to access 

the processor’s internal registers and standard calls for lowlevel 

assembly instructions. 

Depending on the requirements and complexity of the 

custom solution, it might be possible to develop software that 

runs directly on the hardware, also called bare-metal 

environment. When a more scalable and modular solution is 

required, a Real-Time Operating System (RTOS) can run on 

the microprocessor, on-top of which the various software tasks 

can be developed. Using a standard architecture, such as Arm 

Cortex-M processors, enables companies to choose literally 

any RTOS provider, saving considerably in development and 

porting effort. 

Fig. 8. Arm processors provide access to largest technology ecosystem, with tools, compilers, OS support, accessible software, a thriving developer base, and a 

large wealth of resources. 

847

Choosing a proven architecture and vendor for key IP such 

as the microprocessor enables access to an established base of 

developers, that have already built hardware or software using 

that architecture or processor family, and are therefore familiar 

with all kinds of issues that can be encountered when building 

a custom solution. For instance, in the last 20 years, thousands 

of companies have joined the Arm partnership and built chips 

based on Arm IP. This ecosystem has grown into the largest 

technology ecosystem in the world, providing a wide choice of 

development tools, compilers and RTOS, and also access to an 

extremely large developer base, as evidenced by more than 5 

million downloads of CMSIS in 2016. All of this is backed by 

the largest open-access development resource library – with 

thousands of articles, how-to guides and other resources easily 

accessible online. 

In summary, the choice of the microprocessor IP has direct 

implications on the development time required to build the 

software necessary to make use of the hardware solution. Using 

proven and established vendors can considerably reduce the 

development risks not only on the hardware side, but also on 



Custom SoCs can provide huge benefits to IoT and 

embedded applications. Thanks to the huge progress in the 

production processes and technology nodes, as well as the 

availability of masksets at reduced cost, companies can now 

access suitable process nodes for advanced mixed-signal 

designs at competitive prices. 

In addition, there is a large variety of IP available, from 

proven complex microprocessor IP, such as the Cortex-M3, to 

peripherals and accelerators. Pricing for these IP can vary a lot 

depending on the IP of choice and the vendor; some of these 

companies, like Arm through DesignStart, offer access to their 

IP with no upfront cost and a success-based royalty model, 

transforming the use of the IP into a pure variable cost that 

depends on the volume shipped. 

Finally, many design houses offer design services, 

providing the possibility to outsource part or all of the 

development cycle of a custom SoC, potentially providing 

tested, packaged chips to make the process even simpler. 

The barrier to developing microchips is lowering, with 

reduced investment size and lower development risk. This will 

result in an explosion in the number of custom SoC solutions 

that will power future IoT and embedded applications. 

REFERENCES 

[1] Contained in Gartner Special Report "Digital Business Technologies". 

[2] Published on Experts-exchange. 

[3] W. Arden, M. Brillouet, P. Cogez, M. Graef, B. Huizing, R. Mahnkopf, 

“More-than-Moore” white paper, International Technology Roadmap 

for Semiconductors, 2011. 

[4] “Intel adopts upon-request replacement policy on Pentium processors 

with floating point flaw; Will take Q4 charge against earnings". 

Business Wire. 1994-12-20 

[5] Statistical Analysis of Floating Point Flaw: Intel White Paper”, 9 July 

2004. p. 9. Solution ID CS-013007. Retrieved 5 April 2016. 


848

The Business Case for Affordable Custom Silicon 

Edel Griffith (Author), Darren Hobbs, Dermot Barry 

S3 Semiconductors 

Dublin, Ireland 

info@s3semi.com 

Abstract—This paper will show how custom silicon, which is 

every customer’s ideal – a highly optimized, efficient system 

designed exactly to their application requirement, is no longer the 

high cost solution it once was and is now a very feasible option even 

at lower volumes. 

Keywords—custom integrated chips, custom ASICs, System 

Integrators, OEMs 


Historically, custom chips or ASICs were considered costprohibitive 

and only possible for companies that were shipping 

millions of units a year. These days however, custom integrated 

chips are possible for many device makers and original 

equipment manufacturers (OEMs) who in the past may have 

found such designs outside their budgets. With high volume 

consumer products pushing the cutting edge of advanced process 

nodes, foundry capacity at mature process nodes is being freed 

up. These mature process nodes are almost fully depreciated and 

therefore more affordable than ever in which to fabricate custom 

chips. 

Year on year the number of discrete components in a system 

increases. As the number of components increase, the associated 

cost and size of the printed circuit board (PCB) also increases. 

Being able to integrate these components onto a single custom 

integrated chip has the potential to offer considerable bill of 

materials (BOM) cost and footprint savings. In this paper, S3 

Semiconductors will examine two recent case studies whereby 

BOM cost and footprint savings achieved were of the order of 

80-90% reduction on the previous generation products where 

discrete components on a PCB were utilized. 

S3 Semiconductors will also show how inputting a number 

of variables into their online BOM calculator allows you to 

quickly attain an estimate of the breakeven volume and total 

BOM savings achievable using a custom ASIC versus a discrete 

solution. Taking inputs such as how many devices are expected 

to be manufactured every year, the expected product lifetime, 

coupled with some design questions like how many data 

converters are on board, what level of integrated processing is 

needed, and whether you have RF requirements and what type 

of connectivity is required. The results will show how there is a 

business case for affordable custom silicon now available to 

even low- and medium-volume applications. 

Then we will discuss two recent case studies, supporting two 

very diverse market segments and show how moving from a 

discrete solution to a custom integrated chip solution saved two 

companies over 80% reduction in both BOM costs and in size. 

II. 

INDUSTRY CHANGES 

Technology and manufacturing techniques have moved on a 

lot in recent years. Changes in market and technology are 

driving the demand for custom silicon on chip (SoC) but also 

helping to reduce the cost of such solutions versus similar 

solutions in the past. 

A. Market Changes 

Industry 4.0 (or the fourth industrial revolution) is a 

collective term embracing many contemporary automation, data 

exchange and manufacturing technologies. It facilitates the 

vision and execution of a modular ‘smart factory’ in which 

cyber-physical systems monitor physical processes, create a 

virtual copy of the physical world and make decentralized 

decisions. Over the Industrial Internet of Things (IIoT), cyberphysical 

systems communicate and cooperate with each other 

and with humans in real time, and via the Internet of Services. 

Today, global sensors in the IIoT market generated revenue 

of $3.77B. This is estimated to increase to $11.23B in 2021, 

yielding a CAGR of 16.8% [1]. Within this market, the largest 

share is attributed to industrial control applications which 

commands 38% of that share. 

As the level of automation in factories grows and processing 

becomes more automated, the number of inputs and outputs in 

industrial electronic systems has also increased. There are now 

inputs coming from many different sources – multitudes of 

sensors, as well as the traditional switches, keyboards, touch 

pads, encoders and scanners to name a few. The sensors are 

required to monitor input conditions like pressure, temperature, 

humidity, air quality, acceleration or a host of other 

environmental conditions. On the other side, there are also many 

outputs like drivers for display, heaters, motors, actuators or 

switches. 

Factory automation requires real-time data. Process 

monitoring reduces waste and improves efficiency. Being able 

to access this information remotely and in real-time, allows for 

the optimization of resource, energy and man-power. In cases of 

variation in performance, operators can be warned in advance to 


849

mitigate against production down-time. Predictive maintenance 

and improved efficiencies should impact positively on the 

bottom-line. 

In the past, solutions like these sensor-based systems were 

designed using printed circuit boards (PCBs) containing discrete 

off the shelf components. These catalogue products are 

designed to service many different applications and therefore 

can be over specified for what the application actually requires. 

This can add to the cost of the products but also means that 

OEMs requiring low- to medium-volumes can become lower 

tier customers to the component suppliers and the risk of 

obsolescence is greater if the high-volume customer no longer 

requires the product. Obsolescence can also cause major issues 

if single or multiple components on a board are no longer 

manufactured. Best case a new component can be used to 

replace the obsolete part but worst case a whole new re-spin of 

the board may be required with potential changes to 

functionality if an exact replacement product is not available. 

B. Technology Changes 

Process nodes (the minimum feature size in an integrated 

circuit) have been reducing every year, as consumer device 

manufacturers try to push the boundaries of silicon performance. 

This push has seen the number of transistors on-chip continuing 

to increase. Designers of custom electronics are moving towards 

the most advanced process nodes available enabling even 

thinner mobile phones and faster computers. These high-volume 

opportunities are therefore centered around cutting-edge 

processes. To make these new generations of ICs even more 

sophisticated fabs are opening which in turn, has left the world 

with many fabs that are no longer quite at the cutting edge. 

Designers may often see these fabs as outdated, but the 

production technology is proven, with high yield and reliability 

and they present an opportunity to produce better products, as 

well as being fully depreciated and lower cost. 

According to reports by SEMI [2], the foundry market is 

projected to grow to $97.5B by 2025. While the volume growth 

will be in process nodes like 10nm and below, it is interesting to 

observe that still over 50% of the market will be for processes 

greater than 20nm. 

Fig. 1. Foundry Market by Feature Dimension 

So, while the volume demand may be centered around the 

smaller process nodes as foundries build newer and newer 

factories to meet the demand for the ever-smaller geometries, 

they find themselves with mature, reliable factories at capable 

geometries that are not at the risky leading edge. The foundries 

are wishing to maximize fab utilization across all process nodes. 

They want to get the best return from the investments already 

made for these process nodes and as the factories are all fully 

depreciated, their costs can therefore be lower. 

As a result, an excellent opportunity exists to be able to 

leverage this technology for custom integrated chips in an 

economical manner. 

S3 Semiconductors, with over 20 years’ experience in the 

semiconductor market has been fostering and developing 

partnerships with these foundry partners for many years and as 

a result has access to a broad range of technology nodes that 

allow performance leading mixed signal and RF IP and custom 

integrated chips to be developed. 

III. 

CASE STUDIES 

Here, we will examine two recent developments whereby the 

end customer achieved cost and area savings while realizing 

additional favorable results also. S3semi likes to take a holistic 

view of the discovery process and this can enable the 

development of a chip that can allow for future proofing and 

allows our customers to take advantages of market changes but 

also to be able to scale and enter parallel application areas. 

Finally, we will examine an online calculator that S3 

Semiconductors has developed to assist customers in comparing 

the savings possible when choosing a custom integrated chip 

route rather than using a commercial catalogue standard 

products method. 

A. Case Study 1 

A major supplier of plant equipment to the oil and gas sector 

had been brainstorming internally while working on their 

product roadmap to try and understand options to remain 

competitive while maintaining control of their product costs. 

They approached S3semi to try and establish and quantify the 

exact benefits they could expect going a custom integrated chip 

route. 

The key criteria that became apparent during these early 

discussions with the customer were the need for: 

• Ability to allow for portfolio tiering 

• Multiple sensor interfaces (pressure, temperature, 

diagnostics) 

• Integrated smart control loop 

• Accurate valve positioning 

• Multiple communications protocols 

• Integrated ARM processor and PIC controller 

• A SoC that was designed to be intrinsically safe 

• Low power 

850

The incumbent solution comprised mainly of discrete 

components using commercial off the shelf components. The 

total bill of materials (BOM) cost for the solution was high and 

while the end customer was not happy with the prevailing costs, 

they also felt it was unnecessarily over specified for the 

performance required and did not allow them to implement 

product tiering. A feasibility study followed by a design study 

phase established the details of the integration options possible. 

Central to the discussions were the sensing needs, 

measurement needs, control and programmability needs, 

connectivity needs and the security needs for the desired 

solution. Within 3 months the functional specification of the 

custom chip was clear and a detailed project plan for 

implementation of the design right up to the qualification and 

production was presented and agreed with the customer. 

The final custom integrated chip was manufactured using a 

0.18um TSMC foundry process. The final system-on-chip 

(SoC) delivered: 

• Analog Front End 

o 14-bit ultra-low power SAR ADCs 

o 12-bit control DACs 

o Power switches 

o Analog Multiplexers 

o Analog Operational Amplifiers 

• Multiple industrial communication interfaces 

o FOUNDATION Fieldbus 

o Highway Addressable Remote Transducer 

(HART) 

• ARM Cortex-M4 core 

• PIC microcontroller 

• FLASH & SRAM memories 

• Multiple peripheral interfaces 

o SPI 

o UART 

o I 2 C 

o Parallel 

Fig. 2. Visual Representation of the benefits of custom integrated chips 

B. Case Study 2 

S3 Semiconductors has for many years been designing and 

manufacturing custom integrated circuits (ICs) that ensure 

seamless integration of analog and digital subsystems within 

wired and wireless communication systems. 

Mobile satellite services (MSS), a more niche area of 

wireless communications, provides two-way voice and data 

communications to users worldwide who are mobile or in 

remote locations. The terminals range in size from handheld to 

laptop-size units and can be mounted in a vehicle, with 

communications maintained while the vehicle is moving. 

Today’s solutions that incorporate satellite terrestrial 

modems are generally ASSP based. As a result, they tend to be 

large, falling between either not being optimized enough or over 

specified. They can be noisy with poor signal integrity and have 

poor blocker performance. They are also inefficient and costly. 

MSS customers are no different to other communications 

customers. They are demanding greater asset tracking, 

monitoring and control. And they require all of this with 

increased broadband speed and no disruption in remote 

locations. 

A satellite operator that services the MSS Industry, had heard 

that by using integration they could offer many benefits for their 

next generation modem product and with enhanced connectivity 

they could introduce new functionality to their product, which in 

turn could introduce new service centric revenue streams. 

A meeting was arranged with S3 Semiconductors and the 

customer to review their product features and understand their 

product roadmap. Taking a holistic view of this discovery 

process the key criteria for the customer quickly became 

apparent and they were: 

• Footprint that was much less than that of the current 

discrete solution 

• Integrated L-band transceiver that could support 

multiple modulation schemes 

• Choice of converter line-ups (super-het, zero or low-IF, 

direct-up conversion), 

• Low power 

• Economic semiconductor integration node 

The development schedule was 12 months and the final 

custom ASIC was developed on time and within budget. The 

final product was fabricated on 0.18m RF-CMOS process from 

TSMC and included. 

• Embedded algorithms for DC offset correction 

• RC Time constant calibration 

• IP2 Calibration 

• IQ Gain/Phase Calibration 

• Image Rejection Calibration for low IF, AGC and AFC 


851

• Integrated receiver blocks 

• Integrated transmitter blocks 

• BIST, Analog-test multiplexer, SPI and 

communications digital signal processor 

C. Online Calculator 

When people think about custom silicon their first thoughts 

are inevitably about the perceived high cost associated with it. 

However, they also tend to look at the cost of development only. 

An area that is often missed in comparisons is the cost of the 

product over its entire lifecycle. In application areas within IIoT, 

the lifetime of products can be 10 years or more. There is also 

the largely unconsidered cost associated in sourcing, storing and 

testing the large number of components involved when a 

discrete solution is employed, as well as the placement costs 

incurred in the board assemble operation. Consideration should 

also be given to reductions in assembly costs and improved 

reliability due to a significant reduction in overall component 

numbers due to the custom chip incorporating this functionality 

on a single chip. 

Taking these elements into account, the true comparison cost 

should consider the number of devices expected to be 

manufactured every year and the expected product lifetime. 

Then, consideration can be given to the design aspect of the 

product. S3 Semiconductors has developed an easy to use online 

calculator that takes these inputs and gives an accurate budgetary 

representation of the potential savings that can be made in going 

a custom ASIC route. 

other analog components will be needed. The next question is 

on the level of processing power needed – is a cost-effective 8- 

bit MCU all that is required or will the design need one or more 

64-bit multicore processor. Finally, an understanding as to 

whether wireless connectivity is needed or not. When all the 

above inputs are entered quickly and easily into the S3semi free 

online calculator (https://www.s3semi.com/bom-calculator/), 

the output will show a break-even volume and the total savings 

that can be expected. A graphical representation of this is shown 

in Figure 3 versus the cost associated with going with a discrete 

solution. This example shows that a custom chip route will give 

total savings of over $15million with a breakeven volume of 

73,909 on a product that ships 50,000 units per year for 6 years. 

IV. 

RESULTS 

Two case studies for different end markets were highlighted 

in the previous section. In both instances, the end customer was 

looking for a smaller and cheaper solution without having to 

compromise on performance. The results achieved and the 

savings made versus the previous discrete implementation are 

detailed below. 

Case study 1 

In case study 1, the Industrial customer wanted to develop a 

custom integrated chip that allowed them to remain competitive 

while maintaining control of their product costs. The final SoC 

offered many advantages over the previous solution. It was 

shown to: 

• Achieve BOM cost reduction of 90% 

• Substantially reduced footprint with a SoC in a 19mm 

x 19mm TFBGA package 

• Met the low power budget supplied from the 4-20mA 

control loop 

• Allowed portfolio tiering 

Fig. 3. Calculator inputs and output 

The calculator looks for inputs on the number of devices 

being shipped per year and for how many years. Then it queries 

how many data converters you expect will be required and what 

Case study 2 

In case study 2, the communications customer wished to 

achieve substantial area savings while introducing new 

functionality to their product. When the final solution was 

compared versus the previous solution (see photo comparison 

below) 

• 80% reduction in size 

• Improved signal integrity and reliability 

• Reduced Power 

• Large saving in electronics BOM 

Much of the previous discussion has been centered on the 

BOM cost savings that custom silicon affords the end user. An 

additional saving, that may not be obvious, is that of area. While 

a user might recognize that going a custom route will save area, 

they may not realize the actual extent of the savings possible. 

An example of some key products that will be on board a typical 

analog front-end design are highlighted in Table 1 below. The 

852

typical sizes of an integrated option versus a typical discrete 

option are also shown in figure 4. As can be seen, the savings 

are considerable. 

Product Integrated Size Discrete Size 

RTC 0.07mm 2 15.2mm 2 

12-bit DAC 0.09mm 2 10mm 2 

14-bit ADC 0.24mm 2 14.7mm 2 

1.8V LDO 0.06mm 2 8.26mm 2 

4:1 Analog Mux 0.21mm 2 9mm 2 

Power Switch 0.01mm 2 4.2mm 2 

Table 1: Area comparison of products 

Fig. 4. Case study 2 – Before and After product size comparison. 

Creating a fully custom integrated chip design for your 

product can ensure you are getting exactly the technical 

functionality you need, in a package that is usually considerably 

smaller, cheaper and more efficient than one made up of 

numerous off the shelf components. Coupled with this is easier 

integration of the single custom chip and cheaper and faster final 

testing of your end product as you have less components to deal 

with. You can also enjoy the peace of mind knowing you will 

be able to produce the same device for many years to come and 

not risk obsolescence. 

V. CONCULSION 

Custom integrated chips are becoming increasingly more 

popular. A totally bespoke chip that incorporates most of the 

functionality you need in a single chip is the ideal for designers. 

By taking advantage of available capacity at mature process 

nodes and using an older geometry essentially removes a lot of 

the potential headaches of leading edge nodes as one has access 

to all that the mature nodes offer like extensive libraries of 

silicon proven IP. The resultant chip is usually smaller, cheaper 

and more efficient. Differentiation becomes easier and your 

intellectual property is better protected as a single chip is much 

more difficult to reverse engineer than a board. 

Going the custom route often does require you to work with 

an accomplished design partner. This offers the advantage of 

leveraging the experience of this design partner and negates the 

need to build this competence with your company. The ideal 

partner should have extensive experience in the area of custom 

chip development. S3 Semiconductors has more than 20 years 

of experience in designing analog and mixed-signal custom 

silicon, and can help guide you through the complete process. 

S3semi can be the perfect partner to help you realize large cost 

savings. Using the online calculator customers have immediate 

feedback on the costs associated with going the custom chip 

route. This is backed by 20 years of custom silicon design 

including the 2 case studies highlighted earlier and the provision 

of easy access to the most economical and highest quality 

production facilities which ensures access to higher 

performance, lower power and proven yield of low-cost custom 

silicon. 

REFERENCES 

[1] Frost & Sullivan, Analysis of Sensors in the Global Internet of Industrial 

Things Market, 2015 

[2] Dr. Handel Jones, Semiconductor Industry from 2015 to 2025, Internation 

Business Strategies (IBS) http://www.semi.org/en/node/57416 

[3] James O’Riordan, The Compelling Economics for OEMs to Commission 

their own Semiconductor Chips. 


853

Demand based and Software guided Selection of 

Microcontrollers 

Thomas Stolze, Klaus-Dietrich Kramer 

Department of Automation and Computer Science 

Harz University 

Wernigerode, Germany 

Abstract— Recent years of development lead to a huge 

diversification of the microcontroller market. Even though 

ARM based microcontrollers get widespread throughout most 

applications, more and more different controller families and 

derivatives get available to product development. On the one 

hand this situation appears great for development engineers as 

they can select devices from a large pool of hardware. On the 

other hand comparisons between these derivatives, especially 

with focus on project based requirements, get very complex 

and time demanding. They may even influence the time-tomarket 

of the final application. Furthermore, a quantified 

selection is not possible with common methods. 

A solution of this selection problem shall be given in the 

current paper. It focuses on the selection process of 

microcontrollers in very early stages of development, while a 

detailed view to this topic is given in [1]. 

Choosing the best suitable microcontroller is not a trivial task, 

so the paper first deals with elementary key domains that have 

to be investigated. These key domains are separated into 

calculation power, properties of peripherals, memory and core 

features, and external factors such as economical values and 

software support. Then an algorithmic selection process based 

on these domains is presented that takes care of user 

constraints and allows to assess different microcontrollers so 

that they become easily comparable. A software tool has been 

developed that supports the user in this process. As a result, 

each considered microcontroller can not only be sorted into 

the common groups of suited an non suited devices, but gets a 

score that makes it directly comparable to other ones in all 

relevant domains. With this method it is possible to quantify 

the suitability of each microcontroller. It is even possible to 

evaluate microcontrollers regarding their reserves for future 

developments. A new vector based visualisation of 

microcontroller qualification supports the election of the best 

microcontroller, too. This new approach leads to a wellfounded 

selection and helps developers making the correct 

decision in a very short time. It thus also helps lowering 

development costs. 

Keywords—microcontroller; selection; choice; evaluation 


In recent years the microcontroller market developed to a 

billion dollar market [2] offering a huge amount of different 

microcontroller families with their respective derivatives, 

providing 8, 16 and 32 bit controllers for every kind of use. 

The demand of industrial and consumer applications enforces 

this trend, especially with mobile and IOT devices getting more 

and more widespread. 

But what sounds like a big deal for consumers has a major 

drawback for developers. The development process and 

especially the selection of well suited controllers get more and 

more complex. It is not only the number of possibly suitable 

devices, it is more the comparability of devices from different 

vendors with different properties and features. But the selection 

of an optimally suited device is essential for the later success of 

the product or application. What's more, this process is limited 

by time and costs. The longer the selection process takes, the 

later the real production can start, and the more expensive the 

whole development gets. And, even worse, selecting a device 

that may not fulfill a requirement very late in the development 

process may cause huge costs for changing to a better one or it 

may even lead to project failure. Because of this dilemma 

developers have to take the utmost care when selecting a 

certain derivative, but have to do it effectively, too. So, from a 

developer's point of view, what are possible helpful 

approaches? 

II. 

SELECTION OF A MICROCONTROLLER 

There are different ways to make a choice which of course 

have pros and cons. The most intuitive way is a manual 

selection process. Here the developer will attempt to gather 

information about microcontrollers that seem suitable for a 

given task with according constraints and then, based on a 

comparison of the information, a decision has to be made. By 

doing all the research manually, e.g. individually deciding 

which devices to look for, getting information out of data 

sheets, this tends to be the most time demanding way. The 

developer has to manually process and compare various 

properties. Furthermore, it is also the most limited way, since 

the selection will be affected by personal experiences and 

relations with certain manufacturers and devices. 


854

A second strategy is to use comparison tools provided by 

hardware manufacturers. These often come in form of websites 

that allow developers to put their requirements into forms that 

automatically limit the displayed results to those which fulfill 

all of them. All other derivatives are discarded. While this 

procedure helps developers to save time, it also limits the 

results to those of the respective manufacturer. That means a 

comprehensive comparison may only be possible if other 

manufacturers are included manually by the developer, or by 

using web-tools that are provided by controller distributors 

with a portfolio supporting different manufacturers. 

However, both selection strategies have other drawbacks 

which shall be discussed. First, the processing power is in most 

cases not part of the selection process, especially when using 

web tools. Therefore developers either have to test 

recommended devices on their own, or they have to rely on 

benchmark results published by manufacturers or third parties 

which may not be applicable to current application demands. 

Unfortunately, even today results running the Dhrystone 

benchmark are often published to compare microcontroller 

systems, but this type of benchmark is outdated and has been 

developed for other demands than nowadays microcontrollers 

[3]. Possible solutions are the EEMBC benchmarks, for 

instance the Autobench benchmark [4]. It provides detailed 

performance information for many use cases, but it has to be 

either run by developers themselves or by EEMBC as part of a 

commissioned work. As processing power is an important 

selection constraint, it is necessary to have reliable results, but 

this may lead to additional time and cost efforts. It may even be 

impossible to realize pre-development tests because of the 

mentioned efforts, or even because appropriate test software is 

not written yet. To summarize, a combination of an automatic 

selection including reliable performance data for multiple 

hardware manufacturers would be a convenient solution. That 

would also minimize efforts of the developers. 

Second, the comparison which is made only divides all 

considered devices into two groups: those which are suited for 

the intended use, and those which are not. There is no 

quantifying examination which may lead to a more 

sophisticated selection process showing an ordered lineup of 

suited devices. The selection process would become more 

transparent and flexible if there were dedicated qualifications 

for each derivative. An example may be a microcontroller that 

is disregarded in the old-fashioned scheme because of a minor 

lack of memory, but which would outperform other ones in all 

other aspects. Here a quantifying approach would allow to keep 

a more detailed view on that situation and thus finding a 

workaround, e.g. some compiler optimization with regard to 

code size, that makes this controller the number one to choose. 

Third, selecting a microcontroller is not only limited to 

processing power or peripheral demands. Furthermore, it is 

also depending on factors like costs per device, availability and 

software support. Therefore, these factors have to be included 

in the selection process, too. Some of these criteria are included 

in manual and web-based searches, e.g. the costs per device. 

Other constraints like the support by different development 

environments still have to be examined separately. 

As a conclusion one can say there are certain useful 

methods to partly support the selection process. But there are 

additional features necessary to assess all aspects of the 

selection. In order to comprehensively support the selection 

process a new assessment system has been developed that 

integrates all mentioned aspects to make a well-founded 

selection without time-demanding and costly research. This 

assessment system shall be described subsequently. 

III. 

VECTOR-BASED BENCHMARK OF EMBEDDED 

CONTROLLERS 

The main purpose of the development of the vector-based 

benchmark of embedded controllers (abbreviated "VBEC") is 

to support the selection process by providing a software tool 

that automatically calculates the qualification of 

microcontrollers with respect to given requirements. The 

qualification is displayed in form of dedicated values for 

different domains. Therefore, the following three assessment 

domains were defined: 

• processing power 

• peripherals, memory, features 

• external factors 

In order to work VBEC communicates with an underlying 

database which contains all relevant information for different 

microcontrollers. The data maintenance is realized by a special 

frontend to collect data and manage datasets. This is done by 

the administrator. Developers use the VBEC frontend to put in 

the project's requirements and get the assessment results. The 

following figure illustrates the software architecture. 

Fig. 1: Software Architecture of VBEC 

Each derivative is saved with according benchmark results 

based upon pre-run benchmark tests that were run on the 

device. The derivative is also linked to general information 

about its peripherals, memory system and special processor 

features like floating point units (FPU). The dataset for each 

device is being completed by information about unit costs at 

different distributors, availability information in terms of lead 

times at different distributors and software support in form of 

integrated development environments supporting the device. 

Developers using the software system have to key in their 

project specific requirements in the VBEC frontend. Therefore 

a form based software has been developed. As a first step, each 

855

element used for the detailed comparison has to be weighted 

according to project specific needs. That ensures a calculation 

where every single aspect matches the current project's 

priorities and is not only generalized information. It is also 

possible to declare requirements as essential, so that these have 

to be fulfilled in any case. Afterwards, the project's 

requirement values are collected, , ordered by the three domains. 

In the domain of calculation power this is a score value 

which specifies the overall performance needed. The score is 

automatically calculated using 12 different benchmark 

modules. Developers can decide whether all of them are 

weighted equally or in a more detailed fashion, e.g. weight 

certain benchmark modules higher than others to match the 

intended software design. The underlying database contains 

result values of the 12 different benchmark modules for each 

microcontroller, ranging from simple math operations over 

library functions to FFTs or cosine-transforms. These 

benchmarks were pre-run run so that developers neither have to do 

the test work on their own, nor have to rely on numbers from 

the web which may not be reliable enough or do not reflect the 

project adequately. 

In the second domain the demands regarding peripherals, 

memory or special processor features have to be defined. This 

is done either by selecting numbered values for the required 

elements (e.g. the number of available AD-channels) or by 

checking required features to search for, for instance a FPU. 

The third domain is processed in a similar fashion, selecting 

values for external factors like maximum allowed costs per 

unit. . The datasets provide detailed information to support the 

search, so that it is possible to keep costs up-to-date with 

regard to certain distributors and buying quantities. 

Additionally, this domain handles lead time and software 

support by a certain number of IDEs. Especially the 

information about usable IDEs may contribute to the overall 

project costs because of licenses that have to be acquired. 

Showing possible alternatives may help to decide for or against 

an IDE. 

Developers may now directly compare their inputs with 

two real microcontrollers selectable from a list of all 

derivatives available in the database. An easy comparison is 

given since all relevant values are lined up next to each other. 

Using a coloring scheme it is possible to mark underachievements 

orange and violations of the essential 

requirements red, helping to find any problems very quickly. 

Moreover, VBEC is now able to calculate triple values of 

the qualification for each microcontroller in the database, 

where each value represents the qualification in one of the 

three domains. In order to determine those triple values, a 

single qualification for each property has to be calculated first. 

This is done by using the following equation: 

and required quantity is limited to 1 or 100 %. The following 

formula is then used to summarize the single qualifications 

to a domain qualification: 

∑ 

" 

#$ ∙! 

∑" 

#$ 

Equation (2) shows the calculation of the domain 

qualification by summarizing the single qualifications 

and the according weights using the weighted arithmetic 

mean. As a result, VBEC calculates three values between 0 and 

1 (or between 0 % and 100 % with regard to percent) when all 

are saturated. If they are not, the resulting may be 

greater than 100 %, indicating that there are over-achievements 

which may be used for further development. The three 

values provide information about how good a device matches 

the requirements in the described three domains. They form the 

basis of the selection decision. 

The triple values can also be interpreted a vector drawn 

within a 3D qualification cube e to simplify the comparison of 

the results (see fig. 2). The axes of the cube represent the three 

domains. Devices that are fully suitable and fulfill all 

requirements will have a qualification triple of Q={1,1,1} and 

show up as a vector from %0,0,0( to )* %1,1,1( in the 

3D cube. That is why point (1,1,1) can be seen as an 

optimum, whereas all deviation of result vectors from this point 

are under-achievements - as long as the calculated values use 

the saturated calculation method. If they are not saturated, then 

it is possible to get values greater than 1, indicating that there is 

an over-achievement or some future reserves. 

(2) 

_ 

_ 

(1) 

Fig. 2: 3D qualification cube for two compared microcontrollers A and B 

(example assessment) 

The calculation can be done in a non-saturated or a 

saturated way, where in the latter the ratio between available 


856

Fig. 2 shows a graphical comparison between two different 

microcontrollers assessed with the same requirements. It can 

easily be seen that controller B matches all requirements and 

therefore its qualification vector ends at . On the contrary 

microcontroller A does not fulfill all requirements. In fact it 

does not fulfill any of the three domains and has disadvantages 

on all three axes, even though only a slight one on the axis 

"external factors". By rotating the view this situation gets more 

clearly as shown in fig. 3. 

collect performance data. As of now, the database contains 25 

microcontrollers from major manufacturers as representatives 

of different microcontroller ontroller families. Their specification data 

was also saved in the database. Based upon that information 

VBEC has been tested with several example projects including 

the development of E-Bike control modules, surveillance of 

chemical processes, robot vehicles and consumer goods like 

home meteorological stations. It turned out that VBEC 

achieves an easy and fast comparison and provides useful 

information for selecting the best suited microcontroller. 

However, it is noticeable that the database is yet too small and 

needs to be extended. The more microcontrollers there are, the 

more meaningful VBEC gets. It also tends out that some underachieving 

devices may also be used because of substitutable 

missing elements. . This is often the case when the affected 

device is cheaper than others which may allow to implement 

workarounds using some extra hardware. 

Fig. 3: 3D qualification cube, rotated (example assessment) 

VBEC also allows to calculate an overall qualification 

value from the triple values, allowing to compare 

microcontrollers only by that qualification value. Developers 

may decide whether to do that using the arithmetic mean, the 

geometric mean or the length of the assessment vector in the 

3D qualification cube. The geometric mean represents a more 

pessimistic view because it emphasizes under-achievements, 

whereas the arithmetic mean represents a more balanced view 

and the length of the assessment vector provides a more 

optimistic view emphasizing the over-achievements. 

By 

selecting one of these methods the overall qualification value 

can be fitted to the project's peculiarities. 

Furthermore, VBEC is able to show a detailed comparison 

table including main information like the triple values, the 

overall qualification values and other relevant properties like 

current device costs. . That table can be sorted in different ways, 

so it is possible to quickly determine which microcontroller 

scores the highest triple values or overall qualification. 

Detailed module based benchmark results are provided by a 

performance diagram. By using these techniques, developers 

only need to pick the best-suited suited derivatives from the table and 

the diagram. 

IV. CURRENT TESTS 

Before VBEC could be tested, many microcontrollers with 

their according development kits had to be benchmarked to 

V. CONCLUSION AND FUTURE DEVELOPMENT 

Summarizing the presented software-guided selection of 

microcontrollers VBEC is able to effectively support 

developers making the correct choice. It integrates all relevant 

aspects of the selection process and provides data in order to 

compare microcontrollers directly with the project's 

requirements. Moreover, it automatically assesses all 

derivatives that are saved in the database and thus provides an 

easy selection. With the help of the 3D rendering and tables it 

is very easy to compare different controllers. 

In order to work properly and provide adequate data VBEC 

has to be updated with new devices as they appear on the 

market. That includes not only putting the specifications into 

the database, but also to run benchmark tests on every single 

device. The effort for doing so is high - but developers benefit 

from these tests, and the tests only have to be run once. 

Nevertheless, an appropriate way to keep VBEC up to date 

seems to select certain derivatives from a whole 

microcontroller family to find a reasonable middle course. 

REFERENCES 

[1] T. Stolze, Application Criteria of Single Chip Microcontrollers/ 

Embedded Controllers, Ilmenau, Isle Verlag, 2018, unpublished. (ISBN 

978-3-938843-90-1) 

[2] IC Insights, Inc., “MCU Market Forecast to Reach Record High 

Revenues Through 2020: After recent years of price erosion, MCU 

ASPs are forecast to rise and help lift sales to new highs,” Scottsdale 

AZ: IC Insights Inc., , Research Bulletin, August 2016. 

[3] A. Weiss, “Dhrystone Benchmark : History, Analysis, 'Scores' and 

Recommendations,” El Dorado Hills, CA: EEMBC, 2002, URL: 

http://www.eembc.org/techlit/datasheets/dhrystone_wp.pdf, last access 

2018-01-18. 

[4] EEMBC, “AUTOBENCH TM : An EEMBC Benchmark,” El Dorado 

Hills, CA: EEMBC, , 2015, URL: 

https://www.eembc.org/benchmark/automotive_sl.php, last access 2018- 

01-18. 

857

Tips and Tricks for Debugging 

Greg Davis, Director of Engineering, Compiler Development 

Green Hills Software, Inc. 

Copyright 2011‐2018 by Greg Davis 

When people think about what a software engineer does for a living, they say he or she writes 

code. While this is certainly true, it is hardly an accurate description of what the job is like since 

writing code is unlike just about every other form of writing. A journalist might write several 

articles in a given month on a variety of topics in his domain. At a high level, the process of 

authoring a particular article might be like the process of writing a piece of code. You might 

start with an outline view of the whole, then work on the key points until you arrive at a rough 

draft. Then you might start fixing flaws and improving on various aspects until you arrive at 

something that is usable. This is probably where the similarity ends. A journalist may follow 

such a process to write his article, then he may take a step back, do some more research, and then 

start over again on a new article. The next article may stand independently of his previous 

works. Or, as is often the case, it may build upon previous articles. Yet once an article is 

published, it is on the record as published, and revisions are usually limited to corrections to 

factual errors. 

This alone sets the world of an engineer far apart from the world of a journalist. Except when 

changing jobs, very rarely does an engineer every really done with his code. Typically, an 

engineer works on an accumulating body of code; from version to version, new capabilities are 

added, but much of the old code remains. The old code needs to constantly be revisited to make 

it work with new aspects of the evolving system. 

An even more overlooked aspect of being a software engineer is the act of debugging. Studies 

show that an engineer spends roughly 80% of his time debugging, so it is surprising how little 

attention is paid to this time consuming activity. Debugging is unlike how we write code. When 

we write code, we start with a conceptual model of how we believe our code works, and then we 

augment the code to match the model of how we believe the code should work. The fact that our 

conceptual model of a program’s behavior is flawed is what leads to needing to debug in the first 

place. On the other hand, debugging is a centered on the reality of how your program is 

858

actually behaving. Debugging is the process of discovering how things really work so that we 

can set them right. 

We have seen books upon books about how to write code. Books on topics such as software 

design, software development methodologies, teamwork, coding styles, and the like are 

commonplace. But relatively little has been written about debugging. Similarly, universities are 

full of classes teaching about operating systems, algorithms, programming languages, theory, 

graphics, but little is actually taught about debugging. Students are forced to develop their own 

approaches to debugging based on the tools at hand. 

This paper focuses on techniques that are effective in debugging a wide range of embedded 

systems. 

Connecting to Your Target 

The first step in debugging a system is to get everything running under a debugger. While there 

are certain circumstances where a debugger is not appropriate, the vast majority of debugging 

can and should be done using a debugger. The main alternative, printf-debugging, will be 

discussed later. 

In order for your debugger to function, it needs to be able to connect to your embedded system. 

There are three main varieties of debug connections. 

 

 

JTAG Probe – The probe is a device that connects to your processor through special 

debugging channels, which are typically exposed by extra pins on the processor. 

Although there are many kinds of debug channels, JTAG is the most well known, and it 

is sometimes used as a generic term to refer to any such debug channel. Using this debug 

channel, the probe can inspect registers and memory, and it can start and stop the 

processor. 

Debug Agent – A debug agent is a process running under an operating system that is used 

to control other threads and processes on the system. The debug agent uses the operating 

system to inspect memory, set breakpoints, and to start and stop processes. 

859

Monitor – A monitor is like a debug agent, except that it runs directly on the hardware 

rather than on top of an operating system. It uses a timer to periodically halt the system 

so that it can communicate with the debugger. 

Typically a JTAG Probe or Monitor is used if you are running without an operating system or 

with a microkernel. When running under a full fledged operating system or RTOS, a Debug 

Agent is typically used, although a JTAG Probes is still be useful for debugging problems in the 

hardware or kernel. 

Basic Debugging 

Once your system is connected to your debugger, you can start debugging. To illustrate a basic 

debugging session, let’s pretend that your embedded system is a traffic light monitor. It is 

connected to traffic lights and traffic sensors, and it periodically sends signals over a serial port 

to a nearby computer. Your system works fine when it first starts up, but after a while it begins 

to send corrupted packets over the serial port. 

What you might do to start debugging is to set a breakpoint at the communication routine that 

sends packets over the serial port. You restart the system, and every time a packet it sent, you hit 

the breakpoint. This means that the system has stopped running, and it is in a mode where you 

can inspect the state of the program. Perhaps the function looks something like: 

error_t send_message(char *message, size_t message_len) 

{ 

} 

... 

A logical next step is to view the message parameter and to look at it to see if it is corrupted. If 

it is corrupted, you know that the problem is that the caller of send_message passed along a 

corrupt message. You should now start looking at the caller to see how that happened. If, on the 

other hand, the message parameter looks correct, you might set the system running again. 

Now that the system is running again, the message should be sent over the serial port. Now look 

860

at the computer on the other end of the serial port and see what the message looks like. If it 

looks OK, then you have not yet experienced the problem. You can let the system keep running 

and repeat the process the next time the breakpoint at send_message is hit. However, if the 

message looks corrupt, then most likely there is some sort of problem in the sending of the 

message. At that point, you can review how a UART works and start stepping forward in 

send_message() to see when things start to go wrong. 

This sort of investigative approach is the bread and butter of debugging. If you are new to 

debugging with a debugger, you can stop reading here and start trying it out on your own. You’ll 

find that using a debugger is light years ahead of what you were doing before. 

Call Stacks 

Another very useful feature of debuggers is the call stack. The call stack shows the functions 

that are currently active. For example, if you had found that the message parameter passed 

into send_message() was corrupted, you would want to look to see how it had happened. 

The call stack is the first thing to check. The call stack might show: 

main() event_loop() status_check() periodic_update() 

send_message() 

At this point, you might ask the debugger to climb up from the current function, 

send_message() to the periodic_update() function to see what is going on. 

861

At this point, you can see the context where periodic_update() called 

send_message(). There are a few possibilities: 

1. status_message could have come out corrupted from get_status_message() 

2. The status could have been corrupted at some point after get_status_message() 

3. The accessor functions GET_MESSAGE() and GET_LEN() might be to blame. 

It’s hardly conclusive at this point, but at least we know something already. Without a call 

stack, you’d probably end up setting breakpoints earlier in the program to try to debug before 

the call to send_message(). 

Hardware Breakpoints 

Software breakpoints are used to stop the execution of the program when a given line is reached. 

The way they work is that the debugger replaces the first instruction at that source line with a 

trap instruction. When the trap instruction is executed, the program stops and the debugger gains 

control. The implementation details of software breakpoints vary from system to system. 

862

Unlike software breakpoints, hardware breakpoints come in two varieties, hardware execution 

breakpoints, and hardware data breakpoints. Hardware execution breakpoints are like the 

standard breakpoints in that they stop the program when a given line of code is executed. 

Typically, hardware execution breakpoints are unnecessary as long as the program is running out 

of RAM. However, when the program is running out of ROM, hardware execution breakpoints 

are probably the only execution breakpoint that will be available. Hardware breakpoints are 

implemented in the CPU itself, so the capabilities vary from system to system. 

Hardware data breakpoints are unlike software in that they watch over a designated piece of 

memory. The capabilities of hardware breakpoints vary from target to target, but generally they 

can be set to halt on a read, a write, or either a read or write, to the designated memory. 

A common use case is when you find that a given piece of memory is corrupted you set a writeonly 

hardware breakpoint on the memory so that you can see every time the memory is being 

modified. In my experience, most data is set once or twice and then left alone until it goes out 

of scope or is freed. So a hardware breakpoint will quickly show you the culprit. The biggest 

problem in practice is identifying where to set the hardware breakpoint. You might know that 

your system is crashing because a packet is being corrupted, but unless the packet always ends 

up at the same location in memory, you’ll need some way to identify which packet is being 

corrupted. You’ll need to stop the system and set the hardware breakpoint once this packet is 

created. 

Hardware breakpoints can also be implemented under an operating system by unmapping the 

page that the data resides at. When the data is accessed, there will be a page fault, and the 

operating system can take over. However, since pages generally contain many variables, an 

operating system implementation slows down the system whenever unrelated variables that 

reside on the same page are accessed. 

Hardware data breakpoints should also be redundant if your system is running under a simulator 

or is interpreted. The simulator or interpreter should have such a capability already. But for the 

rest of us, hardware data breakpoints are invaluable. 

863

Printf Debugging 

Printf debugging is the practice of instrumenting your code with print statements as a means of 

debugging. Typically, you add printf statements to the code and run and see what was going on 

before the bug occurred. Usually the information is incomplete, so you typically iterate by 

adding more printf statements to the code in places where the details are sketchy. At some point, 

you end up with enough print statements in the code that you are able to determine what is going 

wrong. 

One deficiency with printf debugging is that it requires multiple compilations as you keep adding 

more printf’s to the code. Printf debugging can also make bugs disappear since the printf’s add a 

fair bit of overhead to the code, change the timing all over, cause the optimizer to unnecessarily 

make pessimistic assumptions, and perturb the register allocation in the routines calling printf(). 

Printf debugging is all too often used in cases where there is no debugger available, or when the 

user doesn’t know how to use it. This is generally a bad reason to resort to printf debugging. 

There are a few legitimate cases where a debugger cannot be used to debug a system, but these 

are few and far between. 

A legitimate reason to use printf debugging is when there is a lot of data to sort through to 

narrow down on where the problem lies. If you can have the output onto the host machine’s disk 

or into a large memory buffer, you can sift through it all. Often times printf debugging is used to 

narrow down the problem to a particular software component at which time a debugger can be 

brought up on the component to debug the track down the problem further. 

Debugging for Comprehension 

As I stated earlier, one of the big reasons that we end up debugging in the first place is because 

our conceptual model of how a system works is flawed. During the process of software 

development, debugging can be used to increase your understanding of the code. 

864

“Why Isn’t this Failing Every Time?” 

Sometimes we look at an incomprehensible piece of code and it looks so dumb, so broken, and 

so wrong that we wonder how the code has survived for so long. We wonder what the author 

could have possibly been thinking. Was he pulling an all nighter? This is totally broken! 

Sometimes, we are the author, and we still have no clue. 

Sometimes we even vaguely remember writing the code, and this still doesn’t help. 

Chances are, the code isn’t totally broken. There’s probably some condition that makes the code 

work, at least most of the time. The only way to understand this is to see what’s going on when 

the code is executed. Set a breakpoint on the code, start looking around at the relevant 

variables, and prepare to be amazed. 

“Is this Code Ever Reached?” 

Sometimes something looks so ancient or quaint that you have a hard time believing it matters 

anymore. Maybe the code is commented using acronyms that you haven’t heard since you were 

in high school. 

A debugger can be useful to browse code between different modules. It can bring you to call 

points in different modules far faster and more accurately than the corresponding grep command. 

Often times, the code isn’t a function, but just a part of a function that is only reached under can 

be met. 

A more common technique is to replace the code in question: 

… 

if (guard_var == 3 && ptr == NULL) { 

865

… // you have a hard time believing this code works 

with: 

… 

if (guard_var == 3 && ptr == NULL) { 

send_message(“Dubious code”, 13); 

// Loop until you bring up a debugger to debug further 

while (one) ; // one is a volatile variable with value 1 

… // you have a hard time believing this code works 

Then test your system like you normally would. Often times, you’ll see the “Dubious code” 

message, and the program hangs due to the while (one) loop. You can attach to the program and 

see what’s happening. From the call stack and inspective different variables, you can probably 

figure out everything that you needed to know. Occasionally, you’ll find that the code is never 

reached. That means one of two things. Either the code can be deleted. Or you need to improve 

your testing system. 

“I hope this works…” 

Once you’ve identified a problem, you modify the code to fix it. If you’re like most people, you 

often times find yourself hoping that your code does the trick. You run your system again, and if 

it works properly, you consider your job done. 

A better technique is to first watch your new code run under the debugger. Often times, your 

new code is still wrong, and a quick look at it under the debugger would set you back on the 

right track quicker than waiting for your system to fail again. And the feedback will be 

immediate. Another advantage is that even if your code fixes the problem, it might not fix it for 

the reason you think. When you see the code run, you might realize what you were doing wrong. 

866

Advanced Debugging Techniques 

Data Visualization 

Some data structures are difficult to quickly visualize in order to understand the contents. For 

example, when viewing a C++ STL string, it typically requires decending a level into the data 

structure to find the string. STL maps are even more complicated. Consider an STL map that 

maps from a C++ string into an STL list of integers. The fully resolved name for this is data type 

is: 

std::map 

Actually viewing the data structure requires descending a data structure which is typically 

implemented as a tree. While there are efficiency reasons for the data structure being 

implemented as a tree, it isn’t convenient to have to view it as such if you’re just trying to figure 

out what strings the map contains and what they map to. 

Many debuggers offer the capability of viewing STL data structures simply. For example, an 

instance of the sorted map above that contains the words “better debugging is good” with each 

word having the index 1, 2, 3, and 4, respectively, might look like: 

The importance of this goes beyond debugging standard C++ data structures. A real world 

embedded application contains numerous important data structures. These data structures are 

probably performance optimized, but they need to be debugged by real people. 

867

If you debugger has data visualization capabilities, they are probably extensible. It is well worth 

your time to write the extensions so that they can view your own data structures. It might take 

several hours to a couple of days to write the extension, but this will quickly pay off if these data 

structures are often used. If your debugger does not have these visualization capabilities built in, 

the next best thing is to provide a textual representation of the data. You could easily iterate 

through the above data structure to print out something like: 

{{“better”, {1}}, 

{“debugging”, {2}}, 

{“good”, {4}}, 

{“is”, {3}}} 

This textual representation can be viewed from the debugger or dumped out to a file on your host 

computer. 

Conclusion 

We have examined a number of ways to debug a system. These techniques can greatly speed up 

the process of debugging when compared to other approaches. 

868

Jump Starting Code Development to Minimize Bugs 

Jacob W. Beningo 

Beningo Embedded Group 

Linden, Michigan USA 

jacob@beningo.com 

Abstract—Debugging an embedded system is one of the 

greatest challenges that faces developers during the development 

cycle. Using a mix of traditional and modern development 

techniques such as assertions, RTOS Aware debugging, streaming 

trace and code reviews can easily decrease the time spent 

debugging and minimize debugging as a major development 

challenge. 


Debugging an embedded system is one of the most time 

consuming and expensive activities that embedded software 

developers engage in. Survey results 1 show that the average 

team can spend as much as 20% of a development cycle 

debugging their software. Despite these survey results, during 

2017 I polled developers at three Embedded Systems 

Conferences in Boston, Minneapolis and San Jose in addition to 

the Arm Technical Conference and found that from the several 

hundred engineers I encountered, these developers spent on 

average 40% of their development cycle debugging! Combining 

these two results, in a yearlong project, these developers are 

spending anywhere from 2.5 to 5 months debugging their 

software! 

Developers can easily prevent, detect and eliminate bugs to 

dramatically decrease the time they spend debugging their 

embedded system. In this paper, we are going to examine several 

techniques ranging from traditional techniques such as 

assertions and code reviews to modern techniques like real-time 

tracing that can be used to quickly detect bugs. We will develop 

a robust process that readers can follow and implement to 

decrease the time they spend debugging and spend more time 

innovating that can be found in the last section of this paper. 

II. WHERE DO BUGS COME FROM? 

I don’t think there are very many developers who start out 

wanting to put bugs into their software. Sure, many of us enjoy 

a debugging challenge but given the pressure that we are under 

to deliver product, the preference is to just skip the bugs and get 

the job done right the first time. The problem is that it is truly 

impossible to develop a bug free system and anyone who tells 

you otherwise is just kidding themselves and trying to fool you. 

We can certainly do everything in our power to mitigate and 

minimize the bugs that are present, but they are still undoubtedly 

there. Edsger W. Dijkstra once stated: 

“Program testing can be used to show the presence of bugs, 

but never to show their absence!” 

We can never truly declare that we develop bug-free code 

because no matter how strictly we test the system, we can only 

show that under certain circumstances the system behaves as 

expected. 

We could discuss how bugs are introduced into a system at 

great length, but since we want to focus on bug prevention and 

detection techniques, let’s just examine a few possibilities. The 

most common causes (in my opinion) for bugs in an embedded 

system are: 

• Faulty requirements 

• Poorly architected software 

• Complex implementation 

• Not using industry best practices 

• Rushed development cycle 

These are just a few examples but there are dozens that we could 

undoubtedly list. The trick, is to develop a process for preventing 

and immediately detecting bugs in a system. Before developing 

a process or procedure, it is useful to first evaluate your 

debugging skills. 

III. HOW SOPHISTICATED ARE DEBUGGING TECHNIQUES? 

Back when I first started developing embedded software 

twenty years ago, it always felt like we would cross our fingers, 

press the debug button and hope for the best. If things appeared 

to work, we would heave a sigh of relief and nervously announce 

the system was working. If something went wrong, we now had 

to guess and infer what on earth was going on in that little silicon 

core that we could only glimpse at through breakpoints, watches 

and maybe printf if it didn’t interfere with the systems real-time 

performance. In general, these techniques are inefficient, error 

prone and require a lot of guess work. 


869

The modern developer has far more than these simple, 

traditional techniques available to them. Newer techniques 

include but are not limited to: 

• Statistical profiling by sampling the program counter 

(PC) 

• Data profiling 

• Task and data tracing 

• Instruction tracing 

In order to determine how sophisticated a debugger the reader 

is, review the techniques that are listed in Figure 1. Rank 

yourself on a scale from 0 to 10 where 10 indicates technique 

IV. REVISION CONTROL SYSTEMS 

Using a revision control system doesn’t really help prevent 

bugs but they can be very useful in finding bugs. More times 

than not, a bug will surface or be discovered in code which then 

leads the developer to track down when the bug was introduced 

and what might have changed in the code to create the bug. 

Finding the bug can be very difficult if a revision control system 

is not used. 

A revision control system, when used properly, will allow a 

developer to revert their code and create difference reports that 

can be critical to discovering where the bug may have been crept 

into the system. For this reason and many others, every 

development team, whether a hundred engineers or just one, 

should use a revision control system. 

V. CODING STYLE GUIDES AND STANDARDS 

Two useful techniques that developers can use to help 

minimize the opportunity for bugs to get into their software is to 

use a good coding style guide and industry standards. 

A. Coding Style Guides 

A coding style guide is nothing more than a guide that 

specifies how the software will be organized and how it should 

look. The style guide specifies things such as 

• Naming conventions 

• White tab spacing 

• Documentation blocks 

• Where brackets go (new line or inline) 

Fig. 1. This diagrams highlights common debugging techniques starting with 

simple breakpoints and moving to more modern and complex techniques in a 

clockwise direction. (See the text for how to rank yourself). 

mastery and 0 indicates that either the technique is never used or 

that the reader knows little to nothing about it. Sum up the total 

for each development technique and see where you currently 

rank in your debugging techniques in Figure 2. 

Rank 

Debugging Technique Evaluation 

Status 

0 - 40 Stumbling in the dark ages 

40 – 60 Crawling out of the abyss 

60 – 80 Expert bug squasher 

Fig. 2. Rankings for how sophisticated your use of debugging techniques are. 

As you can see from the table, in order to be truly efficient at 

debugging and embedded system, you need to be an expert in at 

least six of these debugging techniques. 

Let’s now examine several different techniques that can be 

used by developers to prevent and find software bugs before 

exploring a simple process developers can follow to setup and 

rid themselves of bugs. 

• Etc 

The reason that a coding style guide should be used is that it 

helps provide a uniform look and feel to the software even if 

multiple developers with different preferences are working on 

the code base. 

A uniform look and feel can remove distractions during a code 

review from minor nuance differences in the coding style and 

allow a developer to focus in on the code and finding potential 

implementation flaws. 

Every developer has their own preferences so it’s a good idea to 

create a style guide for your own teams’ work. Jack Ganssle put 

together a useful template that can be leveraged and modified 

for any particular teams’ purpose 2 . 

In addition to using a style guide, it can also be helpful to put 

together template header and source files that match your style 

guide. My preference is to comment my code so that 

documentation can be generated by Doxygen. For this reason, 

I’ve created several Doxygen templates 3 that already meet my 

style guide and can be copied and pasted into any new module 

that is being created. 

B. Coding Standards 

A coding standard is a set of industry best practices that a 

developer can follow that removes the opportunity for error and 

confusion. The two best and perhaps well-known examples are 

MISRA-C and CERT-C which are internally renown and proven 

coding standards. 

870

MISRA-C provides developers with a set of mandatory and 

advisory rules for which C constructs are safe to use in safety 

critical applications. These recommendations eliminate potential 

issues that are associated with the C Standard and provide 

developers with a C subset that they can use in their application. 

CERT-C provides developers with a secure coding standard 

that is designed to help developers create secure software. 

Obviously following this standard also helps improve software 

robustness and minimizes not just the opportunity that someone 

will successfully hack the application but also will decrease the 

opportunity for bugs to exist in the software as well. 

Both of these standards are designed to help developers 

prevent bugs from entering into their software and are examples 

that developers can follow to help prevent bugs from ever 

entering into their application code. In order to properly follow 

these standards, developers can use code analysis techniques to 

prevent bugs as well. 

track of the big picture. When developing a software function, 

these numbers don’t change. It turns out that a developer 

creating a function can only keep track of 7 to 10 paths through 

the function and then after that, the risk increases that bugs will 

exist in the code or be introduced during maintenance. In 

computer science, the paths through a particular function is 

considered to be a measurement of the function complexity and 

has a special name known as the McCabe Cyclomatic 

Complexity measurement. 

Cyclomatic complexity quantitatively measures the number 

of linearly independent paths through a software function. The 

greater the function complexity is, the greater the risk that bugs 

will exist in that code. Figure 3 shows how the function 

complexity measurements relates to the risk a bug exists in the 

code. 

Complexity versus Reliability Risk 

Complexity 

Reliability Risk 

1 – 10 A simple function, little risk 

VI. CODE ANALYSIS TO PREVENT BUGS 

There are several different code analyses that can be 

performed in order to prevent bugs from getting into the 

software. The three analysis types that I have found to be the 

most useful include: 

• Static code analysis 

• Dynamic code analysis 

• McCabe Cyclomatic Complexity 

Before committing any code to a repository, it is useful to first 

perform each analysis on the code base and resolve any issues 

that might be found. Let’s briefly look at each analysis. 

A. Static Code Analysis 

Static code analysis scans a developers’ software while the 

code is still in source form and is not yet executing on the target 

platform. Static analysis provides developers with an automated 

tool that goes beyond the checks performed by the compiler such 

as precision tracking, initialization checking, value tracking, 

strong type checking and macro analysis 5 that can detect 

potential bugs in an application. 

Static analysis can be used to not just detect potential issues 

in the way that C/C++ was written but can also be used to check 

whether the code is meeting rules in coding standards such as 

MISRA-C or a team’s style guide. 

B. Dynamic Code Analysis 

Dynamic code analysis is performed on the software while it 

is executing on the embedded target. Dynamic code analysis can 

provide developers with useful information such as: 

• Stack usage 

• Heap usage 

• Can collect execution timing 

• Monitors system inputs and outputs 

C. McCabe Cyclomatic Complexity 

The typical human mind can only simultaneously keep track 

of between 7 and 10 pieces of information before it starts to lose 

11 – 20 More complex, moderate risk 

21 – 50 Complex, high risk 

51+ Untestable, high risk 

Fig. 3. The greater the complexity, the greater the risk that bugs will be 

present in the function. 

As the reader can see, as a functions complexity rises above 10, 

the risk starts to dramatically increase. For this reason, a 

developer can analyze their software functions for Cyclomatic 

Complexity and any functions that have a value greater than 10 

can be reworked and simplified. 

Monitoring the function complexity along with performing 

static and dynamic analysis can help detect and prevent bugs. 

VII. RTOS AWARE DEBUGGING 

Many modern embedded systems are now employing a realtime 

operating system (RTOS) to schedule tasks and manage the 

complex timing requirements and microcontroller resources. 

Introducing an RTOS into an embedded system has many 

advantages but it can also include potential issues related to: 

• Memory management 

• Stack utilization 

• System timing 

• etc 

Developers will often need some way to know: 

• how much stack space is being utilized by a task 

• the minimum, maximum and average period that a task 

is executing at 

• the minimum, maximum and average task response 

times 

• task, semaphore, queue and other RTOS resource states 

and availability 


871

Developers can use RTOS Aware debugging to monitor these 

critical features within their application which can help them 

answer important design questions, detect stack overflows and 

many other potential bugs. These debugging techniques are 

often instrumented within the RTOS and it is the responsibility 

of the IDE toolchain provider to make these RTOS details 

readily available to the developer. 

A simple example of RTOS Aware Debugging can be seen in 

Figure 3. In this example, the stack of a Blinky task (or thread) 

is being monitored while executing worst case system test cases 

in order to determine the worst-case stack usage. As you can see, 

the stack size is 1024 bytes, but the maximum stack used was 

232. In this example, we have dramatically oversized the stack, 

but this example easily could have shown that the stack had 

overflowed. 

Fig. 4. Using RTOS Aware debugging to monitor the stack usage in e2 Studio 

running the Renesas Synergy SSP. 

Another example that can be useful is to use RTOS Aware 

debugging to monitor the period, frequency and response times 

of the tasks in a system. Figure 4 shows an example using 

SEGGER SystemView where we can see how many times 

individual tasks executed along with other useful information 

such as task frequency, execution time and other data. 

Fig. 5. Using RTOS Aware debugging to monitor task execution. 

RTOS Aware debugging provides insights into an embedded 

system that not only helps a developer debug the system but also 

gain insights into its behavior, execution and response times. 

With this kind of information, debugging a system can be 

extremely fast and efficient. 

VIII. A SIMPLE BUG PREVENTION AND DETECTION PROCESS 

The best approach any developer can take to detecting bugs 

is to prevent them from ever entering their system. While we 

have discussed several ways developers can do this, completely 

preventing bugs from entering the system is unrealistic. We are 

after all human and unexpected interactions and behaviors are 

bound to spring up in our code. Therefore, the best way to really 

prevent and detect bugs is to develop a robust process for 

detecting them the moment that they appear in the system. Bug 

detection requires that a very simple process be followed when 

developers start writing their software. 

Over the years I have put together a checklist for the bug 

prevention and detection process that developers can follow at 

the beginning of the software implementation phase. The goal is 

to setup all the tools necessary to immediately detect a bug and 

monitor the system performance and behavior so that with every 

version commit, there is a baseline for how the system is 

expected to behave and perform. By doing so, if a suspected bug 

is detected, developers can go back through their revision 

control system, examine the baseline data and track down where 

the bug came from and hopefully arrive at a quick solution to 

remove it. 

A. Phase 1 – Project Setup 

There are several steps that should be done before an IDE is 

ever opened that can help prevent software bugs. These include: 

• Setup revision control system 

• Creating the empty project 

• Creating the project directory structure 

• Setting the white tab spacing 

These steps start to lay the baseline for first, being able to revert 

code and perform a forensic bug analysis but also put in place 

the issue with inconsistent tab spacing that can be distracting and 

an eye sore. 

B. Phase 2 – Documentation Facility Configuration 

The configuration phase is used to put in place the tools 

necessary to ensure good documentation. In this phase I do 

several things such as: 

• Add Doxygen code templates (Example can be found 

here 3 ) 

• Configure Doxygen wizard 

• Import skeleton HAL’s and API’s 

• Create a version log 

• Create a hardware configuration module 

C. Phase 3 – Code Analysis 

There are many potential causes for bugs and using 

automated tools that can highlight potential issues in the code 

can dramatically decrease time spent debugging by detecting 

these issues in an automated fashion. For this reason, there are 

several steps to take in the code analysis setup phase such as: 

• Setup static code analysis tool 

• Setup software metrics analyzer tool 

• Setup dynamic code analysis tool (if one is available) 

Running these tools on every build will help to identify potential 

bugs in the code along with areas that are highly complex and 

could pose future risk for bug injection during maintenance. 

D. Phase 4 – Scheduler Setup 

At this stage, the toolchain is configured, and we are ready 

to start bringing up a board to begin software development. The 

first thing that is usually done at this stage is to get the hardware 

doing something. In this phase, the following items should be 

completed: 

872

• Setup a RTOS or baremetal scheduler 

o 

Will need a timer or system tick 

• Setup a single task to blink an LED (The electrical 

engineers “Hello World” program) 

E. Phase 5 – Setup RTOS Aware Debugging (Optional) 

If the application being developed using an RTOS, 

developers should utilize RTOS aware debugging techniques. 

These techniques are often implemented directly into the IDE 

and can help developers monitor task stack usage, semaphores, 

message queues and other RTOS related objects. RTOS Aware 

debugging elements are often implemented in the IDE and the 

following should setup at this stage: 

• Setup and become familiar with the IDE RTOS Aware 

debugging capabilities 

• Setup task stack monitoring and recording 

F. Phase 6 – Setup Debug Messages and Tracing 

I believe this is perhaps the most critical phase in the entire 

process. Up to this point we have been doing everything that we 

can to prevent bugs from entering the system. Now, we setup the 

tools to catch bugs when they show themselves in our 

application. At this point, a developer should setup the 

following: 

• Setup trace channel 

o 

o 

o 

Serial 

TCP/IP 

RTT 

• Setup trace tool(s) 

o 

o 

• Setup printf 

o 

o 

• Configure assert 

o 

SystemView (Segger) 

Percepio Tracealyzer 

Uart driver implementation 

Uart mapped to printf 

assert function implemented 

• Configure real-time data graphing 

• Setup watch points on critical variables 

These tools will allow a developer to output debug messages, 

halt the moment a bug is detected and monitor their applications 

performance. All before any production or prototype code is 

written. 

G. Phase 7 – Record a Baseline 

In order to understand how the system behavior and 

execution change overtime, it’s important that an initial baseline 

trace be taken. These baselines should be taken periodically and 

can be referenced when the system starts to misbehave to 

identify potential causes. This helps to provide an application 

footprint and prevents developers from scratching their heads 

and wondering when a specific behavior was introduced in the 

code. They can instead simply refer to their baseline traces and 

observe the change. 

At this point, a developer should: 

• Perform a baseline trace 

• Perform a statistical analysis on the function execution 

H. Phase 8 – Software Implementation 

At this point, all the necessary tools to prevent and detect 

bugs are setup and ready to be put into action. The developer can 

now start to implement their software. Even during the 

development phase, there are several things that a developer 

should be doing in order to minimize bugs and quickly catch the 

ones that do make it into the system: 

• Scheduler regular code reviews 

• Run analysis tools on every version before committing 

• Monitor their system trace and debug messages 

• Perform a baseline trace with every new version 

IX. CONCLUSIONS 

The modern developer has a wide range of techniques 

available to them to prevent and detect bugs that will help 

minimize the time spent debugging an embedded system. The 

problem faced by many developers is that in the fast-paced push 

to get products to market, they may feel they don’t have the time 

to follow a disciplined approach or learn more modern 

techniques. With the average developer spending 2.5 – 4.8 

months debugging their system, there is plenty of time to employ 

the processes and tools discussed in this paper which have the 

potential to decrease the debugging time by as much, if not more, 

than 50%. 

REFERENCES 

[1] Aspencore, “2017 Embedded Market Survey”., 2017. 

https://www.embedded.com/electronics-blogs/embedded-marketsurveys/4458724/2017-Embedded-Market-Survey 

[2] www.ganssle.com/misc/fsm.doc 

[3] J. Beningo, “Doxygen C Templates” https://www.beningo.com/162- 

code-templates/ 

[4] McCabe, Thomas Jr. Software Quality Metrics to Identify Risk. 

Presentation to the Department of Homeland Security Software 

Assurance Working Group, 2008. 

(http://www.mccabe.com/ppt/SoftwareQualityMetricsToIdentifyRisk.pp 

t#36) and Laird, Linda and Brennan, M. Carol. Software Measurement 

and Estimation: A Practical Approach. Los Alamitos, CA: IEEE 

Computer Society, 2006. 

[5] http://www.gimpel.com/html/lintfaq.htm 

[6] J. Beningo, “Embedded Software Start-up Checklist”, 

https://www.beningo.com/tools-embedded-software-start-up-checklist/. 

Feb 2016 


873

On-Chip Debug and Test Infrastructures 

of Embedded Systems from the Users Perspective 

Jens Braunes 

PLS Programmierbare Logik & Systeme GmbH 

Lauta, Germany 

Jens.Braunes@pls.mc.com 

Abstract—In the world of embedded systems developers are 

facing widely different challenges when it comes to debugging 

and test of software. On one hand, there is a need for costefficient 

standard components fulfilling the requirement of 

minimal integration effort. On the other hand, extremely 

powerful multicore systems, used in automotive and industrial 

applications, where the requirements on debugging and system 

observability are much higher, are present. Of course, that 

affects the available interfaces, and not least, the on-chip debug 

and trace functions. 

The paper will give a brief overview of all aspects of debug 

support that is implemented on hardware. This includes 

interfaces, on-chip debug support and trace solutions. The focus 

of the paper is on widely used solutions and implementations 

which are standardized or have become quasi industrial 

standards. 

Keywords—Debugging; debug interfaces; on-chip debug; 

on-chip race; JTAG; CoreSight; Nexus; multicore 


With even more complex embedded applications, the error 

diagnostics and test gets more and more expensive. The 

efficiency of the development and test process of software for 

today’s multicore systems depends significantly on the internal 

debug infrastructure of the particular chip. 

As we learned from the past for low-cost standard systems 

the semiconductor industry does not invest very much into the 

debug infrastructure. However, for the automotive and 

industrial areas it is entirely different. Because of the arising of 

more and more powerful multicore systems the software 

development process makes higher and higher demands on 

debugging and observability. Interfaces for accessing the 

system as well as on-chip debug and trace functions have to 

address this concern. 

The paper will give a brief overview of present common 

debug interfaces of embedded systems as well as of the on-chip 

debug hardware providing essential functions for system 

observation and multicore debugging. Finally the paper will 

give an overview of available trace implementations for tracebased 

debugging, non-invasive system observation and tracebased 

analysis of the system’s run-time behavior. 

II. 

DEBUG INTERFACES 

Debugging and test with real hardware depends crucially 

upon an efficient communication with the target and the 

possibility to observe the system state from the outside. In the 

simplest case, the software itself reports its state and important 

values. This is known as ‘printf debugging’ and uses typically 

a console for text messages or a serial interface. Of course, 

‘printf debugging’ requires additional code in the application 

and has a great impact on the run-time behavior. It is 

completely unsuitable for debugging of multicore or real-time 

critical applications. 

Dedicated debug interfaces allow an even more efficient 

and comfortable access to the system, but require some effort 

and incur extra costs. Across all microcontroller architectures 

and semiconductor vendors, the IEEE 1149.1 JTAG (Joint Test 

Action Group) interface is still most common. Actually 

developed for testing integrated circuits, it plays the role of a 

quasi-standard for debug access. However, the JTAG 

implementation is relatively expensive in terms of required 

pins. At least five pins (TDI, TDO, TCK, TMS, TRST) are 

required. In some cases additional pins are needed, for instance 

for target reset, reference voltage and vendor specific signals of 

the debug system. That makes JTAG unattractive for cost 

sensitive and small devices. Furthermore regarding speed and 

robustness against disturbances, JTAG is no longer state-ofthe-art. 

In the last years, a number of different, often vendor 

specific, alternatives to JTAG have emerged. Due to the market 

dominance of some microcontroller architectures, some quasi 

industry standards have become apparent. First of all, ARM’s 

SWD (Serial Wire Debug) has to be mentioned, which is part 

of the ARM CoreSight Debug and Trace IP (intellectual 

property) [1]. It needs two pins only, one for bidirectional data 

transfer and one for the clock. Another advantage over JTAG is 

the two times higher transfer speed. With 50 MHz clock up to 

4 MB/s can be realized. SWD uses a packet oriented protocol 

that allows simple error detection and increased robustness 

against disturbances. The new SWD protocol version 2, which 

was introduced by the CoreSight SoC-600 [2], specifies a socalled 

multi-drop architecture which allows addressing 

multiple processors in a multi-processor system by one single 

debug interface. 


874

The Device Access Port (DAP) from Infineon is another 

vendor specific debug interface. DAP is used exclusively in 

Infineon products like the AURIX multicore microcontroller 

family. Like the SWD, DAP manages the target 

communication with two pins only (bidirectional data transfer, 

clock). For error detection and correction, the transmitted data 

are protected by a CRC code. With a maximum clock rate of 

160 MHz, DAP is one of the fastest debug interfaces at the 

moment and it achieves up two 15 MB/s for block data 

transfers. Furthermore, DAP can be used in two additional 

modes: wide mode and SPD (single pin DAP). In wide mode, 

the data rate is increased to up to 30 MB/s for block read or 

writes by an additional pin. In contrast, SPD reduces the pin 

count to only one pin. In SPD mode, the data bit is encoded by 

the distances between the SPD signal edges. This way, the 

receiver does not need to transmit a separate clock signal for 

decoding the DAP data. With SPD, the achieved speed is not 

very high and comparable to a 1.3 MHz regular DAP 

connection but SPD is suitable for transmitting debug signals 

over CAN. 

A real successor for JTAG and officially standardized in 

IEEE 1149.1 is cJTAG whose implementations can be found in 

some devices from NXP. cJTAG is backwards compatible to 

JTAG but needs only two pins. Some extensions of the used 

protocol address multicore and multi-processor systems, e.g. 

the support of multiple test access ports (TAPs), as well as 

systems with power management. 

Renesas provides for their microcontrollers also a 

proprietary debug interface called low-pin count debug (LPD). 

LPD can be used with different pin counts – with the minimum 

of two – which can be configured by the user. The 

comparatively efficient protocol allows realizing data transfers 

with up to 1 MB/s using a 10 MHz clock. 

III. 

DEBUGGING OVER FUNCTIONAL INTERFACES 

Since several years, in order to save costs there are attempts 

to avoid dedicated debug interfaces completely. Instead, 

functional interfaces like CAN, Ethernet or USB should be 

used for debugging. Without the need of dedicated debug 

interfaces, this could be much cheaper, because functional 

interfaces are often already implemented on the devices and 

can be possibly instantiated a second time for debugging 

purposes. 

One example is DXCPL (DAP over CAN Physical Layer). 

DXCPL uses the physical layer of CAN for transferring debug 

signals of Infineon’s single-pin DAP. Because of the limited 

bandwidth of CAN, the achievable transfer speed is only 10 to 

40 KB/s. Hence DXCPL is used primarily for in-field 

debugging where the actual debug interface (DAP, etc.) is not 

physically accessible anymore, e.g. by the housing of an ECU 

(electronic control unit). 

Another interesting alternative is a reuse of standardized 

interfaces used by calibration tools. A working group of the 

ASAM (Association for Standardization of Automation and 

Measuring Systems), for example, is currently pursuing the 

goal to utilize the XCP protocol [3], which was used 

exclusively for calibration of ECUs until now yet, also for 

debugging purposes too. In the future, debugging of ECU 

software will be possible under real or even extreme 

conditions. 

ARM goes considerably further with its SoC-600- 

specification [2]. SoC-600 offers a library of IP blocks that 

allows using almost any functional interface for debugging, 

including USB, CAN bus, Ethernet or WiFi. 

In general, using functional interfaces for debugging has 

advantages but also some disadvantages. A clear advantage is, 

that debugging remains possible even if dedicated debug 

interfaces are not accessible anymore, e.g. later in the field. 

Also it may save costs, because expensive, specialized 

hardware solutions on the target as well as on the tool side can 

be substituted by cost-efficient standard components and IP 

blocks which are often already implemented. On the other side, 

as disadvantage, the interfaces are eventually not available 

anymore for the actual application and the software itself has to 

make sure, that the interface is initialized properly and the 

debug channel gets opened. 

IV. 

ON-CHIP DEBUG SYSTEMS 

From the users perspective, the debug system provided by 

the chip is much more important than the debug interfaces. 

Because the offered functions are crucial how deep the system 

can be observed and how users can control the application from 

outside. At the end, the debug tool, running on the PC, relies on 

them and its provided functionality depends strongly on the 

available on-chip debug functions. 

The debug infrastructure, also known as on-chip debug 

system, has primarily two tasks: 

1. Provide target information, e.g. memory and register 

contents. In addition the users should be enabled to 

modify them. 

2. Control the program execution on the target. That 

includes: 

− 

− 

− 

Breaking the running application both triggered by 

the debugger and by breakpoints, 

Starting the halted application, 

Single stepping, 

As in the case of debug interfaces the on-chip debug 

systems are dominated by vendor specific and architecture 

specific solutions. At the end, the debug tool has to hide the 

differences and should provide a common user interface. 

The only real standard that can be found in the set of 

frequently implemented on-chip debug solution is Nexus 

(IEEE-ISTO 5001) [4]. The Nexus standard defines four 

compliance classes where each higher class is based on the 

lower class. An excerpt of the compliance classes and the 

belonging debug functions can be found in table I. A chip 

vendor has to implement at least all required functions of the 

class to be conform to that class. In order to fulfil the two basic 

tasks, mentioned above, a realization of all Nexus class 1 

functions is sufficient. At the market Nexus compliant 

implementations can be found for Power Architecture based 

SoCs from NXP and STMicroelectronics, but also in some 

RH850 device from Renesas. 

875

TABLE I. 

Read or write user registers and 

memory in debug mode 

EXCERP FROM NEXUS COMPIANCE CLASSES. 

Class 

1 

Class 

2 

STATIC DEVELOPMENT FEATURES a 

Single-step instruction in user mode 

and re-enter debug mode 

Enter / exit a debug mode from / to 

user mode 

Stop program execution on 

instruction/data breakpoint and enter 

debug mode (minimum 2 

breakpoints) 

Ability to set breakpoint or 

watchpoint 

Class 

3 

Class 

4 

◼ ◼ ◼ ◼ 

◼ ◼ ◼ ◼ 

◼ ◼ ◼ ◼ 

◼ ◼ ◼ ◼ 

DYNAMIC DEVELOPMENT FEATURES b 

Read or write memory locations 

while program runs in real time 

◼ ◼ ◼ ◼ 

With the already mentioned CoreSight, ARM offers a 

complete set of debug IP including breakpoints, watchpoints 

(data breakpoints), read and write of memory at run-time as 

well as trace and cross-trigger functionality (Fig. 1). Due to the 

basic CoreSight concept, defining a set of configurable IP 

blocks, silicon vendors and IP licensees can decide which 

debug functions will be actually implemented. Often they take 

customer requests into account but at the end often costs play 

the important role. 

A complete proprietary solution comes from Infineon. The 

OCDS (On-Chip Debug Solution) is a hardware block 

exclusively used in their own architectures (TriCore, C16x 

and successors). Of course, OCDS supports breakpoints, data 

breakpoints (at least for data addresses) and provides access to 

memory and register contents at run-time. With the AURIX 

family – the latest TriCore based multicore devices – the 

cross-triggering was completely reworked and is now enabled 

for processors with a large number of cores. Fig. 2 shows the 

◼ 

◼ 

a. Development features available on halted target 

b. Development features available on running target too 

so called OCDS cross trigger switch, which allows also 

sending or receiving signals to or from external pins. 

All on-chip debug systems provide hardware breakpoints. 

These are in fact based on hardware comparators for code 

addresses, sometimes also for data addresses. A comparator hit 

asserts a configured debug action which can be for example 

issue a HALT signal. The debug tool hides that hardware 

realization and offers a breakpoint to the user. However, 

hardware breakpoints are a limited resource. For typical 

microcontrollers and embedded multicore processors only two 

to eight hardware breakpoints are available for each core. Once 

all hardware breakpoints are already used, the debug tool has to 

fall back to software breakpoints. Software breakpoints are 

generally based on a code patch. The debug tool replaces the 

original instruction at the desired breakpoint location by a 

special breakpoint instruction, which is provided by selected 

processor architectures, or sometimes by an illegal instruction. 

This patched instruction causes a trap when it is executed 

which is caught by the debug tool. The debug tool revokes the 

code patch and presents the halted application to the user. 

Software breakpoints are only applicable if the code, which has 

to be patched, is executed from RAM. For FLASH located 

applications software breakpoints are much more complicated 

to realize and users have to be satisfied only with available 

hardware breakpoints. 

V. MULTICORE RUN-CONTROL 

The emerging of multicore microcontrollers and SoCs is of 

course accompanied with the need for enhanced debug 

functionality especially for run-control; breaking, stepping and 

starting. Dependent on the application and the debug scenario, 

the cores have to be synchronized, e.g. to halt them at the same 

time at a breakpoint. Because of the significant differences 

between the core’s clock frequencies, and the clock of the 

debug interface, an external synchronization by the debug 

probe or the debug tool itself is not practical; the latencies to 

trigger a halt signal for all cores would be too high. This 

approach would lead into a completely inconstant view to the 

halted system. In fact, a cross-trigger mechanism is required to 

do the signaling for run-control on-chip directly. As already 

mentioned in section III some of the on-chip debug solutions 

already provide some cross-triggering functionality. 

However different clock domains, delays of signals as well 

Fig. 1. System overview of ARM CoreSight Debug and Trace IP, 

including components for target access, run-control, cross-tiggering and 

trace. 

Fig. 2. The Infineon OCDS trigger switch allows a quasi-synchronous 

break of multiple cores.The signal distribution via different trigger lines 

is configurable. 


876

as different pipeline depths cause latencies when entering the 

debug mode after a break or leaving the debug mode once the 

system is started again. A simultaneous breaking, stepping or 

starting of multiple cores is only quasi-synchronous. But with 

only a few cycles or executed instructions the slippage is quite 

low and can be ignored in most use cases. 

At this point, it must be clarified that cross-triggering for 

synchronized run-control and for trace may compete especially 

for CoreSight and for particular Nexus implementations. 

CoreSight, for example, uses the same cross-trigger matrix 

for distributing HALT signals to the cores as well as for 

triggering the trace capture. As a consequence, the user must 

expect that simultaneous trace recording as well as debugging 

with breakpoints is only possible to a limited extent. 

VI. 

TRACE-BASED SYSTEM OBSERVATION 

Beside traditional debug support (stop/go, memory 

read/write) today’s microcontrollers often provide on-chip 

trace for observing the system behavior at run-time and for 

exact measurements. Trace-based debugging and measurement 

has a great advantage over traditional debugging: The 

observation is non-intrusive, that means, on-chip trace does not 

influence the run-time behavior of the system. However, a 

significant additional expense arises at the silicon and as well 

as at the tool side. Trace units are quite expensive because of 

their required chip area. They need to be directly connected to 

the cores and busses and have to transmit the captured trace 

data across the chip and an appropriate interface to the debug 

tool. Therefore, chip vendors try to find a trade-off for their 

implementations, which results in direct consequences for the 

users: 

1. The observation of chip-internal activities is limited. 

For CoreSight implementations, for example, only 

program trace is available in many cases. Data trace is 

omitted for cost efficiency. Or the number of 

simultaneous observable cores is limited as it is in the 

case for Infineon’s MCDS (Multicore Debug 

Solution). 

2. Either the captured data is buffered in an on-chip trace 

memory before it is transferred off-chip towards the 

debug tool or the captured data is directly transferred 

and stored in the debug probe. The former needs only 

a debug interface to transmit the data, which makes 

the approach quite cost-efficient in terms of pin 

counts. But it will need much more chip area for 

implementing the on-chip trace memory. The latter 

requires a specific and much more powerful trace 

interface which goes along with higher pin counts, 

chip area and costs. 

Especially the second point needs to be detailed a bit more 

from the users point of view. As described, trace data can be 

buffered either on-chip trace memory or directly transferred 

off-chip via a high-bandwidth trace interface and stored by the 

debug tool. In the first case, the recording time is very limited 

due to the capacity of the on-chip trace memory – depending 

on the device typically a few KB to up to 2 MB are available. 

The resulting recording time until the trace memory is filled is 

in the range of a few milliseconds. An exact number can hardly 

be stated. It is strongly influenced by the executed code and the 

compression algorithms used by the trace hardware. While this 

is suitable for trace-based debugging – for example for tracing 

back the history of an exception – it is seldom applicable for 

trace based analysis like code coverage or profiling. One 

exception is a special trace mode that is available for the latest 

Infineon MCDS implementations. This mode is called 

Compact Function Trace (CFT). CFT records only function 

enters and leaves which saves a lot of trace memory and is 

sufficient for trace-based profiling and for call graph analysis 

even with a limited amount of trace memory. 

For longer trace recordings, where a large amount of data 

arises, the captured data needs to be transferred via an 

appropriate interface towards the debug tool. For several years, 

parallel interfaces where used for that purpose (e.g ARM 

CoreSight, Nexus). However, considering a reasonable 

expense regarding the pin count, the achievable bandwidth is 

limited to approximately 250 MB/s at maximum. For today’s 

multicore systems, which are also running with high clock 

rates, this bandwidth is often no longer sufficient. For this 

reason serial high speed interfaces are more commonly used 

nowadays. Current implementations of serial trace interfaces 

achieve with four lines only – two differential lines per data 

lane and two differential clock lines – somewhat higher transfer 

speeds than parallel interfaces. It can be anticipated that in 

future for some devices serial trace interfaces will transmit data 

with several GB/s. 

At the moment, serial trace interfaces, which rely on the 

Xilinx AURORA protocol, are implemented by Infineon for 

the AURIX device family as well as by NXP and 

STMicroelectronics for the latest Power Architecture devices 

(MPC/SPC57xx, SPC58x). With HSSTP (High Speed Serial 

Trace Port) ARM has defined a serial high speed interface, 

which is also based on the AURORA protocol [5]. First 

devices, which will support HSSTP, can be expected within the 

next two years. 

Indeed, serial high speed trace interfaces are much more 

elaborate regarding their hardware realization on both the 

silicon and the tool side (fig. 3 shows a typical setup). But to 

Fig. 3. Example of a setup for serial high speed trace interfaces 

(Universal Access Device 3+ from PLS with AURORA trace POD and a 

Infineon AURIX device) 

877

achieve the required bandwidth for even more powerful 

microcontrollers with even more cores, even more complex 

interconnects and increasing clock frequencies, serial trace 

interfaces represent a good trace-off between pin count and onchip 

logic. The main reason for that is, with even higher 

density of integration, the costs of transistors and thus the costs 

for the required logic are much lower than the for pins, which 

would be required to increase the bandwidth of parallel trace 

interfaces the same way. Especially in the area of deeply 

embedded systems, with high demands on functional safety 

and hard real-time requirements, a waiver of high bandwidth 

trace interfaces at all is not an option. For non-intrusive 

debugging, measurement and in particular for system analysis, 

a comprehensive trace is essential. 

VII. 

CONCLUSION 

In the view of the current market for microcontrollers and 

embedded SoCs, the provided debug infrastructure is mostly 

vendor-specific. Except from JTAG, real vendor-independent 

standards are seldom implemented. As a consequence, the 

decision for a specific microcontroller or platform architecture 

is in most cases also a decision for a specific debug 

infrastructure. System designers and integrators have to 

consider this, when they are plan the next generation of their 

products or completely new ones. The decision concerns not 

only the debug interfaces, on-chip debug systems and trace, 

rather it influences also the ecosystem, especially the debug 

and system analysis tools. 

[1] ARM Limited, “CoreSight Debug Trace”, https://www.arm.com/ 

products/system-ip/coresight-debug-trace 

[2] ARM Limited, “ARM® CoreSight SoC-600 Technical Reference 

Manual” 

[3] R. König, “ASAM AE MCD-1 XCP SW-Debug V1.0”, Release 

Presentation, https://www.asam.net/standards/detail/mcd-1-xcp/ 

[4] Nexus 5001 Forum, “IEEE-ISTO 5001-2012, The Nexus 5001 

Forum Standard for a Global Embedded Processor Debug Interface”, 

http://nexus5001.org/nexus-5001-forum-standard/ 

[5] Xilinx, Inc., “Aurora 8B/10B Protocol Specification“, 

https://www.xilinx.com 


878

Debugging Live Cortex ® -M based Embedded Systems 

Jean J. Labrosse 

Micriµm Software, part of the Silicon Labs Portfolio 

Weston, FL, USA 

Jean.Labrosse@Micrium.com 

Abstract— Debugging embedded systems has always been challenging. Now, however, 

MCUs based on the ARM Cortex-M architecture have a secret weapon: the CoreSight TM 

Debug and Trace port. CoreSight is a block of IP that resides alongside all Cortex-M CPUs 

and offers varied capabilities based on the actual Cortex-M core found on the MCU you are 

using. CoreSight has many features, including the ability to start/stop a target. It contains 

a breakpoint unit, includes a data watchpoint, allows printf()-like output, has an optional 

instruction trace capability, and enables developers to read and write memory locations 

(including I/Os since those are memorymapped) without interfering with the CPU. 

In the past, this last feature has been underused by tool vendors, yet it offers unprecedented 

insight into a running embedded system. There are many applications where you simply 

cannot stop at a breakpoint and examine variables using the debugger: process control, 

engine control, communications protocols and more. Indeed, using printf() statements, 

which requires instrumenting your code, is not practical in these situations. Instead, having 

a tool allowing values of interest to be read directly from memory and displayed graphically 

has much greater value; you can show trends, oscillations and other abnormalities that 

would not be immediately apparent with just a numeric representation. 

Keywords: RTOS; Data Visualization, Dashboards, Real-Time; Debugging; IoT; 

879 



Most, if not all, embedded systems read data from sensors, process that data, and, most 

likely, produce some form of output. The data read or produced by these systems often needs to 

be monitored and possibly displayed for human use. In many cases, the data is only available and 

meaningful when the system is running. An example of this is the engine/compressor control 

system shown in Figure 1. There is a tremendous amount of data being read, computed and output 

when the engine is running, such as spark plug firing angles, fuel injector flow rates, cylinder 

temperatures, RPM, valve positions, etc. The only way to debug such a system is to look at the 

overall system during operation because the dynamics of one subsystem will affect the operation 

of another. For example, an increasing engine load must be compensated for with an increase in 

fuel. 

Fig 1. Industrial Engine / Compressor 

You find this situation in many real-world and real-time applications, such as flight control 

systems, chemical reactions, food processing plants, printing presses and more. 

There are a number of ways developers display the status of their real-time systems. 

LEDs 

Developers trying to determine whether or not their code is running as expected often turn to LEDs. 

These components can be manipulated with relatively little code, and most evaluation boards 

include at least one or two of them. LEDs, then, are a low-cost method for gaining visual feedback 

from an embedded system. 

Although the feedback provided by LEDs can prove useful to developers, a blinking light hardly 

constitutes a wealth of information. Using LEDs, developers can see which portions of their 

application code are being executed, but other more advanced diagnostics cannot easily be 

performed. LEDs are ill-suited for displaying the values of variables, for instance. 

880 


printf() statements 

printf() is another means of obtaining feedback from embedded systems. With printf(), 

developers can display the contents of memory buffers, the values of error codes, the results of 

analog-to-digital conversions, and other important information. In order for them to do so, however, 

their software must include drivers for printf(), and their development environment must 

include some sort of console for viewing printf() output. 

The code associated with printf() is generally not trivial. It includes both drivers and the 

function itself. In some development environments, the addition of a single printf() call to an 

application can bring about an increase in code size of as much as 10 kBytes. An application’s 

RAM footprint can also increase substantially as a result of printf(). 

Larger memory footprints are not the only side effect of printf() usage; application 

performance can also be affected. Typically, any drops in performance are only noticeable in 

debugging, since completed systems generally do not utilize printf(). Even these changes can 

be harmful, though. They can actually create new bugs or mask existing ones. This phenomenon 

is sometimes referred to as the Heisenberg effect. 

The Heisenberg effect aside, unnecessary printf() calls are pollution, and developers must often 

generate an excessive amount of it in order for printf() to provide a comprehensive view of 

their systems. Thus, even on high-performance platforms with abundant memory resources, the 

use of printf() can be problematic. 

Full Graphics Display 

Developers dealing with complex systems often find graphical feedback to be more helpful than 

text. The graphical LCDs present on some hardware platforms are one means of obtaining such 

feedback. 

Although the information provided by an LCD can prove highly beneficial, the use of a display for 

monitoring an embedded system can cause many of the same problems associated with LEDs and 

printf(). For instance, graphical displays, when used for monitoring, are a source of code 

pollution. Developers must add code to their application whenever they wish to display new data. 

Since LCDs typically serve as user interfaces (not diagnostic tools) in completed systems, this extra 

code must eventually be removed. 

Graphical displays also necessitate drivers, and these drivers can be highly complex. Accordingly, 

displays can also be a source of the Heisenberg effect. Even in systems that will ultimately employ 

display drivers as part of a user interface, the utilization of these drivers for debugging can introduce 

unforeseen problems. 

Debugger Live Watch Feature 

Many developers turn to a debugger for feedback from their embedded systems. A variety of 

information can be gleaned from a typical debugger, including the values of variables. These values 

are usually listed in what is known as a watch window, and some actually provide live watch 

capabilities, displaying values while the target is running. Different tools offer slightly different 

versions of the watch window, but, in most debuggers, it is little more than a table of variables and 

their values. 

A key limitation of live watch windows is that they are typically refreshed only once per second. 

Live watch windows also only show numerical data, whereas, in some cases, the information being 

conveyed would be substantially improved with a graphical representation of the same data. 

881 


Unfortunately, a debugger is not a practical tool when it comes to monitoring live data from 

deployed applications. What’s needed is a tool that can serve both the embedded developer as well 

as field service personnel. 

Commercial MMIs 

Many commercial Man-Machine Interfaces (MMIs) are available and are often found alongside 

Programmable Logic Controllers (PLCs) in factory floor automation. Such tools are rarely used by 

embedded systems engineers, but the visual aspect of such tools is exactly what many developers 

need. Unfortunately, these types of tools are often ignored either for reasons of cost or lack of 

interface mechanisms to the embedded system being developed; MMI software is typically 

compatible with PLC communications protocol but little else. 

Fig 2. Typical Man-Machine Interface Display Screen 

882 


II. 

THE ARM CORTEX-M CORESIGHT TM DEBUG PORT 

ARM Cortex ® -M processors are equipped with special and very powerful debug hardware built 

onto each chip. CoreSight contains features that require stopping the processor or changing its 

program execution flow and are considered invasive. These features can be problematic when 

monitoring and controlling a live system because, in many cases, we cannot afford to stop the CPU 

at a breakpoint. CoreSight also provides capabilities that are non-intrusive, which allow us to 

monitor and control live systems without halting the CPU: 

- On the fly memory/peripheral access (Read and Write) 

- Instruction Trace (requires that the chip also include an Execution Trace Macrocell, ETM) 

- Data Trace 

- Profiling using profiling counters 

Figure 3 shows a simplified block diagram of the relationship between the CoreSight debug port, 

the CPU and the Memory/Peripherals. 

Fig 3. Relationship between CoreSight, CPU and Memory/Peripherals on a Cortex-M 

883 


III. 

TOOLS FOR TESTING/DEBUGGING LIVE SYSTEMS 

Figure 4 shows how CoreSight connects to your development environment. 

F4-1 Your development environment typically consist of an Integrated Development 

Environment (IDE) that includes a code editor, compiler, assembler, linker, debugger and 

possibly other tools. 

F4-2 When you are ready to debug your application, download your code to the target through 

a Debugger Interface, such as the Segger J-Link. 

F4-3 J-Link connects to the CoreSight debug port and is able to start/stop the CPU, download 

code, program the onboard Flash, and more. J-Link can also read and write directly to 

memory as needed, while the target is executing code. 

F4-4 Micrium’s µC/Probe is a stand-alone, vendor-agnostic, Windows-based application that 

reads the ELF file, which is produced by the toolchain. The ELF file contains the code that 

was downloaded to the target as well as the names of all globally accessible variables, their 

data types and their physical memory locations in the target memory. 

F4-5 µC/Probe allows a user to display or change the value (at run-time, i.e. live) of virtually 

any variable or memory location (including I/O ports) on a connected embedded target. 

The user simply populates µC/Probe’s graphical environment with gauges, numeric 

indicators, tables, graphs, virtual LEDs, bar graphs, sliders, switches, push buttons and 

other components and associates each of these to variables or memory locations in your 

embedded device. µC/Probe doesn’t require you to instrument the target code in order to 

display or change variables at run time. By adding virtual sliders or switches to µC/Probe’s 

screens, you can easily change parameters of your running system, such as filter 

coefficients and PID loop gains, or actuate devices and test I/O ports. 

F4-6 µC/Probe sends requests to J-Link to read or write from/to memory. 

F4-7 J-Link requests are converted to CoreSight commands, which are fulfilled and the variable 

values displayed graphically onto µC/Probe’s screens. 

F4-8 Another highly useful tool for testing/debugging live embedded systems is Segger’s 

SystemView. This tool typically works in conjunction with an RTOS and displays the 

execution profile of your tasks and ISRs on a timeline. You can thus view how long each 

task takes to execute (minimum/average/maximum), when tasks are ready to run, when 

execution actually starts for each task, when ISRs execute and much more. SystemView 

can help you uncover bugs that could go unnoticed, possibly for years. However, 

SystemView requires that you add code to your target that records RTOS events and ISRs. 

SystemView also consumes a small amount of RAM to buffer events. 

F4-9 J-Link allows multiple processes to access CoreSight concurrently, so you can use all three 

tools at once. 

884 


Fig 4. Tools for debugging and testing live systems. 

IV. SUMMARY 

Embedded systems are often black-box devices with little or no display capability making it 

difficult to see what is happening inside. Developers often use different techniques to show what’s 

going on inside these devices, but, often, this requires additional hardware and code to be 

instrumented. 

This paper presented tools that are available from Micrium (µC/Probe) and Segger (SystemView) 

that can provide unique insights into your live embedded systems while remaining non-intrusive. 

Both tools should be used in the early stages of product development as the feedback provided by 

these tools can help in better optimizing your design. 

885 


V. REFERENCES 

[1] Micrium, “µC/Probe, Graphical Live Watch®,” 

https://micrium.com/ucprobe/about/ 

[2] Segger, “SystemView for µC/OS,” 

https://www.micrium.com/systemview/about/ 

www.segger.com/systemview.html 

[3] Segger, “Debug Probes,” 

https://www.segger.com/jlink-debug-probes.html 

[4] Silicon Labs, “Simplicity Studio,” 

http://www.silabs.com/products/mcu/Pages/simplicity-studio.aspx 


886

Time Sensitive Networks for Industry 4.0 

Thomas Leyrer 

Texas Instruments Incorporated 

Freising, Germany 

t-leyrer@ti.com 

Abstract— The digital revolution in the manufacturing 

process demands a communication standard which meets the 

requirements of the manufacturing floor. Additional sensing 

technology for predictive maintenance adds new quality of 

service requirements to the industrial network. Managing 

different communication requirements for motion control, 

programmable logic control and predictive maintenance is the 

key challenge of applying the IEEE Time Sensitive Network 

(TSN) standard to the trends in the industrial automation 

market. 

Keywords— Industry 4.0, real-time control, real-time Ethernet, 

Time Sensitive Networks, Industrial Ethernet 


workspaces. Multiple cameras or scanners are combined to 

provide a real-time view of the production process. Direct 

integration into control systems enables a more efficient 

collaboration of man and machines. 

Industrial control systems use Programmable Logic 

Controllers (PLC) to automate a large 24 volt input/output 

(IO) system. These IO modules can reside next to the PLC 

CPU or connect through Industrial Ethernet to remote IO 

systems at the machine. For cabinet deployed IO functions 

protection class IP20 is sufficient. Machine deployed IOs 

support high protection class IP67. Industrial sensors either 

connect via the IO system or directly over Industrial Ethernet 

to the PLC network. 

Industrial communication and control systems in a factory 

continuously increase the efficiency and flexibility of a 

production system. Modern factory floors support multiple 

control systems for different applications. Figure 1 shows the 

various control systems of a production cell with wires and 

wireless communication interfaces. 

Many products require tight control of temperature, humidity, 

air purity and light during the production process to maintain 

best and consistent product quality. These parameters get even 

more important with additive manufacturing. Besides handling 

of raw material and product, the automated transport of chips 

is part of the smart factory 

Machine tools support concurrent process of multiple parts 

with multiple tools. Such machines have up to 100 axes which 

are very dynamic in speed and very precise in position. 

Automated tool changer with data logging and quality checks 

of the tool support full online documentation of the production 

process. User friendly control panels support additional 

visualization of all process parameters and connect the 

machine to the Information Technology (IT) world. 

Cameras are used to detect the presence and position of 

objects. More enhanced systems include precise measurement 

and quality checks with reference parts. In addition, 

surroundings of machines and robots are scanned for safe 

Figure 1: Industry 4.0 Production Cell 

A special variant of industrial control exists for manipulators 

of industrial robots. Up to seven axes allow flexible movement 

in all direction. A number of interfaces are needed to integrate 

a robot arm into a production system. Interaction with PLC, 

tools and cameras enables higher degree of automation. 

Through additional sensing technologies and functional safety 

design the collaboration with humans is possible. 

Electronic tools in the manufacturing line have dedicated 

control units which are specific to the function like welding, 

painting and milling. These tools may only power up when 

they are taken out from the tool magazine. 


887

Flow of raw material, products and packaging requires 

identification, tracking and transportation. In certain 

applications parallel movement with multi-carrier systems 

increase the throughput of the system. Palletizing is used to 

stack many objects in one area to reduce the transport 

overhead. 

All components on the manufacturing floor are connected and 

controlled inside a network domain which is called 

Operational Technology (OT). In order to protect this domain 

from the IT world a secure gateway function is needed. Such 

gateway adjusts the data format and communication protocol 

at the edge between IT and OT domain. 

Many different control systems with hundreds of sensor and 

actuators share a common, deterministic backbone to make 

sure information flows at the right speed and at the correct 

time. The time sensitive network can only meet the 

requirement through wired communication interfaces. 

However, wireless technologies gain more momentum for 

service ports and mesh networks collecting less time critical 

process data. 

Autonomous guided vehicles (AGV) are a new way of 

transport in factories. They use certain intelligence to find the 

shortest route, pick the right parts and avoid collisions with 

other vehicles or humans. Equipped with a robot arm and 

vision system, AGVs can take over complex tasks of machine 

loading and unloading without human interaction 

While there are different control systems in the Industry 4.0 

production cell they can be viewed as a generic real-time 

control system using industrial communication to connect 

remote IO devices to a central processing unit. 

II. 

A. Field Level Control Systems 

INDUSTRIAL REAL-TIME CONTROL 

Production systems are organized in various levels to manage 

the production process. Optimization of the production 

process is only possible with a fully transparent and timely 

accurate view of the IO functions. Compared to consumer and 

office communication, industrial control systems needs to be 

real-time deterministic and safe. Figure 2 shows the basic 

functions and parameters of an industrial control system. 

There is cyclic exchange of IO data between devices on the 

field level of a manufacturing floor and the IO controller 

(IOC) which manages multiple IO devices (IOD) organized in 

line or ring structure. Multiple IOCs can be connected at the 

control level to exchange data between various machines or 

between machines and automation components such as robots, 

conveyor belts and tool magazines. Communication to office 

level bridges between the OT and the IT. This bridge function 

requires security for authentication and data while still 

maintaining the timing context of IO operation. 

IO communication over Industrial Ethernet is repeated with a 

pre-configured cycle time. There are different classes of cycle 

times in a production system ranging from 31.25 us for motion 

control applications to more than 10 ms for a complete 

manufacturing site. Based on the communication cycle time 

(t cycle ) the IOC sends new data packets to IODs. The IOC 

serves as a timing master with a local time reference (t ref ). The 

IOD synchronizes to the master time reference and all IODs in 

the network have same understanding of time. The time 

synchronization inside an IO communication network allows 

giving all input and output data at each node a reference time. 

For example IOD2 data_in2 is captured at a pre-configured 

time t in_2 . Same behavior exists for the output data which is 

triggered with time t out_2 . 

The time synchronized control of input and output data over a 

network of many IO devices serves as a basis for 

programmable logic controller (PLC) and multi-axis motion 

controller used in machine tools and robotics applications. 

B. Fieldbus and Industrial Ethenet 

Figure 2 - Industrial Control System 

The introduction of PLC has led to a need of deterministic 

communication interfaces and protocols. Over time the serial 

based communication get replaced with Ethernet based 

communication. The transition to Ethernet technology allows 

building larger systems and exchange of more data in a single 

packet. With 100 Mbit/s data rate there is enough bandwidth to 

control hundreds of devices in short cycle time. 

While there is enough bandwidth with 100 Mbit/s Industrial 

Ethernet to drive a single control system, a converged network 

888

for multiple applications requires even more flexibility and 

bandwidth. Table1 lists examples of different control systems 

with key parameters of cycle time, bandwidth, jitter and 

number of devices in the network. 

Control 

System 

Cycle 

Time 

#Devices 

Bandwidth 

[Mbit/s] 

Jitter 

] 

PLC 1 ms 512 100 250 ns 

CNC 125 µs 128 100 40 ns 

Robotics 250 µs 64 100 250 ns 

Vision 4 ms 8 1000 100 ns 

Transport 10 ms 16 100 1 µs 

Table 1 – Examples of Industrial Control Parameters 

The most critical application in terms of timing parameter 

are CNC machines. Inside the multi-axis application there are 

still serial based encoder protocols because Industrial Ethernet 

cycle time do not reach the cycle time of 10 µs today. Larger 

PLC systems may span hundreds of devices with cycle time in 

milliseconds. For automated transportation solutions the cycle 

time can be even slower and the usage of wireless 

communication can be used. 

A new challenge for industrial control systems is the 

integration of machine vision. The bandwidth of vision sensors 

in high throughput applications such as printing and bottling 

exceeds the 100 Mbit/s data rate of today’s Industrial Ethernet 

protocols. In addition there is high computation challenge to 

extract relevant object information out of an image capture 

with more than 50 frames per second. 

A converged network of multiple disciplines in one 

Ethernet cable is the opportunity for IEEE 802.1Q [1] Ethernet 

bridges to gain significant footprint in industrial control 

systems. Being real-time deterministic for one discipline is 

solved with protocols such as Profinet and EtherCAT. 

Supporting multiple control systems as a backbone for 

Industry 4.0 production cell requires a new approach, which is 

discussed next. 

III. 

A. TSN Standard 

TIME SENSITIVE NETWORKS 

The IEEE802.1 Working Group is further developing the 

Ethernet bridge technology for different networks. The 

customer bridge with Ethertype of 0x8100 typically applies to 

industrial networks. With the addition of Virtual Local Area 

Network (VLAN) header inside an Ethernet packet, there are 

more quality of service options based on VLAN identifiers and 

priority flags. For example the classification of packets used in 

the context for motion control can use unique identifier. The 

association of motion control traffic to a stream can now be 

managed in the network. The forwarding rules at a Customer- 

VLAN (C_VLAN) bridge for a motion stream can be mapped 

to a certain traffic class and priority. These mechanisms are 

used by TSN to support a more deterministic distribution of IO 

data over Ethernet network 

Figure 3 shows the basic flow of an Ethernet packet 

through a customer bridge. Before packets are processed for 

forwarding decision, certain rules for the ingress port apply. 

For example, a port can have a state in which only port to port 

traffic is allowed and all other traffic is dropped. Another 

ingress rule is port membership. Only VLAN IDs which are 

registered can enter the bridge. One enhancement of TSN is the 

stream filtering and policing under the module 802.1Qci [2]. A 

stream can be filtered on receive port in case it exceeds the data 

rate specified for the stream or the time window for the stream 

is not open. 

Figure 3 – TSN enhancements to customer bridge 

The IEEE802.1Q standard defines store and forward 

bridges. This means a packet needs to be received without 

errors to ensure error free forwarding. The penalty of store and 

forward scheme is high latency and high jitter through a daisy 

chained network of industrial IO devices. With the 

combination of 802.3br [3] and 802.1Qbu [4] standard 

modules, it is now possible to reduce the latency and jitter to 

about 1/10 of a max sized packet with 1500 bytes. Frame 

preemption works with a max latency of 123 bytes. Only 

packets larger than 123 bytes can be interrupted with express 

packets. Max jitter in a customer bridge with frame preemption 

at 1 Gigabit rate is therefore 1 us. This number comes 

closer to what dedicated Industrial Ethernet standard reach at 

100 Mbit. However the latency and jitter numbers of protocols 

such as Profinet IRT and EtherCAT are still a factor of 10 

better. 

With the introduction of 802.1Qbv [5] time aware shaper 

(TAS), it is possible to separate streams for real-time (RT) 

packets into time windows and avoid jitter caused by non-realtime 

(NRT) packets. Time aware shaper assumes there is time 

synchronization according to 802.1AS-rev [6]. The network 

and the streams are managed using 802.1Qcc [7] standard 

module. Before packets are transferred to egress queues, the 

filter database (FDB) defines the route of a packet through a 

bridge. From an IEEE802.1 Ethernet bridge standard 

perspective, the frame filter looks at Ethernet header including 

VLAN tag and its FDB entry to decide which egress ports the 

packet goes to. Industrial Ethernet like EtherCAT do not have 


889

this extra step of forwarding decision as all frames pass 

through a network node on-the fly. The bridge delay for 

automatic frame forwarding is in the range of 320 ns. Industrial 

Ethernet protocols which make forwarding decision based on 

Ethernet header and protocol header reach a forwarding delay 

< 3us when using cut-through scheme, i.e. start forwarding the 

frame after decision point in frame is verified. Cut-through 

switching at 100 Mbit reduces the latency from 125us to 3 µs. 

At gigabit rate this latency reduces from 12.5 µs to 1 µs. 

For TSN to reach the latency and jitter performance of 

today’s Industrial Ethernet it requires a combination of time 

aware shaper and cut-through switching. Figure 4 shows the 

transmit queues, shaper and transmit selection logic with time 

aware gates. In theory, the IEEE standard would allow mixing 

various shapers, cyclic queueing and frame pre-emption on a 

single transmit port. Engineering such complex traffic model 

over a larger network is very difficult. What makes sense is the 

time aware shaper for IO data and map different control 

systems into different streams with dedicated time windows. 

Figure 4 - Transmit Port Scheduler 

B. TSN Profile for Industry 4.0 Production Cell 

With the use cases described in the introduction of this paper 

following converged network configuration using TSN with 

cut-through switch is discussed. IO packets representing one 

control system characteristic are mapped into one stream. This 

stream is identified through a single VLAN ID and mapped to 

one traffic class which has its own transmission window. 

gate_open 

queue 7 

9 x 64B = 

6.0 us 

motion IO IO IO vision NRT NRT NRT 

cut-through 

gate_open 

queue 6 

32 x 64B = 21.5 us 

cut-through 

gate_open 

queue 5 

100 us cycle time 

4 x 512B = 17.0 us 

cut-through 

gate_close 

queue 5 

best effort – strict priority 

cut-through 

Figure 5 Communication Cycle with TAS 

store and 

forward 

Cyclic queuing and forwarding as defined in 802.1Qch [8] 

supports mapping of different streams to traffic classes which 

are controlled through time gates in a cyclic manner. In the 

example shown in Figure 5, there are three traffic classes with 

associated streams executed in reserved time windows by 

defining gate open and close times exclusively for one traffic 

class. 

Motor control parameters such as PWM output data, current 

values and position data can be transferred over TSN network 

with the frequency of PWM cycle time. An 8 kHz torque loop 

is executed with 125 µs cycle time on TSN network. The size 

of the Ethernet packet can be minimum 64 bytes. For one 3 

phase motor the number of PWM data fits into 12 bytes. 

Motor parameters in one 64 bytes packet serve up to 3 axes. 

The mapping of various control systems onto one gigabit TSN 

network, as shown in Figure 5, supports 9 motion control 

packets. In total, control parameters for 27 motors are mapped 

in this example. 

PLC IO devices can be single sensor/actuator device, 

concentration of multiple channels to a decentralized remote 

IO or DIN-rail IO systems with modular backplane 

architecture to concentrate many IO modules into a single 

frame. For the first two devices a typical packet size is 64 

bytes. Only the cabinet deployed remote IO uses larger packet 

size up to 1500 bytes. 

Machine vision sensors span a wide range of image size and 

frame rates. For 2D line scanners and lower resolution 3D 

time-of-flight cameras the bandwidth of 100 Mbit/s Ethernet is 

sufficient. Adding multiple machine vision sensors to one 

TSN network requires a gigabit Ethernet Network. The frame 

rate of a scanner is in the range of 100 Hz per image. However 

the image size may exceed the 1500 byte limit of Ethernet 

frames. A full image can be spread out over multiple frames in 

a cyclic manner and still meet the frame rate which is more 

than 100x slower than the TSN cycle time. In the example of 

Figure 5 there are 4 packets defined in the vision stream with 

512 byte for each camera. Higher resolution cameras do not 

stream raw images over the network as required bandwidth 

exceeds the gigabit data rate. These cameras compress the 

original image to stream over gigabit interfaces. A vision 

computer system decodes the stream and runs analytics to 

detect objects. In addition machine vision sensors have local 

intelligence to measure and detect objects. The immediate 

processing of images support a much faster re-action time for 

control systems. 

Another traffic class for Industry 4.0 production systems is 

condition monitoring to enable predictive maintenance 

through data analytics outside the classical PLC system. The 

maintenance cycles to replace tools, lubricate and replace 

bearings are days, weeks or even years. Data collection for 

condition monitoring therefore can be part of the non-realtime 

traffic (NRT) window. For NRT, as part of cyclic TSN 

profile, it is important that the frame does not overlap into 

next communication cycle which starts with a TAS window. 

IV. 

EMBEDDED PROCESSOR WITH INTEGRATED TSN SWITCH 

The examples given in previous chapter describe a 

converged network for industrial control systems based on 

TSN. All devices in the network need to work on a common 

890

time base. IEEE802.1AS is the protocol which enables a 

common understanding of time in an Industrial Ethernet 

network. There is at least one timing master in the system 

which provides the reference time. Such reference time then is 

used by each slave to manage the communication parameters 

of the TSN switch. In described example, these are gate times 

for different streams and cycle time. 

In addition to time delay through physical layer and TSN 

switch, there is an unknown delay which comes with different 

cable length. To compensate for variable delay the 802.1AS 

protocol supports peer to peer line delay measurement. Cable 

delay, Ethernet Phy delay and System on Chip (SoC) bridge 

delay are summed up to calculate the same master time on each 

network node. A frequency drift from crystals on different 

network nodes is compensated by comparing time stamp of 

Ethernet packets with local time. Figure 6 shows three different 

time domains on a forwarding bridge. The receive function of 

the Ethernet Phy recovers timing from the received symbols. A 

receive PLL follows the clock oscillator of previous network 

node. The interface between Ethernet Phy and TSN switch uses 

recovered receive clock. On the TSN switch device a receive 

time stamp (RX_TS) is taken with a local clock of the device. 

This time domain is also used for transmit time stamp (TX_TS) 

which is needed in case the forwarding delay of the bridge is 

not constant for time synchronization packets. 

shows the new generation of Sitara TM embedded processors 

with integrated gigabit verion of PRU-ICSS. External 

interface is gigabit MII to Ethernet Phys. Each instance of 

PRU-ICSS supports two pysical Ethernet ports. With three 

PRU-ICSS subsystems on one SoC, an implementation with 

two rings on the field level and one ring on the control level is 

possible. PRU-ICSS provides programmable logic above 

physical layer with non pipelined cores, broadside data bus of 

up to 1000 bit bus width and hardware timer for Ethernet 

traffic control. TSN time aware gates use the hardware timer 

to transfer packets with zero latency between transmit queue 

selection and transmit interface to physical layer. PRU can 

transfer Ethernet packets of 64 bytes in 4 ns. This high 

throughput architecture of 128 Gb/s to physical ports and host 

port serves as a basis for low latency gigabit TSN bridges. In 

order to maintain the low latency communication up to the 

application CPU, multiple interfaces with DMA support are 

available. A high speed and high bandwidth crossbar switch 

distributes multiple real-time (RT), non-real-time (NRT) and 

network management (NW) interfaces to various masters in 

the system. A bus master can also be direct memory access 

(DMA) peripheral which takes the payload of an IO packet 

and writes the data directly to external interface. 

CPU 1 

CPU 2 MCU DMA 

crossbar 

Figure 6 - Time Domains on a Single Network Node 

Ethernet Phys use an external clock source to generate 

transmit clock between bridge and physical layer. As an 

optimization one can use single oscillator source for SoC 

bridge and Ethernet Phy transmit. This will reduce the jitter in 

the system. An alternative location for time stamp capture is 

the physical layer. The challenge for physical layer time stamp 

comes with gigabit packets, different time synchronization 

protocols and the need for an extra interface to transfer time 

stamp data from the Ethernet Phy to the SoC. Time stamp 

accuracy in embedded processors such next generation 

Sitara TM [9] processor AM6x is 4 ns for gigabit Ethernet using 

a 250 MHz clock reference. This time base and accuracy is 

used for clock synchronization. It can be tuned in one ns steps 

and supports time synchronization over larger networks well 

below 100 ns. The jitter between TSN network nodes has 

direct impact to scheduler and time aware shaper. In case time 

synchronization jitter is in the range of 1 µs means that two 

minimum sized packets of 64 bytes cannot be transmittet in a 

TAS window at gigabit rate. 

A flexible and deterministic intellectual property (IP) core to 

support TSN is Programmable-Realtime Unit and Industrial 

Communications Subsystem (PRU-ICSS) [10]. Figure 7 

RT 

NRT NW 

PRU-ICSS PRU-ICSS PRU-ICSS 

RGMII, SGMII 

Figure 7 - AM6x Embedded Processor with Gigabit TSN 

Switches 

Industry 4.0 production systems connect operation technology 

(OT) on the factory floor with information technology (IT) 

inside the company and off-site cloud services. The edge 

gateway at this boundary between OT and IT requires network 

security. Security accelerator on embedded processors with 

latest ciphers and key management guarantee high throughput 

of the gateway. 

V. CONCLUSION 

Security 

Industry 4.0 reference architecture provides a guideline for 

connectivity through the life cycle of a product. This guideline 

addresses issues of compatibility, security and 

communication. The modern production system has many 

different control characteristics and in order to reach a 

converged network, a very flexible communication standard is 


891

equired. TSN is one of the core standards for wired 

communication which has a rich set of standard modules to 

enable a converged network. Time synchronization and time 

aware shaper are most relevant modules to real-time 

deterministic communication over Ethernet. 

Current Industrial Ethernet standards specify the timing 

parameters to ensure real-time deterministic communication. 

In addition they provide certification tests, interop events and 

application interfaces to control systems. The challenge for 

TSN to be adopted and finally accepted as a replacement for 

existing Industrial Ethernet standards is the specification and 

validation for a set of industrial profiles. Most critical 

parameter for industrial control systems is deterministic and 

low latency bridge delay. The IEEE TSN working groups need 

to address time synchronization jitter in few tens of ns, cutthrough 

switching and fixed delay through the forwarding 

bridge in order to compete with existing standards. 

An embedded processor solution family with integrated 

gigabit switch technology was presented. It supports both, the 

current Industrial Ethernet standard as well as future standards 

based on TSN. The real-time deterministic behavior of PRU- 

ICSS and multiple interfaces to application CPUs on the SoC 

ensure real-time data path up to the application controller. SoC 

integration of security acceleration IP and multiple instances 

of PRU-ICSS enable a flexible edge gateway with control 

CPU and cloud connection. 

[1] IEEE Std 802.1Q-2014 (Revision of IEEE Std 802.1Q-2011) - IEEE 

Standard for Local and Metropolitan Area Networks--Bridges and 

Bridged Networks. 

[2] IEEE Std 802.1Qci – IEEE Standard for Local and Metropolitan Area 

Networks-Media Access Control (MAC) Bridges and Virtual Bridged 

Local Area Networks Amendment: Per-Stream Filtering and Policing 

[3] IEEE Std 802.3br – IEEE Standard for Ethernet Amendment 5: 

Specification and Management Parameters for Interspersing Express 

Traffic 

[4] IEEE Std 802.1Qbu – IEEE Standard for Local and Metropolitan Area 


Local Area Networks - Amendment: Frame Preemption. 

[5] IEEE Std 802.1Qbv – IEEE Standard for Local and Metropolitan Area 


Local Area Networks Amendment: Enhancements for Scheduled 

Traffic. 

[6] IEEE Std 802.1AS-Rev – IEEE Standard for local and metropolitan area 

networks – Timing and Synchronization for Time-Sensitive 

Applications. 

[7] IEEE Std 802.1Qcc – IEEE Standard for Local and Metropolitan Area 


Local Area Networks Amendment: Stream Reservation Protocol (SRP) 

Enhancements and Performance Improvements. 

[8] IEEE Std 802.1Qch – IEEE Standard for Local and Metropolitan Area 


Local Area Networks Amendment: Cyclic Queuing and Forwarding 

[9] Sitara TM Processors http://www.ti.com/processors/sitara/overview.html 

[10] Programmable Real-time Unit and Industrial Communication 

SubSystem (PRU-ICSS) http://processors.wiki.ti.com/index.php/PRU- 

ICSS 

REFERENCES 

892

TSN 

Future Industrial Ethernet Standard or just AVB 2.0? 

Dipl.-Ing. (FH) Torsten Rothe 


Avnet EMG AG 

Rothrist, Switzerland 

Torsten.Rothe@avnet.com 

Abstract - Over the last decades Ethernet has become the defacto 

standard for high-bandwidth standardized communication 

between computing devices. Originally invented for Local Area 

Networks it has been adopted in many other use cases, like VoIP 

communication or AVB (Audio Video Bridging)). Ethernet is 

standardized, exists everywhere and is one of the most costefficient 

ways to connect a huge number of communicating 

devices. However, when it comes to industrial control applications, 

additional requirements such as determinism, redundancy and 

short latencies arise. These requirements so far were addressed by 

industrial Ethernet protocols, such as EtherCAT or ProfiNET. 

With the emerging and now mostly finalized IEEE 802.1 TSN 

Time Sensitive Network standard this is about to change. Having 

evolved from the existing 1722.1 AVB standard, it adds abilities 

such as frame preemption and seamless redundancy to also meet 

industrial requirements. As an open standard and driven by both 

the automotive and industrial world, TSN is destined to become 

the next big step in industrial Ethernet communication. In this 

paper we take a closer look at TSN from an industrial automation 

viewpoint and discuss different use-cases that could be enabled by 

this new standard. We discuss the various IEEE 802.1 substandards 

and their respective significance to these use-cases, and 

also review the feasibility of TSN for replacing traditional 

industrial Ethernet protocols in these scenarios. 

Keywords - TSN; 802.1, 1722.1, AVB; real-time Ethernet; 

deterministic Ethernet; industrial automation; industrial control 


When Ethernet was defined in the 1970s, the focus was on 

best-effort communication and on achieving maximum 

throughput versus cost for the bandwidth available. Clearly, at 

that time determinism was neither considered nor a design 

target. However, over the years, Ethernet has become 

increasingly important also for deterministic communication 

applications due to its openness and wide acceptance. Today it 

is available in all virtually infrastructures, easy to use and, 

despite of a multitude of different vendors, compatibility is 

ensured by standardized and mandatory extensive testing. It is 

robust and can work seamlessly across media boundaries. The 

telecom world has already made the change from classic circuit 

designs, such as Sonet/SDH, to packet oriented designs using 

Ethernet, leveraging AVB and, in the future, TSN. So what 

keeps us from using TSN also as an IEEE standardized 

replacement for proprietary and vendor-driven industrial field 

bus standards? In this paper we will take a closer look whether 

TSN is just an extended version of AVB or if it is suitable to 

become the future de-facto standard of deterministic Ethernet 


We start with a short overview of the current status of TSN 

from an industrial application perspective and provide a short 

market overview about the existing deterministic industrial 

Ethernet solutions. This is followed by an introduction to the 

TSN standards relevant for industrial applications and their 

current state of development. We also will take a closer look at 

the most important standards of TSN from an industrial control 

perspective, followed by more information on the configuration 

of such a network. Finally, we look into a test were we compare 

the influence of hardware and software support in redundant ring 

system. This is concluded by an outlook to the next relevant 

steps for TSN for industrial applications in the future. 

II. 

MARKET OVERVIEW 

Figure 1 [1] shows the 2016 summary of today’s Industrial 

Networking Protocols and their respective worldwide market 

share. Obviously this is divided into (traditional) field-bus 

protocols and Ethernet-based solutions. 

Figure 1: 2016 Industrial network protocol shares according to HMS 


893

When looking at figure 1, two major trends can be observed. 

There is quite a stable market for field-bus interfaces with only 

a moderate growth and a fast growing market for industrial 

Ethernet based applications which even today is as big as the one 

for field-busses. If the shown growth-rates remain constant, in 3 

years from now, Ethernet-based applications will be twice the 

market of traditional field busses and therefore have outgrown 

traditional field-bus protocols. However, most likely the 

emerging trend Industry 4.0 which demands for much higher 

data rates will further accelerate the growth rates. This means 

that even in 2019, Ethernet based protocols will make up the 

majority of freshly installed industrial networks. 

While all of the different real-time Ethernet standards have 

their pros and cons, they have one major drawback in common: 

they are not compatible to each other. A major reason for this is 

that most of the big vendors have adopted (or invented) one of 

these standards and are pushing it into the market. If one of these 

standards gains market share, this also means that some vendors 

are gaining market share. So there is not even an interest in 

improving interoperability. However, for Industry 4.0 to become 

a success, one standardized and interoperable industrial Ethernet 

solution is required. The current fragmentation aggravates 

installation, configuration and maintenance of such networks 

and results in significant higher costs. On the other hand, many 

Industry 4.0 applications require hard real-time networks with a 

bounded latency. TSN now promises to offer these features as 

an IEEE standard and is seen as the prodigy to become the future 

industrial Ethernet standard. Let’s have a closer look at its 

features. 

III. 

TSN – HISTORY, CONCEPT AND TECHNICAL FEATURES 

With the current fragmentation of standards, a merge of OT 

(operation technology) and IT (information technology) in an 

industrial plant is impossible within a reasonable cost structure. 

However, with Industry 4.0 this is a major demand in the market 

and was so far not possible without expensive solutions. This 

was a major reason, why TSN was advanced from the existing 

AVB (Audio Video Bridging) standard by the AVNU alliance 

[2]. AVB, which had been developed for high-quality audio and 

video stream transmission over Ethernet, was the first nonproprietary 

Ethernet solution for best-Effort real-time stream 

data transmission and originally standardized as IEEE 1722.1. 

The TSN group was founded in 2012 out of the AVB group. This 

was necessary because some of the AVB standards were very 

interesting for industrial applications, but AVB alone could not 

fulfill all their needs. So by separating both standards, the AVB 

group could finish their work and TSN had the chance to create 

some enhancements and additional standards with clear design 

targets from the beginning. The most important design targets of 

TSN were: 

 

 

 

Open standard to guarantee interoperability and longterm 

stability 

Real-time with guaranteed latency, low jitter and zero 

congestion loss 

Reducing cost and complexity for installation and 

maintenance networks 

 

 

 

Coexistence of (real-time) industrial control and besteffort 

standard Ethernet traffic resulting in convergence 

of OT and IT 

Immunity of control traffic determinism against besteffort 

traffic influences 

Vendor independency 

The new TSN standard therefore should support all typical 

use cases in industry, with some examples shown below. 

Vertical 

Industrial 

Automation 

Automotive 

Industrial Control 

Power Generation 

Plants and 

Substations 

Transportation 

Building 

Automation 

Use Case 

Machine Control, PLC, motion control, safety 

application 

Media streaming, infotainment , but also control 

application and in the future connected cars or 

autonomous driving ADAS 

Machine Control. 

Power provider have proprietary networks today 

for substation management. Common requirements 

are reaction times below 3 ms for a substation to 

initiate the disconnect from the power grid. 

Passanger information, transport control,. 

Monitor and control of heating, ventilation, air 

condition, light, access control. 

Table 1 

Today, AVB and TSN are merged into the IEEE 802.1 

standard. Most sub-standards originate from AVB 1722.1 

standards, for example [3, 4]. 

802.1BA: Audio Video Bridging (AVB) Systems 

802.1AS: Timing and Synchronization for Time-Sensitive 

Applications (gPTP), 

802.1Qat: Stream Reservation Protocol (SRP) 

802.1Qav: Forwarding and Queuing for Time-Sensitive 

Streams (FQTSS). 

The TSN standard defines additions to enable additional usecases 

such as industrial Ethernet. Some of these are looked in 

more detail in the next chapter. This table provides an overview 

about these additions and the current status as of the publishing 

time of this paper. 

Standard Name Status Ref 

IEEE 802.1AS-REV Timing and Synchronization Draft 6.0 [5] 

for Time-Sensitive 

Applications 

IEEE 802.1Qbv Enhancements for Scheduled active [6] 

Traffic 

IEEE 802.1Qbu Frame Preemption active [7] 

IEEE 802.1Qca Path Control and Reservation active [8] 

IEEE 802.1Qcc] Stream Reservation Protocol Draft 2.0 [9] 

(SRP) Enhancements 

IEEE 802.1Qci Per Stream Filtering and Draft 2.1 [10] 

Policing 

IEEE 802.1CB Frame Replication and Draft 2.9 [11] 

Elimination for Reliability 

IEEE 802.3br Interspersing Express Traffic active [12] 

IEEE 802.1Qch Cyclic Queuing and 

Draft 2.2 [13] 

Forwarding 

IEEE 802.1Qcp YANG Data Model Draft 2.0 [14] 

IEEE 802.1Qcr Asynchronous Traffic Draft 0.3 [15] 

Shaping 

Table 2 TSN’s Industrial Protocol Features 

894

The most important features of industrial networks are: 

 

 

 

Configurable, predictable and guaranteed end to end 

latency and bandwidth between nodes 

Common time basis with minimal jitter variation and 

time deviation between participation nodes 

No (or minimal) packet loss through features to avoid 

congestion loss and concepts for media redundancy 

Some of the TSN additions mentioned in the last chapter help 

to realize these features. These are: 

 

 

 

 

 

 

Timing and Synchronization for Time-Sensitive 

Applications 

Enhancements for Scheduled Traffic 

Frame Preemption 

Frame Replication and Elimination for Reliability 

Cyclic Queuing and Forwarding 

Per Stream Filtering and PolicingIEEE802.1Qci 

In the following section we would like to give a short 

introduction to these and explain how they contribute to enable 

industrial requirements. All these enhancements are realized as 

Layer 2 features in hardware to ensure real-time capabilities. 

However, some of them need additional software support on 

higher levels. The additional features are only enabled if all 

required participants (e.g. both nodes of a link) support it, 

therefore keeping compatibility with existing Ethernet nodes. 

For the end user this means that applications can be developed 

as usual, adding TSN features gradually when needed. 

A. Common Time Basis 

One basic key necessity to realize deterministic applications 

with multiple participants is to establish a common time basis by 

time synchronization. Contrary to time synchronization in ITnetworks, 

it is not focused on global time but on minimizing 

time deviations between nodes. For TSN, the existing AVB 

standard 802.1AS (gPTP, generalized Precision Time Protocol) 

[16] was improved into the 802.1 AS-REF standard. Even with 

AVB, deviations of less than +/500ns over 7 hops to a time grand 

master could be achieved by using BCMA (Best Clock Master 

Algorithm) [17]. However, if the grand master itself fails, gPTP 

requires a significant time to switch to a new time grand master. 

In 802.1 AS-REF [5], multiple domains and time masters can be 

assigned which allows to switch to a redundant time master 

without delays in case of failures. To ensure a seamless 

switchover in critical applications, redundant time masters can 

be predefined. For AVB, the major application of time 

synchronization was to ensure synchronous play-out of audio 

and video streams for multiple talkers and to synchronize 

samples from multiple listeners. For industrial control 

applications, it is necessary to guarantee minimum deviations in 

the start and end of transfer cycles and to synchronize the time 

slots for IEEE802.1Qbv scheduled traffic [17, 18] which we are 

looking at in the next section. 

B. Traffic Shaping and Scheduling 

Industrial process control is characterized by cyclically 

recurring, as well as sporadic events. For combining these 

synchronous and asynchronous transfers with non-mission 

critical traffic, TSN has implemented a Time Aware Scheduler 

IEEE802.1Qbv [6]. This enhancement became necessary, 

because with the previous possibilities of prioritization (802.1Q 

and the AVB 802.1Qav Credit Based Traffic Shaping) it is not 

possible to send packets at any fixed time. IEEE 802.1Q only 

cares about the prioritization of the packets. This means that 

best-effort traffic is already queued in the outbound queue of a 

switch or in processing cannot be interrupted by prioritized 

traffic frames. This leads to cycle times being missed which is 

critical for control applications. Similarly, AVB‘s 802.1Qav 

Credit Based Traffic Shaper is only concerned with ensuring a 

guaranteed bandwidth for prioritized traffic. This fully meets the 

requirements for audio and video applications, but not the 

requirements for an accurate end to end latency, as required for 

industrial applications. In AVB streams are defined and reserved 

between talker and listener endpoints. Both have a common 

understanding of time, to ensure that these streams are 

transmitted and played in a timely accurate manner. The 

intermediate nodes, like switches, are responsible for forwarding 

the data streams. However, since streams are sent with a 

sufficient amount of time in advance, credit-based traffic 

shaping is adequate to ensure that streams arrive at the listener 

in time and with the necessary bandwidth. Only the endpoint 

uses traffic shaping based on time while playing the streams. 

Requirements for industrial control are different, since packets 

need to arrive at all participating stations at a precisely defined 

and exact time. [19,20] 

The Time Aware Scheduler (IEEE8021.1Qbv) resolves this 

by introducing a time slot based scheduling. Figure 2 shows such 

a state-of-the art switch with multiple (usually 8) port output 

queues representing different frame priorities. Traffic frames are 

pushed into these queues based on assigned classes and priorities 

to be forwarded to the MAC (media access controller). In a 

switch without time aware scheduler according to IEEE 802.1Q, 

queues with higher priority are normally served first. Lower 

priority queues are only served, when all higher priority queues 

have been served and have become empty. However, if a lowerpriority 

frame is already in processing to the MAC, a freshly 

arrived high-priority frame has to wait until this lower priority 

frame is processed, leading to unpredictable delays of highpriority 

traffic through a switch. To resolve this issue, the time 

aware scheduler allows the definition of (repeating) cycles 

(starting and ending at the t 0 times in the figure 2). At predefined 

times in each cycle (d in the picture), all best-effort 

traffic queues are stopped and one of the defined high-priority 

traffic queues is granted by the time aware scheduler for a time 

TS1. This ensures that periodically (with the cycle time period), 

high-priority traffic is passing through the switch at pre-defined 

times. After this slot is finished, the best-effort queues are served 

according to IEEE 802.1Q. However, one requirement of 

Ethernet is, that a frame, once its transmission has started has to 

be processed completely by the MAC to allow CRC (cyclic 

redundancy check) checking. Therefore, even at the beginning 

of the priority time slot of a new cycle, a big (best-effort) frame 

could not be finished yet and violate the priority scheduling. To 

avoid this, so-called guard band (GB) is introduced at the end of 


895

each cycle, in which no new frames are scheduled at all 

(blocking all existing queues) and only already scheduled frames 

are drained. The length of the guard band depends on the 

maximum allowed length of the frames processed by the switch. 

It should be chosen as small as possible, because (by blocking 

all traffic) it limits the total possible throughput of the switch. 

For time aware shaping to work, all network participants have to 

have a common, synchronized time basis to meet the agreed 

priority time slots and start new cycles at exactly the same time. 

Queue 3 

Queue 2 

Queue 1 

Queue 0 

t0 

Q3 

Qbv Schedule 

d 

Q1,Q2,Q3 

Gate Control List 

t0-gb = 0 0 0 0 t0-gb = Block non-Time Critical Data 

t0 = 1 0 0 0 t0 = Time Critical Queue is open 

t0+d = 0 1 1 1 t0+d = Time Critical data is done 

repeat each cycle 

MAC 

Q3 Q0,Q1,Q2 Q3 Q0,Q1,Q2 

TS 1 TS 2 GB TS 1 

Figure 2 Time Aware Shaper [19] 

Cycle n Cycle n+1 

C. Frame Preemption 

In order to optimize the usage of the available bandwidth 

compared to time aware shaping, an Ethernet frame 

preemption feature (IEEE802.1Qbu [7] for switches and 

IEEE802.3br [12] for MACs) has been added to the TSN 

standard. It allows large lower-prioritized frames to be 

interrupted by higher prioritized frames. The interruption can 

occur at 64-byte boundaries (the minimum length of an 

Ethernet frame). This allows to reduce the guard band length 

significantly, since it only needs to allow the next 64-byte 

chunk to finish before starting the priority slot. For a detailed 

analysis on the minimal required guard times, see [21]. An 

interrupted frame will continue transmission after the 

transfer of the prioritized frame is finished [19, 22]. Since 

interrupting frames results in CRC violations, both the 

sender and receiver have to support frame preemption for it 

to be enabled. Figure 3 illustrates the concept. 

Guard Band without Frame Preemption 

interfering frame 

C1 C2,C3,C4 interfering frame C1 

GB TS 1 TS 2 GB TS 1 


Guard Band with Frame Preemption 

part1 C1 C2,C3,C4 part 2 

GB 

TS 1 GB TS 1 


Figure 3 Frame Preemption 

C1 

gb 

t0 

d 

C2,C3,C4 

C2,C3,C4 

In the upper part of the picture, the processing of an 

interfering frame without Frame Preemption is shown. The (low 

priority) white frame requires a guard band which is long as the 

largest possible interfering frame. With frame preemption, the 

situation is different, as shown in the lower part. Although the 

white frame’s (low priority) transmission has already started, it 

is preempted at its next 64 byte boundary and the blue frame and 

all the prioritized traffic is transmitted. After this is done, the 

transmission of the white frame is continued. Seamless 

switching between the prioritized and preempted frame is 

achieved by introducing two different kinds of MACs, called 

pMAC (pre-emptible MAC) and eMAC (express MAC). The 

pMAC handles the preemption of non-priority queue frames, 

handshaking with the eMAC (responsible for priority frames) to 

ensure a seamless handover at the MAC. 

For synchronous traffic, time aware shaping with frame 

preemption is a feasible solution. However, for (sporadically 

occurring) traffic in control networks such as alarm messages, 

status messages, etc. this is not the case. In order to ensure lowlatency 

communication for this kind of traffic, other 

mechanisms are required. Currently, the existing Traffic Shaper 

from AVB IEEE802.1Qav could be used but wasn’t actually 

designed for deterministic systems. 

As stated already in chapter B (Traffic Shaping and 

Scheduling), one of the main issues with IEEE802.1Qav is, that 

it can’t be predicted, at which moment in time any individual 

message arrives at the listener. This is essential, for example in 

an application where sporadic events need 

deterministic/predictable behavior. 

Therefore the IEEE defined a new standard (IEEE 

P802.1Qch) in order to cope with the demand for such sporadic 

events. 

The IEEE P802.1Qch norm - on top of the behaviour defined 

in IEEE802.1Qbv – defines the assignment of bandwidth for 

sporadic events, but only when such an event occurs. This 

means, the periodic traffic in each cycle remains untouched, but 

sporadic events get some bandwidth and prioritization assigned 

when needed. The bandwidth used in such an event could 

otherwise be used for best-effort traffic. The standard defines the 

traffic between two neighbouring hops, therefore allowing to 

guarantee a maximum end-to-end latency, which is 1 cycle time 

more per hop compared to IEEE802.1Qbv [20, 23]. 

D. Functions to protect against Malfunctions and Faults 

Another important aspect for Industrial Networks is to 

maximize availability and immunity against single failing nodes. 

Basically, two different general types of failures are imaginable, 

misbehaving nodes and failing nodes. Misbehaving nodes are 

usually harder to detect than completely failing nodes. 

Some examples for misbehaviors affecting the network are: 

 

 

 

erroneous communication or corruption of messages 

slow communication or forwarding performance 

flooding the network with traffic 

While the first two misbehaviors can be detected by 

verifying messages and timing in the receiving node, the last one 

is a serious danger to the whole network because it can prevent 

and overrule other important communication in the network and 

make it fail in total. So in the next subchapter we will take a look 

896

at ingress Traffic Control, a method to protect nodes against 

traffic flooding. 

Failing nodes are usually easier to detect. Examples of such 

failures are: 

failure due to faulty (re-)configuration 

failure due to software bugs, attacks or single event upsets 

physically damaged devices (e.g. power failure) 

media fault (“cut power or network cable”) 

Usually such defects can be mitigated in the network by enabling 

a second (redundant) communication channel which does not 

rely on the defective device. In the second subchapter we will 

look into concepts to implement such redundancy. 

1) Ingress Traffic Control 

Several vendors have implemented proprietary mechanisms 

to protect switches against erroneous ingress traffic. TSN 

defines “Per-Stream Filtering and Policing“ (IEEE802.1Qci) as 

an interoperable standard. Fundamentally, it allows to perform 

frame counting, filtering, policing, and service class selection for 

a frame based on the particular data stream to which the frame 

belongs, and a synchronized cyclic time schedule. Policing and 

filtering functions include the detection and mitigation of 

disruptive transmissions, such as high-bandwidth traffic or 

packet flooding by other systems in a network, improving the 

robustness of that network [20, 23]. 

2) Redundancy 

A typical fault is for example, an open/short-circuited media 

line or broken plug connection. For IT network systems, a 

variety of spanning tree protocol implementations for switches, 

such as STP (IEEE 802.1D) [24] and RSTP (IEEE 802.1w) [25] 

exist. These implementations reconfigure the network if there is 

a redundant path available. However, for real-time Ethernet 

packet-integrity requirements, this technology is not suitable for 

several reasons. First of all, it completely neglects the endpointavailability. 

So, in case of a cable loss between a switch and an 

endpoint, this endpoint would be permanently disconnected 

from the network. Furthermore, in case of a malfunction, all 

packets would be lost, until the reconfiguration is finished, along 

with the time synchronization. Finally, for simple loose contacts 

at plugs, resulting in toggling (on/off) connections, protocols 

such as RSTP would cause the network to constantly reconfigure 

and fail. One solution to handle such malfunctions is building a 

permanent redundant path to endpoints. HSR (High Availability 

Seamless Redundancy) and PRP (Parallel Redundancy 

Protocol) according to IEC 62439-3 are existing solutions. 

Highly simplified, HSR achieves redundancy by establishing a 

ring structure between endpoints, while PRP uses redundant 

parallel cabling. Preferences are mostly dictated by physical and 

environment restrictions. TSN adds the same mechanisms into 

the new 802.1CB [7] standard, which works as follows: To 

achieve redundancy, the packets are duplicated at the sender 

endpoint and sent (redundantly) via separate paths to the 

recipient. The receiver forwards the first arriving packet to the 

application level. The (later arriving) duplicate gets discarded. 

The concept is illustrated in the picture below, showing an 

HSR-like ring structure. Talker T duplicates the packet and 

injects it into the red and blue data paths [26]. 

T 

B 

B 

Talker Replicates 

Listener Removes duplications 

Figure 4 Frame Replication and Elimination 

B 

This task can also be taken over by intelligent bridges or 

switches, depending on the technology available at the 

endpoint. The duplicated packages are distinguished by 

"sequence numbers". In this example, the red packet arrives 

first at the listener L and is directly forwarded to the application. 

The redundant blue packet is eliminated to prevent infinite loop 

cycles. If the red packet had been deleted due to an issue in the 

red path, the blue packet would have arrived first and be used 

instead. If no duplication is performed in the endpoint, this is 

an indication of a defect in the network. Compared to HSR and 

PRP IEC62439-3, IEEE802.1CB is not tied to a specific 

topology and can also use more than two redundant paths. This 

means, that parallel and ring structures can be combined in the 

same network. However, this requires compliance with the 

required latencies on these paths and thus a configuration of the 

TSN network [19] 

IV. 

B 

B 

NETWORK CONFIGURATION 

To enable huge Industry 4.0 networks, just having a common 

deterministic communication standard is not enough. A 

common standard for network configuration, along with tools to 

design, optimize and monitor such networks is required as well. 

Compared to classic Ethernet, the TSN standard allows for a 

huge variety of configuration alternatives and – for most 

functions to work - requires a common understanding between 

communicating devices. To give an example, even just 

configuring IEEE802.1Qbv on a single node is quite complex. 

However, for IEEE802.1Qbv to work, priority slots, cycle times 

and message queue priorities between communicating devices 

have to be aligned and maximum latencies and required 

bandwidth have to be calculated. This is especially true for larger 

networks with many hops and different topologies. 

But let’s move from bottom to top. For the configuration of 

single network devices (e.g. an MPU attached to a small switch 

to connect to the TSN network) a solution using NETCONF [27] 

as the management protocol exists. NETCONF provides 

mechanisms to read and write standardized configuration files to 

nodes and also offers layers to ensure secure and reliable 

transport of the configuration data. The configuration data are 

usually XML files and can be generated from a YANG [28] 

language model. YANG is a standardized language which can 

L 

B 


897

e used to describe network configuration and state data in a 

more human readable format than XML [29]. 

But having means to configure single nodes is not sufficient 

for network configuration, because time-critical networks are 

more than just the sum of their atomic nodes. Standardized 

methods and tools to calculate, monitor and push common 

network configuration to these nodes are required. 

One approach to further standardize configuration of nodes 

could be to use OPC UA. OPC UA is a standard protocol to 

support client/server or publisher/subscriber based 

communication between nodes and can also be used to distribute 

configuration as well as status and monitoring information. By 

supporting authentication and encryption, OPC UA is inherently 

secure and reliable. There are ongoing discussions about using 

OPC UA for TSN Network configuration [30] and organizations 

such as VDAM [31] und ODVA have already committed to this 

[32]. 

The complexity to calculate and find a solution that fulfils all 

bandwidth and timing requirements grows exponentially with 

the number of nodes. Automatic monitoring, administration and 

reconfiguration of devices is essential to improve turnaround 

times and reduce costs, because just adding or replacing a single 

node might require a recalculation and reconfiguration of the 

network. As a consequence, the complete network has to be 

administered and controlled by a central authority or tool. Right 

now, a few solutions exist in the market [33, 34]. Although these 

solutions provide cross-vendor support, they are not vendor 

independent [36, 36]. 

V. EXPERIMENTAL RESULTS 

A. Concept 

Unfortunately, at the time of writing this, no TSN conformal 

switch to evaluate features, such as frame preemption, time 

aware scheduling or redundancy, were available. The authors 

therefore decided to focus on evaluating the behavior of a 

redundancy ring structure using HSR, which is fully compliant 

to 802.1CB and for which current implementations exist. The 

concepts can directly be re-used and expanded, once TSN 

hardware becomes available. 

B. Test Setup 

The most important feature to test in an HSR setup is the 

redundancy itself by ensuring that no packets are lost if the ring 

is opened at a single point of failure. For performance 

measurement, two basic indicators exist, latency and throughput. 

Three different kinds of latency exist for a redundant network 

node, the egress (t e), the ingress (t i) and the cut-through latency 

(t t). Ingress and egress latencies describe the latencies of packets 

passing from the CPU to the redundant port, or from the 

redundant port to the CPU respectively. While these latencies 

are definitely interesting for local optimization of the node’s 

software stack, they have little effect on the total ring 

performance, since they add only once each to the total latency 

of a packet (at the sending and the receiving node). The cutthrough 

latency on the other hand has a very big influence, 

because in a worst case scenario (ring open between two 

neighboring nodes, packet transmitted between these two 

nodes), the total latency of a packet in a network with N nodes 

is: 

N 

t max = t e + ∑ t tn + t i 

n=1 

In a sane redundant network, the average latency should be 

t max/2. This is why latency measurements were limited to t t. For 

most industrial networks, the network throughput is less 

important than the latency, but an important mean to detect 

bottlenecks. For the sake of simplicity and traceability, no 

complicated analysis tools were utilized. Instead, ping [37] was 

used to detect packet losses and to measure latencies, while iperf 

[38] was used to analyze the network throughput. 

To measure the cut-through latency of each node, the 

following setup was used: 

1. Directly connect sender and receiver node with one 

Ethernet cable and measure ping time. 

2. Add an additional node (the device under test) between 

the sender and receiver and ping again. The cut through 

time is the time difference between these two ping 

times. 

3. To ensure that all nodes work correctly as expected and 

as a sanity test, close the redundant Ethernet port 

between sender and receiver (hereby creating a ring) and 

ensure that the ping times return to the ones measured in 

step 1. Open and close the Ethernet connections at 

various steps to ensure that no packet losses occur 

during the ping. 

The same principle was used to measure the network throughput. 

To ensure that performance limitations are detected, both sender 

and receiver have to perform with a higher throughput than the 

actual device under test. In our setup, therefore known-good 

deterministic software HSR devices were used, which perform 

worse in terms of latency, but very well in terms of throughput 

compared to hardware devices. 

Figure 5 shows the test principle test setup of 3 boards. Each 

of the nodes represents different Evaluation Boards configured 

as DANH (Double Attached Node HSR). 

Node A is the sending device. Node B be is the device under 

test (DUT) and node C is the receiver. At the beginning of every 

test we connect node A and C directly and measure the ping 

round time of the two devices in both directions from A to C and 

C to A (Route A). In the next step we connect node B like in 

figure 5 below and disconnect route A. By doing this, we ensure 

that the system is forced to reroute the ping-test via node B 

which is the DUT. At the end we close route A again. Repeating 

the test should show us the same results as in the previous test 

from A to C. In above mentioned test via node B (route A open) 

the signal takes one more hop and experiences therefore a minor 

delay when passing through node B. This delay is called the cutthrough 

time t t 

898

A 

End-Node 

A 

DANH 

B 

C 

DANH 

End-Node 

B 

The main advantages of the software solution are clearly the 

cost and the flexibly. But, as the measured delays (t t) in table 3 

show, the hardware implementation from Microchip and 

Renesas can forward the frames without any help from the CPU 

in times below 1 µs. The software solution form NXP on the 

other hand needs an average of 40 µs to forward a frame. The 

LS1021A TSN board needs 80 µs. This additional delay is likely 

the result of the Real-Time extension which guarantees better 

determinism at the cost of lower performance. All results were 

measured without additional CPU load. In real-world 

applications, there usually would be an application running in 

parallel on the CPU, which could have an unpredictable impact 

on the software bridging and HSR implementation. This might 

result in even larger throughput delays. The conclusion is, that 

for deterministic applications like TSN, a hardwareimplemented 

redundancy (as defined in IEEE802.1CB) seems 

inevitable. Otherwise it is impossible to guarantee determinism 

for the system. For further details about the results and test setup, 

please contact the authors. 

DANH 

End-Node 

C 

Figure 5 HSR Ring 

C. Results 

From our test we got four different solutions. Two with HSR 

software implementation and two with hardware 

implementation. From NXP [39] we used the LS1021A TSN 

reference design (running Linux and the PREEMPT_RT Linux 

patch) and the LS1021A Tower Board (running mainline Linux 

4.14). Both are equipped with a dual core Cortex-A7 and two 

SGMII (Serial Gigabit Medium Independent Interface) ports, 

each with a single PHY. Both boards enable the HSR bridge 

function between the two ports on a software basis. As a first 

examples of HSR hardware implementations, we used the 

Microchip [40] 7 port KSZ9477 switch attached to a SAMA5 

Cortex-A5 MPU as management CPU. The second hardware 

implementation example is Renesas’ RZ/N1D [41], an 

integrated solution with a dual Core-A7 and an on-chip 5 port 

real-time Ethernet Switch. The following table shows the 

average throughput results t t (measured with 100 pings). 

DUT t AtoC / µs t CtoA / µs t t / µs 

MCP

REFERENCES 

[1] https://www.hms-networks.com/images/librariesprovider6/defaultalbum/company-images/network-shares-according-tohms.jpg?sfvrsn=ff60d2d6_2 

[2] http://avnu.org/ 

[3] http://www.ieee802.org/1/pages/avbridges.html, 

[4] http://www.ieee802.org/1/pages/tsn.html 

[5] http://www.ieee802.org/1/pages/802.1AS-rev.html 

[6] http://standards.ieee.org/findstds/standard/802.1Qbv-2015.html 

[7] http://standards.ieee.org/findstds/standard/802.1Qbu-2016.html 

[8] http://www.ieee802.org/1/pages/802.1ca.html 

[9] http://www.ieee802.org/1/pages/802.1cc.html 

[10] http://www.ieee802.org/1/pages/802.1ci.html 

[11] http://www.ieee802.org/1/pages/802.1cb.html 

[12] http://standards.ieee.org/findstds/standard/802.3br-2016.html 

[13] http://www.ieee802.org/1/pages/802.1ch.html 

[14] http://www.ieee802.org/1/pages/802.1cp.html 

[15] http://www.ieee802.org/1/pages/802.1cr.html 


[17] http://www.elektroniknet.de/elektronik-automotive/bordnetzvernetzung/viel-mehr-als-nur-echtzeit-141430.html 

[18] http://www.strategiekreis-automobilezukunft.de/public/projekte/seis/das-sichere-ip-basiertefahrzeugbordnetz/pdfs/TP2_Vortrag4.pdf 

[19] D. Pannel and J. Bergen, "IEEE TSN Standards Overview & 

Update," Marvell, 2015. 

[20] Dr. René Hummen, Stephan Kehrer and Dr. Oliver Kleineberg, 

"TSN-Time-Sensitive-Networking-White-Paper- 

EMEA_EN.pdf," Hirschmann, 2016. 

[21]https://zenodo.org/record/263879/files/2016ETFA-TUBS.pdf 

[22] U. Schulze, "Keine Zeit verschwenden," iX, no. 1, pp. 94-96, 2018. 

[23] https://mentor.ieee.org/802.24/dcn/17/24-17-0020-00-sgtgcontribution-time-sensitive-and-deterministic-networkingwhitepaper.pdf 

[24] http://ieeexplore.ieee.org/document/1309630 


[26] http://netmodule.com/en/technologies/industrialethernet/IEC62439 

[27] https://tools.ietf.org/html/rfc624 

[28] http://www.yang-central.org/twiki/bin/view/Main/WebHome 

[29] https://www.nxp.com/docs/en/user-guide/OPEN-LINUX-IND- 

UM.pdf 

[30] http://opcconnect.opcfoundation.org/2017/12/opc-ua-over-tsn-anew-frontier-in-ethernet-communications/ 

[31] https://ias.vdma.org/viewer/-/article/render/15646006 

[32] https://www.odva.org/Optimization-40/Optimization-of-Machine- 

Integration-OMI 

[33] http://www.hirschmann.de/de/Hirschmann/Industrial_Ethernet/ 

Netzmanagement/Industrial_HiVision_Network_Management_D 

E/index.phtml1 

[34] https://www.tttech.com/products/industrial/deterministicnetworking/network-configuration/slate-xns/ 

[35] A. Hennecke and S. Weyer, "http://www.computerautomation.de/feldebene/vernetzung/artikel/143840/," 

[Online]. 

[36] Goerge A.Ditzel and Paul Didier, "Time Sensictive Networl (TSN) 

Protocola and use in EtherNet/IP Systems," 2015 ODVA Industry 

Conference & 17th Annual Meeting , Frisco, Texas, USA , 

October 13-15, 2015 . 

[37] https://en.wikipedia.org/wiki/Ping_(networking_utility) 

[38] https://iperf.fr/ 

[39] https://www.nxp.com/support/developer-resources/referencedesigns/time-sensitive-networking-solution-for-industrialiot:LS1021A-TSN-RD?fsrch=1&sr=1&pageNum=1 

[40] http://www.microchip.com/DevelopmentTools/ 

ProductDetails.aspx?PartNO=EVB-KSZ9477 

[41] https://www.renesas.com/en-eu/products/microcontrollersmicroprocessors/rz/rzn/rzn1d.html#productInfo 

[42] https://www.iiconsortium.org/press-room/11-28-17.htm 

[43] https://lni40.de/ 

900

Demystifying Time Aware Traffic Shaping 

Technologies for TSN 

A Case Study for Linux Driver Enabling 

Ong, Boon Leong 

Internet of Things Group 

Intel Corporation 

Penang, Malaysia 

boon.leong.ong@intel.com 

Abstract— The evolution of IEEE802.1 Audio Video Bridging 

(AVB) Task Group to Time-Sensitive Network (TSN) Task Group 

and the creation of Avnu Alliance have generated much attention 

in industrial, automotive and professional Audio/Video 

applications. As hardware Intellectual Properties are created 

according to IEEE standards, software components are developed 

to enable them. This paper provides a concise yet extensive 

introduction of various TSN technologies and the readiness of TSN 

framework in Linux world. In addition, the paper describes a 

modular approach that has been taken in-house to develop TSNcapable 

Ethernet driver. 

Keywords— Linux; Time Sensitive Network; TSN; IEEE802.1 

Qav; IEEE802.1 Qbv; IEEE802.1 Qbu; Frame Preemption; Gate 

Control; Credit Based Shaper; Traffic Class; Ethtool; Networking. 


The Internet has been around for decades and the range of 

applications that are powering it has grown tremendously from 

simple web pages with text and pictures, Internet Relay Chat to 

on-demand streaming of audio/video contents, Voice over 

Internet Protocol telephony and two-way interactive video calls. 

The need for networking bandwidth too has sky-rocketed from 

Fast Ethernet to Gigabit Ethernet (1G), 10G, 40G, 100G and 

beyond in the data centers backhaul. Ethernet technology was 

originally designed to provide best effort delivery for lightly 

loaded network and soon has been enhanced to provide traffic 

prioritization to certain data streaming applications that need 

bandwidth reservation. 

The dawn of Internet of Things and market movements such 

as Industry 4.0 (“Smart Factory”) and autonomous vehicles is 

driving Ethernet technology to provide data transfer in a reliable 

and timely manner. Such is known as Time-Sensitive 

Networking and the “TSN” word has been a recent buzz word in 

Ethernet Technology domain especially for automotive and 

industrial automation applications ever since the evolution of 

IEEE802.1 Audio Video Bridging (AVB) Task Group (TG) to 

TSN TG because the scope of former TG has grown beyond 

time-sensitive A/V stream. The objectives of TSN are manifold: 

(a) time synchronization, (b) deterministically low latency for 

scheduled traffics (industrial and automotive control loop) and 

bandwidth reserved traffics (audio and video streaming), (c) 

bandwidth utilization reservation and (d) fault 

tolerance/reliability. In layman’s terms, TSN enables networked 

applications/entities to interact in real-time fashion: bounded 

delay and well-known time. 

For data centers and embedded devices, Linux-based 

Operating System (OS) from commercial companies or 

community releases, e.g., Red Hat Enterprise Linux, SUSE 

Enterprise Linux, Ubuntu Linux, Yocto Project/OpenEmbedded 

built Linux and OpenWrt, is popular choice due to its featurerich 

networking stack and broad device drivers support for many 

different types of network interface controllers (NICs). The 

same argument applies to the Android mobile OS that too is 

using Linux kernel. For decades, the software industry has seen 

a continuous innovation on higher level protocols above the 

standard socket interface provided by Linux kernel and within 

the Linux kernel. To name a few, NAPI (New API) packet 

processing framework for receive interrupt mitigation, 

iptables for configuring Netfilter (an IP rules-based packet 

filtering and packet mangling), tc (traffic control) for 

configuring Linux kernel packet scheduler i.e. queuing 

discipline and XDP (eXpress Data Path) pre-stack packet 

processing. ethtool is also another popular utility for showing 

and making changes to the parameters belong to Ethernet NIC, 

e.g., speed, duplex mode, auto-negotiation, checksum offload, 

DMA ring sizes, interrupt moderation/coalesce and receive flow 

hashing for load-balancing across multi-queue NICs. 

TSN TG belongs to IEEE802 standard committee and is not 

charted for standard certification. Avnu Alliance is formed with 

the aim to create interoperable TSN ecosystem and has 

certification labs that test and certify commercial product against 

a rich set of conformance and interoperability tests. Though at 

its infancy, Avnu Alliance already has first set of conformance 

tests on time synchronization, or IEEE802.1AS and is partnering 


901

with Open Platform Communications (OPC) Foundation to 

provide conformance testing and certification of OPC UA over 

TSN devices [1] for the industrial ecosystem. To fuel the 

creation of AVB/TSN solution, OpenAvnu project [2], 

sponsored by the Avnu Alliance, the primary aim is to provide 

building block components. The project contains both GPLv2 

for kernel driver ingredients and BSD licenses for user-space 

sample application, library, and daemon. 

In this paper, the two key technologies of TSN time 

synchronization and traffic shaping are discussed in Section II 

and Section III. Section IV focuses on the current state of the art 

for TSN technologies in Linux kernel and other parallel kernelrelated 

projects such as traffic control and ethtool. An overview 

of the technologies offered in OpenAvnu is also described in 

Section IV. Section V describes the modular software 

architecture approach taken for enabling TSN-capable Ethernet 

kernel driver despite the lack of complete TSN framework in 

Linux networking subsystem at the time this paper is created. 

Section VI provides two examples covering how to apply 

various TSN technologies for a network that carries a mix of 

traffic patterns: scheduled, time-sensitive and best effort. 

II. OVERVIEW OF TIME SYNCHRONIZATION 

IEEE Std. 1588-2008 [3] also popularly known as Precision 

Time Protocol Version 2 (PTPv2) enhances the accuracy of time 

synchronization between two networked nodes from 

millisecond (achievable by Network Time Protocol (NTP)) to 

microsecond or sub-microsecond. This is made possible as 

packet time-stamping is done at hardware level instead of 

software level in the case of NTP. The transport of PTP message 

can be over UDP/IPv4, UDP/IPv6, IEEE802.3 Ethernet and 

several industrial automation control protocols, e.g., 

DeviceNET, ControlNET, and PROFINET. 

IEEE Std. 802.1AS-2011 [4] also known as generalized 

Precision Time Protocol (gPTP) is based on IEEE Std 1588- 

2008 but differs in various aspects as documented in section 7.5 

of the specification. For examples, gPTP has faster best master 

clock algorithm (BMCA) convergence time, all gPTP message 

is done only using IEEE 802 MAC and gPTP time domain can 

span beyond across heterogeneous networks, e.g., Ethernet, 

Wireless, Media over Coax Alliance and HomePlug. 

To bring about new enhancements and performance 

improvements to IEEE Std. 802.1AS-2011, IEEE 802.1ASbt 

was started late 2011 and eventually superseded by IEEE 

802.1AS-Rev which is still at the draft stage as of this writing. 

Examples of the enhancements are, Link Aggregation support, 

Fine Timing Measurement for IEEE 802.11 transport, one-step 

processing, faster grandmaster change over and further reduce 

BMCA converge time. 

III. OVERVIEW OF TRAFFIC SHAPING 

In packet-switched computer networking technology such as 

Ethernet, network packets flow through an inter-connected mesh 

of network bridges/switches in bandwidth optimized fashion. 

Depending on the bandwidth utilization at the certain point of 

time, traffic congestions may happen unpredictably. Such 

randomness contributes towards varying transmission latency in 

the application data flow and eventually causes poor service 

experience to its users. 

Quality of Service (QoS) in computer networking is about 

ensuring certain application data flows are given higher priority 

over others. IEEE Std. 802.1Q-2005 defines (1) Virtual Local 

Area Network (VLAN) which includes Priority Code Point 

(PCP) for marking packet priority, (2) strict priority transmission 

selection algorithm for prioritization of traffics. Multiple queues 

are added in both ingress and egress side of a networked device 

in order to reorder higher priority packets over lower priority 

packets. Traffic prioritization helps improve the application 

service quality under lightly-loaded network. In Ethernet 

technology, a frame that is the midst of transmission must be 

fully transmitted together with its checksum or else it is treated 

as a corrupted frame. Therefore, a higher priority frame is not 

allowed to be transmitted until an earlier lower priority packet is 

completely transmitted. The transmission latency of the higher 

priority frame becomes greater if the earlier frame is having long 

payload such as a jumbo frame. Clearly, the situation worsens if 

the entire network is heavily loaded. In gist, traffic prioritization 

does not assure low transmission time with bounded latency. 

Internet applications such as audio/video streaming, VoIP 

telephony, and two-way video call, use Real Time Protocol that 

runs over User Datagram Protocol (UDP) on Internet Protocol 

(IP). The data streams of such applications carry huge contents 

and are sensitive to transmission latency. RTP streams contain 

time-stamp and a sequence number which are used to manage 

stream transmission jitter, packet loss and out-of-order. For invehicle 

infotainment or professional AV system whereby media 

source, speakers, and display unit are located closely, Ethernetbased 

AVB technology is a better option than RTP because AVB 

uses IEEE 1722 AV Transport Protocol Layer 2 payload to carry 

multiple streams and has less header overhead, no IP and UDP 

headers. 

Both AVTP and RTP media streams require low bounded 

latency and latency variation in the packet-switched network and 

this means reserving transmission bandwidth for AV streams on 

the usually congested network. The recommendation to map and 

regenerate VLAN tag encoded priority for bandwidth reserved 

streams and a controlled bandwidth queue draining algorithms 

called Credit-Based Shaper (CBS) are defined in IEEE Std. 

802.1Qav-2009 [5]. CBS, in essence, is a mean to space out AV 

streams as far as possible to prevent the formation of long bursts 

of high priority traffic that both (1) degrade QoS offered by 

lower priority traffic classes and (2) interfere with other high 

priority traffic [6]. IEEE Std. 802.1Qat-2010 [7] describes 

Stream Reservation Protocol (SRP) for registering and 

deregistering AV streams and their associated Traffic 

Specification (TSpec). SRP has been implemented on top of an 

existing network management protocol called the Multiple 

Registration Protocol (MRP). Both Multiple VLAN Registration 

Protocol (MVRP) and Multiple MAC Registration Protocol 

(MMRP) use MRP and may be used with SRP [8]. 

It is worth to note that section 33.6.1 of [5], end stations that 

are SR talkers shall apply CBS algorithm to per-stream queue 

and per-traffic class queue. TSpec that describes the bandwidth 

reservation for SR class, for specification scalability reason, 

does not include the overhead of the underlying Ethernet MAC 

service. Section 34.4 of [5] described the way to calculate framelevel 

bandwidth requirements (used as CBS idle slope) based SR 

stream TSpec. 

902

Control applications in automotive and industrial networks 

require even lower latencies than AV application and their traffic 

pattern is categorized as scheduled traffic by IEEE Std. 

802.1Qbv-2015 [9]. IEEE Std. 802.1Qbv-2015 defines Time- 

Aware Shaper (TAS) whereby the selection of transmit frame 

from transmit queue is controlled by the associated gate control 

that opens or closes based on a pre-defined time schedule called 

gate control list (gate, open/close, time interval). To protect 

scheduled traffic from being delayed by other traffics, guard 

band with duration as long as the time for transmitting a longest 

VLAN Ethernet frame (1522 bytes) is set before gate open time 

for scheduled traffic. In another word, there shall be no frame 

transmission, i.e. loss of bandwidth usage, during guard band in 

order for scheduled traffics to be selected for transmission 

without delay. 

To reduce the effect of bandwidth loss in guard band, IEEE 

Std. 802.3br-2016 [10] enhances the capability of Media Access 

Control (MAC) to include express MAC (eMAC), preemptable 

MAC (pMAC) and MAC Merge sublayers for the purpose of 

interspersing express traffic (scheduled traffic) with 

preemptable traffic into a normal Ethernet frame transparent to 

Physical Layer. IEEE Std. 802.1Qbu-2016 [11], complementary 

to IEEE Std. 802.3br, defines a mean to (1) map traffic priority 

to frame preemption (FPE) status (express or preemptable) and 

(2) hold or release the transmission of preemptable frames in 

pMAC. Since preemption occurs only if at least 60 bytes of the 

preemptable frame have been transmitted and at least 64 bytes 

(including the frame CRC) remain to be transmitted, the guard 

band for scheduled traffic can be reduced to as small as 64 bytes. 

Prior to the introduction of TAS and FPE, the capability to 

determine the time to pre-fetch data from system memory into 

NIC internal memory and the transmission time of a packet from 

internal memory to physical line is available in Intel Ethernet 

Controller I210. Both of these times (prefetching and 

transmission) of a packet is calculated based on per-packet 

LaunchTime value set (in 32-nanosecond unit) in the associated 

transmission descriptor entry. LaunchTime is available on 

stream reservation (SR) transmission queue and not in best effort 

queue. By assigning SR queue with higher priority than best 

effort queue and proper configuration of LaunchTime, we can 

segregate time-sensitive traffics from best effort traffics, i.e., a 

close resemblance of traffic pattern modulated by TAS. 

IV. CURRENT STATE OF ART: TSN SUPPORT IN LINUX 

This section provides a glimpse of what is currently 

supported in OpenAvnu project and Linux kernel as of this 

writing. 

A. OpenAvnu project 

Table I tabulates a partial list of software components 

available under OpenAvnu [2] open source project. I210 driver 

(igb_avb.ko) is an out-tree Linux kernel module (maintained 

outside of Linux mainline) with an intention to offer hardware 

capabilities such as (1) direct transmit and receive descriptor 

ring and data buffer access (known as media queue [12]) with 

LaunchTime configuration (2) hardware PTP clock access, (3) 

CBS configuration, and (4) receive flex filter configuration 

through I210 user-space library (libigb). The driver 

architecture bypasses the TCP/IP suite in Linux networking 

stack and avoids its associated data path latency because 

IEEE1722 AVTP frame is Layer 2 protocol without TCP and IP 

headers. 

Essentially, OpenAvnu sample applications and gPTP 

daemon listed in Table 1 use the Application Programming 

Interface (API) defined by libigb directly for all timesensitive 

data path operations. The drawback of this approach is 

in its scaling capability when Linux mainline evolves and the 

ease to port another types of Ethernet NIC on OpenAvnu 

framework. 

TABLE I. 

PARTIAL LIST OF OPENAVNU SOFTWARE 

Component Directory Path Description 

MRP 

Daemon 

daemons/mrpd 

gPTPd daemons/gptp gPTP daemon 

MAAP 

AVTP talker 

& listener 

AVTP Live 

Stream 

MRP Client 

I210 driver & 

library 

daemons/maap 

examples/simp 

le_talker 

examples/simp 

le_rx 

examples/live 

_stream 

examples/mrp_ 

client 

kmod/igb 

lib/igb 

MRP daemon that supports 

MSRP, MMRP & MVRP. 

MAC Address Acquisition 

Protocol for allocating multicast 

MAC address for AVTP. 

Sample AVTP talker & listener 

that register with MRPD and 

use the assigned VLAN ID & 

PCP for Ethernet traffic 

prioritization. 

Sample AVTP talker and 

listener that can be piped with 

other AV applications e.g. 

GStreamer and ALSA. 

Sample MRP client to 

demonstrate 

stream 

joining/leaving and MMRP or 

MVRP query. 

Linux kernel module and 

library for intel Ethernet NIC 

I210 to demonstrate AVB. 

B. Linux and its ecosystem 

linuxptp [13] project contains time synchronization userspace 

programs (ptp4l and phc2sys) to synchronize (1) 

hardware PTP clock (in Ethernet NIC) with the master clock 

(remote end station), and (2) hardware PTP clock with the 

system clock (local end station).Unlike OpenAvnu gPTPd, the 

linuxptp project uses API of modern Linux mainline and 

make use of ethtool to query and set hardware time-stamping 

of Ethernet NIC. Per-packet transmit or per-receive time-stamp 

value that is recorded in transmit or receive descriptor by 

Ethernet controller is subsequently passed to ptp4l as a control 

message with cmsg_level SOL_SOCKET, cmsg_type 

SCM_TIMESTAMPING and payload type struct 

scm_timestamping [14]. 

Linux traffic control subsystem has traffic shaping capability 

in egress and traffic policing in ingress. User-space application 

tc, developed in parallel with Linux kernel, is used to configure 

the traffic control operation inside the kernel, e.g. queue 

discipline (qdisc) for scheduling, classes for shaping and filters 

for classification or policing. 

IEEE 802.1 Qav Credit Based Shaper support has been 

recently added to Linux mainline [15][16][17]. It is 


903

implemented as cbs classful qdisc with user configurable 

parameters (locredit, hicredit, sendslope and 

idleslope) done through tc application and supports 

software fallback for NIC that does not have hardware CBS 

capability. Multiqueue Priority Qdisc (mqprio) is classful 

qdisc that maps traffic flows to hardware queues: traffic flow (as 

identified by socket option SO_PRIORITY, 16 priorities in 

total) is mapped to traffic classes which is 1:1 mapped to 

hardware queue. To attach CBS to a hardware queue, cbs qdisc 

is attached as a child of mqprio qdisc. Clearly, the cbs qdisc 

support for per-traffic class queues is available in Linux 

mainline. 

There has been recent discussion and Request For Comment 

(RFC) submission in the netdev mailing list about 

LaunchTime technology [19]. The technique stores the value of 

per-packet LaunchTime in control message (cmsg) as with 

cmsg_level SOL_SOCKET, cmsg_type SO_TXTIME and 

payload type 64-bit unsigned integer. Control messages are sent 

to Linux kernel through the use of sendmsg() socket API. 

tbs qdisc was proposed as a mean to reorder transmit frames 

before they are committed to network device hardware queue. 

Decades-old ethtool application, developed in parallel 

with Linux mainline, has been primarily used to query and 

control Ethernet NIC and PHY general hardware settings. In 

addition, ethtool covers the settings for advanced receive 

flow classification and steering capabilities such as receive flow 

hash indirection hardware table and receive side scaling hash 

key. Based on discussion [18] in Linux netdev mailing list, the 

community is leaning towards Linux traffic control subsystem 

instead of ethtool for enabling TSN-related traffic 

scheduling and shaping capabilities discussed above. Launch 

To summarize the state of support for TSN in Linux mainline 

today, there is no CBS support for per-stream queue, per-traffic 

class gate control (IEEE 802.1Qbv), per-transmission port frame 

preemption (IEEE 802.1Qbu) and LaunchTime technology. 

V. CASE: LINUX DRIVER ENABLEMENT FOR TSN 

For the purpose of validating the health of next generation 

TSN-capable Ethernet NIC before design tape-in, we have 

developed Linux kernel driver that covers traffic shaping 

capabilities: LaunchTime, IEEE 802.1Qbv and IEEE 802.1Qbu. 

From section IV, we have seen that TSN framework is still under 

software architectural definition phase in Linux mainline as of 

this writing. Therefore, to ensure source code developed today 

can be easily adapted to future TSN framework in Linux 

mainline, we have taken a modular approach for the kernel 

driver which has two sub-components: (1) TSN Core Library 

and (2) TSN Glue Logic. For LaunchTime, we adopted the 

approach contributed in [19]. In the ensuing section, for brevity, 

we refer IEEE 802.1Qbu frame preemption as FPE and IEEE 

802.1Qbv Enhancements for Scheduled Traffic as EST. 

TSN Core Library contains a collection of functions 

implemented for TSN hardware capabilities in Ethernet 

controller and for two purposes: (1) hooks for device driver 

frameworks general initiation and setup in Fig. 1, and (2) runtime, 

user-driven input for TSN configuration in Fig. 2. 

Fig. 1. Functions for TSN Init, Setup, ISR and Enable/Disable. 

Fig. 2. Functions for TSN Configuration. 

Fig. 3. TSN Configuration via ethtool ioctl. 

Linux kernel has well-defined PCI driver and network 

device frameworks that include a list of common management 

callback functions (hooks) as displayed in Fig. 1. Fig. 1 shows 

the relationship of how these common management hooks are 

associated with six of TSN Core Library functions with their 

purposes tabulated in Table II. Across multiple Linux versions, 

the design of PCI driver and network device frameworks are 

fairly stable and mature, so, the implementation of these 

904

functions in Table II is fairly future proof. IEEE 802.1Qbu [11] 

Annex R.2 describes how EST and FPE capabilities can be used 

in isolation, this is the reason why separate enable/disable 

functions are developed. IEEE 802.3br [10] Section 99.4.2 

describes the method to determine link partner preemption 

capability through the use of verify and response mPacket. 

When a user enables frame preemption through 

set_fpe_enable(), verify mPacket is sent to link partner 

automatically. In return, a frame preemption capable link partner 

would send response mPacket. Such response mPacket 

processing is handled within interrupt service routine 

fpe_irq_status(). Likewise, when the physical link is not 

supporting full-duplex mode, frame preemption is automatically 

disabled and the auto response of mPacket from the local station 

will be disabled. 

TABLE II. 

Function Name 

tsn_init 

tsn_setup 

est_irq_status 

fpe_irq_status 

set_est_enable 

set_fpe_enable 

TABLE III. 

Component 

set_est_gce 

set_est_gcl_len 

get_est_gcl_len 

set_est_gcl_times 

get_est_gc_cfgs 

reconfigure_cbs 

set_fpe_config 

get_fpe_config 

TSN CORE LIBRARY FUNCTIONS FOR DEVICE DRIVER 

FRAMEWORK 

Description 

Discover hardware capabilities, e.g., support 

for FPE and EST and the depth of gate control 

list. 

Setup interrupts for TSN, e.g., gate control 

errors (EST) and preemption support in linkpartner 

a (FPE). 

Interrupt service routines for EST and FPE. 

Enable/disable EST and FPE independently b . 

a. 

IEEE 802.3br Section 99.4.2. 

b. 

IEEE 802.1Qbu Annex R.2 Preemption used in isolation 

TSN CORE LIBRARY FUNCTIONS FOR RUN-TIME 

CONFIGURATION 

Description 

Configure a gate control entry according to 

index: commands, gate state and time interval. 

EST command: SetGateStates a . 

FPE commands: Set-And-Hold-MAC b and Set- 

And-Release-MAC b . 

Set/get the length of gate control list as bounded 

by the depth of the list discovered by 

tsn_init(). 

Set gate control associated time parameters c : 

AdminBaseTime, AdminCycleTime and 

AdminCyleTimeExtension and trigger a GCL 

change from Admin copy to Operational copy. 

Get gate control configurations in state 

machines, i.e., gate control list and its 

associated time parameters. 

Reconfigure CBS IdleSlope parameters of pertraffic 

class queue based on total gate control 

open time d . 

Set/Get preemption state of traffic class queue. 

get_fpe_pmac_sts Get current pMAC holdRequest state e . 

get_tsn_err_stat 

clear_tsn_err_stat 

Get/clear error status related to EST & FPE. 

Component 

set_tsn_tunables 

get_tsn_tunables 

Description 

Set/get tunable parameters for hardware 

tuning/offset and TSN standards, e.g., for FPE, 

holdAdvance and releaseAdvance e . 

a. 

IEEE 802.1Qbv Table 8-6 

b. 

IEEE 802.1Qbu Table 8-6. 

c. 

IEEE 802.1Qbv Section 8.6.9.4. 

d. 

IEEE 802.1Qbv Section 8.6.8.2. 

e. 

IEEE 802.1 Qbu Section 12.30.1. 

Fig. 2 shows the part of TSN Core Library that is mainly 

developed for user-driven TSN run-time configuration such as 

setting gate control list and getting the holdRequest state of 

pMAC. A functional summary of these TSN functions is 

provided in Table III. Gate Control List (GCL) is a collection of 

gate control entries that define the gate operational state (open 

or close) and the interval/duration of the operation. A gate 

control is 1:1 associated with traffic-class queue. The 

set_est_gce()function is used repeatedly, each time with 

different entry index, to program the GCL. To set the length of 

gate control list, set_est_gcl_len() is used. The main 

reason for not implementing set_est_gcl() for 

programming the entire GCL is to offer the flexibility of 

modifying individual gate control entry from older GCL at will 

before committing the newly updated GCL to the state machine. 

The set_est_gcl_times() function is meant for setting 

time-related GCL parameters such as base time, cycle time and 

cycle time extension and implicitly triggering a GCL commit to 

the hardware, i.e. making a switch of ownership to GCL from 

administrative copy to operational copy, as described in IEEE 

802.1Qbv [9] Section 8.6.9.4.7. Therefore, there is no need to 

offer a separate function to set ConfigChange. IEEE 802.1Qbv 

Section 8.6.8.2 discusses the formula to adjust the CBS 

idleSlope value inversely to the total gate open time during the 

gating cycle for the queue and it is implemented in 

reconfigure_cbs(), which is executed whenever there is 

a change in (1) idleSlope value set by tc command as discussed 

in Section IV part B of this writing, or (2) a new GCL or gate 

control entry is configured. 

The value of framePreemptionAdminStatus as defined in 

IEEE 802.1 Qbu [11] section 12.30.1.1.1 deserves some further 

explanation. The “priority” mentioned in the standard for 

framePreemptionAdminStatus is referring to the frame priority 

as defined by VLAN tagged PCP (8 priorities in total). The perframe 

priority value of framePreemptionAdminStatus specifies 

whether a frame with that priority shall be transmitted by using 

preemptable or express MAC service. As multiple frames with 

different priorities may be enqueued to the same traffic class 

queue, the values of framePreemptionAdminStatus set for these 

priorities must be consistent too. As discussed in Section IV part 

B of this writing, the priority of a transmit frame is specified at 

socket session through socket option SO_PRIORITY (perstream 

priority). With hardware offload, mqprio qdisc maps 

these socket priorities to traffic class queues which in turn also 

are 1:1 mapped to hardware transmit queues in Ethernet 

controller. Based on the above discussions, we can conclude that 

it is sufficient to specify preemption configuration at the 


905

hardware queue level instead of data stream level and this is 

what was designed in set_fpe_config() function. 

TSN Glue Logic, in Fig. 2 and Fig.3, is a thin layer of logic 

that binds TSN Core Library to higher layer networking 

subsystem such as traffic control or ethtool. Before the 

discussion of TSN framework in Linux mailing list [18], for the 

ease of prototyping for hardware validation, TSN Glue Logic 

was implemented to hook TSN Core Library into ethtool 

subsystem as shown in Fig. 3. The area of developments done 

for such purpose are as follow:- 

Linux kernel: include/uapi/linux/ethtool.h, 

include/linux/ethtool.h and 

net/core/ethtool.c. 

ethtool application [20]: ethtool-copy.h and 

ethtool.c. 

Fig. 4. An example of TSN configuration Implementation. 

FILE is for setting the administrative copy of GCL. The content 

of a text file named as FILE is a list of gate control entries that 

follow the format suggested in [18], 

. First of all, 

is a character chosen from {S, H, R} set 

corresponding to {SetGateStates, Set-And-Hold-MAC, Set-And- 

Release-MAC} gate control operation defined in IEEE 802.1 

Qbv and IEEE 802.1Qbu. The next field, , is simply a 

hexadecimal number which represents the state of gate control. 

For example, to set 1 st , 4 th , and 8 th gate control to open for 10 

microseconds and stop pMAC from sending preemptable 

frames, the gate control entry in the FILE text file is “H 0x89 

10000”. 

To set the GCL associated time parameters, e.g. base time, 

cycle time and cycle time extension, ethtool set-estinfo 

cycle N.D base N.D ext N.D command is used 

whereby N is the numerator value of time in seconds and D is 

the denominator value of time up to nanoseconds accuracy. 

From Fig. 4, it should be clear that the ethtool set-gcl 

command calls drv_set_gcl() function in TSN Glue Logic 

for parsing the GCL list stored in ethtool_gcl data structure. 

For each of the gate control entries, the drv_set_gcl() 

function repetitively calls the set_est_gce() function for 

setting the value of the respective gate control entry. Before 

drv_set_gcl() exits, it calls the set_est_gcl_len() 

function to set the length of the GCL. To set up GCL associated 

time-related parameters and commit the GCL to hardware, 

ethtool se-est-info command is called. This function, 

as shown in Fig. 4, make calls to the drv_set_est_info() 

then set_est_gcl_times() for setting the GCL-related 

time parameters. 

VI. EXAMPLES: APPLICATION OF TSN TECHNOLOGIES 

In this section, we will take a look at how to use various 

TSN technologies that have been discussed earlier to provide 

network service for traffic patterns: scheduled, time-sensitive 

and best effort. 

There are two ethtool.c shown in Fig. 3, labelled by 

Circle-1 and Circle-2, for the user- and kernel-space 

respectively. The user-space ethtool.c contains mainly 

operations to parse string-based inputs for commands and TSN 

parameters into data structure newly introduced for TSN 

capabilities in ethtool-copy.h. The same set of data 

structure definitions is mirrored in 

include/uapi/linux/ethtool.h, a file in Linux kernel 

project. On the other hand, the kernel space ethtool.c covers 

data marshalling operations, i.e. data copying between kerneland 

user-space for the above-mentioned input parameters and 

response, and eventually calls ethtool_ops functions 

implemented in TSN Glue Logic as labelled by Circle-3. The 

function prototypes of ethtool_ops is defined in 

include/linux/ ethtool.h. 

Fig. 4 shows an excerpt of TSN configuration 

implementation that uses the software architecture that we have 

just discussed. The command ethtool set-gcl file 

Fig. 5. TSN technologies for network with scheduled, time-sensitive and 

best effort traffics. 

906

Fig. 5 shows an example of how CBS, gate control, and 

frame preemption can be in an end-point that have various 

applications that generate a mixed of scheduled (industrial and 

automotive control), time-sensitive (A/V streams) and best 

effort traffics. 

Scheduled traffics are periodic, short and need very small 

latency in nature. These frames are enqueued to the highest 

priority transmit queue, in Fig. 5 that is TxQ3. At the fixed 

period of gating cycles, only the gate control of TxQ3 will open 

and the rest of the other gate controls belong to the other traffic 

class queues must close. The interval of the TxQ3 gate is open 

is normally small in order not to add significant delay to other 

frames from other queues. To hold back earlier preemptable 

frames from delaying the actual start time of the frame of 

scheduled traffic, the set-and-hold-MAC gate control operation 

is used. The scheduled frame scheduled from TxQ3 is serviced 

by Express MAC (eMAC) service. 

For time-sensitive traffic, SR Class A and Class B data 

streams are enqueued to TxQ2 and TxQ1 which are attached to 

independent credit-based shapers. Each of the CBS parameters 

(idle slope, send slope, locredit and hicredit) are set with 

different values to ensure an SR Class A stream generates 8000 

packets per second (125us apart) and a Class B stream generates 

4000 packets per second (250us apart). There is a convenient 

python script (calc_cbs_params.py) provided in [15] for 

calculating the above-said parameters based on the bandwidth 

allocated for each SR streams. The gate control pattern for 

TxQ2 and TxQ1 may be set as shown in the GCL of Fig. 5: 

At T02, Gate#2 and Gate#1 are opened and Gate#0 is 

closed, this helps to prevent best effort frame from 

delaying SR frames. In addition, set-and-release-MAC 

gate control operation should be used to allow 

preemptable MAC service from transmitting frames 

connect to it. 

At T03, Gate#2, Gate#1 and Gate#0 are open, this 

avoids best effort frames from being over-starved by 

scheduled traffics and time-sensitive traffics. 

Fig. 6. TSN technologies for network with time-sensitive and best effort 

traffics. 

The strict priority scheduling ensures that frames from 

TxQ2, TxQ1, and TxQ0 are selected for transmission through 

preemptable MAC service in the right traffic class order. 

Fig. 6 shows the usage of CBS for SR Class A and Class B 

streams which are connected to express MAC service. Best 

effort traffics are sent through preemptable MAC service. Such 

configuration has the benefit to reduce the latency introduced 

to SR frame by best effort frame to as small as 123-byte as 

explained in IEEE 802.1Qbu [11] Section R.2. Without frame 

preemption, the latency imposed to SR frame can be as large as 

the maximum frame size, i.e. 1522-byte. 


In conclusion, we have discussed two important elements 

in TSN, time synchronization, and traffic shaping, and provided 

a concise introduction to many TSN related standards. The 

current state of the art of TSN related software components, e.g. 

gPTPd, linuxptp, tc and ethtool are discussed and we 

concluded that at the time of this writing, TSN framework in 

Linux networking subsystem is still under definition stage. We 

have demonstrated how with modular software architecture 

approach (TSN Core Library and TSN Glue Logic), TSNcapable 

Ethernet kernel driver could be developed for IP 

validation and the solution is scalable to future TSN framework 

in Linux kernel, which seems to heavily base on traffic control 

subsystem. Lastly, the paper provided two examples of how 

various TSN technologies may be used to cater for a network 

that has a mix of traffic patterns: scheduled, time-sensitive and 

best effort. 


Much of the discussions made in this paper is derived from 

various IEEE standards, public mailing list discussion, and 

numerous papers, the author would like to thank all of the 

contributors as listed in the references section. 

In addition, the author would like to thank the Ethernet and 

TSN hardware and software team in Intel Corporation Internet 

of Things Group (IOTG) for countless hours and effort towards 

the creation and validation of TSN technologies in intel product. 

Special mentions are Mr. Gavin Hindman, Mr. Jesus Sanchez- 

Palencia, Mr. Kweh Hock Leong, Mr. Vinicius Gomes, and Mr. 

Voon Weifeng from Intel Corporation for good partnership in 

the Linux kernel driver development. 

REFERENCES 

[1] Avnu Alliance Delivers First TSN Conformance Tests for Industrial 

Devices. http://avnu.org/wp-content/uploads/2014/05/Avnu-SPS-IPC- 

Conformance-Testing-Release-FINAL-UPDATED.pdf 

[2] https://github.com/AVnu/OpenAvnu 

[3] IEEE, “IEEE Stamdard for a Precision Clock Synchronization Protocol 

for Networked Measurement and Control Systems”, IEEE Std 1588- 

2008. 

[4] IEEE, “Timing and Synchronization for Time-Sensitive Applications in 

Bridged Local Area Networks”, IEEE Std 802.1AS-2011. 

[5] IEEE, “Forwarding and Queuing Enhancements for Time-Sensitive 

Streams”, IEEE Std 802.1Qav-2009. 

[6] M. D. Johas Teener et al., "Heterogeneous Networks for Audio and Video: 

Using IEEE 802.1 Audio Video Bridging," in Proceedings of the IEEE, 

vol. 101, no. 11, pp. 2339-2354, Nov. 2013. 

[7] IEEE, “Stream Reservation Protocol”, IEEE Std 802.1Qat-2010. 


907

[8] Levi Pearson, “Stream Reservation Protocol”, Avnu Alliance Best 

Practoces, Nov. 2014. 

[9] IEEE, “Enhancements for Scheduled Traffic”, IEEE Std 802.1Qbv- 

2015. 

[10] IEEE, “Specification and Management Parameters for Interspersing 

Express Traffic”, IEEE Std 802.3br-2016. 

[11] IEEE, “Frame Preemption”, IEEE Std 802.1Qbu-2016. 

[12] Eric Mann, “Linux Network Enabling Requirements for Audio/Video 

Bridging”, Linux Plumber 2012. 

https://linuxplumbers.ubicast.tv/videos/linux-network-enablingrequirements-for-audiovideo-bridging-avb/ 

[13] https://sourceforge.net/projects/linuxptp/ 

[14] https://www.kernel.org/doc/Documentation/networking/timestamping.tx 

t 

[15] Vinicius Costa Gomes, “TSN: Add qdisc based config interface for CBS”, 

https://marc.info/?l=linux-netdev&m=150820212927379&w=2 

[16] Vinicius Costa Gomes, “net/sched: Introduce Credit Based Shaper (CBS) 

qdisc”, https://marc.info/?l=linux-netdev&m=150820214127384&w=2 

[17] Vinicius Costa Gomes, “net/sched: Add support for HW offloading for 

CBS”, https://marc.info/?l=linux-netdev&m=150820212927381&w=2 

[18] Vinicius Costa Gomes, “[RFC] TSN: Add qdisc-based config interfaces 

for traffic shapers”, 


[19] Jesus Sanchez-Palencia, “Time based packet transmission”, 


[20] The ethtool project. 

https://www.kernel.org/pub/software/network/ethtool/ 

908

TSN and OPC UA for Industrial Automation 

Challenges in Getting Fieldbus-like Performance 

and Scalability within Convergent Networks 

Henke, Torben; Zahn, Peter; Frick, Florian; Lechler, Armin 

Institute for Control Engineering of Machine Tools and Manufacturing Units 

University of Stuttgart 

Stuttgart, Germany 

Torben.Henke@isw.uni-stuttgart.de 

Abstract—Convergent communication networks with a 

multitude of differing data flows are a promising approach to 

increase information consistency and device interoperability and 

so enable new applications and business models in industrial 

automation and production environments. A required key 

feature for the application of control applications is deterministic 

real-time behavior of such networks. With the presence of IEEE 

802.1 TSN, there is a promising vendor and application 

independent networking technology available. This paper 

analyzes typical parameters and requirements of fieldbus 

technology which is used today for real-time applications and 

their relation to TSN mechanisms within convergent networks. 

Also, the potential of future realizations using TSN in 

conjunction with OPC UA Pub/Sub is shown. Focus is laid on 

specific challenges which will require some further work to 

enable fieldbus-like performance over such infrastructures. 

Time Sensitive Networking (TSN), an extension of the 

IEEE 802.1 bridging standard, which is currently under 

development, extends Standard Ethernet with real-time 

capability. Thus, TSN enables to replace a core functionality of 

existing fieldbus systems, namely deterministic data exchange, 

with a vendor and industry independent standard. 

For component providers, a uniform communication 

standard means a significant reduction in development effort. 

In addition, due to the expected high spread of TSN in the 

industries of automotive, industrial, IT, entertainment and 

finance, network adapters will be available in large quantities 

and the associated lower prices. This leads to lower equipment 

costs. 

Keywords—TSN, OPC UA Pub/Sub, fieldbus, communication 


In the course of Industry 4.0 machine communication 

across different layers is of increasing importance to enable 

new technical applications and business models. Direct data 

exchange between individual components of a production 

machine, within the factory network or even up to cloud 

infrastructures will be more and more requested. 

Industrial communication within single production 

machines and automation devices today is based on fieldbus 

technology to guarantee real-time-capable communication, take 

care of the semantic of data and device descriptions (profiles) 

as well as for the configuration of the communication entities. 

A major part of the fieldbus standards does not allow direct 

communication between devices in fieldbus network and the 

rest of the IT network using a common protocol. Furthermore, 

the multitude of different, non-interoperable fieldbus systems 

leads to high costs. On the one hand, this is due to higher 

development effort as component manufacturers have to adapt 

their devices to the various fieldbuses. On the other hand, the 

required communication hardware to guarantee real-time 

behavior is only produced in small quantities which makes 

them quite expensive. 

Fig. 1. Pyramid of automation 

II. 

INDUSTRIAL COMMUNICATION – STATE OF THE ART 

Industrial communication networks are vital resources to 

transfer information related to control, diagnostics, tracing and 

configuration within production environments. Different layers 

of communication can be distinguished, to date the pyramid of 

automation is mostly used as the underlying model. Here, from 

bottom to top, the requirements concerning time determinisms 

decreases and the abstraction and amount of the data increases. 


909

On the field level, typically small telegrams with a small cycle 

time, often depending of control loop frequencies, are 

exchanged. While today, for that purpose several specialized 

fieldbus environments are used, things get more challenging 

when using standard IT protocols. In the next section, the 

related requirements are presented. 

A. Fieldbus Technologies and their Requirements 

Field buses are used for communication between field 

devices such as sensors, actuators and automation devices 

today. Their main purpose is to transmit process data between 

the participants in a deterministic real time behavior, often in 

form of cyclic telegrams. Additionally, service and 

configuration related data has to be considered. To ensure the 

deterministic behavior, often the communication topology, the 

participants and the timing schedule are configured statically 

within given bounds. 

The following describes typical requirements placed on 

fieldbus systems today. Their weighting depends on the 

respective application, resulting in different fieldbus 

environments used for different applications. 

1) CYCLE TIME 

The cycle time specifies the length of one communication 

cycle, where a data transfer between the included network 

nodes takes place. Often, actual values, which were sampled at 

the beginning of the cycle, and set points, which are to be valid 

beginning with the next cycle, are submerged. The cycle time 

is limited both by the transmission time of a packet and the 

processing time in the subscribers. With a bandwidth of 100 

Mbps on an Ethernet link, it takes 5.76 µs to send one frame 

with minimum size. If several frames are sent from/to a 

subscriber, the Inter Frame Gap (IFG) of 0.96 µs must be 

added, so that the minimum total duration per frame is 6.72 µs. 

In addition to the transmission time, there are also the 

propagation delay on the line and the time that the participants 

need to process the data. 

Depending on the application, different demands are placed 

regarding cycle times. Typically, for decentralized control 

systems, 1 to 2 ms are sufficient. Cycle times between 25 and 

250 µs are required for a central control system. As the cycle 

time corresponds to a dead time within an overlaid control 

loop, it often is also limited bounded from control performance 

point of view. 

2) SYNCHRONICITY 

Synchronization of several entities is crucial for example in 

motion tasks, where several drives have to operate coupled at 

high speeds. A slight shift in the time-position profile of a drive 

may lead to poor product quality or even machine damage. 

Fieldbuses for such applications must therefore provide a 

common time base for all components, the accuracy 

requirements of the process determines the amount of allowed 

jitter. Various protocols and methods exist to compensate for 

uncertainties in communication delays as also clock 

differences. 

3) RELIABILITY 

Reliability is of importance in automation and control 

technology, because failures of systems leads to a standstill of 

the plant and thus financial loss for the operator. In order to 

minimize such situations, various options for setting up 

redundant structures are available. An example is line 

redundancy, here several independent communication channels 

are established between the participants. If one is interrupted or 

disturbed, it is still possible to communicate via others. If one 

frame is forwarded multiple times in the network, care has to 

be taken that is processed only once at the receiver. 

4) NUMBER OF STATIONS 

For decentralized control scenarios with many distributed 

stations, it is important that a fieldbus can handle a sufficient 

number of stations. The maximum number of participants is 

limited by two factors. On the one hand, through the addressing 

range of the protocol, on the other hand by the amount of data 

exchanged between participants. If the amount of data exceeds 

the available bandwidth, the scheduled transmission cannot be 

finished within a cycle. 

5) DATA RATE 

To achieve short cycle times with a high number of 

subscribers, the available data rate and the amount of 

application payload are decisive. The latter depends on how the 

protocol is structured and how much data is transferred per 

subscriber. For an Ethernet-based fieldbus, for example, each 

data packet sent has a minimum size of 64 bytes. If it contains 

only two bytes of payload, bandwidth utilization is inefficient. 

Also, a high data rate is desirable in terms of minimum 

transmission delays. 

Fig. 2. Overview: Requirements concerning fieldbus technologies 

B. Key mechanisms and representative protocols 

To give an overview on the different mechanisms, which 

can be used to fulfill the presented requirements, some typical 

fieldbus protocols are explained in the following section. 

1) EtherCAT 

EtherCAT uses a so-called summing frames, all participants 

form a logical ring. The initial telegram is sent from the master 

910

to the first slave in the ring and from there in sequence from 

slave to slave, before they finally reach the master again. The 

summing frame acts as placeholder for all data exchanged in 

the current cycle. The slaves know, where in the summing 

frame their input data are located and to which address they 

should write their output data. In order to keep the processing 

delay in the slave as low as possible, the summing frames are 

processed in hardware on-the-fly. This means, that the frame is 

not received completely before it is processed. Instead, the 

received data is continuously interpreted and forwarded, so that 

a delay of only a few bits occurs. By using summing telegrams 

within Ethernet frames, the protocol efficiency can be 

significantly increased. 

In addition to direct exchange for cyclical process data, it is 

possible to tunnel other communication through a specific 

mailbox mechanism. In order to avoid collisions in the 

network, only one master must be present, no other network 

device may transfer data on its own initiative. Devices that are 

not compliant must not be connected to the network. Direct 

communication between the devices is only possible to a 

limited extent, data can only be transferred directly from a 

slave previously arranged in the logical ring to a later slave. In 

the other direction, the data must first be forwarded by the 

master in the next cycle. 

As the master sends and receives standard Ethernet frames, 

no special hardware is required. Only slave devices need a 

network interface, capable of processing frames on-the-fly. 

2) Profinet 

Profinet works according to the provider-consumer model. 

All network participants send their data to the respective 

recipients at fixed, pre-configured points in time without being 

asked. If a controller sends output data, it acts as a provider of 

the data. The receiving field devices are the consumers of this 

data. This is reversed for the input data. 

The devices participating in a Profinet network are divided 

into three classes: 

1. Supervisor: Used to configure and diagnose 

participants of a Profinet network. 

2. IO controller: PLCs and NC controls are located here. 

3. IO device: All decentralized I/O field devices, such as 

Bus Terminals and drives. 

Several IO controllers may be present in a Profinet network 

simultaneously, an IO controller can consume the data of 

different IO devices. Cyclic communication between IO 

devices is not supported in Profinet. 

Dynamic Frame Packing (DFP) makes it possible to 

transmit data to several subscribers similar to a summing frame 

in an Ethernet frame. To do this, the participants must be 

placed in line. Within a DFP frame, the Profinet frames for the 

participants are arranged in the order of the line. The first node 

takes its frame and forwards a new Ethernet frame together 

with the other Profinet frames for the following nodes. 

3) EtherNet/IP 

EtherNet/IP is based on the very common TCP/IP and 

UDP/IP layer, while the exchange of data between the nodes is 

carried out via the Common Industrial Protocol (CIP) up from 

layer 5 in the ISO/OSI model. The CIP is an object-oriented 

data protocol for communication between field devices and 

controllers. A CIP object consists of data and services, among 

other things. A network node is described as a collection of 

objects. Within a node, the data of the objects is mapped to the 

internal data. Communication between nodes also takes place 

via so-called communication objects. 

Two different communication relations are available: Pointto-point 

connections using TCP/IP are used for the exchange of 

non-real-time data. The exchange of real-time data takes place 

via UDP/IP packets and is transmitted as multicast. Here, four 

different modes are available 

Cyclic transmission of full payload 

Cyclic transmission only of incremental data 

Requesting individual nodes to get their data (polling) 

Request all nodes to send their data at the same time 

As a conclusion, TABLE I. gives an overview on typical 

performance parameters of the presented fieldbus protocols. It 

can be concluded, that for the selection of a protocol, the 

requirements of the application have to be taken into account as 

there is no global optimum. 

TABLE I. 

COMPARISON OF TECHNICAL FIELDBUS PARAMETERS 

EtherCAT PROFINET EtherNet/IP 

Min. cycle Time 11 s 31,25 1 ms 

Synchronicity ± 20 ns < 1µs < 200 ns 

Data Rate 100 Mbit/s 1 Gbit/s Ethernet max 

RT-Mechanism Exclusive Exclusive Quality of Service 

bus access bus access 

Frame format Summing frame Ethernet TCP/IP 

Topology arbitrary arbitrary arbitrary 

Direct cross 

communication 

no optional yes 

Due to the high demands on determinism and the 

incompatibility of most fieldbuses with standard Ethernet, it is 

necessary to separate devices belonging to the Information 

Technology (IT) from the Operational Technology (OT). As a 

result, several physically independent networks are set up. 

III. 

PATH TO CONVERGENT NETWORKS 

One of the main justification for field buses is the lack of 

determinism in standard Ethernet. With TSN, however, 

standard Ethernet becomes real-time capable. This makes it 

possible to create so-called convergent networks, in which 

different traffic classes coexist, which offers various 

advantages over proprietary, sealed infrastructures. Separate 

cabling is no longer necessary. If all devices are in one single 

network, it is also possible to access individual devices directly 

from upper layers. This enables added value for the customer 

as for access on devices not special tools or methods are 

needed anymore. Devices, for example, can instead be 

configured via an integrated web server. Also, direct access to 

huge amounts of production data allows those to be used for 

big data analytics. 


911

Another advantage of a convergent network is its lower 

technical complexity. This concerns both the network itself and 

the devices, as standardized hardware and stacks can be used. 

The complexity of the field devices is reduced by the fact that 

they no longer have to be adapted to the several field buses. On 

the other hand, much more effort has to be put on the 

management of such complex networks. 

Considering convergent networks for control and 

automation applications, some base functionalities have to be 

provided: On the one hand, it is necessary to have a uniform 

time base throughout the entire network. This ensures 

synchronous operation of distributed systems. Also mandatory 

is time deterministic data transmission, which is currently the 

main obstacle for convergent networks. Another challenge to 

be solved is the configuration of such networks. In addition to 

the traditional network parameters determining switching and 

routing, a convergent network in control technology also 

includes timing-related parameters. 

Beside the requirements of the network itself, there are 

aspects that need to be considered. This includes migration 

strategies for existing devices and protocols. Efficient methods 

for integrating such devices into a convergent network will be 

crucial for the acceptance of such novel communication 

infrastructures. Interoperability between devices is moving in a 

similar direction. Although this often is not absolutely 

necessary at process data level, the benefit from convergent 

networks correlates with the availability of data. 

IV. 

BENEFITS FROM TSN & OPC UA 

One technical solution which covers the requirements for 

such convergent networks is currently being specified by IEEE 

under the term TSN. It brings a collection of standards that 

extend the IEEE 802.1 Ethernet standard and cover various 

aspects of time-critical data transmission via Ethernet. Some 

standards, such as IEEE 802.1Qbv [1], deal with the timedeterministic 

transmission of data. Other standards increase the 

transmission security of the data through redundancy. IEEE 

802.1AS-Rev [2] specifies a time synchronization protocol, 

however there are several preliminary variants currently in the 

field which are not fully interoperable. 

TSN only focusses on the transport layer, whereas it does 

not specify higher level protocols and descriptions. Here, OPC 

UA has obtained increasing importance for enabling a 

manufacturer-independent communication which is already 

used widely for diagnosis and configuration data. Although, to 

date it does not have any real-time guaranties. 

In addition to the pure data description, OPC UA also 

contains mechanisms for the verification and protection of data 

through encryption. This makes it also suitable for Industry 4.0 

scenarios. OPC UA basically works according to the client 

server principle, while a device can be both client and server 

simultaneously. This is not suitable for cyclic communication 

with low cycle times, as the overhead in the communication 

increases and bandwidth efficiency drops. A solution to this 

problem is the Publisher/Subscriber (Pub/Sub) extension for 

OPC UA [3]. Here, a publisher transmits its data, cyclically or 

on change, as a multicast telegrams into the network. Other 

devices can subscribe to and receive this data. As a device also 

can take both roles at a time, arbitrary communication relations 

can be established. 

The combination of OPC UA Pub/Sub and TSN is expected 

to be the future multi-vendor communications solution in 

convergent networks. However, there are still some challenges 

to be solved for broad industrial application which will be 

shown below. 

1) Resource Consumption 

For devices with limited resources, a full OPC UA stack 

can be challenging with respect to CPU, RAM, and the 

available memory resources. The OPC Foundation is currently 

working on a possible reduced communication model, where 

the content of the messages and the position of single data 

within is permanently configured at startup. This eliminates the 

need for a complete decoding of Pub/Sub messages and allows 

static access on the process data in it. 

2) Network Configuration 

Another challenge is the configuration of the network 

devices according to the needed communication relations. 

Several models for this are currently being developed which 

are not necessarily compatible. Also, the have not yet been 

integrated into common engineering tools which is crucial for 

industrial acceptance of the technology. 

3) Scalability 

A third challenge when using OPC UA Pub/Sub and TSN 

for control technology in convergent networks is the limited 

scalability, depending on the scenario. This can be seen from 

two different points of view when simply increasing the 

number of devices which participate in the cyclic 

communication even with small payload. 

On the one hand, real-time data traffic can consume a large 

part of the network's bandwidth. In an example network with 

100 devices that send 90 bytes (including protocol overhead) to 

the central controller every millisecond, this application 

occupies 72% of the bandwidth when using a network with a 

data rate of 100 Mbps. Of course, to date network devices and 

links with 1 GBps or more are common. Although, in industrial 

line installations with less powerful devices, lower speeds still 

will be used in the next years for the reason of robustness or 

cost. 

On the other hand, data transfer needs time and several 

frames cannot be received at the same time in a centralized 

controller. If the data from all devices in this example shall be 

present at the controller in the same point of time, the earliest 

has telegram has to be sent 720 µs in advance, which means 

they are more outdated than in a scenario with fewer devices. 

Furthermore, the high data volume, especially when using 

complex layered stacks, leads to a high computing load on the 

controller which is not available for the control application 

itself. This is especially of importance when using embedded 

devices with limited resources. 

One possible solution could be suitable aggregation 

methods for many-to-one communication using Pub/Sub. 

Together with an exact time base in the network, transmission 

time could be optimized for each device to guarantee minimum 

total latency of the transmission. 

912


TSN together with OPC UA has the potential as an 

enabling technology for convergent networks and thus shaping 

the future production environment. As many aspects are 

subject to current discussions and specification activities, 

several questions are left open to date, which need to be solved 

before broad industrial application. 

To bridge the gap between network and automation device 

vendors as also machine builders, the industry working group 

“TSN for Automation” [4] was established at the Institute for 

Control Engineering of Machine Tools and Manufacturing 

Units at the University of Stuttgart. Its scope is to build a 

bridge between IT and automation world and assist in getting 

into the technology. Latest information from various related 

committees are provided as also an accessible laboratory 

together with a reference solution for TSN end devices. 

Solutions to the named challenges regarding scalability and 

configuration will be addressed in future research work in close 

cooperation with partners from industry. 

REFERENCES 

[1] LAN/MAN STANDARDS COMMITTEE OF THE IEEE COMPUTER SOCIETY: 

IEEE Std 802.1Qbv-2015 (Amendment to IEEE Std 802.1Q-2014 

as amended by IEEE Std 802.1Qca-2015, IEEE Std 802.1Qcd- 

2015, IEEE Std 802.1Q-2014/Cor 1-2015) IEEE Standard for Local and 

metropolitan area networks—Bridges and Bridged Networks— 

Amendment 25: Enhancements for Scheduled Traffic (2015) 

[2] LAN/MAN STANDARDS COMMITTEE OF THE IEEE COMPUTER SOCIETY: 

IEEE 802.1: 802.1AS-Rev - Timing and Synchronization for Time- 

Sensitive Applications. URL http://www.ieee802.org/1/pages/802.1ASrev.html 

[3] OPC FOUNDATION: OPC Unified Architecture Part 14: PubSub : 

Release Candidate 1.04.24, 2017 

[4] http://www.tsn4automation.com 


913

Beyond the Capabilities of Wireshark: 

Effective and Efficient Generation of Mostly- 

Valid Messages for Bad-Case Testing of 

Communication Protocol Implementations 

Dipl.-Phys. Andreas Walz, Prof. Dr.-Ing. Axel Sikora 

Institute of Reliable Embedded Systems and Communication Electronics (ivESK) 

Offenburg University of Applied Sciences 

914

Introduction 

(Security) Testing of software is highly important 

(Mostly) Positive testing is much more common than negative testing 

• Verify correct behavior of DUT given valid and expectable input 

• What about mostly-valid, invalid or generally unexpected input? 

Negative (bad-case) testing 

• Cumbersome for rich and complex message formats 

• Numerous implicit and explicit consistency requirements 

• Many ways messages can be invalid 

• Involves parsing, interpretation, and manipulation of “on-the-wire” encoded messages 

Here: Present a powerful concept & toolbox supporting negative testing campaigns 

with mostly-valid messages for complex communication protocol stacks 

915 

2018-03-01 Embedded World xxx Conference 2018 

2

Agenda 

1. The hassle with negative testing 

2. The concept of Generic Message Trees (GMTs) and 

GMT manipulation operators 

3. How GMTs can help with negative testing … 

4. The TLS Presentation Language (TPL) and automatic 

code generation 

916 


3

A Typical/Simple Test Setup 

Transport Layer Security (TLS) implementations 

• Server: the Device Under Test (DUT) 

• Client: The test agent 

Client 

Server 

ClientHello 

TLS Client 

Test Agent 

ServerHello 

TLS Server 

DUT 

917 


4


Transport Layer Security (TLS) implementations 

• Server: the Device Under Test (DUT) 

• Client: The test agent 

TLS Client 

Test Agent 

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0 

24c014c00a00a3009f006b006a0039003800880087c032c02ec02 

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009 

00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2 

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b 

08d9c01300000bff01000100000b00020100 

“On-the-wire” representation 

TLS Server 

DUT 

918 


5


TLS Client 

Test Agent 

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 



08d9c01300000bff01000100000b00020100 


TLS Server 

DUT 

919 


6


TLS Client 

Test Agent 

“read-only” 

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 



08d9c01300000bff01000100000b00020100 


TLS Server 

DUT 

920 


7


TLS Client 

Test Agent 


160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 



08d9c01300000bff01000100000b00020100 

TLS Server 

DUT 

921 


8


TLS Client 

Test Agent 


160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 



08d9c01300000bff01000100000b00020100 

TLS Server 

DUT 

922 


9

Structured Message Manipulation 

Parsing / dissecting 

Tree-like representation 

Flat “on-the-wire” representation 

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 

923 


10





160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 

924 


11





160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 

Serialization / encoding 

925 


12





160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


? 

00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 


926 


13





160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


? 

00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 


927 


14





160301011c0100011803035c6ac7070b212bef9e9c8c03c131802 


24c014c00a00a3009f006b006a0039003800880087c032c02ec02 


? 

00a2009e0067004000330032009a009900450044c031c02dc029c 

025c00ec004009c003c002f00960041c011c007c00cc002000500 

04c012c00800160013c00dc003000a00ff0100006d000b0004030 

00102000a00340032000e000d0019000b000c00180009000a0016 

00170008000600070014001500040005001200130001000200030 

00f0010001100230000000d0020001e0601060206030501050205 

03040104020403030103020303020102020203000f000101 


928 


15

Generic Message Trees 

Protocol messages are represented as a tree data structure 

• Called Generic Message Trees (GMTs*) 

• Similar to parse trees 

• Structure (i.e. composition rules) given by protocol definition 

• Internal nodes: composite structures 

• Leaf nodes: atomic data with raw (binary) data representation 

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS 

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947 

929 


16



• Called Generic Message Trees (GMTs) 





930 


17



• Called Generic Message Trees (GMTs) 





Record header 

Handshake header 

ClientHello 

931 


18

Input Generation Strategy 

1. Use valid message as input 

2. Convert message to tree (GMT) representation 

3. Apply randomized or deterministic manipulation(s) 

4. Serialize to flat message and use as test message 

Random or deterministic message Manipulation 

932 


19

Message Manipulation (1) 

Different generic manipulation operators available: 

Deterministic operators 

• Voiding / Removing operators 

Void or remove node/subtree 

• Duplicating operator 

Duplicate subtree and add as sibling 

Randomized (fuzzing) operators 

• Truncating fuzz operator 

Truncate subtree (may remove nodes) 

• Integer fuzz operator 

Randomize value of integer or 

enumeration field 

• Content fuzz operator 

Fill a leaf node with random content 

(random raw data) 

• Appending fuzz operator 

Append random content to a leaf node 

933 


20

Message Manipulation (2) 

Two special operators: 

• Repairing operator 

Restore consistency among children 

of a tree node (from last to first) 

• Repairing fuzz operator 

Traverse tree towards root and 

apply repairing operator on each 

visited node with a fixed probability 

Length of 

Length of 

934 


21

A Few Concrete Examples … 

1. Manipulating a Single Length Field 

2. Removing a certain message component 

3. Full-fledged randomized testing 

935 


22

Example 1: 

Manipulating a Single Length Field 

936 


23

Example 1: 


TLSRecord record; 

record.dissect(message); 

record.propSet(".value@**/extensions/_N", 0); 

937 


24

Example 1: 




record.propSet(".value@**/extensions/_N", 0); 

938 


25

Example 2: 

Removing a Certain Message Component 

Remove ec_point_format extension from TLS ClientHello message 

939 


26

Example 2: 


Implementation in C++ code using GMTs 



Cursor cursor(record); 

cursor.seekByPath("**/Extension:ec_point_formats%"); 

cursor.doRemove(); 

RepairingOperator operator; 

operator.apply(cursor); 

940 


27

Example 2: 


Before manipulation 

941 


28

Example 2: 


After manipulation 

942 


29

Example 3: 

Randomized Testing: Input Generation 

Select and apply random operator 

Perform randomized repairing 

Recursive manipulation of subtree 



943 


30

Example 3: 

Randomized Testing: Test Diversity 



944 


31

Example 3: 

Randomized Testing: Test Diversity 



945 


32

Example 3: 

Randomized Testing: Identified Bug 

Inconsistent treatment of length fields by MatrixSSL 3.8.4 



946 


33

GMT Approach 

Benefits: 

• User-friendly navigation through and access to message fields 

• Allows (semi-)automatic manipulation of messages 

• Allows rapid prototyping 

Available as open-source library (C++) under 3-clause BSD license 

[https://github.com/phantax/gmt-cpp] 

947 


34

GMT Approach 

Benefits: 

• User-friendly navigation through and access to message fields 

• Allows (semi-)automatic manipulation of messages 

• Allows rapid prototyping 

Available as open-source library (C++) under 3-clause BSD license 


How to obtain format-specific dissectors? 

948 


35

TLS Presentation Language (1) 

TLS Presentation Language (TPL) 

• Data format description language (not a programming language!) 

• Introduced with the draft specification of SSL by Netscape (now TLS) 

• Used to describe (“present”) the on-the-wire format of SSL/TLS messages 

• Enhanced version (eTPL*) used for automatic code generation of parsers/dissectors 

• Suitable not only for (D)TLS, but also other protocols 

struct { 

[RFC 5246] 

ProtocolVersion client_version; 

Random random; 

SessionID session_id; 

CipherSuite cipher_suites; 

CompressionMethod compression_methods; 

select (extensions_present) { 

case false: struct {}; 

case true: Extension extensions; 

}; 

} ClientHello; 

949 


36 

* A. Walz and A. Sikora, "eTPL: An enhanced version of the TLS 

presentation language suitable for automated parser generation," 

IDAACS 2017, doi: 10.1109/IDAACS.2017.8095200

TLS Presentation Language (2) 

Elements of the (enhanced) TLS Presentation Language 

• Basic built-in types (integer fields) 

• Enumerated fields 

• Composite (constructed) types 

• Variants (dynamic choices within composite types) 

• Vectors (both fixed and variable length) 

struct { 

[RFC 5246] 

ProtocolVersion client_version; 

Random random; 

SessionID session_id; 

CipherSuite cipher_suites; 

CompressionMethod compression_methods; 

select (extensions_present) { 

case false: struct {}; 

case true: Extension extensions; 

}; 

} ClientHello; 

950 


37

eTPL/GMT Tool Chain 

Format definition 

eTPL Parser 

User code 

Code Generator 

Format-specific code 

GMT Library (generic) 

etpl-tool 

[https://github.com/phantax/etpl-tool] 

gmt-cpp 


(e)TPL Python C++ 

951 


38

Summary & Conclusion 

Addressed the hassle related to bad-case (negative) testing of protocol implementations 

• How to obtain mostly-valid test messages 

Built an efficient bidirectional bridge between two types of protocol message representations 

• Flat “on-the-wire” representation 

• Tree-like representation 

Presented the GMT concept and the eTPL/GMT tool chain 

• Systematic message manipulation on a structured message representation 

• All the message parsing/encoding done automatically 

• Dealing with complex messages can be made easy, efficient, and user-friendly 

952 


39

Thank you for your attention! 

Questions… ? 

Prof. Dr. 

Axel Sikora , Dr.-Ing. Dipl.-Wirt.-Ing 

Scientific Director 

Institute of Reliable Embedded Systems and Communication Electronics 

Andreas Walz , Dipl.-Phys. 

Research Engineer 

Institute of Reliable Embedded Systems and Communication Electronics 

Phone +49 (0)781 205-416 

Fax +49 (0)781 205-45 416 

axel.sikora@hs-offenburg.de 

Badstraße 24 

77652 Offenburg 

www.hs-offenburg.de 

Phone +49 (0)781 205-4803 

Fax +49 (0)781 205-45 4803 

andreas.walz@hs-offenburg.de 

Badstraße 24 

77652 Offenburg 

www.hs-offenburg.de 

953

References 

• A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box 

Testing of TLS Implementations," in IEEE Transactions on Dependable and Secure 

Computing, doi: 10.1109/TDSC.2017.2763947 

• A. Walz and A. Sikora, "eTPL: An enhanced version of the TLS presentation language 

suitable for automated parser generation," IDAACS 2017, doi: 

10.1109/IDAACS.2017.8095200 

• gmt-cpp: https://github.com/phantax/gmt-cpp 

• etpl-tool: https://github.com/phantax/etpl-tool 

954 


41

System and Device Design Recommendations for 

CAN FD Networks 

Holger Zeltwanger 

CAN in Automation (CiA) e. V. 

90429 Nuremberg, Germany 

headquarters@can.cia.org 

Abstract—Several semiconductor manufacturers have implemented 

already the CAN FD protocol in stand-alone controller 

chips as well as micro-controllers. Some OEMs (original equipment 

manufacturer) started to integrate CAN FD networks in 

their in-vehicle network architectures. This paper provides some 

guidelines and recommendations, in particular for the bit-timing 

settings for the arbitration phase and the data-phase. 

Keywords—CAN FD, in-vehicle network, bit-timing, network 

topology, ringing suppression, phase-margin 

I. INTRODUCTION AND BASICS 

In 2012, the CAN FD (controller area network with flexible 

data-rate) protocol was launched at the 13 th international CAN 

Conference in Hambach castle. In the meantime, the protocol 

has been internationally standardized in ISO 11898-1:2015. 

This standard just specifies the CAN FD and the Classical 

CAN protocol. In order to avoid misunderstandings, it was 

agreed to use the term ISO CAN FD or just CAN FD, when the 

implementation complies with the ISO 16845:2016 conformance 

test plan. Implementations based on CAN FD version 1.0 

should be named non-ISO CAN FD. 

Transceiver chips and System Base Chips (SBC) compliant 

with ISO 11898-2:2016 support optionally bit-rates up to 2 

Mbit/s or 5 Mbit/s. Parameters for higher bit-rates are not specified. 

Nevertheless, you can achieve higher bit-rates, when 

limiting the temperature, for example. 

Both standards, ISO 11898-1:2015 and ISO 11898-2:2016, 

do not provide any system and device design recommendations, 

etc. The legacy standards contained some system and 

device design rules. 

The new editions of ISO 11898-1 and ISO 11898-2 are 

written for semiconductor manufacturers. Device designers 

need additional guidelines and recommendations for the CAN 

FD device interface. Normally, they are depending on the CAN 

FD system design given by the OEM. 

The ISO 11898-1:2015 document does not specify the interface 

to the host controller in detail. It just gives some basic 

information, which is not sufficient for interoperability and 

system design aspects. For example, the oscillator frequency is 

not specified, because this is a device design issue. The CiA 

601-2 CAN controller interface specification recommends 

using 20 MHz, 40 MHz, or 80 MHz. Other frequencies should 

not be used. Another recommendation in this document is the 

number of bit-timing registers to be implemented. The ISO 

standard just requires a small register, which is sufficient for 

some bit-rate combinations. The CAN FD protocol may use 

two bit-rates: one for the arbitration phase and another one or 

the same for the data-phase. In case of using a large ratio between 

arbitration and data-phase bit-times, the standardized 

size of the bit-timing registers is not appropriate. Therefore the 

CiA 601-2 document recommends for the arbitration phase a 

register a programmability of 5 time-quanta (tq) to 385 tq. The 

configurability for the data-phase register should be the range 

from 4 tq to 49 tq. Additionally, the CiA 601-2 specification 

contains some recommendations regarding interrupt sources 

and message buffer behaviors. 

In order to understand the ISO 11898-2:2016 standard from 

a device designer’s point-of-view, the CiA 601-1 specification 

provides some useful information about the transceiver loop 

delay symmetry, the bit-timing symmetry, the transmitter delay 

compensation (TDC). This document explains how to interpret 

and consider the parameters given by the transceiver chip suppliers. 

II. 

BIT-TIMING SETTINGS FOR CAN FD 

A. General guidelines 

As said, the ISO 11898 series do not specify device or system 

design aspects. In order to achieve interoperability of devices, 

the bit-timing should be the same in all nodes. This is 

nothing new for engineers familiar with Classical CAN network 

designs. However, in Classical CAN networks there are 

some tolerances allowed regarding the bit-timing settings. They 

are necessary, when nodes with different oscillator frequencies 

are in the same network. Typically, the sample-point (SP) is 

given as a range such as 85 % to 90 % with nominal value of 

87,5 % (CANopen). The SP is between the phase segment 1 

and the phase segment 2 of a bit-time. The bit-time comprises 

the synchronization segment (always one time-quantum), the 

propagation segment, the phase segments 1 and 2. 

In CAN FD networks, the rules and recommendations 

needs to be more strict, because higher bit-rates bring the net- 


955

work closer to the physical limits. Of course, when not using 

the bit-rate switch function, the bit-timing is as in Classical 

CAN. But when using two bit-rates, the system designer should 

take care that all nodes apply the very same bit-timing settings. 

The nonprofit SAE (Society of Automotive Engineers) International 

association developed two recommended practices 

for CAN FD node and system designers. The SAE J2284-4 

document specifies a bus-line network running at 2 Mbit/s with 

all necessary device and system parameters including the bittiming 

settings. The SAE J2284-5 document does the same for 

a point-to-point CAN FD communication running at 5 Mbit/s. 

The given parameter values are mainly deriving from General 

Motors CAN FD first system designs. If you read between the 

lines, you can adapt the specification also for other topologies 

and bit-rates. 

The Japan Automotive Software Platform and Architecture 

(JasPar) association develops also guidelines for CAN FD 

device and system design. The Japanese nonprofit group cooperates 

with CAN in Automation (CiA). Both associations exchange 

documents and comment them each other. Recently, 

there was a joint meeting to discuss the ringing suppression, in 

order to achieve higher bit-rates or to support hybrid topologies 

such as multi-star networks. 

Currently, CiA has released its CiA 601-3 document. Besides 

the oscillator frequency (see above), it recommends the 

bit-timing configuration and some optimization hints for the 

phase margin. This includes recommendations for the topology, 

the device design (especially limiting parasitic capacitance). 

B. Bit-timing configuration recommendations 

The bit-timing configuration has two aspects: Setting the 

nominal time-quantum for the arbitration phase and the data 

time-quantum for the data-phase as well as setting of the related 

sample-points including the secondary sample-point (SSP) 

in the data-phase, when the TDC is used. 

The recommendations given below consider that with each 

resynchronization, a receiving node can correct a phase error of 

sjw D in the data-phase and sjw A in the arbitration phase. The 

larger the ratio sjw D :BT D , the larger the resulting CAN clock 

tolerance in the data-phase. The same holds for the arbitration 

phase with sjw A :BT A . The absolute number of resynchronizations 

per unit of time increases towards higher bit-rates. However, 

the absolute value of sjw D or sjw A decreases proportionally 

with the bit-time. In other words, a higher bit-rate leads to 

more, but smaller resynchronizations. A CAN FD node performs 

the bit-rate switching at the SP of the BRS (bit-rate 

switch) bit and the CRC (cyclic redundancy check) delimiter 

bit. All three available SPs are independent of each other: arbitration 

phase SP, data phase SP, and data phase SSP. They can 

be chosen independently. 

In the arbitration phase, the nodes are synchronized and 

need the propagation segment as a waiting time for the roundtrip 

of the bit-signal. In the data-phase, the nodes are not synchronized. 

Therefore, no delays need to be considered. Nevertheless, 

the phase-segment 1 should be large enough, to guarantee 

a stable signal. 

For the data-phase bit-timing settings all the following recommendations 

should be considered. For the arbitration phase 

just recommendation 3 and 5 apply. 

Recommendation 1: Choose the highest available CAN 

clock frequency 

This allows shorter values for the tq. Use only recommended 

CAN clock frequencies (see above). 

Recommendation 2: Set the BRP A bit-rate prescaler equal 

BRP B 

This leads to identical tq values in both phases. This prevents 

that during bit-rate switching inside the CAN FD data 

frame an existing quantization error can transform into a phase 

error. 

Recommendation 3: Choose BRP A and BRP D as low as 

possible 

Lower BRPs lead to shorter tq, which allows a higher resolution 

of the bit-time. This has the advantage that the SP can be 

placed more accurately to the optimal position. The size of the 

synchronization segment is shorter and reduces the quantization 

error. Additionally, the receiving node can synchronize 

more accurately to the transmitting node, which increases the 

available robustness. 

Recommendation 4: Configure all CAN FD nodes to have 

the same arbitration phase SP and the same data phase SP 

The simplest way to achieve this is to use the identical bittiming 

configuration in all CAN nodes. This is not always 

possible, when different CAN clock frequencies are used. The 

arbitration phase SP and the data phase SP can be different, 

without any impact on robustness. Different SPs in the CAN 

FD nodes reduce robustness, because this leads to different 

lengths of the BRS bits and CRC delimiter bits in the different 

nodes and a phase error introduced by the bit-rate switching. 

The SSP can be different in the CAN nodes, without influencing 

robustness. 

Recommendation 5: Chose sjw D and sjw A as large as possible 

The maximal possible values are min (ps1 A/D , ps2 A/D ). A 

large sjw A value allows the CAN node to resynchronize quickly 

to the transmitting node. A large sjw D value maximizes the 

CAN clock tolerance. 

Recommendation 6: Enable TDC for data bit-rates higher 

than 1 Mbit/s 

In this case, the BRP D shall be set to 1 or 2 (see ISO 11898- 

1:2015). It is not recommended to configure the TDC with a 

fixed value, because the large transmitter delay varitions. 

C. SP positioning 

The SP locations of the arbitration phase and the data-phase 

may be different. If in the arbitration phase the SP is at the very 

far end of the bit-time, the maximum possible network length 

can be achieved. Sampling earlier reduces the achievable network 

length, but increases robustness. A value of higher than 

80 % is not recommended for automotive applications due to 

robustness reasons. 

956

SE 

scope Tx1 (trigger) 

trigger once 

at all nodes 

scope bus 1 scope bus 2 

scope RX1 

scope RX2 

SE 

µC 

TX 

RX 

SE 

TX 

CAN_H 

CAN_L 

RX 

diff 

CAN CAN topology 

under Network test 

diff 

TX 

CAN_H 

CAN_L 

RX 

SE 

TX 

RX 

µC 

measure once at all nodes 

CAN node 1 

for all trigger positions 

CAN node 12 

µC 

TX 

RX 

TX 

CAN_H 

CAN_L 

RX 

result: 

matrix with n² measurements 

TX 

CAN_H 

CAN_L 

RX 

TX 

RX 

µC 

CAN CAN node node n n 

CAN CAN node node n-1 n-1 

Fig. 1. Measurement setup for the evaluation of the asymmetry introduced by the topology (source: CiA 601-3) 

The SP location in the data-phase depends on the maximum 

possible bit asymmetries. There are two asymmetries, one for 

the worst lengthening of dominant bits (A1) and another for the 

worst shortening of dominant bits (A2) in a given network setup. 

Both values are given normally in ns. Both values are the 

sum of asymmetries caused by the physical network elements 

including transceiver, cabling, connectors, and optional circuitry 

(e.g. galvanic isolation). In order to avoid compensations, 

absolute values are added. ISO 11898-2:2016 specifies the 

asymmetry values for 2 Mbit/s and for 5 Mbit/s qualified transceivers. 

The asymmetries caused by the other physical network 

components are given by datasheets or needs to be estimated or 

measured. The system designer selects the worst-case connections 

in network and calculates or measures the both asymmetry 

values. Another option is to simulate it. There are providers 

offering such simulation services. 

A1 topology and A2 topology values are different for every communication 

relationship. This means in a setup with n CAN 

nodes there are n 2 values for A1 topology and n 2 values for 

A2 topology . To represent the worst-case, the maximal A1 topology 

and the maximal A2 topology values are used to calculate A1 respectively 

A2. CiA provides with the CiA 601-3 specification a 

spreadsheet to prove the robustness of the chosen bit-timing 

settings and the sample-points. 

D. Phase margin (PM) calculation 

The PM is the allowed shift of a bit edge towards the SP of 

the bit, at a given tolerance of the CAN clock frequency (d fused ). 

In other words, this is the edge shift caused by physical layer 

effects that is still tolerated by the CAN protocol. 

The worst-case bit sequence, i.e. that leads to the lowest 

PMs, is when the transmitting node sends five dominant bits 

followed by one recessive stuff bit (for details see /CiA601-1/). 

This is the longest possible sequence of dominant bits followed 

by a recessive bit inside a frame. Current transceiver designs 

cause the largest bit asymmetry at this bit sequence, i.e. the 

recessive bit is typically shorter than its nominal value. Further 

effects additionally raise the asymmetry: e.g. asymmetric rise 

and fall times, bus topology, EMC jitter, etc. 

The PM1 and PM2 values given in s (aecond) can be calculated 

by the following equations: 

PM1 < !∙!" !!!"! ! !!" ! 

− !∙!" ! 

(!!!" !"#$ ) (!!!" !"#$ ) 

PM2 < !∙!" ! 

− !∙!" !!!"! ! 

(!!!" !"#$ ) (!!!" !"#$ ) 

(1) 

(2) 

with PM1 = phase margin 1, PM2 = phase margin 2, BT D = 

data-phase bit-time, PS2 D = data-phase phase segment 2 

III. OPTIMIZATION HINTS 

The transceiver chips or the SBCs cause a significant part 

of the overall asymmetry. Therefore it is recommended, to use 

always components qualified for higher bit-rates. Even if for 2- 

Mbit/s CAN FD networks, 5-Mbit/s qualified chips should be 

chosen. 

The “badly” designed wiring harness can add many asymmetries. 

The following recommendations should be considered: 

• Use a linear topology, terminated at both ends. 

• Reduce the total bus length. 

• Limit the number of CAN nodes. 

• Avoid long, not terminated stubs, which are branches 

from the well-terminated CAN lines; use stubs of 

“cm-range” instead of “m-range”. Consider a highohmic 

termination of not terminated stubs. 


957

• Optimize the low-ohmic termination (resistor position 

and resistor value). Another option is to increase the 

low-ohmic termination resistance (e.g. 124 Ω instead 

of 120 Ω) to compensate for the high-ohmic terminations 

in systems with many nodes. 

• Reduce the number of stubs per star point. The more 

stubs are connected to one star point, the higher the reflection 

factor gets. 

• In case, a star point with many branches is required 

due to mechanical constraints, avoid identical stub 

lengths per star point. 

• In case, multiple star points are required, keep a significant 

distance between the two star points. 

• Cable cross section: increase it to approximately 2 x 

0,35 mm of the CAN_H and CAN_L wire. 

Besides these system design recommendations, the device 

designer should consider the following hints: 

• Limit the parasitic capacitance of the device. The parasitic 

capacitance of the device includes the following 

parameters: additional ESD protection elements; parasitic 

capacitance of the connector; parasitic capacitance 

of the CAN_H or CAN_L wire; parasitic capacitance 

of the CMC; the parasitic capacitance of the transceiver 

input pins. All this parasitic capacitance should be 

below 80 pF per channel. 

• CAN_H and CAN_L PCB tracks from connector to 

transceiver should be of equal distance and parallel. 

• Keep the TXD and RXD PCB tracks between host 

controller and transceiver short. 

• Configure the host controller TXD output pin with 

strong push-pull behavior: a pull-up or pull-down resistor 

behavior can cause additional asymmetries and 

propagation delays. 

• Avoid any serial components like logical gates or resistors 

within the TXD and RXD connection lines between 

host controller and transceiver. In case galvanic 

isolation is required, take care of the potential additional 

asymmetry and select components accordingly. 

• Use a CAN clock source with lower clock jitter. 

• Avoid galvanic isolation, or use a galvanic isolation solution 

that adds only a small asymmetry. 

In order to optimize the PM, the following hints should be 

considered: 

• Optimize the bit-timing configuration by reducing the 

tq length. This increases PM1 by reducing the quantization 

error. 

• Use a CAN clock with lower tolerance (d fused ). This 

improves PM1 and PM2. 

IV. OTHER PHYSICAL LAYER OPTIONS 

Besides galvanic isolation, there are some other options, 

which the system and device designer may consider. European 

carmakers often use common-mode chokes, for example. Further 

add-on circuitry includes a split-termination (two 60-Ω 

resistors) with a capacitor to ground. 

The CiA community discusses a ringing suppression option, 

which will be specified in the CiA 601-4 document. It is 

still under development. In general, such ringing suppression 

circuitry changes dynamically the network impedance to reduce 

the ringing in the beginning of the bit-time. Before the SP, 

the impedance is dynamically switched back to the nominal 

value. There are two approaches discussed: 

• Ringing suppression circuitry on the critical receiving 

nodes (CiA 601-4 version 1.0) 

• Ringing suppression circuitry on the transmitting nodes 

The updated CiA 601-4 will just specify the requirements 

and not the implementations. The automotive industry is highly 

interested in ringing suppression. It would allow achieving 

higher bit-rates (desired is 8 Mbit/s) or to allow higher asymmetries 

caused by the network topology. 

The common-mode choke specification for CAN FD networks 

will be given in CiA 601-6. Also this document is under 

development. It will mainly contain recommendations and how 

to measure the values to be provided in datasheets. It is the 

goal, to make datasheet values more comparable than today. 

CiA members are also working on a cable specification 

(CiA 110). It is intended to define parameters and how to 

measure them, in order to make datasheets comparable. 


The ISO 11898 series does not provide device and system 

design recommendations and specifications. CiA, JasPar, and 

SAE give those in their documents. Some of these documents 

are still under development. CiA, for example, provide in the 

CiA 601 series additional node and system design recommendations 

and design guidelines. 

A requirement specification for ringing suppression circuitry 

is under development (CiA 601-4). Guidelines for commonmode 

chokes are also in preparation (CiA 601-6). 

REFERENCES 

[1] Arthur Mutter: "Robustness of a CAN FD bus system – About oscillator 

tolerance and edge deviations”, 14th international CAN Conference 

(iCC 2013), Paris, France, 2013 

[2] Florian Hartwich: “The configuration of the CAN bit-timing”, 6th 

international CAN Conference (iCC1999), Turin, Italy, 1999 

[3] Marc Schreiner: “CAN FD system design”, 15th international CAN 

Conference (iCC 2015), Vienna, Austria, 2015 

[4] Y. Horii, “Ringing suppression technology to achieve higher data rates 

using CAN FD,” 15th international CAN Conference (iCC 2015), 

Vienna, Austria, 2015 

[5] CAN Newsletter magazine 2012 to 2017 (several articles), Nuremberg, 

Germany. 

[6] CiA 601 series, Nuremberg, Germany 

958

CANopen FD 

Embedded network as base for IoT applications 

Reiner Zitzmann 

CAN in Automation GmbH 

Nuremberg, Germany 

headquarters@can-cia.org 

Abstract—In 2012, Bosch presented the improved CAN with 

flexible data rate (CAN FD). Since these days, the international 

standardization of CAN FD has been finalized, the conformance 

test plan has been released, implementation guidelines for system 

and device design are available (CiA 600 document series) and 

first microcontrollers from several manufacturers are available. 

As CAN FD is therefore ready to use, CAN in Automation and its 

members like to offer the advantages of CAN FD to the CANopen 

users as well. Therefore the CiA working group SIG application 

layer prepared the improved and simple to use CANopen FD. 

CANopen FD combines the advantages of the well-accepted 

CANopen with being well-equipped for meeting future 

requirements in embedded networking. 

CANopen FD offers a lengthened PDO that allows providing 

a high data throughput for providing big data application with a 

large data base. The larger PDO payload complements 

additionally sophisticated security measures. The new USDO, 

which substitutes the classical SDO, supports a high design 

flexibility. Any CANopen FD device gets the ability to have access 

to any other CANopen FD device. In contrast to classical 

CANopen, no system designer is required but the crosscommunication 

between any CANopen FD devices can be 

established dynamically during runtime. This supports not only 

the trend to flexible systems, where the end user acts as system 

designer by adding or removing system components during 

system runtime. Also configuration and diagnostic is simplified, 

as a tool can have access to any network participant during 

runtime. 

CANopen FD was released in the document CiA 1301, by 

CAN in Automation. In addition the aforementioned features, 

CANopen FD provides an improved EMCY write service and a 

comprehensive error history, including time stamps and detailed 

information on the type of error. Nevertheless most of the wellknown 

CANopen functionality was kept, so that CANopen users 

can easily transit to CANopen FD and can reuse most of their 

CANopen know-how. 

Keywords—CAN FD, CANopen FD, Internet of things (IoT), 

Big data, Security 


More and more applications can be controlled and 

monitored via web-based applications, e.g. from a tablet or 

smart phone. These web-based applications such as e.g. 

temperature control in a private house hold, rental of a bike 

from a bike sharing provider, etc. rely on data that is often 

generated in an embedded and deeply embedded network of 

the application to be monitored or even controlled. To allow in 

future as many reasonable web-based applications as possible, 

system designers of embedded networks will provide huge 

amount of data, in addition to the really required control data. 

As a consequence of this trend, we can forecast a higher 

demand for data throughput and a high degree on 

communication flexibility on the embedded network level. To 

meet these requirements in the best way, CAN in Automation 

(CiA) has updated the well-accepted CANopen application 

layer and communication profile in a way, so that it is able to 

combine the advantages provided by the new CAN FD data 

link layer and the classical CANopen. 

II. CAN WITH FLEXIBLE DATA RATE (CAN FD) 

In 2012, on occasion of CiA’s 13 th international CAN 

conference, Bosch presented the improved CAN, called CAN 

FD [1]. It is as reliable as classical CAN but enables the user to 

generate a higher data throughput in the embedded network. 

up to 64 byte of data 

Arbitration phase! Data transmission phase! ACK phase! 

50 kbit/s to 1 Mbit/s free transmission 50 kbit/s to 1 Mbit/s 

FIGURE 1 – CAN WITH FLEXIBLE DATA RATE (CAN FD) 

As illustrated in Figure 1, a CAN FD frame is able to carry 

up to 64 byte of data. To increase the efficiency, the data field 

of a CAN FD frame can be transmitted with a higher 

transmission speed. Because of additional features for checking 


959

the data integrity [9], CAN FD can provide data at least with 

the same reliability as classical CAN. 

As many car manufacturers plan the usage of CAN FD, to 

complement automotive Ethernet, we can expect a high 

availability of microcontrollers with integrated CAN FD 

controllers, as it is today the situation for classical CAN. This 

leads to the fact that an improved CAN is available, combining 

robustness, flexibility, simplicity with high data throughput and 

increased reliability. By updating CANopen with regard to 

CAN FD, CiA intends to make these attributes available to the 

CANopen users. 

III. 

CANOPEN FD 

A. Impacts of CAN FD on CANopen 

In order to offer the benefits of CAN FD to their users, 

CAN-based higher layer protocols need to be adapted. Among 

others, the standardized and in many projects successfully 

installed CANopen protocol [5] was updated as well. As a 

result, in September 2017, CAN in Automation released the 

CAN FD-based successor of CANopen, CANopen FD [6]. 

At the beginning of the review process the CAN in 

Automation (CiA) working group “SIG application layer” 

evaluated [2], which kind of the existing CANopen services 

would benefit from CAN FD and the increased CAN FD data 

throughput. CANopen services can be differentiated according 

to data transport oriented services and network management 

oriented services. A detailed examination of the network 

management oriented CAN services shows that these services 

are not suffering any bandwidth issues and most of them just 

use some Byte of the classical CAN data frame. An overview is 

provided in Table 1. 

TABLE I. 

CANOPEN NETWORKMANAGEMENT ORIENTED SERVICES 

CANopen service 

Synchronization service 

Time stamp service 

EMCY write service 

Network management service 

Error control services 

Used data byte 

0 to 1 Byte 

6 Byte 

8 Byte 

2 Byte 

1 Byte 

Only the EMCY write service, which is intended for 

diagnostic tasks, utilizes the maximum size of eight byte of a 

classical CAN frame. But a closer examination of this service 

indicates that the EMCY write service uses just three 

standardized data bytes. The remaining data field can 

optionally be filled up with manufacturer-specific error 

information. As a result of the evaluation of the network 

management related services, the SIG application layer 

concluded to keep these services and the related protocols 

unchanged in an updated, CAN FD capable version of 

CANopen [4]. Only the EMCY write service shall be 

reorganized and shall provide more detailed error diagnostic 

information, including a time stamp. 

Subsequently the CiA working group evaluated the data 

transport oriented services of CANopen. As illustrated in Table 

II, the size of the classical CAN frame data field limits these 

services. The mapping of these services to CAN FD, would 

offer the possibility to provide an enhanced functionality to the 

user. 

TABLE II. 

CANopen service 

Service data object, expedited 

Service data object, segmented 

Service data object, block transfer 

Process data object 

Multiplex process data object 

CANOPEN DATA TRANSPORT ORIENTED SERVICES 

Used data byte 

Payload limited to 4 Byte by 

classical CAN frame 

Payload limited to 7 Byte per 


Payload limited to 7 Byte per 






Both, Process Data Object (PDO) as well the different types 

of Service Data Objects (SDO) are limited by the size of the 

classical CAN data field. A mapping of these objects to CAN 

FD adds an increased performance to CANopen [3]. 

B. Adaptation of the Process Data Object (PDO) 

CANopen PDOs, intended for high-prior command- and 

status information, are determined by two parameter sets; the 

PDO communication as well as the PDO mapping parameters. 

In general the description of PDOs can remain unchanged. The 

communication parameters specify, the CAN-Identifier that is 

used for a PDO, the triggering event, which triggers the 

transmission of a PDO as well as some busload management 

features. Only the decision, which kind of CAN frame shall be 

used for the PDO, depends on a classical CAN or CAN FD 

data link layer. All the rest is completely independent of the 

used data link layer and can therefore remain unchanged. As 

CANopen FD uses overall the CAN FD frame format, no 

additional settings that selecting the type of CAN-ID, are 

required. As a result, the entire PDO communication parameter 

set can remain unchanged. 

With regard to the PDO mapping parameters, the result is 

rather similar. Currently the content of a PDO is determined by 

a link-listing that allows linking 64 different parameters to one 

PDO. In case the smallest unit to be linked to a PDO has the 

size of one Byte, the existing table of 64 references is 

sufficient, to fill up a PDO with 64 Byte payload. With regard 

to this it is very advantageous that SIG application layer 

decided to maintain only data in the CANopen object 

dictionary that has the size of an integer multiple of 1 byte. 

Therefore the style of todays CANopen PDO mapping 

parameter sets can be reused for CANopen FD, in the same 

way. With regard to the description of PDOs, no adaptations 

are required from point of view of the specification. Stack 

manufacturers and users have to be aware, that PDOs cannot be 

triggered any longer by means of CAN remote frames, as 

CANopen FD does no longer support them. The 

implementation of PDOs can remain rather unchanged. Just the 

todays 8 byte limit has to be adapted to the used CAN FD 

frame size (see Table 3). 

960

TABLE III. 

CAN FD DLC 

SIZE OF DATA FIELD IN CAN FD FRAMES 

Size of CAN FD data field 

0 0 Byte 

1 1 Byte 

2 2 Byte 

3 3 Byte 

4 4 Byte 

5 5 Byte 

6 6 Byte 

7 7 Byte 

8 8 Byte 

9 12 Byte 

10 16 Byte 

11 20 Byte 

12 24 Byte 

13 32 Byte 

14 48 Byte 

15 64 Byte 

Users have to be aware that the run-time verification of 

PDOs becomes in CAN FD-based systems of limited use [3]. 

As CAN FD increases the sizes of the CAN FD data field in 

certain steps, it may be more difficult to detect 

misconfigurations. E.g. in case of an configuration error, a 

transmitter sends just 17 instead of 18 intended process data 

bytes in a PDO, a receiver cannot detect this configuration 

error by checking the length of the used CAN FD frame. A 

receiver gets at least 20 byte data, as this is the next supported 

CAN FD data frame limit. 

C. Adaptation of the Service Data Object (SDO) 

In contrast to the PDO, it was much more complex to adapt 

the SDO to CAN FD. Todays SDOs use a bit-coded command 

specifier, to distinguish between the different SDO protocols. 

Unfortunately, there is no coding left to indicate a new 

enhanced, CAN FD-based SDO service. Therefore there was 

no chance to simply extend the well-known CANopen SDO 

service. In addition, the members of SIG application see some 

limitations in the existing SDO service. Especially systems that 

are modified during system run-time, by e.g. the end user. In 

these systems a system integrator would be needed that 

configures the required cross-communication. In addition, the 

members of SIG application layer recommend a strict 

separation between systems based on classical CAN and CAN 

FD. Therefore the SIG application layer discarded the idea of 

adapting the well-known SDO service. Instead, SIG application 

layer introduced an entire new CANopen service, the Universal 

SDO (USDO). 

D. Universal Service Data Object (USDO) 

Introducing a new service offers the chance to overcome 

limits of the existing SDO service. In times of the Internet of 

Things (IoT) and Industry 4.0, embedded networks such as 

CANopen are faced with more and more challenges. There 

may be the demand for remote diagnostic, remote monitoring 

or remote (fleet) management of the applications. To meet the 

requirements of these applications, high design flexibility is 

demanded. Communication links are not necessarily known in 

advance but established during runtime. Furthermore the end 

user that is accessing an embedded network remotely is not 

necessarily an expert in CAN and CANopen. As end users 

could just be familiar with their application, they would prefer 

a logical addressing instead of a geographical addressing with 

many CANopen details (e.g. such as: a command “set 

temperature in the living room to 22°C” might be more 

convenient than a corresponding CANopen command “Write 

22 to Index 3000 h sub-index 05 h in device with CANopen node 

number 35). Furthermore, in case an embedded network is 

managed from outside the network, communication services 

would be appreciated that allow encapsulating many single 

tasks to one comprehensive service. E.g. in case the very same 

configuration has to be adjusted in several or all devices, a 

confirmed broad- and multicast service would be of interest. In 

case of accessing cascaded CANopen systems, an inherent 

routing capable service would be appreciated. 

To be prepared for future challenges, the USDO provides 

solutions for all of these requirements. The USDO as illustrated 

in Figure 2, allows uni-, multi- and broadcast communication 

coherences between one server and several clients. The coding 

of the “Destination address” determines the communication 

coherences between either one or several servers. 

Client ! ! ! ! ! ! ! Server! 

USDO download request! 

Destination Command Session Subindex! 

Index! Data Size! Application data! 

address! specifier! ID! 

! 

type! 

0 1 2 3 4 + 5 6 7 8 up to 63 ! 

Destination address! 

00 h! Broadcast (to all nodes)! 

Destination 

address 

USDO download response! 

Command 

specifier 

Session 

ID 

Description! 

01 h to 7F h! Unicast (to node with indicated node-ID)! 

80 h to FF h! Multicast (to some nodes part of indicated group)! 

Subindex 

Index 

0 1 2 3 4 + 5! 

FIGURE 2 – CANOPEN FD USDO DOWNLOAD 

In addition to a “Command specifier”, which informs the 

server on the type of the intended access, a “Session ID” is 

provided in the protocol. This allows one client to run several 

USDOs in parallel to the very same server. In contrast to 

todays SDOs, this USDO offers therefore the opportunity to 

execute a program download to a device and to monitor the 

process via an additional USDO communication session. 

During such an USDO access, the client can up- or download 

single entries of an USDO server’s object dictionary. 

Furthermore, the USDO may provide source as well as final 

destination address information, as illustrated in Figure 3. 


961

Client ! ! ! ! ! ! ! Server! 

USDO upload request (local)! 

Local 

destination 

address! 

Destination 

address 

Network ID! 

Command 

specifier! 

Command 

specifier 

Long distance USDO upload request ! 

Session ID! 

Sub-Index! 

Session 

ID 

Index! 

Sub- 

Index 

Destination 

network ID! 

Index 

0 1 2 3 4 + 5 ! 

Description! 

00 h! Network ID unknown, limited USDO remote protocol handling 

possible! 

01 h to 7F h! Unique network ID! 

FF h! Unconfigured CANopen FD device! 

Destination 

node-ID! 

Source 

network ID! 

Source 

node-ID! 

0 1 2 3 4 + 5 6 7 8 9! 

FIGURE 3 – COMPARISON LOCAL AND REMOTE USDO UPLOAD REQUEST 

As network- and node-ID of initial client (source) and final 

server (destination) are always provided in the USDO remote 

access format, a data exchange exceeding local network 

borders can be realized rather simple. 

In one of the next versions the USDO shall get the feature 

to offer the transfer of complete Arrays or Records, by means 

of Multiple Sub-index Access (MAS). This will simplify the 

transfer of data structures. In addition the SIG application layer 

intends to meet IoT requirements, such as logical addressing. 

The logical addressing will be derived from the solution that is 

currently developed by the SIG CANopen Internet of Things. 

E. Current status of CANopen FD 

The basic CANopen FD specification CiA 1301 was 

released in September 2017. Further specifications are going to 

follow. Currently CiA working groups are specifying the 

network start-up and management; including the dynamic 

node-ID, network-ID and bit timing adjustment. As 

prerequisite for the obligatory conformance testing, the update 

of the conformance test plan as well as of the electronic device 

description is currently in the focus of the related CiA working 

groups. In parallel, CiA working groups are evaluating CiA 

device and application profiles, with the focus on making use 

of the new CANopen FD features, in the best way. 

IV. CANOPEN FD AND IOT 

Todays and future embedded systems are faced with the 

requirement that they generate the data base for many webbased 

applications. Lots of data have to be provided to a huge 

data pool so that many web applications can generate added 

value and can provide nice functions to customers. E.g. for 

predictive maintenance, lots of data like the amount of working 

hours in total and since last service, or today, have to be 

communicated in addition to the pure control data. The 

increased data throughput, derived by communicating these 

“nice-to-have data”, can be met rather easily, by means of the 

lengthened CANopen FD PDOs as well as the higher 

communication speed. 

For remote diagnostic, system maintainer would like to 

have remote access to the embedded network and would like to 

upload diagnostic data and e.g. to update the firmware of oldfashioned 

devices. 

To enable such use cases in a simple way, CANopen FD 

can reuse emerging solutions that are currently developed by 

the CiA working group SIG CANopen_IoT. This working 

group defines the “user handling” within a CANopen (FD) – 

IoT gateway [7]. Depending on the registered user and his/her 

user class, the gateway application provides gateway resources 

to that user. In this context “resources” may be access rights 

(read or write) to dedicated network participants, memory and 

processing power in the gateway, etc. 

The CANopen FD USDO eases establishing dynamically 

communication channels to any CANopen FD device in a sublayered 

CANopen FD network architecture. That an external 

user gets information on how does the CANopen FD network 

architecture look like, system discovery services are 

introduced, by the SIG CANopen_IoT. For the purpose of the 

system discovery, configuration and diagnostics, new attributes 

in the electronic device description and configuration file 

formats provide a mapping of a generic function to the 

application-specific use case [7]. Along with the so-called 

nodelist information, e.g. GraphML capable tools are 

considered to provide the application-related visualization of 

the entire control system. An IoT application accesses the 

gateway and can discover the sub-layered CANopen (FD) 

system, e.g. based on the nodelist.graphml files. As soon as the 

system is known, the IoT application can use the remote 

CANopen access services according to CiA 309 [8]. 

Currently the SIG CANopen_IoT specifies the mapping of 

the generic access services to HTTP requests. This should free 

the IoT application from the necessity to have knowledge on 

any CANopen (FD) specifics. Additionally the logical 

addressing will support this. CANopen device descriptions 

allow already logical addressing based on the reference 

designation system. CANopen FD will complement this by 

enhancing CANopen FD USDOs by logical addressing. 

Connecting an embedded network to the IoT, causes 

immediately the issue of preventing unintended access to that 

embedded network. The working group TF security is currently 

developing a solution that allows CANopen FD users to 

guarantee that only intended parties get access to specific data. 

The TF benefits from the possibility of having larger payload 

in the exchanges CAN FD frames that allows, e.g. the transfer 

of security keys in a single CANopen FD PDO. 

V. SUMMARY 

CiA released CANopen FD in the specification CiA 1301, 

in September 2017. Several CANopen FD protocol stack 

manufacturers tested the new CANopen FD features on, 

occasion of CiA plug fests. 

CANopen FD keeps on the one hand the basic attributes of 

the well-known CANopen. On the other hand CANopen FD 

enriches CANopen by an extended data throughput and higher 

design flexibility. Therefore, todays CANopen users can reuse 

most of their existing CANopen knowledge, when migrating 

from CANopen to CANopen FD and can just focus on the new 

functionality. Depending on the used communication bit rates, 

they can use CANopen FD in existing network topologies. 

CANopen FD users will benefit from the new USDO, 

which serves in CANopen FD as multi-function tool. 

962

Establishing any kind of communication relationship, 

dynamically and use-case dependent, will meet the 

requirements of future system design. 

The larger payload provided by CAN FD data frames, will 

support the introduction of any safety or security solution in 

CANopen FD systems. CiA working groups are currently 

working on this subject. 

The basic specification CiA 1301 has been released. 

Microcontrollers with integrated CAN FD controllers are 

available from several semiconductor manufacturers. CiA 

provides recommendations for CAN FD device design and 

system design. Therefore everything is available to CANopennetworking 

in the next decade. 

REFERENCES 

[1] CAN in Automation, Florian Hartwich, Robert Bosch GmbH, CAN with 

Flexible Data-Rate, Proceedings of the 13th international CAN 

Conference 

[2] CAN in Automation, Heinz-Jürgen Oertel, Using CAN with flexible 

data-rate in CANopen systems, Proceedings of the 13th international 

CAN Conference 

[3] CAN in Automation, Dr. Martin Merkel, Ixxat Automation GmbH, 

CANopen on CAN FD, Proceedings of the 14th international CAN 

Conference 

[4] CAN in Automation, SIG application layer meeting minutes 2013 to 

2017 

[5] CiA 301, CANopen application layer and communication profile, 

Version 4.2 

[6] CiA 1301, CANopen FD application layer and communication profile, 

Version 1.0 

[7] CAN in Automation, SIG CANopen Internet of Things (IoT) meeting 

minutes 2012 to 2018 

[8] CiA 309, CANopen access from other networks – Part 1: General 

principles and services, Version 2.0 

[9] ISO 11898-1:2015. Road vehicles – Controller area network – Part 1: 

Data link layer and physical signalling 


963

High Integrity Software Is Fundamental to 

Autonomous Embedded Systems 

Jeffrey Fortin 

Vector Software 

Vector Informatik 

East Greenwich, RI USA 

jeffrey.fortin@vectorcast.com 

The growth of autonomous systems is expected to grow 

substantially over the next decade. This emerging technology is 

having a disruptive force in the market, enabling new business 

models that challenge the current mobility solutions. But will 

these autonomous systems be trusted and accepted by the public? 

For these predictions to be correct, these new autonomous 

systems must have at least the same level of integrity as the 

existing solutions we use today. 

These new autonomous embedded systems are expected to 

have a significant amount of functionality implemented in 

software. This is foreshadowed in the recent trend of software 

playing a significant role in automation systems in general. 

Understanding the behavior of this software is the key to 

assessing its integrity. To meet the goal of a fully automated 

driving system for instance we must, at a minimum, adopt the 

proven methods for developing high integrity system currently 

used by the safety critical industries. 

The automobile industry itself has taken a lead in establishing 

standards for software integrity. ISO 26262 and MISRA are the 

two software standards that apply to the verification and 

validation of vehicle based software. 

The application of these methods must be balanced with 

business objectives. The systems must not be under-tested nor 

over-tested with the amount and level of testing being determined 

by the level of risk. 

Keywords—Autonomous; Automoted Software Testing; High 

Integrity Software; Safety; ISO 26262; MISRA 


Integrity means being trustworthy. High Integrity means 

having a high degree of trust. A fundamental principle of an 

autonomous embedded system is that it must be trusted to do 

the right thing. It is this trust that provides the value of the 

system. 

To address this issue of trust, the automobile industry has 

established standards for software integrity. ISO 26262 and 

MISRA are the two standards that apply to the verification and 

validation of vehicle based software. ISO 26262 is a Functional 

Safety standard entitled "Road vehicles -- Functional safety". 

The standard is an adaptation of the Functional Safety standard 

IEC 61508 for electrical/electronic/programmable electronic 

safety-related systems. Part 6 of the ISO 26262 standard 

addresses the recommendations for software testing and 

verification as part of the standard for software development. 

Recommended activities include both unit level and system 

level testing such as functional tests and structural coverage 

tests. Test tools that support capture and reporting of structural 

code coverage are highly recommended in the standard for all 

Automotive Safety Integrity Levels (ASIL) defined by ISO 

26262. 

These standards must however be practical from a business 

perspective. The level of testing effort must be correlated to 

associated level of risk. Inefficient testing practices must be 

replaced with an automated repeatable software quality testing 

process allowing for rapid innovation (agility) while at the 

same time maintaining the integrity levels mandatory for an 

autonomous embedded system. 

II. 

AN AUTOMATION EXAMPLE 

Let us take an example from the automotive industry. Early 

automobiles required a high degree of operator interaction for 

the automobile to function. The operator had to be mindful of a 

wide array of complex interactions and be skilled in 

understanding the correct settings and operations required to 

run the automobile. The Ford Model-T was a very popular 

early automobile. For the owners of these vehicles, here are the 

steps that were needed to start the Model-T: 

1. Check the fuel level by raising the front seat 

cushion, insert a dipstick, and check that you have 

enough fuel. 

2. Check the oil by going underneath the car, open 

the top petcock, if oil drips out you have enough 

oil. 


964

3. Making sure the ignition is switched off, go to the 

front of the car and prime the engine by pulling 

the choke then turn the engine crank to prime it. 

4. Get in the car, turn on the ignition and adjust the 

spark advance to the top, open the throttle slightly. 

5. Get out of the car and turn the crank, making sure 

to use your left arm, that way if the engine should 

backfire there is less chance of breaking your arm. 

6. If all goes well, the car will start. 

Over time these systems, such as the starter and the 

carburetor choke became automatic, meaning the operator no 

longer needed to worry about that aspect of the operation of the 

car. (It must have been a great relief to no longer have to risk 

breaking an arm just to start the car.) These improved systems 

worked automatically and could be relied on to do the right 

thing. They also improved the overall safety of the car. In 

today’s modern automobile we have moved beyond 

mechanical automation systems to using Electronic Control 

Units (ECUs), leveraging the power of computers and software 

to provide sophisticated control systems for anti-lock braking, 

collision detection, and lane change warnings just to name a 

few. Ultimately this pattern continues and now we see the 

emergence of Advanced Driver Assistance Systems (ADAS) 

and Automated Driving Systems promising a future when 

manually operating an automobile will be considered reckless 

behavior. 

III. 

AUTOMATION IS NOT THE SAME AS AUTONOMOUS 

Automation systems surround us in our daily lives and have 

become an essential part of industry. Robotics, industrial 

control, integrated avionics systems and medical devices all 

leverage automation. But when we use the term “autonomous” 

we are really taking these systems to a new level. A familiar 

example use case is the Autonomous Vehicle. A vehicle 

capable of being navigated and maneuvered not under the 

control of a human but rather being computer controlled under 

a range of driving situations. Because of this higher level of 

automation, the risks are higher making it all the more 

important to insure the integrity levels necessary for these 

systems to be trusted and successful in the marketplace. 

IV. 

STEPS NEEDED TO DEVELOP AN AUTONOMOUS 

EMBEDDED SYSTEM 

To develop an autonomous system, we can look to the steps 

that are used currently for the development of high integrity 

systems. 

A. Requirments 

The fundamental starting point for development is to start 

with testable requirements. Taking the time ensure the 

requirements are well understood and testable is paramount. 

Software bugs are often the result of poorly written 

requirements. When developers are given incomplete 

requirements, they will make assumptions to fill in the gaps. 

These assumptions are then encoded into the system and in 

most cases, are not consistent with the overall system 

requirements. Ensure your requirements are correct and can be 

tested. 

B. Using Test-Driven Development 

An effective way to ensure your requirements are testable 

and correct is to use a Test-driven development approach. In 

this approach the tests are written up front and agreed upon by 

the system engineers to fulfil the intent of the requirements. In 

this way the development proceeds with the goals or objectives 

already defined, limiting wasted effort on coding incorrect 

software or software that does not meet the requirements. The 

tests act as guardrails on the development process. Allowing 

for rapid development with a lower risk of introducing costly 

and potentially dangerous bugs. 

C. Use Industry Best Practices 

Look to industry best practices such as ISO 26262 and 

MISRA that are used to develop high integrity automotive 

systems. Take advantage of the lessons learned from years of 

industry expertise. Even if your system is not regulated, it still 

needs to be trusted in order for it to provide the value intended. 

D. Leverage Test Automoation 

Test Automation allows you to free scarce resources for use 

on other tasks. Tests should be fast and easy for anyone to run. 

This facilitates collaboration, so everyone is involved in 

improving the integrity of the system. 

V. AN INDISTRY BEST PRACTICE - ISO 26262 

ISO 26262 defines four levels of Safety Integrity Levels 

with level A being the least critical and level D being the most 

critical. For each level there are associated requirements for 

testing. There are five types of tests defined in the standard but 

Requirements-based tests are highly recommended for all 

integrity levels. This reflects the importance requirements play 

in high integrity systems. 

The standard is also very clear that structural code coverage 

metrics must be collected. The standard defines three types of 

coverage metrics, Statement coverage, Branch coverage and 

Modified Condition/Decision Coverage. Structural coverage 

metrics can be measured using software tools making this task 

much easier to conduct. The standard also calls out tool 

confidence levels and the methods that are used to qualify a 

tool. 

VI. 

THE MISRA CODING STANDARDS 

MISRA is The Motor Industry Software Reliability 

Association and they have developed standards that are focused 

on language restrictions to mitigate reliability faults. To test 

against these standards, static analysis tools are used to 

examine the code and find any violations of the rules specified 

within a given standard. For example, there is a rule that a 

switch statement needs to include a default case. Using an 

automatic static analysis tool greatly simplifies the code 

inspection requirement to show compliance with the many 

rules that are specified in the standards. 

VII. AN EXAMPLE TEST AUTOMATION ENVIRONMENT 

The effort required to analyze and test all the code and to 

collect the code coverage metrics is made much easier with the 

use of automated testing tools. These tools can parse the code 

and automatically generate a test driver that can be used to call 

965

the function under test and to instrument the code to collect the 

code coverage metrics. They also have the ability to 

automatically generate detailed test reports and provide testing 

metrics that can be used to show compliance with standards 

and to determine release readiness. An example test harness is 

shown in Fig 1. Here we can see the original source code, the 

test driver that will be used to call the Module Under Test as 

well as stub functions that are used to mock the software 

interfaces that are external to the unit. In addition, it may be 

desirable to have real functions present in the test environment. 

The test automation environment automatically generates 

test harnesses for each code coverage metric. For ISO 26262 

testing, the type of code coverage is defined for each integrity 

level. For efficiency, it’s important to be able to perform the 

right level of testing for the given level. In this way there is a 

balance between the effort required to perform the test and the 

associated risk level. 

Fig 1 Automated creation of a test harness 

A. Support for Test-Driven Development 

The test automation system should support a Test-driven 

Development methodology. This means it must have the ability 

to generate a test harness purely on the interface definition for 

the unit under test and the test inputs and expected outputs. In 

this way the code developers can use the test harness to prove 

the work they have done meets the requirements and they have 

a clear definition of done. Before any code is written the test 

should be run to be sure it fails. If the test can pass with no 

code written, then it does not have any value to prove the code 

is correct. The developer will add code sufficiently to allow the 

test to pass. Ideally only the code necessary to pass the test 

should be written as any additional code is not required and 

wastes effort that can be placed elsewhere. These tests should 

be incorporated into the overall test process and run whenever 

a change is made to the code or to the test itself. 

B. Reporting and Metrics 

Included with the test automation system should be a 

reporting and metrics facility. Most software development 

standards require test artifacts that must be shown to an auditor 

to confirm compliance. Safety standards such as ISO 26262 

also have reporting requirements that are audited to show 

compliance with the regulations. 

Metrics drive the efficiency of the development process. 

They focus the resources on the highest risk areas and provide 

insight into the progress of the development. With metrics, 

clear release readiness criteria can be defined and tracked. 

Ideally these metrics should be visible to the entire team so 

everyone is aware of the project status and can participate in 

improving the overall quality and integrity of the software. 

C. Tool Qualification 

The tools used for test automation should be evaluated for 

use by a certification authority. These authorities will evaluate 

the software development procedures used to develop the tool 

and certify that the tool fulfills the requirements for a particular 

safety related standard such as ISO 26262. The certification 

organization acts as a trusted agent providing a recognized 

authority for the qualification of the tool. 


The growth of Autonomous Embedded Systems will place 

high demand for efficient testing methods for proving the 

integrity of the software. Within the next decade, our current 

autonomous systems will look as outdated as a Model-T does 

to us today. But for this growth to happen, the systems must be 

trustworthy. Following the steps outlined in this paper will go a 

long way to meeting the challenge of providing a trusted 

autonomous embedded system in way that is in line with 

business objectives. 


966

Safety & Security Testing of Cooperative 

Automotive Systems 

Dominique Seydel, Gereon Weiss 

Application Architecture Design & Validation 

Fraunhofer ESK 


{dominique.seydel, gereon.weiss}@esk.fraunhofer.de 

Daniela Pöhn, Sascha Wessel 

Secure Operating Systems 

Fraunhofer AISEC 

Garching, Germany 

{daniela.poehn, sascha.wessel}@aisec.fraunhofer.de 

Franz Wenninger 

Design, Test & System Integration 

Fraunhofer EMFT 


franz.wenninger@emft.fraunhofer.de 

Abstract— Cooperative behavior of automated traffic 

participants is one next step towards the goals of reducing the 

number of traffic fatalities and optimizing traffic flow. The 

notification of a traffic participant’s intentions and coordination 

of driving strategies increase the reaction time for safety 

functions and allow a foresighted maneuver planning. When 

developing cooperative applications, a higher design complexity 

has to be handled, as components are distributed over 

heterogeneous systems that interact with a varying timing 

behavior and less data confidence. In this paper, we present a 

solution for the development, simulation and validation of 

cooperative automotive systems together with an exemplary 

development flow for safety and security testing. 

Keywords— automotive safety; cooperative applications; 

security testing; validation; autonomous systems; ITS 


In comparison to the development of traditional ADAS 

functions, testing and simulation of connected applications 

have to consider the interaction of heterogeneous systems that 

are distributed within a wireless networked architecture. As the 

communication link is more unreliable in contrast to common 

input sensors, the application has to cope with varying timing 

behavior and less data confidence. However, the higher 

complexity in the development process of cooperative 

applications is justified by several advantages. They arise from 

the aspect that foreign traffic participants are no longer solely 

observed from the outside to predict their behavior but they 

give insights on their status, intentions and their involvement in 

cooperative maneuvers. This results in an increased reliability 

of predicted vehicle movements, which in turn can be used for 

safety functions and allow an increased reaction time of safety 

mechanisms. In terms of driving comfort, the traffic 

participant’s cooperation allows a foresighted maneuver 

planning. 

By the current state of available tools, the development, test 

and certification of autonomous systems is complex, costly in 

terms of time and equipment, potentially hazardous and often 

incomplete. For instance, if it comes to complex applications 

that require a distributed consensus, e.g. Merging Assistance 

[1], an application distributed among various foreign entities 

has to be validated. Thus, it appears that development and 

simulation environments are not yet ready to rapidly develop 

prototypes of cooperative driving functions. 

Therefore, we provide an approach for an integrated testing 

environment that can cover the whole innovation cycle for 

prototype development of cooperative automotive systems. 

Incorporating safety and security aspects, it starts from the 

design of applications over simulation to integrating and 

validating the respective prototypes. Thus, our approach 

supports the whole development process for prototyping and 

testing cooperative functions. 

The following Chapter II gives an overview on an efficient 

approach for the development of cooperative applications. 

Further aspects of the simulation and testing phase are 

discussed in Chapter III. The current scope of software analysis 

is presented in Chapter IV and a secured deployment process in 

the following Chapter V. We conclude our work with Chapter 

VI and provide a brief outlook to next steps. 

II. 

RAPID APPLICATION DEVELOPMENT 

A. Application Development Cycle 

The testbed combines several aspects of Vehicle-to-X 

(V2X) application development life cycle. It covers 

development steps beginning from application design, 

967

continuous integration into simulation environments, testing 

environments over tools performing functional and security 

analyses, up to a secure deployment and update process. The 

application development flow of the innovation- and testbed 

concept is shown in Fig. 1. It is also designed to only use single 

aspects in a building-block style when developing innovative 

applications. 

The testing layer consists of several testing methods that 

are specific for each step in the development process and 

include several simulation environments, integration testing 

and field tests. The testbed provides the ezCar2x ® framework, 

described in [2], that allows testing connected applications 

within a simulation environment, using network and traffic 

simulation as well as integration of hardware-in-the-loop, e.g. 

Road Side Units (RSUs). Another main feature of the testbed is 

a combined simulation and field testing approach, where 

virtual and real traffic participants can be tested in a 

synchronized environment. 

Additional analysis tools enable to examine the developed 

application. On the one hand, our DANA tool (“Description 

and Analysis of Networked Applications”) for functional 

validation can be used iteratively in every testing step [3]. For 

the integration testing of the application, the analysis toolbox 

provides application security testing methods in order to detect 

software vulnerabilities in an early development stage. Using 

static application security testing (SAST) as well as dynamic 

application security testing (DAST) allows quick analyses 

during integration testing in order to detect potential software 

vulnerabilities. 

Finally, the application is built and subsequently signed 

within the software repository and pushed to the update server, 

which is part of the back end. The update server again signs the 

application and deploys it to V2X devices. 

B. Application Design 

For the initial development step of designing a cooperative 

driving function, the testing environment comprises interfaces 

to common automotive modelling tools, like Matlab Simulink 

or ADTF. The deployed application uses the ezCar2x ® 

framework, an ETSI ITS (Intelligent Transport Systems) 

compliant communication stack, which can either run on real 

communication hardware or on a virtual node within a network 

simulation. Furthermore, application security testing can be 

conducted with static and dynamic methods. 

Fig. 1 Application Development Cycle using the Testbed Services 

If enhanced safety mechanisms are required for the 

intended application, state-of-the-art software methods, e.g. 

graceful degradation strategies [4] or model-based 

communication [5], can be incorporated into the application 

model within this development step as well. For example, a 

connected application with safety-critical functionality, as 

Platooning, strongly depends on Quality of Service (QoS) 

parameters of the communication link. Our safety function for 

Resilient Control uses these QoS parameters, such as the 

current Packet Loss Rate (PLR), to decide which degradation 

mode is sufficient, e.g. readjusting the distance to the vehicle 

ahead. The safety mechanisms for resilient control are 

developed as generic component and can be integrated into 

common automotive software architectures, as AUTOSAR, 

AUTOSAR Adaptive and further concepts. Also existing 

architectures from non-safety domains like infotainment, as 

developed by the GENIVI Alliance, can be integrated to handle 

the unreliability of the communication link. 

Another aspect that is getting more relevant for application 

design and validation are so-called Plastic Architectures. Parts 

of an application can be distributed over several entitites. For 

example, in the case of a Collision Warning application [6] this 

includes the interaction of the originating, the warning and 

optionally edge or cloud components form the overall function. 

As the specific architecture may frequently change, the 

architectures change depending on the context, thus becoming 

formable or plastic. Also, in future parts of the application can 

be dynamically relocated during runtime, e.g. from cloud over 

edge to in-vehicle components. Thereby, the system boundaries 

dynamically change depending on the current communication 

relations. Although, there are concepts to solve the underlying 

network aspects [7], these runtime conditions have already to 

be covered within the design phase of the specific applications. 

III. SIMULATION & TESTING 

One goal of simulation and testing for cooperative 

automated driving is to achieve a fail-operational behavior of 

the application, even when the context information is 

unconfident. Therefore, the coverage of the testcases used 

within the simulation environment and during virtual testing 

should be as realistic and as comprehensive as possible. This is 

reached by (automatically) defining reference scenarios and 

generate variations out of them, e.g. by stochastic variations 

[8]. 

One of the parameter variations is the realistic behavior of 

the communication channel during a certain driving scenario. 

Therefore, a network simulation tool, e.g. ns-3 or OMNET++, 

and a traffic simulation tool, e.g. SUMO, VTD or CarMaker, 

are integrated into the simulation environment. Our testbed 

could also be integrated with other microscopic and 

macroscopic traffic simulation tools, as each of them has 

advantages when testing a specific connected application. 

A. Simulation Environment 

The suggested concept combines three different simulation 

aspects into one integrated simulation environment. 

The first component is a traffic simulator that is used to 

model and run driving test cases on a realistic road network. 

968

The second component is a network simulation tool for 

evaluating applications under real communication conditions. 

For the heterogeneous use of common vehicular 

communication technologies, e.g. 802.11p, 4G or LTE, the 

ezCar2x ® framework provides additional network layer 

components. The network simulation tool also facilitates 

interfaces to control traffic simulation and integration 

hardware-in-the-loop tests or vehicle-in-the-loop tests (as for 

Virtual Platooning), e. g. including RSUs. 

The third component of the testbed is for test control. 

Traces from all simulation components are monitored and 

analyzed within the test control component. For ensuring the 

security of cooperative systems, testing covers white-, gray-, 

and black-box approaches (e.g. Data-Flow Analysis, Fuzzing 

or Penetration Testing). In order to validate the applications, 

test cases have to reach full coverage and should therefore be 

generated (semi-)automatically for each application. 

B. Integrated and Hybrid Simulation 

As already described in Chapter II.A, the application 

implementation is deployed on each virtual V2X node within 

the network simulation environment. Together with the 

ezCar2x ® Framework each virtual node can be equipped with 

developed applications and also with V2X communication 

ability. Hence, its interaction with other nodes can be simulated 

as realistically as possible. 

The interaction between all virtual nodes is realized using a 

virtual wireless channel. Thereby, we consider the specific 

characteristics of each communication technology by using 

individual channel models, e.g. dedicated models for ITSG5, 

LTE or 5G. This virtual wireless channel can also be used to 

integrate real hardware into the simulation by using a channel 

proxy and creating a mirror node for each hardware component 

within the network simulation. 

The network simulation is coupled with a macroscopic 

traffic simulation for large scale traffic scenarios, e.g., to test 

security mechanisms for V2X messages, and with a 

microscopic traffic simulation for smaller driving scenarios, 

e.g. Cooperative Merging [6] or Platooning. The coupling via a 

control interface is needed to synchronize the behavior of 

communication nodes and traffic participants in each 

simulation for the given driving scenario. 

The environment can be extended with hybrid simulation 

capabilities by including hardware-in-the-loop. A RSU 

comprising an application, e.g. Smart Lighting, can be 

integrated into the simulation loop, by connecting it to the 

wireless channel interface of the simulation environment. The 

RSU can again interact with further communication hardware, 

e.g. test vehicles that are in communication range. The RSU 

can also be connected with sensors, that are integrated into the 

testing environment and which can provide status data to 

generate event messages, e.g. Decentralized Environmental 

Notification Messages (DENMs). 

C. Sensor Integration 

The effectiveness of a connected applications’ simulation 

depends on how realistic the input data for a certain driving 

scenario is. The input data from distributed sources, e.g. the 

status data within Cooperative Awareness Messages (CAMs) 

from other vehicles or roadside sensors, have to be 

synchronized during recording and replay phase. 

Synchronization is required to setup the intended driving 

conditions for the application that are to be tested. To achieve 

realistic input data, it is beneficial to have sensors integrated in 

an early development step, through Hardware-in-the-Loop 

(HiL) or recorded input data stream integration. 

The realistic sensor data can also be used to develop 

algorithms for sensor analysis, timing and improvement of 

machine learning processes. In addition, to develop tamperproof 

algorithms it is advantageous to have a realistic behavior 

of sensor data available in order to avoid manipulation or 

attacks with sensor hardware or sensor data. 

Within our simulation environment, vehicle sensors and 

infrastructure sensors are integrated as components in 

ezCar2x® via generic sensor interfaces. When recording test 

data, the real sensors can easily be included into the 

synchronized recording process. The same setup can be used 

during field tests, where infrastructure sensors usually are 

integrated into RSUs and thereby provide their environment 

data. Thus, virtual, hybrid and integration testing can be carried 

out with low effort. 

IV. SOFTWARE ANALYSIS 

In each step of the development process it is beneficial to 

perform additional analyses to get detailed knowledge of the 

overall system and the applications behavior for debugging, 

monitoring, security, and validation purposes. In this chapter, 

we give an overview on our methods and tools for software 

analysis. 

An exemplary flow of the development process for an 

application is shown in Fig. 2. The application under 

development can be prototyped as implemented source code or 

as a software model to be further developed and optimized 

within the testbed. 

A. Monitoring and Functional Validation 

For software validation and verification, model-based 

techniques are advantageous during the design and integration 

phase. Our DANA platform [3], an open and modular 

environment based on Eclipse, is a tool built for specifying and 

analyzing networked applications. For this purpose, the 

specified valid behavior of the application is described as a 

layered reference model. This model provides a basis for 

further model-based development steps. On the one hand, it 

can be used for various transformations of behavior models, 

e.g., for generating test cases or code for running simulations. 

On the other hand, it can be used for static analyses to check 

conformance to modeling guidelines, metrics for interfaces, 

and the compatibility of behavior models. The model-based 

approach also allows a quick integration of new messages 

sources, e.g. additional communication protocols or wireless 

channels. Furthermore, the DANA tool can be used for 

verifying and validating software interface behavior, as 

messages in these interfaces can contain complex data and 

include intricate interactions. 

In our proposed testbed we use DANA as a central 

monitoring tool to have all the status, debug and behavior 

969

Fig. 2 Exemplary Development Flow for Safety & Security Testing 

information, the error messages and timing data centrally 

available from each component. This aggregation helps to 

simplify and to speed up the debugging process during 

development and runtime. Further validation checks can be 

applied on this collected data, as described in the previous 

paragraph. 

B. Application Security Testing 

By integrating static as well as dynamic testing methods 

into the development process, applications can be tested against 

a broad spectrum of software vulnerabilities, as outlined in [9]. 

Detecting vulnerabilities in early development stages is crucial, 

as it prevents the necessity of expensive software patching 

post-release. Therefore, we propose a work flow, where 

application security testing is part of continuous integration. 

Static code analysis as part of SAST allows inspecting 

program behavior without actually running the program. A 

large amount of tools for programming languages that are 

typically used within the V2X domain are available for 

detecting software requirements being violated, e.g. erroneous 

program behavior, reachability or dead code. At that, many 

tools are specifically designed to detect program flaws that lead 

to potential vulnerabilities, e.g. buffer overflows. 

SAST depends on the application source code and 

vulnerability specification as input, where the latter defines 

what kind of vulnerabilities should be detected by the tool. The 

static analysis then runs fully automated and outputs a report of 

the detected potential vulnerabilities. 

Available tools can be applied directly on source code as 

well as on binaries and bytecode. The applied methods differ in 

efficiency and effectiveness depending on the underlying 

approach. Broadly, applied methods can be divided into lexical 

scanning and data or control flow analysis. 

While SAST has shortcomings, e.g. detection of 

vulnerabilities in authentication, access control or 

cryptographic protocols, as well as uncovering flaws in the 

security design, DAST can tackle these problems. Furthermore, 

it provides deeper analysis and can imitate attack scenarios. 

Since applications are executed within this testing method, 

DAST depends on predefined input data. These inputs either 

represent specific test cases or are randomly chosen. 

Dynamic code analysis is a white-box testing approach. 

This means that the internal structure of an application, e.g. 

source code or intermediate representation, is available and can 

be leveraged for analyzing the application. 

Monitoring of function calls by hooking into security 

critical functions, e.g. system calls, is one typical testing 

method. In addition, function parameter analysis is applied, 

inspecting the input-output relations of function calls. More 

sophisticated methods like dynamic taint analysis are able to 

analyze execution paths an attacker may use to exploit an 

application. This method can also be applied if source code of 

an application is not available. 

All of these methods aim at detecting potential 

vulnerabilities that occur during runtime of the application. 

This allows verifying alarms that have been reported by SAST. 

Furthermore, it can be used as preparation for penetration 

testing, as it provides potential entry points for actual attacks. 

V. SECURE DEPLOYMENT AND PROTOTYPING 

To rapidly transfer the developed application into a 

prototype, as needed for field tests, an integrated deployment 

process is beneficial. The testbed offers an integrated solution 

containing a secure V2X platform, a continuous integration 

workflow, secure deployment and update processes and a 

communication link to back end or cloud platforms. 

Novel cooperative functions can be integrated with 

ezCar2x® into secure ITS prototype devices. These build upon 

trust2X [10], a hardened platform that includes hardware- and 

software-based security in order to isolate and protect 

970

processes and data of cooperative driving functions from other 

operating systems (e. g. AUTOSAR), functional modules and 

communication interfaces (e. g. backend communication for 

secure software updates and app deployment). 

The hardened V2X platform trust2X is a hardened Linuxbased 

operating system that provides hardware- and softwarebased 

security features for target devices of V2X applications. 

Its main goal is the secure isolation of different software 

entities that run concurrently on top of a single Linux kernel. 

ezCar2x® can run in a Linux container, isolated from other 

guest operating systems (e.g. AUTOSAR) and other functional 

modules or services (e.g. TLS, VPN). By isolating software 

entities, each entity is protected against compromised or 

malfunctioning software that runs in a different container. 

Therefore, safety- and security-critical functions can run within 

an isolated environment, unaffected by failure or compromise 

of separate entities in the system. 

Each commit to the central software repository 

automatically triggers a build of the application under 

development. While each build is self-testing, SAST and 

DAST are triggered automatically. Static security testing tools 

are applied either directly on source code, or on compilation 

products such as bytecode or binaries. Therefore, either single 

commits can be tested using light-weight analysis tools, as well 

as individual software modules or the complete application. 

Static application testing tools can run without user interaction 

and provide a report, which lists each finding of potential 

vulnerabilities. DAST tools might require user interaction 

depending on the applied testing method. 

In case a potential vulnerability has been found the 

continuous integration server alerts the development team. 

Based on the review of the testing reports either further 

security tests can be applied, or the developer commits a patch 

in order to eradicate the vulnerability. If the application has 

been successfully tested and no potential vulnerabilities were 

found, the continuous integration server triggers a merge 

request, or directly merges with master. Subsequently, the 

application is build and signed. 


We provided an approach for an integrated testing 

environment that covers the whole development process for 

prototyping and testing cooperative functions. Incorporating 

safety and security aspects starting from the design phase, the 

complex task of simulation and cooperative applications and 

several testing steps have been described. Finally, a software 

solution for the validation and deployment of the prototypes 

was presented, which makes tools available for the whole 

development cycle. Thereby, a toolkit is provided that is 

intended to rapidly bring an idea for a connected application 

into a prototype with a decreased investment risk. 

In future, all testbed services described in the previous 

chapters could also be made available as an online web-service. 

For this purpose, a next step of the described solution is to 

enable access to configurable or pre-configured simulations via 

an online service. On this website, users could remotely control 

simulation parameters, define new scenarios and get qualified 

evaluation results. This ongoing development addresses the 

rapid development of connected applications by abstracting 

technology know–how and increasing time-to-market speed. 

Thus, innovators and developers can concentrate on the actual 

function and idea of the intended application and are able to 

experience, improve, and validate their solution in early stages 

prior to competitors. 


This project was partially funded by the Bavarian Ministry 

of Economic Affairs and Media, Energy and Technology 

within the High Performance Center Secure Networked 

Systems. 

REFERENCES 

[1] Ntousakis, I. A., Nikolos, I. K., & Papageorgiou, M. (2017). Cooperative 

Vehicle Merging on Highways-Model Predictive Control (No. 17- 

00930). 

[2] Roscher, K., Bittl, S., Gonzalez, A. A., Myrtus, M., and Jiru, J. (2014). 

ezCar2X: Rapid-Prototyping of Communication Technologies and 

Cooperative ITS Applications on Real Targets and Inside Simulation 

Environments, In: 11th Conference Wireless Communication and 

Information. vwh, pp. 51 – 62. 

[3] Drabek, C., Weiss, G. (2017) DANA - Description and Analysis of 

Networked Applications. In: International Workshop on Competitions, 

Usability, Benchmarks, Evaluation, and Standardisation for Runtime 

Verification Tools (RV-CuBES), pp. 71-80. 

[4] Schleiss P., Drabek C., Weiss G., Bauer B. (2017) Generic Management 

of Availability in Fail-Operational Automotive Systems. In: Tonetta S., 

Schoitsch E., Bitsch F. (eds) Computer Safety, Reliability, and Security. 

SAFECOMP 2017. 

[5] Moradi-Pari, E., Mahjoub, H. N., Kazemi, H., Fallah, Y. P., and 

Tahmasbi-Sarvestani, A. (2017). Utilizing Model-Based Communication 

and Control for Cooperative Automated Vehicle Applications. IEEE 

Transactions on Intelligent Vehicles. 

[6] Zhang, R., Cao, L., Bao, S., & Tan, J. (2017). A method for connected 

vehicle trajectory prediction and collision warning algorithm based on 

V2V communication. International Journal of Crashworthiness, 22(1), 

15-25. 

[7] An, X., et al. (2017) On end to end network slicing for 5G 

communication systems. In: Transactions on Emerging 

Telecommunications Technologies, 28. Jg., Nr. 4. 

[8] Damm W., Heidl P. (Hrsg) (2017) Positionspapier und Roadmap zu 

,,Hochautomatisierte Systeme: Testen, Safety und 

Entwicklungsprozesse“, SafeTRANS e. V. http://www.safetransde.org/de/Aktuelles/?we_objectID=2, 

Access: 18.1.2018. 

[9] Eckert, Claudia. (2017) Cybersicherheit beyond 2020!. In: Informatik- 

Spektrum, 40. Jg., Nr. 2, pp. 141-146. 

[10] Waidner M. (2018) Safety und Security. In: Neugebauer R. (eds) 

Digitalisierung. Springer Vieweg, Berlin, Heidelberg 

971

Sensor Simulation 

Validation of safety-related sensors with real time capability 

Dr.-Ing. Kristian Trenkel 

Research 

iSyst Intelligente Systeme GmbH 

Nürnberg, Germany 

Kristian.Trenkel@isyst.de 

Abstract— Currently there are no suitable solutions for the 

simulation of advanced sensors with digital interfaces in real-time. 

But this is necessary for development and testing. The presented 

solution enables the simulation of sensors in real time. The 

simulation of sensor values and the injection of errors is possible. 

Based on the used CAN interface, the solution can be easily 

integrated into existing test systems. Furthermore, it is easy to use 

at the developer's workstation. 

Keywords— Simulation of sensors, Simulation of sensors busses, 

Environment simulation 


In the automotive industry the number of electronic control 

units (ECU) and the number of functions increases every vehicle 

generation. More and more functions need information about the 

environment of the vehicle. Because of that the number of 

sensors is increasing. Beside the sensors with analog or PWM 

interfaces more and more digital sensors interface like SPI, 

SENT or PSI5 become important in the development of 

automotive ECUs. This interface leads to new requirements for 

the development and test. 

For the development and test it is necessary to simulate the 

sensor values and the behavior in the case of faults of the sensor 

in real time. With a real sensor not all possible behaviors and 

tolerances can be caught. Compared to the simulation of analog 

or PWM signals the simulation of sensors with a digital interface 

is much more complex. The sensor provides, beside the sensor 

signal, more and more diagnostic and configuration information. 

For safety relevant applications (e.g. according to ISO26262) it 

is necessary to test all elements of the sensor communication 

with focus on the fault detection. The currently available sensor 

simulation systems don’t provide the possibility of a real time 

simulation. Over the mostly used USB interface it is only 

possible to program fixed values or predefined signal sequences. 

Also, there is no possibility to synchronise the simulation to the 

test system. 

The introduced simulation platform provides the possibility 

to simulate sensor values in real time, which means a response 

time below 1µs. The system uses one CAN interfaces for the 

integration in the test system. It is possible to simulate the real 

and dynamic behavior of a sensor in a simulated test 

environment like a HIL system or a PC on the workplace of a 

development engineer. The presented system of the iSyst 

Intelligente Systeme GmbH allows the simulation of sensor with 

a SPI, PSI5, LIN or SENT interface [1]. This solution enables 

the engineer to accomplish an effective development and the test 

of sensors with digital interfaces in safety relevant area, too. 

The paper shows the possibilities and limits of the sensor 

simulation platform. The advantages of the presented solution 

are illustrated by results from real test projects. 

II. 

STATE OF THE ART 

Until now, sensors have been widely used in the automotive 

sector, but also in other areas of industry, which output the 

sensor value as an analog signal or PWM signal. These sensors 

are typically connected via three wires, which contain the supply 

voltage (e. g. 5V), ground and the actual sensor signal. The 

transmitted information can be displayed as current or voltage. 

It is not possible to exchange further information such as 

diagnostic information from the sensor. An error is detected by 

checking the actual sensor signal and the position of the signal 

within defined limits. 


972

III. 

DIGITAL SENSOR INTERFACES 

A. Used sensor interaces 

Currently, sensors with digital interfaces are frequently used. 

These enable on the one hand to receive several sensor signals 

from a single sensor. On the other hand, it is possible to obtain 

diagnostic and status information directly from the sensor. This 

enables extended fault detection, which in turn is important for 

safety-critical applications such as airbag control units. 

SPI, PSI5 and SENT are examples of digital sensor 

interfaces. 

SPI (Serial Peripheral Interface) [1] is a bi-directional, 

synchronous serial protocol that is often used for communication 

of ICs on boards. 

PSI5 (Peripheral Sensor Interface 5) [2] is a two-wire 

interface that can be operated synchronously or asynchronously. 

A current modulation with Manchester encoding is used for 

communication from the sensor (slave) to the control unit 

(master). Modulation of the supply voltage is used for 

communication from the control unit to the sensor. PSI5 has 

been developed for the connection of sensors in the automotive 

sector. 

SENT (Single Edge Nibble Transmission - SAE J2716) [3] 

is a unidirectional, asynchronous protocol using three wires for 

supply voltage, ground and signal. The signal is transmitted as 

modulated signal voltage with constant amplitude and different 

pulse length for each nibble (4 bit). It has been developed for the 

connection of sensors in the automotive sector. 

In addition to the actual sensor value or sensor values, 

advanced sensors provide a large number of diagnostic 

functions. They perform cyclical self-tests (e. g. testing of the 

clock source and memory) and report the results to the connected 

processor of the system. Furthermore, the sensors can also 

monitor the processor by providing watchdog functions. The 

processor in turn checks the sensor values (e. g. checking for the 

presence of noise) to ensure that the sensor functions correctly. 

B. Testing of sensors 

During the development and testing of embedded systems 

with sensors, the task now is to support and test all functions that 

the real sensor provides via its interface. The real sensors can 

only be used to implement the behaviour of this special sensor, 

which lies somewhere in the permissible tolerance band. On the 

one hand, this makes it difficult for software developers to check 

their implemented plausibility checks. On the other hand, the test 

department cannot enforce a faulty behaviour of the sensor and 

therefore cannot check the correctness of the monitoring. In 

safety-critical applications, however, the testing of failures and 

the safeguarding of monitoring functions are an absolute 

necessity. 

In order to enable the realization of these development and 

test tasks, a platform for the emulation of sensors was developed, 

whose function and possibilities are illustrated by the following 

example. 

Figure 1: Emulation module for sensors with SPI interface 

IV. 

EXAMPLE OF USE 

In the following, the challenges and solution proposals for 

safeguarding acceleration sensors connected via SPI are 

presented. The procedure is shown by way of example on a 

control unit, which uses three acceleration sensors to provide 

measured values for the Electronic Safety Program (ESP). The 

recording and transmission of the sensor values is classified as 

ASIL B (Automotive Safety Integrity Level). All error detection 

mechanisms of the sensor were tested as well as the transmission 

of the sensor data and error states via FlexRay to other ECUs. 

Therefore, it was necessary to integrate the sensor emulation into 

a hardware-in-the-loop (HIL) test system. 

V. INTEGRATION INTO THE TEST SYSTEM 

Sensor simulation, or better to say emulation, is integrated 

into the HIL test system via the CAN bus. In this case, only the 

default values for the sensor data are transmitted cyclically in a 

500 µs pattern via the CAN bus from the HIL test system to the 

sensor emulation. The error injection is controlled as required to 

keep the bus load on the CAN bus as low as possible. This makes 

it possible to define the sensor data from the HIL test system in 

real-time. In the given example, the sensor data is queried in the 

control unit with a cycle time of 1 ms. The transfer of the sensor 

values takes place on the SPI interface according to the requests 

of the ECU. Furthermore, the sensor emulation also takes over 

the generation of a noise on the sensor signals synchronously to 

the queries of the ECU. This is necessary for a realistic 

simulation of the sensor behaviour. 

973

of sensor emulation via SPI in real time to the ECU to be tested. 

The result of the plausibility check must also be measured as a 

value on the FlexRay. Different plausible and not plausible 

sensor values can be specified and the evaluation can be checked 

by the ECU. 

A second test case is the falsification of the cyclic 

redundancy code (CRC) checksum within the responses on the 

SPI interface. The error injection can be performed for 

individual or all responses. In case of a faulty CRC, the control 

unit must mark the sensor values on the FlexRay bus as invalid. 

This can be easily checked by the control PC. 

Figure 2: Test system integration 

The schematic structure of the test system can be seen in 

Figure 2. Python and the test automation system iTestStudio are 

used to implement and execute the tests on the control PC. This 

is connected to the dSPACE real-time computer in the HIL test 

system via a proprietary connection. The control PC can also 

read the FlexRay communication (rest bus simulation) between 

the HIL test system and the control unit and thus also check the 

sensor values on the FlexRay. Furthermore, the control PC can 

use the Universal Measurement and Calibration Protocol (XCP) 

to read and write variables within the ECU software. With this 

test system design, both white box and black box tests for the 

sensors are possible. 

VI. 

EXECUTION OF THE TESTS 

In the following, two tests are described in more detail to 

illustrate the possibilities of sensor emulation and the HIL test 

system. 

The plausibility of the sensor data is checked in the first test 

case. The plausibility check is carried out in the real application 

by a second ECU (ESP), which also has acceleration sensors and 

makes these values available on the FlexRay. In the example, 

the second ECU is simulated by the HIL test system. It is now 

possible to provide sensor values on the FlexRay and by means 

Furthermore, the sensor emulation makes it possible to 

change all values provided by the sensor (e. g. status bits, chip 

ID and clock counter) via CAN. In addition, faulty lengths for 

the responses, a sensor failure and a missing jitter of the sensor 

values can be set. 

Errors in the requests to the sensor emulation are detected 

and reported cyclically every 500 ms to the HIL test system via 

CAN. 

VII. CONCULTION 

For the development and test of embedded systems with 

advanced sensors, real-time sensor emulation is essential. Only 

with the help of an emulation is it possible to implement and test 

all functions and also the diagnostic function as well as to ensure 

all plausibility checks. 

REFERENCES 

[1] iSyst GmbH, “Test Components” Online]. Available: 

https://www.isyst.de/en/products/test-components/ 

[2] Wikipedia, “Serial Peripheral Interface” [Online]. Available: 

http://de.wikipedia.org/wiki/Serial_Peripheral_Interface 

[3] Robert Bosch GmbH, “PSI5 Peripheral Sensor Interface 5.” [Online]. 

Available: http://psi5.org/ 

[4] SAE International, “J 2716 - SENT - Single Edge Nibble Transmission 

for Automotive Applica-tions,” 27.01.2010. [Online]. Available: 

http://standards.sae.org/j2716 201001/ 


974

Effective Power Interruption Testing 

How Best to Fail 

Thom Denholm 

Technical Product Manager 

Datalight, Inc. 

Bothell, WA 

Thom.Denholm@datalight.com 

Abstract—From dropped batteries to system failures, 

embedded designs need solid power interruption testing. 

Reliability demands for embedded products have increased as the 

desired lifetime of high reliability products has grown. To achieve 

the most comprehensive reliability test in the least time, stress 

testing must utilize I/O at the point of power interruption. 

This session will survey the failure points of file systems and 

flash media, with a discussion of the most effective strategies for 

ensuring that test design accounts for the variety of real world 

failures that can occur. Validation of data and hardware 

requirements will also be discussed. 

Keywords—reliability; NAND; flash media; power interruption; 

file system; O_DIRECT 


When your team is responsible for validating the reliability 

of a design for the embedded marketplace, you need to do more 

than just tick a box or read a marketing document. Real testing 

of reliability involves getting into the guts of the embedded 

design, and covers everything between the application and the 

hardware. This whitepaper is focused on testing the file system 

and its interactions. Catching a vulnerability in design or 

testing is far more efficient (and far less expensive) than 

handling problems in the field. 

II. 

DEFINITION OF RELIABILITY 

Let’s start by defining just what reliability is – whatever a 

given customer thinks it means. 

For some customers, reliability is just being able to turn on, 

start up, work properly. This is akin to a pre-electronic 

appliance, ready when you needed it. Other customers expect 

their favorite settings or programmed routes to be available. 

They want changes they have made to be reflected in their next 

use of the device. Still other customers have an even higher 

definition of reliability. They want to start up with settings that 

were just changed before the device was shut off. Medical 

designs are one class of device where this level of reliability is 

mandated – it must be known exactly what the device was 

programmed for, even if the power is removed immediately 

thereafter. 

It is safe to say that even though the needs of the individual 

customer vary, the bar must be set quite high. Likely this is 

even higher than the average developer expects. 

A. System reliability vs Data on the media 

Examining the previous cases from the computer’s point of 

view, the first is related directly to the system files. Sudden 

power loss could corrupt other files, as long as it doesn’t affect 

the system files. One trick used by developers is to put the 

system files on a separate partition (or even separate media) to 

achieve this goal. Files can be marked read only, though that 

only keeps the file system from touching them. Media failures 

can overwhelm both attributes and partitions to corrupt files, 

and this is a particular problem with NAND flash media. [1] 

To save the end user’s settings, those changes have to make 

it to the media. On embedded Linux, for instance, those 

changes must make it through the write-back cache and any 

buffers. The data also needs to clear any hardware block cache. 

In time, all the data will eventually be flushed – if the user 

waits long enough, their settings will be saved. 

In addition to user data, the file system metadata also must 

be in the appropriate place. For standard Linux file systems, a 

journal is used to commit data immediately, and these 

journaled metadata writes also make it to the media in time. 

For most file systems, Linux flush and fsync commands are 

used to achieve complete control of the file system, which is 

the only way to ensure data is committed to the media. [2] 

That control over the file system and block device are the 

key to the final use case, where the data must be committed 

immediately – waiting for the block cache to time out is not an 

option. 

The next step is to refine the granularity even more – what 

happens when the power is interrupted? 

B. Write interrupted mid-block 

Writes can be interrupted at any point, especially when 

power is lost unexpectedly. This results in two options – the 

system could be in the middle of writing a block or in between 

writing blocks. 

On older magnetic media, an interruption in the middle of 

writing a block would leave a partial write. The block being 

written is corrupted – it likely contains fragments of new data 


975

and old data. If this was the only copy of that data, it (and the 

entire file that contained that block) are now useless. Keeping a 

copy can be done via techniques such as writing to another file 

and then renaming, or using a copy-on-write file system. 

Testing for this situation should be part of any comprehensive 

device test. 

New NAND based media, from SSDs to eMMC to SD and 

Compact Flash, has an even larger problem with an interrupted 

write. It is so large that vendors specifically recommend 

maintaining power during block writes just to avoid it. Going 

into further detail on this type of failure is outside the scope of 

this paper. Assuming that media power is maintained with a 

capacitor or other technique, an interrupted write becomes 

instead write interruption between blocks. 

C. Write interrupted between blocks 

From the user perspective, each write consists of one or 

more blocks of data, along with any metadata that must be 

written. If a given write is a small amount of data, up to one 

block, with no metadata changes, then power interruption 

won’t be a problem. For larger writes, an interruption between 

write blocks will interrupt what is known as an atomic 

operation. 

File systems such as FAT don’t do anything special for an 

atomic operation, so interrupting one of those is not much 

better than a mid-block interruption on magnetic media. The 

file being written to has some updated blocks and some not yet 

updated. For most user data, this is enough to render the file 

useless. Earlier techniques like writing to a separate file or 

using a file system that works with atomic writes will at least 

keep the older file intact. 

Testing to make sure longer writes are committed the way a 

system designer expects is another requirement for 

comprehensive testing. 

III. 

VALIDATE THE WRITTEN DATA 

At this point, the developer has an expectation of how data 

will be written and/or recovered as part of a power interruption. 

This is all assuming the writes happen in the order they were 

initiated, which is not always the case. 

Another important consideration is whether data has been 

overwritten. If a file data write is not a completely atomic 

operation, then a multiple block data write may be only 

partially completed. This would leave it to the application to 

understand just what one of those partial writes means. There 

are file systems which provide the granularity for a completely 

atomic write; is that what you need for your customers? 

IV. 

ALL I NEED IS O_DIRECT 

On Linux file systems, there is a particularly persistent 

myth that opening a file with O_DIRECT is all that is required 

for data to be reliability stored on the media. The fsync() 

command would not be necessary in this case. To assess the 

validity of this, our team measured the performance of 

sequential writes using fsync, O_DIRECT, and neither. 

Performance was high – faster than the physical speed of 

225MB/sec – for tests with neither protection, and for tests 

with only O_DIRECT. When fsync() was factored in, 

performance dropped to a reasonable number below the 

physical maximum of the device. 

Metadata should also be part of that atomic write. If it isn’t, 

one of two situations can occur. If the data is written but the 

metadata has not been, the data would be lost as the system 

recovered from power interruption. A potentially worse case is 

if the metadata is written but the data has not yet been. A 

system recovering in this situation would then try to open this 

updated file and find either garbage or, worse, other data from 

the media – a potential security risk. 

We did find that when the amount of data written is larger 

than any cache in the physical media, the write performance is 

roughly the same with or without fsync(). For reliability 

purposes, when the data absolutely has to be on the media, 

either write large files or use fsync() – O_DIRECT is not 

enough. 

V. STRESS! 

That covers the basic reliability testing for normal system 

use. The next thing to focus on is system stress – unusual cases 

that will likely occur in the field. 

976

The first of these that most developers don’t examine is 

what happens when the disk is full. Besides the potential 

performance implications, a disk full situation generates can 

generate more writes as garbage is collected. Related to that are 

extreme situations in the number of files – does the system 

latency increase noticeably beyond the first 100 or so files? 

Both of these situations lead to a larger write error window – 

and a larger potential for failure. 

Another stressor is a system update. This is a situation that 

is especially important to test thoroughly, especially the results 

of a potential power interruption. Atomic writes here can be a 

major factor which allows the device to recover from a failed 

update. 

Extreme use cases also hit the media especially hard. In the 

case of NAND flash media, thousands of reads from a given 

file can cause read disturb, adding bit errors to the media. If 

these bit errors are not taken care of (a process called 

scrubbing), the correctible errors can grow to uncorrectable 

errors. 

Other potential media failures include some of these items: 

• Specific write patterns – on NAND media without a 

randomization filter, writing all zeroes (0x00) is 

actually worse for the flash than writing other patterns. 

• Hot Spots – media locations that are prone to failure for 

reasons unknown to the developer (and possibly to the 

vendor) 

VI. 

DISCARDS (THE TRIM STATEMENT) 

Another storage related item is discards – using the trim 

command to inform the media that data is no longer in use. On 

devices where discards are not used (or under-utilized), latency 

can increase noticeably once all the blocks on the media have 

been written once. [3] This increase latency causes a noticeable 

drop in performance, and of course when writes take longer, 

the potential error window grows. 

For that matter, the firmware on most NAND based 

solutions is a black box. What is happening when the media is 

busy discarding data or wear leveling or garbage collecting? 

What happens when the power is interrupted during those 

operations? Using the file system to generate these sorts of 

failures can be very hit or miss, and a custom media test should 

be developed to validate its operation during a variety of power 

interruptions. 

VII. TACTICS FOR EFFECTIVE TESTING 

We have examined a number of methods that can be used 

to generate more effective power interruption testing, so now 

we must put them all together. 

First of all, interrupting an embedded device while 

quiescent or while reading will result in nothing being lost, 

unless it is updating something in the background. This means 

power interruption testing needs to trigger those background 

writes where possible and, most importantly, focus on the 

writes. 

To validate this assumption, we modified the standard 

power interruption tests regularly performed by Datalight. 

These tests are on Linux, and can utilize any file system. For 

this test, we chose VFAT. This stochastic test performs random 

operations, reading and writing, creating and removing folders, 

etc. We found that with default values, a power interruption 

would cause a chkdsk failure 5.3% of the time. When we 

halved the chances of write operations occurring, these failures 

dropped to 1.4% of tests. 

These random failures during writes are likely to exercise 

the safety routines of the file system and application. The next 

step is validating the data written, not just the structure. Make 

sure that what is most important to your customer is being 

confirmed here – data ordering, overwrites, and completely 

atomic operations. 

Data order is most important to databases, and has some 

importance to journaled file systems. If data is overwritten in 

place, then an entire operation must be atomic to prevent a 

corrupt state – half old data, half new data, all useless. 

While the user data is important, the system files can be 

even more important. If corruption of other files affects system 

files, the entire device can be rendered unusable. The same 

could happen if power interrupts an unprotected system update. 

Most file systems provide a utility based on chkdsk to detect 

these sorts of failures, though they can’t usually correct them 

very well. 

The original MS DOS chkdsk found data that was on the 

media but not represented in the file allocation table (FAT) and 

also find a number of errors within the FAT. It was not able to 

connect that data to file names, so files full of lost chains and 

other errors resulted in nearly useless files such as 

FILE0001.CHK – taking up space but not useful for most 

applications. 

VIII. SUMMARY 

The best testing meets both the needs of the strictest users 

and the goals of stressing the system in the best way. 

Interruptions during writes will demonstrate the most failures 

and accurately reflect field stress. Other factors to consider 

include validation of data, atomic operations and cases that also 

stress the media. Real testing of reliability is more than just a 

requirement, it also leads to long term success of the embedded 

design. 

REFERENCES 

[1] P. Slocum - "Are read only partitions safe from corruption if there's also 

a read/write partition on the same sd card?" 

https://raspberrypi.stackexchange.com/questions/67035/ 

[2] T. Denholm - "Reliably committing data in Linux", first presented at 

Embedded World 2017; 

https://www.datalight.com/resources/whitepapers/reliably-committingdata-in-linux 

[3] T. Denholm - "Performance drop without discards", May 31 2017, 

https://www.datalight.com/blog/2017/05/31 


977

Using Google Test for Safety-critical Software 

Development 

Miroslaw Zielinski 

Principle Software Engineer 

Parasoft 

Krakow, Poland 

miroslaw.zielinski@parasoft.com 

Abstract— This paper explores the essential elements of the 

unit testing environment for safety-critical projects. It describes 

how to augment open source unit testing frameworks to 

successfully certify the software. 

Keywords—unit testing; Google Test; safety-critical software; 

software certification 


The volume of safety-sensitive software has grown 

significantly, along with the continuously increasing number of 

connected devices and recent advancements in AI, notably AI’s 

applications to autonomous driving. As a result, it is becoming 

much more difficult to work around software certification. This 

is because the project modules subject to the rigorous criteria 

of safety standards, such as IEC 61508, ISO 26262, and DO 

178B/C, are becoming a much bigger part of the codebases. To 

this end, software must be rigorously tested. 

Safety standards mandate numerous testing practices and 

processes on software development. There is a cost associated 

with all of them. Some software quality practices, such as unit 

testing, require significant investment in tools and impact 

development schedules. In addition to the initial investment in 

tools and process implementation, there is an overhead related 

to the creation and maintenance of test cases proportional to the 

amount of created code. A thorough implementation of unit 

testing, including all the reporting required to get certification 

credit, causes it to be one of the most expensive techniques for 

assuring software quality. This is especially true for C and C++ 

languages. The cost associated with implementing this 

technology, as well as qualifying the tool chain, itself, makes 

the difficult process of selecting the unit testing solution very 

important because it affects the development process in many 

ways. 

In this paper, we explore the essential components of a 

widely-understood unit testing solution required for developing 

safety-critical software. We also discuss the feasibility of 

building the unit testing solution based on software that is free 

for commercial use. The discussion includes the commercial 

tools for functionalities that do not have sufficiently functional 

counterparts in the open source world. The main intention of 

this paper is to analyze practices and methods recommended by 

the safety standards at a high-level and link them to specific 

features of the unit testing solution. This will help build an 

understanding of where it is reasonable to rely on the open 

software. The discussion assumes that C and C++ languages 

are the most popular for safety-critical software development 

and uses Google Test as an example of a free unit testing 

framework. Most conclusions, however, apply equally well to 

any of the frameworks available in the open source ecosystem. 

II. WHY CONSIDER OPEN SOURCE UNIT TESTING SOLUTIONS? 

Teams producing safety-critical systems traditionally select 

commercial unit testing solutions. This is mainly because open 

frameworks do not provide sufficient functionality to assure 

successful certification of the software, especially for the most 

stringent levels of safety. Commercial unit testing solutions 

usually come with modules for creating unit tests, 

stubbing/mocking, calculating various code coverage metrics, 

and, of course, colorful reporting. All-in-one solutions that 

completely satisfy the requirements imposed by safety 

standards, such as ISO26262 or DO-178C, certainly have many 

benefits. A significant benefit is tool certification (in terms of 

IEC 61508 and related standards) and qualification kits for 

other standards that greatly reduces tool qualification 

workloads. Not only do we have a good answer for all 

requirements from safety standards, but we can also consult our 

vendor’s support team to address any concerns. 

What are the benefits of an open source unit testing 

framework when a commercial solution can meet the 

requirements put forth by a standard? If we assume for the 

moment that a hypothetical open source unit testing tool meets 

all safety standards requirements, the following benefits can be 

achieved: 

• Relying on a popular free framework increases our 

ability to find software engineers already familiar with 

the tool. 

• Developers are more willing to learn and use a popular 

and readily available solution. 

• There are usually many open libraries and modules with 

existing sets of unit tests that may potentially be 

integrated into our projects. The same is valid for the 

978

code developed in-house. Modules developed without 

safety standard in mind and covered with open source 

test cases sometimes fall unexpectedly into the scope of 

software certification. 

• Test cases created in an open source format protect our 

investment in unit testing and free us from the 

constraints of a vendor’s commercial solution. If a 

company decides to give up the commercial tool or 

switch vendors, all test cases created with abandoned 

solution may need to be rewritten or imported. 

• It is easier to function in long supply chains (as is 

common with automotive software) when executing test 

cases that prove that the supplied source code does not 

require commercial tools. 

These benefits represent important decision factors and can 

have strategic importance in many cases. 

There are also disadvantages. The most severe is that by 

selecting an open source unit testing framework, we cover only 

a fraction of the requirements that are typically imposed by the 

safety standard. We will return to this point later, but a quick 

example is the support for structural code coverage. Open 

source solutions provide reasonable support for only the 

simplest metrics. If our project requires a more advanced 

metric, such as MC/DC (modified condition/decision 

coverage), we will need to augment the selected framework 

with a commercial solution to provide code coverage statistics. 

There are many more functionalities in the widely-understood 

area of unit testing that may require a commercial plugin to 

provide sufficient functionality. 

Despite the disadvantages, the fact remains that creating 

unit tests is a very expensive process. When we decide to fully 

base our unit testing process on a commercial solution, we will 

need to implement the unit test cases in the format supported 

by the commercial tool, which ties us to the vendor for a long 

time. Attempts to reuse code created along with the tests will 

require using the same tooling. If our toolbox includes an open 

source framework augmented with dedicated tools to 

implement specific sub-functions, reusing the assets we create 

is much easier. We can hand created code and test cases over to 

our contractor, enabling them to use their own coverage tools 

and verify that code quality is as expected. Relying on the 

freely available formats for unit test creation is a reasonable 

idea in the absence of standardized format of unit tests 

description. 

III. WHAT DO WE NEED FROM A UNIT TESTING SOLUTION? 

The essential features of a solution depend on the safety 

standard and our project’s risk classification level (SIL, ASIL, 

or DAL). The discussion includes not only the core features of 

the unit testing framework but also the accompanying 

functionalities, such as stubbing/mocking tools, traceability 

frameworks (which are obligatory to assure completeness of 

testing), and the ability to produce the required reports. 

For sake of simplicity, let's analyze the requirements from 

two popular industry standards: ISO 26262 (automotive) DO- 

178C (aviation). Table 1 lists a selection of important 

methodologies for unit testing that are required to meet the 

objectives from the standards. Selection focuses on the most 

important practices only, and present generalized view. 

In contrast to ISO 26262, DO-178C does not explicitly 

require unit testing. The standard does, however, impose 

requirements that are often difficult to meet without 

implementing unit testing process. As a result, many 

organizations effectively assume that unit testing is a de-facto 

requirement for DO-178C compliance. Looking at the 

objectives from safety standards, the solution for unit testing 

would ideally contain: 

• Unit testing framework (assertions, test suites, 

execution automation) 

• Code coverage tool 

• Stubbing/mocking framework 

• Integration with hardware processor or simulator 

• Reporting 

• Validation test cases to qualify the entire solution 

• Tool certification and/or tool qualification kit 

Each of these modules plays a specific role in the 

development process and is expected to generate artifacts that 

support the certification claim. The following sections discuss 

each module, highlights crucial features, and assess the 

feasibility of using freely available modules. 

A. Unit Testing Framework 

Safety standards do not list any specific features expected 

from the unit testing framework itself. There are, however, 

some requirements stemming from the framework 

implementation process and safety-oriented structures in the 

organization responsible for controlling and documenting the 

verification and validation processes. One such requirement 

relates to reports generated by the framework. It is a common 

practice that test results are reviewed by a separate, dedicated 

team in the organization. For this purpose, unit tests execution 

should be well documented. Generated reports shall contain a 

section that details the following: 

• The function or method tested 

• The initial values for the parameters 

• The configuration and expectations related to test 

doubles 

• Results of all assertions, including assertions that were 

positively verified 

• Correlation with the requirement validated by the given 

test 

This level of details may seem to be an overkill, but in 

reality, it allows to confirm the correctness of the test case and 

execution result, without looking at the body of the test case. It 

simplifies the work by facilitating the “independent review” 

process. Every required information is in one document and 

there is no need to reach to the code base for additional 

979

TABLE 1: REQUIRED UNIT TESTING METHODOLOGIES FOR DO-178C AND ISO 26262 

DO-178C ISO 26262 

Unit testing (methodology) 6.4.c, 6.4.d Part 6, clause 9 

Requirements-based testing / traceability 6.5.a, 6.5.b, 6.5,c Part 6, 9.4.2 

Statement Coverage 6.4.4.c (Levels A,B.C) Part 6, 9.4.4 (ASIL A, ASIL B) 

Branch Coverage 6.4.4.c (Levels A,B) Part 6, 9.4.4 (ASIL B, ASIL C, ASIL D) 

MC/DC Coverage 6.4.4.c (Level A) Part 6, 9.4.4 (ASIL D) 

Fault injection/robustness test cases 6.4.2.2 Part 6, 9.4.2 

Test environment represetative for production env. 6.4.1 Part 6, 9.4.5 

Software tool qualification 12.2 Part 8, 11 

information. An example of such a report from the 

commercial unit testing framework is presented in Fig. 1. 

Free unit testing frameworks do not usually provide 

sufficiently detailed reports out of the box. Frameworks with 

an open architecture, such as Google Test, can usually integrate 

multiple plugins that contribute to the execution, which makes 

the extension relatively simple. When extending the open 

source framework users should consider: 

• Test case association with the requirements 

• Extending assertions to generate messages for positive 

and negative assertions verification 

• Dedicated macros for outputting additional meta-data 

about the test case 

Safety standards also suggest that the test environment 

should be as close as possible to the production environment. 

As a result, executing unit test cases on the target processor or 

at least on the processor simulator is desirable. Teams often try 

to limit this type of testing because it is more time consuming 

and difficult to automate than testing on the host platform. But 

even if source code can easily be tested on the host computer, a 

periodic verification with the target processor is typically 

conducted to avoid problematic argumentation and to prove 

that differences between the target and the host processor do 

not hide potential errors. 

For example, DO-178C states that “The difference between 

the target computer and the emulator or simulator, and the 

effects of these differences on the ability to detect errors and 

verify functionality should be considered.” [1]. Preparing the 

data that supports the claim that host and on target testing are 

equivalent is not easy, especially for the more stringent levels 

of safety. In most cases, it is easier to adapt the unit testing 

framework for on-target test execution. Commercial unit 

testing frameworks include dedicated support for a large 

collection of cross-development environments. Support usually 

includes the integration with debuggers, allowing seamless 

communication for uploading test binaries and downloading 

results. Open source frameworks require some modification to 

adapt for on-target execution: 

• Cross compilation of the framework with the target 

compiler 

• Implementation of a plug-in that outputs results from 

the target to the host machine 

• Scripts to automate the interaction with the target to 

upload the test binary, start execution, and download the 

results 

• Conform to the hardware resource limitations of the 

target, such as processing speed and available memory 

Although these gaps seem challenging at first, they are not 

difficult to bridge. Google Test requires a “C++98-standardcompliant 

compiler,” which is a reasonable requirement. A 

bigger challenge would be if C-only compilation were 

required. In this case, we need to look for the C-based unit 

testing framework, such as CUnit or cmocka. Implementation 

of the communication layer to transport testing results 

commonly requires providing a function that can transport a 

buffer from the target to the host. The remaining part of report 

building happens in the upper layer of the framework. Finally, 

automation of the tests execution can be typically achieved 

with the interface provided by the debugger. Most of the crossdevelopment 

environments support some simple scripting to 

automate debugging activities, which is more than enough to 

automate unit tests execution. It is important, however, that an 

open source testing framework can also work within the 

hardware resource constraints of an embedded target. 

Frameworks with heavy memory and processing requirements 

might be impractical for embedded devices. 

B. Code Coverage 

Once we settled on the selected unit testing framework, the 

next step is to decide the code coverage tools. Coverage 

metrics are consistently required by all safety standards. The 

implied objective is to identify code that was not exercised by 

requirements-based testing and refine the tests, requirements, 

or both. The type of required metrics depends on the risk level 

associated with the system (see Table 1). The process of 

refining the tests and requirements with the help of a coverage 

report is time consuming. Thus, coverage reports should be 

easy to analyze and well-integrated with other components of 

the unit testing solution to minimize the amount of manual 

work. The coverage tool should at least include the following 

features: 

• Support for all required coverage metrics 

• Ability to present coverage results generated per 

execution of specific test case 

• Ability to present coverage results in context of specific 

selected requirement (traceability) 

980

Fig. 1: Example report from a commercial unit testing solution. 

• Ability to merge coverage results from different testing 

sessions and different working stations 

• Ability to merge coverage results from different types 

of testing, such as unit testing, integration testing, and 

system level testing 

• Ability to collect results from host, target, and simulator 

• Ability to annotate reports with additional information, 

such as date, tester name, session ID, tool identification 

(i.e., compiler and linker hashes) 

• Ability to collect coverage results on a per-build basis 

to allow for comparisons between builds and baselines 

There are popular coverage tools in the free software 

domain, including GNU gcov and Clang-based tools. There are 

also several promising projects in early phases of development, 

but none of them satisfy all the enumerated criteria. One of the 

most significant limitations is lack of support for statement and 

MC/DC coverage. Most of the available free tools only support 

line and branch coverage metrics. Line coverage can 

sometimes be used as a replacement for statement coverage if 

we engage static analysis tools to enforce a coding convention 

to place only one statement in each line. Such an approach, 

however, is far from convenient and actually obfuscates the 

code. There does not seem to be a good option among free 

tools for MC/CD coverage, which represents a significant 

roadblock to adoption in high safety integrity level projects. 

C. Traceability 

Support for traceability from requirements to code and 

associated tests is another important feature. Traceability 

facilitates requirements-based testing, and in context of code 

coverage means correlation between a test case and the code 

coverage results generated when executing a specific test case. 

Additionally, the test-case-to-requirement correlation helps 

developers understand how well a specific requirement is being 

tested. 

A traceability framework needs to provide bidirectional 

links between all important artifacts created during the 

software verification process. In the context of code coverage 

tools, the important element is to assure the ability to annotate 

code coverage results with the information about the executed 

test case. Commercial solutions for unit testing offer these 

capabilities out of the box, whereas open source unit testing 

frameworks require integration with the coverage tool to fulfill 

this requirement. A convenient method of doing this is via the 

981

API, which is often provided by coverage tools. Such an API 

enables the integration of the unit testing framework with the 

coverage tool. There are commercial solutions that offer 

integration APIs, enabling easy interaction with any testing 

framework. 

The API does not have to be complex. In most cases, 

simple functions to notify about test start/stop events are 

sufficient: 

void TestStart(const char* testName); 

void TestStop(void); 

This kind of API assumes that calling TestStart annotates 

the coverage results stream with the ID of the executed test 

case. Calling TestStop closes the results section assigned to the 

specific test case. This simple integration schema is presented 

in Fig. 2. 

With the API discussed above, integration of coverage tools 

with unit testing frameworks is relatively simple. Most of the 

open frameworks support plug-ins for monitoring test 

execution, which can be used to send messages to the coverage 

tool about the beginning and end of the test execution. The 

following example shows how to use Google Test’s 

testing::TestEventListener interface to bridge the unit testing 

framework and coverage tool: 

class CoverageAnnotator : public ::testing::EmptyTestEventListener 

{ 

public: 

virtual void OnTestStart(const ::testing::TestInfo& test_info) 

{ 

TestStart(test_info.test_case_name()); /* Coverge tool API call */ 

} 

virtual void OnTestEnd(const ::testing::TestInfo& test_info) 

{ 

TestStop(); /* Coverge tool API call */ 

} 

}; 

Reports generated from the integrated unit testing 

framework and coverage tool allow developers to review 

coverage results generated by a specific unit test. An example 

showing a collection of Google Test test cases with associated 

coverage results is shown in Fig. 3. 

D. Target-based Testing and Metrics Collection 

ISO 26262 and DO-178C recommend that test environment 

shall be as close as possible to the production environment. 

This means that results, including code coverage, shall be 

collected from the target processor or at least from the 

simulator. For code coverage tool, collecting results from the 

embedded hardware needs to be supported. The subject is 

broad, but with some simplification, code coverage can be 

collected using two types of technology: 

• Source code instrumentation (injecting extra code into 

original code) 

• Processor core trace logic (collecting information about 

instructions executed by the core) 

Source instrumentation is flexible and can be applied at the 

build time. It is independent of the hardware and allows 

execution on the target processor. Dedicated integration may 

be required, however, to work with specific cross-compilers. 

The technology supports all known coverage metrics, from the 

statement through the path and condition coverage up to full 

MC/DC coverage. The main limitation of this technology is 

that it imposes some overhead on the execution time (i.e., time 

for executing the injected instrumentation) and it increases the 

footprint of the binary executable, which may be an issue for 

smaller MCUs. 

An instruction trace is an alternative method of collecting 

code coverage metrics. This method requires dedicated support 

from the hardware. Processors must contain core trace logic, 

which generates a stream of information about machine 

instructions executed by the core. This information is recorded 

and later mapped to the high-level language (C/C++) to 

provide source code level coverage metrics. This approach is 

also feasible for simulators. The significant advantage of this 

technology is that it does not impose any overhead on 

execution time or binary footprint, which may be important for 

testing portions of the code in which timing is critical. 

Fig. 2: Diagram showing a simple Google Test integration with a code coverage solution. 

982

Fig. 3: Example report from a commercial tool showing Google Test test cases associated with coverage results. 

The severe limitation of this technology is that mapping of 

machine instructions to the high-level language structures is 

not trivial. For C/C++, the object code does not contain enough 

information to support full tracing of machine instructions to 

high-level language constructs. In effect, solutions that are 

currently available only support statement and condition 

coverage in a reasonable way. More complex metrics are not 

supported. This eliminates this methodology from the 

applications in the projects where MC/DC coverage is 

required. 

Practical implementation of the unit testing solution for 

software certification induces some additional requirements on 

the coverage tools, which do not stem directly from the safety 

standards. Example is the ability to merge coverage results 

from different execution sessions into one report, proving there 

are no unexpected gaps in the testing process. The need for this 

functionality is well known to any practitioner. In real world 

projects, some sections of the code can be tested only when 

using the actual hardware, while others can be examined using 

simpler, less expensive setups with simulators or even host 

processors. The ability to combine the testing results from PIL 

and SIL testing sessions saves time. Time savings can be also 

obtained from the combination of system level coverage testing 

results with unit testing coverage results. 

The analysis of the features expected from the code 

coverage tool deployed for safety-critical software 

development suggests that free coverage tools are not yet at a 

maturity level that would allow organizations to integrate them 

into safety-oriented production development environments. In 

this moment, commercial tools are better suited for this goal. 

E. Fault Injection and Robustness Test Cases 

Fault injection is a method used in software testing to 

assure that the system can safely handle all the errors that 

survived verification and validation processes–so called 

residual errors. This method assumes that there is a safety 

mechanism that can bring the system into a safe state when an 

unexpected error occurs. The goal of fault injection testing is to 

prove that those safety mechanisms are there and are effective. 

This methodology is explicitly listed in ISO 26262 in 

context of unit testing: “includes injection of arbitrary faults in 

order to test safety mechanism” [2]. DO-178C also discusses 

this methodology “Robustness test cases demonstrate the 

ability of the software to respond to abnormal inputs and 

conditions.” [3] 

There are numerous studies and publications discussing the 

effectiveness and viable approaches to fault injection testing 

983

[4], [5]. Fault injection testing is recommended by safety 

standards for software testing at unit, integration, and system 

levels. The easiest and most common way of implementing this 

methodology is mutating the section of the software and 

observing the response of the test suites. In the context of unit 

testing, a legitimate strategy is to replace a function or method 

with an alternative implementation that returns a mutated value 

or injects a side effect, such as a modification of a global 

variable (although this capability is not limited to the fault 

injection testing only). The natural application is isolating 

tested components during unit testing to make the tests faster, 

more robust, and less complicated. An ideal framework should 

offer the ability to intercept any function or method call in the 

tested code and: 

• Stub it (provide a dummy implementation in case the 

original definition is not available) 

• Simulate the return value 

• Modify global variables or object state 

• Check asserted expectations about the call, such as the 

values of the parameters used for the call, etc. 

• Perform a proxy call to an original symbol, potentially 

with modified parameters or other side effects 

There are several technologies that can help achieve the 

above goals. Some of them only help when replacing the 

original calls with a “test double,” others support the process of 

programming the expected behavior. For complete control of 

the outcome, however, users may need to rely on the 

combination of the below: 

• Mocking frameworks (e.g., Google Mock or 

CppUMock). 

• Link time substitution: replacing object files containing 

original definitions with prepared test doubles. 

• Runtime function pointer substitution: for every 

function expected to be replaced with the test double, 

declare a corresponding function pointer and assign it 

with the default definition. At test time, the pointer can 

be reassigned to an alternative implementation. This 

approach is usable only for C-style functions 

• Source code instrumentation, which uses a dedicated 

tool to analyze the source code and apply 

instrumentation that replaces the invocations of 

functions with desired “test doubles.” 

• Binaries instrumentation: another dedicated tool that 

analyzes the binaries at project link time and performs 

all the required rewiring to call desired test doubles in 

place of the original functions. 

• Preprocessor substitution, which is renaming function 

calls using preprocessor macros. 

F. Test Framework Fault Injection Implementation 

Mocking frameworks, such as Google Mock, offer very 

convenient APIs for programming expected behavior. The 

benefits of which are the readability of programmed 

expectations for mock object behavior and the ability to store 

those definitions together with the test case. A serious 

limitation, however, is that mocking with frameworks like 

GMock only works effectively for virtual methods of C++ 

classes. Mocking C style functions or non-virtual methods 

requires significant redesign of the code. This is often 

unacceptable. If our code is compiled as C code, we may not be 

able to use the mocking framework at all. This is because many 

of them rely on the C++ language features. In general, mocking 

frameworks seem to be a viable option for fault injection 

testing and isolation testing, but only if this was planned from 

the beginning of the project and the software architecture was 

designed with this intention. 

There is also the possibility of using the linker for 

substituting test doubles for the real code. At least two 

approaches are possible. Link-time substitution, which assumes 

there are entire modules that can be replaced with alternative 

implementations during linking phase [6]. This approach may 

work in the early phases of the project where it is easy to 

substitute entire modules with test implementations. But as 

complexity increases, it is more and more difficult to inject 

alternative implementations. In real-world projects, this 

approach is seldom used and is rarely recommended. 

The second approach is to rewire calls to inject alternate 

code at link time with the dedicated support of the linker. Some 

linkers, such as the GNU linker, provide the ability to redirect 

all references to a symbol to an alternative definition. On some 

platforms, the GCC linker allows specifying the "–wrap" 

option with the name of the symbol to be rewired (which needs 

to exist somewhere in the objects or libraries.) After gaining 

some experience with this approach, practitioners find it to be 

quite powerful, as it enables point injections to existing objects 

or code without modifications. Problems emerge, however, 

when used with C++ because the information passed to linker 

must be in the form of mangled symbol name, which is 

inconvenient and error prone. In general, it is an interesting 

alternative if the fault injection testing is conducted in the 

limited scope and the compiler offers corresponding 

functionality. Overall, using this approach during unit testing 

for standard mocking and isolation may be too inconvenient 

due to the difficulty in managing test doubles. 

And finally, there is the possibility of injecting alternative 

behavior into the tested code using so-called code patching. 

This process usually requires user-provided configuration for 

the test doubles and the instrumentation of the source code it 

replaces. Although more technologically advanced than 

previously described methods, this approach gives a good level 

of flexibility. Users do not have to design their code in a 

specific way to accommodate for stubbing and mocking. There 

is no need to remove the original definition of a mocked 

function from the test binary. Moreover, most fault injection 

frameworks support so-called proxy calls, where the injected 

test double perform some of the operations required by the test 

scenario and then invokes the original function–quite useful in 

many situations. There are commercial solutions that offer 

these capabilities, which proves useful for fault injection 

testing, as well as for regular isolation during unit testing. 

984

A critical factor related to all the described implementation 

techniques is how to program the behavior of the test double. 

How do we express the action required to inject a fault into the 

testing process? Mocking frameworks, such as Google Mock, 

include a convenient and easy-to-use API. The significant 

advantage of the Google Mock API is that the definition of the 

mock can be stored inside the test case. Those who have tried 

implementing unit tests manually understands how important it 

is to see the preconditions of the test and the test double 

definition in one place. If an alternative method is used, such as 

link-time substitution, we will need to program the test double 

behavior inside the test double’s definition, which complicates 

maintaining the alternative logic for multiple test cases. There 

are interesting frameworks, such as CppUMock, that address 

exactly this issue by providing generic functionality that 

enables the separation of the test-specific logic definition for 

the test double from its body. This allows storing the test 

doubles configuration together with the test case. 

G. Tool Qualification 

When making a decision about tools for safety-critical 

development, it is important to consider the tool qualification 

process. According to ISO 26262, “The objective of tool 

qualification is to provide evidence of software tool suitability 

for use when developing safety-related item or element.” [7]. 

Safety standards differ in terminology and requirements related 

to this process, but the guidance generally requires that the 

process starts with the tool classification. The tool 

classification process determines whether qualification is 

required, as well as the objectives and appropriate methods to 

qualify the tool. The actual qualification is conducted 

according to guidelines stemming from the classification 

process. A commonly chosen method for qualification is based 

on the validation of the software tool in the development 

environment. This method assumes that software tool has welldefined 

functional requirements and that appropriate test cases 

are available to validate those requirements. 

Commercial tools usually offer dedicated qualification kits, 

which significantly simplify the qualification process. For 

example, a commercial code coverage tool will most likely 

provide a required set of test cases together with expected 

results definition to help confirm the correct operation of the 

tool. It’s reasonable to check with the vendor before purchasing 

the tool if qualification support is provided for project specific 

environment. 

Open source components of the development environment, 

such as a unit testing framework, will require additional work 

to qualify them. Teams will need to prepare a definition of 

functional requirements and collect appropriate test cases that 

prove the correctness of functionality. In many cases, it is 

possible to reuse the test cases created for the open source tool 

for standard quality control. Google Test, for example, is 

distributed together with the reasonable set of tests that can be 

reused for the qualification process. The qualification process 

does not have to cover the entire functionality of the tool—just 

the features of the tool that are actually used in the 


The documentation created during the qualification process 

should contain instructions for the developers—the so-called 

“safety manual,” which clearly defines which functionalities of 

the tool are qualified and can be used for safety-critical 

development, as well as the settings required for safe usage of 

the tool. It is sufficient to perform the tool qualification process 

once for a given project, assuming we will not change the 

versions of the tools. 

The qualification process for a solution containing many 

separate components is challenging, especially if the solution 

contains open source components that were not designed with 

the qualification activities in mind. In the likely case of 

selecting validation as the qualification method, the most 

expensive part of the process will be the definition of 

functional requirements for the open source project and 

preparation of the test cases to validate the requirements. The 

cost of preparing the validation test cases can be reduced by 

using the test cases (if they exist) shipped with the open source 

tool. 

Unless it is separately regulated by the end customer or the 

business’s internal development policy, there are no other 

obstacles for using the open source tools for the safety-critical 

software development than the qualification process. 

IV. SUMMARY 

The aim of this paper was to discuss the feasibility of 

building a unit testing solution for safety critical systems based 

on the free software components. Due to stringent requirements 

of safety critical software standards, such a solution will have 

to be a mixture of open source and commercial tools in most 

cases. An obvious drawback of such a mixed solution is the 

cost of maintenance and tool qualification will likely be higher 

than using a uniform commercial solution. It doesn’t mean, 

however, that this approach is unreasonable. An important 

aspect of open source solutions is the reliance on open 

standards and formats, which secures the investment made in 

test case implementation and eases the exchange of compliance 

artifacts in the customer’s supply chain. The final decision 

about a unit testing solution must be made by the development 

team based on their understanding of the specifics of their 

development environment, project requirements, and customer 

expectations. The concept of using an open source unit testing 

framework supported with commercial tools for advanced code 

coverage and test doubles should be considered a viable 

solution. With the growing number of software projects that 

require certification, and the increasing amount of open source 

code integrated into safety critical systems, this approach is 

likely to gain popularity across software organizations. 

REFERENCES 

[1] DO-178C, Software Consideratins in Airborn Systems and Equipment 

Certification, RTCA, Inc. December 13, 2011, 4.4.3.b 

[2] ISO 26262 Road vehicles – Functional safety, part 6, 9.4.2, Table 12, 

subscript a 

[3] DO-178C, Software Consideratins in Airborn Systems and Equipment 

Certification, RTCA, Inc. December 13, 2011, 6.4.2.2 

[4] D. Cotroneo, R. Natella, “Fault Injection for Software Certification” 

IEEE Security and Privacy Magazine · July 2013 

985

[5] J. M. Voas and G. McGraw, Software Fault Injection: Inoculating 

Programs Against Errors. John Wiley & Sons, Inc., 1998 

[6] J. W. Grenning, Test-Driven Development for Embedded C, Pragmatic 

Bookshelf, September 2014 

[7] ISO 26262 Road vehicles – Functional safety, part 8, 11.1 

986

ewC2018_Sammelmappe_Seitenzahl

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?